Convert each value of Java spark Dataset into a row using explode() - java

I want to convert each value of a spark Dataset (say 'x' rows and 'y' columns) into individual rows (result should be x*y rows) with an additional column.
For example,
ColA ColB ColC
1 2 3
4 5 6
Should produce,
NewColA NewColB
1 ColA
4 ColA
2 ColB
5 ColB
3 ColC
6 ColC
The values in NewColB are from the original column of the value in NewColA i.e. values 1 and 4 have values as ColA in NewColB because they originally came from ColA and so on.
I have seen a few implementations of explode() function in Java but I want to know how it can be used in my use case. Also note that input size maybe large (x*y may be in millions).

The simplest way to accomplish this is with the stack() function built in to spark sql.
val df = Seq((1, 2, 3), (4, 5, 6)).toDF("ColA", "ColB", "ColC")
df.show()
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
| 1| 2| 3|
| 4| 5| 6|
+----+----+----+
val df2 = df.select(expr("stack(3, ColA, 'ColA', ColB, 'ColB', ColC, 'ColC') as (NewColA, NewColB)"))
df2.show()
+-------+-------+
|NewColA|NewColB|
+-------+-------+
| 1| ColA|
| 2| ColB|
| 3| ColC|
| 4| ColA|
| 5| ColB|
| 6| ColC|
+-------+-------+
sorry, examples are in scala, but should be easy to translate
It's also possible, albeit more complicated and less efficient to do this with a .flatMap().

Related

Avoiding multiple joins in a specific case with multiple columns with the same domain in dataframes of apache spark sql

I got asked to do something in apache spark sql (java api), through dataframes, that I think would cost really a lot if performed following a naive approach (I'm still working in the naive approach but I think it would cost a lot since it would need at least 4 sort of joins).
I got the following dataframe:
+----+----+----+----+----+----------+------+
| C1| C2| C3| C4| C5|UNIQUE KEY|points|
+----+----+----+----+----+----------+------+
| A| A|null|null|null| 1234| 2|
| A|null|null| H|null| 1235| 3|
| A| B|null|null|null| 1236| 3|
| B|null|null|null| E| 1237| 1|
| C|null|null| G|null| 1238| 1|
| F|null| C| E|null| 1239| 2|
|null|null| D| E| G| 1240| 1|
+----+----+----+----+----+----------+------+
C1, C2, C3, C4 and C5 have the same domain values, unique key is a unique key, points is an integer that should be considered only once for each distinct value of its corresponding C columns (e.g., for first row A,A,null,null,null,key,2 is the same of A,null,null,null,null,key,2 or A,A,A,A,null,key,2)
I got asked to "for each existing C value get the total number of points".
So the output should be:
+----+------+
| C1|points|
+----+------+
| A| 8|
| B| 4|
| C| 3|
| D| 1|
| E| 4|
| F| 2|
| G| 2|
| H| 3|
+----+------+
I'm was going to separate the dataframe in multiple small ones (1 column for a C column and 1 column for the points) through simple .select("C1","point"), .select("C2","point") and so on. But I believe that it would really cost a lot if the amount of data is really big, I believe that there should be some sort of trick through map reduce, but I couldn't find one myself since I'm still new to all this world. I think I'm missing some concepts on how to apply a map reduce.
I thought also about using the function explode, I thought putting together [C1, C2, C3, C4, C5] in a column then using explode so I get 5 rows for each row and then I just group by key... but I believe that this would increase the amount of data at some point and if we are talking about GBs this may not be feasible.... I hope you can find the trick that i'm looking for.
Thanks for your time.
Using explode would probably be the way to go here. It won't increase the amount of data and would be a lot more computationally effective as compared to using multiple join (note that a single join by itself is an expensive operation).
In this case, you can convert the columns to an array, retaining only the unique values for each separate row. This array can then be exploded and all nulls filtered away. At this point, a simple groupBy and sum will give you the wanted result.
In Scala:
df.select(explode(array_distinct(array("C1", "C2", "C3", "C4", "C5"))).as("C1"), $"points")
.filter($"C1".isNotNull)
.groupBy($"C1)
.agg(sum($"points").as("points"))
.sort($"C1") // not really necessary
This will give you the wanted result:
+----+------+
| C1|points|
+----+------+
| A| 8|
| B| 4|
| C| 3|
| D| 1|
| E| 4|
| F| 2|
| G| 2|
| H| 3|
+----+------+

What are necessary conditions for taking Union of two datasets in spark java

What are necessary conditions like no of columns or identical columns or different columns
Lets assume the you have two dataframes.
val df1 = spark.sql("SELECT 1 as a,3 as c")
val df2 = spark.sql("SELECT 1 as a,2 as b")
df1.union(df2) - will work, because it`s conditions is the same number of columns
df1.union(df2).show()
+---+---+
| a| c|
+---+---+
| 1| 3|
| 1| 2|
+---+---+
as you can see it takes df1 columns, and match only by the index of the columns of df2, and not by the name.
if you are using unionByName, e.g
df1.unionByName(df2).show()
it wouldn`t work because it try look for 'c' column in df2.
to sum up, both union style needs the same number of columns.
unionByName - required that the columns will be at the same name too.

Getting distinct combination of two columns

Using Hibernate, I want to get the rows of values such that:
col1 | col2
-------+-------
1 | 2
-------+-------
2 | 1
-------+-------
3 | 4
-------+-------
4 | 5
-------+-------
4 | 3
will produce:
col1 | col2
-------+-------
1 | 2
-------+-------
3 | 4
-------+-------
4 | 5
Can I get this out in Hibernate on grails? Or can anyone provide a MySQL implementation of this. Been battling this long enough.
You can use mysql least() and greatest() operators to make sure that the smaller number comes first and the highest comes later. This way you can use distinct to eliminate duplicates:
select distinct least(col1, fol2) as col1, greatest(col1, col2) as col2
from yourtable
You can group the two column like :
select * from yourtable group by (col1+col2);

GWT Hide grouped data in DataGrid

We're using GWT. I have a DataGrid with many repeated values in the column on the left. I would like to hide these. For example:
I have:
Town | Address | Color
------------------------------------------
Springfield | Springfield Heights | Blue
Springfield | Bum town | Red
Springfield | Little Italy | Blue
Shelbyville | Manhattan Square | Green
Shelbyville | Chinatown | Red
I would like to have:
Town | Address | Color
------------------------------------------
Springfield | Springfield Heights | Blue
| Bum town | Red
| Little Italy | Blue
Shelbyville | Manhattan Square | Green
| Chinatown | Red
I tried a few things, but they don't work well with sortable columns. Is there a standard way to do this?
You can override getCellStyleNames for your cell used for Town column. This method gives you Context, which you can use to see where this cell is in the column (context.getIndex()). Using this information, you can compare value in this cell with value in the cell above it (if any). If it is the same, return a style to hide value in this cell.
Note that it won't work if you simply return empty value when overriding getValue for you cell, because it will make the next cell show its value even if it is the same. You can, of course, work around this by looking up until you find a non-empty cell, but overriding getCellStyleNames and simply hiding repeating values looks like a simpler solution.
Because this is a method in your cell, it should behave well with updates, sorting columns, etc.

Strategy for parsing natural language descriptions into structured data

I have a set of requirements and I'm looking for the best Java-based strategy / algorthm / software to use. Basically, I want to take a set of recipe ingredients entered by real people in natural english and parse out the meta-data into a structured format (see requirements below to see what I'm trying to do).
I've looked around here and other places, but have found nothing that gives a high-level advice on what direction follow. So, I'll put it to the smart people :-):
What's the best / simplest way to solve this problem? Should I use a natural language parser, dsl, lucene/solr, or some other tool/technology? NLP seems like it may work, but it looks really complex. I'd rather not spend a whole lot of time doing a deep dive just to find out it can't do what I'm looking for or that there is a simpler solution.
Requirements
Given these recipe ingredient descriptions....
"8 cups of mixed greens (about 5 ounces)"
"Eight skinless chicken thighs (about 1ΒΌ lbs)"
"6.5 tablespoons extra-virgin olive oil"
"approximately 6 oz. thinly sliced smoked salmon, cut into strips"
"2 whole chickens (3 .5 pounds each)"
"20 oz each frozen chopped spinach, thawed"
".5 cup parmesan cheese, grated"
"about .5 cup pecans, toasted and finely ground"
".5 cup Dixie Diner Bread Crumb Mix, plain"
"8 garlic cloves, minced (4 tsp)"
"8 green onions, cut into 2 pieces"
I want to turn it into this....
|-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------|
| | Measure | | | weight | weight | | |
| # | value | Measure | ingredient | value | measure | preparation | Brand Name |
|-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------|
| 1. | 8 | cups | mixed greens | 5 | ounces | - | - |
| 2. | 8 | - | skinless chicken thigh | 1.5 | pounds | - | - |
| 3. | 6.5 | tablespoons | extra-virgin olive oil | - | - | - | - |
| 4. | 6 | ounces | smoked salmon | - | - | thinly sliced, cut into strips | - |
| 5. | 2 | - | whole chicken | 3.5 | pounds | - | - |
| 6. | 20 | ounces | forzen chopped spinach | - | | thawed | - |
| 7. | .5 | cup | parmesean cheese | - | - | grated | - |
| 8. | .5 | cup | pecans | - | - | toasted, finely ground | - |
| 9. | .5 | cup | Bread Crumb Mix, plain | - | - | - | Dixie Diner |
| 10. | 8 | - | garlic clove | 4 | teaspoons | minced | - |
| 11. | 8 | - | green onions | - | - | cut into 2 pieces | - |
|-----|---------|-------------|-------------------------|--------|-----------|--------------------------------|-------------|
Note the diversity of the descriptions. Some things are abbreviated, some are not. Some numbers are numbers, some are spelled out.
I would love something that does a perfect parse/translation. But, would settle for something that does reasonably well to start.
Bonus question: after suggesting a strategy / tool, how would you go about it?
Thanks!
Joe
Short answer. Use GATE.
Long answer. You need some tool for pattern recognition in text. Something, that can catch patterns like:
{Number}{Space}{Ingredient}
{Number}{Space}{Measure}{Space}{"of"}{Space}{Ingredient}
{Number}{Space}{Measure}{Space}{"of"}{Space}{Ingredient}{"("}{Value}{")"}
...
Where {Number} is a number, {Ingredient} is taken from dictionary of ingredients, {Measure} - from dictionary measures and so on.
Patterns I described are very similar to GATE's JAPE rules. With them you catch text that matches pattern and assign some label to each part of a pattern (number, ingredient, measure, etc.). Then you extract labeled text and put it into single table.
Dictionaries I mentioned can be represented by Gazetteers in GATE.
So, GATE covers all your needs. It's not the easiest way to start, since you will have to learn at least GATE's basics, JAPE rules and Gazetteers, but with such approach you will be able to get really good results.
It is basically natural language parsing. (You did already stemming chicken[s].)
So basically it is a translation process.
Fortunately the context is very restricted.
You need a supportive translation, where you can add dictionary entries, adapt the grammar rules and retry again.
An easy process/work flow in this case is much more important than the algorithms.
I am interested in both aspects.
If you need a programming hand for an initial prototype, feel free to contact me. I did see, you are already working quite structured.
Unfortunately I do not know of fitting frameworks. You are doing something, that Mathematica wants to do with its Alpha (natural language commands yielding results).
Data mining? But simple natural language parsing with a manual adaption process should give fast and easy results.
You also can try Gexp.
Then you have to write rules as Java class such as
seq(Number, opt(Measure), Ingradient, opt(seq(token("("), Number, Measure, token(")")))
Then you have to add some group to capture (group(String name, Matcher m)) and extrat parts of pattern and store this information into table.
For Number, Measure you should use similar Gexp pattern, or I would recommend some Shallow parsing for noun phrase detection with words from Ingradients.
If you don't want to be exposed to the nitty-gritty of NLP and machine learning, there are a few hosted services that do this for you:
Zestful (disclaimer: I'm the author)
Spoonacular
Edamam
If you are interested in the nitty-gritty, the New York Times wrote about how they parsed their ingredient archive. They open-sourced their code, but abandoned it soon after. I maintain the most up-to-date version of it and I wrote a bit about how I modernized it.
Do you have access to a tagged corpus for training a statistical model? That is probably the most fruitful avenue here. You could build one up using epicurious.com; scrape a lot of their recipe ingredients lists, which are in the kind of prose form you need to parse, and then use their helpful "print a shopping list" feature, which provides the same ingredients in a tabular format. You can use this data to train a statistical language model, since you will have both the raw untagged data, and the expected parse results for a large number of examples.
This might be a bigger project than you have in mind, but I think in the end it will produce better results than a structured top-down parsing approach will.

Categories