I want to convert each value of a spark Dataset (say 'x' rows and 'y' columns) into individual rows (result should be x*y rows) with an additional column.
For example,
ColA ColB ColC
1 2 3
4 5 6
Should produce,
NewColA NewColB
1 ColA
4 ColA
2 ColB
5 ColB
3 ColC
6 ColC
The values in NewColB are from the original column of the value in NewColA i.e. values 1 and 4 have values as ColA in NewColB because they originally came from ColA and so on.
I have seen a few implementations of explode() function in Java but I want to know how it can be used in my use case. Also note that input size maybe large (x*y may be in millions).
The simplest way to accomplish this is with the stack() function built in to spark sql.
val df = Seq((1, 2, 3), (4, 5, 6)).toDF("ColA", "ColB", "ColC")
df.show()
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
| 1| 2| 3|
| 4| 5| 6|
+----+----+----+
val df2 = df.select(expr("stack(3, ColA, 'ColA', ColB, 'ColB', ColC, 'ColC') as (NewColA, NewColB)"))
df2.show()
+-------+-------+
|NewColA|NewColB|
+-------+-------+
| 1| ColA|
| 2| ColB|
| 3| ColC|
| 4| ColA|
| 5| ColB|
| 6| ColC|
+-------+-------+
sorry, examples are in scala, but should be easy to translate
It's also possible, albeit more complicated and less efficient to do this with a .flatMap().
What are necessary conditions like no of columns or identical columns or different columns
Lets assume the you have two dataframes.
val df1 = spark.sql("SELECT 1 as a,3 as c")
val df2 = spark.sql("SELECT 1 as a,2 as b")
df1.union(df2) - will work, because it`s conditions is the same number of columns
df1.union(df2).show()
+---+---+
| a| c|
+---+---+
| 1| 3|
| 1| 2|
+---+---+
as you can see it takes df1 columns, and match only by the index of the columns of df2, and not by the name.
if you are using unionByName, e.g
df1.unionByName(df2).show()
it wouldn`t work because it try look for 'c' column in df2.
to sum up, both union style needs the same number of columns.
unionByName - required that the columns will be at the same name too.
So my project has a "friends list" and in the MySQL database I have created a table:
nameA
nameB
Primary Key (nameA, nameB)
This will lead to a lot of entries, but to ensure that my database is normalised I'm not sure how else to achieve this?
My project also uses Redis.. I could store them there.
When a person joins the server, I would then have to search for all of the entries to see if their name is nameA or nameB, and then put those two names together as friends, this may also be inefficient.
Cheers.
The task is quite common. You want to store pairs where A|B has the same meaning as B|A. As a table has columns, one of the two will be stored in the first column and the other in the second, but who to store first and who second and why?
One solution is to always store the lesser ID first and the greater ID second:
userid1 | userid2
--------+--------
1 | 2
2 | 5
2 | 6
4 | 5
This has the advantage that you store each pair only once, as feels natural, but has the disadvantage that you must look up a person in both coumns and find their friend sometimes in the first and sometimes in the second column. That may make queries kind of clumsy.
Another method is to store the pairs redundantly (by using a trigger typically):
userid1 | userid2
--------+--------
1 | 2
2 | 1
2 | 5
2 | 6
4 | 5
5 | 2
5 | 4
6 | 2
Here querying is easier: Look the person up in one column and find their friends in the other. However, it looks kind of weird to have all pairs duplicated. And you rely on a trigger, which some people don't like.
A third method is to store numbered friendships:
friendship | user_id
-----------+--------
1 | 1
1 | 2
2 | 2
2 | 5
3 | 2
3 | 6
4 | 4
4 | 5
This gives both users in the pair equal value. But in order to find friends, you need two passes: find the friendships for a user, find the friends in these friendships. However, the design is very clear and even extensible, i.e. you could have friendships of three four or more users.
No method is really much better than the other.
This is a largely conceptual question so i dont have any code to show. I'll try to explain this as best i can. I am writing a program that is supposed to find common sequences of numbers found in a large table of random combinations.
So for example take this data:
1 5 3 9 6 3 8 8 3 3
6 7 5 5 5 4 9 2 0 1
6 4 4 3 7 8 3 9 5 6
2 4 2 4 5 5 3 4 7 7
1 5 6 3 4 9 9 3 3 2
0 2 7 9 4 5 3 9 8 3
These are random combinatinos of the numbers 1-9. For every 3 digit (or more) sequence found more than once i need to put that into another database. So the first row contains "5 3 9" and the 6th row also contains "5 3 9". I would put that sequence in a separate table with the number of times it was found.
I'm still working out the algorithm for actually making these comparisons but i figure i'll have to start with "1 5 3", compare that to every single 3 number trio found, then move on to "5 3 9" then "3 9 6" etc....
MY MAIN PROBLEM RIGHT NOW is that i dont know how to do this if these numbers are stored in a database. My database table has 11 columns. One column for each individual number, and one column for the 10 digit sequence as a whole. Columns are called Sequence, 1stNum, 2ndNum, 3rdNum...10thNum.
Visual: first row in my database for the data above would be this :
| 1 5 3 9 6 3 8 8 3 3 | 1 | 5 | 3 | 9 | 6 | 3 | 8 | 8 | 3 | 3 |
("|" divide columns)
How do i make comparisons efficiently with Java? I'm iterating over every row in the table many times. Once for the initial sequence to be compared, and for every one of those sequences i go through each row. Basically a for loop in a for loop. This sounds like its going to take a ton of queries and could take forever if the table gets to be massive (which it will).
Is it more computationally efficient if i iterate through a database using queries or if i dump the database and iterate through a file?
I tried to explain this as best as i could, its a very confusing process for me. I can clarify anything you need me to. I just need guidance on what the best course of action for this would be.
Here is what I would do, assuming you have retrieved the sequences in a list :
List<String> sequences = Arrays.asList("1539638833","6755549201","6443783956","2424553477","1563499332","0279453983");
Map<String,Integer> count = new HashMap<>();
for (String seq : sequences) {
int length = seq.length();
for (int i=0 ; i<length - 2 ; i++) {
String sub = seq.substring(i,i + 3);
count.put(sub,count.containsKey(sub) ? count.get(sub) + 1 : 1);
}
}
System.out.println(count);
Ouput :
{920=1, 783=1, 945=1, 332=1, 963=1, 644=1, 156=1, 983=1, 453=1, 153=1, 388=1, 534=1,
455=1, 245=1, 539=2, 554=1, 242=1, 555=1, 553=1, 437=1, 883=1, 349=1, 755=1, 675=1,
638=1, 395=1, 201=1, 956=1, 933=1, 499=1, 634=1, 839=1, 794=1, 027=1, 477=1, 833=1,
347=1, 492=1, 378=1, 279=1, 993=1, 443=1, 396=1, 398=1, 549=1, 563=1, 424=1}
You can then store these values in the database from the Map.
You can do it in sql with a union clause:
select sum(c), sequence
from
(
select
count(*) as c, concat(col1 ,col2 , col3) as sequence
from t
group by col1, col2, col3
union
select
count(*) as c, concat(col2 ,col3 , col4) as sequence
from t
group by col2, col3, col4
union (... and so on enumerating through the column combinations)
) as tt
group by sequence
I would imagine a pure java implementation would be quicker and have less
memory overhead. But if you already have it in the database it may be quick
enough.
I'm newbie to Hive, I would an help to write an UDF function for weighting factor calculation.
The calculation seems simple.
I have one table with some values KEY,VALUE grouped by GROUP_ID. For each row of one group I want calculate the weighting factor, a float beetween 0 and 1 that's the weight of that element of the group.
The sum of weighting factors into the group must be 1.
In this example the value is the distance, then the weight is inversely proportional to the distance.
GROUP_ID | KEY | VALUE(DISTANCE)
====================================
1 10 4
1 11 3
1 12 2
2 13 1
2 14 5
3 .. ..
...
Math function: 1/(Xi * sum(1/Xk)) from k=1 to N)
GROUP_ID | KEY | VALUE | WEIGHTING_FACTOR
=======================================================
1 10 4 1/(4*(1/4+1/3+1/2)) = 0.23
1 11 3 1/(3*(1/4+1/3+1/2)) = 0.31
1 12 2 1/(2*(1/4+1/3+1/2)) = 0.46
2 13 1 1/(1*(1/1+1/5)) = 0.83
2 14 5 1/(5*(1/1+1+5)) = 0.17
3 .. ..
...
Have you a suggestion for using UDF, UDAF or UDTF function?
Maybe I must use a "Transform"?
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
Solved using Windowing and Analytics Functions
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html
Source: https://stackoverflow.com/a/18919834/2568351