String,Dataset pair in Spark 2.0

String,Dataset pair in Spark 2.0 - java

I have a dataset of transactions where each transactions represent a purchase of a single item. So, each order is recorded as 3 transactions if the order contained 3 items.
Example dataset:
User Order, ItemCount, ItemPrice
1 1 1 10
1 1 1 10
1 2 1 30
1 2 1 30
2 3 1 20
2 3 1 20
3 4 1 15
3 4 1 15
3 4 1 15
To reduce the dataset I have grouped by order and user and aggregated ItemCount and ItemPrice to get a dataset like this:
User Order, ItemCount, OrderAmount
1 1 2 20
1 2 2 60
2 3 2 40
3 4 3 45
Now I want to group the orders by user and do some analysis on the orders for each user. Is there a way in Spark to group the orders by user and end with a pair of > where User is the user id and the Dataset contains the orders?
The only solution I see at the moment is to convert the dataset to rdd and do groupbykey to get rddpair> and then write some code to do my analysis on the list of rows.
I would prefer a solution where I can work with the orders as a Dataset and do my analysis using Dataset functionality. Can anyone point me into the right direction here? Is this possible?
I am new to spark and have been using Spark with Java so far as I have very limited experience with Scala, but examples in Scala would help.

Just group by user and order and aggregate columns itemcount and itemprice. Then group by user and run all the aggregations in the appropriate columns.
df.groupBy($"User", $"Order").agg(sum($"ItemCount").as("count"),
sum($"ItemPrice").as("total"))
.groupBy($"User").agg(avg($"total").as("avg_amount"),
avg($"count").as("avg_count"),
count($"count").as("total_purchases"))

Related

What is the most efficient way to compare rows in a MySQL table with Java

This is a largely conceptual question so i dont have any code to show. I'll try to explain this as best i can. I am writing a program that is supposed to find common sequences of numbers found in a large table of random combinations.
So for example take this data:
1 5 3 9 6 3 8 8 3 3
6 7 5 5 5 4 9 2 0 1
6 4 4 3 7 8 3 9 5 6
2 4 2 4 5 5 3 4 7 7
1 5 6 3 4 9 9 3 3 2
0 2 7 9 4 5 3 9 8 3
These are random combinatinos of the numbers 1-9. For every 3 digit (or more) sequence found more than once i need to put that into another database. So the first row contains "5 3 9" and the 6th row also contains "5 3 9". I would put that sequence in a separate table with the number of times it was found.
I'm still working out the algorithm for actually making these comparisons but i figure i'll have to start with "1 5 3", compare that to every single 3 number trio found, then move on to "5 3 9" then "3 9 6" etc....
MY MAIN PROBLEM RIGHT NOW is that i dont know how to do this if these numbers are stored in a database. My database table has 11 columns. One column for each individual number, and one column for the 10 digit sequence as a whole. Columns are called Sequence, 1stNum, 2ndNum, 3rdNum...10thNum.
Visual: first row in my database for the data above would be this :
| 1 5 3 9 6 3 8 8 3 3 | 1 | 5 | 3 | 9 | 6 | 3 | 8 | 8 | 3 | 3 |
("|" divide columns)
How do i make comparisons efficiently with Java? I'm iterating over every row in the table many times. Once for the initial sequence to be compared, and for every one of those sequences i go through each row. Basically a for loop in a for loop. This sounds like its going to take a ton of queries and could take forever if the table gets to be massive (which it will).
Is it more computationally efficient if i iterate through a database using queries or if i dump the database and iterate through a file?
I tried to explain this as best as i could, its a very confusing process for me. I can clarify anything you need me to. I just need guidance on what the best course of action for this would be.

Here is what I would do, assuming you have retrieved the sequences in a list :
List<String> sequences = Arrays.asList("1539638833","6755549201","6443783956","2424553477","1563499332","0279453983");
Map<String,Integer> count = new HashMap<>();
for (String seq : sequences) {
int length = seq.length();
for (int i=0 ; i<length - 2 ; i++) {
String sub = seq.substring(i,i + 3);
count.put(sub,count.containsKey(sub) ? count.get(sub) + 1 : 1);
}
}
System.out.println(count);
Ouput :
{920=1, 783=1, 945=1, 332=1, 963=1, 644=1, 156=1, 983=1, 453=1, 153=1, 388=1, 534=1,
455=1, 245=1, 539=2, 554=1, 242=1, 555=1, 553=1, 437=1, 883=1, 349=1, 755=1, 675=1,
638=1, 395=1, 201=1, 956=1, 933=1, 499=1, 634=1, 839=1, 794=1, 027=1, 477=1, 833=1,
347=1, 492=1, 378=1, 279=1, 993=1, 443=1, 396=1, 398=1, 549=1, 563=1, 424=1}
You can then store these values in the database from the Map.

You can do it in sql with a union clause:
select sum(c), sequence
from
(
select
count(*) as c, concat(col1 ,col2 , col3) as sequence
from t
group by col1, col2, col3
union
select
count(*) as c, concat(col2 ,col3 , col4) as sequence
from t
group by col2, col3, col4
union (... and so on enumerating through the column combinations)
) as tt
group by sequence
I would imagine a pure java implementation would be quicker and have less
memory overhead. But if you already have it in the database it may be quick
enough.

Weighting Factor calculation with Hive UDF

I'm newbie to Hive, I would an help to write an UDF function for weighting factor calculation.
The calculation seems simple.
I have one table with some values KEY,VALUE grouped by GROUP_ID. For each row of one group I want calculate the weighting factor, a float beetween 0 and 1 that's the weight of that element of the group.
The sum of weighting factors into the group must be 1.
In this example the value is the distance, then the weight is inversely proportional to the distance.
GROUP_ID | KEY | VALUE(DISTANCE)
====================================
1 10 4
1 11 3
1 12 2
2 13 1
2 14 5
3 .. ..
...
Math function: 1/(Xi * sum(1/Xk)) from k=1 to N)
GROUP_ID | KEY | VALUE | WEIGHTING_FACTOR
=======================================================
1 10 4 1/(4*(1/4+1/3+1/2)) = 0.23
1 11 3 1/(3*(1/4+1/3+1/2)) = 0.31
1 12 2 1/(2*(1/4+1/3+1/2)) = 0.46
2 13 1 1/(1*(1/1+1/5)) = 0.83
2 14 5 1/(5*(1/1+1+5)) = 0.17
3 .. ..
...
Have you a suggestion for using UDF, UDAF or UDTF function?
Maybe I must use a "Transform"?
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform

Solved using Windowing and Analytics Functions
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/ptf-window.html
Source: https://stackoverflow.com/a/18919834/2568351

Multithreaded search of single collection for duplicates

use I can't divide into segads. As for my above example if 5 threads are set, then first segment would take 2 first object, and second 3th and 4th, so they dont find dups, but there are dups if we merge them, its 2th and 3th.
There could be more complex strate take from first threads .. ah nevermind, to hard to explain.
And ofcourse, problelection itself in my plans.
Tha
EDIT:
InChunk, and then continue analyzing that chunk till the end. ;/

I think the process of dividing up the items to be de-duped is going to have to look at the end of the section and move forward to encompass dups past it. For example, if you had:
1 1 2 . 2 4 4 . 5 5 6
And you dividing up into blocks of 3, then the dividing process would take 1 1 2 but see that there was another 2 so it would generate 1 1 2 2 as the first block. It would move forward 3 again and generate 4 4 5 but see that there were dups forward and generate 4 4 5 5. The 3rd thread would just have 6. It would become:
1 1 2 2 . 4 4 5 5 . 6
The size of the blocks are going to be inconsistent but as the number of items in the entire list gets large, these small changes are going to be insignificant. The last thread may have very little to do or be short changed altogether but again, as the number of elements gets large, this should not impact the performance of the algorithm.
I think this method would be better than somehow having one thread handle the overlapping blocks. With that method, if you had a lot of dups, you could see it having to handle a lot more than 2 contiguous blocks if you were unlucky in the positing of the dups. For example:
1 1 2 . 2 4 5 . 5 5 6
One thread would have to handle that entire list because of the 2s and the 5s.

I would use a chunk-based division, a task queue (e.g. ExecutorService) and private hash tables to collect duplicates.
Each thread in the pool will take chunks on demand from the queue and add 1 to the value corresponding to the key of the item in the private hash table. At the end they will merge with the global hash table.
At the end just parse the hash table and see which keys have a value greater than 1.
For example with a chunk size of 3 and the items:
1 2 2 2 3 4 5 5 6 6
Assume to have 2 threads in the pool. Thread 1 will take 1 2 2 and thread 2 will take 2 3 4. The private hash tables will look like:
1 1
2 2
3 0
4 0
5 0
6 0
and
1 0
2 1
3 1
4 1
5 0
6 0
Next, thread 1 will process 5 5 6 and thread 2 will process 6:
1 1
2 2
3 0
4 0
5 2
6 1
and
1 0
2 1
3 1
4 1
5 0
6 1
At the end, the duplicates are 2, 5 and 6:
1 1
2 3
3 1
4 1
5 2
6 2
This may take up some amount of space due to the private tables of each thread, but will allow the threads to operate in parallel until the merge phase at the end.

First In, First Out (FIFO) Inventory

Let's say I have the following two records;
tran_id item_id qty_in qty_out price
1 1 15 0 1.50
2 1 15 0 1.60
Now, when I want to consume 20 units of item_id 1, I want to consume 15 of rated 1.50 and 5 of rated 1.60 on a FIFO.
Can somebody give me an idea as to how I should proceed?

Your SQL statement could look something like this
select * from tablename where item_id = 1 order by tran_id asc
That should give you your records with the first items at the top and so on. Then in your java code you can adjust the quantities accordingly.

How to group rows in htmldatatable?

How to group rows in htmldatatable?
I am using JSF.
A short example :
TransNum TransAmount InvoiceNum InvoiceAmount
1 50 1 10
1 50 2 15
1 50 3 30
2 10 1 6
2 10 2 5
If I select Grouping column as "InvoiceNum" then the table should look like:-
(i.e Grouping is done on InvoiceNum):
TransNum TransAmount InvoiceNum InvoiceAmount
1
1 50 1 10
2 10 1 6
2
1 50 2 15
2 10 2 5
3
1 50 3 30
TransNum TransAmount InvoiceNum InvoiceAmount
Similarly, grouping can be done based on multiple columns values too.
Thanks in advance.

JSF h:dataTable has no built-in grouping.
Either you find a component that fits your needs in one of the component libraries, such as Primefaces, Richfaces or Icefaces.
Or you have to implement it yourself in the backing bean by sorting the list in the way you want.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

String,Dataset pair in Spark 2.0 - java

Related

What is the most efficient way to compare rows in a MySQL table with Java

Weighting Factor calculation with Hive UDF

Multithreaded search of single collection for duplicates

First In, First Out (FIFO) Inventory

How to group rows in htmldatatable?

Categories

Resources