i will write an EFT system with java. I will read information from a file and file's content has a standard. For example;
# Number of banks
2
# BankID, InitialCashReserve
1 0
2 100
# EFTID, Amount, FromBankID, ToBankID
1 40 1 2
2 10 2 1
3 20 2 1
4 30 2 1
5 40 2 1
6 50 2 1
7 60 1 2
Is there an easy way to read these or do I have to read line by line and check.
You'll have to read it line by line.
If I were you, I'd load the entire contents of the file into some sort of object structure before using the data, that way you won't have to be going back and forth through the file stream during the operation of your program.
If there's a library for your file type, that library will pretty much do those 2 steps for you anyway.
Related
I have a dataset of transactions where each transactions represent a purchase of a single item. So, each order is recorded as 3 transactions if the order contained 3 items.
Example dataset:
User Order, ItemCount, ItemPrice
1 1 1 10
1 1 1 10
1 2 1 30
1 2 1 30
2 3 1 20
2 3 1 20
3 4 1 15
3 4 1 15
3 4 1 15
To reduce the dataset I have grouped by order and user and aggregated ItemCount and ItemPrice to get a dataset like this:
User Order, ItemCount, OrderAmount
1 1 2 20
1 2 2 60
2 3 2 40
3 4 3 45
Now I want to group the orders by user and do some analysis on the orders for each user. Is there a way in Spark to group the orders by user and end with a pair of > where User is the user id and the Dataset contains the orders?
The only solution I see at the moment is to convert the dataset to rdd and do groupbykey to get rddpair> and then write some code to do my analysis on the list of rows.
I would prefer a solution where I can work with the orders as a Dataset and do my analysis using Dataset functionality. Can anyone point me into the right direction here? Is this possible?
I am new to spark and have been using Spark with Java so far as I have very limited experience with Scala, but examples in Scala would help.
Just group by user and order and aggregate columns itemcount and itemprice. Then group by user and run all the aggregations in the appropriate columns.
df.groupBy($"User", $"Order").agg(sum($"ItemCount").as("count"),
sum($"ItemPrice").as("total"))
.groupBy($"User").agg(avg($"total").as("avg_amount"),
avg($"count").as("avg_count"),
count($"count").as("total_purchases"))
Pig documentation says if some conditions are met(these conditions are described in doc), Pig can do map side GROUP. Can someone explain this algorithm? I want to get a deep understanding of what can be done by MapReduce.
For example, imagine the file below:
10 - 1
14 - 2
10 - 3
12 - 4
12 - 5
20 - 6
21 - 7
17 - 8
12 - 9
17 - 10
Then, the load is going to store your file, like this (imagine your cluster have 3 nodes - If you use a Identity Map-Reduce Job than you can achieve the same result setting the reduce number to 3. If your file is skweed you can have some performance problems).
The loader used for this must guarantee that it does not split a single value of a key across multiple splits. (http://wiki.apache.org/pig/MapSideCogroup)
part-r-00000 part-r-00001 part-r-00002
10 - 1 14 - 2 20 - 6
10 - 3 17 - 8 21 - 7
12 - 4 17 - 10
12 - 9
Now, hadoop framework is going to spawns one map task for each partition generated. I this case 3 map-tasks.
So imagine that you are going to sum the second field, that process can run just in the map side.
part-m-00000
10 - 17
12 - 13
part-m-00001
14 - 2
17 - 18
part-m-00002
20 - 13
In the case of COGROUP, i imagine that is goind to execute in a similiar way. Each map-task is going to operate in two partitions file with the same keys.
You can read the source code for the algorithm. A one liner answer is that both implement a merge algorithm (i.e. data has to be sorted by group key in advanced so that (a) sorting is not required and (b) by iterating over the data you can find where the group key changes.
use I can't divide into segads. As for my above example if 5 threads are set, then first segment would take 2 first object, and second 3th and 4th, so they dont find dups, but there are dups if we merge them, its 2th and 3th.
There could be more complex strate take from first threads .. ah nevermind, to hard to explain.
And ofcourse, problelection itself in my plans.
Tha
EDIT:
InChunk, and then continue analyzing that chunk till the end. ;/
I think the process of dividing up the items to be de-duped is going to have to look at the end of the section and move forward to encompass dups past it. For example, if you had:
1 1 2 . 2 4 4 . 5 5 6
And you dividing up into blocks of 3, then the dividing process would take 1 1 2 but see that there was another 2 so it would generate 1 1 2 2 as the first block. It would move forward 3 again and generate 4 4 5 but see that there were dups forward and generate 4 4 5 5. The 3rd thread would just have 6. It would become:
1 1 2 2 . 4 4 5 5 . 6
The size of the blocks are going to be inconsistent but as the number of items in the entire list gets large, these small changes are going to be insignificant. The last thread may have very little to do or be short changed altogether but again, as the number of elements gets large, this should not impact the performance of the algorithm.
I think this method would be better than somehow having one thread handle the overlapping blocks. With that method, if you had a lot of dups, you could see it having to handle a lot more than 2 contiguous blocks if you were unlucky in the positing of the dups. For example:
1 1 2 . 2 4 5 . 5 5 6
One thread would have to handle that entire list because of the 2s and the 5s.
I would use a chunk-based division, a task queue (e.g. ExecutorService) and private hash tables to collect duplicates.
Each thread in the pool will take chunks on demand from the queue and add 1 to the value corresponding to the key of the item in the private hash table. At the end they will merge with the global hash table.
At the end just parse the hash table and see which keys have a value greater than 1.
For example with a chunk size of 3 and the items:
1 2 2 2 3 4 5 5 6 6
Assume to have 2 threads in the pool. Thread 1 will take 1 2 2 and thread 2 will take 2 3 4. The private hash tables will look like:
1 1
2 2
3 0
4 0
5 0
6 0
and
1 0
2 1
3 1
4 1
5 0
6 0
Next, thread 1 will process 5 5 6 and thread 2 will process 6:
1 1
2 2
3 0
4 0
5 2
6 1
and
1 0
2 1
3 1
4 1
5 0
6 1
At the end, the duplicates are 2, 5 and 6:
1 1
2 3
3 1
4 1
5 2
6 2
This may take up some amount of space due to the private tables of each thread, but will allow the threads to operate in parallel until the merge phase at the end.
I am supposed to start for example from point 1B to 5D. how am i supposed to reach? anyone can gv me a hint to get start on this?thanks. not asking for codes but the tips/ hints. thanks.
A B C D E
1 5 1 4 4 1
2 3 4 3 3 4
3 4 3 1 1 3
4 4 3 4 2 5
5 3 4 1 1 3
Start by defining data structures , and apply algorithm with it
You may refer to pseudo-code on wiki: Wiki A Star Search Algorithm
imagine I have the following "Pageview matrix"
COLUMN HEADINGS: books placement resources br aca
Each row represents a session
so this is my matrix,sample:
4 5 0 2 2
1 2 1 7 3
1 3 6 1 6
saved in a .txt file
Can i give this as an input to a k-means program and obtain clusters based on the highest frequency of occurrence?? How do i use it?
Can i give this as an input to a k-means program and obtain clusters based on the highest frequency of occurrence?
This is not what k-means does.
You can feed it to a k-means algorithm though. Each row is just a point in a 5d space - what part are you having trouble with?