What is the algorithm behind GROUP USING 'collected' and 'merge' - java

Pig documentation says if some conditions are met(these conditions are described in doc), Pig can do map side GROUP. Can someone explain this algorithm? I want to get a deep understanding of what can be done by MapReduce.

For example, imagine the file below:
10 - 1
14 - 2
10 - 3
12 - 4
12 - 5
20 - 6
21 - 7
17 - 8
12 - 9
17 - 10
Then, the load is going to store your file, like this (imagine your cluster have 3 nodes - If you use a Identity Map-Reduce Job than you can achieve the same result setting the reduce number to 3. If your file is skweed you can have some performance problems).
The loader used for this must guarantee that it does not split a single value of a key across multiple splits. (http://wiki.apache.org/pig/MapSideCogroup)
part-r-00000 part-r-00001 part-r-00002
10 - 1 14 - 2 20 - 6
10 - 3 17 - 8 21 - 7
12 - 4 17 - 10
12 - 9
Now, hadoop framework is going to spawns one map task for each partition generated. I this case 3 map-tasks.
So imagine that you are going to sum the second field, that process can run just in the map side.
part-m-00000
10 - 17
12 - 13
part-m-00001
14 - 2
17 - 18
part-m-00002
20 - 13
In the case of COGROUP, i imagine that is goind to execute in a similiar way. Each map-task is going to operate in two partitions file with the same keys.

You can read the source code for the algorithm. A one liner answer is that both implement a merge algorithm (i.e. data has to be sorted by group key in advanced so that (a) sorting is not required and (b) by iterating over the data you can find where the group key changes.

Related

file processing wit a standard

i will write an EFT system with java. I will read information from a file and file's content has a standard. For example;
# Number of banks
2
# BankID, InitialCashReserve
1 0
2 100
# EFTID, Amount, FromBankID, ToBankID
1 40 1 2
2 10 2 1
3 20 2 1
4 30 2 1
5 40 2 1
6 50 2 1
7 60 1 2
Is there an easy way to read these or do I have to read line by line and check.
You'll have to read it line by line.
If I were you, I'd load the entire contents of the file into some sort of object structure before using the data, that way you won't have to be going back and forth through the file stream during the operation of your program.
If there's a library for your file type, that library will pretty much do those 2 steps for you anyway.

Multi Thread Processing in Java

Im just having a few issues getting my head around this problem. Any help would be greatly appreciated.
The program must read a text file, where it will compute the sum of divisors of each input number. For example, number 20 has the sum 1+2+4+5+10=22. Then these sums are summed up line-by-line. For each of these sums above the divisor is then found, and then finally they are totaled up.
E.g Initial File
1 2 4 6 15 20
25 50 100 125 250 500
16 8 3
Then computes the sum of divisors.
1 1 3 6 9 22
6 43 117 31 218 592
15 7 1
Summed up line by line
42
1007
23
Then above sums are computed.
54
73
1
Then finally totaled up and returned.
128
I need to complete the process with each new line being completed by a threadpool.
My logic is as follows.
5.2. For each input line (Add each line to an ArrayBlockingQueue,
Then add each item in the Queue to an ExecutorService Which will run the follow)
5.2.1. Parse the current input line into integers
5.2.2. For each integer in the current input line
5.2.2.1. Compute the sum-of-divisors of this integer
5.2.2.2. Add this to the cumulated sum-of-divisors
5.2.3. Compute the sum-of-divisors of this cumulated sum
5.2.4. Add this to the grand total
I get stuck after 5.2, Do I either create a new class that implements the runnable interface and then adds the cumulated sum to an atomicArray, or is best to create a class that implements the callable interface and then get it to return the cumulated sum? Or is there a completely different way.
Here is what i have so far which returns the desired result but in a sequential matter.
http://pastebin.com/AyB58fpr
Use
java.util.concurrent.Future
and a
java.util.concurrent.Executors.newFixedThreadPool(int nThreads)
This will be really easy to do.
Follow the Oracle tutorial if you are not familiar with Executors.
I prefer the Callable interface since that doesn't create a dependency of the code which processes the input to how the output is gathered.
The usual approach is to collect the tasks in a list of Futures. See this answer for an example.

What was wrong with my logic on Java insertion sorts?

I had a question on a test - I got it wrong, but I don't know why?
Question: An array of integers is to be sorted from biggest to smallest using an insertion sort. Assume the array originally contains the following elements:
5 9 17 12 2 14
What will it look like after the third pass through the for loop?
17 9 5 12 2 14
17 12 9 5 2 14 - Correct answer
17 12 9 5 14 2
17 14 12 9 5 2
9 5 17 12 2 14
In an insertion sort, I thought the source array is untouched; I felt, at the third pass, the destination array would be incomplete.
How did the test arrive at this answer?
Basically, after n passes of insertion sort, the first n+1 elements are sorted in the correct order and the rest are untouched. As you can see in the alternatives, the correct answer is the only one that fulfills that. Every step is only inserting relative to the already sorted numbers, so each step one extra number is correctly sorted.
Step 0 (Original, 5 assumed sorted)
5 9 17 12 2 14
Step 1, takes the 9 and puts it in the correct place before 5 (result, 9 5 sorted)
9 5 17 12 2 14
Step 2, takes the 17 and puts it in the correct place before 9 (result 17 9 5 sorted)
17 9 5 12 2 14
Step 3, takes the 12 and puts it in the correct place after 17 and before 9 (result 17 12 9 5sorted)
17 12 9 5 2 14
Insertion sort is typically done in-place. After k iterations the first k + 1 elements are sorted, hence the answer you've shown above.

Reducing the granularity of a data set

I have an in-memory cache which stores a set of information by a certain level of aggregation - in the Students example below let's say I store it by Year, Subject, Teacher:
# Students Year Subject Teacher
1 30 7 Math Mrs Smith
2 28 7 Math Mr Cork
3 20 8 Math Mrs Smith
4 20 8 English Mr White
5 18 8 English Mr Book
6 10 12 Math Mrs Jones
Now unfortunately my cache doesn't have GROUP BY or similar functions - so when I want to look at things at a higher level of aggregation, I will have to 'roll up' the data myself. For example, if I aggregate Students by Year, Subject the aforementioned data would look like so:
# Students Year Subject
1 58 7 Math
2 20 8 Math
3 38 8 English
4 10 12 Math
My question is thus - how would I best do this in Java? Theoretically I could be pulling back tens of thousands of objects from this cache, so being able to 'roll up' these collections quickly may become very important.
My initial (perhaps naive) thought would be to do something along the following lines;
Until I exhaust the list of records:
Each 'unique' record that I come
across is added as a key to a
hashmap.
If I encounter a record that
has the same data for this new level
of aggregation, add its quantity to
the existing one.
Now for all I know this is a fairly common problem and there's much better ways of doing this. So I'd welcome any feedback as to whether I'm pointing myself in the right direction.
"Get a new cache" not an option I'm afraid :)
-Dave.
Your "initial thought" isn't a bad approach. The only way to improve on it would be to have an index for the fields on which you are aggregating (year and subject). (That's basically what a dbms does when you define an index.) Then your algorithm could be recast as iterating through all index values; you wouldn't have to check the results hash for each record.
Of course, you would have to build the index when populating the cache and maintain it as data is modified.

Interpolating Large Datasets On the Fly

Interpolating Large Datasets
I have a large data set of about 0.5million records representing the exchange rate between the USD / GBP over the course of a given day.
I have an application that wants to be able to graph this data or maybe a subset. For obvious reasons I do not want to plot 0.5 million points on my graph.
What I need is a smaller data set (100 points or so) which accurately (as possible) represents the given data. Does anyone know of any interesting and performant ways this data can be achieved?
Cheers, Karl
There are several statistical methods for reducing a large dataset to a smaller, easier to visualize dataset. It's not clear from your question what summary statistic you want. I've just assumed that you want to see how the exchange rate changes as a function of time, but perhaps you are interested in how often the exchange rate goes above a certain value, or some other statistic that I'm not considering.
Summarizing a trend over time
Here is an example using the lowess method in R (from the documentation on scatter plot smoothing):
> library(graphics)
# print out the first 10 rows of the cars dataset
> cars[1:10,]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
# plot the original data
> plot(cars, main = "lowess(cars)")
# fit a loess-smoothed line to the points
> lines(lowess(cars), col = 2)
# plot a finger-grained loess-smoothed line to the points
> lines(lowess(cars, f=.2), col = 3)
The parameter f controls how tightly the regression fits to your data. Use some thoughtfulness with this, as you want something that accurately fits your data without overfitting. Rather than speed and distance, you could plot the exchange rate versus time.
It's also straightforward to access the results of the smoothing. Here's how to do that:
> data = lowess( cars$speed, cars$dist )
> data
$x
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18 18 18 19 19
[38] 19 20 20 20 20 20 22 23 24 24 24 24 25
$y
[1] 4.965459 4.965459 13.124495 13.124495 15.858633 18.579691 21.280313 21.280313 21.280313 24.129277 24.129277
[12] 27.119549 27.119549 27.119549 27.119549 30.027276 30.027276 30.027276 30.027276 32.962506 32.962506 32.962506
[23] 32.962506 36.757728 36.757728 36.757728 40.435075 40.435075 43.463492 43.463492 43.463492 46.885479 46.885479
[34] 46.885479 46.885479 50.793152 50.793152 50.793152 56.491224 56.491224 56.491224 56.491224 56.491224 67.585824
[45] 73.079695 78.643164 78.643164 78.643164 78.643164 84.328698
The data object that you get back contains entries named x and y, which correspond to the x and y values passed into the lowess function. In this case, x and y represent speed and dist.
One thought is use the DBMS to compress the data for you using an appropriate query. Something along the lines of having it take a median for a specific range, a pseudo-query:
SELECT truncate_to_hour(rate_ts), median(rate) FROM exchange_rates
WHERE rate_ts >= start_ts AND rate_ts <= end_ts
GROUP BY truncate_to_hour(rate_ts)
ORDER BY truncate_to_hour(rate_ts)
Where truncate_to_hour is something appropriate to your DBMS. Or a similar approach with some kind of function to segment the time into unique blocks (such as round to nearest 5 minute interval), or another math function to aggregate the group thats appropriate in place of median. Given the complexity of the time segmenting procedure and how your DBMS optimizes it may be more efficient to run a query on a temporary table with the segmented time value.
If you wanted to write your own, one obvious solution would be to break your record set into fixed number-of-points chunks, for which the value would be the average (mean, median, ... pick one). This has the probable advantage of being the fastest, and shows overall trends.
But it lacks the drama of price ticks. A better solution would probably involve looking for the inflection points, then selecting among them using sliding windows. This has the advantage of better displaying the actual events of the day, but will be slower.
Something like RRDTool would do what you need automatically - the tutorial should get you started, and drraw will graph the data.
I use this at work for things like error graphs, I don't need 1-minute resolution for a 6-month time period, only for the most recent few hours. After that I have 1-hour resolution for a few days, then 1-day resolution for a few months.
The naive approach is simply calculating an average per timeinterval corresponding to a pixel.
http://commons.wikimedia.org/wiki/File:Euro_exchange_rate_to_AUD.svg
This does not show flunctuations. I would suggest also calculating the standard deviation in each time interval and plot that too (essentially making each pixel higher than one single pixel). I could not locate an example, but I know that Gnuplot can do this (but is not written in Java).
How about to make enumeration/iterator wrapper. I'm not familiar with Java, but it may looks similar to:
class MedianEnumeration implements Enumeration<Double>
{
private Enumeration<Double> frameEnum;
private int frameSize;
MedianEnumeration(Enumeration<Double> e, int len) {
frameEnum = e;
frameSize = len;
}
public boolean hasMoreElements() {
return frameEnum.hasMoreElements();
}
public Double nextElement() {
Double sum = frameEnum.nextElement();
int i;
for(i=1; (i < frameSize) && (frameEnum.hasMoreElements()); ++i) {
sum += (Double)frameEnum.nextElement();
}
return (sum / i);
}
}

Categories