Multi Thread Processing in Java - java

Im just having a few issues getting my head around this problem. Any help would be greatly appreciated.
The program must read a text file, where it will compute the sum of divisors of each input number. For example, number 20 has the sum 1+2+4+5+10=22. Then these sums are summed up line-by-line. For each of these sums above the divisor is then found, and then finally they are totaled up.
E.g Initial File
1 2 4 6 15 20
25 50 100 125 250 500
16 8 3
Then computes the sum of divisors.
1 1 3 6 9 22
6 43 117 31 218 592
15 7 1
Summed up line by line
42
1007
23
Then above sums are computed.
54
73
1
Then finally totaled up and returned.
128
I need to complete the process with each new line being completed by a threadpool.
My logic is as follows.
5.2. For each input line (Add each line to an ArrayBlockingQueue,
Then add each item in the Queue to an ExecutorService Which will run the follow)
5.2.1. Parse the current input line into integers
5.2.2. For each integer in the current input line
5.2.2.1. Compute the sum-of-divisors of this integer
5.2.2.2. Add this to the cumulated sum-of-divisors
5.2.3. Compute the sum-of-divisors of this cumulated sum
5.2.4. Add this to the grand total
I get stuck after 5.2, Do I either create a new class that implements the runnable interface and then adds the cumulated sum to an atomicArray, or is best to create a class that implements the callable interface and then get it to return the cumulated sum? Or is there a completely different way.
Here is what i have so far which returns the desired result but in a sequential matter.
http://pastebin.com/AyB58fpr

Use
java.util.concurrent.Future
and a
java.util.concurrent.Executors.newFixedThreadPool(int nThreads)
This will be really easy to do.
Follow the Oracle tutorial if you are not familiar with Executors.

I prefer the Callable interface since that doesn't create a dependency of the code which processes the input to how the output is gathered.
The usual approach is to collect the tasks in a list of Futures. See this answer for an example.

Related

Array and file creation basic Java Array Problem

I happen to be having a hard time with a problem. A test will be coming up soon for me and I'm just not so sure as to how to get this problem done. (Nothing too advanced) My main issue is that the reading of and creating the final file. the problem is right below. Any help is appreciated, thanks!
I've been told to add a summary due to the difficulty to understand what im asking. So essentially you have a file, text.txt and the program takes said file and adds all the numbers in it, spaces make the numbers distinguished and if a zero is at the front it needs to get taken away. All numbers are separated (whatever number of spaces are between them) is all separated with a space + space and a space = at the end followed by space the result of the addition. I'm looking for help as how to:
a) Make the new file
b) Make the +s and =s
c) Transferring from the original file
d) How to do the addition (dealing with large space gaps and doing the summing)
IM STILL VERY NEW TO JAVA SO ANY LONGER EXPLANATIONS WOULD BE GREATLY APPRECIATED
The approach you are to implement is to store each integer in an array of digits, with one digit per array element. We will be using arrays of length 25, so we will be able to store integers up to 25 digits long. We have to be careful in how we store these digits. Consider, for example, storing the numbers 38423 and 27. If we store these at the “front” of the array with the leading digit of each number in index 0 of the array, then when we go to add these numbers together, we’re likely to add them like this:
38423
27
Thus, we would be adding 3 and 2 in the first column and 8 and 7 in the second column. Obviously this won’t give the right answer. We know from elementary school arithmetic that we have to shift the second number to the right to make sure that we have the appropriate columns lined up:
38423
27
To simulate this right-shifting of values, we will store each value as a sequence of exactly 25 digits, but we’ll allow the number to have leading 0’s. For example, the problem above is converted into:
0000000000000000000038423
0000000000000000000000027
Now the columns line up properly and we have plenty of space at the front in case we have even longer numbers to add to these.
The data for your program will be stored in a file called sum.txt. Each line of the input file will have a different addition problem for you to solve. Each line will have one or more integers to be added together. Take a look at the input file at the end of this write-up and the output you are supposed to produce. Notice that you produce a line of output for each input line showing the addition problem you are solving and its answer. Your output should also indicate at the end how many lines of input were processed. You must exactly reproduce this output.
You should use the techniques described in chapter 6 to open a file, to read it line by line, and to process the contents of each line. In reading these numbers, you won’t be able to read them as ints or longs because many of them are too large to be stored in an int or long. So you’ll have to read them as String values using calls on the method next(). Your first task, then, will be to convert a String of digits into an array of 25 digits. As described above, you’ll want to shift the number to the right and include leading 0’s in front. Handout #15 provides an example of how to process an array of digits. Notice in particular the use of the String method charAt and the method Character.getNumericValue. These will be helpful for solving this part of the problem.
You are to add up each line of numbers, which means that you’ll have to write some code that allows you to add together two of these numbers or to add one of them to another. This is something you learned in Elementary School to add starting from the right, keeping track of whether there is a digit to carry from one column to the next. Your challenge here is to take a process that you are familiar with and to write code that performs the corresponding task.
Your program also must write out these numbers. In doing so, it should not print any leading 0’s. Even though it is convenient to store the number internally with leading 0’s, a person reading your output would rather see these numbers without any leading 0’s.
You can assume that the input file has numbers that have 25 or fewer digits and that the answer is always 25 digits or fewer. Notice, however, that you have to deal with the possibility that an individual number might be 0 or the answer might be 0. There will be no negative integers in the input file.
You should solve this problem using arrays that are exactly 25 digits long. Certain bugs can be solved by stretching the array to something like 26 digits, but it shouldn’t be necessary to do that and you would lose style points if your arrays require more than 25 digits.
The choice of 25 for the number of digits is arbitrary (a magic number), so you should introduce a class constant that you use throughout that would make it easy to modify your code to operate with a different number of digits.
Consider the input file as an example of the kind of problems your program must solve. We might use a more complex input file for actual grading.
The Java class libraries include classes called BigInteger and BigDecimal that use a strategy similar to what we are asking you to implement in this program. You are not allowed to solve this problem using BigInteger or BigDecimal. You must solve it using arrays of digits.
The sample program of handout #18 should be particularly helpful to study to prepare you for this programming problem. It has some significant differences from the task you are asked to solve, but it also has some similarities that you will find helpful to study. The textbook has a discussion of the program starting in section 7.6.
You may assume that the input file has no errors. In particular, you may assume that each line of input begins with at least one number and that each number and each answer will be 25 digits or fewer. There will be whitespace separating the various numbers, although there is no guarantee about how much whitespace there will be between numbers.
You will again be expected to use good style throughout your program and to comment each method and the class itself. A major portion of the style points will be awarded based on how you break this program down into static methods. As with the sample program in handout #18, try to think in terms of logical subtasks of the overall task and create different methods for different subtasks. You should have at least four static methods other than main and you are welcome to introduce more than four if you find it helpful.
Your program should be stored in a file called Sum.java. You will need to include the files Scanner.java and sum.txt from the class web page (under the “assignments” link) in the same folder as your program. For those using DrJava, you will either have to use a full path name for the file sum.txt (see section 6.2.2 of the book) or you will have to put the file in the same directory as the DrJava program.
Input file sum.txt
82384
204 435
22 31 12
999 483
28350 28345 39823 95689 234856 3482 55328 934803
7849323789 22398496 8940 32489 859320
729348690234239 542890432323 534322343298
3948692348692348693486235 5834938349234856234863423
999999999999999999999999 432432 58903 34
82934 49802390432 8554389 4789432789 0 48372934287
0
0 0 0
7482343 0 4879023 0 8943242
3333333333 4723 3333333333 6642 3333333333
Output that should be produced
82384 = 82384
204 + 435 = 639
22 + 31 + 12 = 65
999 + 483 = 1482
28350 + 28345 + 39823 + 95689 + 234856 + 3482 + 55328 + 934803 = 1420676
7849323789 + 22398496 + 8940 + 32489 + 859320 = 7872623034
729348690234239 + 542890432323 + 534322343298 = 730425903009860
3948692348692348693486235 + 5834938349234856234863423 = 9783630697927204928349658
999999999999999999999999 + 432432 + 58903 + 34 = 1000000000000000000491368
82934 + 49802390432 + 8554389 + 4789432789 + 0 + 48372934287 = 102973394831
0 = 0
0 + 0 + 0 = 0
7482343 + 0 + 4879023 + 0 + 8943242 = 21304608
3333333333 + 4723 + 3333333333 + 6642 + 3333333333 = 10000011364
Total lines = 14

Understanding java 8 stream's filter method

I recently learned about Streams in Java 8 and saw this example:
IntStream stream = IntStream.range(1, 20);
Now, let's say that we want to find the first number that's divisable both by 3 and a 5. We'd probably filter twice and findFirst as follows:
OptionalInt result = stream.filter(x -> x % 3 == 0)
.filter(x -> x % 5 == 0)
.findFirst();
That's all sounds pretty reasonable. The surprising part came when I tried to do this:
OptionalInt result = stream.filter(x -> {System.out.println(x); return x % 3 == 0;})
.filter(x -> {System.out.println(x); return x % 5 == 0;})
.findFirst();
System.out.println(result.getAsInt());
I expected to get something like: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 and then: 3 6 9 12 15 18. Because we first iterate over all the the numbers between 1 to 20, filter out only those that are divisable by 3 and then iterate this new Stream and find those that are divisable by 5.
But instead I got this output: 1 2 3 3 4 5 6 6 7 8 9 9 10 11 12 12 13 14 15 15 15
It looks like it doesn't go over all the numbers. Moreover, it looks like it checks x % 5 == 0 only for those numbers that are divisable by 3.
I don't understand how come it doesn't iterate over all of the numbers.
Here's an online snippet of the code: http://www.tryjava8.com/app/snippets/5454a7f2e4b070922a64002b
Well, the thing to understand about streams is that, unlike lists, they don't (necessarily) hold all the items but rather compute each item at a time (lazy evaluation).
It means that when you did IntStream stream = IntStream.range(1, 20); you didn't actually create a collection with 20 items. You created a dynamically computed collection. Each call to this stream's next will compute the next item. The rest of the items are still "not there" (sort of speaking).
Same goes for the filter.
When you add the filter that's checking division by 3 you'll get a new stream that's combined of 2 computations - the first one returns the numbers from 1 until it gets to 20, the second computation returns the numbers that are divided by 3. It's important to understand that each time only the first item is calculated. That's why when you added the check for division by 5 it only worked on those items that were divisible by 3. Same goes as to why the printing stopped at 15. The findFirst method returns the first number that passes all 3 computations (the 1-20 range computation, the division by 3 computation and the division by 5 computation).
A Stream is a lazy evaluation mechanism for processing Collections efficiently. This means that all the intermediate operations on the Stream are not evaluated unless necessary for the final (terminal) operation.
In your example, the terminal operation is firstFirst(). This means the Stream will evaluate the pipeline of intermediate operations until it finds a single int that results from passing the input Stream through all the intermediate operations.
The second filter only receives ints that pass the first filter, so it only processes the numbers 3,6,9,12,15 and then stops, since 15 passes the filter, and supplies the findFirst() operation the only output it needs.
The first filter will only process ints of the input Stream as long as the terminal operation still requires data, and therefore it will only process 1 to 15.

What is the algorithm behind GROUP USING 'collected' and 'merge'

Pig documentation says if some conditions are met(these conditions are described in doc), Pig can do map side GROUP. Can someone explain this algorithm? I want to get a deep understanding of what can be done by MapReduce.
For example, imagine the file below:
10 - 1
14 - 2
10 - 3
12 - 4
12 - 5
20 - 6
21 - 7
17 - 8
12 - 9
17 - 10
Then, the load is going to store your file, like this (imagine your cluster have 3 nodes - If you use a Identity Map-Reduce Job than you can achieve the same result setting the reduce number to 3. If your file is skweed you can have some performance problems).
The loader used for this must guarantee that it does not split a single value of a key across multiple splits. (http://wiki.apache.org/pig/MapSideCogroup)
part-r-00000 part-r-00001 part-r-00002
10 - 1 14 - 2 20 - 6
10 - 3 17 - 8 21 - 7
12 - 4 17 - 10
12 - 9
Now, hadoop framework is going to spawns one map task for each partition generated. I this case 3 map-tasks.
So imagine that you are going to sum the second field, that process can run just in the map side.
part-m-00000
10 - 17
12 - 13
part-m-00001
14 - 2
17 - 18
part-m-00002
20 - 13
In the case of COGROUP, i imagine that is goind to execute in a similiar way. Each map-task is going to operate in two partitions file with the same keys.
You can read the source code for the algorithm. A one liner answer is that both implement a merge algorithm (i.e. data has to be sorted by group key in advanced so that (a) sorting is not required and (b) by iterating over the data you can find where the group key changes.

knapsack-like algorithm

Here is a very interesting java problem I've found:
Before book printing was found the books were copied by certain people called "writers".
The bookkeeper has a stack of N books that need to be copied.For that purpose he has K writers. Each book can have a different number of pages and every writer can only take books from the top of the stack (if he takes book 1 then he can take book 2 but not book 4 or book 5). The bookkeeper knows the number of pages each book has and he needs to share the books between the writers in order for the maximum number of pages a writer has to copy to be the minimum possible.The pages of course can't be split for example you can't have a 30 page book split into 15 and 15.
For example if we have 7 books with 3 writers and the books pages accordingly: 30 50 5 60 90 5 80 then the optimal solution would be for the first writer to take the first 4 books, the second writer the next book and the 3rd the last two books so we would have:
1st = 145 pages
2nd = 90 pages
3rd = 85 pages
So the program is to write an algorithm which finds the optimal solution for sharing the pages between the writers. So in the end of the program you have to show how many pages each one got.
This was in a programming contest years ago and I wanted to give it a try and what I've found so far is that if you take the total number of pages of all the books and divide them by the number of writers you get in the example 106.66 pages and then you try to give to each writer the continuous books from the stack that are closest to that number, but that doesn't work well at all for large page numbers especially if the number of pages a book has exceeds the total number of pages divided by the number of writers
So share your opinion and give tips and hints if you'd like, mathematical or whatever, this is a very interesting algorithm to be found!
I've come up with a straight forward solution, perhaps not very efficient, but the logic works. Basically you start with the number of writers being the same number as that of the number of books and reduce until you have your number of writers.
To show with an example. Suppose you start with your seven values, 30 50 5 60 90 5 80. For each step you reduce it by one by summing up the "lowest pair". The values in bold are the pair being carried on to the next iteration.
7
30 50 5 60 90 5 80
6
30 55 60 90 5 80
5
30 55 60 90 85
4
85 60 90 85
3
145 90 85
With some pseudo programming, this example shows how it could be implemented
main(books: { 30 50 5 60 90 5 80 }, K: 3)
define main(books, K)
writers = books
while writers.length > K do
reduceMinimalPair(writers)
endwhile
end
define reduceMinimalPair(items)
index = 0
minvalue = items[0] + items[1]
for i in 1..items.length-1 do
if items[i] + items[i + 1] < minvalue then
index = i
minvalue = items[i] + items[i + 1]
endif
endfor
items.removeat(index)
items.removeat(index + 1)
items.insertat(index, minvalue)
end
Let us assume you have books 1...n with pages b1,b2,...,bn. Assume you have K writers.
Initialize a matrix F[1...n,1...K]=infinity.
Let F[i,1]= sum_j=1..i (bj)
Now, for every k=2..K
F[i,k] = min_j=1..i( max(F[j,k-1], sum_r=j+1..i (br) )
I think the way you thought is the right one, but if you are saying it did not work for big numbers then maybe you should check if a bigger number than the average exists and do something else in that case. Maybe remove the number and give it from the start to a writer or something along those lines
Alternate to solving it with Dynamic Programming, you can also binary search a upper page limit that everyone will not copy more than this number of pages. When this number converge, that's the answer.

Interpolating Large Datasets On the Fly

Interpolating Large Datasets
I have a large data set of about 0.5million records representing the exchange rate between the USD / GBP over the course of a given day.
I have an application that wants to be able to graph this data or maybe a subset. For obvious reasons I do not want to plot 0.5 million points on my graph.
What I need is a smaller data set (100 points or so) which accurately (as possible) represents the given data. Does anyone know of any interesting and performant ways this data can be achieved?
Cheers, Karl
There are several statistical methods for reducing a large dataset to a smaller, easier to visualize dataset. It's not clear from your question what summary statistic you want. I've just assumed that you want to see how the exchange rate changes as a function of time, but perhaps you are interested in how often the exchange rate goes above a certain value, or some other statistic that I'm not considering.
Summarizing a trend over time
Here is an example using the lowess method in R (from the documentation on scatter plot smoothing):
> library(graphics)
# print out the first 10 rows of the cars dataset
> cars[1:10,]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
# plot the original data
> plot(cars, main = "lowess(cars)")
# fit a loess-smoothed line to the points
> lines(lowess(cars), col = 2)
# plot a finger-grained loess-smoothed line to the points
> lines(lowess(cars, f=.2), col = 3)
The parameter f controls how tightly the regression fits to your data. Use some thoughtfulness with this, as you want something that accurately fits your data without overfitting. Rather than speed and distance, you could plot the exchange rate versus time.
It's also straightforward to access the results of the smoothing. Here's how to do that:
> data = lowess( cars$speed, cars$dist )
> data
$x
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18 18 18 19 19
[38] 19 20 20 20 20 20 22 23 24 24 24 24 25
$y
[1] 4.965459 4.965459 13.124495 13.124495 15.858633 18.579691 21.280313 21.280313 21.280313 24.129277 24.129277
[12] 27.119549 27.119549 27.119549 27.119549 30.027276 30.027276 30.027276 30.027276 32.962506 32.962506 32.962506
[23] 32.962506 36.757728 36.757728 36.757728 40.435075 40.435075 43.463492 43.463492 43.463492 46.885479 46.885479
[34] 46.885479 46.885479 50.793152 50.793152 50.793152 56.491224 56.491224 56.491224 56.491224 56.491224 67.585824
[45] 73.079695 78.643164 78.643164 78.643164 78.643164 84.328698
The data object that you get back contains entries named x and y, which correspond to the x and y values passed into the lowess function. In this case, x and y represent speed and dist.
One thought is use the DBMS to compress the data for you using an appropriate query. Something along the lines of having it take a median for a specific range, a pseudo-query:
SELECT truncate_to_hour(rate_ts), median(rate) FROM exchange_rates
WHERE rate_ts >= start_ts AND rate_ts <= end_ts
GROUP BY truncate_to_hour(rate_ts)
ORDER BY truncate_to_hour(rate_ts)
Where truncate_to_hour is something appropriate to your DBMS. Or a similar approach with some kind of function to segment the time into unique blocks (such as round to nearest 5 minute interval), or another math function to aggregate the group thats appropriate in place of median. Given the complexity of the time segmenting procedure and how your DBMS optimizes it may be more efficient to run a query on a temporary table with the segmented time value.
If you wanted to write your own, one obvious solution would be to break your record set into fixed number-of-points chunks, for which the value would be the average (mean, median, ... pick one). This has the probable advantage of being the fastest, and shows overall trends.
But it lacks the drama of price ticks. A better solution would probably involve looking for the inflection points, then selecting among them using sliding windows. This has the advantage of better displaying the actual events of the day, but will be slower.
Something like RRDTool would do what you need automatically - the tutorial should get you started, and drraw will graph the data.
I use this at work for things like error graphs, I don't need 1-minute resolution for a 6-month time period, only for the most recent few hours. After that I have 1-hour resolution for a few days, then 1-day resolution for a few months.
The naive approach is simply calculating an average per timeinterval corresponding to a pixel.
http://commons.wikimedia.org/wiki/File:Euro_exchange_rate_to_AUD.svg
This does not show flunctuations. I would suggest also calculating the standard deviation in each time interval and plot that too (essentially making each pixel higher than one single pixel). I could not locate an example, but I know that Gnuplot can do this (but is not written in Java).
How about to make enumeration/iterator wrapper. I'm not familiar with Java, but it may looks similar to:
class MedianEnumeration implements Enumeration<Double>
{
private Enumeration<Double> frameEnum;
private int frameSize;
MedianEnumeration(Enumeration<Double> e, int len) {
frameEnum = e;
frameSize = len;
}
public boolean hasMoreElements() {
return frameEnum.hasMoreElements();
}
public Double nextElement() {
Double sum = frameEnum.nextElement();
int i;
for(i=1; (i < frameSize) && (frameEnum.hasMoreElements()); ++i) {
sum += (Double)frameEnum.nextElement();
}
return (sum / i);
}
}

Categories