Java8 - Mapping with Streams without Collecting for Performance

Java8 - Mapping with Streams without Collecting for Performance - java

Exponentially Growing Stream
I have a Stream that grows exponentially for creating permutations. So each call to addWeeks increases the number of elements in the Stream.
Stream<SeasonBuilder> sbStream = sbSet.stream();
for (int i = 1; i <= someCutOff; i++) {
sbStream = sbStream.map(sb -> sb.addWeeks(possibleWeeks))
.flatMap(Collection::stream);
}
// Collect SeasonBuilders into a Set
return sbStream.collect(Collectors.toSet()); // size > 750 000
Problems
Each call to addWeeks returns a Set<SeasonBuilder> and collecting everything into a Set takes a while.
addWeeks is not static and needs to be called on each SeasonBuilder in the stream, each time through the loop
public Set<SeasonBuilder> addWeeks(
final Set<Set<ImmutablePair<Integer, Integer>>> possibleWeeks) {
return possibleWeeks.stream()
.filter(containsMatchup()) // Finds the weeks to add
.map(this::addWeek) // Create new SeasonBuilders with the new week
.collect(Collectors.toSet());
Out of memory error..... when possible weeks has size = 15
Questions
Should I be using a method chain other than map followed by flatmap?
How can I modify addWeeks so that I don't have to collect everything into a Set?
Should I return a Stream<SeasonBuilder>? Can I flatmap a Stream?
Update:
Thanks for the help everyone!
I have put the code for the methods in a gist
Thanks to #Holger and #lexicore for suggesting returning a Stream<SeasonBuilder> in addWeeks. Minor performance increase, as was predicted by #lexicore
I tried using parallelStream() and there was no significant change in performance
Context
I am creating all possible permutations of a Fantasy Football season, which will be used elsewhere for stats analysis. In a 4-team, 14-week season, for any given week, there could be three different possibilities
(1 vs 2) , (3 vs 4)
(1 vs 3) , (2 vs 4)
(1 vs 4) , (2 vs 3)
To solve the problem, plug in the permutations, and we have all our possible seasons. Done! But wait... what if Team 1 only ever plays Team 2. Then the other teams would be sad. So there are some constraints on the permutations.
Every team must play each other roughly the same amount of times (i.e. Team 1 cannot play against Team 3 ten times in a single season). In this example - 4-teams, 14 weeks - each team is capped at playing another team 5 times. So some sort of filtering has to happen when creating permutations, to avoid non-valid seasons.
Where this gets more interesting is:
6 Team League -- 15 possible weeks
8 Team League -- 105 possible weeks
10 Team League -- 945 possible weeks
I am trying to optimize performance where possible, because there are a lot of permutations to create. A 4-team, 14-week season creates 756 756 (=14!/(5!5!4!)) possible seasons, given the constraints. 6-team or 8-team seasons just get crazier.

Your whole construction is very suspicious to begin with. If you're interested in performance it is unlikely that generating all permutations explicitly is a good approach.
I also don't believe that collecting to set and streaming again is the performance problem.
But nevertheless, to answer your question: why don't you return Stream<SeasonBuilder> from addWeeks directly, why do you collect it to set first? Return the stream directy, without collecting:
public Stream<SeasonBuilder> addWeeks(
final Set<Set<ImmutablePair<Integer, Integer>>> possibleWeeks) {
return possibleWeeks.stream()
.filter(containsMatchup()) // Finds the weeks to add
.map(this::addWeek); // Create new SeasonBuilders with the new week
}
You won't need map/flatMap then, just one flatMap:
sbStream = sbStream.flatMap(sb -> sb.addWeeks(possibleWeeks));
But this won't help your performance much anyway.

Related

Is it possible to group by and find top N in exactly one iteration

It defers from this How to apply to sort and limiting after groupBy using Java streams because I want to solve this problem in exactly one iteration. Imagine I have the following entity:
#Getter
#Setter
#AllArgsConstructor
public static class Hospital {
private AREA area;
private int patients;
}
public enum AREA {
AREA1, AREA2, AREA3
}
Now given a list of hospitals I want to find areas with most patients in them, here's what I have done so far:
public static void main(String[] args) {
List<Hospital> list = Arrays.asList(
new Hospital(AREA.AREA1, 20),
new Hospital(AREA.AREA2, 10),
new Hospital(AREA.AREA1, 10),
new Hospital(AREA.AREA3, 40),
new Hospital(AREA.AREA2, 10));
Map<AREA, Integer> map = findTopTen(list);
for (AREA area : map.keySet())
System.out.println(area);
}
public static Map<AREA, Integer> findTopTen(Iterable<Hospital> iterable) {
Map<AREA, Integer> iterationOneResult = StreamSupport.stream(iterable.spliterator(), false)
.collect(Collectors.groupingBy(Hospital::getArea,
Collectors.summingInt(Hospital::getPatients)));
return iterationOneResult.entrySet().stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.limit(10)
.collect(Collectors.toMap(Map.Entry::getKey,
Map.Entry::getValue, (o, o2) -> o,
LinkedHashMap::new));
}
Clearly I've Iterated two times in order to find the top ten areas with most patients in them( once for grouping hospital by area and calculate summation for that group and once more for finding the top ten areas).
Now what I want to know is:
Is there any better approach to solve this problem in one stream and therefore one iteration?
Is there any performance benefit for doing it in one iteration, what is the best practice for solving this kind of problem? (In my point of view on one hand when I call collect which is a terminal operation first time it iterates my iterable and saves the intermediate result in another object, in my code I named that object iterationOneResult, so using one stream and calling collect one time will omit that intermediate result which is the main benefit of using the stream in java, on the other hand, solving this problem in one iteration make it much faster).

Let me try to answer your questions, and provide some context on why they're maybe not the right ones:
Is there any better approach to solve this problem in one stream and therefore one iteration?
The fundamental issue here is that your goal is to find the groups with the maximum values, starting with just the raw members of those groups, unsorted. Therefore, before you can find maximum anything, you will need to assign the members to groups. The problem is, which members are in a group determines that group's value - this leads to the logical conclusion that you can't make decisions like "what are the top ten groups" before sorting all your members into groups.
This is one of the reasons that groupingBy is a Collector - a collector performs a terminal operation, which is a fancy way of saying it consumes the whole stream and returns not a stream but a resolved something - it "ends" the stream.
The reason it needs to end the stream (i.e. to wait for the last element before returning its groups) is because it cannot give you group A before seeing the last element, because the last element may belong to group A. Grouping is an operation which, on an unsorted dataset, cannot be pipelined.
This means that, no matter what you do, there is a hard logical requirement that you will first have to group your items somehow, then find the maximum. This first, then order implies two iterations: one over the items, a second over the groups.
Is there any performance benefit for doing it in one iteration, what is the best practice for solving this kind of problem? (In my point of view on one hand when I call collect which is a terminal operation first time it iterates my iterable and saves the intermediate result in another object, in my code I named that object iterationOneResult, so using one stream and calling collect one time will omit that intermediate result which is the main benefit of using the stream in java, on the other hand, solving this problem in one iteration make it much faster).
Re-read the above: "two iterations: one over the items, a second over the groups". These will always have to happen. However, note that these are two iterations over two different things. Given that you probably have fewer groups than members, the latter iteration will be shorter. Your runtime will not be O(2n) = O(n), but rather O(f(n, m)), where f(n,m) will be "the cost of sorting the n members into m groups, plus the cost of finding the maximal k groups".
Is there any performance benefit for doing it in one iteration
Well... no, since as discussed you can't.
What is the best practice for solving this kind of problem?
I cannot emphasize this enough: clean code.
99.9% of the time, you will waste more time optimizing with custom classes than they gain you back in performance, if they can gain you anything at all. The easy gain to be had here is minimizing the number of lines of code, and maximizing how understandable they are to future programmers.

SlidingWindows for slow data (big intervals) on Apache Beam

I am working with Chicago Traffic Tracker dataset, where new data is published every 15 minutes. When new data is available, it represents records off by 10-15 minutes from the "real time" (example, look for _last_updt).
For example, at 00:20, I get data timestamped 00:10; at 00:35, I get from 00:20; at 00:50, I get from 00:40. So the interval that I can get new data "fixed" (every 15 minutes), although the interval on timestamps change slightly.
I am trying to consume this data on Dataflow (Apache Beam) and for that I am playing with Sliding Windows. My idea is to collect and work on 4 consecutive datapoints (4 x 15min = 60min), and ideally update my calculation of sum/averages as soon as a new datapoint is available. For that, I've started with the code:
PCollection<TrafficData> trafficData = input
.apply("MapIntoSlidingWindows", Window.<TrafficData>into(
SlidingWindows.of(Duration.standardMinutes(60)) // (4x15)
.every(Duration.standardMinutes(15))) . // interval to get new data
.triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes());
Unfortunately, looks like when I receive a new datapoint from my input, I do not get a new (updated) result from the GroupByKey that I have after.
Is this something wrong with my SlidingWindows? Or am I missing something else?

One issue may be that the watermark is going past the end of the window, and dropping all later elements. You may try giving a few minutes after the watermark passes:
PCollection<TrafficData> trafficData = input
.apply("MapIntoSlidingWindows", Window.<TrafficData>into(
SlidingWindows.of(Duration.standardMinutes(60)) // (4x15)
.every(Duration.standardMinutes(15))) . // interval to get new data
.triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane())
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.standardMinutes(15))
.accumulatingFiredPanes());
Let me know if this helps at all.

So #Pablo (from my understanding) gave the correct answer. But I had some suggestions that would not fit in a comment.
I wanted to ask whether you need sliding windows? From what I can tell, fixed windows would do the job for you and be computationally simpler as well. Since you are using accumulating fired panes, you don't need to use a sliding window since your next DoFn function will already be doing an average from the accumulated panes.
As for the code, I made changes to the early and late firing logic. I also suggest increasing the windowing size. Since you know the data comes every 15 minutes, you should be closing the window after 15 minutes rather than on 15 minutes. But you also don't want to pick a window which will eventually collide with multiples of 15 (like 20) because at 60 minutes you'll have the same problem. So pick a number that is co-prime to 15, for example 19. Also allow for late entries.
PCollection<TrafficData> trafficData = input
.apply("MapIntoFixedWindows", Window.<TrafficData>into(
FixedWindows.of(Duration.standardMinutes(19))
.triggering(AfterWatermark.pastEndOfWindow()
// fire the moment you see an element
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
//this line is optional since you already have a past end of window and a early firing. But just in case
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.standardMinutes(60))
.accumulatingFiredPanes());
Let me know if that solves your issue!
EDIT
So, I could not understand how you computed the above example, so I am using a generic example. Below is a generic averaging function:
public class AverageFn extends CombineFn<Integer, AverageFn.Accum, Double> {
public static class Accum {
int sum = 0;
int count = 0;
}
#Override
public Accum createAccumulator() { return new Accum(); }
#Override
public Accum addInput(Accum accum, Integer input) {
accum.sum += input;
accum.count++;
return accum;
}
#Override
public Accum mergeAccumulators(Iterable<Accum> accums) {
Accum merged = createAccumulator();
for (Accum accum : accums) {
merged.sum += accum.sum;
merged.count += accum.count;
}
return merged;
}
#Override
public Double extractOutput(Accum accum) {
return ((double) accum.sum) / accum.count;
}
}
In order to run it you would add the line:
PCollection<Double> average = trafficData.apply(Combine.globally(new AverageFn()));
Since you are currently using accumulating firing triggers, this would be the simplest coding way to solve the solution.
HOWEVER, if you want to use a discarding fire pane window, you would need to use a PCollectionView to store the previous average and pass it as a side input to the next one in order to keep track of the values. This is a little more complex in coding but would definitely improve performance since constant work is done every window, unlike in accumulating firing.
Does this make enough sense for you to generate your own function for discarding fire pane window?

How to use multi-threading to make my application faster

I am iterating through a List of Strings with +- 1500 entries. In each iteration I am again iterating through a List of Strings, but this time with +- 35 million entries. The result of the application is perfect. But it takes the application a long time (2+ hours) to give me the result. How should I structure multithreading to make my application faster?
The order of the result List is not important.
Should I divide the big List (35 million entries) into smaller blocks and iterator through them parallel? (How can I determine the perfect amount of blocks?)
Should I start a thread for each iteration in the small List? (This will create 1500 threads and I guess a lot of them will run "parallel")
What are my other options?
Representation of the code:
List<String> result = new ArrayList<String>();
for(Iterator<String> i = data1.iterator();i.hasNext();){ //1500 entries
String val = i.next();
for(Iterator<String> j = data2.iterator();j.hasNext();){ //35 million entries
String test = j.next();
if(val.equals(test)){
result.add(val);
break;
}
}
}
for(Iterator<String> h = result.iterator();h.hasNext();){
//write to file
}
UPDATE
After restructuring my code and implementing the answer given by JB Nizet my application now runs a lot faster. It now only takes 20 seconds to get to the same result! Without multi-threading!

You could use a parallel stream:
List<String> result =
data1.parallelStream()
.filter(data2::contains)
.collect(Collectors.toList());
But since you call contains() on data2 1500 times, and since contains() is O(N) for a list, transforming it to a HashSet first could make things much faster: contains() on HashSet is O(1). You might not even need multi-threading anymore:
Set<String> data2Set = new HashSet<>(data2);
List<String> result =
data.stream()
.filter(data2Set::contains)
.collect(Collectors.toList());

I am also agree with your idea. What you need to do now?
First calculate number of processor in your system.
Based on number of processor split your records and create exactly that number of threads. ( numberofprocessor * 2 max, else because of context switching between thread performance will be degraded ).
Do not create unnecessarily lots of threads. That will not going to speedup your application. Check exactly how many threads you should create based on number of processor and size of memory in a system. Efficient parallel processing is depends on your machine hardware as well.

Creating groups based on previous data

I am trying to get my app to determine the best solution for grouping 20 golfers into foursomes.
I have data that shows when a golfer played, what date and the others in the group.
I would like the groups made up of golfer who haven't played together, or when everyone has played together, the longest amount of time that they played together. In other words, I want groups made up of players who haven't played together in a while as opposed to last time out.
Creating a permutation list of 20! to determine the lowest combinations didn't work well.
Is there another solution that I am not thinking of?

#Salix-alba's answer is spot on to get you started. Basically, you need a way to figure out how much time has already been spent together by members of your golfing group. I'll assume for illustration that you have a method to determine how much time two golfers have spent together. The algorithm can then be summed up as:
Compute total time spent together of every group of 4 golfers (see Salix-alba's answer), storing the results in an ordered fashion.
Pick the group of 4 golfers with the least time together as your first group.
Continue to pick groups from your ordered list of possible groups such that no member of the next group picked is a member of any prior group picked
Halt when all golfers have a group, which will always happen before you run out of possible combinations.
By way of quick, not promised to compile example (I wrote it in the answer window directly):
Let's assume you have a method time(a,b) where a and b are the golfer identities, and the result is how much time the two golfers have spent together.
Let's also that assume that we will use a TreeMap> to keep track of "weights" associated with groups, in a sorted manner.
Now, let's construct the weights of our groups using the above assumptions:
TreeMap<Integer,Collection<Integer>> options = new TreeMap<Integer, Collection<Integer>>();
for(int i=0;i<17;++i) {
for(int j=i+1;j<18;++j) {
for(int k=j+1;k<19;++k) {
for(int l=k+1;l<20;++l) {
Integer timeTogether = time(i,j) + time(i,k) + time(i,l) + time(j,k) + time(j,l)+time(k,l);
Collection<Integer> group = new HashSet<Integer>();
group.add(i);
group.add(j);
group.add(k);
group.add(l);
options.put(timeTogether, group);
}
}
}
}
Collection<Integer> golferLeft = new HashSet<Integer>(); // to quickly determine if we should consider a group.
for(int a=0; a < maxGolfers, a++) {
golferLeft.add(a);
}
Collection<Collection<Integer>> finalPicks = new ArrayList<Collection<Integer>>();
do{
Map.Entry<Integer, Collection<Integer>> least = options.pollFirstEntry();
if (least != null && golferLeft.containsAll(least.getValue()) {
finalPicks.add(least.getValue());
golferLeft.removeAll(least.getValue());
}
}while (golferLeft.size() > 0 && least != null);
And at the end of the final loop, finalPicks will have a number of collections, with each collection representing a play-group.
Obviously, you can tweak the weight function to get different results -- say you would rather be concerned with minimizing the time since members of the group played together. In that case, instead of using play time, sum up time since last game for each member of the group with some arbitrarily large but reasonable value to indicate if they have never played, and instead of finding the least group, find the largest. And so on.
I hope this has been a helpful primer!

There should be 20 C 4 possible groupings which is 4845. It should be possible to generate these combinations quite easily with four nested for loops.
int count = 0;
for(int i=0;i<17;++i) {
for(int j=i+1;j<18;++j) {
for(int k=j+1;k<19;++k) {
for(int l=k+1;l<20;++l) {
System.out.println(""+i+"\t"+j+"\t"+k+"\t"+l);
++count;
}
}
}
}
System.out.println("Count "+count);
You can quickly loop through all of these and use some objective function to workout which is the most optimal grouping. Your problem definition is a little fuzzy so I'm not sure how tell which is the best combination.
Thats just the number of way picking four golfers out of 20, you really need 5 group of 4 which I think is 20C4 * 16C4 * 12C4 * 8C4 which is 305,540,235,000. This is still in the realm of exhaustive computation though you might need to wait a few minutes.
Another approach might be a probabilistic approach. Just pick the groups at random, rejecting illegal combinations and those which don't meet your criteria. Keep picking random groups until you have found which is good enough.

What is a fast alternative to HashMap for mapping to primitive types?

First of all let me tell you that i have read the following questions that has been asked before Java HashMap performance optimization / alternative and i have a similar question.
What i want to do is take a LOT of dependencies from New york times text that will be processed by stanford parser to give dependencies and store the dependencies in a hashmap along with their scores, i.e. if i see a dependency twice i will increment the score from the hashmap by 1.
The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.
How will i be able to increase the performance of my hashmap? What kind of hashkey can i use?
Thanks a lot
Martinos
EDIT 1:
ok guys maybe i phrased my question wrongly ok , well the byte arrays are not used in MY project but in the similar question of another person above. I dont know what they are using it for hence thats why i asked.
secondly: i will not post code as i consider it will make things very hard to understand but here is a sample:
With sentence : "i am going to bed" i have dependencies:
(i , am , -1)
(i, going, -2)
(i,to,-3)
(am, going, -1)
.
.
.
(to,bed,-1)
These dependencies of all sentences(1 000 000 sentences) will be stored in a hashmap.
If i see a dependency twice i will get the score of the existing dependency and add 1.
And that is pretty much it. All is well but the rate of adding sentences in hashmap(or retrieving) scales down on this line:
dependancyBank.put(newDependancy, dependancyBank.get(newDependancy) + 1);
Can anyone tell me why?
Regards
Martinos

Trove has optimized hashmaps for the case where key or value are of primitive type.
However, much will still depend on smart choice of structure and hash code for your keys.
This part of your question is unclear: The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.. But you don't say what the performance is for the larger data. Your map grows, which is kind of obvious. Hashmaps are O(1) only in theory, in practice you will see some performance changes with size, due to less cache locality, and due to occasional jumps caused by rehashing. So, put() and get() times will not be constant, but still they should be close to that. Perhaps you are using the hashmap in a way which doesn't guarantee fast access, e.g. by iterating over it? In that case your time will grow linearly with size and you can't change that unless you change your algorithm.

Google 'fastutil' and you will find a superior solution for mapping object keys to scores.

Take a look at the Guava multimaps: http://www.coffee-bytes.com/2011/12/22/guava-multimaps They are designed to basically keep a list of things that all map to the same key. That might solve your need.

How will i be able to increase the performance of my hashmap?
If its taking more than 1 micro-second per get() or put(), you have a bug IMHO. You need to determine why its taking as long as it is. Even in the worst case where every object has the same hasCode, you won't have performance this bad.
What kind of hashkey can i use?
That depends on the data type of the key. What is it?
and finally what are byte[] a = new byte[2]; byte[] b = new byte[3]; in the question that was posted above?
They are arrays of bytes. They can be used as values to look up but its likely that you need a different value type.

An HashMap has an overloaded constructor which takes initial capacity as input. The scale off you see is because of rehashing during which the HashMap will virtually not be usable. To prevent frequent rehashing you need to start with a HashMap of greater initial capacity. You can also set a loading factor which indicates how much percentage do you load the hashes before rehashing.
public HashMap(int initialCapacity).
Pass the initial capacity to the HashMap during object construction. It is preferable to set a capacity to almost twice the number of elements you would want to add in the map during the course of execution of your program.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.