I am iterating through a List of Strings with +- 1500 entries. In each iteration I am again iterating through a List of Strings, but this time with +- 35 million entries. The result of the application is perfect. But it takes the application a long time (2+ hours) to give me the result. How should I structure multithreading to make my application faster?
The order of the result List is not important.
Should I divide the big List (35 million entries) into smaller blocks and iterator through them parallel? (How can I determine the perfect amount of blocks?)
Should I start a thread for each iteration in the small List? (This will create 1500 threads and I guess a lot of them will run "parallel")
What are my other options?
Representation of the code:
List<String> result = new ArrayList<String>();
for(Iterator<String> i = data1.iterator();i.hasNext();){ //1500 entries
String val = i.next();
for(Iterator<String> j = data2.iterator();j.hasNext();){ //35 million entries
String test = j.next();
if(val.equals(test)){
result.add(val);
break;
}
}
}
for(Iterator<String> h = result.iterator();h.hasNext();){
//write to file
}
UPDATE
After restructuring my code and implementing the answer given by JB Nizet my application now runs a lot faster. It now only takes 20 seconds to get to the same result! Without multi-threading!
You could use a parallel stream:
List<String> result =
data1.parallelStream()
.filter(data2::contains)
.collect(Collectors.toList());
But since you call contains() on data2 1500 times, and since contains() is O(N) for a list, transforming it to a HashSet first could make things much faster: contains() on HashSet is O(1). You might not even need multi-threading anymore:
Set<String> data2Set = new HashSet<>(data2);
List<String> result =
data.stream()
.filter(data2Set::contains)
.collect(Collectors.toList());
I am also agree with your idea. What you need to do now?
First calculate number of processor in your system.
Based on number of processor split your records and create exactly that number of threads. ( numberofprocessor * 2 max, else because of context switching between thread performance will be degraded ).
Do not create unnecessarily lots of threads. That will not going to speedup your application. Check exactly how many threads you should create based on number of processor and size of memory in a system. Efficient parallel processing is depends on your machine hardware as well.
Related
It defers from this How to apply to sort and limiting after groupBy using Java streams because I want to solve this problem in exactly one iteration. Imagine I have the following entity:
#Getter
#Setter
#AllArgsConstructor
public static class Hospital {
private AREA area;
private int patients;
}
public enum AREA {
AREA1, AREA2, AREA3
}
Now given a list of hospitals I want to find areas with most patients in them, here's what I have done so far:
public static void main(String[] args) {
List<Hospital> list = Arrays.asList(
new Hospital(AREA.AREA1, 20),
new Hospital(AREA.AREA2, 10),
new Hospital(AREA.AREA1, 10),
new Hospital(AREA.AREA3, 40),
new Hospital(AREA.AREA2, 10));
Map<AREA, Integer> map = findTopTen(list);
for (AREA area : map.keySet())
System.out.println(area);
}
public static Map<AREA, Integer> findTopTen(Iterable<Hospital> iterable) {
Map<AREA, Integer> iterationOneResult = StreamSupport.stream(iterable.spliterator(), false)
.collect(Collectors.groupingBy(Hospital::getArea,
Collectors.summingInt(Hospital::getPatients)));
return iterationOneResult.entrySet().stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.limit(10)
.collect(Collectors.toMap(Map.Entry::getKey,
Map.Entry::getValue, (o, o2) -> o,
LinkedHashMap::new));
}
Clearly I've Iterated two times in order to find the top ten areas with most patients in them( once for grouping hospital by area and calculate summation for that group and once more for finding the top ten areas).
Now what I want to know is:
Is there any better approach to solve this problem in one stream and therefore one iteration?
Is there any performance benefit for doing it in one iteration, what is the best practice for solving this kind of problem? (In my point of view on one hand when I call collect which is a terminal operation first time it iterates my iterable and saves the intermediate result in another object, in my code I named that object iterationOneResult, so using one stream and calling collect one time will omit that intermediate result which is the main benefit of using the stream in java, on the other hand, solving this problem in one iteration make it much faster).
Let me try to answer your questions, and provide some context on why they're maybe not the right ones:
Is there any better approach to solve this problem in one stream and therefore one iteration?
The fundamental issue here is that your goal is to find the groups with the maximum values, starting with just the raw members of those groups, unsorted. Therefore, before you can find maximum anything, you will need to assign the members to groups. The problem is, which members are in a group determines that group's value - this leads to the logical conclusion that you can't make decisions like "what are the top ten groups" before sorting all your members into groups.
This is one of the reasons that groupingBy is a Collector - a collector performs a terminal operation, which is a fancy way of saying it consumes the whole stream and returns not a stream but a resolved something - it "ends" the stream.
The reason it needs to end the stream (i.e. to wait for the last element before returning its groups) is because it cannot give you group A before seeing the last element, because the last element may belong to group A. Grouping is an operation which, on an unsorted dataset, cannot be pipelined.
This means that, no matter what you do, there is a hard logical requirement that you will first have to group your items somehow, then find the maximum. This first, then order implies two iterations: one over the items, a second over the groups.
Is there any performance benefit for doing it in one iteration, what is the best practice for solving this kind of problem? (In my point of view on one hand when I call collect which is a terminal operation first time it iterates my iterable and saves the intermediate result in another object, in my code I named that object iterationOneResult, so using one stream and calling collect one time will omit that intermediate result which is the main benefit of using the stream in java, on the other hand, solving this problem in one iteration make it much faster).
Re-read the above: "two iterations: one over the items, a second over the groups". These will always have to happen. However, note that these are two iterations over two different things. Given that you probably have fewer groups than members, the latter iteration will be shorter. Your runtime will not be O(2n) = O(n), but rather O(f(n, m)), where f(n,m) will be "the cost of sorting the n members into m groups, plus the cost of finding the maximal k groups".
Is there any performance benefit for doing it in one iteration
Well... no, since as discussed you can't.
What is the best practice for solving this kind of problem?
I cannot emphasize this enough: clean code.
99.9% of the time, you will waste more time optimizing with custom classes than they gain you back in performance, if they can gain you anything at all. The easy gain to be had here is minimizing the number of lines of code, and maximizing how understandable they are to future programmers.
I wonder that if I use a HashMap to collect the conditions and loop each one in one if statement can I reach higher performance rather than to write one by one if - else if statement?
In my opinion, one-by-one if-else, if statements may be faster because in for loop runs one more condition in each loop like, does the counter reach the target number? So actually each if statement, it runs 2 if statements. Of course inside of the statements different but if we talk about just statement performance, I think one-by-one type would be better?
Edit: this is just a sample code, my question is about the performance differences between the usage of these statements.
Map<String, Integer> words = new HashMap<String, Integer>
String letter ="d";
int n = 4;
words.put("a",1);
words.put("b",2);
words.put("c",3);
words.put("d",4);
words.put("e",5);
words.forEach((word,number)->{
if(letter.equals(word){
System.out.println(number*n);
});
String letter ="d";
int n = 4;
if(letter.equals("a"){
System.out.println(number*1);
}else if(letter.equals("b"){
System.out.println(number*2);
}else if(letter.equals("c"){
System.out.println(number*3);
}else if(letter.equals("d"){
System.out.println(number*4);
}else if(letter.equals("e"){
System.out.println(number*5);
}
For your example, having a HashMap but then doing an iterative lookup seems to be a bad idea. The point of using a HashMap is to be able to do a hash based lookup. That is much faster than doing an iterative lookup.
Also, from your example, cascading if-then tests will definitely be faster, since they will avoid the overhead of the map iterator and extra function calls. Also, they will avoid the overhead of the map iterator skipping empty storage locations in the hash map backing array. A better question is whether the cascading if-thens are faster than iterating across a simple list. That is hard to answer. Cascading if-thens seem likely to be faster, except that if there are a lot of if-thens, then a cost of loading the code should be added.
For string lookups, a list data structure provides adequate behavior up to a limiting value, above which a more sophisticated data structure must be used. What is the limiting value depends on the environment. For string comparisons, I've found the transition between 20 and 100 elements.
For particular lookups, and whether low level optimizations are available, the transition value may be much larger. For example, doing integer lookups using "C", which will can do direct memory lookups, the transition value is much higher.
Typical data structures are HashMaps, Tries, and sorted arrays. Each fits particular patterns of access. For example, sorted arrays are fastest and most compact, but are expensive to update. HashMaps support dynamic updates, and for good hash functions, provide constant time lookups. But, HashMaps are space inefficient, since they depend on having empty cells between hash values.
For cases which do not involve "very large" data sets, and which are not in critical "hot" code paths, HashMaps are the usual structure which is used.
If you have a Map and you want to retrieve one letter, I'm not sure why you would loop at all?
Map<String, Integer> words = new HashMap<String, Integer>
String letter ="d";
int n = 4;
words.put("a",1);
words.put("b",2);
words.put("c",3);
words.put("d",4);
words.put("e",5);
if (words.containsKey(letter) {
System.out.println(words.get(letter)*n);
}
else
{
System.out.println(letter + " doesn't exist in Map");
}
If you aren't using the benefits of a Map, then why use a Map at all?
A forEach will actually touch every key in the list. The number of checks on your if/else is dependent on where it is in the list and how long the list of available letters is. If the letter you choose is the last one in the list then it would complete all checks before printing. If it is first then it will only do one which is much faster than having to check all.
It would be easy for you to write the two examples and run a timer to determine which is actually faster.
https://www.baeldung.com/java-measure-elapsed-time
There are a lot of wasted calculations if you have to run through 1 million if/else statements and only select one which could be anywhere in the list. This doesn't include typos and the horror of code maintenance. Using a Map with an index would be much quicker. If you are only talking about 100 if/else statements (still too many in my opinion) then you may be able to break even on speed.
Say I have the functions mutateElement() which does x operations and mutateElement2() which does y operations. What is the difference in performance between these two pieces of code.
Piece1:
List<Object> = array.stream().map(elem ->
mutateElement(elem);
mutateElement2(elem);
)
.collect(Collectors.toList());
Piece2:
List<Object> array = array.stream().map(elem ->
mutateElement(elem);
)
.collect(Collectors.toList());
array = array.stream().map(elem ->
mutateElement2(elem);
)
.collect(Collectors.toList());
Clearly The first implementation is better as it only uses one iterator, however the second uses two iterators. But would the difference be noticeable if I had say a million elements in the array.
The first implementation is not better simply because it uses only one iterator, the first implementation is better because it only collects once.
Nobody can tell you whether the difference would be noticeable if you had a million elements. (And if someone did try to tell you, you should not believe them.) Benchmark it.
Whatever you use stream or external loop, the problem is the same.
One iteration on the List in the first code and two iterations on the List in the second code.
The time of execution of the second code is so logically more important.
Besides invoking twice the terminal operation on the stream :
.collect(Collectors.toList());
rather than once, has also a cost.
But would the difference be noticeable if I had say a million elements
in the array.
It could be.
Now the question is hard to answer : yes or no.
It depends on other parameters such as cpus, number of concurrent users and processing and your definition of "noticeable".
I have simple loop:
List<String> results = new ArrayList<String>();
for (int i = 0; i < 100000000; i++) {
results.add("a" + i + "b");
}
return results;
must be simple idea- because if I will start to create thread pool, I will spend memory for additional objects, also without java8.
How can I implement simple parallel loop ?
The easiest way to speed this up is to create an ArrayList with the size that you need ahead of time. I have no idea why you would need a list like this, but I can only assume that your program is slowing down everytime it has to allocate and copy another array because it ran out of space.
List<String> results = new ArrayList<String>(10000000);
This can't be done parralel for a simple reason: Java doesn't provide an efficient List implementation for concurrent access.
The simplest solution would be to implement an own LinkedList and link several LinkedLists into one, which can be done a lot faster and in parralel. The merge-operation can afterwards be done in O(1).
Using Java 8 parallell streams you can do this:
IntStream.range(0, 100000000)
.parallel()
.mapToObj(i -> new StringBuilder()
.append("a")
.append(i)
.append("b")
.toString())
.collect(Collectors.toList());
First, IntStream.range(0, 100000000) creates an IntStream with the values 0 to 100000000 (exclusive). Then, the IntStream is made parallel using the parallel method. Then, I use mapToObj in order to make the IntStream into a Stream<String>. At last, the stream will be collected in a List. I also use a StringBuilder to make the string concatenation a bit faster.
I have a List<String> and there is almost 20,000 records in it(and may be more)...
I need to iterate over this list and it takes almost 3 minutes...
Here is my block of code:
for (String string : list) {
response += string;
response += "/t";
}
I have two questions:
long time is for List iteration or for operation on each item?
depending on answer to the question 1 how can i speed up this operation?
The poor performance is more likely to be your use of string concatenation. Use a StringBuilder instead.
Consider using Map if it is applicable. Here is a link for a very common Java objects and how much their operations cost using Big-O notation.
http://objectissues.blogspot.com/2006/11/big-o-notation-and-java-constant-time.html