java 8 parallelStream().forEach Result data loss

java 8 parallelStream().forEach Result data loss - java

There are two test cases which use parallelStream():
List<Integer> src = new ArrayList<>();
for (int i = 0; i < 20000; i++) {
src.add(i);
}
List<String> strings = new ArrayList<>();
src.parallelStream().filter(integer -> (integer % 2) == 0).forEach(integer -> strings.add(integer + ""));
System.out.println("=size=>" + strings.size());
=size=>9332
List<Integer> src = new ArrayList<>();
for (int i = 0; i < 20000; i++) {
src.add(i);
}
List<String> strings = new ArrayList<>();
src.parallelStream().forEach(integer -> strings.add(integer + ""));
System.out.println("=size=>" + strings.size());
=size=>17908
Why do I always lose data when using parallelStream?
What did i do wrong?

ArrayList isn't thread safe. You need to do
List<String> strings = Collections.synchronizedList(new ArrayList<>());
or
List<String> strings = new Vector<>();
to ensure all updates are synchronized, or switch to
List<String> strings = src.parallelStream()
.filter(integer -> (integer % 2) == 0)
.map(integer -> integer + "")
.collect(Collectors.toList());
and leave the list building to the Streams framework. Note that it's undefined whether the list returned by collect is modifiable, so if that is a requirement, you may need to modify your approach.
In terms of performance, Stream.collect is likely to be much faster than using Stream.forEach to add to a synchronized collection, since the Streams framework can handle collection of values in each thread separately without synchronization and combine the results at the end in a thread safe fashion.

ArrayList isn't thread-safe. While 1 thread sees a list with 30 elements another might still see 29 and override the 30th position (loosing 1 element).
Another issue might arise when the array backing the list needs to be resized. A new array (with double the size) is created and elements from the original array are copied into it. While other threads might have added stuff the thread doing the resizing might not have seen this or multiple threads are resizing and eventually only 1 will win.
When using multiple threads you need to either do some syncronized when accessing the list OR use a multi-thread safe list (by either wrapping it in a SynchronizedList or by using a CopyOnWriteArrayList to mention 2 possible solutions). Even better would be to use the collect method on the stream to put everything into a list.

ParallelStream with forEach is a deadly combo if not used carefully.
Please take a look at below points to avoid any bugs:
If you have a preexisting list object in which you want to add more objects from a parallelStream loop, Use Collections.synchronizedList & pass that pre-existing list object to it before looping through the parallelstream.
If you have to create a new list, then you can use Vector to initialize the list outside the loop.
or
If you have to create a new list, then simply use parallelStream and collect the output at the end.

You lose the benefits of using stream (and parallel stream) when you try to do mutation. As a general rule, avoid mutation when using streams. Venkat Subramaniam explains why. Instead, use collectors. Also try to get a lot accomplished within the stream chain. For example:
System.out.println(
IntStream.range(0, 200000)
.filter(i -> i % 2 == 0)
.mapToObj(String::valueOf)
.collect(Collectors.toList()).size()
);
You can run that in parallelStream by adding .parallel()

Related

How to convert this code to JAVA 8 Stream API?

I want to convert this while loop to equivalent code using a Java 8 Streams, but I don't know how to both stream the List and remove elements from it.
private List<String> nameList = new ArrayList<>();
while (nameList.size() > 0) {
String nameListFirstEntry = nameList.get(0);
nameList.remove(0);
setNameCombinations(nameListFirstEntry);
}

I guess this will do
nameList.forEach(this::setNameCombinations);
nameList.clear();
In case you don't need the original list anymore, you might as well create a new empty list instead.

Because List#remove(int) also returns the element, you can both stream the list's elements and remove them via a stream:
Stream.generate(() -> nameList.remove(0))
.limit(nameList.size())
.forEach(this::setNameCombinations);
This code doesn't break any "rules". From the javadoc of Stream#generate():
Returns an infinite sequential unordered stream where each element is generated by the provided Supplier. This is suitable for generating constant streams, streams of random elements, etc.
There is no mention of any restrictions on how the supplier is implemented or that is must have no side effects etc. The Supplier's only contract is to supply.
For those who doubt this is "works", here's some test code using 100K elements showing that indeed order is preserved:
int size = 100000;
List<Integer> list0 = new ArrayList<>(size); // the reference list
IntStream.range(0, size).boxed().forEach(list0::add);
List<Integer> list1 = new ArrayList<>(list0); // will feed stream
List<Integer> list2 = new ArrayList<>(size); // will consume stream
Stream.generate(() -> list1.remove(0))
.limit(list1.size())
.forEach(list2::add);
System.out.println(list0.equals(list2)); // always true

java intstream parallel loop omitting data

I have this piece of code:
ArrayList<ArrayList<Double> results = new ArrayList<ArrayList<Double>();
IntStream.range(0, 100).parallel().forEach(x ->{
for (int y = 0; y <100;y++){
for (int z = 0; z <100;z++){
for (int q = 0; q <100;q++){
results.add(someMethodThatReturnsArrayListDouble);
}
}
}
});
System.out.println(results.size());
After running this code, i get always different results.size(), always a few short. Any idea why is that and how to fix it?

ArrayList is not threadsafe. If you try and add items to it in different threads (which is what a parallellised stream does), it is likely to break.
From the docs:
Note that this implementation is not synchronized. If multiple threads access an ArrayList instance concurrently, and at least one of the threads modifies the list structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more elements, or explicitly resizes the backing array; merely setting the value of an element is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the list. If no such object exists, the list should be "wrapped" using the Collections.synchronizedList method.
The easiest fix, in this case, would be to remove the call to parallel().

You result is not synchronized. There are multiple ways to solve your problem, the best would be letting the java stream api handle the combining of the lists.
List<List<Double>> results = IntStream.range(0, 100).parallel().flatmap(x ->{
List<Double>> results = new ArrayList<Double>();
for (int y = 0; y <100;y++){
for (int z = 0; z <100;z++){
for (int q = 0; q <100;q++){
results.add(someMethodThatReturnsArrayListDouble);
}
}
}
return results.stream();
}).collect(Collectors.toList());
This collects the lists in the method, and returns them as a stream to be combined at the end of the method using collectors.toList(), what is thread safe.

use
Vector
it's a thread-safe implementation of List.

Adding elements in Non-synchronized ArrayList using java parallel stream

I want to run this code in parallel using java parallel stream and update result in two ArrayList. The code given below is working fine except that the non-thread-safety of ArrayList may cause incorrect results, and I don't want to synchronize the ArrayList. Can someone please suggest me a proper way of using parallel stream for my case.
List<Integer> passedList= new ArrayList<>();
List<Integer> failedList= new ArrayList<>();
Integer[] input = {0,1,2,3,4,5,6,7,8,9};
List<Integer> myList = Arrays.asList(input);
myList.parallelStream().forEach(element -> {
if (isSuccess(element)) {//Some SOAP API call.
passedList.add(element);
} else {
failedList.add(element);
}
});
System.out.println(passedList);
System.out.println(failedList);

An appropriate solution would be to use Collectors.partitioningBy:
Integer[] input = {0,1,2,3,4,5,6,7,8,9};
List<Integer> myList = Arrays.asList(input);
Map<Boolean, List<Integer>> map = myList.parallelStream()
.collect(Collectors.partitioningBy(element -> isSuccess(element)));
List<Integer> passedList = map.get(true);
List<Integer> failedList = map.get(false);
This way you will have no thread-safety problems as the task will be decomposed in map-reduce manner: the parts of the input will be processed independently and joined after that. If your isSuccess method is slow you will likely to have performance boost here.
By the way you can create a parallel stream from the original array using Arrays.stream(input).parallel() without necessity to create an intermediate myList.

Multithreaded: Identifying duplicate objects

I'm trying to implement a duplicate objects finding method over a List object. Traversing through the List and finding the duplicate objects using multiple threads is the target. So far I used ExecutorService as follows.
ExecutorService executor = Executors.newFixedThreadPool(5);
for (int i = 0; i < jobs; i++) {
Runnable worker = new TaskToDo(jobs);
executor.execute(worker);
}
executor.shutdown();
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
At TaskToDo class I iterate through the loop. When a duplicate is detected the one out of them will be removed from the List. Following are the problems I faced,
When using multiple threads at the executor it does not result as intended. Some duplicate values are still exist in the list. But a single thread at the executor works perfectly. I tried
List<String> list = Collections.synchronizedList(new LinkedList<String>()) also but same problem exists.
What is the best data structure that i can use for this purpose of removing duplicates for better performance ?
Google gave some results to use Concurrent structures. But difficult to figure out a correct approach to achieve this.
Appreciate your help. Thanks in advance... :)
Following is the code for iterating through the specified list object. Here actual content of the files will be compared.
for(int i = currentTemp; i < list.size() - 1; i++){
if(isEqual(list.get(currentTemp), list.get(i+1))){
synchronized (list) {
list.remove(i + 1);
i--;
}}}

With your current logic, you would have to synchronize at coarser granularity, otherwise you risk removing the wrong element.
for (int i = currentTemp; i < list.size() - 1; i++) {
synchronized (list) {
if (i + 1 > list.size() && isEqual(list.get(currentTemp), list.get(i+1))) {
list.remove(i + 1);
i--;
}
}
}
You see, the isEqual() check must be inside the synchronized block to ensure atomicity of the equivalence check with the element removal. Assuming most of your concurrent processing benefit would come from asynchronous comparison of list elements using isEqual(), this change nullifies any benefit you sought.
Also, checking list.size() outside the synchronized block isn't good enough, because list elements can be removed by other threads. And unless you have a way of adjusting your list index down when elements are removed by other threads, your code will unknowingly skip checking some elements in the list. The other threads are shifting elements out from under the current thread's for loop.
This task would be much better implemented using an additional list to keep track of indexes that should be removed:
private volatile Set<Integer> indexesToRemove =
Collections.synchronizedSet(new TreeSet<Integer>(
new Comparator<Integer>() {
#Override public int compare(Integer i1, Integer i2) {
return i2.compareTo(i1); // sort descending for later element removal
}
}
));
The above should be declared at the same shared level as your list. Then the code for iterating through the list should look like this, with no synchronization required:
int size = list.size();
for (int i = currentTemp; i < size - 1; i++) {
if (!indexesToRemove.contains(i + 1)) {
if (isEqual(list.get(currentTemp), list.get(i+1))) {
indexesToRemove.add(i + 1);
}
}
}
And finally, after you have join()ed the worker threads back to a single thread, do this to de-duplicate your list:
for (Integer i: indexesToRemove) {
list.remove(i.intValue());
}
Because we used a descending-sorted TreeSet for indexesToRemove, we can simply iterate its indexes and remove each from the list.

If your algorithm acts on sufficient data that you might really benefit from multiple threads, you encounter another issue that will tend to mitigate any performance benefits. Each thread has to scan the entire list to see if the element it is working on is a duplicate, which will cause the CPU cache to keep missing as various threads compete to access different parts of the list.
This is known as False Sharing.
Even if False Sharing does not get you, you are de-duping the list in O(N^2) because for each element of the list, you re-iterate the entire list.
Instead, consider using a Set to initially collect the data. If you cannot do that, test the performance of adding the list elements to a Set. That should be a very efficient way to approach this problem.

If you're trying to dedup a large number of files, you really ought to be using a hash-based structure. Concurrently modifying lists is dangerous, not least because indexes into the list will constantly be changing out from under you, and that's bad.
If you can use Java 8, my approach would look something like this. Let's assume you have a List<String> fileList.
Collection<String> deduplicatedFiles = fileList.parallelStream()
.map(FileSystems.getDefault()::getPath) // convert strings to Paths
.collect(Collectors.toConcurrentMap(
path -> {
try {
return ByteBuffer.wrap(Files.readAllBytes(path)),
// read out the file contents and wrap in a ByteBuffer
// which is a suitable key for a hash map
} catch (IOException e) {
throw new RuntimeException(e);
}
},
path -> path.toString(), // in the values, convert back to string
(first, second) -> first) // resolve duplicates by choosing arbitrarily
.values();
That's the entire thing: it concurrently reads all the files, hashes them (though with an unspecified hash algorithm that may not be great), deduplicates them, and spits out a list of files with distinct contents.
If you're using Java 7, then what I'd do would be something like this.
CompletionService<Void> service = new ExecutorCompletionService<>(
Executors.newFixedThreadPool(4));
final ConcurrentMap<ByteBuffer, String> unique = new ConcurrentHashMap<>();
for (final String file : fileList) {
service.submit(new Runnable() {
#Override public void run() {
try {
ByteBuffer buffer = ByteBuffer.wrap(Files.readAllBytes(
FileSystem.getDefault().getPath(file)));
unique.putIfAbsent(buffer, file);
} catch (IOException e) {
throw new RuntimeException(e);
}
}, null);
}
for (int i = 0; i < fileList.size(); i++) {
service.take();
}
Collection<String> result = unique.values();

List filtering : recreate from empty list, or copy and delete elements?

I have an ArrayList, and I need to filter it (only to remove some elements).
I can't modify the original list.
What is my best option regarding performances :
Recreate another list from the original one, and remove items from it :
code :
List<Foo> newList = new ArrayList<Foo>(initialList);
for (Foo item : initialList) {
if (...) {
newList.remove(item);
}
}
Create an empty list, and add items :
code :
List<Foo> newList = new ArrayList<Foo>(initialList.size());
for (Foo item : initialList) {
if (...) {
newList.add(item);
}
}
Which of these options is the best ? Should I use anything else than ArrayList ? (I can't change the type of the original list though)
As a side note, approximatively 80% of the items will be kept in the list. The list contains from 1 to around 20 elements.

Best option is to go with what is easiest to write and maintain.
If performance is problem, you should profile the application afterwards and not to optimize prematurely.
In addition, I'd use filtering from library like google-collections or commons collections to make the code more readable:
Collection<T> newCollection = Collections2.filter(new Predicate<T>() {
public boolean apply(T item) {
return (...); // apply your test here
}
});
Anyway, as it seems you are optimizing for the performance, I'd go with System.arraycopy if you indeed want to keep most of the original items:
String[] arr = new String[initialList.size()];
String[] src = initialList.toArray(new String[initialList.size()]);
int dstIndex = 0, blockStartIdx=0, blockSize=0;
for (int currIdx=0; currIdx < initialList.size(); currIdx++) {
String item = src[currIdx];
if (item.length() <= 4) {
if (blockSize > 0)
System.arraycopy(src, blockStartIdx, arr, dstIndex, blockSize);
dstIndex += blockSize;
blockSize = 0;
} else {
if (blockSize == 0)
blockStartIdx = currIdx;
blockSize++;
}
}
ArrayList newList = new ArrayList(arr.length + 1);
newList.addAll(Arrays.asList(arr));
}
It seems to be about 20% faster than your option 3. Even more so (40%) if you can skip the new ArrayList creation at the end.
See: http://pastebin.com/sDhV8BUL

You might want to go with the creating a new list from the initial one and removing. They would be less method calls that way since you're keeping ~80% of the original items.
Other than that, I don't know of any way to filter the items.
Edit: Apparently Google Collections has something that might interest you?

As #Sanjay says, "when in doubt, measure". But creating an empty ArrayList and then adding items to it is the most natural implementation and your first goal should be to write clear, understandable code. And I'm 99.9% sure it will be the faster one too.
Update: By copying the old List to a new one and then striking out the elements you don't want, you incur the cost of element removal. The ArrayList.remove() method needs to iterate up to the end of the array on each removal, copying each reference down a position in the list. This almost certainly will be more expensive than simply creating a new ArrayList and adding elements to it.
Note: Be sure to allocate the new ArrayList to an initial capacity set to the size of the old List to avoid reallocation costs.

the second is faster (iterate and add to second as needed) and the code for the first will throw ConcurrentModificationException when you remove any items
and in terms of what result type will be depends on what you are going to need the filtered list for

I'd first follow the age old advice; when in doubt, measure.
Should I use anything else than
ArrayList ?
That depends on what kind of operations would you be performing on the filtered list but ArrayList is usually is a good bet unless you are doing something which really shouldn't be backed by a contiguous list of elements (i.e. arrays).
List newList = new
ArrayList(initialList.size());
I don't mean to nitpick, but if your new list won't exceed 80% of the initial size, why not fine tune the initial capacity to ((int)(initialList.size() * .8) + 1)?

Since I'm only get suggestions here, I decided to run my own bench to be sure.
Here are the conclusions (with an ArrayList of String).
Solution 1, remove items from the copy : 2400 ms.
Solution 2, create an empty list and fill it : 1600 ms. newList = new ArrayList<Foo>();
Solution 3, same as 2, except you set the initial size of the List : 1530 ms. newList = new ArrayList<Foo>(initialList.size());
Solution 4, same as 2, except you set the initial size of the List + 1 : 1500 ms. newList = new ArrayList<Foo>(initialList.size() + 1); (as explained by #Soronthar)
Source : http://pastebin.com/c2C5c9Ha

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java 8 parallelStream().forEach Result data loss - java

Related

How to convert this code to JAVA 8 Stream API?

java intstream parallel loop omitting data

Adding elements in Non-synchronized ArrayList using java parallel stream

Multithreaded: Identifying duplicate objects

List filtering : recreate from empty list, or copy and delete elements?

Categories

Resources