Here I am using Javaparallel stream to iterate through a List and calling a REST call with each list element as input. I need to add all the results of the REST call to a collection for which I am using an ArrayList. The code given below is working fine except that the non-thread-safety of ArrayList would cause incorrect results, and adding needed synchronization would cause contention, undermining the benefit of parallelism.
Can someone please suggest me a proper way of using parallel stream for my case.
public void myMethod() {
List<List<String>> partitions = getInputData();
final List<String> allResult = new ArrayList<String>();
partitions.parallelStream().forEach(serverList -> callRestAPI(serverList, allResult);
}
private void callRestAPI(List<String> serverList, List<String> allResult) {
List<String> result = //Do a REST call.
allResult.addAll(result);
}
You can do the operation with map instead of forEach - that will guarantee thread safety (and is cleaner from a functional programming perspective):
List<String> allResult = partitions.parallelStream()
.map(this::callRestAPI)
.flatMap(List::stream) //flattens the lists
.collect(toList());
And your callRestAPI method:
private List<String> callRestAPI(List<String> serverList) {
List<String> result = //Do a REST call.
return result;
}
I wouldn't shy away from synchronising access to your ArrayList. Given that you're accessing a remote service via Rest, I suspect the cost of synchronisation would be negligible. I would measure the effect before you spend time optimising.
Related
There are two test cases which use parallelStream():
List<Integer> src = new ArrayList<>();
for (int i = 0; i < 20000; i++) {
src.add(i);
}
List<String> strings = new ArrayList<>();
src.parallelStream().filter(integer -> (integer % 2) == 0).forEach(integer -> strings.add(integer + ""));
System.out.println("=size=>" + strings.size());
=size=>9332
List<Integer> src = new ArrayList<>();
for (int i = 0; i < 20000; i++) {
src.add(i);
}
List<String> strings = new ArrayList<>();
src.parallelStream().forEach(integer -> strings.add(integer + ""));
System.out.println("=size=>" + strings.size());
=size=>17908
Why do I always lose data when using parallelStream?
What did i do wrong?
ArrayList isn't thread safe. You need to do
List<String> strings = Collections.synchronizedList(new ArrayList<>());
or
List<String> strings = new Vector<>();
to ensure all updates are synchronized, or switch to
List<String> strings = src.parallelStream()
.filter(integer -> (integer % 2) == 0)
.map(integer -> integer + "")
.collect(Collectors.toList());
and leave the list building to the Streams framework. Note that it's undefined whether the list returned by collect is modifiable, so if that is a requirement, you may need to modify your approach.
In terms of performance, Stream.collect is likely to be much faster than using Stream.forEach to add to a synchronized collection, since the Streams framework can handle collection of values in each thread separately without synchronization and combine the results at the end in a thread safe fashion.
ArrayList isn't thread-safe. While 1 thread sees a list with 30 elements another might still see 29 and override the 30th position (loosing 1 element).
Another issue might arise when the array backing the list needs to be resized. A new array (with double the size) is created and elements from the original array are copied into it. While other threads might have added stuff the thread doing the resizing might not have seen this or multiple threads are resizing and eventually only 1 will win.
When using multiple threads you need to either do some syncronized when accessing the list OR use a multi-thread safe list (by either wrapping it in a SynchronizedList or by using a CopyOnWriteArrayList to mention 2 possible solutions). Even better would be to use the collect method on the stream to put everything into a list.
ParallelStream with forEach is a deadly combo if not used carefully.
Please take a look at below points to avoid any bugs:
If you have a preexisting list object in which you want to add more objects from a parallelStream loop, Use Collections.synchronizedList & pass that pre-existing list object to it before looping through the parallelstream.
If you have to create a new list, then you can use Vector to initialize the list outside the loop.
or
If you have to create a new list, then simply use parallelStream and collect the output at the end.
You lose the benefits of using stream (and parallel stream) when you try to do mutation. As a general rule, avoid mutation when using streams. Venkat Subramaniam explains why. Instead, use collectors. Also try to get a lot accomplished within the stream chain. For example:
System.out.println(
IntStream.range(0, 200000)
.filter(i -> i % 2 == 0)
.mapToObj(String::valueOf)
.collect(Collectors.toList()).size()
);
You can run that in parallelStream by adding .parallel()
I have the following method that calls itself recursively:
public ArrayList<SpecTreeNode> getLeavesBelow()
{
ArrayList<SpecTreeNode> result = new ArrayList<>();
if (isLeaf())
{
result.add(this);
}
for (SpecTreeNode stn : chList)
{
result.addAll(stn.getLeavesBelow());
}
return result;
}
I'd like to convert the for loop to use parallelStream. I think I'm partly there but not sure how to implement .collect() to 'addAll' to result:
chList.parallelStream()
.map(SpecTreeNode::getLeavesBelow)
.collect();
Some assistance would be much appreciated.
Just like this, right? Am I missing something?
result.addAll(
chList.parallelStream()
.map(SpecTreeNode::getLeavesBelow)
.flatMap(Collection::stream)
.collect(Collectors.toList())
);
Unrelated to your question but because you're seeking performance improvements: you may see some gains by specifying an initial size for your ArrayList to avoid reallocating multiple times.
A LinkedList may be a preferable data structure if you can't anticipate the size, as all you're doing here is continually appending to the end of the list. However, if you need to randomly access elements of this list later then it might not be.
I would do it by making the recursive method return a Stream of nodes instead of a List, then filter to keep only the leaves and finally collect to a list:
public List<SpecTreeNode> getLeavesBelow() {
return nodesBelow(this)
.parallel()
.filter(Node::isLeaf)
.collect(Collectors.toList());
}
private Stream<SpecTreeNode> nodesBelow(SpecTreeNode node) {
return Stream.concat(
Stream.of(node),
node.chList.stream()
.flatMap(this::leavesBelow));
}
I want to run this code in parallel using java parallel stream and update result in two ArrayList. The code given below is working fine except that the non-thread-safety of ArrayList may cause incorrect results, and I don't want to synchronize the ArrayList. Can someone please suggest me a proper way of using parallel stream for my case.
List<Integer> passedList= new ArrayList<>();
List<Integer> failedList= new ArrayList<>();
Integer[] input = {0,1,2,3,4,5,6,7,8,9};
List<Integer> myList = Arrays.asList(input);
myList.parallelStream().forEach(element -> {
if (isSuccess(element)) {//Some SOAP API call.
passedList.add(element);
} else {
failedList.add(element);
}
});
System.out.println(passedList);
System.out.println(failedList);
An appropriate solution would be to use Collectors.partitioningBy:
Integer[] input = {0,1,2,3,4,5,6,7,8,9};
List<Integer> myList = Arrays.asList(input);
Map<Boolean, List<Integer>> map = myList.parallelStream()
.collect(Collectors.partitioningBy(element -> isSuccess(element)));
List<Integer> passedList = map.get(true);
List<Integer> failedList = map.get(false);
This way you will have no thread-safety problems as the task will be decomposed in map-reduce manner: the parts of the input will be processed independently and joined after that. If your isSuccess method is slow you will likely to have performance boost here.
By the way you can create a parallel stream from the original array using Arrays.stream(input).parallel() without necessity to create an intermediate myList.
I have a java restful webservice program thats hosted on tomcat. In one of my web service methods, I load a big arraylist of objects (about 25,000 entries) from redis. This arraylist is updated once every 30 mins. There are multiple threads reading from this arraylist all the time. When, I update the arraylist I want to cause minimum disruption/delays since there could be other threads reading from it.
I was wondering what is the best way to do this? One way is to use synchronized keyword to the method that updates the list. But, the synchronized method has an overhead, since no threads can read while the update is going on. The update method itself could take few hundred millisecs since it involves reading from redis + deserialization.
class WebService {
ArrayList<Entry> list = new ArrayList<Entry>();
//need to call this every 30 mins.
void syncrhonized updateArrayList(){
//read from redis & add elements to list
}
void readFromList(){
for(Entry e: list) {
//do some processing
}
}
}
Updated the final solution:
I ended up using no explicit synchronization primitives.
Does it have to be the same List instance getting updated? Can you build a new list every 30 minutes and replace a volatile reference?
Something along these lines:
class WebService {
private volatile List<Entry> theList;
void updateList() {
List<Entry> newList = getEntriesFromRedis();
theList = Collections.unmodifiableList(newList);
}
public List<Entry> getList() {
return theList;
}
}
The advantage of this approach is that you don't have to do any other synchronization anywhere else.
A reader-writer lock (or ReadWriteLock in Java) is what you need.
A reader-writer lock will allow concurrent access for read operations, but mutually exclusive access for write.
It will look something like
class WebService {
final ReentrantReadWriteLock listRwLock = new ReentrantReadWriteLock();
ArrayList<Entry> list = new ArrayList<Entry>();
//need to call this every 30 mins.
void updateArrayList(){
listRwLock.writeLock().lock();
try {
//read from redis & add elements to list
} finally {
listRwLock.writeLock().unlock()
}
}
void readFromList(){
listRwLock.readLock().lock();
try {
for(Entry e: list) {
//do some processing
}
} finally {
listRwLock.readLock().unlock()
}
}
}
Here is the solution I finally ended up with,
class WebService {
// key = timeWindow (for ex:10:00 or 10:30 or 11:00), value = <List of entries for that timewindow>
ConcurrentHashMap<String, List<Entry>> map= new ConcurrentHashMap<String, List<Entry>>();
//have setup a timer to call this every 10 mins.
void updateArrayList(){
// populate the map for the next time window with the corresponding entries. So that its ready before we start using it. Also, clean up the expired entries for older time windows.
}
void readFromList(){
list = map.get(currentTimeWindow)
for(Entry e: list) {
//do some processing
}
}
}
ArrayList is not thread safe.. You must use vector List to make it thread safe.
You can also use Thread safe Array list by using Collections Api but I would recommend vector list since it already provides you what you want.
//Use Collecions.synzhonizedList method
List list = Collections.synchronizedList(new ArrayList());
...
//If you wanna use iterator on the synchronized list, use it
//like this. It should be in synchronized block.
synchronized (list) {
Iterator iterator = list.iterator();
while (iterator.hasNext())
...
iterator.next();
...
}
I would recommend you to through this:
http://beginnersbook.com/2013/12/difference-between-arraylist-and-vector-in-java/
List<String> list = new ArrayList<String>();
list.add("a");
...
list.add("z");
synchronized(list) {
Iterator<String> i = list.iterator();
while(i.hasNext()) {
...
}
}
and
List<String> list = new ArrayList<String>();
list.add("a");
...
list.add("z");
List<String> synchronizedList = Collections.synchronizedList(list);
synchronized(synchronizedList) {
Iterator<String> i = synchronizedList.iterator();
while(i.hasNext()) {
...
}
}
Specifically, I'm not clear as to why synchronized is required in the second instance when a synchronized list provides thread-safe access to the list.
If you don't lock around the iteration, you will get a ConcurrentModificationException if another thread modifies it during the loop.
Synchronizing all of the methods doesn't prevent that in the slightest.
This (and many other things) is why Collections.synchronized* is completely useless.
You should use the classes in java.util.concurrent. (and you should think carefully about how you will guarantee you will be safe)
As a general rule of thumb:
Slapping locks around every method is not enough to make something thread-safe.
For much more information, see my blog
synchronizedList only makes each call atomic. In your case, the loop make multiple calls so between each call/iteration another thread can modify the list. If you use one of the concurrent collections, you don't have this problem.
To see how this collection differs from ArrayList.
List<String> list = new CopyOnWriteArrayList<String>();
list.addAll(Arrays.asList("a,b,c,d,e,f,g,h,z".split(",")));
for(String s: list) {
System.out.print(s+" ");
// would trigger a ConcurrentModifcationException with ArrayList
list.clear();
}
Even though the list is cleared repeatedly, it prints the following because that wa the contents when the iterator was created.
a b c d e f g h z
The second code needs to be synchronized because of the way synchronized lists are implemented. This is explained in the javadoc:
It is imperative that the user manually synchronize on the returned list when iterating over it
The main difference between the two code snippets is the effect of the add operations:
with the synchronized list, you have a visibility guarantee: other threads will see the newly added items if they call synchronizedList.get(..) for example.
with the ArrayList, other threads might not see the newly added items immediately - they might actually not ever see them.