Why does parallelStream not use the entire available parallelism? - java

I have a custom ForkJoinPool created with parallelism of 25.
customForkJoinPool = new ForkJoinPool(25);
I have a list of 700 file names and I used code like this to download the files from S3 in parallel and cast them to Java objects:
customForkJoinPool.submit(() -> {
return fileNames
.parallelStream()
.map((fileName) -> {
Logger log = Logger.getLogger("ForkJoinTest");
long startTime = System.currentTimeMillis();
log.info("Starting job at Thread:" + Thread.currentThread().getName());
MyObject obj = readObjectFromS3(fileName);
long endTime = System.currentTimeMillis();
log.info("completed a job with Latency:" + (endTime - startTime));
return obj;
})
.collect(Collectors.toList);
});
});
When I look at the logs, I see only 5 threads being used. With a parallelism of 25, I expected this to use 25 threads. The average latency to download and convert the file to an object is around 200ms. What am I missing?
May be a better question is how does a parallelstream figure how much to split the original list before creating threads for it? In this case, it looks like it decided to split it 5 times and stop.

Why are you doing this with ForkJoinPool? It's meant for CPU-bound tasks with subtasks that are too fast to warrant individual scheduling. Your workload is IO-bound and with 200ms latency the individual scheduling overhead is negligible.
Use an Executor:
import static java.util.stream.Collectors.toList;
import static java.util.concurrent.CompletableFuture.supplyAsync;
ExecutorService threads = Executors.newFixedThreadPool(25);
List<MyObject> result = fileNames.stream()
.map(fn -> supplyAsync(() -> readObjectFromS3(fn), threads))
.collect(toList()).stream()
.map(CompletableFuture::join)
.collect(toList());

I think that the answer is in this ... from the ForkJoinPool javadoc.
"The pool attempts to maintain enough active (or available) threads by dynamically adding, suspending, or resuming internal worker threads, even if some tasks are stalled waiting to join others. However, no such adjustments are guaranteed in the face of blocked I/O or other unmanaged synchronization."
In your case, the downloads will be performing blocking I/O operations.

Related

Groovy ASTBuilder bad performance with multiple threads

I'm using Groovy's ASTBuilder (version 2.5.5) in a project. It's being used to parse and analyze groovy expressions received via a REST API. This REST service receives thousands of requests, and the analysis is done on the fly.
I'm noticing some serious performance issues in a multithreaded environment. Below is a simulation, running 100 threads in parallel:
int numthreads = 100;
final Callable<Void> task = () -> {
long initial = System.currentTimeInMillis();
// Simple rule
new AstBuilder().buildFromString("a+b");
System.out.print(String.format("\n\nThread took %s ms.",
System.currentTimeInMillis() - initial));
return null;
};
final ExecutorService executorService = Executors.newFixedThreadPool(numthreads);
final List<Callable<Void>> tasks = new ArrayList<>();
while (numthreads-- > 0) {
tasks.add(task);
}
for (Future<Void> future : executorService.invokeAll(tasks)) {
future.get();
}
Im trying with different thread loads. The greater the number, the slower.
100 threads => ~1800ms
200 threads => ~2500ms
300 threads => ~4000ms
However, if I serialize the threads, (like setting the pool size to 1), I get much better results, around 10ms each thread. Can someone please help me understand why is this happening?
Performing multiple threaded code, computer shares threads between physical CPU cores. That means the more the number of threads exceeds number of cores, the less benefit you get from every thread. In your example the number of threads increases with number of tasks. So with growing up of the task number every CPU core forced to process the more and more threads. At the same time you may notice that difference between numthreads = 1 and numthreads = 4 is very small. Because in this case every core processes only few (or even just one) thread. Don't set number of threads much more than number of physical CPU threads because it doesn't make a lot of sense.
Additionally in your example you're trying to compare how different numbers of threads performs with different numbers of tasks. But in order to see the efficiency of multiple threaded code you have to compare how the different numbers of threads performs with the same number of tasks. I would change the example the next way:
int threadNumber = 16;
int taskNumber = 200;
//...task method
final ExecutorService executorService = Executors.newFixedThreadPool(threadNumber);
final List<Callable<Void>> tasks = new ArrayList<>();
while (taskNumber-- > 0) {
tasks.add(task);
}
long start = System.currentTimeMillis();
for (Future<Void> future : executorService.invokeAll(tasks)) {
future.get();
}
long end = System.currentTimeMillis() - start;
System.out.println(end);
executorService.shutdown();
Try this code for threadNumber=1 and, lets say, threadNumber=16 and you'll see the difference.
Dynamic evaluation of expressions involves a lot of resources including class loading, security manager, compilation and execution. It is not designed for high performance. If you just need to evaluate an expression for its value, you could try groovy.util.Eval. It may not consume as many resources as AstBuilder. However, it is probably not going to be that much different, so don't expect too much.
If you want to get the AST only and not any extra information like types, you could call the parser more directly. This would involve a lot fewer resources. See org.codehaus.groovy.control.ParserPluginFactory for more direct access to the source parser.

Java ForkJoinPool Threads Not Completing

I am attempting to parallelise a for-loop using Java streams & ForkJoinPool in order to control the number of threads used. When run with a single thread, the parallelised code returns the same result as the sequential program. The sequential code is a set of standard for-loops:
for(String file : fileList){
for(String item : xList){
for(String x : aList) {
// action code
}
}
}
And the following is my parallel implementation:
ForkJoinPool threadPool = new ForkJoinPool(NUM_THREADS);
int chunkSize = aList.size()/NUM_THREADS;
for(String file : fileList){
for(String item : xList){
IntStream.range(0, NUM_THREADS)
.parallel().forEach(i -> threadPool.submit(() -> {
aList.subList(i*chunkSize, Math.min(i*chunkSize + chunkSize -1, aList.size()-1))
.forEach(x -> {
// action code
});
}));
threadPool.shutdown();
threadPool.awaitTermination(5, TimeUnit.MINUTES);
}
}
When using more than 1 thread, only a limited number of iterations are completed. I have attempted to use .shutdown() and .awaitTermination() to ensure completion of all threads, however this doesn't seem to work. The number of iterations that occur difference dramatically from run to run (between 0-1500).
Note: I'm using a Macbook Pro with 8 available cores (4 dual-cores), and my action code does not contain references that make parallelisation unsafe.
Any advice would be much appreciated, thank you!
I think the actual problem you have is caused by your calling shutdown on the ForkJoinPool. If you look into the javadoc, this results in "an orderly shutdown in which previously submitted tasks are executed, but no new tasks will be accepted" - ie. I'd expect only one task to actually finish.
BTW there's no real point in using a ForkJoinPool the way you use it. A ForkJoinPool is intended to split workload recursively, not unlike you do with your creating sublists in the loop - but a ForkJoinPool is supposed to be fed by RecursiveActions that split their work themselves, rather than splitting it up beforehand like you do in a loop. That's just a side note though; your code should run fine, but it would be clearer if you just submitted your tasks to a normal ExecutorService, eg one you get by Executors.newFixedThreadPool(parallelism) rather than by new ForkJoinPool().

Is there a way to force parallelStream() to go parallel?

If the input size is too small the library automatically serializes the execution of the maps in the stream, but this automation doesn't and can't take in account how heavy is the map operation. Is there a way to force parallelStream() to actually parallelize CPU heavy maps?
There seems to be a fundamental misunderstanding. The linked Q&A discusses that the stream apparently doesn’t work in parallel, due to the OP not seeing the expected speedup. The conclusion is that there is no benefit in parallel processing if the workload is too small, not that there was an automatic fallback to sequential execution.
It’s actually the opposite. If you request parallel, you get parallel, even if it actually reduces the performance. The implementation does not switch to the potentially more efficient sequential execution in such cases.
So if you are confident that the per-element workload is high enough to justify the use of a parallel execution regardless of the small number of elements, you can simply request a parallel execution.
As can easily demonstrated:
Stream.of(1, 2).parallel()
.peek(x -> System.out.println("processing "+x+" in "+Thread.currentThread()))
.forEach(System.out::println);
On Ideone, it prints
processing 2 in Thread[main,5,main]
2
processing 1 in Thread[ForkJoinPool.commonPool-worker-1,5,main]
1
but the order of messages and details may vary. It may even be possible that in some environments, both task may happen to get executed by the same thread, if it can steel the second task before another thread gets started to pick it up. But of course, if the tasks are expensive enough, this won’t happen. The important point is that the overall workload has been split and enqueued to be potentially picked up by other worker threads.
If execution by a single thread happens in your environment for the simple example above, you may insert simulated workload like this:
Stream.of(1, 2).parallel()
.peek(x -> System.out.println("processing "+x+" in "+Thread.currentThread()))
.map(x -> {
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(3));
return x;
})
.forEach(System.out::println);
Then, you may also see that the overall execution time will be shorter than “number of elements”דprocessing time per element” if the “processing time per element” is high enough.
Update: the misunderstanding might be cause by Brian Goetz’ misleading statement: “In your case, your input set is simply too small to be decomposed”.
It must be emphasized that this is not a general property of the Stream API, but the Map that has been used. A HashMap has a backing array and the entries are distributed within that array depending on their hash code. It might be the case that splitting the array into n ranges doesn’t lead to a balanced split of the contained element, especially, if there are only two. The implementors of the HashMap’s Spliterator considered searching the array for elements to get a perfectly balanced split to be too expensive, not that splitting two elements was not worth it.
Since the HashMap’s default capacity is 16 and the example had only two elements, we can say that the map was oversized. Simply fixing that would also fix the example:
long start = System.nanoTime();
Map<String, Supplier<String>> input = new HashMap<>(2);
input.put("1", () -> {
System.out.println(Thread.currentThread());
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(2));
return "a";
});
input.put("2", () -> {
System.out.println(Thread.currentThread());
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(2));
return "b";
});
Map<String, String> results = input.keySet()
.parallelStream().collect(Collectors.toConcurrentMap(
key -> key,
key -> input.get(key).get()));
System.out.println("Time: " + TimeUnit.NANOSECONDS.toMillis(System.nanoTime()- start));
on my machine, it prints
Thread[main,5,main]
Thread[ForkJoinPool.commonPool-worker-1,5,main]
Time: 2058
The conclusion is that the Stream implementation always tries to use parallel execution, if you request it, regardless of the input size. But it depends on the input’s structure how well the workload can be distributed to the worker threads. Things could be even worse, e.g. if you stream lines from a file.
If you think that the benefit of a balanced splitting is worth the cost of a copying step, you could also use new ArrayList<>(input.keySet()).parallelStream() instead of input.keySet().parallelStream(), as the distribution of elements within ArrayList always allows a perflectly balanced split.

Multithread - OutOfMemory

I am using an ThreadPoolExecutor with 5 active threads, number of tasks is huge 20,000.
The queue is filled up (pool.execute(new WorkingThreadTask())) with instances of a Runnable tasks almost immediately.
Each WorkingThreadTask has a HashMap:
Map<Integer, HashMap<Integer, String>> themap ;
each map can have up to 2000 items, and each sub-map has 5 items. There is also a shared BlockingQueue.
When process is running I am getting out of memory. I'm running with: (32bit -Xms1024m -Xmx1024m)
How can I handle this problem? I don't think I have leaks in hashmap... When the thread is finished hashmap is cleaned right?
Update:
After running a profiler and checking the memory, the biggest hit is:
byte[] 2,516,024 hits, 918 MB
I don't know from where it's called or used.
Name Instance count Size (bytes)
byte[ ] 2519560 918117496
oracle.jdbc.ttc7.TTCItem 2515402 120739296
char[ ] 357882 15549280
java.lang.String 9677 232248
int[ ] 2128 110976
short[ ] 2097 150024
java.lang.Class 1537 635704
java.util.concurrent.locks.ReentrantLock$NonfairSync 1489 35736
java.util.Hashtable$Entry 1417 34008
java.util.concurrent.ConcurrentHashMap$HashEntry[ ] 1376 22312
java.util.concurrent.ConcurrentHashMap$Segment 1376 44032
java.lang.Object[ ] 1279 60216
java.util.TreeMap$Entry 828 26496
oracle.jdbc.dbaccess.DBItem[ ] 802 10419712
oracle.jdbc.ttc7.v8TTIoac 732 52704
I'm not sure about the inner map but I suspect the problem is that you are creating a large number of tasks that is filling memory. You should be using a bounded task queue and limit the job producer.
Take a look at my answer here: Process Large File for HTTP Calls in Java
To summarize it, you should create your own bounded queue and then use a RejectedExecutionHandler to block the producer until there is space in the queue. Something like:
final BlockingQueue<WorkingThreadTask> queue =
new ArrayBlockingQueue<WorkingThreadTask>(100);
ThreadPoolExecutor threadPool =
new ThreadPoolExecutor(nThreads, nThreads, 0L, TimeUnit.MILLISECONDS, queue);
// we need our RejectedExecutionHandler to block if the queue is full
threadPool.setRejectedExecutionHandler(new RejectedExecutionHandler() {
#Override
public void rejectedExecution(WorkingThreadTask task,
ThreadPoolExecutor executor) {
try {
// this will block the producer until there's room in the queue
executor.getQueue().put(task);
} catch (InterruptedException e) {
throw new RejectedExecutionException(
"Unexpected InterruptedException", e);
}
}
});
Edit:
I don't think I have leeks in hashmap... when thread is finished hashmap is cleaned right?
You might consider aggressively calling clear() on the work HashMap and other collections when the task completes. Although they should be reaped by the GC eventually, giving the GC some help may solve your problem if you have limited memory.
If this doesn't work, a profiler is the way to go to help you identify where the memory is being held.
Edit:
After looking at the profiler output, the byte[] is interesting. Typically this indicates some sort of serialization or other IO. You may also be storing blobs in a database. The oracle.jdbc.ttc7.TTCItem is very interesting however. That indicates to me that you are not closing a database connection somewhere. Make sure to use proper try/finally blocks to close your connections.
HashMap carries quite a lot of overhead in terms of memory usage..... it carries about 36 bytes minimum per entry, plus the size of the key/value itself - each will be at least 32 bytes (I think that's about the typical value for 32-bit sun JVM).... doing some quick math:
20,000 tasks, each with map with 2000 entry hashmap. The value in the map is another map with 5 entries.
-> 5-entry map is 1* Map + 5* Map.Object entries + 5*keys + 5*values = 16 objects at 32 bytes => 512 bytes per sub-map.
-> 2000 entry map is 1* Map, 2000*Map.Object + 2000 keys + 2000 submaps (each is 512 bytes) => 2000*(512+32+32) + 32 => 1.1MB
-> 20,000 tasks, each of 1.1MB -> 23GB
So, your overall footprint is 23GB.
The logical solution is to restrict the depth of your blocking queue feeding the ExecutorService, and only create enough child tasks to keep it busy..... set a limit of about 64 entries in the queue, and then you will never have more than 64 + 5 tasks instantiated at one time. When wpace comes available in the executor's queue, you can create and add another task.
You can improve the efficiency by not adding so many tasks ahead of what is being processed. Try checking the queue and only adding to it if there is less than 1000 entries.
You can also make the data structures more efficient. A Map with an Integer key can often be reduced to an array of some kind.
Lastly, 1 GB isn't that much these days. My mobile phone has 2 GB. If you are going to process large amount of data, I suggest getting a machine with 32-64 GB of memory and a 64-bit JVM.
From the large byte[]s, I'd suspect IO related issues (unless you are handling video/audio or something).
Things to look at:
DB: Are you trying to read large amount of stuff at once? You can
e.g. use a cursor to not do that
File/Network: Are you trying to read large amounts of stuff from file/network at once? You should "propagate the load" to whatever is reading and regulate the rate of read.
UPDATE: OK, so you are using a cursor to read from DB. Now you need to make sure that the reading from the cursor only progresses as you finish stuff (aka "propagate the load"). To do this, use a thread pool like this:
BlockingQueue<Runnable> queue = new LinkedBlockingQueue<Runnable>(queueSize);
ThreadPoolExecutor tpe = new ThreadPoolExecutor(
threadNum,
threadNum,
1000,
TimeUnit.HOURS,
queue,
new ThreadPoolExecutor.CallerRunsPolicy());
Now when you post to this service from your code which reads from the DB, it will block when the queue is full (the calling thread is used to run tasks and hence blocks).

Extra bytes appearing when building file data using multiple threads

I am working on a large scale dataset and after building a model, I use multithreading (whole project in Java) as follows:
OutputStream out = new BufferedOutputStream(new FileOutputStream(outFile));
int i=0;
Collection<Track1Callable> callables = new ArrayList<Track1Callable>();
// For each entry in the test file, do watever needs to be done.
// Track1Callable actually processes that entry and returns a double value.
for (Pair<PreferenceArray, long[]> tests : new DataFileIterable(
KDDCupDataModel.getTestFile(dataFileDirectory))) {
PreferenceArray userTest = tests.getFirst();
callables.add(new Track1Callable(recommender, userTest));
i++;
}
ExecutorService executor = Executors.newFixedThreadPool(cores); //24 cores
List<Future<byte[]>> results = executor.invokeAll(callables);
executor.shutdown();
for (Future<byte[]> result : results) {
for (byte estimate : result.get()) {
out.write(estimate);
}
}
out.flush();
out.close();
When I receive the result from each callable, output it to a file. Does this output in the exact order as the list of initial Callables was made? In spite of some completing before others? Seems it should but not sure.
Also, I expect a total of 6.2 million bytes to be written to the outfile. But I get an additional 2000 bytes (Yeah for free). That messes up my submission and I think it is because of some concurrency issues. I tested this on small dataset and it seems to work fine there (264 bytes expected and received).
Anyhing wrong I am doing with the Executor framework or Futures?
Q: Does the order is the same as the one specified for the tasks? Yes.
From the API:
Returns: A list of Futures
representing the tasks, in the same
sequential order as produced by the
iterator for the given task list. If
the operation did not time out, each
task will have completed. If it did
time out, some of these tasks will not
have completed.
As for the "extra" bytes: have you tried doing all of this in sequential order (i.e., without using an executor) and checking if you obtain different results? It seems that your problem is outside the code provided (and probably is not due to concurrency).
The order in which the callable's are executed doesn't matter from the code you have here. You write the results in the order you store the futures in the list. Even if they were executed in reverse order, the file should appear the same as your file writing is single threaded.
I suspect your callables are interacting with each other and you get different results depending on the number of core you use. e.g. You might be using SimpleDateFormat.
I suggest you run this twice in the same program with a dataset which completes in a short time. Run it first with only one thread in the thread pool and a second time with 24 threads You should be able to compare the results from both runs with Arrays.equals(byte[], byte[]) and see that you get exactly the same results.

Categories