I recently came across some odd behavior on a machine where maping a function that returns Future[T] was executing sequentially. This same problem does not occur on other machines: work is interleaved as one would expect. I later discovered that this was likely because Scala was being a little too smart and choosing an ExecutionContext that matched the machine's resources: one core, one worker.
Here's some simple code that reproduces the problem:
import scala.concurrent._
import scala.concurrent.duration._
val ss = List("the", "quick", "brown", "fox",
"jumped", "over", "the", "lazy", "dog")
def r(s: String) : Future[String] = future {
for (i <- 1 to 10) yield {
println(String.format("r(%s) waiting %s more seconds...",
s, (10 - i).toString))
Thread.sleep(1000)
}
s.reverse
}
val f_revs = ss.map { r(_) }
println("Look ma, no blocking!")
val rev = f_revs.par.map { Await.result(_, Duration.Inf) }.mkString(" ")
println(rev)
Running this on the machine exhibiting the strange behavior produces sequential output like this:
Look ma, no blocking!
r(the) waiting 9 more seconds...
r(the) waiting 8 more seconds...
r(the) waiting 7 more seconds...
Providing a custom ExecutionContext:
val pool = Executors.newFixedThreadPool(1)
implicit val ec = ExecutionContext.fromExecutor(pool)
allows the threads to interleave on this machine. But I have a new problem now: the thread pool does not shutdown, causing the program to hang indefinitely. Apparently, this is the expected behavior for FixedThreadPools, and I can shut it down by putting pool.shutdown() somewhere.
Call me stubborn, but I don't want to have to tell the thread pool to shutdown. Is there a way to configure the pool to shutdown when all of the queues are empty (possibly after some delay) just like it works with the default pool? I've looked through the ExecutionContext documentation, but I'm not finding what I'm looking for.
Scala uses its own fork-join implementation which behaves differently from the Java ones, hence the different behavior between the default ExecutionContext and the one you created with Executors.
An easier way to do what you want would be to set the following System properties to configure the default ExecutionContext:
scala.concurrent.context.minThreads to impose a minimum number of threads. Defaults to 1.
scala.concurrent.context.numThreads to set the number of thread. Defaults to x1.
scala.concurrent.context.maxThreads to impose a maximum number of threads. Defaults to x1.
Each of these can either be a number, or a number preceded by x, to indicate a multiple of the number of processors. To increase the number of threads, you have to change both numThreads and maxThreads. In you case, setting both to x2 should work.
It looks like Java 7 has some additional ExecutorServices, in particular, a ForkJoinPool that does what I want (i.e., no need to shutdown() the pool).
Changing the pool to the following is sufficient to achieve what I want:
val pool = new java.util.concurrent.ForkJoinPool(5)
Java 8 apparently has even more services.
Related
I'm using Groovy's ASTBuilder (version 2.5.5) in a project. It's being used to parse and analyze groovy expressions received via a REST API. This REST service receives thousands of requests, and the analysis is done on the fly.
I'm noticing some serious performance issues in a multithreaded environment. Below is a simulation, running 100 threads in parallel:
int numthreads = 100;
final Callable<Void> task = () -> {
long initial = System.currentTimeInMillis();
// Simple rule
new AstBuilder().buildFromString("a+b");
System.out.print(String.format("\n\nThread took %s ms.",
System.currentTimeInMillis() - initial));
return null;
};
final ExecutorService executorService = Executors.newFixedThreadPool(numthreads);
final List<Callable<Void>> tasks = new ArrayList<>();
while (numthreads-- > 0) {
tasks.add(task);
}
for (Future<Void> future : executorService.invokeAll(tasks)) {
future.get();
}
Im trying with different thread loads. The greater the number, the slower.
100 threads => ~1800ms
200 threads => ~2500ms
300 threads => ~4000ms
However, if I serialize the threads, (like setting the pool size to 1), I get much better results, around 10ms each thread. Can someone please help me understand why is this happening?
Performing multiple threaded code, computer shares threads between physical CPU cores. That means the more the number of threads exceeds number of cores, the less benefit you get from every thread. In your example the number of threads increases with number of tasks. So with growing up of the task number every CPU core forced to process the more and more threads. At the same time you may notice that difference between numthreads = 1 and numthreads = 4 is very small. Because in this case every core processes only few (or even just one) thread. Don't set number of threads much more than number of physical CPU threads because it doesn't make a lot of sense.
Additionally in your example you're trying to compare how different numbers of threads performs with different numbers of tasks. But in order to see the efficiency of multiple threaded code you have to compare how the different numbers of threads performs with the same number of tasks. I would change the example the next way:
int threadNumber = 16;
int taskNumber = 200;
//...task method
final ExecutorService executorService = Executors.newFixedThreadPool(threadNumber);
final List<Callable<Void>> tasks = new ArrayList<>();
while (taskNumber-- > 0) {
tasks.add(task);
}
long start = System.currentTimeMillis();
for (Future<Void> future : executorService.invokeAll(tasks)) {
future.get();
}
long end = System.currentTimeMillis() - start;
System.out.println(end);
executorService.shutdown();
Try this code for threadNumber=1 and, lets say, threadNumber=16 and you'll see the difference.
Dynamic evaluation of expressions involves a lot of resources including class loading, security manager, compilation and execution. It is not designed for high performance. If you just need to evaluate an expression for its value, you could try groovy.util.Eval. It may not consume as many resources as AstBuilder. However, it is probably not going to be that much different, so don't expect too much.
If you want to get the AST only and not any extra information like types, you could call the parser more directly. This would involve a lot fewer resources. See org.codehaus.groovy.control.ParserPluginFactory for more direct access to the source parser.
I am attempting to parallelise a for-loop using Java streams & ForkJoinPool in order to control the number of threads used. When run with a single thread, the parallelised code returns the same result as the sequential program. The sequential code is a set of standard for-loops:
for(String file : fileList){
for(String item : xList){
for(String x : aList) {
// action code
}
}
}
And the following is my parallel implementation:
ForkJoinPool threadPool = new ForkJoinPool(NUM_THREADS);
int chunkSize = aList.size()/NUM_THREADS;
for(String file : fileList){
for(String item : xList){
IntStream.range(0, NUM_THREADS)
.parallel().forEach(i -> threadPool.submit(() -> {
aList.subList(i*chunkSize, Math.min(i*chunkSize + chunkSize -1, aList.size()-1))
.forEach(x -> {
// action code
});
}));
threadPool.shutdown();
threadPool.awaitTermination(5, TimeUnit.MINUTES);
}
}
When using more than 1 thread, only a limited number of iterations are completed. I have attempted to use .shutdown() and .awaitTermination() to ensure completion of all threads, however this doesn't seem to work. The number of iterations that occur difference dramatically from run to run (between 0-1500).
Note: I'm using a Macbook Pro with 8 available cores (4 dual-cores), and my action code does not contain references that make parallelisation unsafe.
Any advice would be much appreciated, thank you!
I think the actual problem you have is caused by your calling shutdown on the ForkJoinPool. If you look into the javadoc, this results in "an orderly shutdown in which previously submitted tasks are executed, but no new tasks will be accepted" - ie. I'd expect only one task to actually finish.
BTW there's no real point in using a ForkJoinPool the way you use it. A ForkJoinPool is intended to split workload recursively, not unlike you do with your creating sublists in the loop - but a ForkJoinPool is supposed to be fed by RecursiveActions that split their work themselves, rather than splitting it up beforehand like you do in a loop. That's just a side note though; your code should run fine, but it would be clearer if you just submitted your tasks to a normal ExecutorService, eg one you get by Executors.newFixedThreadPool(parallelism) rather than by new ForkJoinPool().
I have collection of elements that I want to process in parallel. When I use a List, parallelism works. However, when I use a Set, it does not run in parallel.
I wrote a code sample that shows the problem:
public static void main(String[] args) {
ParallelTest test = new ParallelTest();
List<Integer> list = Arrays.asList(1,2);
Set<Integer> set = new HashSet<>(list);
ForkJoinPool forkJoinPool = new ForkJoinPool(4);
System.out.println("set print");
try {
forkJoinPool.submit(() ->
set.parallelStream().forEach(test::print)
).get();
} catch (Exception e) {
return;
}
System.out.println("\n\nlist print");
try {
forkJoinPool.submit(() ->
list.parallelStream().forEach(test::print)
).get();
} catch (Exception e) {
return;
}
}
private void print(int i){
System.out.println("start: " + i);
try {
TimeUnit.SECONDS.sleep(1);
} catch (InterruptedException e) {
}
System.out.println("end: " + i);
}
This is the output that I get on windows 7
set print
start: 1
end: 1
start: 2
end: 2
list print
start: 2
start: 1
end: 1
end: 2
We can see that the first element from the Set had to finish before the second element is processed. For the List, the second element starts before the first element finishes.
Can you tell me what causes this issue, and how to avoid it using a Set collection?
I can reproduce the behavior you see, where the parallelism doesn't match the parallelism of the fork-join pool parallelism you've specified. After setting the fork-join pool parallelism to 10, and increasing the number of elements in the collection to 50, I see the parallelism of the list-based stream rising only to 6, whereas the parallelism of the set-based stream never gets above 2.
Note, however, that this technique of submitting a task to a fork-join pool to run the parallel stream in that pool is an implementation "trick" and is not guaranteed to work. Indeed, the threads or thread pool that is used for execution of parallel streams is unspecified. By default, the common fork-join pool is used, but in different environments, different thread pools might end up being used. (Consider a container within an application server.)
In the java.util.stream.AbstractTask class, the LEAF_TARGET field determines the amount of splitting that is done, which in turn determines the amount of parallelism that can be achieved. The value of this field is based on ForkJoinPool.getCommonPoolParallelism() which of course uses the parallelism of the common pool, not whatever pool happens to be running the tasks.
Arguably this is a bug (see OpenJDK issue JDK-8190974), however, this entire area is unspecified anyway. However, this area of the system definitely needs development, for example in terms of splitting policy, the amount of parallelism available, dealing with blocking tasks, among other issues. A future release of the JDK may address some of these issues.
Meanwhile, it is possible to control the parallelism of the common fork-join pool through the use of system properties. If you add this line to your program,
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "10");
and you run the streams in the common pool (or if you submit them to your own pool that has a sufficiently high level of parallelism set) you will observe that many more tasks are run in parallel.
You can also set this property on the command line using the -D option.
Again, this is not guaranteed behavior, and it may change in the future. But this technique will probably work for JDK 8 implementations for the forseeable future.
UPDATE 2019-06-12: The bug JDK-8190974 was fixed in JDK 10, and the fix has been backported to an upcoming JDK 8u release (8u222).
So I think I sort of understand how fixed thread pools work (using the Executor.fixedThreadPool built into Java), but from what I can see, there's usually a set number of jobs you want done and you know how many to when you start the program. For example
int numWorkers = Integer.parseInt(args[0]);
int threadPoolSize = Integer.parseInt(args[1]);
ExecutorService tpes =
Executors.newFixedThreadPool(threadPoolSize);
WorkerThread[] workers = new WorkerThread[numWorkers];
for (int i = 0; i < numWorkers; i++) {
workers[i] = new WorkerThread(i);
tpes.execute(workers[i]);
}
Where each workerThread does something really simple,that part is arbitrary. What I want to know is, what if you have a fixed pool size (say 8 max) but you don't know how many workers you'll need to finish the task until runtime.
The specific example is: If I have a pool size of 8 and I'm reading from standard input. As I read, I split the input into blocks of a set size. Each one of these blocks is given to a thread (along with some other information) so that they can compress it. As such, I don't know how many threads I'll need to create as I need to keep going until I reach the end of the input. I also have to somehow ensure that the data stays in the same order. If thread 2 finishes before thread 1 and just submits its work, my data will be out of order!
Would a thread pool be the wrong approach in this situation then? It seems like it'd be great (since I can't use more than 8 threads at a time).
Basically, I want to do something like this:
ExecutorService tpes = Executors.newFixedThreadPool(threadPoolSize);
BufferedInputStream inBytes = new BufferedInputStream(System.in);
byte[] buff = new byte[BLOCK_SIZE];
byte[] dict = new byte[DICT_SIZE];
WorkerThread worker;
int bytesRead = 0;
while((bytesRead = inBytes.read(buff)) != -1) {
System.arraycopy(buff, BLOCK_SIZE-DICT_SIZE, dict, 0, DICT_SIZE);
worker = new WorkerThread(buff, dict)
tpes.execute(worker);
}
This is not working code, I know, but I'm just trying to illustrate what I want.
I left out a bit, but see how buff and dict have changing values and that I don't know how long the input is. I don't think I can't actually do this thought because, well worker already exists after the first call! I can't just say worker = new WorkerThread a bunch of time since isn't it already pointing towards an existing thread (true, a thread that might be dead) and obviously in this implemenation if it did work I wouldn't be running in parallel. But my point is, I want to keep creating threads until I hit the max pool size, wait till a thread is done, then keep creating threads until I hit the end of the input.
I also need to keep stuff in order, which is the part that's really annoying.
Your solution is completely fine (the only point is that parallelism is perhaps not necessary if the workload of your WorkerThreads is very small).
With a thread pool, the number of submitted tasks is not relevant. There may be less or more than the number of threads in the pool, the thread pool takes care of that.
However, and this is important: You rely on some kind of order of the results of your WorkerThreads, but when using parallelism, this order is not guaranteed! It doesn't matter whether you use a thread pool, or how much worker threads you have, etc., it will always be possible that your results will be finished in an arbitrary order!
To keep the order right, give each WorkerThread the number of the current item in its constructor, and let them put their results in the right order after they are finished:
int noOfWorkItem = 0;
while((bytesRead = inBytes.read(buff)) != -1) {
System.arraycopy(buff, BLOCK_SIZE-DICT_SIZE, dict, 0, DICT_SIZE);
worker = new WorkerThread(buff, dict, noOfWorkItem++)
tpes.execute(worker);
}
As #ignis points out, parallel execution may not be the best answer for your situation.
However, to answer the more general question, there are several other Executor implementations to consider beyond FixedThreadPool, some of which may have the characteristics that you desire.
As far as keeping things in order, typically you would submit tasks to the executor, and for each submission, you get a Future (which is an object that promises to give you a result later, when the task finishes). So, you can keep track of the Futures in the order that you submitted tasks, and then when all tasks are done, invoke get() on each Future in order, to get the results.
I am working on a large scale dataset and after building a model, I use multithreading (whole project in Java) as follows:
OutputStream out = new BufferedOutputStream(new FileOutputStream(outFile));
int i=0;
Collection<Track1Callable> callables = new ArrayList<Track1Callable>();
// For each entry in the test file, do watever needs to be done.
// Track1Callable actually processes that entry and returns a double value.
for (Pair<PreferenceArray, long[]> tests : new DataFileIterable(
KDDCupDataModel.getTestFile(dataFileDirectory))) {
PreferenceArray userTest = tests.getFirst();
callables.add(new Track1Callable(recommender, userTest));
i++;
}
ExecutorService executor = Executors.newFixedThreadPool(cores); //24 cores
List<Future<byte[]>> results = executor.invokeAll(callables);
executor.shutdown();
for (Future<byte[]> result : results) {
for (byte estimate : result.get()) {
out.write(estimate);
}
}
out.flush();
out.close();
When I receive the result from each callable, output it to a file. Does this output in the exact order as the list of initial Callables was made? In spite of some completing before others? Seems it should but not sure.
Also, I expect a total of 6.2 million bytes to be written to the outfile. But I get an additional 2000 bytes (Yeah for free). That messes up my submission and I think it is because of some concurrency issues. I tested this on small dataset and it seems to work fine there (264 bytes expected and received).
Anyhing wrong I am doing with the Executor framework or Futures?
Q: Does the order is the same as the one specified for the tasks? Yes.
From the API:
Returns: A list of Futures
representing the tasks, in the same
sequential order as produced by the
iterator for the given task list. If
the operation did not time out, each
task will have completed. If it did
time out, some of these tasks will not
have completed.
As for the "extra" bytes: have you tried doing all of this in sequential order (i.e., without using an executor) and checking if you obtain different results? It seems that your problem is outside the code provided (and probably is not due to concurrency).
The order in which the callable's are executed doesn't matter from the code you have here. You write the results in the order you store the futures in the list. Even if they were executed in reverse order, the file should appear the same as your file writing is single threaded.
I suspect your callables are interacting with each other and you get different results depending on the number of core you use. e.g. You might be using SimpleDateFormat.
I suggest you run this twice in the same program with a dataset which completes in a short time. Run it first with only one thread in the thread pool and a second time with 24 threads You should be able to compare the results from both runs with Arrays.equals(byte[], byte[]) and see that you get exactly the same results.