I am splitting up a computation among eight threads and writing the results to a file, as follows:
1a. Each of seven threads processes its input and writes its output to its own ByteArrayOutputStream; when the stream closes, the thread offers an <Integer, ByteArrayOutputStream> to a ConcurrentLinkedQueue, and calls countDown() on a CountDownLatch (that was initialized to 7).
1b. Concurrently, an eighth thread reads in all of the input data that will be processed on the next iteration. This thread awaits on the CountDownLatch when it finishes reading in its data.
2a. When the CountDownLatch reaches 0, the eighth thread wakes up, sorts the ConcurrentinkedQueue using the Integer in the <Integer, ByteArrayOutputStream> as the sort key, then iterates through the queue and appends the byte arrays to a file. (There might be a more efficient way to traverse the list in order without sorting it, but the list only has seven elements in it so the runtime of the sort method is a non-issue.)
2b. Concurrently, the other seven threads process the input that has been prepared for them by the eighth thread.
** This process loops until all data are processed (typically 40-80 iterations).
Each thread processes an equal-sized input chunk (except possibly on the last iteration) of 8mb; each ByteArrayOutputStream contains from 1-4 mb, and the output size can't be known ahead of time. Typically the runtimes of the earliest-completing and latest-completing CPU-bound threads are within 20% of each other.
I am wondering if there is an IO library (or a method in java.io or java.nio that I've missed) that already does something like this - at present the eighth thread (the IO thread) is idle about 75% of the time, but any way I've come up with to alleviate this inefficiency strikes me as being too complicated (and hence too risky in terms of creating deadlocks or data races); for example, I could divide the input into 4 mb chunks and then give two chunks to the seven CPU-bound threads and one chunk to the IO-bound thread which would in theory reduce the IO thread's idle time to 25% (25% on IO, 50% on a 4 mb chunk, 25% idle), but this is a brittle solution that might not port to another CPU (meaning that on another CPU the IO-bound thread might then turn into a bottleneck if e.g. its runtime is 150% that of the CPU-bound threads) - I'd really like a self-balancing solution so that I don't need to fine-tune the load-balancing by hand.
The inefficiency consists in waiting for all 7 outputs to be complete before thread 8 processes any of it. It would be better to run 7 queues instead of one, i.e. one per source thread, and read them in the order necessary. That way when the first queue has any data it is processed immediately, rather than having to wait for the other 6; similarly for queues 2..6. When thread 8 finishes the last queue it can then start producing,or indeed it could be doing that instead of waiting for any specific queue to start producing.
I've modified the algorithm as follows:
Rather than assigning the input chunks directly to the CPU threads, I instead put the chunks in a BlockingQueue from which the CPU threads poll/take their work chunks
Output is sent to a ConcurrentSkipListMap<Integer, ByteArrayOutputStream>
The CPU threads simply loop until canceled. The IO thread peeks on the ConcurrentSkipListMap (using firstKey) to see if there is any data to be written (I maintain a counter of what the next key should be to ensure that the output streams are written in order), then it checks the length of the BlockingQueue to see if any data needs to be added to it (if queue.size() < N then I add N more chunks to it, where N initially equals 12); if it did either or both IO tasks then it loops, otherwise it processes a chunk from the BlockingQueue and then loops.
The BlockingQueue should not be empty unless the entire input has already been processed by the IO thread - an empty queue indicates that the queue.size() < N threshold needs to be raised. For this reason the CPU threads' logic is
while(!cancel) {
try {
Input input = queue.poll();
if(input == null) {
log.warn("Empty queue");
input = queue.take();
}
process(input);
} catch (InterruptedException ex) {
cancel = true;
}
}
Related
I'm slightly confused by the internal scheduling mechanism of the ExecutorService and the ForkJoinPool.
I understand the ExecutorService scheduling is done this way.
A bunch of tasks are queued. Once a thread is available it will handle the first available task and so forth.
Meanwhile, a ForkJoinPool is presented as distinct because it uses a work-stealing algorithm. If I understand correctly it means a thread can steal some tasks from another thread.
Yet, I don't really understand the difference between the mechanism implemented in ExecutorService and in ForkJoinPool. From my understanding, both mechanisms should reduce the idle time of each thread as much as possible.
I would understand if in the case of an ExecutorService, each thread would have its own queue. Yet, it is not the case as the queue is shared by the different threads of the pool...
Any clarification would be more than welcome!
Suppose you have a very big array of ints and you want to add all of them. With an ExecutorService you might say: let's divide that array into chunks of let's say number of threads / 4. So if you have an array of 160 elements (and you have 4 CPUs), you create 160 / 4 / 4 = 10, so you would create 16 chunks each holding 10 ints. Create runnables/callables and submit those to an executor service (and of course think of a way to merge those results once they are done).
Now your hopes are that each of the CPUs will take 4 of those tasks and work on them. Now let's also suppose that some of the numbers are very complicated to add (of course not, but bear with me), it could turn out that 3 threads/CPUs are done with their work while one of them is busy only with the first chunk. No one wants that, of course, but could happen. The bad thing now is that you can't do anything about it.
What ForkJoinPool does instead is say provide me with how you want to split your task and the implementation for the minimal workload I have to do and I'll take care of the rest. In the Stream API this is done with Spliterators; mainly with two methods trySplit (that either returns null meaning nothing can be split more or a new Spliterator - meaning a new chunk) and forEachRemaning that will process elements once you can't split your task anymore. And this is where work stealing will help you.
You say how your chunks are computed (usually split in half) and what to do when you can't split anymore. ForkJoinPool will dispatch the first chunk to all threads and when some of them are free - they are done with their work, they can query other queues from other threads and see if they have work. If they notice that there are chunks in some other threads queues, they will take them, split them on their own and work on those. It can even turn out that they don't do the entire work on that chunks on their own - some other thread can now query this thread's queue and notice that there is still work to do and so on... This is far better as now, when those 3 threads are free they can pick up some other work to do - and all of them are busy.
This example is a bit simplified, but is not very far from reality. It's just that you need to have a lot more chunks than CPU's/threads for work stealing to work; thus usually trySplit has to have a smart implementation and you need lots of elements in the source of your stream.
I have been reading the source code of PriorityBlockingQueue in Java and I was wondering :
why is the tryGrow() method releasing the lock acquired during the offer() method, just to do its thing non-blocking, and then block again when ready to replace the contents of the queue ? i mean, it could have just kept the lock it had...
how come this works? growing the queue, which involves an array copy, does not cause misbehavior on concurrent adds, where the additional adds can totally come when the current add is increasing the size of the queue?
Because the memory allocation can be comparatively slow and can be done while the array is unlocked.
By releasing the lock it is allowing other threads to continue functioning while it is allocating the (potentially large) new array.
As this process can be done without locks it is good practice to do so. You should only hold a lock for the minimum length of time that you have to.
Sufficient checks are made to ensure no other thread is doing this at the same time.
UNSAFE.compareAndSwapInt(this, allocationSpinLockOffset, 0, 1)
will allow only one thread into this section of code at a time.
Note the
lock.lock();
if (newArray != null && queue == array) {
This grabs the lock again and then confirms that the array it is about to replace is the same one it grabbed a copy of at the start. If it has been replaced meanwhile then it just abandons the one it has just created on the assumption that some other thread has grown the array.
If it is still the same it then copies the old data into the new bigger array and plants it back into the field.
As Kamil nicely explains.
Purpose of that unlock is only to be sure that the faster thread will grow the queue, so we will not waste time while locking the "better ones".
Assume that we have several million long lines of text that must be parsed.
On my i7 2600 CPU it takes about 13 milliseconds to parse every 1000 lines.
Therefore, parsing 1,000,000 lines takes around 13 seconds.
To decrease execution time, I have managed using multiple threads.
Using a blocking queue, I push 1,000,000 lines as a set of 1000 chunk each containing 1000 lines and consume the chunks using 8 threads. The code is simple and seems to be working however, the performance is not encouraging and takes around 11 seconds.
Here is the main fraction of multi-threaded code:
for(int i=0;i<threadCount;i++)
{
Runnable r=new Runnable() {
public void run() {
try{
while (true){
InputType chunk=inputQ.poll(10, TimeUnit.MILLISECONDS);
if(chunk==null){
if(inputRemains.get())
continue;
else
return;
}
processItem(chunk);
}
}catch (Exception e) {
e.printStackTrace();
}
}
};
Thread t=new Thread(r);
threadList.add(t);
for(Thread t: threads)
t.join();
I have used ExecutorService too but the performance is worse!
Changing the chunk size does not help too and the perfomance does not improve.
It means that the blocking queue is not a bottleneck.
On the other hand, when I run 4 instances of the serial program concurrently, it just takes 15 seconds to all 4 instances finish. This means that I can process 4,000,0000 lines using 4 process in 15 seconds and hence, the speed up is around 3.4 that is very promising compared to 1.2 speed up of multi-threading.
I am wondering that anyone has any idea about this?
The problem is very straight forward: a set of lines in a blocking queue and several threads that pol items from the queue and process them in parallel. The queue is filled initially so the threads are fully busy.
I had similar experiences before too but I can not figure out why multi-processing is better.
I should also mention that I run the test on Windows 7 and using a 1.7 JRE.
Any idea is welcomed and thanks before hand.
Edit:
So I initially thought that your timing was around your entire program. If you are just timing the processing of the lines after they have been read into memory, then it may be that your processItem(chunk); method is either doing IO of its own or it is writing information into a synchronized object or other shared variable that is stopping it from being able to fulling run concurrently.
I am wondering that anyone has any idea about this?
Your problem may be that you are IO bound and not CPU board. The only way you will get a large speed improvement by adding more threads is if you are doing more CPU processing than you are doing reading from (or writing to) disk. Once you have maxed out the IO capabilities of your disk subsystem, there is not much that you can do to improve the speed of the processing. As you have demonstrated, adding more threads can actually slow down an IO bound program.
I'd add a single extra thread (i.e. 2 processing threads) to see if that helps. If all you are getting is a 2 second speed improvement then you are going to have to divide the file up over multiple drives or move it to a memory drive if this is a repeated task to be able to read it faster.
I have used ExecutorService too but the performance is worse!
This might happen because you are using too many threads or maybe processing too few lines per iteration/chunk.
On the other hand, when I run 4 instances of the serial program concurrently, it just takes 15 seconds to all 4 instances finish
I suspect this is because each of them can use each other's disk cache from the OS. When the first application reads block #1, the other 3 applications don't have to. Try copying the file 4 times and try 4 serial applications running at the same time each on their own file. You should see the difference.
I would blame your parallelisation of your code. If items are available to process then several threads will be competing for the same resource (the queue). Contention for synchronisation locks is a bit of a performance killer. If items are being processed faster than they are being added to the queue then the threads that are being starved are pretty much just busy loops eg. while (true) {}. This is because your poll time is very short and when the polling fails you simply immediately try again.
A little note on synchronisation. To begin with the JVM uses busy loops to wait for a resource to become available as (in general) code is written to release synchronisation locks as quickly as possible and the alternative (doing a context switch) is quite expensive. Eventually if the JVM finds it is spending most of its time waiting for synchronisation locks then it will default to do switching out to a different thread if it cannot acquire a lock.
A better solution is to have one thread reading in the data and dispatching a new thread whenever there is both an available slot for a thread and data for a new thread. Here Executor would be useful as it can keep track of which threads have finished and which are still busy. But the pseudo-code would look something like:
int charsRead;
char[] buffer = new char[BUF_SIZE];
int startIndex = 0;
while((charsRead = inputStreamReader.read(buffer, startIndex, buffer.length)
!= -1) {
// find last new line so don't give a thread any partial lines
int lastNewLine = findFirstNewLineBeforeIndex(buffer, charsRead);
waitForAvailableThread(); // if not max threads running then should return
// immediately
Thread t = new Thread(createRunnable(buffer, lastNewLine));
t.start();
addRunningThread(t);
// copy any overshoot to the start of a new buffer
// use a new buffer as the another thread is now reading from the previous
// buffer
char[] newBuffer = new char[BUF_SIZE];
System.arraycopy(buffer, lastNewLine+1, newBuffer, 0,
charsRead-lastNewLine-1);
buffer = newBuffer;
}
waitForRemainingThreadsToTerminate();
it takes about 13 milliseconds to parse every 1000 lines.
Therefore, parsing 1,000,000 lines takes around 13 seconds.
The jVM doesn't warm up until it has done something 10,000 after which it can be 10-100x faster so it could be 13 second or it could be 130 ms or less.
Using a blocking queue, I push 1,000,000 lines as a set of 1000 chunk each containing 1000 lines and consume the chunks using 8 threads. The code is simple and seems to be working however, the performance is not encouraging and takes around 11 seconds.
I suggest you retest one thread, you are likely to find it takes less than 11 second.
The bottle neck is the time it takes to parse the String into a line and create the String object, the rest is just overhead which doesn't address the true bottle neck.
If you read different files, one per cpus, you can get close to linear speed up. The problem with reading lines is you have to read one after the other and you get little benefit from concurrency.
2600 is using HT ( Hyper threading) for 8 threads .. and parsing is mainly memory work so little benefit from HT..
Suppose you need to deal with 2 threads, a Reader and a Processor.
Reader will read a portion of the stream data and will pass it to the Processor, that will do something.
The idea is to not stress the Reader with too much of data.
In the set up, i
// Processor will pick up data from pipeIn and will place the output in pipeOut
Thread p = new Thread(new Processor(pipeIn, pipeOut));
p.start();
// Reader will pick a bunch of bits from the InputStream and place it to pipeIn
Thread r = new Thread(new Reader(inputStream, pipeIn));
r.start();
Needless to say, neither pipe is null, when initialized.
I am thinking ... When Processor has been started it attempts to read from the pipeIn, in the following loop:
while (readingShouldContinue) {
Thread.sleep(1); // To avoid tight loop
byte[] justRead = readFrom.getDataCurrentlyInQueue();
writeDataToPipe(processData(justRead));
}
If there is no data to write, it will write nothing, should be no problem.
The Reader comes alive and picks up some data from a stream:
while ((in.read(buffer)) != -1) {
// Writes to what processor considers a pipeIn
writeTo.addDataToQueue(buffer);
}
In Pipe itself, i synchronize access to data.
public byte[] getDataCurrentlyInQueue() {
synchronized (q) {
byte[] a = q.peek();
q.clear();
return a;
}
}
I expect the 2 threads to run semi in parallel, interchanging activities between Reader and a Processor. What happens however is that
Reader reads all blocks up front
Processor treats everything as 1 single block
What am i missing please?
What am i missing please?
(First I should point out that you've left out some critical bits of the code and other information that is needed for a specific fact-based answer.)
I can think of a number of possible explanations:
There may simply be a bug in your application. There's not a lot of point guessing what that bug might be, but if you showed us more code ...
The OS thread scheduler will tend to let an active thread keep running until it blocks. If your processor has only one core (or if the OS only allows your application to use one core), then the second thread may starve ... long enough for the first one to finish.
Even if you have multiple cores, the OS thread scheduler may be slow to assign extra cores, especially if the 2nd thread starts and then immediately blocks.
It is possible that there is some "granularity" effect in the buffering that is causing work not to appear in the queue. (You could view this as a bug ... or as a tuning issue.)
It could simply be that you are not giving the application enough load for multi-threading to kick in.
Finally, I can't figure out the Thread.sleep stuff either. A properly written multi-threaded application does not use Thread.sleep for anything but long term delays; e.g. threads that do periodic house-keeping tasks in the background. If you use sleep instead of blocking, then 1) you risk making the application non-responsive, and 2) you may encourage the OS thread scheduler to give the thread fewer time slices. It could well be that this is the source of your trouble vis-a-vis thread starvation.
You reinvented parts of the java concurrent library. it would make things a lot easier if you modeled your threads with BlockingQueue instead of synchronizind things yourself.
Basically your producer would put chunks on the BlockingQueue und your consumer would while(true) loop over the queue and call get(). That way the producer would block/wait until there is a new chunk on the queue.
The reader is reading everything before its first time-slice. This means that the reading is finishing before the processor ever gets a chance to run.
Try increasing the amount of bytes that are being read, or slow down the reader somehow; maybe with a sleep() call every once in a while.
Btw. Don't poll. It is a horrendous waste of CPU cycles, and it doesn't scale at all.
Also use a synchronized queue and forget the manual locking. http://docs.oracle.com/javase/tutorial/collections/implementations/queue.html
When using multiple threads you need to determine whether you
have work which can be performed in parallel efficiently.
are not adding more overhead than the improvement you are likely to achieve
the OS, or some library is not already optimised to do what you are trying to do.
In your case, you have a good example of when not to use multi-threads. The OS is already tuned to read ahead and buffer data before you ask for it. The work the Reader does is relatively trivial. The overhead of creating new buffers, adding them to a queue and passing the data between threads is likely to be greater than the amount of work you are performing in parallel.
When you try to use multiple threads to do a task best done by a single thread, you will get strange profiling/tuning results.
+1 For a good question.
I am working on a tutorial for my Java concurrency course. The objective is to use thread pools to compute prime numbers in parallel.
The design is based on the Sieve of Eratosthenes. It has an array of n bools, where n is the largest integer you are checking, and each element in the array represents one integer. True is prime, false is non prime, and the array is initially all true.
A thread pool is used with a fixed number of threads (we are supposed to experiment with the number of threads in the pool and observe the performance).
A thread is given a integer multiple to process. The thread then finds the first true element in the array that is not a multiple of thread's integer. The thread then creates a new thread on the thread pool which is given the found number.
After a new thread is formed, the existing thread then continues to set all multiples of it's integer in the array to false.
The main program thread starts the first thread with the integer '2', and then waits for all spawned threads to finish. It then spits out the prime numbers and the time taken to compute.
The issue I have is that the more threads there are in the thread pool, the slower it takes with 1 thread being the fastest. It should be getting faster not slower!
All the stuff on the internet about Java thread pools create n worker threads the main thread then wait for all threads to finish. The method I use is recursive as a worker can spawn more worker threads.
I would like to know what is going wrong, and if Java thread pools can be used recursively.
Your solution may run slower as threads are added for some of following problems:
Thread creation overheads: creating a thread is expensive.
Processor contention: if there are more threads than there are processors to execute them, some of threads will be suspended waiting for a free processor. The result is that the average processing rate for each thread drops. Also, the OS then needs to time-slice the threads, and that takes away time that would otherwise be used for "real" work.
Virtual memory contention: each thread needs memory for its stack. If your machine doesn't have enough physical memory for the workload, each new thread stack increases virtual memory contention which results in paging which slows things down
Cache contention: each thread will (presumably) be scanning a different section of the array, resulting in memory cache misses. This slows down memory accesses.
Lock contention: if your threads are all reading and updating a shared array and using synchronized and one lock object to control access to the array, you could be suffering from lock contention. If a single lock object is used, each thread will spend most of its time waiting to acquire the lock. The net result is that the computation is effectively serialized, and the overall processing rate drops to the rate of a single processor / thread.
The first four problems are inherent to multi-threading, and there are no real solutions ... apart from not creating too many threads and reusing the ones that you have already created. However, there are a number of ways to attack the lock contention problem. For example,
Recode the application so that each thread scans for multiple integers, but in its own section of the array. This will eliminate lock contention on the arrays, though you will then need a way to tell each thread what to do, and that needs to be designed with contention in mind.
Create an array of locks for different regions of the array, and have the threads pick the lock to used based on the region of the array they are operating on. You would still get contention, but on average you should get less contention.
Design and implement a lockless solution. This would entail DEEP UNDERSTANDING of the Java memory model. And it would be very difficult to prove / demonstrate that a lockless solution does not contain subtle concurrency flaws.
Finally, recursive creation of threads is probably a mistake, since it will make it harder to implement thread reuse and the anti-lock-contention measures.
How many processors are available on your system? If #threads > #processors, adding more threads is going to slow things down for a compute-bound task like this.
Remember no matter how many threads you start, they're still all sharing the same CPU(s). The more time the CPU spends switching between threads, the less time it can be doing actual work.
Also note that the cost of starting a thread is significant compared to the cost of checking a prime - you can probably do hundreds or thousands of multiplications in the time it takes to fire up 1 thread.
The key point of a thread pool is to keep a set of thread alive and re-use them to process tasks. Usually the pattern is to have a queue of tasks and randomly pick one thread from the pool to process it. If there is no free thread and the pool is full, just wait.
The problem you designed is not a good one to be solved by a thread pool, because you need threads to run in order. Correct me if I'm wrong here.
thread #1: set 2's multiple to false
thread #2: find 3, set 3's multiple to false
thread #3: find 5, set 5's multiple to false
thread #4: find 7, set 7's multiple to false
....
These threads need to be run in order and they're interleaving (how the runtime schedules them) matters.
For example, if thread #3 starts running before thread #1 sets "4" to false, it will find "4" and continue to reset 4's multiples. This ends up doing a lot of extra work, although the final result will be correct.
Restructure your program to create a fixed ThreadPoolExecutor in advance. Make sure you call ThreadPoolExecutor#prestartAllCoreThreads(). Have your main method submit a task for the integer 2. Each task will submit another task. Since you are using a thread pool, you won't be creating and terminating a bunch of threads, but instead allowing the same threads to take on new tasks as they become available. This will reduce on overall execution overhead.
You should discover that in this case the optimum number of threads is equal to the number of processors (P) on the machine. It is often the case that the optimum number of threads is P+1. This is because P+1 minimizes overhead from context switching while also minimizing loss from idle/blocking time.