Poor Multi-threading performance compared to Multi-processing in Java

Poor Multi-threading performance compared to Multi-processing in Java - java

Assume that we have several million long lines of text that must be parsed.
On my i7 2600 CPU it takes about 13 milliseconds to parse every 1000 lines.
Therefore, parsing 1,000,000 lines takes around 13 seconds.
To decrease execution time, I have managed using multiple threads.
Using a blocking queue, I push 1,000,000 lines as a set of 1000 chunk each containing 1000 lines and consume the chunks using 8 threads. The code is simple and seems to be working however, the performance is not encouraging and takes around 11 seconds.
Here is the main fraction of multi-threaded code:
for(int i=0;i<threadCount;i++)
{
Runnable r=new Runnable() {
public void run() {
try{
while (true){
InputType chunk=inputQ.poll(10, TimeUnit.MILLISECONDS);
if(chunk==null){
if(inputRemains.get())
continue;
else
return;
}
processItem(chunk);
}
}catch (Exception e) {
e.printStackTrace();
}
}
};
Thread t=new Thread(r);
threadList.add(t);
for(Thread t: threads)
t.join();
I have used ExecutorService too but the performance is worse!
Changing the chunk size does not help too and the perfomance does not improve.
It means that the blocking queue is not a bottleneck.
On the other hand, when I run 4 instances of the serial program concurrently, it just takes 15 seconds to all 4 instances finish. This means that I can process 4,000,0000 lines using 4 process in 15 seconds and hence, the speed up is around 3.4 that is very promising compared to 1.2 speed up of multi-threading.
I am wondering that anyone has any idea about this?
The problem is very straight forward: a set of lines in a blocking queue and several threads that pol items from the queue and process them in parallel. The queue is filled initially so the threads are fully busy.
I had similar experiences before too but I can not figure out why multi-processing is better.
I should also mention that I run the test on Windows 7 and using a 1.7 JRE.
Any idea is welcomed and thanks before hand.

Edit:
So I initially thought that your timing was around your entire program. If you are just timing the processing of the lines after they have been read into memory, then it may be that your processItem(chunk); method is either doing IO of its own or it is writing information into a synchronized object or other shared variable that is stopping it from being able to fulling run concurrently.
I am wondering that anyone has any idea about this?
Your problem may be that you are IO bound and not CPU board. The only way you will get a large speed improvement by adding more threads is if you are doing more CPU processing than you are doing reading from (or writing to) disk. Once you have maxed out the IO capabilities of your disk subsystem, there is not much that you can do to improve the speed of the processing. As you have demonstrated, adding more threads can actually slow down an IO bound program.
I'd add a single extra thread (i.e. 2 processing threads) to see if that helps. If all you are getting is a 2 second speed improvement then you are going to have to divide the file up over multiple drives or move it to a memory drive if this is a repeated task to be able to read it faster.
I have used ExecutorService too but the performance is worse!
This might happen because you are using too many threads or maybe processing too few lines per iteration/chunk.
On the other hand, when I run 4 instances of the serial program concurrently, it just takes 15 seconds to all 4 instances finish
I suspect this is because each of them can use each other's disk cache from the OS. When the first application reads block #1, the other 3 applications don't have to. Try copying the file 4 times and try 4 serial applications running at the same time each on their own file. You should see the difference.

I would blame your parallelisation of your code. If items are available to process then several threads will be competing for the same resource (the queue). Contention for synchronisation locks is a bit of a performance killer. If items are being processed faster than they are being added to the queue then the threads that are being starved are pretty much just busy loops eg. while (true) {}. This is because your poll time is very short and when the polling fails you simply immediately try again.
A little note on synchronisation. To begin with the JVM uses busy loops to wait for a resource to become available as (in general) code is written to release synchronisation locks as quickly as possible and the alternative (doing a context switch) is quite expensive. Eventually if the JVM finds it is spending most of its time waiting for synchronisation locks then it will default to do switching out to a different thread if it cannot acquire a lock.
A better solution is to have one thread reading in the data and dispatching a new thread whenever there is both an available slot for a thread and data for a new thread. Here Executor would be useful as it can keep track of which threads have finished and which are still busy. But the pseudo-code would look something like:
int charsRead;
char[] buffer = new char[BUF_SIZE];
int startIndex = 0;
while((charsRead = inputStreamReader.read(buffer, startIndex, buffer.length)
!= -1) {
// find last new line so don't give a thread any partial lines
int lastNewLine = findFirstNewLineBeforeIndex(buffer, charsRead);
waitForAvailableThread(); // if not max threads running then should return
// immediately
Thread t = new Thread(createRunnable(buffer, lastNewLine));
t.start();
addRunningThread(t);
// copy any overshoot to the start of a new buffer
// use a new buffer as the another thread is now reading from the previous
// buffer
char[] newBuffer = new char[BUF_SIZE];
System.arraycopy(buffer, lastNewLine+1, newBuffer, 0,
charsRead-lastNewLine-1);
buffer = newBuffer;
}
waitForRemainingThreadsToTerminate();

it takes about 13 milliseconds to parse every 1000 lines.
Therefore, parsing 1,000,000 lines takes around 13 seconds.
The jVM doesn't warm up until it has done something 10,000 after which it can be 10-100x faster so it could be 13 second or it could be 130 ms or less.
Using a blocking queue, I push 1,000,000 lines as a set of 1000 chunk each containing 1000 lines and consume the chunks using 8 threads. The code is simple and seems to be working however, the performance is not encouraging and takes around 11 seconds.
I suggest you retest one thread, you are likely to find it takes less than 11 second.
The bottle neck is the time it takes to parse the String into a line and create the String object, the rest is just overhead which doesn't address the true bottle neck.
If you read different files, one per cpus, you can get close to linear speed up. The problem with reading lines is you have to read one after the other and you get little benefit from concurrency.

2600 is using HT ( Hyper threading) for 8 threads .. and parsing is mainly memory work so little benefit from HT..

Related

Cause of delay in multi-threading

I wrote a program in Java to print 10 hundred thousand in a for loop.
for (int i =0;i<1000000;i++){
System.out.println(i);
}
It took around 7.5 seconds.
I wrote a custom class in Java that implements Runnable interface, it takes 2 parameters as a limit to print the values between 2 values.
public class ThreadCustom implements Runnable {
int start;
int end;
String name;
ThreadCustom(int start, int end, String name){
this.start = start;
this.end = end;
this.name = name;
}
#Override
public void run() {
for(int i =start; i<=end;i++){
System.out.println(i);
}
}
}
I created 10 objects of my custom thread class, assigned each object a chunk of 100k numbers to print so at the end I get all the 10 hundred thousands printed (not in order definitely) but it takes around 9.5 seconds.
What's the reason for this 2 seconds delay? Is that because of time slicing and context switching that takes place between threads? I am executing a java process and it spawns 10 threads. Am I thinking in the right direction?
Updated: commented System.out.println to see how it performs when there is an iteration.
Printed time without threads
2019-04-14 22:18:07.111 // start
2019-04-14 22:18:07.116 // end
Using ThreadCustom class:
2019-04-14 22:26:42.339
2019-04-14 22:26:42.341

The extra time is spent in two ways:
1) the overhead involved in setting up each threads execution context
2) the likely scenario that you are spawning more threads than there are logical processors available in your main processor
Since the amount of processing required to increment a loop and print an integer is minimal, this will, in the majority of cases result in degraded performance in a parallel environment.
If you were however to do something like count the distinct pixel colors on any given image during each iteration, you would see a significant performance advantage when using multiple threads.

I wrote a program in Java to print [1 million] in a for loop... I created 10 objects of my custom thread class, ... but it takes around 9.5 seconds. What's the reason for this 2 seconds delay?
Threads are only faster if they can work independently. In the case of printing numbers to System.out, all of the threads are trying to contest for access to the same resource System.out which is a synchronized PrintStream. This means that most of the time is wasted waiting for another thread to release the lock on System.out. Any additional "delay" with threaded programs is most likely because of the lock contention and the context switching between the threads.
To test thread speed appropriately, you need to run some sort of independent CPU task in each thread. Calculating Math.sqrt(...) a bunch of times is a better example. On my newer Macbook, I can do 1 billion (with a b) Math.sqrt(...) calls in ~8.1 seconds but 10 threads can each do 100 million in ~1.1 seconds in parallel. But wait, you might say, 10 * 1.1 > 8 seconds of total CPU. I have 4 cores, so with 10 threads running, there is a lot of in and out of the CPUs. 4 threads doing 250m each take 2.1 seconds which is a lot closer to 8.1 secs with the single thread example.
Lastly, Java performance testing is really hard. I bet if you ran your two programs a number of times you would see some different results. Any program that runs quickly is really not a good judge of speed or at best is a very rough approximation. Also, you need to be careful else the hotswap compiler might optimize your loops away at runtime so you need to try to do actual work.

Multithreaded application increases runtime with number of threads

I am implementing a multithreaded solution of the Barnes-Hut algorithm for the N-Body problem.
Main class does the following
public void runSimulation() {
for(int i = 0; i < numWorkers; i++) {
new Thread(new Worker(i, this, gnumBodies, numSteps)).start();
}
try {
startBarrier.await();
stopBarrier.await();
} catch (Exception e) {e.printStackTrace();}
}
The bh.stop- and bh.startBarrier are CyclicBarriers setting start- and stopTime to System.nanoTime(); when reached (barrier actions).
The workers run method:
public void run() {
try {
bh.startBarrier.await();
for(int j = 0; j < numSteps; j++) {
for(int i = wid; i < gnumBodies; i += bh.numWorkers) {
bh.addForce(i);
bh.moveBody(i);
}
bh.barrier.await();
}
bh.stopBarrier.await();
} catch (Exception e) {e.printStackTrace();}
}
addForce(i) goes through a tree and does some calculations. It does not effect any shared variables, so no synchronization is used. O(NlogN).
moveBody(i) does calculations on one element and no synchronization is used. O(N).
When bh.barrier is reached, a tree with all bodies is built up (barrier action).
Now to the problem. The runtime increases linearly with the number of threads used.
Runtimes for gnumBodies = 240, numSteps = 85000 and four cores:
1 thread = 0.763
2 threads = 0.952
3 threads = 1.261
4 threads = 1.563
Why isn't the runtime decreasing with the number of threads used?
edit: added hardware info

What hardware are you running it on? Running multiple threads has its overhead so it might not be worth while splitting your task into to small sub-task.
Also, try using an ExecutorService instead of thread. That way you can use a thread pool instead of creating an actual thread for each task. There is no use in having more threads that your hardware can handle.
It also look to me like each thread will do the same work. Can this be the case? when creating a worker you are using same parameters each time besides for i.

Multithreading does not increase the execution speed unless you also have multiple CPU cores.
Threads doing math calculations and can run full speed
If you have only one CPU core, it is all the same if you run a calculation in one thread or in multiple threads. Running in multiple threads gives no performance benefit, but comes with an overhead of thread switching, so actually the total performance will be a little worse.
If you have multiple available CPU cores, then the threads can run physically in parallel up to the number of cores. This means 4-8 threads may work well on nowadays desktop hardware.
Threads waiting for IO and getting suspended
Threads make sense if you don't do a mathematical calculation, but do something which involves slow I/O such as network, files, or databases. Instead of hogging the run of your program, while one thread waits for the IO, another thread can use the same CPU core. This is the reason why web servers and database solutions works with more threads than CPU cores.
Avoid unnecessary synchronization
Nevertheless, your measurement shows a synchonization mistake.
I guess you shall remove all xxbarrier.await() from the thread code.
I am unsure what is your goal with the xxBarriers vs. System nanotime, but any unnecessary synchornization can easily result slow performance. Instead of calculating, you're waiting on the xxxBarriers.

Your workers do the same job numWorker times, independently.
The only shared object is your CyclicBarrier. await() waits all parities invoke await on this barrier. With the number of workers are increasing, it spends more time on await()

If you have multiple cores or if hyperthreading is available, then running multiple threads will take the benefit of underlying hardware.
If only one core is present, multi-threading can give a 'perceived' benefit if your application involves atleast one non CPU intensive work like interaction with human. Humans are very slow compared to modern day CPUs. Hence if your application requires to get multiple inputs from human and also process them, it makes sense to do separate the input and calculations in two threads. By the time human will provide an input, part of the calculation can be completed in another thread. Thus the overall improvement in time.
If you application must do calculations and multi-threading support in hardware is not present, it is better to use single thread. Your 'calculations' are already lined up in the pipeline back-to-back and CPU will already be running at (almost) max speed. Multi-threading would require context-switching time which will increase the time taken to do the calculations.

When i ran the application with a larger number of bodies an less steps, the application scaled as expected. So the problem was probably the overhead of the barrier(s)!

Merging several ByteArrayOutputStreams into one FileOutputStream

I am splitting up a computation among eight threads and writing the results to a file, as follows:
1a. Each of seven threads processes its input and writes its output to its own ByteArrayOutputStream; when the stream closes, the thread offers an <Integer, ByteArrayOutputStream> to a ConcurrentLinkedQueue, and calls countDown() on a CountDownLatch (that was initialized to 7).
1b. Concurrently, an eighth thread reads in all of the input data that will be processed on the next iteration. This thread awaits on the CountDownLatch when it finishes reading in its data.
2a. When the CountDownLatch reaches 0, the eighth thread wakes up, sorts the ConcurrentinkedQueue using the Integer in the <Integer, ByteArrayOutputStream> as the sort key, then iterates through the queue and appends the byte arrays to a file. (There might be a more efficient way to traverse the list in order without sorting it, but the list only has seven elements in it so the runtime of the sort method is a non-issue.)
2b. Concurrently, the other seven threads process the input that has been prepared for them by the eighth thread.
** This process loops until all data are processed (typically 40-80 iterations).
Each thread processes an equal-sized input chunk (except possibly on the last iteration) of 8mb; each ByteArrayOutputStream contains from 1-4 mb, and the output size can't be known ahead of time. Typically the runtimes of the earliest-completing and latest-completing CPU-bound threads are within 20% of each other.
I am wondering if there is an IO library (or a method in java.io or java.nio that I've missed) that already does something like this - at present the eighth thread (the IO thread) is idle about 75% of the time, but any way I've come up with to alleviate this inefficiency strikes me as being too complicated (and hence too risky in terms of creating deadlocks or data races); for example, I could divide the input into 4 mb chunks and then give two chunks to the seven CPU-bound threads and one chunk to the IO-bound thread which would in theory reduce the IO thread's idle time to 25% (25% on IO, 50% on a 4 mb chunk, 25% idle), but this is a brittle solution that might not port to another CPU (meaning that on another CPU the IO-bound thread might then turn into a bottleneck if e.g. its runtime is 150% that of the CPU-bound threads) - I'd really like a self-balancing solution so that I don't need to fine-tune the load-balancing by hand.

The inefficiency consists in waiting for all 7 outputs to be complete before thread 8 processes any of it. It would be better to run 7 queues instead of one, i.e. one per source thread, and read them in the order necessary. That way when the first queue has any data it is processed immediately, rather than having to wait for the other 6; similarly for queues 2..6. When thread 8 finishes the last queue it can then start producing,or indeed it could be doing that instead of waiting for any specific queue to start producing.

I've modified the algorithm as follows:
Rather than assigning the input chunks directly to the CPU threads, I instead put the chunks in a BlockingQueue from which the CPU threads poll/take their work chunks
Output is sent to a ConcurrentSkipListMap<Integer, ByteArrayOutputStream>
The CPU threads simply loop until canceled. The IO thread peeks on the ConcurrentSkipListMap (using firstKey) to see if there is any data to be written (I maintain a counter of what the next key should be to ensure that the output streams are written in order), then it checks the length of the BlockingQueue to see if any data needs to be added to it (if queue.size() < N then I add N more chunks to it, where N initially equals 12); if it did either or both IO tasks then it loops, otherwise it processes a chunk from the BlockingQueue and then loops.
The BlockingQueue should not be empty unless the entire input has already been processed by the IO thread - an empty queue indicates that the queue.size() < N threshold needs to be raised. For this reason the CPU threads' logic is
while(!cancel) {
try {
Input input = queue.poll();
if(input == null) {
log.warn("Empty queue");
input = queue.take();
}
process(input);
} catch (InterruptedException ex) {
cancel = true;
}
}

Understanding Threads + Asynchronous

So I have a program that I made that needs to send a lot (like 10,000+) of GET requests to a URL and I need it to be as fast as possible. When I first created the program I just put the connections into a for loop but it was really slow because it would have to wait for each connection to complete before continuing. I wanted to make it faster so I tried using threads and it made it somewhat faster but I am still not satisfied.
I'm guessing the correct way to go about this and making it really fast is using an asynchronous connection and connecting to all of the URLs. Is this the right approach?
Also, I have been trying to understand threads and how they work but I can't seem to get it. The computer I am on has an Intel Core i7-3610QM quad-core processor. According to Intel's website for the specifications for this processor, it has 8 threads. Does this mean I can create 8 threads in a Java application and they will all run concurrently? Any more than 8 and there will be no speed increase?
What exactly does the number represent next to "Threads" in the task manager under the "Performance" tab? Currently, my task manager is showing "Threads" as over 1,000. Why is it this number and how can it even go past 8 if that's all my processor supports?
I also noticed that when I tried my program with 500 threads as a test, the number in the task manager increased by 500 but it had the same speed as if I set it to use 8 threads instead. So if the number is increasing according to the number of threads I am using in my Java application, then why is the speed the same?
Also, I have tried doing a small test with threads in Java but the output doesn't make sense to me.
Here is my Test class:
import java.text.SimpleDateFormat;
import java.util.Date;
public class Test {
private static int numThreads = 3;
private static int numLoops = 100000;
private static SimpleDateFormat dateFormat = new SimpleDateFormat("[hh:mm:ss] ");
public static void main(String[] args) throws Exception {
for (int i=1; i<=numThreads; i++) {
final int threadNum = i;
new Thread(new Runnable() {
public void run() {
System.out.println(dateFormat.format(new Date()) + "Start of thread: " + threadNum);
for (int i=0; i<numLoops; i++)
for (int j=0; j<numLoops; j++);
System.out.println(dateFormat.format(new Date()) + "End of thread: " + threadNum);
}
}).start();
Thread.sleep(2000);
}
}
}
This produces an output such as:
[09:48:51] Start of thread: 1
[09:48:53] Start of thread: 2
[09:48:55] Start of thread: 3
[09:48:55] End of thread: 3
[09:48:56] End of thread: 1
[09:48:58] End of thread: 2
Why does the third thread start and end right away while the first and second take 5 seconds each? If I add more that 3 threads, the same thing happens for all threads above 2.
Sorry if this was a long read, I had a lot of questions.
Thanks in advance.

Your processor has 8 cores, not threads. This does in fact mean that only 8 things can be running at any given moment. That doesn't mean that you are limited to only 8 threads however.
When a thread is synchronously opening a connection to a URL it will often sleep while it waits for the remote server to get back to it. While that thread is sleeping other threads can be doing work. If you have 500 threads and all 500 are sleeping then you aren't using any of the cores of your CPU.
On the flip side, if you have 500 threads and all 500 threads want to do something then they can't all run at once. To handle this scenario there is a special tool. Processors (or more likely the operating system or some combination of the two) have a scheduler which determines which threads get to be actively running on the processor at any given time. There are many different rules and sometimes random activity that controls how these schedulers work. This may explain why in the above example thread 3 always seems to finish first. Perhaps the scheduler is preferring thread 3 because it was the most recent thread to be scheduled by the main thread, it can be impossible to predict the behavior sometimes.
Now to answer your question regarding performance. If opening a connection never involved a sleep then it wouldn't matter if you were handling things synchronously or asynchronously you would not be able to get any performance gain above 8 threads. In reality, a lot of the time involved in opening a connection is spent sleeping. The difference between asynchronous and synchronous is how to handle that time spent sleeping. Theoretically you should be able to get nearly equal performance between the two.
With a multi-threaded model you simply create more threads than there are cores. When the threads hit a sleep they let the other threads do work. This can sometimes be easier to handle because you don't have to write any scheduling or interaction between the threads.
With an asynchronous model you only create a single thread per core. If that thread needs to sleep then it doesn't sleep but actually has to have code to handle switching to the next connection. For example, assume there are three steps in opening a connection (A,B,C):
while (!connectionsList.isEmpty()) {
for(Connection connection : connectionsList) {
if connection.getState() == READY_FOR_A {
connection.stepA();
//this method should return immediately and the connection
//should go into the waiting state for some time before going
//into the READY_FOR_B state
}
if connection.getState() == READY_FOR_B {
connection.stepB();
//same immediate return behavior as above
}
if connection.getState() == READY_FOR_C {
connection.stepC();
//same immediate return behavior as above
}
if connection.getState() == WAITING {
//Do nothing, skip over
}
if connection.getState() == FINISHED {
connectionsList.remove(connection);
}
}
}
Notice that at no point does the thread sleep so there is no point in having more threads than you have cores. Ultimately, whether to go with a synchronous approach or an asynchronous approach is a matter of personal preference. Only at absolute extremes will there be performance differences between the two and you will need to spend a long time profiling to get to the point where that is the bottleneck in your application.
It sounds like you're creating a lot of threads and not getting any performance gain. There could be a number of reasons for this.
It's possible that your establishing a connection isn't actually sleeping in which case I wouldn't expect to see a performance gain past 8 threads. I don't think this is likely.
It's possible that all of the threads are using some common shared resource. In this case the other threads can't work because the sleeping thread has the shared resource. Is there any object that all of the threads share? Does this object have any synchronized methods?
It's possible that you have your own synchronization. This can create the issue mentioned above.
It's possible that each thread has to do some kind of setup/allocation work that is defeating the benefit you are gaining by using multiple threads.
If I were you I would use a tool like JVisualVM to profile your application when running with some smallish number of threads (20). JVisualVM has a nice colored thread graph which will show when threads are running, blocking, or sleeping. This will help you understand the thread/core relationship as you should see that the number of running threads is less than the number of cores you have. In addition if you see a lot of blocked threads then that can help lead you to your bottleneck (if you see a lot of blocked threads use JVisualVM to create a thread dump at that point in time and see what the threads are blocked on).

Some concepts:
You can have many threads in the system, but only some of them (max 8 in your case) will be "scheduled" on the CPU at any point of time. So, you cannot get more performance than 8 threads running in parallel. In fact the performance will probably go down as you increase the number of threads, because of the work involved in creating, destroying and managing threads.
Threads can be in different states : http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Thread.State.html
Out of those states, the RUNNABLE threads stand to get a slice of CPU time. Operating System decides assignment of CPU time to threads. In a regular system with 1000's of threads, it can be completely unpredictable when a certain thread will get CPU time and how long it will be on CPU.
About the problem you are solving:
You seem to have figured out the correct solution - making parallel asynchronous network requests. However, practically speaking starting 10000+ threads and that many network connections, at the same time, may be a strain on the system resources and it may just not work. This post has many suggestions for asynchronous I/O using Java. (Tip: Don't just look at the accepted answer)

This solution is more specific to the general problem of trying to make 10k requests as fast as possible. I would suggest that you abandon the Java HTTP libraries and use Apache's HttpClient instead. They have several suggestions for maximizing performance which may be useful. I have heard the Apache HttpClient library is just faster in general as well, lighter weight and less overhead.

Java multithreading, getting threads to work in parallel

Suppose you need to deal with 2 threads, a Reader and a Processor.
Reader will read a portion of the stream data and will pass it to the Processor, that will do something.
The idea is to not stress the Reader with too much of data.
In the set up, i
// Processor will pick up data from pipeIn and will place the output in pipeOut
Thread p = new Thread(new Processor(pipeIn, pipeOut));
p.start();
// Reader will pick a bunch of bits from the InputStream and place it to pipeIn
Thread r = new Thread(new Reader(inputStream, pipeIn));
r.start();
Needless to say, neither pipe is null, when initialized.
I am thinking ... When Processor has been started it attempts to read from the pipeIn, in the following loop:
while (readingShouldContinue) {
Thread.sleep(1); // To avoid tight loop
byte[] justRead = readFrom.getDataCurrentlyInQueue();
writeDataToPipe(processData(justRead));
}
If there is no data to write, it will write nothing, should be no problem.
The Reader comes alive and picks up some data from a stream:
while ((in.read(buffer)) != -1) {
// Writes to what processor considers a pipeIn
writeTo.addDataToQueue(buffer);
}
In Pipe itself, i synchronize access to data.
public byte[] getDataCurrentlyInQueue() {
synchronized (q) {
byte[] a = q.peek();
q.clear();
return a;
}
}
I expect the 2 threads to run semi in parallel, interchanging activities between Reader and a Processor. What happens however is that
Reader reads all blocks up front
Processor treats everything as 1 single block
What am i missing please?

What am i missing please?
(First I should point out that you've left out some critical bits of the code and other information that is needed for a specific fact-based answer.)
I can think of a number of possible explanations:
There may simply be a bug in your application. There's not a lot of point guessing what that bug might be, but if you showed us more code ...
The OS thread scheduler will tend to let an active thread keep running until it blocks. If your processor has only one core (or if the OS only allows your application to use one core), then the second thread may starve ... long enough for the first one to finish.
Even if you have multiple cores, the OS thread scheduler may be slow to assign extra cores, especially if the 2nd thread starts and then immediately blocks.
It is possible that there is some "granularity" effect in the buffering that is causing work not to appear in the queue. (You could view this as a bug ... or as a tuning issue.)
It could simply be that you are not giving the application enough load for multi-threading to kick in.
Finally, I can't figure out the Thread.sleep stuff either. A properly written multi-threaded application does not use Thread.sleep for anything but long term delays; e.g. threads that do periodic house-keeping tasks in the background. If you use sleep instead of blocking, then 1) you risk making the application non-responsive, and 2) you may encourage the OS thread scheduler to give the thread fewer time slices. It could well be that this is the source of your trouble vis-a-vis thread starvation.

You reinvented parts of the java concurrent library. it would make things a lot easier if you modeled your threads with BlockingQueue instead of synchronizind things yourself.
Basically your producer would put chunks on the BlockingQueue und your consumer would while(true) loop over the queue and call get(). That way the producer would block/wait until there is a new chunk on the queue.

The reader is reading everything before its first time-slice. This means that the reading is finishing before the processor ever gets a chance to run.
Try increasing the amount of bytes that are being read, or slow down the reader somehow; maybe with a sleep() call every once in a while.
Btw. Don't poll. It is a horrendous waste of CPU cycles, and it doesn't scale at all.
Also use a synchronized queue and forget the manual locking. http://docs.oracle.com/javase/tutorial/collections/implementations/queue.html

When using multiple threads you need to determine whether you
have work which can be performed in parallel efficiently.
are not adding more overhead than the improvement you are likely to achieve
the OS, or some library is not already optimised to do what you are trying to do.
In your case, you have a good example of when not to use multi-threads. The OS is already tuned to read ahead and buffer data before you ask for it. The work the Reader does is relatively trivial. The overhead of creating new buffers, adding them to a queue and passing the data between threads is likely to be greater than the amount of work you are performing in parallel.
When you try to use multiple threads to do a task best done by a single thread, you will get strange profiling/tuning results.
+1 For a good question.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.