Buffering Input files and parallel processing in java

Buffering Input files and parallel processing in java - java

I have a collection of 1000 files in gz format. I want to process them in chunks in parallel, say 8 in each round. When I let every thread opens a file and read from disk that resulted in a significant delay due to many processes trying to read from different locations.
I just wonder if there is an efficient method to handle multiple files reads? Or shall I buffer all the files into memory first (e.g. all 8 files and then hand the buffers to the threads). If so, what would be the best way of buffering files? bufferArray? or some alternative structures?
Thank you.

If you use a fixed size pool of say 8 (because you have 8 cores) you may find this is reasonably efficient as decompressing the files is cpu intensive.
You may find however that this is no faster than using 4 threads or only 2 because the real bottleneck is reading the data from disk. If this is the case, the only thing you can do is get a faster disk. e.g. mirror the disk, or use an SSD which can be 20x faster.

I suspect you're swamping your process with 1000 threads. Threads aren't particularly lightweight (e.g. each will grab by default 512k of stack space).
A more efficient model may be to use a thread pool (via ThreadPoolExecutor) and tune it for the optimal number of simultaneous threads on your system (e.g. you've suggested 8 above - I'd suggest this hinges to some degree on the number of free CPUs you have).
Each .gz file would be represented by one Callable submitted to the executor, and the executor will look after running multiple jobs simultaneously.

Related

How to prevent physical memory consuming when running parallel Java processes

I have big list (up to 500 000) of some functions.
My task is to generate some graph for each function (it can be do independently from other functions) and dump output to the file (it can be several files).
The process of generating graphs can be time consuming.
I also have server with 40 physical cores and 128GB ram.
I have tried to implement parallel processing using java Threads/ExecutorPool, but it seems not to use processors all resources.
On some inputs the program takes up to 25 hours to run and only 10-15 cores are working according to htop.
So the second thing I've tried is to create 40 distinct processes (using Runtime.exec) and split the list among them.
This method uses processor all resources (100% load on all 40 cores) and speedups performance up to 5 times on previous example (it takes only 5 hours which is reasonable for my task).
But the problem of this method is that, each java process runs separately and consumes memory independently from others. Is some scenarios all 128gb of ram is consumed after 5 minutes of parallel work. One solution that I am using now is to call System.gc() for each process if Runtime.totalMemory > 2GB. This slows down overall performance a bit (8 hours on previous input) but lefts memory usage in reasonable boundaries.
But this configuration works only for my server. If you run it on the server with 40 core and 64GB run, you need to tune Runtime.totalMemory > 2GB condition.
So the question is what is the best way to avoid such aggressive memory consuming?
Is it normal practice to run separate processes to do parallel jobs?
Is there any other parallel method in Java (maybe fork/join?) which uses 100% physical resources of processor.

You don't need to call System.gc() explicitly! The JVM will do it automatically when needed, and almost always does it better. You should, however, set the max heap size (-Xmx) to a number that works well.
If your program won't scale further you have some kind of congestion. You can either analyse your program and your java- and system settings and figure out why, or run it as multiple processes. If each process is multi-threaded, then you may get better performance using 5-10 processes instead of 40.
Note that you may get higher performance with more than one thread per core. Fiddle around with 1-8 threads per core and see if throughput increases.
From your description it sounds like you have 500,000 completely independent items of work and that each work item doesn't really need a lot of memory. If that is true, then memory consumption isn't really an issue. As long as each process has enough memory so it doesn't have to gc very often then gc isn't going to affect the total execution time by much. Just make sure you don't have any dangling references to objects you no longer need.

One of the problems here: it is still very hard to understand how many threads, cores, ... are actually available.
My personal suggestion: there are several articles on the java specialist newsletter which do a very deep dive into this subject.
For example this one: http://www.javaspecialists.eu/archive/Issue135.html
or a more recent new, on "the number of available processors": http://www.javaspecialists.eu/archive/Issue220.html

How to push a code to use as much CPU resources as possible?

I have a very large set of text files. The task was to calculate the document frequencies (number of document that contain a certain term) for all the terms (uniquely) inside this huge corpus. Simply starting from the first file and calculating everything in a serialized manner seemed to be a dumb thing to do (I admit I did it just to see how disastrous it is).
I realized that if I do this calculation in a Map-Reduce manner, meaning clustering my data into smaller pieces and in the end aggregating the results, I would get the results much faster.
My PC has 4 cores, so I decided to separate my data into 3 distinct subsets and feeding each subset to a separate thread waiting for all the threads to finish their work and passing their results to a another method to aggregate everything.
I tests it with a very small set of data, worked fined. Before I use the actual data, I tested it with a larger set to I can study its behaviour better. I started jvisualvm and htop to see how the cpu and memory is working. I can see that 3 threads are running and cpu cores are also busy. But the usage of these cores are rarely above 50%. This means that my application is not really using the full power of my PC. is this related to my code, or is this how it is supposed to be. My expectation was that each thread uses as much cpu core resource as possible.
I use Ubuntu.

Sounds to me that you have an IO bound application. You are spending more time in your individual threads reading the data from the disk then you are actually processing the information that is read.
You can test this by migrating your program to another system with a SSD to see if the CPU performance changes. You can also read in all of the files and then process them later to see if that changes the CPU curve during processing time. I suspect it will.

As already stated you're bottle-necked by something probably disk IO. Try separating the code that reads from disk from the code that processes the data, and use separate thread pools for each. Afterwards, a good way to quickly scale your thread pools to properly fit your resources is to use one of the Executors thread pools.

You are IO bound for a problem like this on a single machine, not CPU bound. Are you actively reading the files? Only if you had all the files in-memory would you start to saturate the CPU. That is why map-reduce is effective. It scales the total IO throughput more than CPU.

You can possibly speed up this quite a bit if you are on Linux and use tmpfs for storing the data in memory, instead on disk.

java- how to determine the optimum number of threads for a specific type of processing, in different types of servers

I have a java program which goes to some websites, converts the website's HTML into XML, then runs some xquery commands on the XML, finally stores the result into csv, which is then uploaded into Cloud file storage (like Amazon S3).
Now, I want to split the work into multiple threads so that it is done faster-- but how do I determine the number of threads that is optimum for my work?
I want to determine the number of threads that I should allow, for the different types of Amazon EC2 instances... Is there a library or framework that can help me with this?
Or, do I have to manually run the code on an Amazon EC2 instance, and keep changing the number of threads, and measure the time taken?
Specifically, I want to keep a balance between total time taken to process all threads, versus the number of threads that are allowed to run simultaneously... And if I could clearly see this correlation for different servers with different CPU/RAM capacities that would be great...Any advice/guidance would be appreciated...

The type of work you describe is almost certainly I/O bound -- most of the time is spent waiting for data to be downloaded or uploaded. If so, your goal is simply to make full use of upload / download bandwidth.
If so, the optimal number of threads will be more than the number of physical cores on the machine (which would be the right place to start for a CPU-bound process).
It's hard to say from this info what the optimum number of threads will be as it depends on how much you're downloading and how fast the link is. Try doubling the number of threads until performance starts to suffer.

I think you should profile your app with single thread using JHAT, MAT, etc... and then decide how many thread based on machine config you want to run. It will give you a general idea of how expensive your thread is. You can then run load test (like 10,000 items queued up against 10 threads) to validate the limits that you came up with, and tune accordingly.

To find the number of logical cores available you can use:
int processors = Runtime.getRuntime().availableProcessors();
and create a ThreadPool with that many. See also :
Finding Number of Cores in Java
Java: How to scale threads according to cpu cores?

Code in the thread pool runs much slower than unthreaded

I have the code which reads a set of binary files which essentially consist from a lot of serialized java objects. I'm trying to parallelize the code, by running the reading of the files in the thread pool ( Executors.newFixedThreadPool )
What I'm seeing is that when threaded, the reading runs actually slower than in a single thread -- from 1.5 to 10 time slower,depending on the number of threads.
In my test-case I'm actually reading the same file (35mb) from multiple threads, so I'm not bound by I/O in any way. I do not run more threads than CPUs and I do not have any synchronization between pools -- i.e. I'm just processing independently a bunch of files.
Does anyone have an idea what could be a possible reason for this slow performance when threaded ? What should I look for ? Or what's the best way to dissect the problem? I already looked for static variables in the classes, which could be shared between threads and I don't see any.
Can one of the java.* classes when instantiated in the thread run significantly slower, (e.g. java.zip.deflate which I'm using)?
Thanks for any hints.
Upd: Another interesting hint is that when a single thread is running the execution time of the function which does the reading is constant to high precision, but when running multiple threads, I see significant variation in timings.

Sounds to me like you are expecting a java.zip.deflate read of 35mb to run faster when you add multiple threads doing the same job. It won't. In fact, although you may not be IO bound, you are still incurring kernel overhead with each thread that you add -- buffer copies, etc.. Even if you are reading entirely out of kernel buffer space, you incur CPU and processing overhead.
That said, I am surprised that you incur 1.5 to 10 times slower. If each of your processing threads is then writing output then obviously that won't be cached.
However I suspect that you may be incurring memory contention. If you are handling a Java serialized object stream, you need to watch your memory consumption unless you are resetting it often. Serialization keeps a lot of references around to objects so that large contiguous streams can generate a tremendous amount of GC bandwidth.
I'd connect to your program using jconsole and watch the memory tab closely. As the survivor and old-gen spaces fill you will see non-linear CPU implications.

Just because all thread workers are reading from same file does not mean for sure it is not IO bound. It might be. It might not be. To be sure, setup your test case so that all thread workers are reading from a file in memory vs. off disk.
You mentioned above that you believe the OS has cached the file, but do you know for sure if the file is being opened in read-only/shared mode? If not, then the OS could still be locking the file to insure only one thread has access at a time.
Potentially related links:
Reading a single file with Multiple Thread: should speed up?
Java multi-thread application that reads a single file

The problem was caused by java.util.zip.Inflate class which actually has lot of synchronized methods (because several of them use native code), so when multiple threads are being run, the synchronized methods are competing with each other and making the code very close to sequential.
The solution was to replace the java.util.zip classes by the java only version from GNU classpath (e.g. from here http://git.savannah.gnu.org/cgit/classpath.git/tree/java/util/zip)

Processing huge files in java

I have a huge file of around 10 GB. I have to do operations such as sort, filter, etc on the files in Java. Each operation can be done in parallel.
Is it good to start 10 threads and read the file in parallel ? Each thread reads 1 GB of the file.
Is there any other option to solve the issue with extra large files and processing them as fast as possible? Is NIO good for such scenarios?
Currently, I am performing operations in serial and it takes around 20 mins to process such files.
Thanks,

Is it good to start 10 threads and read the file in parallel ?
Almost certainly not - although it depends. If it's from an SSD (where there's effectively no seek time) then maybe. If it's a traditional disk, definitely not.
That doesn't mean you can't use multiple threads though - you could potentially create one thread to read the file, performing only the most rudimentary tasks to get the data into processable chunks. Then use a producer/consumer queue to let multiple threads process the data.
Without knowing more than "sort, filter, etc" (which is pretty vague) we can't really tell how parallelizable the process is in the first place - but trying to perform the IO in parallel on a single file will probably not help.

Try profiling the code to see where the bottlenecks are. Have you tried having one thread read the whole file (or as much as possible), and give that off to 10 threads for processing? If File I/O is your bottleneck (which seems plausible), this should improve your overall run time.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.