Processing huge files in java

Processing huge files in java - java

I have a huge file of around 10 GB. I have to do operations such as sort, filter, etc on the files in Java. Each operation can be done in parallel.
Is it good to start 10 threads and read the file in parallel ? Each thread reads 1 GB of the file.
Is there any other option to solve the issue with extra large files and processing them as fast as possible? Is NIO good for such scenarios?
Currently, I am performing operations in serial and it takes around 20 mins to process such files.
Thanks,

Is it good to start 10 threads and read the file in parallel ?
Almost certainly not - although it depends. If it's from an SSD (where there's effectively no seek time) then maybe. If it's a traditional disk, definitely not.
That doesn't mean you can't use multiple threads though - you could potentially create one thread to read the file, performing only the most rudimentary tasks to get the data into processable chunks. Then use a producer/consumer queue to let multiple threads process the data.
Without knowing more than "sort, filter, etc" (which is pretty vague) we can't really tell how parallelizable the process is in the first place - but trying to perform the IO in parallel on a single file will probably not help.

Try profiling the code to see where the bottlenecks are. Have you tried having one thread read the whole file (or as much as possible), and give that off to 10 threads for processing? If File I/O is your bottleneck (which seems plausible), this should improve your overall run time.

Related

What is the right way to create/write a large file in java that are generated by a user?

I have looked at examples that tell best practices for file write/create operations but have not seen an example that takes into consideration my requirements. I have to create a class which reads the contents of 1 file, does some data transformation, and then write the transformed contents to a different file then sends the file to a web service. Both files ultimately can be quite large like up to 20 MB and also it is unpredictable when these files will be created because they are generated by the user. Therefore it could be like 2 minutes between the time when this process occurs or it could be several all in the same second. The system is not like crazy in the sense that it could be like hundreds of these operations in the same second but it could be several.
My instinct says to solve it by:
Creating a separate thread when the process begins.
Read the first file.
Do the data transformation.
Write the contents to the new file.
Send the file to the service.
Delete the created file.
Am I missing something? Is there a best practice to tackle this kind of issue?

The first question you should ask is weather you need to write the file to the disk in the first place. Even if you are supposed to send a file to a consumer at the end of your processing phase, you could keep the file contents in memory and send that. The consumer doesn't care weather the file is stored on disk or not, since it only receives an array of bytes with the file contents.
The only scenario in which it would make sense to store the file on disk would be if you would communicate between your processes via disk files (i.e. your producer writes a file to disk, sends some notification to your consumer and afterwards your consumer reads the file from disk - for example based on a file name it receives from the notification).
Regarding I/O best practices, make sure you use buffers to read (and potentially write) files. This could greatly reduce the memory overhead (since you would end up keeping only a chunk instead of the whole 20 MB file in memory at a given moment).
Regarding adding multiple threads, you should test weather that improves your application performance or not. If your application is already I/O intensive, adding multiple threads will result in adding even more contention on your I/O streams, which would result in a performance degradation.

Without the full details of the situation, a problem like this may be better solved with existing software such as Apache NiFi:
An easy to use, powerful, and reliable system to process and distribute data.
It's very good at picking up files, transforming them, and putting them somewhere else (and sending emails, and generating analytics, and...). NiFi is a very powerful tool, but may be overkill if you're needs are just a couple of files given the additional set-up.

Given the description you have given, I think you should perform the operations for each file on one thread; i.e. on thread will download the file, process it and then upload the results.
If you need parallelism, then implement the download / process / upload as a Runnable and submit the tasks to an ExecutorService with a bounded thread pool. And tune the size of the thread pool. (That's easy if you expose the thread pool size as a config property.)
Why this way?
It is simple. Minimal synchronization is required.
One of the three subtasks is likely to be your performance bottleneck. So by combining all three into a single task, you avoid the situation where the non-bottleneck tasks get too far ahead. And if you get too far ahead on some of the subtasks you risk running out of (local) disk space.
I'm going to contradict what Alex Rolea said about buffering. Yes, it may help. But in on a modern (e.g. Linux) operating system on a typical modern machine, memory <-> disk I/O is unlikely to be the main bottleneck. It is more likely that the bottleneck will be network I/O or server-side I/O performance (especially if the server is serving other clients at the same time.)
So, I would not prematurely tune the buffering. Get the system working, benchmark it, profile / analyze it, and based on those results figure out where the real bottlenecks are and how best to address them.
Part of the solution may be to not use disk at all. (I know you think you need to, but unless your server and its protocols are really strange, you should be able to stream the data to the server out of memory on the client side.)

Processing large number of text files in java

I am working on an application which has to read and process ~29K files (~500GB) everyday. The files will be in zipped format and available on a ftp.
What I have done: I plan to download and the files from ftp, unzip it and process using multi-threading, which has reduced the processing time significantly (when number of active threads are fixed to a smaller number). I've written some code and tested it for ~3.5K files(~32GB). Details here: https://stackoverflow.com/a/32247100/3737258
However, the estimated processing time, for ~29K files, still seems to be very high.
What I am looking for: Any suggestion/solution which could help me bring the processing time of ~29K files, ~500GB, to 3-4 hours.
Please note that, each files have to be read line by line and each line has to be written to a new file with some modification(some information removed and some new information be added).

You should profile your application and see where the current bottleneck is, and fix that. Proceed until you are at your desired speed or cannot optimize further.
For example:
Maybe you unzip to disk. This is slow, to do it in memory.
Maybe there is a load of garbage collection. See if you can re-use stuff
Maybe the network is the bottleneck.. etc.
You can, for example, use visualvm.

It's hard to provide you one solution for your issue, since it might be that you simply reached the hardware limit.
Some Ideas:
You can parallelize the process which is necessary to process the read information. There you could provide multiple read lines to one thread (out of a pool), which processes these sequentially
Use java.nio instead of java.io see: Java NIO FileChannel versus FileOutputstream performance / usefulness
Use a profiler
Instead of the profiler, simply write log messages and measure the
duration in multiple parts of your application
Optimize the Hardware (use SSD drives, expiriment with block size, filesystem, etc.)

If you are interested in parallel computing then please try Apache spark it is meant to do exactly what you are looking for.

How to push a code to use as much CPU resources as possible?

I have a very large set of text files. The task was to calculate the document frequencies (number of document that contain a certain term) for all the terms (uniquely) inside this huge corpus. Simply starting from the first file and calculating everything in a serialized manner seemed to be a dumb thing to do (I admit I did it just to see how disastrous it is).
I realized that if I do this calculation in a Map-Reduce manner, meaning clustering my data into smaller pieces and in the end aggregating the results, I would get the results much faster.
My PC has 4 cores, so I decided to separate my data into 3 distinct subsets and feeding each subset to a separate thread waiting for all the threads to finish their work and passing their results to a another method to aggregate everything.
I tests it with a very small set of data, worked fined. Before I use the actual data, I tested it with a larger set to I can study its behaviour better. I started jvisualvm and htop to see how the cpu and memory is working. I can see that 3 threads are running and cpu cores are also busy. But the usage of these cores are rarely above 50%. This means that my application is not really using the full power of my PC. is this related to my code, or is this how it is supposed to be. My expectation was that each thread uses as much cpu core resource as possible.
I use Ubuntu.

Sounds to me that you have an IO bound application. You are spending more time in your individual threads reading the data from the disk then you are actually processing the information that is read.
You can test this by migrating your program to another system with a SSD to see if the CPU performance changes. You can also read in all of the files and then process them later to see if that changes the CPU curve during processing time. I suspect it will.

As already stated you're bottle-necked by something probably disk IO. Try separating the code that reads from disk from the code that processes the data, and use separate thread pools for each. Afterwards, a good way to quickly scale your thread pools to properly fit your resources is to use one of the Executors thread pools.

You are IO bound for a problem like this on a single machine, not CPU bound. Are you actively reading the files? Only if you had all the files in-memory would you start to saturate the CPU. That is why map-reduce is effective. It scales the total IO throughput more than CPU.

You can possibly speed up this quite a bit if you are on Linux and use tmpfs for storing the data in memory, instead on disk.

Buffering Input files and parallel processing in java

I have a collection of 1000 files in gz format. I want to process them in chunks in parallel, say 8 in each round. When I let every thread opens a file and read from disk that resulted in a significant delay due to many processes trying to read from different locations.
I just wonder if there is an efficient method to handle multiple files reads? Or shall I buffer all the files into memory first (e.g. all 8 files and then hand the buffers to the threads). If so, what would be the best way of buffering files? bufferArray? or some alternative structures?
Thank you.

If you use a fixed size pool of say 8 (because you have 8 cores) you may find this is reasonably efficient as decompressing the files is cpu intensive.
You may find however that this is no faster than using 4 threads or only 2 because the real bottleneck is reading the data from disk. If this is the case, the only thing you can do is get a faster disk. e.g. mirror the disk, or use an SSD which can be 20x faster.

I suspect you're swamping your process with 1000 threads. Threads aren't particularly lightweight (e.g. each will grab by default 512k of stack space).
A more efficient model may be to use a thread pool (via ThreadPoolExecutor) and tune it for the optimal number of simultaneous threads on your system (e.g. you've suggested 8 above - I'd suggest this hinges to some degree on the number of free CPUs you have).
Each .gz file would be represented by one Callable submitted to the executor, and the executor will look after running multiple jobs simultaneously.

Code in the thread pool runs much slower than unthreaded

I have the code which reads a set of binary files which essentially consist from a lot of serialized java objects. I'm trying to parallelize the code, by running the reading of the files in the thread pool ( Executors.newFixedThreadPool )
What I'm seeing is that when threaded, the reading runs actually slower than in a single thread -- from 1.5 to 10 time slower,depending on the number of threads.
In my test-case I'm actually reading the same file (35mb) from multiple threads, so I'm not bound by I/O in any way. I do not run more threads than CPUs and I do not have any synchronization between pools -- i.e. I'm just processing independently a bunch of files.
Does anyone have an idea what could be a possible reason for this slow performance when threaded ? What should I look for ? Or what's the best way to dissect the problem? I already looked for static variables in the classes, which could be shared between threads and I don't see any.
Can one of the java.* classes when instantiated in the thread run significantly slower, (e.g. java.zip.deflate which I'm using)?
Thanks for any hints.
Upd: Another interesting hint is that when a single thread is running the execution time of the function which does the reading is constant to high precision, but when running multiple threads, I see significant variation in timings.

Sounds to me like you are expecting a java.zip.deflate read of 35mb to run faster when you add multiple threads doing the same job. It won't. In fact, although you may not be IO bound, you are still incurring kernel overhead with each thread that you add -- buffer copies, etc.. Even if you are reading entirely out of kernel buffer space, you incur CPU and processing overhead.
That said, I am surprised that you incur 1.5 to 10 times slower. If each of your processing threads is then writing output then obviously that won't be cached.
However I suspect that you may be incurring memory contention. If you are handling a Java serialized object stream, you need to watch your memory consumption unless you are resetting it often. Serialization keeps a lot of references around to objects so that large contiguous streams can generate a tremendous amount of GC bandwidth.
I'd connect to your program using jconsole and watch the memory tab closely. As the survivor and old-gen spaces fill you will see non-linear CPU implications.

Just because all thread workers are reading from same file does not mean for sure it is not IO bound. It might be. It might not be. To be sure, setup your test case so that all thread workers are reading from a file in memory vs. off disk.
You mentioned above that you believe the OS has cached the file, but do you know for sure if the file is being opened in read-only/shared mode? If not, then the OS could still be locking the file to insure only one thread has access at a time.
Potentially related links:
Reading a single file with Multiple Thread: should speed up?
Java multi-thread application that reads a single file

The problem was caused by java.util.zip.Inflate class which actually has lot of synchronized methods (because several of them use native code), so when multiple threads are being run, the synchronized methods are competing with each other and making the code very close to sequential.
The solution was to replace the java.util.zip classes by the java only version from GNU classpath (e.g. from here http://git.savannah.gnu.org/cgit/classpath.git/tree/java/util/zip)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.