Java read huge file ( ~100GB ) efficiently

Java read huge file ( ~100GB ) efficiently - java

I would like to read a huge binary file ( ~100GB ) efficiently in Java. I have to process each line of it . The line processing will be in separate threads. I don't want to load the whole file into memory. Does reading in chunks work? What will be the optimum buffer size? Any formula for that?

If this is a binary file, then reading in "lines" does not make a lot of sense.
If the file is really binary, then use a BufferedInputStream and read bytes one at a time into byte[]. When you get to the byte that marks your end of "line", add the byte[] and the count of bytes in the line to a queue for you worker threads to process.
And repeat.
Tips:
Use a bounded buffer in case you can read lines faster than you can process them.
Recycle the byte[] objects to reduce garbage generation.
If the file is (really) text, then you could use BufferedReader and the readLine() method instead of calling read().
The above will give you reasonable performance. Depending on how much work has to be done to process each line, it may be good enough that there is no point optimizing the file reading. You can check this by profiling.
If you profiling tells you that reading is the bottle-neck, then consider using NIO with ByteBuffer or CharBuffer. It is more complicated but potentially faster than read() or readLine().
Does reading in chunks work?
BufferedReader or BufferedInputStream both read in chunks, under the covers.
What will be the optimum buffer size?
That's probably not that important what the buffer size is. I'd make it a few KB or tens of KB.
Any formula for that?
No there isn't a formula for an optimum buffer size. It will depend on variables that you can't quantify.

Java 8, streaming
Stream<String> lines = Files.lines(Paths.get("c:\myfile.txt"));
lines.forEach(l -> {
// Do anything line by line
});

Related

Okio vs java.io performance

I read the following blog:
https://medium.com/#jerzy.chalupski/a-closer-look-at-the-okio-library-90336e37261
It is said that"
the Sinks and Sources are often connected into a pipe. Smart folks at Square realized that there’s no need to copy the data between such pipe components like the java.io buffered streams do. All Sources and Sinks use Buffers under the hood, and Buffers keep the data in Segments, so quite often you can just take an entire Segment from one Buffer and move it to another."
I just dont understand where is the copy of data in java.io.
And in which case a Segment would be moved to another Buffer.
After i read source code of Okio. If writing strings to file by Okio like the following:
val sink = logFile.appendingSink().buffer()
sink.writeUtf8("xxxx")
there will be no "moving segment to another Buffer". Am i right?

Java's BufferedReader is just a Reader that buffers data into a buffer – the buffer being a char[], or something like that – so that every time you need a bunch of bytes/chars from it, it doesn't need to read bytes from a file/network/whatever your byte source is (as long as it has buffered enough bytes). A BufferedWriter does the opposite operation: whenever you write a bunch of bytes to the BufferedWriter, it doesn't actually write bytes to a file/socket/whatever, but it "parks" them into a buffer, so that it can flush the buffer only once it's full.
Overall, this minimises access to file/network/whatever, as it could be expensive.
When you pipe a BufferedReader to a BufferedWriter, you effectively have 2 buffers. How does Java move bytes from one buffer to the other? It copies them from the source to the sink using System.arraycopy (or something equivalent). Everything works well, except that copying a bunch of bytes requires an amount of time that grows linearly as the size of the buffer(s) grow. Hence, copying 1 MB will take roughly 1000 times more than copying 1 KB.
Okio, on the other hand, doesn't actually copy bytes. Oversimplifying the way it works, Okio has a single byte[] with the actual bytes, and the only thing that gets moved from the source to the sink is the pointer (or reference) to that byte[], which requires the same amount of time regardless of its size.

java : Does using buffered input stream make the conversion of input streams to byte[] more efficient?

I want to convert an input stream to byte[] and I'm using IOUtils.toByteArray(inputStream). Will it make more efficient by using a wrapper like BufferedInputStream for the inputStream ? Does it save memory ?

Will it make more efficient by wrapper like BufferedInputStream for
the inputStream ?
Not by any significance. IOUtils.toByteArray reads data into a buffer of 4096 bytes. BufferedInputStream uses a 8192 bytes buffer by default.
Using BufferedInputStream does fewer IO reads, but you need a very fast data source to notice any difference.
IF you read an InputStream one byte at a time (or a few bytes), then using a BufferedInputStream really improves performance because it reduces the number of operating system calls by a factor 8000. And operating system calls take a lot of time, comparatively.
Does it save memory ?
No. IOUtils.toByteArray will create a new byte[4096] regardless if whether pass in a buffered or an unbuffered InputStream. A BufferdedInputStream costs a bit more memory to create. But nothing significant.

in terms of final memory consumption it wouldn't help, as you anyway will need to move the whole stream to byte[], the size of the array would be the same, so memory consumption would be the same.
What BufferedInputStream does, it wraps another stream and instead writing to it directly it buffers your input into internal buffer and writes to underlying stream only when it closes/flushes or when the internal buffer is full. It can make your write operations faster, as you will do them in batches instead of writing directly each time, but it wouldn't help if you reading it from another side.

How to keep read a large file with dynamic buffer size - depending on data read from file.

I have a file containing data that is meaningful only in chunks of certain size which is appended at the start of each chunk, for e.g.
{chunk_1_size}
{chunk_1}
{chunk_2_size}
{chunk_2}
{chunk_3_size}
{chunk_3}
{chunk_4_size}
{chunk_4}
{chunk_5_size}
{chunk_5}
.
.
{chunk_n_size}
{chunk_n}
The file is really really big ~ 2GB and the chunk size is ~20MB (which is the buffer that I want to have)
I would like to Buffer read this file to reduce the number to calls to actual hard disk.
But I am not sure how much buffer to have because the chunk size may vary.
pseudo code of what I have in mind:
while(!EOF) {
/*chunk is an integer i.e. 4 bytes*/
readChunkSize();
/*according to chunk size read the number of bytes from file*/
readChunk(chunkSize);
}
If lets say I have random buffer size then I might crawl into situations like:
First Buffer contains chunkSize_1 + chunk_1 + partialChunk_2 --- I have to keep track of leftover and then from the next buffer get the remaning chunk and concatenate to leftover to complete the chunk
First Buffer contains chunkSize_1 + chunk_1 + partialChunkSize_2 (chunk size is an integer i.e. 4 bytes so lets say I get only two of those from first buffer) --- I have to keep track of partialChunkSize_2 and then get remaning bytes from the next buffer to form an integer that actually gives me the next chunkSize
Buffer might not even be able to get one whole chunk at a time -- I have to keep hitting read until the first chunk is completely read into memory

You don't have much control over the number of calls to the hard disk. There are several layers between you and the hard disk (OS, driver, hardware buffering) that you cannot control.
Set a reasonable buffer size in your Java code (1M) and forget about it unless and until you can prove there is a performance issue that is directly related to buffer sizes. In other words, do not fall into the trap of premature optimization.
See also https://stackoverflow.com/a/385529/18157

you might need to do some analysis and have an idea of average buffer size, to read data.
you are saying to keep buffer-size and read data till the chunk is done ,to have some meaning full data
R u copying the file to some place else, or you sending this data to another place?
for some activities Java NIO packages have better implementations to deal with ,rather than reading data into jvm buffers.
the buffer size should be decent enough to read maximum chunks of data ,
If planning to hold data in memmory reading the data using buffers and holding them in memory will be still memory-cost operation ,buffers can be freed in many ways using basic flush operaitons.
please also check apache file-utils to read/write data

Buffer Size for BufferedInputStream

How to determine the size of Buffer, while using Buffered Input Stream for reading batch of files? Is it based on the File size?I'm using,
byte[] buf = new byte[4096];
If i increase the buffer size,it will read quickly?

The default, which is deliberately undocumented, is 8192 bytes. Unless you have a compelling reason to change it, don't change it.

You can easily test it yourself, but it's not really a big issue. A few kilobytes is enough for the buffer, so you'll get good reading speeds.
If you profile your application and do realize that File IO is a performance bottleneck, there are ways to make it quicker, such as memorymapping a file.

What you show there is the "byte size" that you are reading into (the array).
If you are reading from a FileInputStream (i.e. non buffered), then changing that size will change the read size, yes.
This is different than the internal buffer used by BufferedInputStream. It doesn't have a getter but you can specify the size in the constructor and "remember it" from that I suppose. Default is 8K which may not be optimal.

What order should I use GzipOutputStream and BufferedOutputStream

Can anyone recommend whether I should do something like:
os = new GzipOutputStream(new BufferedOutputStream(...));
or
os = new BufferedOutputStream(new GzipOutputStream(...));
Which is more efficient? Should I use BufferedOutputStream at all?

GZIPOutputStream already comes with a built-in buffer. So, there is no need to put a BufferedOutputStream right next to it in the chain. gojomo's excellent answer already provides some guidance on where to place the buffer.
The default buffer size for GZIPOutputStream is only 512 bytes, so you will want to increase it to 8K or even 64K via the constructor parameter. The default buffer size for BufferedOutputStream is 8K, which is why you can measure an advantage when combining the default GZIPOutputStream and BufferedOutputStream. That advantage can also be achieved by properly sizing the GZIPOutputStream's built-in buffer.
So, to answer your question: "Should I use BufferedOutputStream at all?" → No, in your case, you should not use it, but instead set the GZIPOutputStream's buffer to at least 8K.

What order should I use GzipOutputStream and BufferedOutputStream
For object streams, I found that wrapping the buffered stream around the gzip stream for both input and output was almost always significantly faster. The smaller the objects, the better this did. Better or the same in all cases then no buffered stream.
ois = new ObjectInputStream(new BufferedInputStream(new GZIPInputStream(fis)));
oos = new ObjectOutputStream(new BufferedOutputStream(new GZIPOutputStream(fos)));
However, for text and straight byte streams, I found that it was a toss up -- with the gzip stream around the buffered stream being only slightly better. But better in all cases then no buffered stream.
reader = new InputStreamReader(new GZIPInputStream(new BufferedInputStream(fis)));
writer = new OutputStreamWriter(new GZIPOutputStream(new BufferedOutputStream(fos)));
I ran each version 20 times and cut off the first run and averaged the rest. I also tried buffered-gzip-buffered which was slightly better for objects and worse for text. I did not play with buffer sizes at all.
For the object streams, I tested 2 serialized object files in the 10s of megabytes. For the larger file (38mb), it was 85% faster on reading (0.7 versus 5.6 seconds) but actually slightly slower for writing (5.9 versus 5.7 seconds). These objects had some large arrays in them which may have meant larger writes.
method crc date time compressed uncompressed ratio
defla eb338650 May 19 16:59 14027543 38366001 63.4%
For the smaller file (18mb), it was 75% faster for reading (1.6 versus 6.1 seconds) and 40% faster for writing (2.8 versus 4.7 seconds). It contained a large number of small objects.
method crc date time compressed uncompressed ratio
defla 92c9d529 May 19 16:56 6676006 17890857 62.7%
For the text reader/writer I used a 64mb csv text file. The gzip stream around the buffered stream was 11% faster for reading (950 versus 1070 milliseconds) and slightly faster when writing (7.9 versus 8.1 seconds).
method crc date time compressed uncompressed ratio
defla c6b72e34 May 20 09:16 22560860 63465800 64.5%

The buffering helps when the ultimate destination of the data is best read/written in larger chunks than your code would otherwise push it. So you generally want the buffering to be as close to the place-that-wants-larger-chunks. In your examples, that's the elided "...", so wrap the BufferedOutputStream with the GzipOutputStream. And, tune the BufferedOutputStream buffer size to match what testing shows works best with the destination.
I doubt the BufferedOutputStream on the outside would help much, if at all, over no explicit buffering. Why not? The GzipOutputStream will do its write()s to "..." in the same-sized chunks whether the outside buffering is present or not. So there's no optimizing for "..." possible; you're stuck with what sizes GzipOutputStream write()s.
Note also that you're using memory more efficiently by buffering the compressed data rather than the uncompressed data. If your data often acheives 6X compression, the 'inside' buffer is equivalent to an 'outside' buffer 6X as big.

Normally you want a buffer close to your FileOutputStream (assuming that's what ... represents) to avoid too many calls into the OS and frequent disk access. However, if you're writing a lot of small chunks to the GZIPOutputStream you might benefit from a buffer around GZIPOS as well. The reason being the write method in GZIPOS is synchronized and also leads to few other synchronized calls and a couple of native (JNI) calls (to update the CRC32 and do the actual compression). These all add extra overhead per call. So in that case I'd say you'll benefit from both buffers.

I suggest you try a simple benchmark to time how long it take to compress a large file and see if it makes much difference. GzipOutputStream does have buffering but it is a smaller buffer. I would do the first with a 64K buffer, but you might find that doing both is better.

Read the javadoc, and you will discover that BIS is used to buffer bytes read from some original source. Once you get the raw bytes you want to compress them so you wrap BIS with a GIS. It makes no sense to buffer the output from a GZIP, because one needs to think what about buffering GZIP, who is going to do that ?
new GzipInputStream( new BufferedInputStream ( new FileInputXXX

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java read huge file ( ~100GB ) efficiently - java

I would like to read a huge binary file ( ~100GB ) efficiently in Java. I have to process each line of it . The line processing will be in separate threads. I don't want to load the whole file into memory. Does reading in chunks work? What will be the optimum buffer size? Any formula for that?

Java 8, streaming Stream<String> lines = Files.lines(Paths.get("c:\myfile.txt")); lines.forEach(l -> { // Do anything line by line });

Related

Okio vs java.io performance

java : Does using buffered input stream make the conversion of input streams to byte[] more efficient?

How to keep read a large file with dynamic buffer size - depending on data read from file.

Buffer Size for BufferedInputStream

What order should I use GzipOutputStream and BufferedOutputStream

Categories

Resources