I read the following blog:
https://medium.com/#jerzy.chalupski/a-closer-look-at-the-okio-library-90336e37261
It is said that"
the Sinks and Sources are often connected into a pipe. Smart folks at Square realized that there’s no need to copy the data between such pipe components like the java.io buffered streams do. All Sources and Sinks use Buffers under the hood, and Buffers keep the data in Segments, so quite often you can just take an entire Segment from one Buffer and move it to another."
I just dont understand where is the copy of data in java.io.
And in which case a Segment would be moved to another Buffer.
After i read source code of Okio. If writing strings to file by Okio like the following:
val sink = logFile.appendingSink().buffer()
sink.writeUtf8("xxxx")
there will be no "moving segment to another Buffer". Am i right?
Java's BufferedReader is just a Reader that buffers data into a buffer – the buffer being a char[], or something like that – so that every time you need a bunch of bytes/chars from it, it doesn't need to read bytes from a file/network/whatever your byte source is (as long as it has buffered enough bytes). A BufferedWriter does the opposite operation: whenever you write a bunch of bytes to the BufferedWriter, it doesn't actually write bytes to a file/socket/whatever, but it "parks" them into a buffer, so that it can flush the buffer only once it's full.
Overall, this minimises access to file/network/whatever, as it could be expensive.
When you pipe a BufferedReader to a BufferedWriter, you effectively have 2 buffers. How does Java move bytes from one buffer to the other? It copies them from the source to the sink using System.arraycopy (or something equivalent). Everything works well, except that copying a bunch of bytes requires an amount of time that grows linearly as the size of the buffer(s) grow. Hence, copying 1 MB will take roughly 1000 times more than copying 1 KB.
Okio, on the other hand, doesn't actually copy bytes. Oversimplifying the way it works, Okio has a single byte[] with the actual bytes, and the only thing that gets moved from the source to the sink is the pointer (or reference) to that byte[], which requires the same amount of time regardless of its size.
Related
I want to convert an input stream to byte[] and I'm using IOUtils.toByteArray(inputStream). Will it make more efficient by using a wrapper like BufferedInputStream for the inputStream ? Does it save memory ?
Will it make more efficient by wrapper like BufferedInputStream for
the inputStream ?
Not by any significance. IOUtils.toByteArray reads data into a buffer of 4096 bytes. BufferedInputStream uses a 8192 bytes buffer by default.
Using BufferedInputStream does fewer IO reads, but you need a very fast data source to notice any difference.
IF you read an InputStream one byte at a time (or a few bytes), then using a BufferedInputStream really improves performance because it reduces the number of operating system calls by a factor 8000. And operating system calls take a lot of time, comparatively.
Does it save memory ?
No. IOUtils.toByteArray will create a new byte[4096] regardless if whether pass in a buffered or an unbuffered InputStream. A BufferdedInputStream costs a bit more memory to create. But nothing significant.
in terms of final memory consumption it wouldn't help, as you anyway will need to move the whole stream to byte[], the size of the array would be the same, so memory consumption would be the same.
What BufferedInputStream does, it wraps another stream and instead writing to it directly it buffers your input into internal buffer and writes to underlying stream only when it closes/flushes or when the internal buffer is full. It can make your write operations faster, as you will do them in batches instead of writing directly each time, but it wouldn't help if you reading it from another side.
I have a file containing data that is meaningful only in chunks of certain size which is appended at the start of each chunk, for e.g.
{chunk_1_size}
{chunk_1}
{chunk_2_size}
{chunk_2}
{chunk_3_size}
{chunk_3}
{chunk_4_size}
{chunk_4}
{chunk_5_size}
{chunk_5}
.
.
{chunk_n_size}
{chunk_n}
The file is really really big ~ 2GB and the chunk size is ~20MB (which is the buffer that I want to have)
I would like to Buffer read this file to reduce the number to calls to actual hard disk.
But I am not sure how much buffer to have because the chunk size may vary.
pseudo code of what I have in mind:
while(!EOF) {
/*chunk is an integer i.e. 4 bytes*/
readChunkSize();
/*according to chunk size read the number of bytes from file*/
readChunk(chunkSize);
}
If lets say I have random buffer size then I might crawl into situations like:
First Buffer contains chunkSize_1 + chunk_1 + partialChunk_2 --- I have to keep track of leftover and then from the next buffer get the remaning chunk and concatenate to leftover to complete the chunk
First Buffer contains chunkSize_1 + chunk_1 + partialChunkSize_2 (chunk size is an integer i.e. 4 bytes so lets say I get only two of those from first buffer) --- I have to keep track of partialChunkSize_2 and then get remaning bytes from the next buffer to form an integer that actually gives me the next chunkSize
Buffer might not even be able to get one whole chunk at a time -- I have to keep hitting read until the first chunk is completely read into memory
You don't have much control over the number of calls to the hard disk. There are several layers between you and the hard disk (OS, driver, hardware buffering) that you cannot control.
Set a reasonable buffer size in your Java code (1M) and forget about it unless and until you can prove there is a performance issue that is directly related to buffer sizes. In other words, do not fall into the trap of premature optimization.
See also https://stackoverflow.com/a/385529/18157
you might need to do some analysis and have an idea of average buffer size, to read data.
you are saying to keep buffer-size and read data till the chunk is done ,to have some meaning full data
R u copying the file to some place else, or you sending this data to another place?
for some activities Java NIO packages have better implementations to deal with ,rather than reading data into jvm buffers.
the buffer size should be decent enough to read maximum chunks of data ,
If planning to hold data in memmory reading the data using buffers and holding them in memory will be still memory-cost operation ,buffers can be freed in many ways using basic flush operaitons.
please also check apache file-utils to read/write data
I would like to read a huge binary file ( ~100GB ) efficiently in Java. I have to process each line of it . The line processing will be in separate threads. I don't want to load the whole file into memory. Does reading in chunks work? What will be the optimum buffer size? Any formula for that?
If this is a binary file, then reading in "lines" does not make a lot of sense.
If the file is really binary, then use a BufferedInputStream and read bytes one at a time into byte[]. When you get to the byte that marks your end of "line", add the byte[] and the count of bytes in the line to a queue for you worker threads to process.
And repeat.
Tips:
Use a bounded buffer in case you can read lines faster than you can process them.
Recycle the byte[] objects to reduce garbage generation.
If the file is (really) text, then you could use BufferedReader and the readLine() method instead of calling read().
The above will give you reasonable performance. Depending on how much work has to be done to process each line, it may be good enough that there is no point optimizing the file reading. You can check this by profiling.
If you profiling tells you that reading is the bottle-neck, then consider using NIO with ByteBuffer or CharBuffer. It is more complicated but potentially faster than read() or readLine().
Does reading in chunks work?
BufferedReader or BufferedInputStream both read in chunks, under the covers.
What will be the optimum buffer size?
That's probably not that important what the buffer size is. I'd make it a few KB or tens of KB.
Any formula for that?
No there isn't a formula for an optimum buffer size. It will depend on variables that you can't quantify.
Java 8, streaming
Stream<String> lines = Files.lines(Paths.get("c:\myfile.txt"));
lines.forEach(l -> {
// Do anything line by line
});
This does not look trivial, specially for a read/write buffered FileChannel. Is there anything opensource implemented somewhere that I can base my implementation on?
To be clear for those who did not understand:
FileChannel does buffereing in the OS level and I want to do buffering in the Java level. Read here to understand: FileChannel#force and buffering
#Peter I want to write a huge file to disk from a fast message stream. Buffering and batching are the way to go. So I want to batch in Java and then call FileChannel.write.
I recommend using a BufferedOutputStream wrapping a FileOutputStream. I do not believe you will see any performance improvement by mucking with ByteBuffer and FileChannel, and that you'll be left with a lot of hard-to-maintain code if you go that route.
The reasoning is quite simple: regardless of the approach you take, the steps involved are the same:
Generate bytes. You don't say how you plan to do this, and it could introduce an additional level of temporary buffering into the equation. But regardless, the Java data has to be turned into bytes.
Accumulate bytes into a buffer. You want to buffer your data before writing it, so that you're not making lots of small writes. That's a given. But where that buffer lives is immaterial.
Move bytes from Java heap to C heap, across JNI barrier. Writing a file is a native operation, and it doesn't read directly from the Java heap. So whether you buffer on the Java heap and then move the buffered bytes, or buffer in a direct ByteBuffer (and yes, you want a direct buffer), you're still moving the bytes. You will make more JNI calls with the ByteBuffer, but that's a marginal cost.
Invoke fwrite, a kernel call that copies bytes from the C heap into a kernel-maintained disk buffer.
Write the kernel buffer to disk. This will outweigh all the other steps combined, because disks are slow.
There may be a few microseconds gained or lost depending on exactly how you implement these steps, but you can't change the basic steps.
The FileChannel does give you the option to call force(), to ensure that step #5 actually happens. This is likely to actually decrease your overall performance, as the underlying fsync call will not return until the bytes are written. And if you really want to do it, you can always get the channel from the underlying stream.
Bottom line: I'm willing to bet that you're actually IO-bound, and there's no cure for that save better hardware.
FileChannel only works with ByteBuffers so it is naturally buffered. If you need additional buffering to can copy data from ByteBuffer to ByteBuffer but I am not sure why you would want to.
FileChannel does buffereing in the OS level
FileChannel does tell the OS what to do. The OS usually has a disk cache but FileChannel has no idea whether this is the case or not.
I want to do buffering in the Java level
You are in luck, because you don't have a choice. ;) This is the only option.
I would have two threads, the producer thread produces ByteBuffers and appends them to the tail a queue, the consumer thread remove some ByteBuffers from the head of the queue each time, and call fileChannel.write(ByteBuffer[]).
What is the exact use of flush()? What is the difference between stream and buffer? Why do we need buffer?
The advantage of buffering is efficiency. It is generally faster to write a block of 4096 bytes one time to a file than to write, say, one byte 4096 times.
The disadvantage of buffering is that you miss out on the feedback. Output to a handle can remain in memory until enough bytes are written to make it worthwhile to write to the file handle. One part of your program may write some data to a file, but a different part of the program or a different program can't access that data until the first part of your program copies the data from memory to disk. Depending on how quickly data is being written to that file, this can take an arbitrarily long time.
When you call flush(), you are asking the OS to immediately write out whatever data is in the buffer to the file handle, even if the buffer is not full.
The data sometimes gets cached before it's actually written to disk (in a buffer) flush causes what's in the buffer to be written to disk.
flush tells an output stream to send all the data to the underlying stream. It's necessary because of internal buffering. The essential purpose of a buffer is to minimize calls to the underlying stream's APIs. If I'm storing a long byte array to a FileOutputStream, I don't want Java to call the operating system file API once per byte. Thus, buffers are used at various stages, both inside and outside Java. Even if you did call fputc once per byte, the OS wouldn't really write to disk each time, because it has its own buffering.