I need a FIFO ring buffer to temporarily store byte stream data, the buffer should has two internal position pointers, denoting the last read/write positions, respectively. Such that the read/write operations on the buffer maybe interleaved.
Is there any existing libraries for this that are efficiently implemented?
Related
I read the following blog:
https://medium.com/#jerzy.chalupski/a-closer-look-at-the-okio-library-90336e37261
It is said that"
the Sinks and Sources are often connected into a pipe. Smart folks at Square realized that there’s no need to copy the data between such pipe components like the java.io buffered streams do. All Sources and Sinks use Buffers under the hood, and Buffers keep the data in Segments, so quite often you can just take an entire Segment from one Buffer and move it to another."
I just dont understand where is the copy of data in java.io.
And in which case a Segment would be moved to another Buffer.
After i read source code of Okio. If writing strings to file by Okio like the following:
val sink = logFile.appendingSink().buffer()
sink.writeUtf8("xxxx")
there will be no "moving segment to another Buffer". Am i right?
Java's BufferedReader is just a Reader that buffers data into a buffer – the buffer being a char[], or something like that – so that every time you need a bunch of bytes/chars from it, it doesn't need to read bytes from a file/network/whatever your byte source is (as long as it has buffered enough bytes). A BufferedWriter does the opposite operation: whenever you write a bunch of bytes to the BufferedWriter, it doesn't actually write bytes to a file/socket/whatever, but it "parks" them into a buffer, so that it can flush the buffer only once it's full.
Overall, this minimises access to file/network/whatever, as it could be expensive.
When you pipe a BufferedReader to a BufferedWriter, you effectively have 2 buffers. How does Java move bytes from one buffer to the other? It copies them from the source to the sink using System.arraycopy (or something equivalent). Everything works well, except that copying a bunch of bytes requires an amount of time that grows linearly as the size of the buffer(s) grow. Hence, copying 1 MB will take roughly 1000 times more than copying 1 KB.
Okio, on the other hand, doesn't actually copy bytes. Oversimplifying the way it works, Okio has a single byte[] with the actual bytes, and the only thing that gets moved from the source to the sink is the pointer (or reference) to that byte[], which requires the same amount of time regardless of its size.
I want to convert an input stream to byte[] and I'm using IOUtils.toByteArray(inputStream). Will it make more efficient by using a wrapper like BufferedInputStream for the inputStream ? Does it save memory ?
Will it make more efficient by wrapper like BufferedInputStream for
the inputStream ?
Not by any significance. IOUtils.toByteArray reads data into a buffer of 4096 bytes. BufferedInputStream uses a 8192 bytes buffer by default.
Using BufferedInputStream does fewer IO reads, but you need a very fast data source to notice any difference.
IF you read an InputStream one byte at a time (or a few bytes), then using a BufferedInputStream really improves performance because it reduces the number of operating system calls by a factor 8000. And operating system calls take a lot of time, comparatively.
Does it save memory ?
No. IOUtils.toByteArray will create a new byte[4096] regardless if whether pass in a buffered or an unbuffered InputStream. A BufferdedInputStream costs a bit more memory to create. But nothing significant.
in terms of final memory consumption it wouldn't help, as you anyway will need to move the whole stream to byte[], the size of the array would be the same, so memory consumption would be the same.
What BufferedInputStream does, it wraps another stream and instead writing to it directly it buffers your input into internal buffer and writes to underlying stream only when it closes/flushes or when the internal buffer is full. It can make your write operations faster, as you will do them in batches instead of writing directly each time, but it wouldn't help if you reading it from another side.
I have a file containing data that is meaningful only in chunks of certain size which is appended at the start of each chunk, for e.g.
{chunk_1_size}
{chunk_1}
{chunk_2_size}
{chunk_2}
{chunk_3_size}
{chunk_3}
{chunk_4_size}
{chunk_4}
{chunk_5_size}
{chunk_5}
.
.
{chunk_n_size}
{chunk_n}
The file is really really big ~ 2GB and the chunk size is ~20MB (which is the buffer that I want to have)
I would like to Buffer read this file to reduce the number to calls to actual hard disk.
But I am not sure how much buffer to have because the chunk size may vary.
pseudo code of what I have in mind:
while(!EOF) {
/*chunk is an integer i.e. 4 bytes*/
readChunkSize();
/*according to chunk size read the number of bytes from file*/
readChunk(chunkSize);
}
If lets say I have random buffer size then I might crawl into situations like:
First Buffer contains chunkSize_1 + chunk_1 + partialChunk_2 --- I have to keep track of leftover and then from the next buffer get the remaning chunk and concatenate to leftover to complete the chunk
First Buffer contains chunkSize_1 + chunk_1 + partialChunkSize_2 (chunk size is an integer i.e. 4 bytes so lets say I get only two of those from first buffer) --- I have to keep track of partialChunkSize_2 and then get remaning bytes from the next buffer to form an integer that actually gives me the next chunkSize
Buffer might not even be able to get one whole chunk at a time -- I have to keep hitting read until the first chunk is completely read into memory
You don't have much control over the number of calls to the hard disk. There are several layers between you and the hard disk (OS, driver, hardware buffering) that you cannot control.
Set a reasonable buffer size in your Java code (1M) and forget about it unless and until you can prove there is a performance issue that is directly related to buffer sizes. In other words, do not fall into the trap of premature optimization.
See also https://stackoverflow.com/a/385529/18157
you might need to do some analysis and have an idea of average buffer size, to read data.
you are saying to keep buffer-size and read data till the chunk is done ,to have some meaning full data
R u copying the file to some place else, or you sending this data to another place?
for some activities Java NIO packages have better implementations to deal with ,rather than reading data into jvm buffers.
the buffer size should be decent enough to read maximum chunks of data ,
If planning to hold data in memmory reading the data using buffers and holding them in memory will be still memory-cost operation ,buffers can be freed in many ways using basic flush operaitons.
please also check apache file-utils to read/write data
I am newly to NIO and I find an article saying 'the block-based transmission is commonly more effective than stream-based transmission'.
It means read(ByteBuffer) is block-based transmission and read(byte[]) is stream-based transmission.
I want to know what's the internal difference between the two methods.
ps:I also hear block-based transmission is transferring byte arrays and stream-based transmission is transferring byte one by one. I think it's wrong,
because java.io.FileInputStream.read(byte[]) transfers byte array as well.
One thing that makes Bytebuffer more efficient is using direct memory. This avoids a copy from direct memory into a byte[]. If you are merely copying data from one Channel to another this can be up to 30% faster. If you reading byte by byte it can be slightly slower to use a ByteBuffer as it has more overhead accessing each byte. If you use it to read binary e.g. int or double it can be much faster as it can grab the whole value in one access.
I think you're talking about buffer-based vs stream-based I/O operations. Java NIO is buffer oriented in the sense that data is first read into a buffer which is then processed. This gives one flexibility. Also, you need to be sure that the buffer has all the data you require, before you process it. On the other hand, with stream-based I/O, you read one or more bytes from the stream, these are not cached anywhere. This is a blocking I/O, while buffer-based I/O (which is Java NIO) is a non-blocking IO.
While I wouldn't use "stream-based" to characterize read(byte[]), there are efficiency gains to a ByteBuffer over a byte[] in some cases.
See A simple rule of when I should use direct buffers with Java NIO for network I/O? and ByteBuffer.allocate() vs. ByteBuffer.allocateDirect()
The memory backing a ByteBuffer can be (if "direct") easier for the JVM to pass to the OS and to do IO tricks with (for example passing the memory directly to read to write calls), and may not be "on" the JVM's heap. The memory backing a byte[] is on the JVM heap and IO generally does not go directly into the memory used by the array (instead it often goes through a bounce buffer --- because the GC may "move" array objects around in memory while IO is pending or the array memory may not be contiguous).
However, if you have to manipulate the data in Java, a ByteBuffer may not make much difference, as you'll eventually have to copy the data into the Java heap to manipulate it. If you're doing a data copy in and back out with out manipulation, a direct ByteBuffer can be a win.
As we all know java allows us to use byte array as a buffer for data. My case here is with J2me
The scenario here is that I have two buffers of equal size and I need to swap them as they get full one by one ..
In detail
Two buffers buff1 and buff2
Reading data from Buff1 while writing other data to buff2
Then when buff2 gets full
They swap their position now reading from buff2 and writing to buff1
The above cycle goes on
So how do I detect when a buffer is full and is ready to be swapped?
so how do I detect when a buffer is full
The buffer itself is never full (or empty). It is just a fixed amount of reserved memory.
You need to keep track of the useful parts (i.e. those with meaningful data) yourself. Usually, this is just an integer that counts how much bytes were written into the buffer (starting from the beginning).
When that integer reaches the buffer length, your buffer is "full".
Three values can be used to specify the state of a buffer at any given moment in time: SOURCE LINK
* position
* limit
* capacity
Position
When you read from a channel, you put the data that you read into an underlying array. The position variable keeps track of how much data you have written. More precisely, it specifies into which array element the next byte will go. Thus, if you've read three bytes from a channel into a buffer, that buffer's position will be set to 3, referring to the fourth element of the array.
Limit
The limit variable specifies how much data there is left to get (in the case of writing from a buffer into a channel), or how much room there is left to put data into (in the case of reading from a channel into a buffer).
The position is always less than, or equal to, the limit.
Capacity
The capacity of a buffer specifies the maximum amount of data that can be stored therein. In effect, it specifies the size of the underlying array -- or, at least, the amount of the underlying array that we are permitted to use.
The limit can never be larger than the capacity.
so how do I detect when a buffer is full and is ready to be swapped?
Check the values of the position and the capacity fields or the limit .