In the ByteToMessage decoder (https://github.com/netty/netty/blob/master/codec/src/main/java/io/netty/handler/codec/ByteToMessageDecoder.java), which ReplayingDecoder derives from, the cumulation ByteBuf (used to accumulate data until enough has been read from the network to begin decoding) seems to be implemented like a dynamic array.
By this I mean that, if the current cumulation ByteBuf has the capacity to retain the incoming data, it is copied to the cumulation ByteBuf. If there is insufficient capacity, the cumulation ByteBuf is expanded and both the previous cumulation ByteBuf and the incoming data are written to the newly allocated instance. Is there a reason that a CompositeByteBuf with a bounded number of components is not used here instead?
Using a PooledByteBufAllocator should help reduce the number of memory allocations, but it still seems to me that using a CompositeByteBuf in conjunction with a PooledByteBufAllocator would be the most efficient solution as it would attempt to both optimize memory allocations and copies.
However, before I go down the rabbit hole of implementing my own pipeline stage for zero copy aggregation, I wanted to ask if there is a particular reason for the current implementation (e.g. Is the construction of a CompositeByteBuf being performed under the hood by one of the copy calls, or has someone already found that the current strategy is empirically better?)
Thanks in advance
Related
Since JDK 1.4, Direct Buffer was introduced along with Java NIO. One reason of it is Java GC may move the memory. Therefore the buffer data must be put off heap.
I'm wondering why traditional Java blocking IO api (BIO) doesn't need a direct buffer? Does BIO use something like direct buffer internally, or are there some other mechanisms to avoid the "memory movement" problem?
The simple answer is: It doesn't matter. Java has a clear, and public, spec. The JLS, the JVMS, and the javadoc of the core library. Java implementations do exactly what those 3 documents state, and you may trust that somehow it 'works'. This isn't as trite as it sounds, for example, the JMM (Java Memory Model, part of the JVMS if memory serves) lays out all sorts of things a JVM 'may' do in regards to re-ordering instructions and caching local writes, which is tricky, because it means if you mess that up, given that it is a 'may', a JVM may not actually bug out, even though your code is buggy, in that a JVM may do X, and if it does that, your code breaks; just that on your machine, at this time, with this song playing on your music player, the JVM chose never to do X, so you can't observe the problem.
Fortunately, the BIO stuff mostly has no may in it.
Here is the basic outlay of BIO in java:
You call .read() .read(byte[]), or .read(byte[], off, len).
(This is no guarantee; an implementation detail; a JVM is not required to do it this way): The JVM will read 'as much as is currently available' (hence, .read(some100SizedByteArr) may read only 50 bytes, even though if you call read again it'll read more bytes: 50 so happened to be 'ready' in the network buffer. Lots of folks get that wrong and think .read(byte[]) will fill the byte array if it can. Nope. That would make it impossible to write code that processes data as it comes in!
(Again, no guarantee): Given that byte arrays can be shoved around in memory, you'd think that's a problem, but it really isn't: That byte[] is guaranteed not to magically grow new bytes in it, there is no way with the BIO API to say: Just fill this array as the bytes fly in over the wire. The only way to fill that array is to call .read() on your inputstream. That is a blocking operation, and the JVM can therefore 'deal with it' as it pleases. Perhaps the native layer simply locks out the garbage collector until data is returned (this isn't as pricey as it sounds; the .read() method, once at least 1 byte can be returned, returns quickly, it doesn't wait for more data beyond the first byte, at least, that's how most JVMs do it). Perhaps it will read the data into a cloned buffer that lives out of heap and blits it over into your array later (sounds inefficient, perhaps, but a JVM is free to do it this way). Possibly the JVM marks that byte array specifically as off-limits for movement of any sort but the GC just collects 'around' it. It doesn't matter - a JVM can do whatever it wants. As long as it guarantees that `.read(byte[]):
Blocks until EOF is reached (in which case it returns -1), or at least 1 byte is available.
Fills the byte array with the bytes so returned.
Marks the inputstream as having 'consumed' all that you just got.
Returns a value representing how many bytes have been filled.
That's sort of the point of java: The how is irrelevant. Had the how not been irrelevant, writing a JVM for a new platform could be either impossible or require full virtualization, making it incredibly slow. The docs give themselves some 'may' clauses exactly so that this can be avoided.
One place where may does show up in BIO: When you .interrupt() a thread that is currently locked in a BIO .write() call (and the bytes haven't all been sent yet, let's say the network is slow and you sent a big array), o a BIO .read() call (it blocks until at least 1 byte is available; let's say the other side isn't sending anything) - then what happens? The docs leave it out. It 'may' result in an IOException being thrown, thus ending the read/write call, with a message indicating you interrupted it. Or, .interrupt() does nothing, and it is in fact impossible to interrupt a thread frozen on a BIO call. Most JVMs do the exception thing (fortunately), but the docs leave room - if for whatever reason the underlying OS/arch don't make that feasible, then a JVM is free not to do anything if you attempt to interrupt(). Conclusion: If you want to write proper 'write once run anywhere' code you can't rely on the idea that you can .interrupt() BIO freezes.
In a high volume multi-threaded java project I need to implement a non-blocking buffer.
In my scenario I have a web layer that receives ~20,000 requests per second. I need to accumulate some of those requests in some data structure (aka the desired buffer) and when it is full (let's assume it is full when it contains 1000 objects) those objects should be serialized to a file that will be sent to another server for further processing.
The implementation shoud be a non-blocking one.
I examined ConcurrentLinkedQueue but I'm not sure it can fit the job.
I think I need to use 2 queues in a way that once the first gets filled it is replaced by a new one, and the full queue ("the first") gets delivered for further processing. This is the basic idea I'm thinking of at the moment, and still I don't know if it is feasible since I'm not sure I can switch pointers in java (in order to switch the full queue).
Any advice?
Thanks
What I usualy do with requirements like this is create a pool of buffers at app startup and store the references in a BlockingQueue. The producer thread pops buffers, fills them and then pushes the refs to another queue upon which the consumers are waiting. When consumer/s are done, (data written to fine, in your case), the refs get pushed back onto the pool queue for re-use. This provides lots of buffer storage, no need for expensive bulk copying inside locks, eliminates GC actions, provides flow-control, (if the pool empties, the producer is forced to wait until some buffers are returned), and prevents memory-runaway, all in one design.
More: I've used such designs for many years in various other languages too, (C++, Delphi), and it works well. I have an 'ObjectPool' class that contains the BlockingQueue and a 'PooledObject' class to derive the buffers from. PooledObject has an internal private reference to its pool, (it gets initialized on pool creation), so allowing a parameterless release() method. This means that, in complex designs with more than one pool, a buffer always gets released to the correct pool, reducing cockup-potential.
Most of my apps have a GUI, so I usually dump the pool level to a status bar on a timer, every second, say. I can then see roughly how much loading there is, if any buffers are leaking, (number consistently goes down and then app eventually deadlocks on empty pool), or I am double-releasing, (number consistently goes up and app eventually crashes).
It's also fairly easy to change the number of buffers at runtime, by either creating more and pushing them into the pool, or by waiting on the pool, removing buffers and letting GC destroy them.
I think you have a very good point with your solution. You would need two queues, the processingQueue would be the buffer size you want (in your example that would be 1000) while the waitingQueue would be a lot bigger. Every time the processingQueue is full it will put its contents in the specified file and then grab the first 1000 from the waitingQueue (or less if the waiting queue has fewer than 1000).
My only concern about this is that you mention 20000 per second and a buffer of 1000. I know the 1000 was an example, but if you don't make it bigger it might just be that you are moving the problem to the waitingQueue rather than solving it, as your waitingQueue will receive 1000 new ones faster than the processingQueue can process them, giving you a buffer overflow in the waitingQueue.
Instead of putting each request object in a queue, allocate an array of size 1000, and when it is filled, put that array in the queue to the sender thread which serializes and sends the whole array. Then allocate another array.
How are you going to handle the situation when the sender cannot work fast enough and its queue is overflown? To avoid out of memory error, use queue of a limited size.
I might be getting something wrong, but you may use an ArrayList for this as you don't need to poll per element from your queue. You just flush (create a copy and clear) your array in a synchronized section when it's size reaches the limit and you need to send it. Adding to this list should also be synced to this flush operation.
Swapping your arrays might not be safe - if your sending is slower than your generation, buffers may soon start overwriting each other. And 20000-elements array allocation per second is almost nothing for GC.
Object lock = new Object();
List list = ...;
synchronized(lock){
list.add();
}
...
// this check outside is a quick dirty check for performance,
// it's not valid out of the sync block
// this first check is less than nano-second and will filter out 99.9%
// `synchronized(lock)` sections
if(list.size() > 1000){
synchronized(lock){ // this should be less than a microsecond
if(list.size() > 1000){ // this one is valid
// make sure this is async (i.e. saved in a separate thread) or <1ms
// new array allocation must be the slowest part here
sendAsyncInASeparateThread(new ArrayList(list));
list.clear();
}
}
}
UPDATE
Considering that sending is async, the slowest part here is new ArrayList(list) which should be around 1 microseconds for 1000 elements and 20 microseconds per second. I didn't measure that, I resolved this from proportion in which 1 million elements are allocated in ~1 ms.
If you still require a super-fast synchronized queue, you might want to have a look at the MentaQueue
What do you mean by "switch pointers"? There are no pointers in Java (unless you're talking about references).
Anyways, as you probably saw from the Javadoc, ConcurrentLinkedQueue has a "problem" with the size() method. Still, you could use your original idea of 2 (or more) buffers that would get switched. There's probably going to be some bottlenecks with the disk I/O. Maybe the non-constant time of size() won't be a problem here either.
Of course if you want it to be non-blocking, you better have a lot of memory and a fast disk (and large / bigger buffers).
This is more like a matter of conscience than a technological issue :p
I'm writing some java code to dowload files from a server...For that, i'm using the BufferedOutputStream method write(), and BufferedInputStream method read().
So my question is, if i use a buffer to hold the bytes, what should be the number of bytes to read? Sure i can read byte to byte using just int byte = read() and then write(byte), or i could use a buffer. If i take the second approach, is there any aspects that i must pay attention when defining the number of bytes to read\write each time? What will this number affect in my program?
Thks
Unless you have a really fast network connection, the size of the buffer will make little difference. I'd say that 4k buffers would be fine, though there's no harm in using buffers a bit bigger.
The same probably applies to using read() versus read(byte[]) ... assuming that you are using a BufferedInputStream.
Unless you have an extraordinarily fast / low-latency network connection, the bottleneck is going to be the data rate that the network and your computers' network interfaces can sustain. For a typical internet connection, the application can move the data two or more orders of magnitude of times faster than the network can. So unless you do something silly (like doing 1 byte reads on an unbuffered stream), your Java code won't be the bottleneck.
BufferedInputStream and BufferedOutputStream typically rely on System.arraycopy for their implementations. System.arraycopy has a native implementation, which likely relies on memmove or bcopy. The amount of memory that is copied will depend on the available space in your buffer, but regardless, the implementation down to the native code is pretty efficient, unlikely to affect the performance of your application regardless of how many bytes you are reading/writing.
However, with respect to BufferedInputStream, if you set a mark with a high limit, a new internal buffer may need to be created. If you do use a mark, reading more bytes than are available in the old buffer may cause a temporary performance hit, though the amortized performance is still linear.
As Stephen C mentioned, you are more likely to see performance issues due to the network.
What is the MTU(maximum traffic unit) in your network connection? If you using UDP for example, you can check this value and use smaller array of bytes. If this is no metter, you need to check how memory eats your program. I think 1024 - 4096 will be good variant to save this data and continue to receive
If you pump data you normally do not need to use any Buffered streams. Just make sure you use a decently sized (8-64k) temporary byte[] buffer passed to the read method (or use a pump method which does it). The default buffer size is too small for most usages (and if you use a larger temp array it will be ignored anyway)
I explore netty to communicate Objects between VMs. I use ObjectEncoder & ObjectDecoder respectively to serialize these.
I quickly found out that this solution is limited to max 1MB-sized objects. As I intend to communicate larger objects and given I do not intend to limit this size, I used Integer.MAX_VALUE to set the maximum frame length.
Unfortunately it looks like this value is picked up to initialize some buffers, thus resulting in unnecessary GC-ing and very likely in OutOfMemory.
Is there a way to create an unlimited ObjectEncoder/Decoder while using DynamicChannelBuffers so that not too much memory is wasted?
ObjectDecoder extends LengthFieldBasedFrameDecoder which extends FrameDecoder. FrameDecoder manages the decode buffer and it uses a dynamic buffer with initial capacity of 256.
However, once you receive a large object, the dynamic buffer expands itself, but never shrinks. If you have multiple connections that exchange large objects, your ObjectDecoder will all have a very large buffer eventually, potentially leading to OutOfMemoryError.
This issue has been fixed last week and a new release (3.2.7.Final) will be released this week.
I have two scenarios in Netty where I am trying to minimize memory copies and optimize memory usage:
(1) Reading a very large frame (20 Megabites).
(2) Reading lots of very little frames (20 megabites at 50 bites per frame) to rebuild into one message at a higher level in the pipeline.
For the first scenario, as I get a length at the beginning of the frame, I extended FrameDecoder. Unfortunately as I don't see how to return the length to Netty (I only indicate whether the frame is complete or not), I believe Netty is going through multiple fill buffer, copy and realloc cycles thus using for more memory than is required. Is there something I am missing here? Or should I be avoiding the FrameDecoder entirely if I expect this scenario?
In the second scenario, I am currently creating a linked list of all the little frames which I wrap using ChannelBuffers.wrappedBuffer (which I can then wrap in a ChannelBufferInputStream), but I am again using far more memory than I expected to use (perhaps because the allocated ChannelBuffers have spare space?). Is this the right way to use Netty ChannelBuffers?
There is a specialized version of frame decoder called, LengthFieldBasedFrameDecoder. Its handy, when you have a header with message length. It can even extract the message length from header by giving an offset.
Actually, ChannelBuffers.wrappedBuffer does not creates copies of received data, it creates a composite buffer from given buffers, so your received frame data will not be copied. If you are holding the composite buffers/ your custom wrapper in the code and forgot to nullify, memory leaks can happen.
These are practices I follow,
Allocate direct buffers for long lived objects, slice it on use.
when I want to join/encode multiple buffers into one big buffer. I Use ChannelBuffers.wrappedBuffer
If I have a buffer and want to do something with it/portion of it, I make a slice of it by calling slice or slice(0,..) on channel buffer instance
If I have a channel buffer and know the position of data which is small, I always use getXXX methods
If I have a channel buffer, which is used in many places for make something out of it, always make it modifiable, slice it on use.
Note: channelbuffer.slice does not make a copy of the data, it creates a channel buffer with new reader & write index.
In the end, it appeared the best way to handle my FrameDecoder issue was to write my own on top of the SimpleChannelUpstreamHandler. As soon as I determined the length from the header, I created the ChannelBuffer with size exactly matching the length. This (along with other changes) significantly improved the memory performance of my application.