Non blocking buffer in java

Non blocking buffer in java - java

In a high volume multi-threaded java project I need to implement a non-blocking buffer.
In my scenario I have a web layer that receives ~20,000 requests per second. I need to accumulate some of those requests in some data structure (aka the desired buffer) and when it is full (let's assume it is full when it contains 1000 objects) those objects should be serialized to a file that will be sent to another server for further processing.
The implementation shoud be a non-blocking one.
I examined ConcurrentLinkedQueue but I'm not sure it can fit the job.
I think I need to use 2 queues in a way that once the first gets filled it is replaced by a new one, and the full queue ("the first") gets delivered for further processing. This is the basic idea I'm thinking of at the moment, and still I don't know if it is feasible since I'm not sure I can switch pointers in java (in order to switch the full queue).
Any advice?
Thanks

What I usualy do with requirements like this is create a pool of buffers at app startup and store the references in a BlockingQueue. The producer thread pops buffers, fills them and then pushes the refs to another queue upon which the consumers are waiting. When consumer/s are done, (data written to fine, in your case), the refs get pushed back onto the pool queue for re-use. This provides lots of buffer storage, no need for expensive bulk copying inside locks, eliminates GC actions, provides flow-control, (if the pool empties, the producer is forced to wait until some buffers are returned), and prevents memory-runaway, all in one design.
More: I've used such designs for many years in various other languages too, (C++, Delphi), and it works well. I have an 'ObjectPool' class that contains the BlockingQueue and a 'PooledObject' class to derive the buffers from. PooledObject has an internal private reference to its pool, (it gets initialized on pool creation), so allowing a parameterless release() method. This means that, in complex designs with more than one pool, a buffer always gets released to the correct pool, reducing cockup-potential.
Most of my apps have a GUI, so I usually dump the pool level to a status bar on a timer, every second, say. I can then see roughly how much loading there is, if any buffers are leaking, (number consistently goes down and then app eventually deadlocks on empty pool), or I am double-releasing, (number consistently goes up and app eventually crashes).
It's also fairly easy to change the number of buffers at runtime, by either creating more and pushing them into the pool, or by waiting on the pool, removing buffers and letting GC destroy them.

I think you have a very good point with your solution. You would need two queues, the processingQueue would be the buffer size you want (in your example that would be 1000) while the waitingQueue would be a lot bigger. Every time the processingQueue is full it will put its contents in the specified file and then grab the first 1000 from the waitingQueue (or less if the waiting queue has fewer than 1000).
My only concern about this is that you mention 20000 per second and a buffer of 1000. I know the 1000 was an example, but if you don't make it bigger it might just be that you are moving the problem to the waitingQueue rather than solving it, as your waitingQueue will receive 1000 new ones faster than the processingQueue can process them, giving you a buffer overflow in the waitingQueue.

Instead of putting each request object in a queue, allocate an array of size 1000, and when it is filled, put that array in the queue to the sender thread which serializes and sends the whole array. Then allocate another array.
How are you going to handle the situation when the sender cannot work fast enough and its queue is overflown? To avoid out of memory error, use queue of a limited size.

I might be getting something wrong, but you may use an ArrayList for this as you don't need to poll per element from your queue. You just flush (create a copy and clear) your array in a synchronized section when it's size reaches the limit and you need to send it. Adding to this list should also be synced to this flush operation.
Swapping your arrays might not be safe - if your sending is slower than your generation, buffers may soon start overwriting each other. And 20000-elements array allocation per second is almost nothing for GC.
Object lock = new Object();
List list = ...;
synchronized(lock){
list.add();
}
...
// this check outside is a quick dirty check for performance,
// it's not valid out of the sync block
// this first check is less than nano-second and will filter out 99.9%
// `synchronized(lock)` sections
if(list.size() > 1000){
synchronized(lock){ // this should be less than a microsecond
if(list.size() > 1000){ // this one is valid
// make sure this is async (i.e. saved in a separate thread) or <1ms
// new array allocation must be the slowest part here
sendAsyncInASeparateThread(new ArrayList(list));
list.clear();
}
}
}
UPDATE
Considering that sending is async, the slowest part here is new ArrayList(list) which should be around 1 microseconds for 1000 elements and 20 microseconds per second. I didn't measure that, I resolved this from proportion in which 1 million elements are allocated in ~1 ms.
If you still require a super-fast synchronized queue, you might want to have a look at the MentaQueue

What do you mean by "switch pointers"? There are no pointers in Java (unless you're talking about references).
Anyways, as you probably saw from the Javadoc, ConcurrentLinkedQueue has a "problem" with the size() method. Still, you could use your original idea of 2 (or more) buffers that would get switched. There's probably going to be some bottlenecks with the disk I/O. Maybe the non-constant time of size() won't be a problem here either.
Of course if you want it to be non-blocking, you better have a lot of memory and a fast disk (and large / bigger buffers).

Related

Java How to make sure that a shared data write affects multiple readers

I am trying to code a processor intensive task, so I would like to use multithreading and share the calculation between the available processor cores.
Let's say I have thousands of iterations and all iterations have two phases:
Some working threads that scans through hundreds of thousands of options
while they have to read data from a shared array (or some other data structure), while there is no modification of the data.
One thread that collects the results from all the working threads (while
they are waiting) and makes modifications on the shared array
The phases are in sequence, so that there is no overlap (no concurrent writing and reading of the data). My problem is: How would I be sure that the data (cache) is updated for the working threads before the next phase, Phase 1, starts.
I am assuming that when people speak about cache or caching in this context, they mean the processor cache (fix me if I'm wrong).
As I understood, volatile can be used for nonreference types only, while there is no point to use synchronized, because the working threads will block each other at reading (there can be thousands of reads when processing an option).
What else can I use in this case?
Right now I have a few ideas, but I have no idea how costly they are (most probably they are):
create new working threads for all iterations
in a synchronized block make a copy of the array (can be up to 195kB in size) for each threads before a new iteration begins
I red about ReentrantReadWriteLock, but I can't understand how is it related to caching. Can a read lock acquire force the reader's cache to update?

The thing I was searching for was mentioned in the "Java Tutorial on Concurrence" I just had to look deeper. In this case it was the AtomicIntegerArray class. Unfortunately it is not efficient enough for my needs. I run some tests, maybe it worth to share.
I approximated the cost of different memory access methods, by running them many times and averaged the elapsed times, broke everything down to one average read or write.
I used a size of 50000 integer array, and repeated every test methods 100 times, then averaged the results. The read tests are performing 50000 random(ish) reads. The results shows the approximated time of one read/write access. Still, this can't be stated as exact measurement, but I believe it gives a good sense of the time costs of the different access methods. However on different processors or with different numbers these results may be completely different regarding to the different cache sizes, and clock speeds.
So the results are:
Fill time with set is: 15.922673ns
Fill time with lazySet is: 4.5303152ns
Atomic read time is: 9.146553ns
Synchronized read time is: 57.858261399999996ns
Single threaded fill time is: 0.2879112ns
Single threaded read time is: 0.3152002ns
Immutable copy time is: 0.2920892ns
Immutable read time is: 0.650578ns
Points 1 and 2 shows the write result on an AtomicIntegerArray, with sequential writes. In some article I red about the good efficiency of the lazySet() mehtod so I wanted to test it. It is usually over perform the set() method by about 4 times, however different array sizes show different results.
Points 3 and 4 shows the difference between the "atomic" access and synchronized access (a synchronized getter) to one item of the array via random(ish) reads by four different threads simultaneously. This clearly indicates the benefits of the "atomic" access.
Since the first four value looked shockingly high, I really wanted to measure the access times without multithreading, so I got the reslults of points 5 and 6. I tried to copy and modify methods from the previous tests, to make the code as close as it is possible. Of course there can be optimizations I can't affect.
Then just out of curiosity I come up with points 7. and 8. which imitates the immutable access. Here one thread creates the array (by sequential writes) and passes it's reference to an another thread which does the random(ish) read accesses on it.
The results are heavily vary, if the parameters are changed, like the size of the array or the count of the methods running.
The conclusion:
If an algorithm is extremely memory intensive (lots of reads from the same small array, interrupted by short calculations - which is my case), multithreading can slow down the calculation instead of speeding it up. But if it has many many reads, compared to the size of the array, it may be helpful to use an immutable copy of the array, and use multiple threads.

Increasing program speedup when using shared memory

I have a program that calculates Pi from the Chudnovsky formula. It's written in Java and it uses a shared Vector that is used to save intermediate calculations like factorials and powers that include the index of the element.
However, I believe that since it's a synchronized Vector (thread safe by default) only one thread can read or write to it. So when we have lots of threads, instead of having increasing speedup, we see the computation time becomes constant.
Is there anything that I can do to circumvent that? What to do when there are too many threads reading/writing to the same shared memory?

When the access pattern is lots of reads and occasional writes, you can protect an unsyncronized data structure with a ReentrantReadWriteLock. It allows multiple readers, but only a single writer.
Depending on your implementation, you might also benefit from using a ConcurrentHashMap.
You might be able to cheat a bit and use either an AtomicIntegerArray or an AtomicReferenceArray of Futures/CompletionStages.

Store the results of each thread in a stack. One thread collects results from every thread and adds them together. Of course the stack should not be empty.
If you want multiple threads to work on factorials why not create a thread or two that produce a list of factorial results. Other threads can just look up results if needed.

Instead of having the same shared memory, you can have multiple threads with individual memories in a stack. Eventually, add all these up together (or occasionally) with one thread!

If you need high throughput, you can consider using Disruptor and RingBuffer.
At a crude level you can think of a Disruptor as a multicast graph of queues where producers put objects on it that are sent to all the consumers for parallel consumption through separate downstream queues. When you look inside you see that this network of queues is really a single data structure - a ring buffer.
Each producer and consumer has a sequence counter to indicate which slot in the buffer it's currently working on. Each producer/consumer writes its own sequence counter but can read the others' sequence counters
Few useful links:
https://lmax-exchange.github.io/disruptor
http://martinfowler.com/articles/lmax.html
https://softwareengineering.stackexchange.com/questions/244826/can-someone-explain-in-simple-terms-what-is-the-disruptor-pattern

How to avoid 100% CPU utilization without removing while(true)

I am working on a system where I need a while(true) where the loop constantly listens to a queue and increments counts in memory.
The data is constantly coming in the queue, so I cannot avoid a while(true) condition. But naturally it increases my CPU utilization to 100%.
So, how can I keep a thread alive which listens to the tail of queue and performs some action, but at the same time reduce the CPU utilization to 100%?

Blocking queues were invented exactly for this purpose.
Also see this: What are the advantages of Blocking Queue in Java?

LinkedBlockingQueue.take() is what you should be using. This waits for an entry to arrive on the queue, with no additional synchronization mechanism needed.
(There are one or two other blocking queues in Java, IIRC, but they have features that make them unsuitable in the general case. Don't know why such an important mechanism is buried so deeply in arcane classes.)

usually a queue has a way to retrieve an item from it and your thread will be descheduled (thus using 0% cpu) until something arrives in the queue...

Based on your comments on another answer, you want to have a queue that is based on changes in hsqldb
Some quick googling turns up:
http://hsqldb.org/doc/guide/triggers-chapt.html
It appears you can set it up so that changes cause a trigger to occur, which will notify a class you write implementing the org.hsqldb.Trigger interface. Have that class contain a reference to a LinkedBlockingDequeue from the Concurrent package in Java and have the trigger add the change to the queue.
You now have a blocking queue that your reading thread will block on until hsqldb fires a trigger (from an update by a writer) which will put something in the queue. The waiting thread will then unblock and have the item off the queue.

lbalazscs and Brain have excellent answers. I couldn’t share my code it was hard for them to give them the exact fix for my issue. And having a while(true) which constantly polls a queue is surely the wrong way to go about it. So, here is what I did:
I used ScheduledExecutorService with a 10sec delay.
I read a block of messages (say 10k) and process those messages
thread is invoked again and the "loop" continues.
This considerably reduces my CPU usage. Suggestions welcomed.

Lots of dumb answers from people who read books and only wasted time in schools, not as many direct logic or answers I see.
while(true) will set your program to use all the CPU power that's basically 'alloted' to it by the windows algorithms to run what is in the loop, usually as-fast-as-possible over and over. This doesn't mean if it says 100% on your application, that if you run a game, your empty loop .exe will be taking all your OS CPU power, the game should still run as intended. It is more like a visual bug, similar to the windows idle process and some other processes. The fix is to add a Sleep(1) (at least 1 millisecond) or better yet a Sleep(5) to make sure other stuff can run and ensure the CPU is not constantly looping your while(true) as fast as possible. This will generally drop CPU usage to 0% or 1% in the visual queue as 1 full millisecond is a big resting time for even older CPU.
Many times while(trues) or generic endless loops are bad designs and can be drastically slowed down to even Sleep(1000) - 1 second interval checks or higher. Endless loops are not always bad designs, but usually they can be improved..
funny to see this bug I learned whe nI was like 12 learning C pop up and all the 'dumb' answers given.
Just know if you try it, unless the scripted slower language you have learned to use has fixed it somewhere along the line by itself, windows will claim to use a lot of CPU on doing an empty loop when the OS is actually having free resources to spend.

Does having a lot of null objects drain memory? and if so is there an alternative to using ArrayDeque as a queue?

I'm a n00b so I'm sorry if I'm way off with this one but I am using arraydeque as a queue for some threads to process. Each thread processes the an item in the queue(each thread checks if there's data in the queue and if it is it does queue.poll(), if its good then its sent away in a solution queue otherwise the data it is either discarded or a part of it is sent back to the queue for further processing.
Here's my problem, the longer my program works the more memory it keeps using and eventually I get outofmemory errors(but it stays maxed out for a while before this happens). I'm learning java so I'm not sure if I have identified this correctly but I ran yourkit on my code and it said:
Find arrays with big number of 'null' elements.
Problem: Possible memory waste.
Possible solution: Use alternate data structures e.g. maps or rework algorithms.
yourkit also showed me that 93% of my memory was stuck in here(in the heap dump). Yesterday I asked a question about arraydeque.polling() being a possible memory hog and got a comment saying that it was not because my data is turned into 'null' once its polled.
So my two questions(as in my title) is having a constantly growing number of null objects a problem(I am not sure if they get GC'ed but since there was several million in the heap dump, I suspect maybe not)? If so, is there an alternative to ArrayDeque, maybe something that GC's items when they are no longer needed(my program is constantly processing and adding items in a queue, but even though the number of items to process is reduced the memory consumption never goes down, when the program is done is just suddenly goes to zero, if the queue is gradually building I would expect it to gradually get smaller)?
Another slightly related question, I'm dealing with a few billions of items in a queue thats being processed by threads, but memory is causing it to fail. Is there a point to me trying to improve my internal program queue or would it make more sense to use a real queue program like(rabbitmq or activemq)?(I'm really new to program so not sure when I reached a limit of a tool and how to either improve it or figure out what to use next)

ArrayDeque stores items in a flat array with a help of two "pointers" - head and tail. If the total number of elements in the queue exceeds the current size of this array, its size is doubled.
When you poll an item from the queue, the slot in this array is cleared (set to null), but the array never really shrinks! This means if you first offer million items to the queue and then poll all of them, there ArrayDeque still maintains an array of at least 1 million entries, all of them set to null. This explains the Find arrays with big number of 'null' elements message.
Seems like your application at some point in time offers huge number of elements to the queue. Try (periodically?) calling the following code:
queue = new ArrayDeque<String>(queue);
This will copy the contents of old queue, garbage collecting unnecessarily big internal array.
Note that there is no such thing as a null object - if you removed an item from the queue and this item is no longer referenced by your code - it will be garbage collected.

It looks like the ArrayDeque implementation never shrinks its internal array, so it just keeps growing forever. When an object is polled from the deque, its corresponding array element is set to null, and the object will eventually be garbage-collected (if all other references to it disappear as well). But the internal array in ArrayDeque just keeps growing.
The Deque interface is also implemented by LinkedList and ConcurrentLinkedDeque, so you're probably best off using one of those.

Most efficient collection for this kind of LILO?

I am programming a list of recent network messages communicated to/from a client. Basically I just want a list that stores up to X number of my message objects. Once the list reaches the desired size, the oldest (first) item in the list should be removed. The collection needs to maintain its order, and all I will need to do is
iterate through it,
add an item to the end, and
remove an item from the beginning, if #2 makes it too long.
What is the most efficient structure/array/collection/method for doing this? Thanks!

You want to use a Queue.

I don't think LILO is the real term...but you're looking for a FIFO Queue

I second #rich-adams re: Queue. In particular, since you mentioned responding to network messages, I think you may want something that handles concurrency well. Check out ArrayBlockingQueue.

Based on your third requirement, I think you're going to have to extend or wrap an existing implementation, and I recommend you start with ConcurrentLinkedQueue.
Other recommendations of using any kind of blocking queue are leading you down the wrong path. A blocking queue will not allow you to add an element to a full queue until another element is removed. Furthermore, they block while waiting for that operation to happen. By your own requirements, this isn't the behavior you want. You want to automatically remove the first element when a new one is added to a full queue.
It should be fairly simple to create a wrapper around ConcurrentLinkedQueue, overriding the offer method to check the size and capacity (your wrapper class will maintain the capacity). If they're equal, your offer method will need to poll the queue to remove the first element before adding the new one.

You can use an ArrayList for this. Todays computers copy data at such speeds that it doesn't matter unless your list can contain billions of elements.
Performance information: Copying 10 millions elements takes 13ms (thirteen milliseconds) on my dual core. So thinking even a second about the optimal data structure is a waste unless your use case is vastly different. In this case: You have more than 10 million elements and your application is doing nothing else but inserting and removing elements. If you operate in any way on the elements inserted/removed, chances are that the time spent in this operation exceeds the cost of the insert/remove.
A linked list seems to better at first glance but it needs more time when allocating memory plus the code is more complex (with all the pointer updating). So the runtime is worse. The only advantage of using a LinkedList in Java is that the class already implements the Queue interface, so it is more natural to use in your code (using peek() and pop()).
[EDIT] So let's have a look at efficiency. What is efficiency? The fastest algorithm? The one which takes the least amount of lines (and therefore has the least amount of bugs)? The algorithm which is easiest to use (= least amount of code on the developer side + less bugs)? The algorithm which performs best (which is not always the fastest algorithm)?
Let's look at some details: LinkedList implements Queue, so the code which uses the list is a bit more simple (list.pop() instead of list.remove(0)). But LinkedList will allocate memory for each add() while ArrayList only allocates memory once per N elements. And to reduce this even further, ArrayList will allocate N*3/2 elements, so as your list grows, the number of allocations will shrink. If you know the size of your list in advance, ArrayList will only allocate memory once. This also means that the GC has less clutter to clean up. So from a performance point of view, ArrayList wins by an order of magnitude in the average case.
The synchronized versions are only necessary when several threads access the data structure. With Java 5, many of those have seen dramatic speed improvements. If you have several threads putting and popping, use ArrayBlockingQueue but in this case, LinkedBlockingQueue might be an option despite the bad allocation performance since the implementation might allow to push and pop from two different threads at the same time as long as the queue size >= 2 (in this special case, the to threads won't have to access the same pointers). To decide that, the only option is to run a profiler and measure which version is faster.
That said: Any advice on performance is wrong 90% of the time unless it is backed by a measurement. Todays systems have become so complex and there is so much going on in the background that it is impossible for a mere human to understand or even enumerate all the factors which play a role.

you can get by with a plain old ArrayList.
When adding, just do (suppose the ArrayList is called al)
if (al.size() >= YOUR_MAX_ARRAY_SIZE)
{
al.remove(0);
}

I think that you want to implement a Queue<E> where you have the peek, pull and remove methods act as if there is nothing on the head until the count exceeds the threshold that you want. You probably want to wrap one of the existing implementions.

LinkedList should be what you're looking for

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.