Most efficient collection for this kind of LILO? - java

I am programming a list of recent network messages communicated to/from a client. Basically I just want a list that stores up to X number of my message objects. Once the list reaches the desired size, the oldest (first) item in the list should be removed. The collection needs to maintain its order, and all I will need to do is
iterate through it,
add an item to the end, and
remove an item from the beginning, if #2 makes it too long.
What is the most efficient structure/array/collection/method for doing this? Thanks!

You want to use a Queue.

I don't think LILO is the real term...but you're looking for a FIFO Queue

I second #rich-adams re: Queue. In particular, since you mentioned responding to network messages, I think you may want something that handles concurrency well. Check out ArrayBlockingQueue.

Based on your third requirement, I think you're going to have to extend or wrap an existing implementation, and I recommend you start with ConcurrentLinkedQueue.
Other recommendations of using any kind of blocking queue are leading you down the wrong path. A blocking queue will not allow you to add an element to a full queue until another element is removed. Furthermore, they block while waiting for that operation to happen. By your own requirements, this isn't the behavior you want. You want to automatically remove the first element when a new one is added to a full queue.
It should be fairly simple to create a wrapper around ConcurrentLinkedQueue, overriding the offer method to check the size and capacity (your wrapper class will maintain the capacity). If they're equal, your offer method will need to poll the queue to remove the first element before adding the new one.

You can use an ArrayList for this. Todays computers copy data at such speeds that it doesn't matter unless your list can contain billions of elements.
Performance information: Copying 10 millions elements takes 13ms (thirteen milliseconds) on my dual core. So thinking even a second about the optimal data structure is a waste unless your use case is vastly different. In this case: You have more than 10 million elements and your application is doing nothing else but inserting and removing elements. If you operate in any way on the elements inserted/removed, chances are that the time spent in this operation exceeds the cost of the insert/remove.
A linked list seems to better at first glance but it needs more time when allocating memory plus the code is more complex (with all the pointer updating). So the runtime is worse. The only advantage of using a LinkedList in Java is that the class already implements the Queue interface, so it is more natural to use in your code (using peek() and pop()).
[EDIT] So let's have a look at efficiency. What is efficiency? The fastest algorithm? The one which takes the least amount of lines (and therefore has the least amount of bugs)? The algorithm which is easiest to use (= least amount of code on the developer side + less bugs)? The algorithm which performs best (which is not always the fastest algorithm)?
Let's look at some details: LinkedList implements Queue, so the code which uses the list is a bit more simple (list.pop() instead of list.remove(0)). But LinkedList will allocate memory for each add() while ArrayList only allocates memory once per N elements. And to reduce this even further, ArrayList will allocate N*3/2 elements, so as your list grows, the number of allocations will shrink. If you know the size of your list in advance, ArrayList will only allocate memory once. This also means that the GC has less clutter to clean up. So from a performance point of view, ArrayList wins by an order of magnitude in the average case.
The synchronized versions are only necessary when several threads access the data structure. With Java 5, many of those have seen dramatic speed improvements. If you have several threads putting and popping, use ArrayBlockingQueue but in this case, LinkedBlockingQueue might be an option despite the bad allocation performance since the implementation might allow to push and pop from two different threads at the same time as long as the queue size >= 2 (in this special case, the to threads won't have to access the same pointers). To decide that, the only option is to run a profiler and measure which version is faster.
That said: Any advice on performance is wrong 90% of the time unless it is backed by a measurement. Todays systems have become so complex and there is so much going on in the background that it is impossible for a mere human to understand or even enumerate all the factors which play a role.

you can get by with a plain old ArrayList.
When adding, just do (suppose the ArrayList is called al)
if (al.size() >= YOUR_MAX_ARRAY_SIZE)
{
al.remove(0);
}

I think that you want to implement a Queue<E> where you have the peek, pull and remove methods act as if there is nothing on the head until the count exceeds the threshold that you want. You probably want to wrap one of the existing implementions.

LinkedList should be what you're looking for

Related

What will happen in HashMap , if we put an element while rehashing is happening?

I want to know when resizing or rehashing is happening what will happen if we try to put an element in the map .Does it goes to new increased map or the old map.
And also what is the use of the extra free space in hashmap ,which is 25% of original map size as load factor is 75%?
Perhaps this needs a coherent answer.
I want to know when resizing or rehashing is happening what will happen if we try to put an element in the map.
This question only makes sense if you have two or more threads performing operations on the HashMap. If you are doing that your code is not thread-safe. Its behavior is unspecified, version specific, and you are likely to have bad things happen at unpredictable times. Things like entries being lost, inexplicable NPEs, even one of your threads going into an infinite loop.
You should not write code where two or more threads operate on a HashMap without appropriate external synchronization to avoid simultaneous operations. If you do, I cannot tell you what will happen.
If you only have one thread using the HashMap, then the scenario you are concerned about is impossible. The resize happens during an update operation.
If you have multiple threads and synchronize to prevent any simultaneous operations, then the scenario you are concerned about is impossible. The other alternative is to use ConcurrentHashMap which is designed to work correctly when multiple threads can red and write simultaneously. (Naturally, the code for resizing a ConcurrentHashMap is a lot more complicated. But it ensures that entries end up in the right place.)
Does it goes to new increased map or the old map.
Assuming you are talking about the multi-threaded non-synchronized case, the answer is unspecified and possibly version specific. (I haven't checked the code.) For the other cases, the scenario is impossible.
And also what is the use of the extra free space in hashmap ,which is 25% of original map size as load factor is 75%?
It is not used. If the load factor is 75%, at least 25% of the hash slots will be empty / never used. (Until you hit the point where the hash array cannot be expanded any further for architectural reasons. But you will rarely reach that point.)
This is a performance trade-off. The Sun engineers determined / judged that a load factor of 75% would give the best trade-off between memory used and time taken to perform operations on a HashMap. As you increase the load factor, the space utilization gets better but most operations on the HashMap get slower, because the average length of a hash chain increases.
You are free to use a different load factor value if you want. Just be aware of the potential consequences.
Resizing and multithreading
If you access the hash map from a single thread, then it cannot happen. Resizing is triggered not by a timer, but by an operation that changes the number of elements in the hash map, e.g. it is triggered by put() operation. If you call put() and hash map sees that resizing is needed, it will perform resizing, then it will your new element. Means, the new element will be added after resizing, no element will be lost, there will be inconsistent behaviour in any of the methods.
Buf if access your hash map from multiple threads, then there can be many sorts of problems. For instance, if two threads call put() at the same time, both can trigger resizing. One of consequences can be that the new element of one of the threads will be lost. Even if resizing is not needed, multithreading can lead to loosing of some elements. For instance, two threads generate the same bucket index, and there is no such bucket yet. Both threads create such bucket and add it to the array of buckets. But the most recent wins, the other one will be overridden.
It is nothing specific to hash map. It is a typical problem when you modify object by multiple threads. To handle hash maps in multithreading environment correctly, you can either implement synchronization or use a class that is already thread safe, ConcurrentHashMap.
Load factor
Elements in hash map are stored in buckets. If each hash corresponds to a single bucket index, then the access time is O(1).
The more hashes you have, the higher is the probability that two hashes produce the same bucket index. Then they will be stored in the same bucket,
and the access time will increase.
One solution to reduce such collisions is to use another hash function. But 1) designing of hash functions that fit particular requirements can be very non trivial task (besides reducing collisions, it should provide acceptable performance), and 2) you can improve hash only in your own classes, but not in libraries you use.
Another, more simple solution, is to use a bigger number of buckets for the same number of hashes. When you reduce the relation (number of hashes) / (number of buckets), then you reduce the probability of collisions and thus keep access time close to O(1). But the price is, that you need more memory. For instance, for load factor 75% the 25% of the bucket array are not used; for 10% load factor the 90% will be not used.
There is no solution that fits all cases. Try different values and measure performance and memory usage, and then decide what is better in your case.

Java object that dereferences itself after a time

I have a large HashMap for storing POJOs.
As time goes on, I add more and more POJOs and the map gets bigger and bigger and eventually I run out of memory.
I want the objects within the map to dereference themselves after a set period of time (so they are effectively being cached).
I was thinking about using timers within the objects but I'm not sure if this is the standard (or proper) way of doing it.
Any advice would be greatly appreciated, thanks in advance!
I would at least implement an absolute threshold on your map object size (in quantity of objects if you know the POJOs size). On add, check to see if your size limit will be exceeded, and if so, iterate through to find the oldest MaxCount * N (N is a fraction, probably start with 20% = 0.2) items and remove them. Because you trim a percentage of the limit (very important, versus a fixed size), adding items is still O(1) (amortized). If additionally, you want to add a timer to your collection that throws away anything older than some threshold, that would also help you to keep the size manageable in a separate thread without the call to Add taking an exhorbitant amount of time in a few rare cases.

Non blocking buffer in java

In a high volume multi-threaded java project I need to implement a non-blocking buffer.
In my scenario I have a web layer that receives ~20,000 requests per second. I need to accumulate some of those requests in some data structure (aka the desired buffer) and when it is full (let's assume it is full when it contains 1000 objects) those objects should be serialized to a file that will be sent to another server for further processing.
The implementation shoud be a non-blocking one.
I examined ConcurrentLinkedQueue but I'm not sure it can fit the job.
I think I need to use 2 queues in a way that once the first gets filled it is replaced by a new one, and the full queue ("the first") gets delivered for further processing. This is the basic idea I'm thinking of at the moment, and still I don't know if it is feasible since I'm not sure I can switch pointers in java (in order to switch the full queue).
Any advice?
Thanks
What I usualy do with requirements like this is create a pool of buffers at app startup and store the references in a BlockingQueue. The producer thread pops buffers, fills them and then pushes the refs to another queue upon which the consumers are waiting. When consumer/s are done, (data written to fine, in your case), the refs get pushed back onto the pool queue for re-use. This provides lots of buffer storage, no need for expensive bulk copying inside locks, eliminates GC actions, provides flow-control, (if the pool empties, the producer is forced to wait until some buffers are returned), and prevents memory-runaway, all in one design.
More: I've used such designs for many years in various other languages too, (C++, Delphi), and it works well. I have an 'ObjectPool' class that contains the BlockingQueue and a 'PooledObject' class to derive the buffers from. PooledObject has an internal private reference to its pool, (it gets initialized on pool creation), so allowing a parameterless release() method. This means that, in complex designs with more than one pool, a buffer always gets released to the correct pool, reducing cockup-potential.
Most of my apps have a GUI, so I usually dump the pool level to a status bar on a timer, every second, say. I can then see roughly how much loading there is, if any buffers are leaking, (number consistently goes down and then app eventually deadlocks on empty pool), or I am double-releasing, (number consistently goes up and app eventually crashes).
It's also fairly easy to change the number of buffers at runtime, by either creating more and pushing them into the pool, or by waiting on the pool, removing buffers and letting GC destroy them.
I think you have a very good point with your solution. You would need two queues, the processingQueue would be the buffer size you want (in your example that would be 1000) while the waitingQueue would be a lot bigger. Every time the processingQueue is full it will put its contents in the specified file and then grab the first 1000 from the waitingQueue (or less if the waiting queue has fewer than 1000).
My only concern about this is that you mention 20000 per second and a buffer of 1000. I know the 1000 was an example, but if you don't make it bigger it might just be that you are moving the problem to the waitingQueue rather than solving it, as your waitingQueue will receive 1000 new ones faster than the processingQueue can process them, giving you a buffer overflow in the waitingQueue.
Instead of putting each request object in a queue, allocate an array of size 1000, and when it is filled, put that array in the queue to the sender thread which serializes and sends the whole array. Then allocate another array.
How are you going to handle the situation when the sender cannot work fast enough and its queue is overflown? To avoid out of memory error, use queue of a limited size.
I might be getting something wrong, but you may use an ArrayList for this as you don't need to poll per element from your queue. You just flush (create a copy and clear) your array in a synchronized section when it's size reaches the limit and you need to send it. Adding to this list should also be synced to this flush operation.
Swapping your arrays might not be safe - if your sending is slower than your generation, buffers may soon start overwriting each other. And 20000-elements array allocation per second is almost nothing for GC.
Object lock = new Object();
List list = ...;
synchronized(lock){
list.add();
}
...
// this check outside is a quick dirty check for performance,
// it's not valid out of the sync block
// this first check is less than nano-second and will filter out 99.9%
// `synchronized(lock)` sections
if(list.size() > 1000){
synchronized(lock){ // this should be less than a microsecond
if(list.size() > 1000){ // this one is valid
// make sure this is async (i.e. saved in a separate thread) or <1ms
// new array allocation must be the slowest part here
sendAsyncInASeparateThread(new ArrayList(list));
list.clear();
}
}
}
UPDATE
Considering that sending is async, the slowest part here is new ArrayList(list) which should be around 1 microseconds for 1000 elements and 20 microseconds per second. I didn't measure that, I resolved this from proportion in which 1 million elements are allocated in ~1 ms.
If you still require a super-fast synchronized queue, you might want to have a look at the MentaQueue
What do you mean by "switch pointers"? There are no pointers in Java (unless you're talking about references).
Anyways, as you probably saw from the Javadoc, ConcurrentLinkedQueue has a "problem" with the size() method. Still, you could use your original idea of 2 (or more) buffers that would get switched. There's probably going to be some bottlenecks with the disk I/O. Maybe the non-constant time of size() won't be a problem here either.
Of course if you want it to be non-blocking, you better have a lot of memory and a fast disk (and large / bigger buffers).

Does having a lot of null objects drain memory? and if so is there an alternative to using ArrayDeque as a queue?

I'm a n00b so I'm sorry if I'm way off with this one but I am using arraydeque as a queue for some threads to process. Each thread processes the an item in the queue(each thread checks if there's data in the queue and if it is it does queue.poll(), if its good then its sent away in a solution queue otherwise the data it is either discarded or a part of it is sent back to the queue for further processing.
Here's my problem, the longer my program works the more memory it keeps using and eventually I get outofmemory errors(but it stays maxed out for a while before this happens). I'm learning java so I'm not sure if I have identified this correctly but I ran yourkit on my code and it said:
Find arrays with big number of 'null' elements.
Problem: Possible memory waste.
Possible solution: Use alternate data structures e.g. maps or rework algorithms.
yourkit also showed me that 93% of my memory was stuck in here(in the heap dump). Yesterday I asked a question about arraydeque.polling() being a possible memory hog and got a comment saying that it was not because my data is turned into 'null' once its polled.
So my two questions(as in my title) is having a constantly growing number of null objects a problem(I am not sure if they get GC'ed but since there was several million in the heap dump, I suspect maybe not)? If so, is there an alternative to ArrayDeque, maybe something that GC's items when they are no longer needed(my program is constantly processing and adding items in a queue, but even though the number of items to process is reduced the memory consumption never goes down, when the program is done is just suddenly goes to zero, if the queue is gradually building I would expect it to gradually get smaller)?
Another slightly related question, I'm dealing with a few billions of items in a queue thats being processed by threads, but memory is causing it to fail. Is there a point to me trying to improve my internal program queue or would it make more sense to use a real queue program like(rabbitmq or activemq)?(I'm really new to program so not sure when I reached a limit of a tool and how to either improve it or figure out what to use next)
ArrayDeque stores items in a flat array with a help of two "pointers" - head and tail. If the total number of elements in the queue exceeds the current size of this array, its size is doubled.
When you poll an item from the queue, the slot in this array is cleared (set to null), but the array never really shrinks! This means if you first offer million items to the queue and then poll all of them, there ArrayDeque still maintains an array of at least 1 million entries, all of them set to null. This explains the Find arrays with big number of 'null' elements message.
Seems like your application at some point in time offers huge number of elements to the queue. Try (periodically?) calling the following code:
queue = new ArrayDeque<String>(queue);
This will copy the contents of old queue, garbage collecting unnecessarily big internal array.
Note that there is no such thing as a null object - if you removed an item from the queue and this item is no longer referenced by your code - it will be garbage collected.
It looks like the ArrayDeque implementation never shrinks its internal array, so it just keeps growing forever. When an object is polled from the deque, its corresponding array element is set to null, and the object will eventually be garbage-collected (if all other references to it disappear as well). But the internal array in ArrayDeque just keeps growing.
The Deque interface is also implemented by LinkedList and ConcurrentLinkedDeque, so you're probably best off using one of those.

most efficient data storage object for queuing a max number of elements

I need a storage type that I can set a max number of elements for and whenever I add something to the tail, the head is truncated as necessary with low overhead. I can of course do this manually if I have to. Example
max = 1000
fill it with integers 1-1000 : [1,2,...,999,1000]
add numbers 1000 - 1500 : [500,501,....,1499,1500]
It has to be as cheap an operation as possible since I will be running multiple threads at this time, one doing audio recording. I don't care about keeping the head elements as they are popped off, I would like to get rid of them in a bulk operation.
I checked out the queue types in the SDK, not sure which could suit these needs, possibly a linked queue of some kind.
Thanks for any help
Use a ring buffer, also known as a circular queue; these can be implemented as arrays, so they're particularly cheap. See this question for an implementation in Java.

Categories