I have a large HashMap for storing POJOs.
As time goes on, I add more and more POJOs and the map gets bigger and bigger and eventually I run out of memory.
I want the objects within the map to dereference themselves after a set period of time (so they are effectively being cached).
I was thinking about using timers within the objects but I'm not sure if this is the standard (or proper) way of doing it.
Any advice would be greatly appreciated, thanks in advance!
I would at least implement an absolute threshold on your map object size (in quantity of objects if you know the POJOs size). On add, check to see if your size limit will be exceeded, and if so, iterate through to find the oldest MaxCount * N (N is a fraction, probably start with 20% = 0.2) items and remove them. Because you trim a percentage of the limit (very important, versus a fixed size), adding items is still O(1) (amortized). If additionally, you want to add a timer to your collection that throws away anything older than some threshold, that would also help you to keep the size manageable in a separate thread without the call to Add taking an exhorbitant amount of time in a few rare cases.
Related
I want to know when resizing or rehashing is happening what will happen if we try to put an element in the map .Does it goes to new increased map or the old map.
And also what is the use of the extra free space in hashmap ,which is 25% of original map size as load factor is 75%?
Perhaps this needs a coherent answer.
I want to know when resizing or rehashing is happening what will happen if we try to put an element in the map.
This question only makes sense if you have two or more threads performing operations on the HashMap. If you are doing that your code is not thread-safe. Its behavior is unspecified, version specific, and you are likely to have bad things happen at unpredictable times. Things like entries being lost, inexplicable NPEs, even one of your threads going into an infinite loop.
You should not write code where two or more threads operate on a HashMap without appropriate external synchronization to avoid simultaneous operations. If you do, I cannot tell you what will happen.
If you only have one thread using the HashMap, then the scenario you are concerned about is impossible. The resize happens during an update operation.
If you have multiple threads and synchronize to prevent any simultaneous operations, then the scenario you are concerned about is impossible. The other alternative is to use ConcurrentHashMap which is designed to work correctly when multiple threads can red and write simultaneously. (Naturally, the code for resizing a ConcurrentHashMap is a lot more complicated. But it ensures that entries end up in the right place.)
Does it goes to new increased map or the old map.
Assuming you are talking about the multi-threaded non-synchronized case, the answer is unspecified and possibly version specific. (I haven't checked the code.) For the other cases, the scenario is impossible.
And also what is the use of the extra free space in hashmap ,which is 25% of original map size as load factor is 75%?
It is not used. If the load factor is 75%, at least 25% of the hash slots will be empty / never used. (Until you hit the point where the hash array cannot be expanded any further for architectural reasons. But you will rarely reach that point.)
This is a performance trade-off. The Sun engineers determined / judged that a load factor of 75% would give the best trade-off between memory used and time taken to perform operations on a HashMap. As you increase the load factor, the space utilization gets better but most operations on the HashMap get slower, because the average length of a hash chain increases.
You are free to use a different load factor value if you want. Just be aware of the potential consequences.
Resizing and multithreading
If you access the hash map from a single thread, then it cannot happen. Resizing is triggered not by a timer, but by an operation that changes the number of elements in the hash map, e.g. it is triggered by put() operation. If you call put() and hash map sees that resizing is needed, it will perform resizing, then it will your new element. Means, the new element will be added after resizing, no element will be lost, there will be inconsistent behaviour in any of the methods.
Buf if access your hash map from multiple threads, then there can be many sorts of problems. For instance, if two threads call put() at the same time, both can trigger resizing. One of consequences can be that the new element of one of the threads will be lost. Even if resizing is not needed, multithreading can lead to loosing of some elements. For instance, two threads generate the same bucket index, and there is no such bucket yet. Both threads create such bucket and add it to the array of buckets. But the most recent wins, the other one will be overridden.
It is nothing specific to hash map. It is a typical problem when you modify object by multiple threads. To handle hash maps in multithreading environment correctly, you can either implement synchronization or use a class that is already thread safe, ConcurrentHashMap.
Load factor
Elements in hash map are stored in buckets. If each hash corresponds to a single bucket index, then the access time is O(1).
The more hashes you have, the higher is the probability that two hashes produce the same bucket index. Then they will be stored in the same bucket,
and the access time will increase.
One solution to reduce such collisions is to use another hash function. But 1) designing of hash functions that fit particular requirements can be very non trivial task (besides reducing collisions, it should provide acceptable performance), and 2) you can improve hash only in your own classes, but not in libraries you use.
Another, more simple solution, is to use a bigger number of buckets for the same number of hashes. When you reduce the relation (number of hashes) / (number of buckets), then you reduce the probability of collisions and thus keep access time close to O(1). But the price is, that you need more memory. For instance, for load factor 75% the 25% of the bucket array are not used; for 10% load factor the 90% will be not used.
There is no solution that fits all cases. Try different values and measure performance and memory usage, and then decide what is better in your case.
I'm trying to maintain a map of keys to their respective elapsed time (long values). Guava's AtomicLongMap works quite nicely for this with one issue: I only want to maintain largest values (elapsed time) so that the Map does not become ridiculous in size (there are large number of possible keys).
Thus I would like to ideally evict entries to maintain the map at a certain size. The largest values would be kept. Obviously I could do this in a blocking fashion (synchronized) but I'm looking for something less blocking as this map is accessed very very frequently by many threads.
One idea I have is to make a reaper that would run after a certain threshold has been hit, copy the map, trim and then reset the reference (probably an atomic reference or marked volatile). Of course this has many many down sides such has maintaining a separate thread and loosing data while the map is being copied along with I'm sure various other things that could go wrong.
Is there a data structure / library I should consider instead?
The simplest option is likely to add a concurrent ring buffer. When you add an entry, you get an entry to remove from the ring buffer. This way you are limited to the size of the ring buffer and you don't need an additional thread.
List multiplies when its full, but hashmap/hashtable multiplies when it reaches loadfactor, so why can't hashmap wait for resizing till its full, is it parting of underlying hasing algorithm??
There's a big difference between an array-list and a hash-map: the former stores each entry into discrete slots, while the latter may put more than one entry into a slot if the entries' hashes match. That means that a hash-map may start to slow down long before every slot is taken and indeed, it's quite unlikely that you'd fill every slot once and once only before having to double-up in a slot.
If you've got a fixed set of things that can be hashed, it is possible to create a hash and from it a hash-map that will store just that fixed set of things in an efficient manner: the result is called a perfect hash.
Because the probability of hash collisions rises dramatically towards the end of it's capacity (there's just not enough empty buckets). As more entries end up in same buckets, the effiectiveness of queries is diminished long before it's full. Depending on the hashing algorithm, the optimal load factor may vary.
The effectiveness of queries on arrays is not affected by it's load factor, that's why it makes no sense to resize earlier.
An arraylist always puts new elements in the next free slot, there's no need to expand until that slot is taken unless you want to save time in the future - in which case you can use ensureCapacity.
A hashmap on the other hand calculates an integer value for each object you put into it. Based on this value the object is stored in a particular bucket - this is done to support fast look-ups. However, the calculated value is not necessarily unique, and even if it was two different values might point at the same bucket. This is especially common for small amounts of buckets and is ridiculously likely to happen if your buckets are almost full.
Consider a hashmap which stored people in buckets based on their birthday. Even with 365 buckets, with 10 people there's roughly 10% chance that you would have a collision. With 23 there's a 50% chance (more here).
Now, a single collision isn't a big deal, but when you use a hashmap you typically do it for the fast lookups. If several items are in the same bucket, the time it takes to perform a lookup grows longer and longer. Therefore, for performance reasons, you want to increase the number of buckets in order to decrease the density of your elements.
I need a storage type that I can set a max number of elements for and whenever I add something to the tail, the head is truncated as necessary with low overhead. I can of course do this manually if I have to. Example
max = 1000
fill it with integers 1-1000 : [1,2,...,999,1000]
add numbers 1000 - 1500 : [500,501,....,1499,1500]
It has to be as cheap an operation as possible since I will be running multiple threads at this time, one doing audio recording. I don't care about keeping the head elements as they are popped off, I would like to get rid of them in a bulk operation.
I checked out the queue types in the SDK, not sure which could suit these needs, possibly a linked queue of some kind.
Thanks for any help
Use a ring buffer, also known as a circular queue; these can be implemented as arrays, so they're particularly cheap. See this question for an implementation in Java.
I am programming a list of recent network messages communicated to/from a client. Basically I just want a list that stores up to X number of my message objects. Once the list reaches the desired size, the oldest (first) item in the list should be removed. The collection needs to maintain its order, and all I will need to do is
iterate through it,
add an item to the end, and
remove an item from the beginning, if #2 makes it too long.
What is the most efficient structure/array/collection/method for doing this? Thanks!
You want to use a Queue.
I don't think LILO is the real term...but you're looking for a FIFO Queue
I second #rich-adams re: Queue. In particular, since you mentioned responding to network messages, I think you may want something that handles concurrency well. Check out ArrayBlockingQueue.
Based on your third requirement, I think you're going to have to extend or wrap an existing implementation, and I recommend you start with ConcurrentLinkedQueue.
Other recommendations of using any kind of blocking queue are leading you down the wrong path. A blocking queue will not allow you to add an element to a full queue until another element is removed. Furthermore, they block while waiting for that operation to happen. By your own requirements, this isn't the behavior you want. You want to automatically remove the first element when a new one is added to a full queue.
It should be fairly simple to create a wrapper around ConcurrentLinkedQueue, overriding the offer method to check the size and capacity (your wrapper class will maintain the capacity). If they're equal, your offer method will need to poll the queue to remove the first element before adding the new one.
You can use an ArrayList for this. Todays computers copy data at such speeds that it doesn't matter unless your list can contain billions of elements.
Performance information: Copying 10 millions elements takes 13ms (thirteen milliseconds) on my dual core. So thinking even a second about the optimal data structure is a waste unless your use case is vastly different. In this case: You have more than 10 million elements and your application is doing nothing else but inserting and removing elements. If you operate in any way on the elements inserted/removed, chances are that the time spent in this operation exceeds the cost of the insert/remove.
A linked list seems to better at first glance but it needs more time when allocating memory plus the code is more complex (with all the pointer updating). So the runtime is worse. The only advantage of using a LinkedList in Java is that the class already implements the Queue interface, so it is more natural to use in your code (using peek() and pop()).
[EDIT] So let's have a look at efficiency. What is efficiency? The fastest algorithm? The one which takes the least amount of lines (and therefore has the least amount of bugs)? The algorithm which is easiest to use (= least amount of code on the developer side + less bugs)? The algorithm which performs best (which is not always the fastest algorithm)?
Let's look at some details: LinkedList implements Queue, so the code which uses the list is a bit more simple (list.pop() instead of list.remove(0)). But LinkedList will allocate memory for each add() while ArrayList only allocates memory once per N elements. And to reduce this even further, ArrayList will allocate N*3/2 elements, so as your list grows, the number of allocations will shrink. If you know the size of your list in advance, ArrayList will only allocate memory once. This also means that the GC has less clutter to clean up. So from a performance point of view, ArrayList wins by an order of magnitude in the average case.
The synchronized versions are only necessary when several threads access the data structure. With Java 5, many of those have seen dramatic speed improvements. If you have several threads putting and popping, use ArrayBlockingQueue but in this case, LinkedBlockingQueue might be an option despite the bad allocation performance since the implementation might allow to push and pop from two different threads at the same time as long as the queue size >= 2 (in this special case, the to threads won't have to access the same pointers). To decide that, the only option is to run a profiler and measure which version is faster.
That said: Any advice on performance is wrong 90% of the time unless it is backed by a measurement. Todays systems have become so complex and there is so much going on in the background that it is impossible for a mere human to understand or even enumerate all the factors which play a role.
you can get by with a plain old ArrayList.
When adding, just do (suppose the ArrayList is called al)
if (al.size() >= YOUR_MAX_ARRAY_SIZE)
{
al.remove(0);
}
I think that you want to implement a Queue<E> where you have the peek, pull and remove methods act as if there is nothing on the head until the count exceeds the threshold that you want. You probably want to wrap one of the existing implementions.
LinkedList should be what you're looking for