I have a hashtable that under heavy-traffic. I want to add timeout mechanism to hashtable, remove too old records. My concerns are,
- It should be lightweight
- Remove operation has not time critical. I mean (timeout value is 1 hour) remove operation can be after 1 hour or and 1 hour 15 minute. There is no problem.
My opinion is,
I create a big array (as ring buffer)that store put time and hashtable key,
When adding to hashtable, using array index find a next slot on array put time,
if array slot empty, put insertion time and HT key,
if array slot is not empty, compare insertion time for timeout occured.
if timeout occured remove from Hashtable (if not removed yet)
it not timeout occured, increment index till to find empty slot or timeouted array slot.
When removing from hashtable there is no operation on big array.
Shortly, for every add operation to Hashtable, may remove 1 timeouted element from hashtable or do nothing.
What is your the more elegant and more lightweight solution ?
Thanks for helps,
My approach would be to use the Guava MapMaker:
ConcurrentMap<String, MyValue> graphs = new MapMaker()
.maximumSize(100)
.expireAfterWrite(1, TimeUnit.HOURS)
.makeComputingMap(
new Function<String, MyValue>() {
public MyValue apply(String string) {
return calculateMyValue(string);
}
});
This might not be exactly what you're describing, but chances it's close enough. And it's much easier to produce (plus it's using a well-tested code base).
Note that you can tweak the behaviour of the resulting Map by calling different methods before the make*() call.
You should rather consider using a LinkedHashMap or maybe a WeakHashMap.
The former has a constructor to set the iteration order of its elements to the order of last access; this makes it trivial to remove too old elements. And its removeEldestEntry method can be overridden to define your own policy on when to remove the eldest entry automatically after the insertion of a new one.
The latter uses weak references to keys, so any key which has no other reference to it can be automatically garbage collected.
I think a much easier solution is to use LRUMap from Apache Commons Collections. Of course you can write your own data structures if you enjoy it or you want to learn, but this problem is so common that numerous ready-made solutions exist. (I'm sure others will point you to other implementations too, after a time your problem will be choosing the right one from them :))
Under the assumption that the currently most heavily accessed items in your cache structure are in the significant minority, you may well get by with randomly selecting items for removal (you have a low probability of removing something very useful). I've used this technique and, in this particular application, it worked very well and took next to no implementation effort.
Related
I have a list of numbers. In my program I would frequently be checking if a certain number is part of my list. If it is not part of my list, I add it to the list, otherwise I do nothing. I have found myself using a hashmap to store the items instead of an arraylist.
void add(Map<Integer, Integer> mp, int item){
if(!mp.containsKey(item)){
mp.put(item, 1);
}
}
As you can see above, I put anything as the value, since I would not be using the values.
I have tested this process to be a lot faster than using an arraylist. (Also, containsKey() for hashmap is O(1) while contains() for arraylist is O(n))
Although it works well for me, it feels awkward for the simple reason that it is not the right data structure. Is this a good practice? Is there a better DS that I can use? Is there not any list that utilizes hashing to store values?
I have a list of numbers. In my program I would frequently be checking if a certain number is part of my list. If it is not part of my list, I add it to the list, otherwise I do nothing.
You are describing a set. From the Javadoc, a java.util.Set is:
A collection that contains no duplicate elements.
Further, the operation you are describing is add():
Adds the specified element to this set if it is not already present.
In code, you would create a new Set (this example uses a HashSet):
Set<Integer> numbers = new HashSet<>();
Then any time you encounter a number you wish to keep track of, just call add(). If x is already present, the set will remain unchanged and will not throw any errors – you don't need to be careful about adding things, just add anything you see, and the set sort of "filters" out the duplicates for you.
numbers.add(x);
It's beyond your original question, but there are various things you can do with the data once you've populated a set - check if other numbers are present/absent, iterate the numbers in the set, etc. The Javadoc shows which functionality is available to use.
An alternative solution from the standard library is a java.util.BitSet. This is - in my opinion - only an option if the values of item are not too big, and if they are relatively close together. If your values are not near each other (and not starting near to zero), then it might be worthwhile looking for third party solutions that offers sparse bit sets or other sparse data structures.
You can use a bit set like:
BitSet bits = new BitSet();
void add(int item) {
bits.set(item);
}
And as suggested in the comments by Eritrean, you can also use a Set (e.g. HashSet). Internally, a HashSet uses a HashMap, so it will perform similar to your current solution, but it does away with having to put sentinel values in yourself (you just add or remove the item itself).
As an added benefit, if you use Collection<Integer> as the type of parameters/fields in your code, you can easily switch between using an ArrayList or an HashSet and test it without having to change code all over the place.
Looking at the source of Java 6, HashSet<E> is actually implemented using HashMap<E,Object>, using dummy object instance on every entry of the Set.
I think that wastes 4 byte (on 32-bit machines) for the size of the entry itself.
But, why is it still used? Is there any reason to use it besides making it easier to maintain the code?
Actually, it's not just HashSet. All implementations of the Set interface in Java 6 are based on an underlying Map. This is not a requirement; it's just the way the implementation is. You can see for yourself by checking out the documentation for the various implementations of Set.
Your main questions are
But, why is it still used? Is there
any reason to use it besides making it
easier to maintain the codes?
I assume that code maintenance is a big motivating factor. So is preventing duplication and bloat.
Set and Map are similar interfaces, in that duplicate elements are not allowed. (I think the only Set not backed by a Map is CopyOnWriteArraySet, which is an unusual Collection, because it's immutable.)
Specifically:
From the documentation of Set:
A collection that contains no
duplicate elements. More formally,
sets contain no pair of elements e1
and e2 such that e1.equals(e2), and at
most one null element. As implied by
its name, this interface models the
mathematical set abstraction.
The Set interface places additional
stipulations, beyond those inherited
from the Collection interface, on the
contracts of all constructors and on
the contracts of the add, equals and
hashCode methods. Declarations for
other inherited methods are also
included here for convenience. (The
specifications accompanying these
declarations have been tailored to the
Set interface, but they do not contain
any additional stipulations.)
The additional stipulation on
constructors is, not surprisingly,
that all constructors must create a
set that contains no duplicate
elements (as defined above).
And from Map:
An object that maps keys to values.
A map cannot contain duplicate keys; each key can map to at most one value.
If you can implement your Sets using existing code, any benefit (speed, for example) you can realize from existing code accrues to your Set as well.
If you choose to implement a Set without a Map backing, you have to duplicate code designed to prevent duplicate elements. Ah, the delicious irony.
That said, there's nothing preventing you from implementing your Sets differently.
My guess is that HashSet was originally implemented in terms of HashMap in order to get it done quickly and easily. In terms of lines of code, HashSet is a fraction of HashMap.
I would guess that the reason it still hasn't been optimized is fear of change.
However, the waste is much worse than you think. On both 32-bit and 64-bit, HashSet is 4x larger than necessary, and HashMap is 2x larger than necessary. HashMap could be implemented with an array with keys and values in it (plus chains for collisions). That means two pointers per entry, or 16 bytes on a 64-bit VM. In fact, HashMap contains an Entry object per entry, which adds 8 bytes for the pointer to the Entry and 8 bytes for the Entry object header. HashSet also uses 32 bytes per element, but the waste is 4x instead of 2x since it only requires 8 bytes per element.
Yes you are right, a small amount of wastage is definetley there. Small because, for every entry it uses the same object PRESENT(which is declared final). Hence the only wastage is for every entry's value in the HashMap.
Mostly I think, they took this approach for maintainability and reusability. (The JCF developers would have thought, we have tested HashMap anyway, why not reuse it.)
But if you are having huge collections, and you are a memory freak, then you may opt out for better alternatives like Trove or Google Collections.
I looked at your question and it took me a while to think about what you said. So here's my opinion regarding the HashSet implementation.
It is necessary to have the dummy instance to know if the value is or is not present in the set.
Take a look at the add method
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
Abd now let's take a look at the put return value
#returns the previous value associated with key, or null if there was no mapping for key. (A null return can also indicate that the map previously associated null with key.)
So the PRESENT object is just used to represent that the set contains the e value. I think you asked why not use null instead of PRESENT. But the, you would not be able to distinguish if the entry was previously on the map because map.put(key,value) would always return null and you would not have way to know if the key existed.
That being said you could argue that they could have used an implementation like this
public boolean add(E e) {
if( map.containsKey(e) ) {
return false;
}
map.put(e, null);
return true;
}
I guess they waste 4 bytes to avoid computing the hashCode, as it could be expensive, of the key two times (if the key is going to be added).
If you question referred to why they used a HashMap that would waste 8 bytes (because of the Map.Entry) instead of some other data structure using a similar Entry of only 4, then yes, I would say they did it for the reasons you mentioned.
I am guessing that it has never turned up as a significant problem for real applications or important benchmarks. Why complicate the code for no real benefit?
Also note, that object sizes are rounded up in many JVM implementation, so there may not actually be an increase in size (I don't know for this example). Also the code for HashMap is likely to be compiled and in cache. Other things being equal, more code => more cache misses => lower performance.
After searching through pages like this wondering why the mildly inefficient standard implementation, found com.carrotsearch.hppc.IntOpenHashSet
Your question:
I think that wastes 4 byte (on 32-bit machines) for the size of the entry itself.
Just one Object variable is created for the entire datastructure of hashset and doing that would save yourself from re-writing the entire hashMap kind of code again.
private static final Object PRESENT = new Object();
All the keys are having one value i.e PRESENT object.
Case 1 :
One HashMap with 1,00,000 entries
Case 2 :
Two HashMaps with 50,000 entries each.
Which of the above cases will take more execution time and more memory? Or is there a significant difference between the two?
Is it feasible to replace one HashMap of large number of entries with two HashMaps of lesser number of entries?
You're better off with a single hash map.
Look-ups are very efficient in a hash map, and they're designed to have lots of elements in. It'll be slower overall if you have to put something in place to search one map and then look in the other one if you don't find it in the first one.
(There won't be much difference in memory usage either way.)
If it's currently too slow, have a check that your .hashCode() and .equals() are not inefficient.
The memory requirements should be similar for the two cases (since the HashMap storage is an array of Entries whose size is the capacity of the map, so two arrays of 50K would take the same space as one array of 100K).
The get() and put() methods should also have similar performance in both cases, since the calculation of the hash code of the key and the matching index is not affected by the size of the map. The only thing that may affect the performance of these operations is the
average number of keys mapped to the same index, and that should also be independent of the size of the map, since as the size grows, the Entries array grows, so the average number of Entries per index should stay the same.
On the other hand, if you use two smaller maps, you have to add logic to decide which map to use. You can try to search for a key in the first map and if not found, search in the second map. Or you can have a criteria that determines which map is used (for example, String keys starting in A to M will be stored in the first map, and the rest in the second map). Therefore, each operation on the map will have an additional overhead.
Therefore, I would suggest using a single HashMap.
The performance difference between using one or two HashMaps should not matter much, so go for the simpler solution, the single HashMap.
But since you ask this question, I assume that you have performance problems. Often, using a HashMap as a cache is a bad idea, because it keeps a reference to the cached objects alive, thus basically disabling garbage collection. Consider redesigning your cache, for example using SoftReferences (a class in the standard Java API), which allows the garbage collector to collect your cached objects while still being able to reuse the objects as long a they are not garbage collected yet.
as everyone mentioned, you should be using one single hash map. if you having trouble with 100k entries then there is a major problem with your code.
here is some heads up on hash map:
Don't use too complex objects as key (in my opinion using object string as key is as far as you should go for HashMap.
if you try to use some complex object as key make sure your equals and hashCode method are as efficient as possible as too much calculation within these method can reduce the efficiency of hashmap greatly
I need to randomly access keys in a HashMap. Right now, I am using Set's toArray() method on the Set that HashMap's keySet() returns, and casting it as a String[] (my keys are Strings). Then I use Random to pick a random element of the String array.
public String randomKey() {
String[] keys = (String[]) myHashMap.keySet().toArray();
Random rand = new Random();
return keyring[rand.nextInt(keyring.length)];
}
It seems like there ought to be a more elegant way of doing this!
I've read the following post, but it seems even more convoluted than the way I'm doing it. If the following solution is the better, why is that so?Selecting random key and value sets from a Map in Java
There is no facility in a HashMap to return an entry without knowing the key so, if you want to use only that class, what you have is probably as good a solution as any.
Keep in mind however that you're not actually restricted to using a HashMap.
If you're going to be reading this collection far more often than writing it, you can create your own class which contains both a HashMap of the mappings and a different collection of the keys that allows random access (like a Vector).
That way, you won't incur the cost of converting the map to a set then an array every time you read, it will only happen when necessary (adding or deleting items from your collection).
Unfortunately, a Vector allows multiple keys of the same value so you would have to defend against that when inserting (to ensure fairness when selecting a random key). That will increase the cost of insertion.
Deletion would also be increased cost since you would have to search for the item to remove from the vector.
I'm not sure there's an easy single collection for this purpose. If you wanted to go the whole hog, you could have your current HashMap, a Vector of the keys, and yet another HashMap mapping the keys to the vector indexes.
That way, all operations (insert, delete, change, get-random) would be O(1) in time, very efficient in terms of time, perhaps less so in terms of space :-)
Or there's a halfway solution that still uses a wrapper but creates a long-lived array of strings whenever you insert, change or delete a key. That way, you only create the array when needed and you still amortise the costs. Your class then uses the hashmap for efficient access with a key, and the array for random selection.
And the change there is minimal. You already have the code for creating the array, you just have to create your wrapper class which provides whatever you need from a HashMap (and simply passes most calls through to the HashMap) plus one extra function to get a random key (using the array).
Now, I'd only consider using those methods if performance is actually a problem though. You can spend untold hours making your code faster in ways that don't matter :-)
If what you have is fast enough, it's fine.
Why not use the Collections.shuffle method, saved to a variable and simply pop one off the top as required.
http://docs.oracle.com/javase/7/docs/api/java/util/Collections.html#shuffle(java.util.List)
You could avoid copying the whole keyset into a temporary data structure, by first getting the size, choosing the random index and then iterating over the keyset the appropriate number of times.
This code would need to be synchronized to avoid concurrent modifications.
If you really just want any element of a set, this will work fine.
String s = set.iterator().next();
If you are unsure whether there is an element in the set, use:
String s;
Iterator<String> it = set.iterator();
if (it.hasNext()) {
s = it.next();
}
else {
// set was empty
}
I am using a hashmap to store objects with a key that evolves over time.
HashMap<String,Stuff> hm = new HashMap<String,Stuff>()
Stuff stuff = new Stuff();
hm.put( "OrignalKey", stuff);
I didn't find anything better than removing "OrignalKey" and put() a new entry with the same object.
hm.remove("OriginalKey");
hm.put("NewKey", stuff);
remove() seems to be taking a significant cpu toll hence my questions:
What is the actual the memory cost to leave duplicate entries (there is no overlapping risk)?
Am I just missing some neat swapKey() method?
What is the actual the memory cost to leave duplicate entries (there is no overlapping risk)?
Well, you've got an extra entry, and the key itself can't be garbage collected. If the key is "large", that could be a problem. It also means that you'll never be able to get an accurate count, you'll never be able to sensibly iterate over all the values, etc. It seems like a bad idea to me.
Am I just missing some neat swapKey() method?
There's no such thing - and it feels like a fairly rare requirement to me. Any such method would pretty much have to do what you're doing anyway - it has to find the old key, remove it from the data structure, and insert an entry for the new key. I can't easily imagine any optimizations possible just by knowing about both operations at once.
swapping of the key is not easily possible, since the key is used for hashing.
changing the key means that the hashvalue is most probably different, too. in this case, changing the key conforms to deletion and following reinsertion