Looking at this question it made me curious as to which to use, Hashset vs ArrayList. The Hashset seems to have a better lookup and ArrayList has a better insert (for many many objects). So I my question is, since I can't insert using an ArrayList, then search through it using a HashSet, I'm going to have to pick one or the other. Would inserting with an ArrayList, converting to a HashSet to do the lookups, be SLOWER overall than to just insert into a HashSet then lookup? Or just stick with the ArrayList, although the lookup is worse, the inserting makes up for it?
It very much depends on the size of the collection and the way you use it. I.e. you can reuse the same HashSet for copying and this would save you time. Or you can keep them up-to-date.
Creating a HashSet copy for each element lookup will always be slower.
You can also utilize LinkedHashSet which has quick insertion and a HashSet's look up speed at cost of a little worse memory consumption and O(N) index(int) operation.
You must decide for your specific application which tradeoff pays off better. Do you first insert everything, then spend the rest of the time looking up, maybe occasionally adding a few more? Use HashSet. Do you have a lot of duplicates, which you must suppress? Another strong point for HashSet. Do you insert a lot all the time and only do an occasional lookup? Then use ArrayList. And so on, there are many more combinations and in some cases you'll have to benchmark it to see.
It's totally depends on your use case. If you implement the hashCode method correctly, the insert operation of HashSet is also an O(1) operation. If you don't need randomly access the elements(using index), and you don't want duplicates, HashSet would be a better choice.
Related
I'm looking for a good sorted data structure in java. After doing some research got few hints about using TreeSet/TreeMap. But these components is lack of one thing: random access to an element in the set. For example, I want to access nth element in the sorted set, but with TreeSet, I must iterate over other n-1 elements before I can get there. It would be a waste since I would have upto several thousands elements in my Set.
The use case is like below
9:20 AM what is this object? edited by user1
9:30 AM what is this book ? edited by user2
9:40 PM what is this red book? edited by user1
I always want to show the latest edited title by that user. I know that the latest is going to be with greatest timestamp. For this i found that ConcurrentSkipListSet/Maps are good. But, I would like to know if there are any better ways to implement this functionality.
Assuming you should keep your data sorted, your best bet is TreeMap. There's no silver bullet which is a sorted collection and also perform O(1) random access. In an ordered collection, you're capable of accessing an element with index directly but you cannot benefit from index based access if your collection is required sorted.
If you need concurrency, ConcurrentSkipListMap is good. It's suitable for large-scale concurrent access to data. However, in terms of performance, it's no match for our Red-Black tree based pal, TreeMap. Thus if you don't need concurrency, forget ConcurrentSkipListMap and stick with TreeMap.
TreeMap is elegant and satisfies your need. Nonetheless, in practice, using HashMap and sorting data whenever you need might be better. Try both and find out which one is better in your case.
Java has tons of different Collections designed for concurrency and thread safety, and I'm at a loss as to which one to choose for my situation.
Multiple threads may be calling .add() and .remove(), and I will be copying this list frequently with something like List<T> newList = new ArrayList<T>(concurrentList). I will never be looping over the concurrent list.
I thought about something like CopyOnWriteArrayList, but I've read that it can be very inefficient because it copies itself every time it's modified. I'm hoping to find a good compromise between safety and efficiency.
What is the best list (or set) for this situation?
As #SpiderPig said, the best case scenario with a List would be an immutable, singly-linked list.
However, looking at what's being done here, a List is unnecessary (#bhspencer's comment). A ConcurrentSkipListSet will work most efficiently (#augray).
This Related Thread's accepted answer offers more insight on the pros and cons of different concurrent collections.
You might want to look into whether a ctrie would be appropriate for your use case - it has thread-safe add and remove operations, and "copying" (in actuality, taking a snapshot of) the data structure runs in O(1). I'm aware of two JVM implementations of the data structure: implementation one, implementation two.
Collections.newSetFromMap(new ConcurrentHashMap<...>())
This is typically how a normal Set is done (HashSet is really a modified wrapper over HashMap). It offers both the advantages of performance/concurrecy from ConcurrentHashMap, and does not have extra features like ConcurrentSkipListSet (ordering), COW lists (copying every modification), or concurrent queues (FIFO/LIFO ordering).
Edit: I didn't see #bhspencer's comment on the original post, apologies for stealing the spotlight.
Hashset being hashing based would be better than List.
Add last and remove first will be good with LinkedList.
Search will be fast in arraylist being array index based.
Thanks,
I'm looking for a high performing data structure that behaves like a set and where the elements will always be an array of ints. The data structure only needs to fulfill this interface:
trait SetX {
def size:Int
def add(element:Array[Int])
def toArray:Array[Array[Int]]
}
The set should not contain duplicates and this could be achieved using Arrays.equals(int[] a, int[] a2) - i.e. the values of the arrays can't be the same.
Before creating it I have a rough idea of how many elements there will be but need resizing behaviour in case there are more than initially thought. The elements will always be the same length and I know what that is at the time of creation.
Of course I could use a Java HashSet (wrapping the arrays of course) but this is being used in a tight loop and it is too slow. I've looked at Trove and that works nicely (by using arrays but providing a TObjectHashingStrategy) but I was hoping that since my requirements are so specific there might be a quicker/more efficient way to do this.
Has anyone ever come across this or have an idea how I could accomplish this?
The trait above is Scala but I'm very happy with Java libs or code.
I should really say what I am doing. I am basically generating a large number of int arrays in a tight loop and at the end of it I just want to see the unique ones. I never have to remove elements from the set or anything else. Just add lots of int arrays to the set and at the end get out the unique ones.
Look at prefix trees. You can follow tree structure immediately during array generation. At the end of generation you will have an answer, if the generated array already is present in the set. Prefix tree would consume much less memory than an ordinary hash set.
If you are generating arrays and have a not very small chance of their equivalence, I suspect you are only taking numbers from a very limited range. It would simplify prefix tree implementation, too.
I'm sure that proper implementation would be faster than using any set implementation to keep solid arrays.
Downside of this solution is that you need to implement data structure yourself, because it would be integrated with the logic of code deeply.
If you want high performance then write your own:
Call it ArraySetInt.
Sets are usually either implemented as trees or hashtable.
If you want an array based set, this would slow down adding, maybe deleting, but will speed up iterating, low memory usage. etc.
First look how ArrayList is implemented.
remove the object and replace it with primitive int.
Then rename the add() to put() and change it to a type of sort by insertion. Use System.arraycopy() to insert. use Arrays.binsearch() to find the insert position and whether element already exist in one step.
With out knowing how much data or if you are doing more reads than write:
You should probably try (ie benchmark) the naive case of an array of arrays or array of special wrapped array (ie composite object with cached hashcode of array and the array). Generally on small data sets not much beats looping through an array (e.g. HashMap for an Enum can actually be slower than looping through).
If you have really large amount of data and your willing to make some compromises you might consider a bloom filter but it sounded like you don't have much data.
I'd go for some classic solution wrapping the array by a class providing faster equals and hashCode. The hashCode can be simply cached and equals can make use of it for quickly saying no in case of differing arrays.
I'd avoid Arrays.hashCode as it uses a stupid multiplier (31), which might lead to unneeded collisions. For a really fast equals you could make use of cryptography and say that two arrays are equals if and only if their SHA-1 are equal (you'd be the first to find a collision :D).
The ArrayWrapper rather simple and should be faster than using TObjectHashingStrategy as it never has to look at the data itself (fewer cache misses) and it has the fastest and best possible hashCode and equals.
You could also look for some CompactHashSet implementation as it can be faster due to better memory locality.
I'm making a java application that is going to be storing a bunch of random words (which can be added to or deleted from the application at any time). I want fast lookups to see whether a given word is in the dictionary or not. What would be the best java data structure to use for this? As of now, I was thinking about using a hashMap, and using the same word as both a value and the key for that value. Is this common practice? Using the same string for both the key and value in a (key,value) pair seems weird to me so I wanted to make sure that there wasn't some better idea that I was overlooking.
I was also thinking about alternatively using a treeMap to keep the words sorted, giving me an O(lgn) lookup time, but the hashMap should give an expected O(1) lookup time as I understand it, so I figured that would be better.
So basically I just want to make sure the hashMap idea with the strings doubling as both key and value in each (key,value) pair would be a good decision. Thanks.
I want fast lookups to see whether a given word is in the dictionary or not. What would be the best java data structure to use for this?
This is the textbook usecase of a Set. You can use a HashSet. The naive implementation for Set<T> uses a corresponding Map<T, Object> to simply mark whether the entry exists or not.
If you're storing it as a collection of words in a dictionary, I'd suggest taking a look at Tries. They require less memory than a Set and have quick lookup times of worst case O(string length).
Any class that is a Set should help your purpose. However, Do note that Set will not allow for duplicates. For that matter, even a Map won't allow duplicate keys. I would suggest on using a an ArrayList(assuming synchronization is not needed) if you need to add duplicate entries and treat them as separate.
My only concern would be memory, if you use the HashSet and if you have a very large collection of words... Then you will have to load the entire collection in the memory... If that's not a problem.... (And your collection must be very large for this to be a problem)... Then the HashSet should be fine... If you indeed have a very large collection of words, then you can try to use a tree, and only load in memory the parts that you are interested in.
Also keep in mind that insertion is fast, but not as fast as in a tree, remember that for this to work, Java is going to insert every element sorted. Again, nothing major, but if you add a lot of words at a time, you may consider using a tree...
What is the best practice for initializing an ArrayList in Java?
If I initialize a ArrayList using the new operator then the ArrayList will by default have memory allocated for 10 buckets. Which is a performance hit.
I don't know, maybe I am wrong, but it seems to me that I should create a ArrayList by mentioning the size, if I am sure about the size!
Which is a performance hit.
I wouldn't worry about the "performance hit". Object creation in Java is very fast. The performance difference is unlikely to be measurable by you.
By all means use a size if you know it. If you don't, there's nothing to be done about it anyway.
The kind of thinking that you're doing here is called "premature optimization". Donald Knuth says it's the root of all evil.
A better approach is to make your code work before you make it fast. Optimize with data in hand that tells you where your code is slow. Don't guess - you're likely to be wrong. You'll find that you rarely know where the bottlenecks are.
If you know how many elements you will add, initialize the ArrayList with correct number of objects. If you don't, don't worry about it. The performance difference is probably insignificant.
This is the best advice I can give you:
Don't worry about it. Yes, you have several options to create an ArrayList, but using the new, the default option provided by the library, isn't a BAD choice, otherwise it'd be stupid to make it the default choice for everyone without clarifying what's better.
If it turns out that this is a problem, you'll quickly discover it when you profile. That's the proper place to find problems, when you profile your application for performance/memory problems. When you first write the code, you don't worry about this stuff -- that's premature optimization -- you just worry about writing good, clean code, with good design.
If your design is good, you should be able to fix this problem in no time, with little impact to the rest of the system. Effective Java 2nd Edition, Item 52: Refer to objects by their interfaces. You may even be able to switch to a LinkedList, or any other kind of List out there, if that turns out to be a better data structure. Design for this kinds of flexibility.
Finally, Effective Java 2nd Edition, Item 1: Consider static factory methods instead of constructors. You may even be able to combine this with Item 5: Avoid creating unnecessary objects, if in fact no new instances are actually needed (e.g. Integer.valueOf doesn't always create a new instance).
Related questions
Java Generics Syntax - in-depth about type inferring static factory methods (also in Guava)
On ArrayList micromanagement
Here are some specific tips if you need to micromanage an ArrayList:
You can use ArrayList(int initialCapacity) to set the initial capacity of a list. The list will automatically grow beyond this capacity if needed.
When you're about to populate/add to an ArrayList, and you know what the total number of elements will be, you can use ensureCapacity(int minCapacity) (or the constructor above directly) to reduce the number of intermediate growth. Each add will run in amortized constant time regardless of whether or not you do this (as guaranteed in the API), so this can only reduce the cost by a constant factor.
You can trimToSize() to minimize the storage usage.
This kind of micromanagement is generally unnecessary, but should you decide (justified by conclusive profiling results) that it's worth the hassle, you may choose to do so.
See also
Collections.singletonList - Returns an immutable list containing only the specified object.
If you already know the size of your ArrayList (approximately) you should use the constructor with capacity. But most of the time developers don't really know what will be in the List, so with a capacity of 10 it should be sufficient for most of the cases.
10 buckets is an approximation and isn't a performance hit unless you already know that your ArrayList contains tons of elements, in this case, the need to resize your array all the time will be the performance hit.
You don't need to tell initial size of ArrayList. You can always add/remove any element from it easily.
If this is a performance matter, please keep in mind following things :
Initialization of ArrayList is very fast. Don't worry about it.
Adding/removing element from ArrayList is also very fast. Don't worry about it.
If you find your code runs too slow. The first to blame is your algorithm, no offense. Machine specs, OS and Language indeed participate too. But their participation is considered insignificant compared to your algorithm participation.
If you don't know the size of theArrayList, then you're probably better off using a LinkedList, since the LinkedList.add() operation is constant speed.
However as most people here have said you should not worry about speed before you do some kind of profiling.
You can use this old, but good (in my opinion) article for reference.
http://chaoticjava.com/posts/linkedlist-vs-arraylist/
Since ArrayList is implemented by array underlying, we have to choose a initial size for the array.
If you really care you can call trimToSize() once you have constructed and populated the object. The javadoc states that the capacity will be at least as large as the list size. As previously stated, its unlikely you will find that the memory allocated to an ArrayList is a performance bottlekneck, and if it were, I would recommend you use an array instead.