Insertion order in CopyOnWriteArraySet VS HashSet - java

Everyone knows HashSet stores elements in buckets based on the size of the hashtable and the elements' hash code values.
But how does CopyOnWriteArraySet store elements? I thought it makes a snapshot of those buckets and copies them. Looks like it doesn't. Does it store them in 'normal' array 1 by 1 and checks equals()?
Does it even use hashing principle?

CopyOnWriteArraySet is a Set-wrapper for CopyOnWriteArrayList, which stores its elements in an array, so it does not use hashing. That's why it doesn't have the O(1) lookup benefit of a HashSet.
The docs say it is only suitable for small sets.

Related

Java HashSet vs Array Performance

I have a collection of objects that are guaranteed to be distinct (in particular, indexed by a unique integer ID). I also know exactly how many of them there are (and the number won't change), and was wondering whether Array would have a notable performance advantage over HashSet for storing/retrieving said elements.
On paper, Array guarantees constant time insertion (since I know the size ahead of time) and retrieval, but the code for HashSet looks much cleaner and adds some flexibility, so I'm wondering if I'm losing anything performance-wise using it, at least, theoretically.
Depends on your data;
HashSet gives you an O(1) contains() method but doesn't preserve order.
ArrayList contains() is O(n) but you can control the order of the entries.
Array if you need to insert anything in between, worst case can be O(n), since you will have to move the data down and make room for the insertion. In Set, you can directly use SortedSet which too has O(n) too but with flexible operations.
I believe Set is more flexible.
The choice greatly depends on what do you want to do with it.
If it is what mentioned in your question:
I have a collection of objects that are guaranteed to be distinct (in particular, indexed by a unique integer ID). I also know exactly how many of them there are
If this is what you need to do, the you need neither of them. There is a size() method in Collection for which you can get the size of it, which mean how many of them there are in the collection.
If what you mean for "collection of object" is not really a collection, and you need to choose a type of collection to store your objects for further processing, then you need to know, for different kind of collections, there are different capabilities and characteristic.
First, I believe to have a fair comparison, you should consider using ArrayList instead Array, for which you don't need to deal with the reallocation.
Then it become the choice of ArrayList vs HashSet, which is quite straight-forward:
Do you need a List or Set? They are for different purpose: Lists provide you indexed access, and iteration is in order of index. While Sets are mainly for you to keep a distinct set of data, and given its nature, you won't have indexed access.
After you made your decision of List or Set to use, then it is a choice of List/Set implementation, normally for Lists, you choose from ArrayList and LinkedList, while for Sets, you choose between HashSet and TreeSet.
All the choice depends on what you would want to do with that collection of data. They performs differently on different action.
For example, an indexed access in ArrayList is O(1), in HashSet (though not meaningful) is O(n), (just for your interest, in LinkedList is O(n), in TreeSet is O(nlogn) )
For adding new element, both ArrayList and HashSet is O(1) operation. Inserting in the middle is O(n) for ArrayList, while it doesn't make sense in HashSet. Both will suffer from reallocation, and both of them need O(n) for the reallocation (HashSet is normally slower in reallocation, because it involve calculation of hash for each element again).
To find if certain element exists in the collection, ArrayList is O(n) and HashSet is O(1).
There are still lots of operations you can do, so it is quite meaningless to discuss for performance without knowing what you want to do.
theoretically, and as SCJP6 Study guide says :D
arrays are faster than collections, and as said, most of the collections depend mainly on arrays (Maps are not considered Collection, but they are included in the Collections framework)
if you guarantee that the size of your elements wont change, why get stuck in Objects built on Objects (Collections built on Arrays) while you can use the root objects directly (arrays)
It looks like you will want an HashMap that maps id's to counts. Particularly,
HashMap<Integer,Integer> counts=new HashMap<Integer,Integer>();
counts.put(uniqueID,counts.get(uniqueID)+1);
This way, you get amortized O(1) adds, contains and retrievals. Essentially, an array with unique id's associated with each object IS a HashMap. By using the HashMap, you get the added bonus of not having to manage the size of the array, not having to map the keys to an array index yourself AND constant access time.

HashSet and LinkedHashSet [duplicate]

This question already has answers here:
HashSet vs LinkedHashSet
(10 answers)
Closed 9 years ago.
So im developing a program where I need to store unique customer data of primitive types. In this regard, I have been reading a book about datastructures, and came to the conclusion to use a HashSet.
Now this book states, that a HashSet has faster insertion and removal, than a LinkedHashSet. Now this baffles me a bit. I thought that the only difference between the two, was that a LinkedHashSet uses some extra memory, using a LinkedList to keep order.
Can anyone elaborate?
Choose you Data Structure wisely.
You can use Linked Hash Set instead of Hash Set if the order of insertion is important to you. With the additional features the memory or processor cycles might take a hit.
Edit1 :
Things to consider other than the insertion order: Because LinkedHashSet maitains a doubly linkedlist, it will be slower for insertion and removing, but will be slightly faster in iteration.
To Quote the java doc:
This class provides all of the optional Set operations, and permits null elements. Like HashSet, it provides constant-time performance for the basic operations (add, contains and remove), assuming the hash function disperses elements properly among the buckets. Performance is likely to be just slightly below that of HashSet, due to the added expense of maintaining the linked list, with one exception: Iteration over a LinkedHashSet requires time proportional to the size of the set, regardless of its capacity. Iteration over a HashSet is likely to be more expensive, requiring time proportional to its capacity.
TreeSet, LinkedHashSet and HashSet in Java are three Set implementation in collection framework and like many others they are also used to store objects. Main feature of TreeSet is sorting, LinkedHashSet is insertion order and HashSet is just general purpose collection for storing object. HashSet is implemented using HashMap in Java while TreeSet is implemented using TreeMap. TreeSet is a SortedSetimplementation which allows it to keep elements in the sorted order defined by either Comparable or Comparator interface. Comparable is used for natural order sorting and Comparator for custom order sorting of objects, which can be provided while creating instance of TreeSet. Anyway before seeing difference between TreeSet, LinkedHashSet and HashSet, let's see some similarities between them:
1) Duplicates : All three implements Set interface means they are not allowed to store duplicates.
2) Thread safety : HashSet, TreeSet and LinkedHashSet are not thread-safe, if you use them in multi-threading environment where at least one Thread modifies Set you need to externally synchronize them.
3) Fail-Fast Iterator : Iterator returned by TreeSet, LinkedHashSet and HashSet are fail-fast Iterator. i.e. If Iterator is modified after its creation by any way other than Iterators remove() method, it will throw ConcurrentModificationException with best of effort. read more about fail-fast vs fail-safe Iterator here
Now let’s see difference between HashSet, LinkedHashSet and TreeSet in Java :
Performance and Speed : First difference between them comes in terms of speed. HashSet is fastest, LinkedHashSet is second on performance or almost similar to HashSet but TreeSet is bit slower because of sorting operation it needs to perform on each insertion. TreeSet provides guaranteed O(log(n)) time for common operations like add, remove and contains, while HashSet and LinkedHashSet offer constant timeperformance e.g. O(1) for add, contains and remove given hash function uniformly distribute elements in bucket.
Ordering : HashSet does not maintain any order while LinkedHashSet maintains insertion order of elements much like List interface and TreeSet maintains sorting order or elements.
Internal Implementation : HashSet is backed by an HashMap instance, LinkedHashSet is implemented using HashSet and LinkedList while TreeSet is backed up by NavigableMap in Java and by default it uses TreeMap.
null : Both HashSet and LinkedHashSet allows null but TreeSet doesn't allow null but TreeSet doesn't allow null and throw java.lang.NullPointerException when you will insert null into TreeSet. Since TreeSet uses compareTo() method of respective elements to compare them which throws NullPointerException while comparing with null, here is an

copy elements of arraylist in a set with the same order java

I've sorted an arraylist of int in ascending order, but when I copy it in a set, the elements are not sorted anymore.
I'm using this :
HashSet<Integer> set = new HashSet<Integer>(sortedArray);
why is like that?
LinkedHashSet will keep the order. TreeSet will sort based either on an external Comparator or natural ordering through Comparable.
A general point of a Set is that order is irrelevant. Hashing is intended to put the elements in as random an order as possible. LinkedHashSet maintains a linked-list between references to the elements, so can maintain an order.
BitSet (which is not a Set) may, or may not, provide a more efficient data structure.
HashSet's don't sort or maintain order, and the API will tell you this:
it does not guarantee that the order will remain constant over time.
Consider using another type of Set such as a TreeSet.
If you just care about uniqueness, use the HashSet. If you're after sorting, then consider the TreeSet.
you need to use TreeSet and implement a Comparator object or Comparable interface for your data. you can read about Object ordering here
hash set is designed for quick access to unique data, not for maintaining a particular order.

How do HashSets in Java work? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How does Java hashmap work?
Can someone explain to me how HashSets in java work and why they are faster than using ArrayLists?
A HashSet is actually a HashMap where the value is always the same.
The way a HashMap works is described in many places (it is referred to as "hashtable" as well). In short: it generates hashes of keys (objects) and positions them into a table. Then each time you look for a key, its hash is computed and the bucket in the table is referenced directly. This means you have just one operation (best case) to access the map.
The HashSet simply contains the keys, so .contains(..) is O(1). That and remove(..) are the only operations a HashSet is faster than an ArrayList (which is O(n)). Iteration is the same, addition is the same.
First, HashSet, unlike ArrayList is a Set: It cannot contain duplicates while ArrayList can - so they are built for different purposes. It also does not guarantee ordering - again, unlike a list.
Second - a HashSet is built on the hash table data structure, that allows O(1) seek time for an element.
Note that many times, a HashSet is slower then an ArrayList - if you want to iterate on elements for example - usually doing it in an ArrayList will be faster then in a HashSet [because of bad cache performance of hash, among other reasons]
These are 2 different data structures.
The concept behind HashSet is key probing.
I.e. you use a transformation of the input key to get an index of the location of the value in an array.
This is a constant O(1) operation since an array allows random access.
The arraylist is also O(1) operation for access since it is also backed by an array.
But only for random access and insertion.
The search though is O(N) operation for an arraylist since you have to search through all the elemements in the list to get to the value unlike the HashSet where you just transform the key and access the array. Search in a HashSet is O(1)
As a matter of fact, for example iterating over and appending to an ArrayList is faster.
And heck, you cannot even sort a HashSet.
But the fastest of all is the NoOp. There is nothing just remotely as fast as the NoOp. Granted, it doesn't do much, the NoOp. But it's really fast at that!
You need to be more precise in what you consider to be "faster than".

TreeMap or HashMap? [duplicate]

This question already has answers here:
Difference between HashMap, LinkedHashMap and TreeMap
(17 answers)
What is the difference between a HashMap and a TreeMap? [duplicate]
(8 answers)
Closed 8 years ago.
When to use hashmaps or treemaps?
I know that I can use TreeMap to iterate over the elements when I need them to be sorted.
But is just that? There is no optimization when I just want to consult the maps, or some optimal specific uses?
TreeMap provides guaranteed O(log n) lookup time (and insertion etc), whereas HashMap provides O(1) lookup time if the hash code disperses keys appropriately.
Unless you need the entries to be sorted, I'd stick with HashMap. Or there's ConcurrentHashMap of course. I can't remember the details of the differences between all of them, but HashMap is a perfectly reasonable "default" option :)
For completeness, I should point out that there was a discussion on Stack Overflow a month or so ago about the internals of various maps. See the comments in this question, which I will copy into this answer if bestsss is happy for me to do so.
Hashtables (usually) perform search operations (look up) bounded within the complexity of O(n)<=T(n)<=O(1), with an average case complexity of O(1 + n/k); however, binary search trees, (BST's), perform search operations (lookup) bounded within the complexity of O(n)<=T(n)<=O(log_2(n)), with an average case complexity of O(log_2(n)). The implementation for each (and every) data structure should be known (by you), to understand the advantages, drawbacks, time complexity of operations, and code complexity.
For example, the number of entries in a hashtable often have some fixed number of entries (some part of which may not be filled at all) with lists of collisions. Trees, on the other hand, usually have two pointers (references) per node, but this can be more if the implementation allows more than two child nodes per node, and this allows the tree to grow as nodes are added, but may not allow duplicates. (The default implementation of a Java TreeMap does not allow for duplicates)
There are special cases to consider as well, for example, what if the number of elements in a particular data structure increases without bound or approaches the limit of an underlying part of the data structure? What about amortized operations that perform some rebalancing or cleanup operation?
For example, in a hashtable, when the number of elements in the table become sufficiently large, and arbitrary number of collisions can occur. On the other hand, trees usually require come re-balancing procedure after an insertion (or deletion).
So, if you have something like a cache (Ex. the number of elements in bounded, or size is known) then a hashtable is probably your best bet; however, if you have something more like a dictionary (Ex. populated once and looked up many times) then I'd use a tree.
This is only in the general case, however, (no information was given). You have to understand process that happen how they happen to make the right choice in deciding which data structure to use.
When I need a multi-map (ranged lookup) or sorted flattening of a collection, then it can't be a hashtable.
The largest difference between the two is the underlying structure used in the implementation.
HashMaps use an array and a hashing function to store elements. When you try to insert or delete an item in the array the hashing function converts the key into an index on the array where the object is/should be stored (ignoring conflicts). While hashmaps are generally very fast because they don't need to iterate over large amounts of data, they slow down when they're filled because they need to copy all the key/values into a new array.
TreeMaps store a the data in a sorted tree structure. While this means that they'll never have to allocate more space and copy over to it, operations require that part of the data already stored be iterated over. Sometimes changing large amounts of the structure.
Out of the two Hashmaps will generally have better performance when you don't need sorting.
Inserting new elements into a HashMap will, on average, be a good deal faster than inserting elements into a TreeMap. Unless you need your elements sorted, I'd go with the HashMap.
Don't forget there is also LinkedHashMap which is nearly as fast as HashMap for add/contains/remove operations but also maintains the insertion order.

Categories