Fast data structure for random and sequential access - java

I'm looking for a data structure or a combination of various data structures that perform very well on random and sequential access.
I need to map an (integer) id to a (double) value and sort by that value. The values can occur multiple times.
The amount of data can possibly be large.
Insertion or deletion are not critical. Iteration and Get Operations are.
I'm using Java. Currently I have a Guava Multimap, built from a TreeMap and ArrayList for sequential access. For random access I use a HashMap in parallel.
Any suggestions?

When insertion and deletion are not critical, then a sorted array might be your friend. You could search there directly via Arrays.binarySearch and you custom Comparator.
In case you don't know any sane upper bound on the size, you can switch to an ArrayList (or implement you own resizing, but why...).
I guess this could be faster then the TreeMap, which is good when insertion and/or deletion are important, but suffers from bad spatial locality (binary tree with many pointers to follow).
The optimal structure would place all the data in a single array, which is impossible in Java (you'd need C struct for this). You could fake it by placing doubles into longs, this is sure to work and to be fast (Double.doubleToLongBits and back are intrinsics, and the length of both datatypes is 64 bits). This would mean a non-trivial amount of work, especially for sorting (if this is uncommon enough, a conversion in some sortable array and back would do).
In order to get faster search, you could use hashing, e.g., via a HashMap pointing to first element and linking the elements. As you keys are ints, some primitive-capable implementation would help (e.g. trove or fastutils or whatever).
There are countless possibilities, but keeping all your data in sync can be hard.

Related

Is it possible to add/update a sorted list in constant time?

Suppose you are given a list of integers that have already been sorted such as (1,7,13,14,50). It should be noted that the list will contain no duplicates.
Is there some data structure that could store this while allowing me to add any new element (at it's proper location) in constant time? add(10) would yield (1,7,10,13,14,50).
Similarly, would I be able to update an element (such as changing 7 to 19) and shift the order accordingly in constant time? change(7,19) yields (1,13,14,19,50).
For a class I need to write a data structure that performs these operations as quickly as possible, but I just wanted to know if constant time could be done and if not, then what would the ideal runtime be?
To insert in constant time, O(1), this would only occur as a best case for any of the data structures. Hash tables generally have the best insertion time, but it might not always be O(1) if there are collisions and there is separate chaining. You do not sort a hash so the complexity is irrelevent.
Binary tree's have a good insertion time, and as a bonus, it is sorted already upon inserting a new node. This takes on average O(logn) time however. The best case for inserting is O(1) if the tree is empty.
Those were just a couple examples, see here for more info on the complexities of these operations: http://bigocheatsheet.com/
In general? No. Determining where to insert a new element or re-ordering the list after insertion involves performing analysis of the list's contents, which involves reading the elements of the list, which (in general) means iterating over some portion of the length of the list. This (again, in general) is dependant on how many elements are in the list, which by definition is not a constant. Hence, a constant-time sorted insert is simply not possible except in special cases.
A binary tree, TreeSet, would be adequate. An array with Arrays.binarySearch and Arrays.copy would be fine too because here we have ints, and then we do not need the wrapper class Integer.
For real constant time, O(1), one must pay in space. Use a BitSet. To add 17 simply set 17 to true. There are optimized methods to find the next set bit and so on.
But I doubt optimizing is really needed at this spot. File I/O might pay off more.

Should I use a `HashSet` or a `TreeSet` for a very large dataset?

I have a requirement to store 2 to 15 million Accounts (which are a String of length 15) in a data structure for lookup purpose and checking uniqueness. Initially I planned to store them in a HashSet, but I doubt the speed of the lookup will be slow because of hash collisions and will eventually be slower than a TreeMap (using Binary search).
There is no requirement for Data to be sorted. I am using Java 7. I have 64G system with 48G dedicated for this application.
This question is not a duplicate of HashSet and TreeSet performance test because that question is about the performance of adding elements to a Set and this question is about the performance of checking an existing Set for duplicate values.
If you have 48 GB of dedicated Memory for your 2 million to 15 million records, your best bet is probably to use a HashMap<Key, Record>, where your key is an Integer or a String depending on your requirements.
You will be fine as far as hash collisions go as long as you give enough memory to the Map and have an appropriate load factor.
I recommend using the following constructor: new HashMap<>(13_000_000); (30% more than your expected number of records - which will be automatically expanded by HashMap's implementation to 2^24 cells).
Tell your application that this Map will be very large from the get-go so it doesn't need to automatically grow as you populate it.
HashMap uses an O(1) access time for it's members, whereas TreeMap uses O(log n) lookup time, but can be more efficient with memory and doesn't need a clever hashing function. However, if you're using String or Integer keys, you don't need to worry about designing a hashing function and the constant time lookups will be a huge improvement. Also, another advantage of TreeMap / TreeSet is the sorted ordering, which you stated you don't care about; use HashMap.
If the only purpose of the list is to check for unique account numbers, then everything I've said above is still true, but as you stated in your question, you should use a HashSet<String>, not a HashMap. The performance recommendations and constructor argument is still applicable.
Further reading: HashSet and TreeSet performance test
When we tried to store 50 million records in HashMap with proper initialization parameters, insertion started to slowdown, especially after 35 million records. Changing to TreeMap gave a constant insertion and retrieval performance.
Observation : TreeMap will give better performance than a HashMap for large input set. For a smaller set, of course HashMap will give better performance.

Collection to store primitive ints that allows for faster contains() & ordered iteration

I need a space efficient collection to store a large list of primitive int(s)(with around 800,000 ints), that allows for fast operations for contains() & allows for iteration in a defined order.
Faster contains() operations to check whether an int is there in the list or not, is main priority as that is done very frequently.
I'm open to using widely used & popular 3rd party libraries like Trove, Guava & such others.
I have looked at TIntSet from Trove but I believe that would not let me define the order of iteration anyhow.
Edit:
The size of collection would be around 800,000 ints.
The range of values in the collection will be from 0 to Integer.Max_VALUE. The order of iteration should be actually based on the order in which I add the value to collection or may be I just provide an ordered int[] & it should iterate in the same order.
As data structure I would choose an array of longs (which I logically treat as two ints). The high-int part (bits 63 - 32) represent the int value you add to the collection. The low-int part (bits 31 - 0) represents the index of the successor when iterating. In case of your 800.000 unique integers you need to create a long array of size 800.000.
Now you organize the array as a binary balanced tree ordered by your values. To the left the smaller values and to the right the higher values. You need two more tracking values: one int to point to the first index to start iterating at and one int to point to the index of the value inserted last.
Whenever you add a new value, reorganize your binary balanced tree and update the pointer from the last value added pointing to the currently added value (as indexes).
Wrap this values (the array and both int values) as the collection of your choice.
With this data structure you get a search performance of O(log(n)) and a memory usage of two times the size of values.
As this reeks of database, but you require a more direct approach, use a memory mapped file of java.nio. Especially a self-defined ordering of 800_000 ints will not do otherwise. The contains could be realized with a BitSet in memory though, parallel to the ordering in the file.
You can use 2 Sets one set is set based on hash (e.g. TIntSet) for fast contains operations. Another is set based on tree structure like TreeSet to iterate in speicific order.
And when you need to add int, you update both sets at the same time.
It sounds like LinkedHashSet might be what you're looking for. Internally, it maintains two structures--a HashSet and a LinkedList, allowing for both fast 'contains()' (from the former) and defined iteration order (from the latter).
Just use a ArrayList<Integer>.

Fastest sort for small collections

Many times I have to sort large amounts of small lists, arrays. It is quite rare that I need to sort large arrays. Which is the fastest sort algorithm for sorting:
arrays
(array)lists
of size 8-15 elements of these types:
integer
string from 10-40 characters
?
I am listing element types because some algorithms do more compare operations and less swap operations.
I am considering Merge Sort, Quick Sort, Insertion Sort and Shell sort (2^k - 1 increment).
Arrays.sort(..) / Collections.sort(..) will make that decision for you.
For example, the openjdk-7 implementation of Arrays.sort(..) has INSERTION_SORT_THRESHOLD = 47 - it uses insertion sort for those with less than 47 elements.
Unless you can prove that this is a bottleneck then the built in sorts are fine:
Collections.sort(myIntList);
Arrays.sort(myArray);
Actually, there isn't a universal answer. Among other things, the performance of a Java sort algorithm will depend on the relative cost of the compare operation, and (for some algorithms) on the order of the input. In the case of a list, it also depends on the list implementation type.
But #Bozho's advice is sound, as is #Sean Patrick Floyd's comment.
FOLLOWUP
If you believe that the performance difference is going to be significant for your use-case, then you should get hold of some implementations of different algorithms, and test them out using the actual data that your application needs to deal with. (And if you don't have the data yet, it is too soon to start tuning your application, because the sort performance will depend on actual data.)
In short, you'll need to do the benchmarking yourself.

why it is better to convert hashset to treeset then working directly with treeset

In many places in the web including sun website, the following sentence appear:
It generally faster to preform actions
on hashSet and then convert the
hashset to treeset.
well, i'm a little bit confused, thats correct that add element in hashset is o(1) and adding object in treeset (black & red tree) is o(logn) but when i convert the hashset to the treeset i need to sort my data which is o(nlogn) so why it is faster to work with hashset and then convert it to treeset? i know that if you preform remove or existing element so there is a diffrence between hash and tree but i don't think it is the factor that sun refer to (at least i hope so since it looks like a very small thing) another thing is the hashcode methods can be not so good and then adding elements to the hash will not be o(1) or the hashcode method can be complicated. so generally i don't understand the sentence. can anyone help me?
It depends on how many operations happen in the hash table before you copy the elements to the sorted tree structure. If all you do is insert n distinct elements to the hash table, then no, it will not be faster to do so then copy them to the tree :)
A hashed set of items can be converted to a sorted tree by either: using a regular sort then building the tree from that, or inserting the items into the tree one at a time. The former means an extra copy/traversal; the latter means extra overhead to maintain a balanced tree (although if you iterate a hash table, you get the items in effectively random order, which means you could probably avoid most rebalancing).
Hash tables are indeed typically faster than search trees for the operations that are well supported (insert/modify/delete), but it's definitely not worth doing what Sun recommends until you actually measure the performance of your whole application and can expect a valuable overall speedup from what will likely be a slight improvement.
Hash tables do have an even larger advantage over sorted trees when the key comparison is expensive (as with strings), because for large sets, fewer items will have a hash collision than a search tree is deep, and because it's possible to cache the hash code for keys already in the set, skipping the expensive comparison for (probably) all but the matching result.

Categories