I need an efficient data structure to store a big number (millions) of records on a live (up to a hundred insertions, deletions or updates per second) server.
Its clients need to be able to grab a chunk of that data, sorted, beginning from some point, be able to scroll (i.e. get records before and after the ones they initially got) and receive live updates.
Initially I considered some form of a linked ordered set with some index, however even though the records are unique in the sense that they have an id, the values of their fields by which the set would be ordered are not. Could resolve collisions by just inserting more than one record into each node, but does not seem right.
The other solution I came up with is a linked set with an index, which is kept sorted through insertion deletion and updates. Big O of that would be not O(log n) but O(n), but I'm guessing if I still have the index, would it speed up the process a lot? Or binary search the place to insert? Do not think I can with the list though.
What would be the most efficient solution and which one is best given that I need clients to receive live updates on the state of this data structure?
The code will be in Java
Millions of records -> First estimate if you want / can hold all the data in RAM.
Have a look at b-tree.
Algorithm
Average
Worst case
Space
O(n)
O(n)
Search
O(log n)
O(log n)
Insert
O(log n)
O(log n)
Delete
O(log n)
O(log n)
In Java these kinds of requirements are usually solved by using a TreeMap like a database index. The TreeMap interface isn't particularly well designed for this, so there are some tricks to it:
Your record objects should implement a Key interface or base class that just exposes the sort fields and ID. This interface should not extend Comparable.
Your record objects will be both keys and values in the TreeMap, and each record will map to itself, but the Key interface will be used as the key, so the type of the map is TreeMap<Key,Record>. Remember that every put should be of the form put(record,record)
When you make the TreeMap, use the constructor that takes a custom comparator. Pass a comparator that compares Keys using the sort fields AND the ID, so that there will be no duplicates.
To search in the map, you can use other implementations of the Key interface -- you don't have to use complete records. Because a caller can't provide an ID, though, you can't use TreeMap.get() to find a record that matches the sort fields. Use a key with ID=0 and TreeMap.ceilingEntry to get the first record with >= key, and then check the sort fields to see if they match.
Note that if you need multiple orderings on different fields, you can make your records implement multiple Key interfaces and put them in multiple maps.
Related
I'm trying to remove duplicates from key-value pairs. And sorting the Data first seems like the best way to do this. I have tuples(Both values are Integer) so the code doesn't necessarily have to work for different Objects and if it can be optimised for Integers that would be great. I would like to sort all my pairs first by Value, and then by Key(Note that I need both operations while maintaining the key-value relationship)
I'm new to Java, and I was wondering if there exist sorting methods in a Map(or any other data-structure which I can use) that would do this for me. Since the dataset I'm using is huge(>50GB), I have to save time wherever possible. I have tried simply adding all the pairs into a Set(as a concatenated string of both integers) and then taking them out, but it takes too long. I'm open to switching to external-sort algorithms if needed(I'm using 64 GB memory pc, so anything that takes more than O(n) space will be problematic)
Well, you can both sort and eliminate duplicate by storing those data into a TreeMap. TreeMap is a implementation of Map where keys in TreeMap are sorted using their natural order. We could implement the Comparable<Data_Type> and override public int compareTo(T t) to define the sorting order.
As this is not a multikey Hash, only one key could exists in Map. So the duplicate entity will be automatically over-written.
Have a look at this link: Sort a HashMap in Java
I am trying to justify whether I'm using the most appropriate Data Structure for a set of scenarios.
The first scenario is an estate agent selling properties at different prices where no price is duplicated. Customers choose a range of prices & obtain a list of properties in that range.
To store the collection of property data I would choose TreeSet. As no property will have the same price, I could have pairs of: price (key) and value (property details). This would work with a TreeSet because there are no duplicate entries and the TreeSet could sort price in natural order. Additionally, the main operation for the scenario is search/contains which would take O(log n). Although there are faster search/contain operations e.g. HashMap, I need ordering. If I need to insert or delete an entry, I believe these operations are also O(log n).
To return a list of properties within a price range, I think I can use headSet() method?
However, I've read on some threads that I can store as a HashMap and create a TreeSet from the HashMap; would it be worth doing this?
You need an ordered set to be able to serve this type of queries. Therefor a tree structure is better suited for your needs than a hash map. However the equivalent to a HashMap is TreeMap, not a TreeSet - you need a mapping between key and value. As for the range operations there is a method more suited for your needs - subMap.
Redis has a data structure called a sorted set.
The interface is roughly that of a SortedMap, but sorted by value rather than key. I could almost make do with a SortedSet, but they seem to assume static sort values.
Is there a canonical Java implementation of a similar concept?
My immediate use case is to build a set with a TTL on each element. The value of the map would be the expiration time, and I'd periodically prune expired elements. I'd also be able to bump the expiration time periodically.
So... several things.
First, decide which kind of access you'll be doing more of. If you'll be doing more HashMap actions (get, put) than accessing a sorted list, then you're better off just using a HashMap and sorting the values when you want to prune the collection.
As for pruning the collection, it sounds like you want to just remove values that have a time less than some timestamp rather than removing the earliest n items. If that's the case then you're better off just filtering the HashMap based on whether the value meets a condition. That's probably faster than trying to sort the list first and then remove old entries.
Since you need two separate conditions, one on the keys and the other one on the values, it is likely that the best performance on very large amounts of data will require two data structures. You could rely on a regular Set and, separately, insert the same objects in PriorityQueue ordered by TTL. Bumping the TTL could be done by writing in a field of the object that contains an additional TTL; then, when you remove the next object, you check if there is an additional TTL, and if so, you put it back with this new TTL and additional TTL = 0 [I suggest this because the cost of removal from a PriorityQueue is O(n)]. This would yield O(log n) time for removal of the next object (+ cost due to the bumped TTLs, this will depend on how often it happens) and insertion, and O(1) or O(log n) time for bumping a TTL, depending on the implementation of Set that you choose.
Of course, the cleanest approach would be to design a new class encapsulating all this.
Also, all of this is overkill if your data set is not very large.
You can implement it using a combination of two data structures.
A sorted mapping of keys to scores. And a sorted reverse mapping of scores to keys.
In Java, typically these would be implemented with TreeMap (if we are sticking to the standard Collections Framework).
Redis uses Skip-Lists for maintaining the ordering, but Skip-Lists and Balanced Binary Search Trees (such as TreeMap) both serve the purpose to provide average O(log(N)) access here.
For a given sort set,
we can implement it as an independent class as follows:
class SortedSet {
TreeMap<String, Integer>> keyToScore;
TreeMap<Integer, Set<String>>> scoreToKey
public SortedSet() {
keyToScore= new TreeMap<>();
scoreToKey= new TreeMap<>();
}
void addItem(String key, int score) {
if (keyToScore.contains(key)) {
// Remove old key and old score
}
// Add key and score to both maps
}
List<String> getKeysInRange(int startScore, int endScore) {
// traverse scoreToKey and retrieve all values
}
....
}
I am wondering if there is a more efficient method for getting objects out of my LinkedHashMap with timestamps greater than a specified time. I.e. something better than the following:
Iterator<Foo> it = foo_map.values().iterator();
Foo foo;
while(it.hasNext()){
foo = it.next();
if(foo.get_timestamp() < minStamp) continue;
break;
}
In my implementation, each of my objects has essentially three values: an "id," "timestamp," and "data." The objects are insterted in order of their timestamps, so when I call an iterator over the set, I get ordered results (as required by the linked hashmap contract). The map is keyed to the object's id, so I can quickly lookup them up by id.
When I look them up by a timestamp condition, however, I get an iterator with sorted results. This is an improvement over a generic hashmap, but I still need to iterate sequentially over much of the range until I find the next entry with a higher timestamp than the specified one.
Since the results are already sorted, is there any algorithm I can pass the iterator (or collection to), that can search it faster than sequential? If I went with a treemap as an alternative, would it offer overall speed advantages, or is it doing essentially the same thing in the background? Since the collection is sorted by insertion order already, I'm thinking tree map has a lot more overhead I don't need?
There is no faster way ... if you just use a LinkedHashMap.
If you want faster access, you need to use a different data structure. For example, a TreeSet with an appropriate comparator might be a better solution for this aspect of your problem. For example if your TreeSet is ordered by date, then calling tailSet with an appropriate dummy value can give you all elements greater or equal to a given date.
Since the results are already sorted, is there any algorithm I can pass the iterator (or collection to), that can search it faster than sequential?
Not for a LinkedHashMap.
However, if the ordered list was an ArrayList instead, then you could use "binary search" on the list ... provided that you could lock it to prevent concurrent modifications while you are searching. (Actually, concurrency is a potential issue to consider no matter how you implement this ... including your current linear search.)
If you want to keep the ability to do id lookups, then you need two data structures; e.g. a TreeSet and a HashMap which share their element objects. A TreeSet will probably be more efficient than trying to maintain an ArrayList in order assuming that there are random insertions and/or random deletions.
Someone knows a nice solution for EnumSet + List
I mean I need to store enum values and I also need to preserve the order , and to be able to access its index of the enum value in the collection in O(1) time.
The closest thing I can come to think of, present in the API is the LinkedHashSet:
From http://java.sun.com/j2se/1.4.2/docs/api/java/util/LinkedHashSet.html:
Hash table and linked list implementation of the Set interface, with predictable iteration order.
I doubt it's possible to do what you want. Basically, you want to look up indexes in constant time, even after modifying the order of the list. Unless you allow remove / reorder operations to take O(n) time, I believe you can't get away with lower than O(log n) (which can be achieved by a heap structure).
The only way I can see to satisfy ordering and O(1) access is to duplicate the data in a List and an array of indexes (wrapped in a nice little OrderedEnumSet, of course).