Performing the fastest search - which collection should i use?

Performing the fastest search - which collection should i use? - java

I know:
If you need fast access to elements using index, ArrayList should be choice.
If you need fast access to elements using a key, use HashMap.
If you need fast add and removal of elements, use LinkedList (but it has a very poor seeking performance).
In order to perform the fastest search, on the basis of data stored in a collection object, which collection should I use?
Below is my code:
public void fillAndSearch(Collection<Student> collection) {
if(collection!=null){
for (int i=0; i<=10; i++) {
Student student = new Student("name" + i, "id" + i);
collection.add(student);
}
}
//here We have to perform searching for "name7" or "id5",
//then which implementation of collection will be fastest?
}
class Student {
String name;
String id;
Student(String name, String id) {
this.name = name;
this.id = id;
}
}

The thing which is often skipped when comparing ArrayList and LinkedList is cache and memory management optimisations. ArrayList is effectively just an array which means that it is stored in a continuous space in the memory. This allows the Operating System to use optimisations such as "when a byte in memory was accessed, most likely the next byte will be accessed soon". Because of this, ArrayList is faster than LinkedList in all but one case: when inserting/deleting the element at the beginning of the list (because all elements in the array have to be shifted). Adding/deleting at the end or in the middle, iterating over, accessing the element are all faster in case of ArrayList.
If you need to search for student with given name and id, it sounds to me like a map with composite key - Map<Student, StudentData>. I would recommend to use HashMap implementation, unless you need to be able to both search the collection and retrieve all elements sorted by key in which case TreeMap may be a better idea. Although remember that HashMap has O(1) access time, while TreeMap has O(logn) access time.

With given restrictions, you should use HashMap.
It will give you quick search, as you wished.
If you care about traversing elements in specific order, you should choose TreeMap (natural order) or LinkedHashMap (insertion order).
If your collection is guaranteed immutable, you can use sorted ArrayList with binary search, it will save you some memory. In this case, you can search only by one specific key, which is undesirable in many real world applications.
Anyway, you should have really huge number of elements (millions/billions) to feel the difference between O(logN) solutions and O(1) solutions.
If you want to learn more about data structures, I recommend you to review Algorythms course by Princeton university on coursera.com

There is nothing wrong in keeping multiple collections to access your data faster.
In this situation I would use 2 HashMap<String, Student>'s. One for each search-key.
(PS: Or if you don't know which kind of keyword is used to search for, then you can store both in the same map).

Related

TreeSet vs ArrayList and sort [duplicate]

I have implemented a graph.
I want to sort a given subset of vertices with respect to their degrees.
Therefore, I've written a custom comparator named DegreeComparator.
private class DegreeComparator implements Comparator<Integer>
{
#Override
public int compare(Integer arg0, Integer arg1)
{
if(adj[arg1].size() == adj[arg0].size()) return arg1 - arg0;
else return adj[arg1].size() - adj[arg0].size());
}
}
So, which one of the below is more efficient?
Using TreeSet
public Collection<Integer> sort(Collection<Integer> unsorted)
{
Set<Integer> sorted = new TreeSet<Integer>(new DegreeComparator());
sorted.addAll(unsorted);
return sorted;
}
Using ArrayList
Collections.sort(unsorted, new DegreeComparator());
Notice that the second approach is not a function, but a one-line code.
Intuitively, I'd rather choose the second one. But I'm not sure if it is more efficient.

Java API contains numerous Collection and Map implementations so it might be confusing to figure out which one to use. Here is a quick flowchart that might help with choosing from the most common implementations

A TreeSet is a Set. It removes duplicates (elements with the same degree). So both aren't equivalent.
Anyway, if what you want naturally is a sorted list, then sort the list. This will work whether the collection has duplicates or not, and even if it has the same complexity (O(n*log(n)) as populating a TreeSet, it is probably faster (because it just has to move elements in an array, instead of having to create lots of tree nodes).

If you only sort once, then the ArrayList is an obvious winner. The TreeSet is better if you add or remove items often as sorting a list again and again would be slow.
Note also that all tree structures need more memory and memory access indirection which makes them slower.
If case of medium sized lists, which change rather frequently by a single element, the fastest solution might be using ArrayList and inserting into the proper position (obviously assuming the arrays get sorted initially).
You'd need to determine the insert position via Arrays.binarySearch and insert or remove. Actually, I would't do it, unless the performance were really critical and a benchmark would show it helps. It gets slow when the list get really big and the gain is limited as Java uses TimSort, which is optimized for such a case.
As pointed in a comment, assuring that the Comparator returns different values is sometimes non-trivial. Fortunately, there's Guava's Ordering#arbitrary, which solves the problem if you don't need to be compatible with equals. In case you do, a similar method can be written (I'm sure I could find it somewhere if requested).

Is there a Java data structure equivalent to Redis sorted sets (zset)

Redis has a data structure called a sorted set.
The interface is roughly that of a SortedMap, but sorted by value rather than key. I could almost make do with a SortedSet, but they seem to assume static sort values.
Is there a canonical Java implementation of a similar concept?
My immediate use case is to build a set with a TTL on each element. The value of the map would be the expiration time, and I'd periodically prune expired elements. I'd also be able to bump the expiration time periodically.

So... several things.
First, decide which kind of access you'll be doing more of. If you'll be doing more HashMap actions (get, put) than accessing a sorted list, then you're better off just using a HashMap and sorting the values when you want to prune the collection.
As for pruning the collection, it sounds like you want to just remove values that have a time less than some timestamp rather than removing the earliest n items. If that's the case then you're better off just filtering the HashMap based on whether the value meets a condition. That's probably faster than trying to sort the list first and then remove old entries.

Since you need two separate conditions, one on the keys and the other one on the values, it is likely that the best performance on very large amounts of data will require two data structures. You could rely on a regular Set and, separately, insert the same objects in PriorityQueue ordered by TTL. Bumping the TTL could be done by writing in a field of the object that contains an additional TTL; then, when you remove the next object, you check if there is an additional TTL, and if so, you put it back with this new TTL and additional TTL = 0 [I suggest this because the cost of removal from a PriorityQueue is O(n)]. This would yield O(log n) time for removal of the next object (+ cost due to the bumped TTLs, this will depend on how often it happens) and insertion, and O(1) or O(log n) time for bumping a TTL, depending on the implementation of Set that you choose.
Of course, the cleanest approach would be to design a new class encapsulating all this.
Also, all of this is overkill if your data set is not very large.

You can implement it using a combination of two data structures.
A sorted mapping of keys to scores. And a sorted reverse mapping of scores to keys.
In Java, typically these would be implemented with TreeMap (if we are sticking to the standard Collections Framework).
Redis uses Skip-Lists for maintaining the ordering, but Skip-Lists and Balanced Binary Search Trees (such as TreeMap) both serve the purpose to provide average O(log(N)) access here.
For a given sort set,
we can implement it as an independent class as follows:
class SortedSet {
TreeMap<String, Integer>> keyToScore;
TreeMap<Integer, Set<String>>> scoreToKey
public SortedSet() {
keyToScore= new TreeMap<>();
scoreToKey= new TreeMap<>();
}
void addItem(String key, int score) {
if (keyToScore.contains(key)) {
// Remove old key and old score
}
// Add key and score to both maps
}
List<String> getKeysInRange(int startScore, int endScore) {
// traverse scoreToKey and retrieve all values
}
....
}

Java HashSet vs Array Performance

I have a collection of objects that are guaranteed to be distinct (in particular, indexed by a unique integer ID). I also know exactly how many of them there are (and the number won't change), and was wondering whether Array would have a notable performance advantage over HashSet for storing/retrieving said elements.
On paper, Array guarantees constant time insertion (since I know the size ahead of time) and retrieval, but the code for HashSet looks much cleaner and adds some flexibility, so I'm wondering if I'm losing anything performance-wise using it, at least, theoretically.

Depends on your data;
HashSet gives you an O(1) contains() method but doesn't preserve order.
ArrayList contains() is O(n) but you can control the order of the entries.
Array if you need to insert anything in between, worst case can be O(n), since you will have to move the data down and make room for the insertion. In Set, you can directly use SortedSet which too has O(n) too but with flexible operations.
I believe Set is more flexible.

The choice greatly depends on what do you want to do with it.
If it is what mentioned in your question:
I have a collection of objects that are guaranteed to be distinct (in particular, indexed by a unique integer ID). I also know exactly how many of them there are
If this is what you need to do, the you need neither of them. There is a size() method in Collection for which you can get the size of it, which mean how many of them there are in the collection.
If what you mean for "collection of object" is not really a collection, and you need to choose a type of collection to store your objects for further processing, then you need to know, for different kind of collections, there are different capabilities and characteristic.
First, I believe to have a fair comparison, you should consider using ArrayList instead Array, for which you don't need to deal with the reallocation.
Then it become the choice of ArrayList vs HashSet, which is quite straight-forward:
Do you need a List or Set? They are for different purpose: Lists provide you indexed access, and iteration is in order of index. While Sets are mainly for you to keep a distinct set of data, and given its nature, you won't have indexed access.
After you made your decision of List or Set to use, then it is a choice of List/Set implementation, normally for Lists, you choose from ArrayList and LinkedList, while for Sets, you choose between HashSet and TreeSet.
All the choice depends on what you would want to do with that collection of data. They performs differently on different action.
For example, an indexed access in ArrayList is O(1), in HashSet (though not meaningful) is O(n), (just for your interest, in LinkedList is O(n), in TreeSet is O(nlogn) )
For adding new element, both ArrayList and HashSet is O(1) operation. Inserting in the middle is O(n) for ArrayList, while it doesn't make sense in HashSet. Both will suffer from reallocation, and both of them need O(n) for the reallocation (HashSet is normally slower in reallocation, because it involve calculation of hash for each element again).
To find if certain element exists in the collection, ArrayList is O(n) and HashSet is O(1).
There are still lots of operations you can do, so it is quite meaningless to discuss for performance without knowing what you want to do.

theoretically, and as SCJP6 Study guide says :D
arrays are faster than collections, and as said, most of the collections depend mainly on arrays (Maps are not considered Collection, but they are included in the Collections framework)
if you guarantee that the size of your elements wont change, why get stuck in Objects built on Objects (Collections built on Arrays) while you can use the root objects directly (arrays)

It looks like you will want an HashMap that maps id's to counts. Particularly,
HashMap<Integer,Integer> counts=new HashMap<Integer,Integer>();
counts.put(uniqueID,counts.get(uniqueID)+1);
This way, you get amortized O(1) adds, contains and retrievals. Essentially, an array with unique id's associated with each object IS a HashMap. By using the HashMap, you get the added bonus of not having to manage the size of the array, not having to map the keys to an array index yourself AND constant access time.

Searching LinkedHashMap, faster method than sequential?

I am wondering if there is a more efficient method for getting objects out of my LinkedHashMap with timestamps greater than a specified time. I.e. something better than the following:
Iterator<Foo> it = foo_map.values().iterator();
Foo foo;
while(it.hasNext()){
foo = it.next();
if(foo.get_timestamp() < minStamp) continue;
break;
}
In my implementation, each of my objects has essentially three values: an "id," "timestamp," and "data." The objects are insterted in order of their timestamps, so when I call an iterator over the set, I get ordered results (as required by the linked hashmap contract). The map is keyed to the object's id, so I can quickly lookup them up by id.
When I look them up by a timestamp condition, however, I get an iterator with sorted results. This is an improvement over a generic hashmap, but I still need to iterate sequentially over much of the range until I find the next entry with a higher timestamp than the specified one.
Since the results are already sorted, is there any algorithm I can pass the iterator (or collection to), that can search it faster than sequential? If I went with a treemap as an alternative, would it offer overall speed advantages, or is it doing essentially the same thing in the background? Since the collection is sorted by insertion order already, I'm thinking tree map has a lot more overhead I don't need?

There is no faster way ... if you just use a LinkedHashMap.
If you want faster access, you need to use a different data structure. For example, a TreeSet with an appropriate comparator might be a better solution for this aspect of your problem. For example if your TreeSet is ordered by date, then calling tailSet with an appropriate dummy value can give you all elements greater or equal to a given date.
Since the results are already sorted, is there any algorithm I can pass the iterator (or collection to), that can search it faster than sequential?
Not for a LinkedHashMap.
However, if the ordered list was an ArrayList instead, then you could use "binary search" on the list ... provided that you could lock it to prevent concurrent modifications while you are searching. (Actually, concurrency is a potential issue to consider no matter how you implement this ... including your current linear search.)
If you want to keep the ability to do id lookups, then you need two data structures; e.g. a TreeSet and a HashMap which share their element objects. A TreeSet will probably be more efficient than trying to maintain an ArrayList in order assuming that there are random insertions and/or random deletions.

Searching data in a array(list)

I have a ArrayList containing Attributes
class Attribute{
private int id;
public string getID(){
return this.id;
}
private string value;
public string getValue(){
return this.value;
}
//... more properties here...
}
Well I filled the ArrayList with like hundreds of those attributes. And I want to find the Attribute with a defined ID. I want to do something like this:
ArrayList<Attribute> arr = new ArrayList<Attribute>();
fillList(arr); //Method that puts a lot of these Attributes in the list
arr.find(234); //Find the attribute with the ID 234;
Is looping over the ArrayList the only solution.

Well something's going to have to loop over the array list, yes. There are various ways of doing this, different libraries etc.
If you fill the array in an ordered way (e.g. so that low IDs always come before high IDs) then you can perform a binary search in O(log N) time. Otherwise, it'll be O(N).
If you're going to search by IDs a lot, however, why not create a Map<Integer, Attribute> to start with - e.g. a HashMap, or a LinkedHashMap if you want to preserve ordering?
If you're only going to search for a single ID (or a few), however, this almost certainly won't be worth it - there's a cost involved in hashing, after all; filling the map will be more expensive than filling the list, and the difference is likely to be greater than the time saved looking up a few IDs.
Have you already established that this is a performance bottleneck? If so, this is an easy place to improve by using a map (or just a sorted list with a binary search). If not, I wouldn't disturb your code if it more naturally uses a list than a map - but you should certainly check whether it's a bottleneck or not.

You want to use a Map

If you wish to access elements of a collection using element attributes, and this attribute is guaranteed to be unique per element, then you really should use a Map. Try a Map with the Attribute.id as the key.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.