Since sets can only have unique values does this mean every time you add an element to a set it has to check whether it is equal to every element there and is hence O(n)?
Since this would make them much slower than arrayLists if this is the case, is the only time you should ever actually use them is when making sure your elements are all unique or is there any other advantage of them?
This depends on the implementation of a set.
C++
An std::set in C++ is typically implemented as a red-black tree and guarantees an insert complexity of O(log(n)) (source).
std::set is an associative container that contains a sorted set of unique objects of type Key. Sorting is done using the key comparison function Compare. Search, removal, and insertion operations have logarithmic complexity.
C++11's std::unordered_set has an insert complexity of O(1) (source).
Unordered set is an associative container that contains set of unique objects of type Key. Search, insertion, and removal have average constant-time complexity.
JAVA
In Java, adding an element to a HashSet is O(1). From the documentation:
This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets
Inserting an element into a TreeSet is O(log(n)).
This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains).
All classes implementing Set can be found in the documentation.
Conclusion
Adding to a set is, in most cases, not slower than adding to an ArrayList or std::vector. However, a set does not necessarily keep items in the order in which they are inserted. Also, accessing some Nth element of a set has a worse complexity than the same operation on an ArrayList or std::vector. Each has their advantages and disadvantages and should be used accordingly.
You tagged this Java as well as C++, so I'll answer for both:
In C++ std::set is an ordered container, likely implemented as a tree. Regardless of implementation adding to a set and checking whether an element in a set are guaranteed to be O(log n). For std::unordered_set, which is new in C++11, those operations are O(1) (given a proper hashing function).
In Java java.util.Set is an interface that can have many different classes who implement them. The complexities of the operations are up to those classes. The most commonly used sets are TreeSet and HashSet. The operations on the former are O(log n) and for the latter they're O(1) (again, giving a proper hashing function).
A C++ std::set is normally implemented as a black-and-red tree. This means that adding to it will be O(log n).
A C++ std::unordered_set insertion is implemented as a hash-table so insertion is O(1).
You forget that a set may not be a bulk list of elements; it can be arranged (and it indeed is) in such a way that searches are much faster than O(N).
http://en.wikipedia.org/wiki/Set_(abstract_data_type)#Implementations
It depends. Different languages provide different implementations. Even java has 2 different sets: A TreeSet and a HashSet. Adding is a TreeSet is O(logn), since the elements are already in an order.
in cpp, sets are typically implemented as binary search trees. with that being said, inserting would require O(log(N)) time complexity. when it comes to unique key, u can try hash_map in cpp, which has a constant time complexity when inserting.
Time complexity: Wrapper Classes vs Primitives.
When the value changes, especially multiple times, primitives give better time.
Example:
int counter = 0;
while (x>y){
counter++;
}
is much faster than:
Integer counter = 0;
while (x>y){
counter++;
}
When the value remains the same, wrapper classes give better time since only a pointer to the wrapper class is passed to the algorithm. It comes handy in defining parameters of methods that do not change their value.
Example:
public int sum (Integer one, Integer two, Integer three){
int sum = one+two+three;
return sum;
}
is faster than
public int sum (int one, int two, int three){
int sum = one+two+three;
return sum;
}
The values that are passed to the methods could be primitive, the important thing is the definition of the parameters of the method itself, that is to say:
public int sum (Integer one, Integer two, Integer three){
int sum = one+two+three;
return sum;
}
int a = 1; int b = 2; int c = 3;
public int sum (a, b, c){
int sum = a+b+c;
return sum;
}
The cumulative effect of using wrapper classes as described above could significantly improve the performance of a program.
As others have stated, sets are usually either trees O(logn) or hash tables O(1). However, there is one thing you can be sure about: No sane map implementation would have O(n) behaviour.
Related
In my Code I maintain a data heap whose basic component is a Map.The Map would be somewhat like:
Key -TableName\Field\Attribute1
Value-3
I used to retrieve my value by :
map.get(key)
Now I required a List instead value.The Map would be somewhat like:
Key -TableName\Field\Attribute1
Value-[3,30,300]
Now I need to retrieve my value by :
map.get(key).get(index)
How much would this change affect the performance of my Code?
I don't think there exist single answer to your question, its broad and you can consider all the answers. Basically, you need to understand time complexity of HashMap and ArrayList and that would give you an idea on how your code is going to perform.
The following image (source) gives you an idea of time complexity O(1), O(log n), O(n), ...
Now Hashmap have the following time complexity (see here for more details on how hash map work in java)
get() and put() - usually O(1) but O(n) worst case scenario
Best case scenario, get() and put() have O(1) cost in time complexity. If you have a inefficient hash function then the data is not distributed correctly across buckets hence, you might end with slow get() and put() methods. An efficient hash function distributes the data in all buckets in a balanced manner, if there is no balance between the buckets containing entries then you will have slower get() and put(). See here and here for details on how to design your hash function.
Note the HashMap performance improvement in Java 8.
ArrayList on the other hand have the following time complexity:
add() - O(1)
remove() - O(n)
get() - O(1)
contains() - O(n) (traversal)
Basically you perform O(1) (O (n) worst case) to get the list and then you are doing another O(1) to get the item from list. So both operations are in constant time provided your hash function is efficient.
As mentioned in other answers, there are a variety of other ways to improve the performance of your code by using array instead of list if possible, and so on.
Although it is nearly impossible to tell the impact on performance of code without seeing any code, performance overhead of storing a list instead of storing a single Integer could be relatively small: in both cases you would end up paying for unboxing the int from Integer and for the has look-up on the key, so the only additional operation is the get(index) on the list.
If the list inside the map is an ArrayList, the operation is fast O(1) lookup
If the list inside the map is a LinkedList, the operation is O(n), where n is the number of elements on the individual list.
It's worth noting that if the list has a small fixed size (say, three or four elements, as shown in your example) you may be better off defining a custom class for the four elements, and make these elements ints. This would save memory and reduce the overhead.
For lists of fixed large size you may want to consider int[] arrays, because they let you avoid boxing for a reduction in memory overhead.
best case for Lookup of hashmap will be O(1) and worst case will depend on hash collision and JDK8 has some nice improvement to handle that scenario, so from map perspective lookup cost will be same no matter if you put single value or list of values associated to key.
cost for look up on list of value by index depends on type of list you are using , if it is array based(i.e ArrayList) then it is constant but it is linked list then cost is O(N).
So it is really choosing why type of list you want to put depending on your performance goal.
What I need:
Fastest put/remove, this is used alot.
Iteration, also used frequently.
Holds an object, e.g. Player. remove should be o(1) so maybe hashmap?
No duplicate keys
direct get() is never used, mainly iterating to retrieve data.`
I don't worry about memory, I just want the fastest speed possible even if it's at the cost of memory.
For iteration, nothing is faster than a plain old array. Entries are saved sequentially in memory, so the JVM can get to the next entry simply by adding the length of one entry to the its address.
Arrays are typically a bit of a hassle to deal with compared to maps or lists (e.g: no dictionary-style lookups, fixed length). However, in your case I think it makes sense to go with a one or two dimensional array since the length of the array will not change and dictionary-style lookups are not needed.
So if I understand you correctly you want to have a two-dimensional grid that holds information of which, if any, player is in specific tiles? To me it doesn't sound like you should be removing, or adding things to the grid. I would simply use a two-dimensional array that holds type Player or something similar. Then if no player is in a tile you can set that position to null, or some static value like Player.none() or Tile.empty() or however you'd want to implement it. Either way, a simple two-dimensional array should work fine. :)
The best Collection for your case is a LinkedList. Linked lists will allow for fast iteration, and fast removal and addition at any place in the linked list. For example, if you use an ArrayList, and you can to insert something at index i, then you have to move all the elements from i to the end one entry to the right. The same would happen if you want to remove. In a linked list you can add and remove in constant time.
Since you need two dimensions, you can use linked lists inside of linked lists:
List<List<Tile> players = new LinkedList<List<Tile>>(20);
for (int i = 0; i < 20; ++i){
List<Tile> tiles = new LinkedList<Tile>(20);
for (int j = 0; j < 20; ++j){
tiles.add(new Tile());
}
players.add(tiles);
}
use a map of sets guarantee O(1) for vertices lookup and amortized O(1) complexity edge insertion and deletions.
HashMap<VertexT, HashSet<EdgeT>> incidenceMap;
There is no simple one-size-fits-all solution to this.
For example, if you only want to append, iterate and use Iterator.remove(), there are two obvious options: ArrayList and LinkedList
ArrayList uses less memory, but Iterator.remove() is O(N)
LinkedList uses more memory, but Iterator.remove() is O(1)
If you also want to do fast lookup; (e.g. Collection.contains tests), or removal using Collection.remove, then HashSet is going to be better ... if the collections are likely to be large. A HashSet won't allow you to put an object into the collection multiple times, but that could be an advantage. It also uses more memory than either ArrayList or LinkedList.
If you were more specific on the properties required, and what you are optimizing for (speed, memory use, both?) then we could give you better advice.
The requirement of not allowing duplicates is effectively adding a requirement for efficient get().
Your options are either hash-based, or O(Log(N)). Most likely, hashcode will be faster, unless for whatever reason, calling hashCode() + equals() once is much slower than calling compareTo() Log(N) times. This could be, for instance, if you're dealing with very long strings. Log(N) is not very much, by the way: Log(1,000,000,000) ~= 30.
If you want to use a hash-based data structure, then HashSet is your friend. Make sure that Player has a good fast implementation of hashCode(). If you know the number of entries ahead of time, specify the HashSet size. ( ceil(N/load_factor)+1. The default load factor is 0.75).
If you want to use a sort-based structure, implement an efficient Player.compareTo(). Your choices are TreeSet, or Skip List. They're pretty comparable in terms of characteristics. TreeSet is nice in that it's available out of the box in the JDK, whereas only a concurrent SkipList is available. Both need to be rebalanced as you add data, which may take time, and I don't know how to predict which will be better.
Which is the most efficient way of finding an element in terms of performance. Say I have 100's of strings. I need to find whether a specified string is available in those bulk strings. I have contains() method in Arraylist, But I need to iterate through Array for the same purpose. Anyone explain, which is the best way of doing this in terms of performance.
Say I have 100's of strings. I need to find whether a specified string is available in those bulk strings.
That sounds like you want a HashSet<String> - not a list or an array. At least, that's the case if the hundreds of strings is the same every time you want to search. If you're searching within a different set of strings every time, you're not going to do better than O(N) if you receive the set in an arbitrary order.
In general, checking for containment in a list/array is an O(N) operation, whereas in a hash-based data structure it's O(1). Of course there's also the cost of performing the hashing and equality checking, but that's a different matter.
Another option would be a sorted list, which would be O(log N).
If you care about the ordering, you might want to consider a LinkedHashSet<String>, which maintains insertion order but still has O(1) access. (It's basically a linked list combined with a hash set.)
An Arraylist uses an array as backing data so the performance will be the same for both
Look at the implementation of ArrayList#contains which calls indexOf()
public int indexOf(Object o) {
if (o == null) {
for (int i = 0; i < size; i++)
if (elementData[i]==null)
return i;
} else {
for (int i = 0; i < size; i++)
if (o.equals(elementData[i]))
return i;
}
return -1;
}
You would do the exact same thing if you implemented the contains() on your own for an array.
You don't have to worry about performance issues. It will not affect much. Its good and easy to use contains() method in ArrayList
I have an unsorted Collection of objects [that are comparable], is it possible to get a sub list of the collection of the list without having to call sort?
I was looking at the possibility of doing a SortedList with a limited capacity, but that didn't look like the right option.
I could easily write this, but I was wondering if there was another way.
I am not able to modify the existing collection's structure.
Since you don't want to call sort(), it seems like you are trying to avoid an O(n log(n)) runtime cost. There is actually a way to do that in O(n) time -- you can use a selection algorithm.
There are methods to do this in the Guava libraries (Google's core Java libraries); look in Ordering and check out:
public <E extends T> List<E> Ordering.leastOf(Iterable iterable, int k)
public <E extends T> List<E> Ordering.greatestOf(Iterable iterable, int k)
These are implementations of quickselect, and since they're written generically, you could just call them on your Set and get a list of the k smallest things. If you don't want to use the entire Guava libraries, the docs link to the source code, and I think it should be straightforward to port the methods to your project.
If you don't want to deviate too far from the standard libraries, you can always use a sorted set like TreeSet, though this gets you logarithmic insert/remove time instead of the nice O(1) performance of the hash-based Set, and it ends up being O(n log(n)) in the end. Others have mentioned using heaps. This will also get you O(n log(n)) running time, unless you use some of the fancier heap variants. There's a fibonacci heap implementation in GraphMaker if you're looking for one of those.
Which of these makes sense really depends on your project, but I think that covers most of the options.
I would probably create a sorted set. Insert the first N items from your unsorted collection into your sorted set. Then for the remainder of your unsorted collection:
insert each item in the sorted set
delete the largest item from the sorted set
Repeat until you've processed all items in the unsorted collection
Yes, you can put all of them into a max heap data structure with a fixed size of N, conditionally, if the item is smaller than the largest in the max heap (by checking with the get() "peek" method). Once you have done so they will, by definition, be the N smallest. Optimal implementations will perform with O(M)+lg(N) or O(M) (where M is the size of the set) performance, which is theoretically fastest. Here's some pseudocode:
MaxHeap maxHeap = new MaxHeap(N);
for (Item x : mySetOfItems) {
if (x < maxHeap.get()) {
maxHeap.add(x);
}
}
The Apache Commons Collections class PriorityBuffer seems to be their flagship binary heap data structure, try using that one.
http://en.wikipedia.org/wiki/Heap_%28data_structure%29
don't you just want to make a heap?
I have to implement a data structure that groups the elements of a equivalence classes.
The API:
interface Grouper<T>{
void same(T l, T r);
Set<EquivalenceClass<T>> equivalenceClasses();
}
interface EquivalenceClass<T>{
Set<T> members();
}
For example the grouping behaves like this:
Grouper g;
g.same(a, b);
g.equivalenceClasses() -> [[a,b]]
g.same(b, a);
g.equivalenceClasses() -> [[a,b]]
g.same(b, c);
g.equivalenceClasses() -> [[a,b,c]]
g.same(d, e);
g.equivalenceClasses() -> [[a,b,c], [d,e]]
g.same(c, d);
g.equivalenceClasses() -> [[a,b,c,d]]
I'm looking for an implementation that works up to ~10 million entries. It should be optimized to fill it and get the equivalence classes once.
Take a look at Union-Find. The union ("same") can be done trivially in O(log N), and can be done in effectively O(1) with some optimizations. The "equivalenceClasses" is O(N), which is the cost of visiting everything anyways.
If you are only going to query the equivalences classes once, the best solution is to build an undirected graph over the elements. Each equivalence is an undirected edge between the two items, and the equivalence classes correspond to the connected components. The time and space complexity will both be linear if you do it right.
Alternatively, you can use a Union-Find data structure, which will give you almost-linear time complexity. It may also be considered simpler, because all the complexities are encapsulated into the data structure. The reason Union-Find is not linear comes down to supporting efficient queries while the classes are growing.
Union-find is the best data structure for your problem, as long you only care about total running time (some operations may be slow, but the total cost of all operations is guaranteed to be nearly linear). Enumerating the members of each set is not typically supported in the plain version of union-find in textbooks though. As the name suggests, union-find typically only supports union (i.e., same) and find, which returns an identifier guaranteed to be the same as the identifier returned by a call to find on an element in the same set. If you need to enumerate the members of each set, you may have to implement it yourself so you can add, for example, child pointers so that you can traverse each tree representing a set.
If you are implementing this yourself, you don't have to implement the full union-find data structure to achieve amortized O(lg n) time per operation. Essentially, in this "light" version of union-find, each set would be a singly linked list with an extra pointer inside each node that points to a set identifier node that can be used to test whether two nodes belong to the same list. When the same method is executed, you can just append the smaller list to the larger and update the set identifiers for the elements of the smaller list. The total cost is at most O(lg n) per element because an element can be a member of the smaller list involved in a same operation at most O(lg n) times.