Data structure to group the elements of equivalence classes - java

I have to implement a data structure that groups the elements of a equivalence classes.
The API:
interface Grouper<T>{
void same(T l, T r);
Set<EquivalenceClass<T>> equivalenceClasses();
}
interface EquivalenceClass<T>{
Set<T> members();
}
For example the grouping behaves like this:
Grouper g;
g.same(a, b);
g.equivalenceClasses() -> [[a,b]]
g.same(b, a);
g.equivalenceClasses() -> [[a,b]]
g.same(b, c);
g.equivalenceClasses() -> [[a,b,c]]
g.same(d, e);
g.equivalenceClasses() -> [[a,b,c], [d,e]]
g.same(c, d);
g.equivalenceClasses() -> [[a,b,c,d]]
I'm looking for an implementation that works up to ~10 million entries. It should be optimized to fill it and get the equivalence classes once.

Take a look at Union-Find. The union ("same") can be done trivially in O(log N), and can be done in effectively O(1) with some optimizations. The "equivalenceClasses" is O(N), which is the cost of visiting everything anyways.

If you are only going to query the equivalences classes once, the best solution is to build an undirected graph over the elements. Each equivalence is an undirected edge between the two items, and the equivalence classes correspond to the connected components. The time and space complexity will both be linear if you do it right.
Alternatively, you can use a Union-Find data structure, which will give you almost-linear time complexity. It may also be considered simpler, because all the complexities are encapsulated into the data structure. The reason Union-Find is not linear comes down to supporting efficient queries while the classes are growing.

Union-find is the best data structure for your problem, as long you only care about total running time (some operations may be slow, but the total cost of all operations is guaranteed to be nearly linear). Enumerating the members of each set is not typically supported in the plain version of union-find in textbooks though. As the name suggests, union-find typically only supports union (i.e., same) and find, which returns an identifier guaranteed to be the same as the identifier returned by a call to find on an element in the same set. If you need to enumerate the members of each set, you may have to implement it yourself so you can add, for example, child pointers so that you can traverse each tree representing a set.
If you are implementing this yourself, you don't have to implement the full union-find data structure to achieve amortized O(lg n) time per operation. Essentially, in this "light" version of union-find, each set would be a singly linked list with an extra pointer inside each node that points to a set identifier node that can be used to test whether two nodes belong to the same list. When the same method is executed, you can just append the smaller list to the larger and update the set identifiers for the elements of the smaller list. The total cost is at most O(lg n) per element because an element can be a member of the smaller list involved in a same operation at most O(lg n) times.

Related

List vs. Map: Which takes less space and more efficient?

I have two classes Foo and Bar.
class Foo
{
Set<Integer> bars; // Foo objects have collection of bars.
Set<Integer> adjacents; // Adjacency list of Foos.
}
class Bar
{
int foo; // ID of foo of which this object belongs to
Ipsum ipsum; // This an arbitrary class. But it must be present
Map<Integer, Float> adjacents; // Adjacency list of Bars
}
Number of Bars are predefined (up to 1000). Hence, I may use an array.
But number of Foos are undefined (at most #ofBars/4).
When you consider addition, deletion and get(), I need the one which is faster and takes less space (because I'm going to use serialization).
Here are my options (as far as I have thought)
Option 1: Don't define a class for Foo. Instead, use List<Set<Integer>> foo; and another map for Map> fooAdjacencies;
Option 2: Use Map<Integer, Set<Integer> foo if I want to get bars of i, I simply write foo.get(i).
Option 3: Dont define classes. Instead, use option 2 and for Bar class:
Map<Integer, Ipsum> bar;
Map<Integer, Map<Integer, Floar>> barAdjacencies;
Which option should I choose in terms of space and time efficiency?
This sounds like it'd be very helpful for you (specifically the Data Structures section): http://bigocheatsheet.com/
You say
I need my structure to be efficient while adding, removing and finding elements. No other behavior.
The problem is that Lists and Maps are usually used in totally different cases. Their names describe their use cases fairly well -- you use a List if you need to list something (probably in some sequential order), while a Map would be used if you need to map an input to an output. You can use a Map as a List by mapping Integers to your elements, but that's overcomplicating things a bit. However, even within List and Map you can have different implementations that differ wildly in asymptotic performance.
With few exceptions, data structures will take O(n) space, which makes sense. If memory serves, anything other than an ArrayList (or other collections backed only by a primitive array) will have a decent amount of space overhead as they use other objects (e.g. Nodes for LinkedLists and Entry objects for Maps) to organize the underlying structure. I wouldn't worry too much about this overhead though unless space really is at a premium.
For best-performance addition, deletion, and search, you want to look at how the data structure is implemented.
LinkedList-style implementation will net you O(1) addition and deletion (and with a good constant factor, too!), but will have a pretty expensive get() with O(n) time, because the list will have to be traversed every time you want to get something. Java's LinkedList implementation, though, removes in O(n) time; while the actual act of deletion is O(1), that's only if you have a reference to the actual node that you're removing. Because you don't, removals in Java's LinkedList are O(n) -- O(n) for searching for the node to remove, and O(1) for removal.
Data structures backed with a plain array will have O(1) get() because it's an array, but takes O(n) to add, and delete, because any addition/deletion other than at the last element requires all other elements to be shuffled (in Java's implementation at least). Searching for something using an object instead of an index is done in O(n) time because you have to iterate over the array to find the object.
The following two structures are usually Maps, and so usually require you to implement equals() (and hashCode() for HashMaps):
Data structures backed by a tree (e.g. TreeMap) will have amortized (I think) O(lg n) add/remove, as a good implementation should be self-balancing, making worst-case addition/deletions only have to go through the height of the tree at most. get() operations are O(lg n). Using a tree requires that your elements be sortable/comparable in some way, which could be a bonus or hinderance, depending on your usage.
Hash-based data structures have amortized (average) O(1) everything, albeit with a slightly higher constant factor due to the overhead of hashing (and following any chains if the hash spread is poor). HashMaps could start sucking if you write a bad hashCode() function, though, so you want to be careful with that, although the implementers of Java's HashMap did do some magic behind the scenes to try to at least partially negate the effect of bad hashCode() implementations.
Hope that rundown helped. If you clear up how your program is structured, I might be able to give a recommendation. Until then, the best I can do is show you the options and let you pick.
I find this problem description a little hard to follow, but I think you're just looking for general collections/data structures advice.
A list (say, an array list) easily allows you to add and iterate over elements. When it is expanded beyond the size of the underlying array, a one-off costly resize operation is executed to add more space; but that is fine because it happens rarely and the amortized time is not bad. Searching for a specific element in a list is slow because you need to traverse it in order; there is no implied ordering in most lists. Deleting elements depends on the underlying list implementation. An array list could be slow in this regard; but I'm guessing that they optimized it just by marking the underlying element as deleted and skipping it during iteration. When using lists you also have to consider where you are adding elements. Linked lists are slower to iterate but can easily add and remove elements at any position. Array lists cannot easily add an element anywhere but the end.
Per your requirements, if you are required to execute a "get" or find on an element, then you need some kind of searching functionality to speed it up. This would make a map better as you can locate elements in log(n) time instead of linear time as when searching an unordered list. Adding and removing elements in a list is also relatively fast, so that's probably your best option.
Most importantly, implement it more than one way and profile it yourself to learn more :) Lists are rarely a good choice when searching is required though.

Is adding to a set O(n)?

Since sets can only have unique values does this mean every time you add an element to a set it has to check whether it is equal to every element there and is hence O(n)?
Since this would make them much slower than arrayLists if this is the case, is the only time you should ever actually use them is when making sure your elements are all unique or is there any other advantage of them?
This depends on the implementation of a set.
C++
An std::set in C++ is typically implemented as a red-black tree and guarantees an insert complexity of O(log(n)) (source).
std::set is an associative container that contains a sorted set of unique objects of type Key. Sorting is done using the key comparison function Compare. Search, removal, and insertion operations have logarithmic complexity.
C++11's std::unordered_set has an insert complexity of O(1) (source).
Unordered set is an associative container that contains set of unique objects of type Key. Search, insertion, and removal have average constant-time complexity.
JAVA
In Java, adding an element to a HashSet is O(1). From the documentation:
This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets
Inserting an element into a TreeSet is O(log(n)).
This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains).
All classes implementing Set can be found in the documentation.
Conclusion
Adding to a set is, in most cases, not slower than adding to an ArrayList or std::vector. However, a set does not necessarily keep items in the order in which they are inserted. Also, accessing some Nth element of a set has a worse complexity than the same operation on an ArrayList or std::vector. Each has their advantages and disadvantages and should be used accordingly.
You tagged this Java as well as C++, so I'll answer for both:
In C++ std::set is an ordered container, likely implemented as a tree. Regardless of implementation adding to a set and checking whether an element in a set are guaranteed to be O(log n). For std::unordered_set, which is new in C++11, those operations are O(1) (given a proper hashing function).
In Java java.util.Set is an interface that can have many different classes who implement them. The complexities of the operations are up to those classes. The most commonly used sets are TreeSet and HashSet. The operations on the former are O(log n) and for the latter they're O(1) (again, giving a proper hashing function).
A C++ std::set is normally implemented as a black-and-red tree. This means that adding to it will be O(log n).
A C++ std::unordered_set insertion is implemented as a hash-table so insertion is O(1).
You forget that a set may not be a bulk list of elements; it can be arranged (and it indeed is) in such a way that searches are much faster than O(N).
http://en.wikipedia.org/wiki/Set_(abstract_data_type)#Implementations
It depends. Different languages provide different implementations. Even java has 2 different sets: A TreeSet and a HashSet. Adding is a TreeSet is O(logn), since the elements are already in an order.
in cpp, sets are typically implemented as binary search trees. with that being said, inserting would require O(log(N)) time complexity. when it comes to unique key, u can try hash_map in cpp, which has a constant time complexity when inserting.
Time complexity: Wrapper Classes vs Primitives.
When the value changes, especially multiple times, primitives give better time.
Example:
int counter = 0;
while (x>y){
counter++;
}
is much faster than:
Integer counter = 0;
while (x>y){
counter++;
}
When the value remains the same, wrapper classes give better time since only a pointer to the wrapper class is passed to the algorithm. It comes handy in defining parameters of methods that do not change their value.
Example:
public int sum (Integer one, Integer two, Integer three){
int sum = one+two+three;
return sum;
}
is faster than
public int sum (int one, int two, int three){
int sum = one+two+three;
return sum;
}
The values that are passed to the methods could be primitive, the important thing is the definition of the parameters of the method itself, that is to say:
public int sum (Integer one, Integer two, Integer three){
int sum = one+two+three;
return sum;
}
int a = 1; int b = 2; int c = 3;
public int sum (a, b, c){
int sum = a+b+c;
return sum;
}
The cumulative effect of using wrapper classes as described above could significantly improve the performance of a program.
As others have stated, sets are usually either trees O(logn) or hash tables O(1). However, there is one thing you can be sure about: No sane map implementation would have O(n) behaviour.

Get the N smallest [Comparable] items in a set

I have an unsorted Collection of objects [that are comparable], is it possible to get a sub list of the collection of the list without having to call sort?
I was looking at the possibility of doing a SortedList with a limited capacity, but that didn't look like the right option.
I could easily write this, but I was wondering if there was another way.
I am not able to modify the existing collection's structure.
Since you don't want to call sort(), it seems like you are trying to avoid an O(n log(n)) runtime cost. There is actually a way to do that in O(n) time -- you can use a selection algorithm.
There are methods to do this in the Guava libraries (Google's core Java libraries); look in Ordering and check out:
public <E extends T> List<E> Ordering.leastOf(Iterable iterable, int k)
public <E extends T> List<E> Ordering.greatestOf(Iterable iterable, int k)
These are implementations of quickselect, and since they're written generically, you could just call them on your Set and get a list of the k smallest things. If you don't want to use the entire Guava libraries, the docs link to the source code, and I think it should be straightforward to port the methods to your project.
If you don't want to deviate too far from the standard libraries, you can always use a sorted set like TreeSet, though this gets you logarithmic insert/remove time instead of the nice O(1) performance of the hash-based Set, and it ends up being O(n log(n)) in the end. Others have mentioned using heaps. This will also get you O(n log(n)) running time, unless you use some of the fancier heap variants. There's a fibonacci heap implementation in GraphMaker if you're looking for one of those.
Which of these makes sense really depends on your project, but I think that covers most of the options.
I would probably create a sorted set. Insert the first N items from your unsorted collection into your sorted set. Then for the remainder of your unsorted collection:
insert each item in the sorted set
delete the largest item from the sorted set
Repeat until you've processed all items in the unsorted collection
Yes, you can put all of them into a max heap data structure with a fixed size of N, conditionally, if the item is smaller than the largest in the max heap (by checking with the get() "peek" method). Once you have done so they will, by definition, be the N smallest. Optimal implementations will perform with O(M)+lg(N) or O(M) (where M is the size of the set) performance, which is theoretically fastest. Here's some pseudocode:
MaxHeap maxHeap = new MaxHeap(N);
for (Item x : mySetOfItems) {
if (x < maxHeap.get()) {
maxHeap.add(x);
}
}
The Apache Commons Collections class PriorityBuffer seems to be their flagship binary heap data structure, try using that one.
http://en.wikipedia.org/wiki/Heap_%28data_structure%29
don't you just want to make a heap?

Efficient EnumSet + List

Someone knows a nice solution for EnumSet + List
I mean I need to store enum values and I also need to preserve the order , and to be able to access its index of the enum value in the collection in O(1) time.
The closest thing I can come to think of, present in the API is the LinkedHashSet:
From http://java.sun.com/j2se/1.4.2/docs/api/java/util/LinkedHashSet.html:
Hash table and linked list implementation of the Set interface, with predictable iteration order.
I doubt it's possible to do what you want. Basically, you want to look up indexes in constant time, even after modifying the order of the list. Unless you allow remove / reorder operations to take O(n) time, I believe you can't get away with lower than O(log n) (which can be achieved by a heap structure).
The only way I can see to satisfy ordering and O(1) access is to duplicate the data in a List and an array of indexes (wrapped in a nice little OrderedEnumSet, of course).

Are LinkedLists an unintuitive solution since most of the time I don't need to know the physical location of an element in a Collection?

Recently a coworker showed me some code he had written with a LinkedList and I couldn't get my head around it.
a -> b -> c -> d -> e -> f
If I want to get d from the LinkedList, don't I have to traverse the list starting with a and iterating up to d or starting with f and iterating back to d?
Why would I care WHERE d is stored physically in the Collection?
Not every linked list is linked in both directions but, normally, yes. This type of collection features sequential access in the forward or forward and reverse directions.
The advantages are:
least amount of memory overhead except for a flat array
very fast insert and delete
memory can be allocated and released one element at a time
easy to implement (not so important with modern languages but it was important in C89 and C99)
LIFO or FIFO ordering is possible
I think that the right question is not WHERE, but HOW it stored your collection. According to this, your time of adding, searching, deleting and keeping your collection consistent is different. So, when you choose your type collection you should keep in mind, what will be the most frequent operation and pick the best solution for your case.
Linked lists typically have better performance characteristics than arrays for adding and removing elements.
And yes, if you're operating on sorted data, you do normally care what order elements are stored in.
You probably don't care, regardless of whether you're using a LinkedList or an ArrayList. LinkedLists offer the advantage of being able to easily add elements to the beginning of a list, which you can't do with an ArrayList.
Lists are not about "physical locations" (whatever you mean by that), lists are a certain data structure that can grow and shrink and provide decent complexity across the various operations.
You don't have to explicitly traverse the linked list, as LinkedList offers indexOf(Object) and get(int). These will still traverse the list, but will do it implicitly.
You'll care about how a collection orders items because this affects efficiency of operations on the collection, particularly insert, fetch & removal. Any ordering on the items in a collection also affect timing of algorithms that use the data structure.

Categories