Deduplication using a Java Set

Deduplication using a Java Set - java

I have a collection of objects, let's call them A, B, C, D,... and some are equal to others. If A and C are equal, then I want to replace every reference to C with a reference to A. This means (a) object C can be garbage collected, freeing up memory, and (b) I can later use "==" to compare objects in place of an expensive equals() operation. (These objects are large and the equals() operation is slow.)
My instinct was to use a java.util.Set. When I encounter C I can easily see if there is an entry in the Set equal to C. But if there is, there seems to be no easy way to find out what that entry is, and replace my reference to the existing entry. Am I mistaken? Iterating over all the entries to find the one that matches is obviously a non-starter.
Currently, instead of a Set, I'm using a Map in which the value is always the same as the key. Calling map.get(C) then finds A. This works, but it feels incredibly convoluted. Is there a more elegant way of doing it?

This problem is not simple de-duplication: it is a form of canonicalization.
The standard approach is to use a Map rather than a Set. Here's a sketch of how to do it:
public <T> List<T> canonicalizeList(List<T> input) {
HashMap<T, T> map = new HashMap<>();
List<T> output = new ArrayList<>();
for (T element: input) {
T canonical = map.get(element);
if (canonical == null) {
element = canonical;
map.put(canonical, canonical);
}
output.add(canonical);
}
return output;
}
Note that this is O(N). If you can safely assume that the percentage of duplicates in input is likely to be small, then you could set the capacity of map and output to the size of input.
Now you seem to be saying that you are doing it this way already (last paragraph), and you are asking if there is a better way. As far as I know, there isn't one. (The HashSet API lets would let you test if a set contains a value equal to element, but it does not let you find out what it is in O(1).)
For what it is worth, under the hood the HashSet<T> class is implemented as a HashMap<T, T>. So you would not be saving time or space by using a HashSet directly ...

Related

Extend HashMap to get all objects with the specified hashcode?

From what I understand, when two objects are put in a HashMap that have the same hashcode, they are put in a LinkedList (I think) of objects with the same hash code. I am wondering if there is a way to either extend HashMap or manipulate the existing methods to return a list or array of objects that share a hash code instead of going into equals to see if they are the same object.
The reasoning is that I'm trying to optimize a part of a code that, currently, is just a while loop that finds the first object with that hashcode and stores/removes it. This would be a lot faster if I could just return the full list in one go.
Here's the bit of code I'd like to replace:
while (WorkingMap.containsKey(toSearch)) {
Occurences++;
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
The keys are Chunk objects and the values are Strings. Here are the hashcode() and equals() functions for the Chunk class:
/**
* Returns a string representation of the ArrayList of words
* thereby storing chunks with the same words but with different
* locations and next words in the same has bucket, triggering the
* use of equals() when searching and adding
*/
public int hashCode() {
return (Words.toString()).hashCode();
}
#Override
/**
* result depends on the value of location. A location of -1 is obviously
* not valid and therefore indicates that we are searching for a match rather
* than adding to the map. This allows multiples of keys with matching hashcodes
* to be considered unequal when adding to the hashmap but equal when searching
* it, which is integral to the MakeMap() and GetOptions() methods of the
* RandomTextGenerator class.
*
*/
public boolean equals(Object obj) {
Chunk tempChunk = (Chunk)obj;
if (LocationInText == -1 && Words.size() == tempChunk.GetText().size())
{
for (int i = 0; i < Words.size(); i++) {
if (!Words.get(i).equals(tempChunk.GetText().get(i))) {
return false;
}
}
return true;
}
else {
if (tempChunk.GetLocation() == LocationInText) {
return true;
}
return false;
}
}
Thanks!

HashMap does not expose any way to do this, but I think you're misunderstanding how HashMap works in the first place.
The first thing you need to know is that if every single object had exactly the same hash code, HashMap would still work. It would never "mix up" keys. If you call get(key), it will only return the value associated with key.
The reason this works is that HashMap only uses hashCode as a first grouping, but then it checks the object you passed to get against the keys stored in the map using the .equals method.
There is no way, from the outside, to tell that HashMap uses linked lists. (In fact, in more recent versions of Java, it doesn't always use linked lists.) The implementation doesn't provide any way to look at hash codes, to find out how hash codes are grouped, or anything along those lines.
while (WorkingMap.containsKey(toSearch)) {
Occurences++;
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
This code does not "find the first object with that hashcode and store/remove it." It finds the one and only object equal to toSearch according to .equals, stores and removes it. (There can only be one such object in a Map.)

Your while isn't really going. It makes max one turn, if the WorkingMap is a plain Java HashMap. .get(key) return the last saved Object in the HashMap that is saved on 'key'. If it matched toSearch, than it going once.
I'm not sure about that many open questions here. But if you need that one and your farther code is understanding
What kind of type is class Possibles? ArrayList?
// this one should make the same as your while
if(WorkingMap.containsKey(toSearch)) {
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
// farher: expand your Possibles to get that LinkedList what you want to have.
public class possibilities {
// List<LinkedList<String>> container = new ArrayList<LinkedList<String>>();
public Map<Chunk, LinkedList<String>> container2 = new HashMap<Chunk, LinkedList<String>>();
public void put(Chunk key, String value) {
if(!this.container2.containsKey(key)) {
this.container2.put(key, new LinkedList<String>());
}
this.container2.get(key).add(value);
}
}
// this one works with updated Possibles
if(WorkingMap.containsKey(toSearch)) {
Possibles.put(toSearch, WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
//---
How ever, yes it can go like that, but keys should not be a complex object.
Notice: That LinkedLists takes memory and how big are chunks? check Memory Usage
Possibles.(get)container2.keySet();
Good Look
Sail

From what I understand, when two objects are put in a HashMap that have the same hashcode, they are put in a LinkedList (I think) of objects with the same hash code.
Yes, but it's more complicated than that. It often needs to put objects in linked lists even when they have differing hash codes, since it only uses some bits of the hash codes to choose which bucket to store objects in; the number of bits it uses depends on the current size of the internal hash table, which approximately depends on the number of things in the map. And when a bucket needs to contain multiple objects it will also try to use binary trees like a TreeMap if possible (if objects are mutually Comparable), rather than linked lists.
Anyway.....
I am wondering if there is a way to either extend HashMap or manipulate the existing methods to return a list or array of objects that share a hash code instead of going into equals to see if they are the same object.
No.
A HashMap compares keys for equality according to the equals method. Equality according to the equals method is the only valid way to set, replace, or retrieve values associated with a particular key.
Yes, it also uses hashCode as a way to arrange objects in a structure that allows for far faster location of potentially equal objects. Still, the contract for matching keys is defined in terms of equals, not hashCode.
Note that it is perfectly legal for every hashCode method to be implemented as return 0; and the map will still work just as correctly (but very slowly). So any idea that involves getting a list of objects sharing a hash code is either impossible or pointless or both.
I'm not 100% sure what you're doing in your equals method with the LocationInText variable, but it looks dangerous, as it violates the contract of the equals method. It is required that the equals method be symmetric, transitive, and consistent:
Symmetric: for any non-null reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
Transitive: for any non-null reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
Consistent: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified.
And the hashCode method is required to always agree with equals about equal objects:
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
The LocationInText variable is playing havoc with those rules, and may well break things. If not today, then some day. Get rid of it!
Here's the bit of code I'd like to replace:
while (WorkingMap.containsKey(toSearch)) {
Occurences++;
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
Something that jumps out at me is that you only need to do the key lookup once, instead of doing it three times, since Map.remove returns the removed value or null if the key is not present:
for (;;) {
String s = WorkingMap.remove(toSearch);
if (s == null) break;
Occurences++;
Possibles.add(s);
}
Either way, the loop is still faulty, since it is supposed to be impossible for a map to contain more than one key equal to toSearch. I can't overstate that the LocationInText variable as you're using it is not a good idea.
I agree with the other commenters it looks like you're looking for a map-of-list structure. Some Java libraries like Guava offer a Multimap for this, but you can do it manually pretty easily. I think the declaration you want is:
Map<Chunk,List<String>> map = new HashMap<>();
To add a new chunk-string pair to the map, do:
void add(Chunk chunk, String string) {
map.computeIfAbsent(chunk, k -> new ArrayList<>()).add(string);
}
That method puts a new ArrayList in the map if the chunk is new, or fetches the existing ArrayList if there is one for that chunk. Then it adds the string to the list that it fetched or created.
To retrieve the list of all strings for a particular chunk value is as simple as map.get(chunkToSearch), which you can add to your Possibles list as Possibles.addAll(map.get(chunkToSearch));.
Other potential optimizations I'd point out:
In your Chunk.hashCode method, consider caching the hash code instead of recomputing it every time the method is called. If Chunk is mutable (which is not a good idea for a map key, but vaguely allowed so long as you're careful) then recompute the hash code only after the Chunk's value has changed. Also, if Words is a List, which it seems to be, it would likely be faster to use its hash code than convert it to a string and use the string's hash code, but I'm not sure.
In your Chunk.equals method, you can return true immediately if the instances are the same (which they often will be). Also, if GetText returns a copy of the data, then don't call it; you can access the private Words list of the other Chunk since you are in the same class, and finally, you can just defer to the List.equals method:
#Override
public boolean equals(Object o) {
return (this == o) || (o instanceof Chunk && this.Words.equals(((Chunk)o).Words));
}
Simple! Fast!

Can I retrieve a key stored in a Hashset in java in some way?

I am trying to perform bi directional A star search, where in I encountered this issue.
Suppose I have an object A in a hashset H. Suppose there is another object B, such that A.equals(B) is true (and their hash values are also the same), though A and B point to different objects. Now, when I check if object B is in hashset H, it returns true, as expected. However, suppose now, I want to access some attribute in object A based on this, I then need to access object A. Note that accessing the same attribute in B will not work since they are equal only under the equals method, but do not point to the same object. How can I achieve this?
One way is to use a Hashmap, such that the value type is the same as the key type, and every time I store some key in the hashmap, I store along with it the same object as its value. But this incurs extra memory overhead of storing the value, when what I really need is a copy of the key itself. Is there any other way to achieve this?

I don't believe there's any way of doing this efficiently using HashSet<E>. (You could list all the keys and check them for equality, of course.)
However, although the HashMap approach would be irritating, it wouldn't actually take any more memory. HashSet is implemented using HashMap (at least in the stock JDK 7) so there's a full map entry for each set entry anyway... and no extra memory is taken to store the value, because they'd both just be references to the same object.

One way is to use a Hashmap, such that the value type is the same as the key type, and every time I store some key in the hashmap, I store along with it the same object as its value. But this incurs extra memory overhead of storing the value, when what I really need is a copy of the key itself.
In actual fact, the implementation of HashSet in (at least) the Oracle Java SE Library in Java 7 has a HashMap inside it. So your concern about the extra memory usage of HashMap is unwarranted.
Here is a link to the source code: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/util/HashSet.java/
Incidentally, the internal map is declared as HashMap<E, Object> rather than HashMap<E, E>. Thought exercise for the reader: why did they do that?
The only other reasonable alternative (using HashSet) would be to iterate over the set elements, testing each one to see if it is equal to the one you are looking for. That is obviously very (time) inefficient.

You could iterate over all values in the HashSet and check the equals method of B. If the iterator returns an object which is equal to B you have found A.

Performance of Arrays.asList().contains() comparing to several if equals instructions

Which on these instructions is better in terms of performance and memory usage :
if(val.equals(CONSTANT1) || val.equals(CONSTANT2) ..... || val.equals(CONSTANTn)) {
}
OR
if(Arrays.asList(CONSTANT1,CONSTANT2, ..... ,CONSTANTn).contains(val)) {
}

A better question to ask would be how to write this code more clearly (and faster, if performance actually matters). The answer to that would be a switch statement (or possibly even polymorphism, if you want to convert your constants into an enum) or a lookup array.
But if you insist on comparing your two approaces, the first is slightly faster. To see this, let's look at what the second aproach entails:
create a new array with the constants, to pass them to the vararg parameter of Arrays.asList
create a new list object wrapping that array
iterate over that array, comparing each element with equals
The third step is equivalent to your first approach.
Finally, it's worth noting that such an operation will likely take far less than a micro second, so unless you invoke this method millions of times per second, any approach will be fast enough.

Theoretically #1 is faster but insignificantly, because Arrays.asList creates only one object - a list view (wrapper) of the specified array, there is no array copying:
public static <T> List<T> asList(T... a) {
return new ArrayList<T>(a);
}
private static class ArrayList<E> extends AbstractList<E>
implements RandomAccess, java.io.Serializable
{
private static final long serialVersionUID = -2764017481108945198L;
private final E[] a;
ArrayList(E[] array) {
if (array==null)
throw new NullPointerException();
a = array;
}

Since you are not using a loop I guess that the number of values is so low that in practice any differences will be irrelevant.
However, having said that, if one was to iterate by hand and use equals() versus asList() and contains()... it would still be the same.
Arrays.asList() returns a private implementation of a list which extends AbstractList and simply wraps around the existing array by reference (no copy is done). The contains() method uses the indexOf() which goes through the array using equals() on each element until it finds a match and then returns it. If you would break on your loop when you find an equals then both implementations would be quite equivalent.
The only difference would be a tiny memory footprint for the additional list structure that Arrays.asList() creates, other than that...

if(val.equals(CONSTANT1) || val.equals(CONSTANT2) ..... || val.equals(CONSTANTn)) {
}
is the better in terms of performance and memory because the 2nd one will take time to build a list and start searching for the val in that list. Here extra memory is required to maintain the list and also extra time is spent on iterating through the list.Where as the comparing the val with constant will make use of short circuit comparison approach.

Generics, Guava Ordering.arbitrary()

#SuppressWarnings("unchecked")
public static final Ordering<EmailTemplate> ARBITRARY_ORDERING = (Ordering)Ordering.arbitrary();
public static final Ordering<EmailTemplate> ORDER_BY_NAME = Ordering.natural().nullsFirst().onResultOf(GET_NAME);
public static final Ordering<EmailTemplate> ORDER_BY_NAME_SAFE = Ordering.allEqual().nullsFirst()
.compound(ORDER_BY_NAME)
.compound(ARBITRARY_ORDERING);
Here's the code a use to order EmailTemplate.
If i have a list of EmailTemplate i want the null elements of the list to appear at the beginning, then the elements with a null name, and then by natural name order, and if they have the same name, an arbitrary order.
Is it how i am supposed to do? It seems strange to start the comparator by "allEqual" i think...
I also wonder what's the best way to deal with the Ordering.arbitrary(), since it's a static method that returns Ordering. Is there any elegant way to use it? I don't really like this kind of useless, with warning, line:
#SuppressWarnings("unchecked")
public static final Ordering<EmailTemplate> ARBITRARY_ORDERING = (Ordering)Ordering.arbitrary();
By the way, the documentation says:
Returns an arbitrary ordering over all objects, for which compare(a,
b) == 0 implies a == b (identity equality). There is no meaning
whatsoever to the order imposed, but it is constant for the life of the VM.
Does this mean that my object being compared with this Ordering will never be garbage collected?

Regarding the second question: no. Guava uses the identity hash codes of the objects to sort them arbitrarily.
Regarding the first question: I would use a comparison chain to sort by name, then by arbitrary order:
private class ByNameThenArbitrary implements Comparator<EmailTemplate> {
#Override
public int compare(EmailTemplate e1, EmailTemplate e2) {
return ComparisonChain.start()
.compare(e1.getName(), e2.getName(), Ordering.natural().nullsFirst(),
.compare(e1, e2, Ordering.arbitrary())
.result();
}
}
Then I would create the real ordering to order the templates with nulls first:
private static final Ordering<EmailTemplate> ORDER =
Ordering.fromComparator(new ByNameThenArbitrary()).nullsFirst();
Not tested, though.

I'm pretty sure, you're doing it too complicated:
Ordering.arbitrary() works with any Object and the compound doesn't require to restrict it to EmailTemplate
Saying nullsFirst() takes priority when null gets compared, and I'd suggest to apply it last
You don't need to define multiple constants, it all should be easy
I'd go for
public static final Ordering<EmailTemplate> ORDER_BY_NAME_SAFE = Ordering
.natural()
.onResultOf(GET_NAME)
.compound(Ordering.arbitrary())
.nullsFirst();
but I haven't tested it.
What's confusing here, is the way how compound and nullsFirst work. With the former, this takes precedence, while with the latter testing for null wins. Both is logical:
compound works left to right
nullsFirst must first test for null, otherwise we'd get an expection
but taken together it's confusing.
Does this mean that my object being compared with this Ordering will never be garbage collected?
No, it uses weak references. Whenever an object isn't referenced elsewhere, it can be garbage collected. This is no contradiction to "the ordering is constant for the life of the VM", since a no more existing object can't be compared anymore.
Note that Ordering.arbitrary() is indeed arbitrary and based on object's identity rather than on equals, which means that
Ordering.arbitrary().compare(new String("a"), new String("a"))
doesn't return 0.
I wonder if an "equals-compatible arbitrary ordering" could be implemented.

Accesing hidden getEntry(Object key) in HashMap

I have similar problem to one discussed here, but with stronger practical usage.
For example, I have a Map<String, Integer>, and I have some function, which is given a key and in case the mapped integer value is negative, puts NULL to the map:
Map<String, Integer> map = new HashMap<String, Integer>();
public void nullifyIfNegative(String key) {
Integer value = map.get(key);
if (value != null && value.intValue() < 0) {
map.put(key, null);
}
}
I this case, the lookup (and hence, hashCode calculation for the key) is done twice: one for lookup and one for replacement. It would be nice to have another method (which is already in HashMap) and allows to make this more effective:
public void nullifyIfNegative(String key) {
Map.Entry<String, Integer> entry = map.getEntry(key);
if (entry != null && entry.getValue().intValue() < 0) {
entry.setValue(null);
}
}
The same concerns cases, when you want to manipulate immutable objects, which can be map values:
Map<String, String>: I want to append something to the string value.
Map<String, int[]>: I want to insert a number into the array.
So the case is quite common. Solutions, which might work, but not for me:
Reflection. Is good, but I cannot sacrifice performance just for this nice feature.
Use org.apache.commons.collections.map.AbstractHashedMap (it has at least protected getEntry() method), but unfortunately, commons-collections do not support generics.
Use generic commons-collections, but this library (AFAIK) is out-of-date (not in sync with latest library version from Apache), and (what is critical) is not available in central maven repository.
Use value wrappers, which means "making values mutable" (e.g. use mutable integers [e.g. org.apache.commons.lang.mutable.MutableInt], or collections instead of arrays). This solutions leads to memory loss, which I would like to avoid.
Try to extend java.util.HashMap with custom class implementation (which should be in java.util package) and put it to endorsed folder (as java.lang.ClassLoader will refuse to load it in Class<?> defineClass(String name, byte[] b, int off, int len), see sources), but I don't want to patch JDK and it seems like the list of packages that can be endorsed, does not include java.util.
The similar question is already raised on sun.com bugtracker, but I would like to know, what is the opinion of the community and what can be the way out taking in mind the maximum memory & performance effectiveness.
If you agree, this is nice and beneficiary functionality, please, vote this bug!

As a logical matter, you're right in that the single getEntry would save you a hash lookup. As a practical matter, unless you have a specific use case where you have reason to be concerned about the performance hit( which seems pretty unlikely, hash lookup is common, O(1), and well optimized) what you're worrying about is probably negligible.
Why don't you write a test? Create a hashtable with a few 10's of millions of objects, or whatever's an order of magnitude greater than what your application is likely to create, and average the time of a get() over a million or so iterations (hint: it's going to be a very small number).
A bigger issue with what you're doing is synchronization. You should be aware that if you're doing conditional alterations on a map you could run into issues, even if you're using a Synchronized map, as you'd have to lock access to the key covering the span of both the get() and set() operations.

Not pretty, but you could use lightweight object to hold a reference to the actual value to avoid second lookups.
HashMap<String, String[]> map = ...;
// append value to the current value of key
String key = "key";
String value = "value";
// I use an array to hold a reference - even uglier than the whole idea itself ;)
String[] ref = new String[1]; // lightweigt object
String[] prev = map.put(key, ref);
ref[0] = (prev != null) ? prev[0] + value : value;
I wouldn't worry about hash lookup performance too much though (Steve B's answer is pretty good in pointing out why). Especially with String keys, I wouldn't worry too much about hashCode() as its result is cached. You could worry about equals() though as it might be called more than once per lookup. But for short strings (which are often used as keys) this is negligible too.

There are no performance gain from this proposal, because performance of Map in average case is O(1). But enabling access to the raw Entry in such case will raise another problem. It will be possible to change key in entry (even if it's only possible via reflection) and therefore break order of the internal array.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Deduplication using a Java Set - java

Related

Extend HashMap to get all objects with the specified hashcode?

Can I retrieve a key stored in a Hashset in java in some way?

Performance of Arrays.asList().contains() comparing to several if equals instructions

Generics, Guava Ordering.arbitrary()

Accesing hidden getEntry(Object key) in HashMap

Categories

Resources