Java: efficient way to remember which objects are processed - java

What is the most efficient way to remember which objects are processed?
Obviously one could use a hash set:
Set<Foo> alreadyProcessed = new HashSet<>();
void process(Foo foo) {
if (!alreadyProcessed.contains(foo)) {
// Do something
alreadyProcessed.add(foo);
}
}
This makes me wonder why I would store the object, while I just want to check if the hash does exist in this set. Assuming that any hash of foo is unique.
Is there any more performant way to do this?
Take in mind that a very large number of objects will be processed and that the actual process code will not always be very heavy. Also it is not possible for me to have a precompiled worklist of objects, it will be build up dynamically during the processing.

Set#contains can be very fast. It depends how are your hashcode() and equals() methods are implemented. try to cache the hashcode value to make it more faster. (like String.java)
The other simple and fastes option is to add a boolean member to your Foo class: foo.done = true;

You can't use the hashcode since equality of the hashcode of two objects does not imply that the objects are equal.
Else depending on the use case you want to remember if you have already processed
a) the same object, tested by reference, or
b) an equal object, tested by a call to Object.equals(Object)
For b) you can use a standard Set implementation.
For a) also you can also use a standard Set implementation if you now that the equals-method is returning equality of references, or you would need something like a IdentityHashSet.
No mention of performance in this answer, you need to address correctness first!

Write good code. Optimize it for performance only if you can show that you need to in your use case.
There is no performance advantage in storing a hash code rather than the object. If you doubt that, remember that what is being stored is a reference to the object, not a copy of it. In reality that's going to be 64 bits, pretty much the same as the hash code. You've already spent a substantial amount of time thinking about a problem that none of your users will ever notice. (If you are doing this calculations millions of times in a tight, mission-critical loop that's another matter).
Using the set is simple to understand. Doing anything else runs a risk that a future maintainer will not understand the code and introduce a bug.
Also don't forget that a hash code it not guaranteed to be unique for every different object. Every so often storing the hash code will give you a false positive, causing you to fail to process an object you wanted to process. (As an aside, you need to make sure that equals() only considers two objects equal if they are the same object. The default Object.equals() does this, so don't override it)
Use the Set. If you are processing a very large number of objects, use a more efficient Set than HashSet. That is much more likely to give you a performance speedup than anything clever with hashing.

Related

Is there a way to retrieve a jvm object by its hash code

Say I have an object and obj.hashCode() returns 8973846,
Can I call a function with the hash code and get the object back?
No. hashCode() is not unique (i.e. different objects can have the same hashCode. Even different objects of the same type can have the same hashCode), so it's not possible to implement such a method.
The best you could do would be, when you create your objects, to put them into a big HashMap<Integer,Object> that maps hash codes to instances. That way, you'd be able to retrieve them later.
Two major problems, though:
Because hash codes aren't guaranteed to be unique, you'll retrieve something with the right hash code, but not necessarily the thing you were expecting. You'd need to code everything so that hash codes were unique with high probability (which is going to be hard when there's only 32 bits to play with).
Your garbage collector is going to have a huge problem here unless you also remove objects from the hash map when you've finished with them. Normally, the garbage collector cleans up any instances that don't have any references left, but in your case, everything will maintain a reference inside the hash map. Welcome to Memory Leak City, Arizona.
You might try a WeakHashMap to alleviate the second problem, though that might cause more problems: when you try to retrieve an object later, it might have disappeared...

Unique tasks all along a program

In my program, I do some tasks, parametrized by a MyParameter object (I call doTask(MyParameter parameter) to run a task).
From the beginning to the end of the program, I can create a lot of tasks (a few million at least) BUT i want to run only once each of them (if a task has already been executed, the method does nothing)
Currently, I'm using a HashSet to store the MyParameter objects for the tasks already executed, but if the MyParameter object is 100bytes, and if I run in my program 10M tasks, it is 1GB at least in memory ...)
How can I optimize that, to use as few memory as possible ?
Thanks a lot guys
If all you need to know is whether a particular MyParameter has been processed or not, ditch the HashSet and use a BitSet instead.
Basically, if all you need to know is whether a particular MyParameter is done or not, then storing the entire MyParameter in the set is overkill - you only need to store a single bit, where 0 means "not done" and 1 means "done". This is exactly what a BitSet is designed for.
The hashes of your MyParameter values are presumably unique, otherwise your current approach of using a HashSet is pointless. If so, then you can use the hashCode() of each MyParameter as an index into the bit set, using the corresponding bit as an indicator of whether the given MyParameter is done or not.
That probably doesn't make much sense as is, so the following is a basic implementation. (Feel free to substitute the for loop, numParameters, getParameter(), etc with whatever it is that you're actually using to generate MyParameters)
BitSet doneSet = new BitSet();
for (int i = 0; < numParameters; ++i) {
MyParameter parameter = getParameter(i);
if (!doneSet.get(parameter.hashCode())) {
doTask(parameter );
doneSet.set(parameter.hashCode());
}
}
The memory usage of this approach is a bit contingent on how BitSet is implemented internally, but I would expect it to be significantly better than simply storing all your MyParameters in a HashSet.
If, in fact, you do need to hang onto your MyParameter objects once you've processed them because they contain the result of processing, then you can possibly save space by storing just the result portion of the MyParameter in the HashSet (if such a thing's possible - your question doesn't make this clear).
If, on the other hand, you really do need each MyParameter in its entirety once you're done processing it, then you're already doing pretty much the best you can do. You might be able to do a little better memory-wise by storing them as a vector (i.e. expandable array) of MyParameters (which avoids some of the memory overheads inherent in using a HashSet), but this will incur a speed penalty due to time needed to expand the vector and an O(n) search time.
A TreeSet will give you somewhat better memory performance than a HashSet, at the cost of log(n) lookups.
You can use a NoSql key-value store such as Cassandra or LevelDB, which are essentially external hash tables.
You may be able to compress the MyParameter representation, but if it's only at 100bytes currently then I don't know how much smaller you'd be able to get it.

compareTo involving non-comparable field: how to maintain transitivity?

Consider a class with a comparable (consistent with equals) and a non-comparable field (of a class about which I do not know whether it overrides Object#equals or not).
The class' instances shall be compared, where the resulting order shall be consistent with equals, i.e. 0 returned iff both fields are equal (as per Object#equals) and consistent with the order of the comparable field. I used System.identityHashCode to cover most of the cases not covered by these requirements (the order of instances with same comparable, but different other value is arbitrary), but am not sure whether this is the best approach.
public class MyClass implements Comparable<MyClass> {
private Integer intField;
private Object nonCompField;
public int compareTo(MyClass other) {
int intFieldComp = this.intField.compareTo(other.intField);
if (intFieldComp != 0)
return intFieldComp;
if (this.nonCompField.equals(other.nonCompField))
return 0;
// ...and now? My current approach:
if (Systems.identityHashCode(this.nonCompField) < Systems.identityHashCode(other.nonCompField))
return -1;
else
return 1;
}
}
Two problems I see here:
If Systems.identityHashCode is the same for two objects, each is greater than the other. (Can this happen at all?)
The order of instances with same intField value and different nonCompField values need not be consistent between runs of the program, as far as I understand what Systems.identityHashCode does.
Is that correct? Are there more problems? Most importantly, is there a way around this?
The first problem, although highly unlikely, could happen (I think you would need an enormous amount of memory, and a very bad luck). But it's solved by Guava's Ordering.arbitrary(), which uses the identity hash code behind the scenes, but maintains a cache of comparison results for the cases where two different objects have the same identity hash code.
Regarding your second question, no, the identity hash codes are not preserved between runs.
Systems.identityHashCode […] the same for two objects […] (Can this happen at all?)
Yes it can. Quoting from the Java API Documentation:
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects.
identityHashCode(Object x) returns the same hash code for the given object as would be returned by the default method hashCode(), whether or not the given object's class overrides hashCode().
So you may encounter hash collisions, and with memory ever growing but hash codes staying fixed at 32 bit, they will become increasingly more likely.
The order of instances with same intField value and different nonCompField values need not be consistent between runs of the program, as far as I understand what Systems.identityHashCode does.
Right. It might even be different during a single invocation of the same program: You could have (1,foo) < (1,bar) < (1,baz) even though foo.equals(baz).
Most importantly, is there a way around this?
You can maintain a map which maps each distinct value of the non-comparable type to a sequence number which you increase for each distinct value you encounter.
Memory management will be tricky, though: You cannot use a WeakHashMap as the code might make your key object unreachable but still hold a reference to another object of the same value. So either you maintain a list of weak references to all the objects of a given value, or you simply use strong references and accept the fact that any uncomparable value ever encountered will never be garbage collected.
Note that this scheme will still not result in reproducible sequence numbers unless you create values reproducibly in just the same order.
If the class of the nonCompField has implemented a reasonably good toString(), you might be able to use
return String.valueOf(this.nonCompField).compareTo(String.valueOf(other.nonCompField));
Unfortunately, the default Object.toString() uses the hashcode, which has potential issues as noted by others.

Am I writing this method the right way?

I've got an ArrayList called conveyorBelt, which stores orders that have been picked and placed on the conveyor belt. I've got another ArrayList called readyCollected which contains a list of orders that can be collected by the customer.
What I'm trying to do with the method I created is when a ordNum is entered, it returns true if the order is ready to be collected by the customer (thus removing the collected order from the readyCollected). If the order hasn't even being picked yet, then it returns false.
I was wondering is this the right way to write the method...
public boolean collectedOrder(int ordNum)
{
int index = 0;
Basket b = new Basket(index);
if(conveyorBelt.isEmpty()) {
return false;
}
else {
readyCollected.remove(b);
return true;
}
}
I'm a little confused since you're not using ordNum at all.
If you want to confirm operation of your code and generally increase the reliability of what you're writing, you should check out unit testing and the Java frameworks available for this.
You can solve this problem using an ArrayList, but I think that this is fundamentally the wrong way to think about the problem. An ArrayList is good for storing a complete sequence of data without gaps where you are only likely to add or remove elements at the very end. It's inefficient to remove elements at other positions, and if you have just one value at a high index, then you'll waste a lot of space filling in all lower positions with null values.
Instead, I'd suggest using a Map that associates order numbers with the particular order. This more naturally encodes what you want - every order number is a key associated with the order. Maps, and particularly HashMaps, have very fast lookups (expected constant time) and use (roughly) the same amount of space no matter how many keys there are. Moreover, the time to insert or remove an element from a HashMap is expected constant time, which is extremely fast.
As for your particular code, I agree with Brian Agnew on this one that you probably want to write some unit tests for it and find out why you're not using the ordNUm parameter. That said, I'd suggest reworking the system to use HashMap instead of ArrayList before doing this; the savings in time and code complexity will really pay off.
Based on your description, why isn't this sufficient :
public boolean collectedOrder(int ordNum) {
return (readyCollected.remove(ordNum) != null);
}
Why does the conveyorBelt ArrayList even need to be checked?
As already pointed out, you most likely need to be using ordNum.
Aside from that the best answer anyone can give with the code you've posted is "perhaps". Your logic certainly looks correct and ties in with what you've described, but whether it's doing what it should depends entirely on your implementation elsewhere.
As a general pointer (which may or may not be applicable in this instance) you should make sure your code deals with edge cases and incorrect values. So you might want to flag something's wrong if readyCollected.remove(b); returns false for instance, since that indicates that b wasn't in the list to remove.
As already pointed out, take a look at unit tests using JUnit for this type of thing. It's easy to use and writing thorough unit tests is a very good habit to get into.

hashCode uniqueness

Is it possible for two instances of Object to have the same hashCode()?
In theory an object's hashCode is derived from its memory address, so all hashCodes should be unique, but what if objects are moved around during GC?
I think the docs for object's hashCode method state the answer.
"As much as is reasonably practical,
the hashCode method defined by class
Object does return distinct integers
for distinct objects. (This is
typically implemented by converting
the internal address of the object
into an integer, but this
implementation technique is not
required by the JavaTM programming
language.)"
Given a reasonable collection of objects, having two with the same hash code is quite likely. In the best case it becomes the birthday problem, with a clash with tens of thousands of objects. In practice objects a created with a relatively small pool of likely hash codes, and clashes can easily happen with merely thousands of objects.
Using memory address is just a way of obtaining a slightly random number. The Sun JDK source has a switch to enable use of a Secure Random Number Generator or a constant. I believe IBM (used to?) use a fast random number generator, but it was not at all secure. The mention in the docs of memory address appears to be of a historical nature (around a decade ago it was not unusual to have object handles with fixed locations).
Here's some code I wrote a few years ago to demonstrate clashes:
class HashClash {
public static void main(String[] args) {
final Object obj = new Object();
final int target = obj.hashCode();
Object clash;
long ct = 0;
do {
clash = new Object();
++ct;
} while (clash.hashCode() != target && ct<10L*1000*1000*1000L);
if (clash.hashCode() == target) {
System.out.println(ct+": "+obj+" - "+clash);
} else {
System.out.println("No clashes found");
}
}
}
RFE to clarify docs, because this comes up way too frequently: CR 6321873
Think about it. There are an infinite number of potential objects, and only 4 billion hash codes. Clearly, an infinity of potential objects share each hash code.
The Sun JVM either bases the Object hash code on a stable handle to the object or caches the initial hash code. Compaction during GC will not alter the hashCode(). Everything would break if it did.
Is it possible?
Yes.
Does it happen with any reasonable degree of frequency?
No.
I assume the original question is only about the hash codes generated by the default Object implementation. The fact is that hash codes must not be relied on for equality testing and are only used in some specific hash mapping operations (such as those implemented by the very useful HashMap implementation).
As such they have no need of being really unique - they only have to be unique enough to not generate a lot of clashes (which will render the HashMap implementation inefficient).
Also it is expected that when developer implement classes that are meant to be stored in HashMaps they will implement a hash code algorithm that has a low chance of clashes for objects of the same class (assuming you only store objects of the same class in application HashMaps), and knowing about the data makes it much easier to implement robust hashing.
Also see Ken's answer about equality necessitating identical hash codes.
Are you talking about the actual class Object or objects in general? You use both in the question. (And real-world apps generally don't create a lot of instances of Object)
For objects in general, it is common to write a class for which you want to override equals(); and if you do that, you must also override hashCode() so that two different instances of that class that are "equal" must also have the same hash code. You are likely to get a "duplicate" hash code in that case, among instances of the same class.
Also, when implementing hashCode() in different classes, they are often based on something in the object, so you end up with less "random" values, resulting in "duplicate" hash codes among instances of different classes (whether or not those objects are "equal").
In any real-world app, it is not unusual to find to different objects with the same hash code.
If there were as many hashcodes as memory addresses, then it would took the whole memory to store the hash itself. :-)
So, yes, the hash codes should sometimes happen to coincide.

Categories