How to efficiently store a set of tuples/pairs in Java - java

I need to perform a check if the combination of a long value and an integer value were already seen before in a very performance-critical part of an application. Both values can become quite large, at least the long will use more than MAX_INT values in some cases.
Currently I have a very simple implementation using a Set<Pair<Integer, Long>>, however this will require too many allocations, because even when the object is already in the set, something like seen.add(Pair.of(i, l)) to add/check existence would allocate the Pair for each call.
Is there a better way in Java (without libraries like Guava, Trove or Apache Commons), to do this check with minimal allocations and in good O(?)?
Two ints would be easy because I could combine them into one long in the Set, but the long cannot be avoided here.
Any suggestions?

Here are two possibilities.
One thing in both of the following suggestions is to store a bunch of pairs together as triple ints in an int[]. The first int would be the int and the next two ints would be the upper and lower half of the long.
If you didn't mind a 33% extra space disadvantage in exchange for an addressing speed advantage, you could use a long[] instead and store the int and long in separate indexes.
You'd never call an equals method. You'd just compare the three ints with three other ints, which would be very fast. You'd never call a compareTo method. You'd just do a custom lexicographic comparison of the three ints, which would be very fast.
B* tree
If memory usage is the ultimate concern, you can make a B* tree using an int[][] or an ArrayList<int[]>. B* trees are relatively quick and fairly compact.
There are also other types of B-trees that might be more appropriate to your particular use case.
Custom hash set
You can also implement a custom hash set with a custom, fast-calculated hash function (perhaps XOR the int and the upper and lower halves of the long together, which will be very fast) rather than relying on the hashCode method.
You'd have to figure out how to implement the int[] buckets to best suit the performance of your application. For example, how do you want to convert your custom hash code into a bucket number? Do you want to rebucket everything when the buckets start getting too many elements? And so on.

How about creating a class that holds two primitives instead? You would drop at least 24 bytes just for the headers of Integer and Long in a 64 bit JVM.
Under this conditions you are looking for a Pairing Function, or generate an unique number from 2 numbers. That wikipeia page has a very good example (and simple) of one such possibility.

How about
class Pair {
int v1;
long v2;
#Override
public boolean equals(Object o) {
return v1 == ((Pair) o).v1 && v2 == ((Pair) o).v2;
}
#Override
public int hashCode() {
return 31 * (31 + Integer.hashCode(v1)) + Long.hashCode(v2);
}
}
class Store {
// initial capacity should be tweaked
private static final Set<Pair> store = new HashSet<>(100*1024);
private static final ThreadLocal<Pair> threadPairUsedForContains = new ThreadLocal<>();
void init() { // each thread has to call init() first
threadPairUsedForContains.set(new Pair());
}
boolean contains(int v1, long v2) { // zero allocation contains()
Pair pair = threadPairUsedForContains.get();
pair.v1 = v1;
pair.v2 = v2;
return store.contains(pair);
}
void add(int v1, long v2) {
Pair pair = new Pair();
pair.v1 = v1;
pair.v2 = v2;
store.add(pair);
}
}

Related

java: memoizing construction through hash function

I have an X object whose constructor takes in 4 integers fields. To calculate it's hash function, I simple throw them in an array and use Arrays.hashCode.
Currently the constructor is private and I have a static creator method. I'd like to memoize construction so that whenever the creator method is called with 4 integer parameters that have been called before, I can return the same object as last time. [Ideally without having to create another X object to compare with.]
Originally I tried a hashSet but that required me to create a new X to check if my hashSet.contains the equal object... nevermind the fact that I can't 'get' out of a hashSet.
My next idea is to use a HashTable which maps:
the hashCode of the int array of the 4 fields --> object. I'm not sure why, but that doesn't feel right. It feels like I'm doing too much work, isn't the point of a hashCode to be a sort of mapping to a bunch of objects which calculate to the same hashCode?
I appreciate your advice.
The purpose of a hash code is generally to narrow down the location in which to look for a particular object. Or put another way, the idea is that your hash code makes it so that if two objects have the same hash code they are "very likely" to be the same object.
Now, how likely is "very likely" essentially depends on the width (number of bits) and quality of the hash code. In the case of Java, with 32 bit hash codes, this "very likely" still generally means "not near enough to 100% that you can do away with an actual comparison of the object data". So as well as implementing hashCode(), you need to implement equals() on an object that is used as the key to a Java Map (HashMap etc).
Or put another way: your implementation is essentially correct, even though it looks like you're doing a lot of work. The upshot is that if what you are looking for is a performance improvement, you may as well just create a new object each time. But if functionally you require that there never exists more than one object with a given set of values, then your implementation is essentially correct.
Things you could do in principle:
if you had a large number of ints, then for the hashCode(), just form the hash code from a 'sample' of a couple of them -- the idea is to 'narrow down the choices' or make it 'fairly but not 100% likely' that equal hash code will mean equal object-- your equals() has to go through and check them anyway, so there's little point in cycling through all values in both hashCode() and equals();
potentially, you can use a stronger hash code, so that you literally assume that equal hash codes mean equal objects. In effect, you cycle through all of the values once in the hash code function and don't have an equals function at all. In practice this means using at least a strong-ish 64 bit hash code. It's probably not worth it for the case you mention. But if you want to understand a little about how it would work, I would point you to a tutorial I wrote on the advanced use of hash codes in Java.
If the 4 integers during construction mean the resulting object will be exactly the same, then use those as the key, not their hash. Notice I'm not using your full Object as the key, just the 4 integer values. The MyObjectSpecification below will be a tiny object.
public class MyObjectSpecification {
private final int i1, i2, i3, i4;
public MyObjectSpecification(int i1, int i2, int i3, int i4) {
this.i1 = i1;
this.i2 = i2;
this.i3 = i3;
this.i4 = i4;
}
public boolean equals(Object o) {
// ...
}
public int hashCode() {
// ...
}
}
public class MyObject {
private static final Map<MyObjectSpecification, MyObject> myObjects
= new ConcurrentHashMap<MyObjectSpecification, MyObject>();
private MyObject(MyObjectSpecification spec) {
// ...
}
public static MyObject getMyObject(int i1, int i2, int i3, int i4) {
MyObjectSpecification spec = new MyObjectSpecification(i1, i2, i3, i4);
if (myObjects.containsKey(spec)) {
return myObjects.get(spec);
}
MyObject newObject = new MyObject(spec);
myObjects.put(spec, newObject);
return newObject;
}
}
Not sure how you plan to use the Hashtable but I think below would do your job:
private static Hashtable<Integer, MyObject> objectInstances =
new Hashtable<Integer, MyObject>();
public static MyObject instance(int i1, int i2, int i3, int i4){
int hashKey = Arrays.hashCode(new int[]{i1, i2,i3,i4});
//get the object from hashtable
MyObject myObject = objectInstances.get(hashKey);
//if object was not already created, create now and put in the hashtable
if(myObject == null){
myObject = new MyObject(i1,i2,i3,i4);
objectInstances.put(hashKey, myObject);
}
return myObject;
}

Creating a hash from several Java string objects

What would be the fastest and more robust (in terms of uniqueness) way for implementing a method like
public abstract String hash(String[] values);
The values[] array has 100 to 1,000 members, each of a which with few dozen characters, and the method needs to be run about 10,000 times/sec on a different values[] array each time.
Should a long string be build using a StringBuilder buffer and then a hash method invoked on the buffer contents, or is it better to keep invoking the hash method for each string from values[]?
Obviously a hash of at least 64 bits is needed (e.g., MD5) to avoid collisions, but is there anything simpler and faster that could be done, at the same quality?
For example, what about
public String hash(String[] values)
{
long result = 0;
for (String v:values)
{
result += v.hashCode();
}
return String.valueOf(result);
}
Definitely don't use plain addition due to its linearity properties, but you can modify your code just slightly to achieve very good dispersion.
public String hash(String[] values) {
long result = 17;
for (String v:values) result = 37*result + v.hashCode();
return String.valueOf(result);
}
It doesn't provide a 64 bit hash, but given the title of the question it's probably worth mentioning that since Java 1.7 there is java.util.Objects#hash(Object...).
Here is the simple implementation using Objects class available from Java 7.
#Override
public int hashCode()
{
return Objects.hash(this.variable1, this.variable2);
}
You should watch out for creating weaknesses when combining methods. (The java hash function and your own). I did a little research on cascaded ciphers, and this is an example of it. (the addition might interfere with the internals of hashCode().
The internals of hashCode() look like this:
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
so adding numbers together will cause the last characters of all strings in the array to just be added, which doesn't lower the randomness (this is already bad enough for a hash function).
If you want real pseudorandomness, take a look at the FNV hash algorithm. It is the fastest hash algorithm out there that is especially designed for use in HashMaps.
It goes like this:
long hash = 0xCBF29CE484222325L;
for(String s : strings)
{
hash ^= s.hashCode();
hash *= 0x100000001B3L;
}
^ This is not the actual implementation of FNV as it takes ints as input instead of bytes, but I think it works just as well.
First, hash code is typically numeric, e.g. int. Moreover your version of hash function create int and then makes its string representation that IMHO does not have any sense.
I'd improve your hash method as following:
public int hash(String[] values) {
long result = 0;
for (String v:values) {
result = result * 31 + v.hashCode();
}
return result;
}
Take a look on hashCode() implemented in class java.lang.String

Data Structure to cache most frequent elements

Suppose I read a stream of integers. The same integer may appear more than once in the stream. Now I would like to keep a cache of N integers that appeared most frequently. The cache is sorted by the frequency of the stream elements.
How would you implement it in Java?
You want to use a binary indexed tree, the code in the link is for C++ and should be fairly straightforward to convert into Java (AFAICT the code would be the same):
Paper Peter Fenwick
Implementation in C++
Use a Guava Multiset and sort it by frequency
public class MyData implements Comparable<MyData>{
public int frequency = 0;
public Integer data;
#Override
public int compareTo(MyData that) {
return this.frequency - that.frequency;
}
}
Have it stored in a PriorityQueue
Create an object model for the int, inside create a Count property. Create a SortedVector collection extending the Vector collection. Each time an integer occurs, add it to the vector if it doesn't exist. Else, find it, update the count property += 1, then call Collections.sort(this) within your Vector.
Do you know the range of the numbers? If so, it might make sense to use an array. For example, if I knew that the range of the numbers was between 0 and 10, I would make an array of size 10. Each element in this array would count the number of times I've seen a given number. Then, you just have to remember the most frequently seen number.
e.g.
array[10];
freq_index = -1;
freq_count = -1;
readVal(int n){
array[n]+=1;
if array[n] > freq_count
freq_index = n;
freq_count = array[n];
}
Of course, this approach is bad if the distribution of numbers is sparse.
I'd try a priority queue.

Java priority queue implementation - memory locality

I am trying to implement an efficient priority queue in Java. I got to a good implementation of a binary heap but it doesn't have the ideal cache performance. For this I started studying the Van Emde Boas layout in a binary heap which led me to a "blocked" version of a binary heap, where the trick is to calculate the children and parent indices.
Although I was able to do this, the cache behavior (and running time) got worse. I think that the problem is: locality of reference is probably not being achieved, since it is Java - I'm not so sure if using an array of objects actually makes objects to be contiguous in memory in Java, can anyone confirm this please?
Also I would like very much to know what kind of data-structures Java's native PriorityQueue uses, if any would know.
In general, there is no good way to force your objects in the queue to occupy a contiguous chunk of memory. There are, however, some techniques that are suitable for special cases.
At a high level, the techniques involve using byte arrays and 'serializing' data to and from the array. This is actually quite effective if you are storing very simple objects. For example, if you are storing a bunch of 2D points + weights, you can simply write byte equivalent of the weight, x-coordinate, y-coordinate.
The problem at this point, of course, is in allocating instances while peeking/popping. You can avoid this by using a callback.
Note that even in cases where the object being stored itself is complex, using a technique similar to this where you keep one array for the weights and a separate array of references for the actual objects allows you to avoid following the object reference until absolutely necessary.
Going back to the approach for storing simple immutable value-type, here's an incomplete sketch of what you could do:
abstract class LowLevelPQ<T> {
interface DataHandler<R, T> {
R handle(byte[] source, int startLoc);
}
LowLevelPQ(int entryByteSize) { ... }
abstract encode(T element, byte[] target, int startLoc);
abstract T decode(byte[] source, int startLoc);
abstract int compare(byte[] data, int startLoc1, int startLoc2);
abstract <R> R peek(DataHandler<R, T> handler) { ... }
abstract <R> R pop(DataHandler<R, T> handler) { ... }
}
class WeightedPoint {
WeightedPoint(int weight, double x, double y) { ... }
double weight() { ... }
double x() { ... }
...
}
class WeightedPointPQ extends LowLevelPQ<WeightedPoint> {
WeightedPointPQ() {
super(4 + 8 + 8); // int,double,double
}
int compare(byte[] data, int startLoc1, int startLoc2) {
// relies on Java's big endian-ness
for (int i = 0; i < 4; ++i) {
int v1 = 0xFF & (int) data[startLoc1];
int v2 = 0xFF & (int) data[startLoc2];
if (v1 < v2) { return -1; }
if (v1 > v2) { return 1; }
}
return 0;
}
...
}
I don't think it would. Remember, "arrays of objects" aren't arrays of objects, they are arrays of object references (unlike arrays of primitives which really are arrays of the primitives). I'd expect the object references are contiguous in memory, but since you can make those references refer to any objects you want whenever you want, I doubt there's any guarantee that the objects referred to by the array of references will be contiguous in memory.
For what it's worth, the JLS section on arrays says nothing about any guarantees of contiguousness.
I think there is some FUD going on here. It is basically inconceivable that any implementation of arrays would not use contiguous memory. And the way the term is used in the JVM specification when describing the .class file format makes it pretty clear that no other implementation is contemplated.
java.util.PriorityQueue uses a binary heap, as it says in the Javadoc, implemented via an array.

Java: Equalator? (removing duplicates from a collection of objects)

I have a bunch of objects of a class Puzzle. I have overridden equals() and hashCode(). When it comes time to present the solutions to the user, I'd like to filter out all the Puzzles that are "similar" (by the standard I have defined), so the user only sees one of each.
Similarity is transitive.
Example:
Result of computations:
A (similar to A)
B (similar to C)
C
D
In this case, only A or D and B or C would be presented to the user - but not two similar Puzzles. Two similar puzzles are equally valid. It is only important that they are not both shown to the user.
To accomplish this, I wanted to use an ADT that prohibits duplicates. However, I don't want to change the equals() and hashCode() methods to return a value about similarity instead. Is there some Equalator, like Comparator, that I can use in this case? Or is there another way I should be doing this?
The class I'm working on is a Puzzle that maintains a grid of letters. (Like Scrabble.) If a Puzzle contains the same words, but is in a different orientation, it is considered to be similar. So the following to puzzle:
(2, 2): A
(2, 1): C
(2, 0): T
Would be similar to:
(1, 2): A
(1, 1): C
(1, 0): T
Okay you have a way of measuring similarity between objects. That means they form a Metric Space.
The question is, is your space also a Euclidean space like normal three dimensional space, or integers or something like that? If it is, then you could use a binary space partition in however many dimensions you've got.
(The question is, basically: is there a homomorphism between your objects and an n-dimensional real number vector? If so, then you can use techniques for measuring closeness of points in n-dimensional space.)
Now, if it's not a euclidean space then you've got a bigger problem. An example of a non-euclidean space that programers might be most familiar with would be the Levenshtein Distance between to strings.
If your problem is similar to seeing how similar a string is to a list of already existing strings then I don't know of any algorithms that would do that without O(n2) time. Maybe there are some out there.
But another important question is: how much time do you have? How many objects? If you have time or if your data set is small enough that an O(n2) algorithm is practical, then you just have to iterate through your list of objects to see if it's below a certain threshold. If so, reject it.
Just overload AbstractCollection and replace the Add function. Use an ArrayList or whatever. Your code would look kind of like this
class SimilarityRejector<T> extends AbstractCollection<T>{
ArrayList<T> base;
double threshold;
public SimilarityRejector(double threshold){
base = new ArrayList<T>();
this.threshold = threshold;
}
public void add(T t){
boolean failed = false;
for(T compare : base){
if(similarityComparison(t,compare) < threshold) faled = true;
}
if(!failed) base.add(t);
}
public Iterator<T> iterator() {
return base.iterator();
}
public int size() {
return base.size();
}
}
etc. Obviously T would need to be a subclass of some class that you can perform a comparison on. If you have a euclidean metric, then you can use a space partition, rather then going through every other item.
I'd use a wrapper class that overrides equals and hashCode accordingly.
private static class Wrapper {
public static final Puzzle puzzle;
public Wrapper(Puzzle puzzle) {
this.puzzle = puzzle;
}
#Override
public boolean equals(Object object) {
// ...
}
#Override
public int hashCode() {
// ...
}
}
and then you wrap all your puzzles, put them in a map, and get them out again…
public Collection<Collection<Puzzle>> method(Collection<Puzzles> puzzles) {
Map<Wrapper,<Collection<Puzzle>> map = new HashMap<Wrapper,<Collection<Puzzle>>();
for (Puzzle each: puzzles) {
Wrapper wrapper = new Wrapper(each);
Collection<Puzzle> coll = map.get(wrapper);
if (coll == null) map.put(wrapper, coll = new ArrayList<Puzzle>());
coll.add(puzzle);
}
return map.values();
}
Create a TreeSet using your Comparator
Adds all elements into the set
All duplicates are stripped out
Normally "similarity" is not a transitive relationship. So the first step would be to think of this in terms of equivalence rather than similarity. Equivalence is reflexive, symmetric and transitive.
Easy approach here is to define a puzzle wrapper whose equals() and hashCode() methods are implemented according to the equivalence relation in question.
Once you have that, drop the wrapped objects into a java.util.Set and that filters out duplicates.
IMHO, most elegant way was described by Gili (TreeSet with custom Comparator).
But if you like to make it by yourself, seems this easiest and clearest solution:
/**
* Distinct input list values (cuts duplications)
* #param items items to process
* #param comparator comparator to recognize equal items
* #return new collection with unique values
*/
public static <T> Collection<T> distinctItems(List<T> items, Comparator<T> comparator) {
List<T> result = new ArrayList<>();
for (int i = 0; i < items.size(); i++) {
T item = items.get(i);
boolean exists = false;
for (int j = 0; j < result.size(); j++) {
if (comparator.compare(result.get(j), item) == 0) {
exists = true;
break;
}
}
if (!exists) {
result.add(item);
}
}
return result;
}

Categories