Java implementation-How to improve the speed of hash table

Java implementation-How to improve the speed of hash table - java

I am implementing a hash table (as per requirements). It works ok with a small input but unfortunately it's way too slow when dealing with a large number of input. I tried BufferedInputStream but it doesn't make any differences. Basically I implemented it following the logic below. Any ideas how I can improve the speed? Is there a specific function that causes the bad performance? Or we might need to close the Scanner?
int [] table = new int [30000];// creat an array as the table
Scanner scan = new Scanner (System.in); //use scanner to read the input file.
while (scan.hasNextLine()) {
//read one line at a time, and a sequence of int into an array list called keys
// functions used here is string.split(" ");
}
hashFuction{
//use middle-squaring on each elements to the array list keys.
// Math.pow() and % 30000, which is the table size, to generate the hash value
// assign table [hashvalue]= value
}

So first, you should now what part of the program is slow. Optimizing everything is a stupid idea, optimizing the fast part is even worse.
Math.pow() and % 30000, which is the table size
This is pretty wrong.
Never use floating point operations for things like hashing. It's slow and badly distributed.
Never use a table size which is neither a power of two nor prime.
You failed to tell us anything about what you're hashing and why... so let's assume you need to map a pair of two ints into the table.
class IntPair {
private int x;
private int y;
public int hashCode() {
// the multiplier must be odd for good results
// its exact value doesn't matter much, but it mustn't equal to your table size; ideally, it should be co-prime
return 54321 * x + y;
}
public boolean equals() {
do yourself
}
}
//// Prime table size. The division is slow, but it works slightly better than power of two.
int[] table = new int[30011]; // this is a prime
int hashCodeToIndex(int hashCode) {
int nonNegative = hashCode & Integer.MAX_VALUE;
return nonNegative % table.length;
}
//// Power of two table size. No division, faster.
int[] table2 = new int[1<<15]; // this is 2**15, i.e., 32768
int smear(int hashCode) {
// doing nothing may be good enough, if the hashCode is well distributed
// otherwise, see e.g., https://github.com/google/guava/blob/c234ed7f015dc90d0380558e663f57c5c445a288/guava/src/com/google/common/collect/Hashing.java#L46
return hashCode;
}
int hashCodeToIndex(int hashCode) {
// the "&" cleans all unwanted bits
return smear(hashCode) & (table2.length - 1);
}
// an alternative, explanation upon request
int hashCodeToIndex2(int hashCode) {
return smear(hashCode) >>> 17;
}

Related

Creating combinations of a BitSet

Assume I have a Java BitSet. I now need to make combinations of the BitSet such that only Bits which are Set can be flipped. i.e. only need combinations of Bits which are set.
For Eg. BitSet - 1010, Combinations - 1010, 1000, 0010, 0000
BitSet - 1100, Combination - 1100, 1000, 0100, 0000
I can think of a few solutions E.g. I can take combinations of all 4 bits and then XOR the combinations with the original Bitset. But this would be very resource-intensive for large sparse BitSets. So I was looking for a more elegant solution.

It appears that you want to get the power set of the bit set. There is already an answer here about how to get the power set of a Set<T>. Here, I will show a modified version of the algorithm shown in that post, using BitSets:
private static Set<BitSet> powerset(BitSet set) {
Set<BitSet> sets = new HashSet<>();
if (set.isEmpty()) {
sets.add(new BitSet(0));
return sets;
}
Integer head = set.nextSetBit(0);
BitSet rest = set.get(0, set.size());
rest.clear(head);
for (BitSet s : powerset(rest)) {
BitSet newSet = s.get(0, s.size());
newSet.set(head);
sets.add(newSet);
sets.add(s);
}
return sets;
}

You can perform the operation in a single linear pass instead of recursion, if you realize the integer numbers are a computer’s intrinsic variant of “on off” patterns and iterating over the appropriate integer range will ultimately produce all possible permutations. The only challenge in your case, is to transfer the densely packed bits of an integer number to the target bits of a BitSet.
Here is such a solution:
static List<BitSet> powerset(BitSet set) {
int nBits = set.cardinality();
if(nBits > 30) throw new OutOfMemoryError(
"Not enough memory for "+BigInteger.ONE.shiftLeft(nBits)+" BitSets");
int max = 1 << nBits;
int[] targetBits = set.stream().toArray();
List<BitSet> sets = new ArrayList<>(max);
for(int onOff = 0; onOff < max; onOff++) {
BitSet next = new BitSet(set.size());
for(int bitsToSet = onOff, ix = 0; bitsToSet != 0; ix++, bitsToSet>>>=1) {
if((bitsToSet & 1) == 0) {
int skip = Integer.numberOfTrailingZeros(bitsToSet);
ix += skip;
bitsToSet >>>= skip;
}
next.set(targetBits[ix]);
}
sets.add(next);
}
return sets;
}
It uses an int value for the iteration, which is already enough to represent all permutations that can ever be stored in one of Java’s builtin collections. If your source BitSet has 2³¹ one bits, the 2³² possible combinations do not only require a hundred GB heap, but also a collection supporting 2³² elements, i.e. a size not representable as int.
So the code above terminates early if the number exceeds the capabilities, without even trying. You could rewrite it to use a long or even BigInteger instead, to keep it busy in such cases, until it will fail with an OutOfMemoryError anyway.
For the working cases, the int solution is the most efficient variant.
Note that the code returns a List rather than a HashSet to avoid the costs of hashing. The values are already known to be unique and hashing would only pay off if you want to perform lookups, i.e. call contains with another BitSet. But to test whether an existing BitSet is a permutation of your input BitSet, you wouldn’t even need to generate all permutations, a simple bit operation, e.g. andNot would tell you that already. So for storing and iterating the permutations, an ArrayList is more efficient.

Random number array with lazy initialization

For a distributed application project I want to have two instances share the same/know (pseudo-)random numbers.
This can be achieved by using the same seed for the random number generator (RNG). But this only works if both applications use the RNG output in the same order. In my case this is hard or even impossible.
Another way of doing this is would be (psedocode):
rng.setSeed(42);
int[] rndArray;
for(...) {
rndArray[i] = rng.nextInt();
}
Now both applications would have the same array of random numbers and my problem would be solved.
BUT the array would have to be large, very large. This is where the lazy initialization part comes in: How can I write a class that where rndArray.get(i) is always the same random number (depending on the seed) without generating all values between 0 and i-1?
I am using JAVA or C++, but this problem should be solvable in most programming languages.

You can use a formula based on a random seed.
e.g.
public static int generate(long seed, int index) {
Random rand = new Random(seed + index * SOME_PRIME);
return rand.nextInt();
}
This will produce the same value for a given seed and index combination. Don't expect it to be very fast however. Another approach is to use a formula like.
public static int generate(long seed, int index) {
double num = seed * 1123529253211.0 + index * 10123457689.0;
long num2 = Double.doubleToRawLongBits(num);
return (int) ((num2 >> 42) ^ (num2 >> 21) ^ num2);
}

If it's large and sparse you can use a hash table (downside: the numbers you get depend on your access pattern).
Otherwise you could recycle the solution to a problem from the Programming Pearls (search for something like "programming pearls initialize array"), but it wastes memory iirc.
Last solution I can think of, you could use a random generator which can efficiently jump to a specified position - the one at http://mathforum.org/kb/message.jspa?messageID=1519417 is decently fast, but it generates 16 numbers at a time; anything better?

Unique hashCode with two fields without order

I need a hashCode implementation in Java which ignores the order of the fields in my class Edge. It should be possible that Node first could be Node second, and second could be Node first.
Here is my method is depend on the order:
public class Edge {
private Node first, second;
#Override
public int hashCode() {
int hash = 17;
int hashMultiplikator = 79;
hash = hashMultiplikator * hash
+ first.hashCode();
hash = hashMultiplikator * hash
+ second.hashCode();
return hash;
}
}
Is there a way to compute a hash which is for the following Edges the same but unique?
Node n1 = new Node("a");
Node n2 = new Node("b");
Edge ab = new Edge(n1,n2);
Edge ba = new Edge(n2,n1);
ab.hashCode() == ba.hashCode() should be true.

You can use some sort of commutative operation instead of what you have now, like addition:
#Override
public int hashCode() {
int hash = 17;
int hashMultiplikator = 79;
int hashSum = first.hashCode() + second.hashCode();
hash = hashMultiplikator * hash * hashSum;
return hash;
}
I'd recommend that you still use the multiplier since it provides some entropy to your hash code. See my answer here, which says:
Some good rules to follow for hashing are:
Mix up your operators. By mixing your operators, you can cause the results to vary more. Using simply x * y in this test, I had a very
large number of collisions.
Use prime numbers for multiplication. Prime numbers have interesting binary properties that cause multiplication to be more volatile.
Avoid using shift operators (unless you really know what you're doing). They insert lots of zeroes or ones into the binary of the
number, decreasing volatility of other operations and potentially even
shrinking your possible number of outputs.

To solve you problem you have to combine both hashCodes of the components.
An example could be:
#Override
public int hashCode() {
int prime = 17;
return prime * (first.hashCode() + second.hashCode());
}
Please check if this matches your requirements. Also a multiplikation or an XOR insted of an addition could be possible.

Hashcode for objects with only integers

How do you in a general (and performant) way implement hashcode while minimizing collisions for objects with 2 or more integers?
update: as many stated, you cant ofcource eliminate colisions entierly (honestly didnt think about it). So my question should be how do you minimize collisions in a proper way, edited to reflect that.
Using NetBeans' autogeneration fails; for example:
public class HashCodeTest {
#Test
public void testHashCode() {
int loopCount = 0;
HashSet<Integer> hashSet = new HashSet<Integer>();
for (int outer = 0; outer < 18; outer++) {
for (int inner = 0; inner < 2; inner++) {
loopCount++;
hashSet.add(new SimpleClass(inner, outer).hashCode());
}
}
org.junit.Assert.assertEquals(loopCount, hashSet.size());
}
private class SimpleClass {
int int1;
int int2;
public SimpleClass(int int1, int int2) {
this.int1 = int1;
this.int2 = int2;
}
#Override
public int hashCode() {
int hash = 5;
hash = 17 * hash + this.int1;
hash = 17 * hash + this.int2;
return hash;
}
}
}

Can you in a general (and performant) way implement hashcode without
colisions for objects with 2 or more integers.
It is technically impossible to have zero collision when hashing to 32 bits (one integer) something made of more than 32 bits (like 2 or more integers).

This is what eclipse auto-generates:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + getOuterType().hashCode();
result = prime * result + int1;
result = prime * result + int2;
return result;
}
And with this code your testcase passes...
PS: And don't forget to implement equals()!

There is no way to eliminate hash collisions entirely. Your approach is basically the preferred one to minimize collisions.

Creating a hash method with zero collisions is impossible. The idea of a hash method is you're taking a large set of objects and mapping it to a smaller set of integers. The best you can do is minimize the number of collisions you get within a subset of your objects.

As others have said, it's more important to minimize collisions that to eliminate them -- especially since you didn't say how many buckets you're aiming for. It's going to be much easier to have zero collisions with 5 items in 1000 buckets than if you have 5 items in 2 buckets! And even if there are plenty of buckets, your collisions could look very different with 1000 buckets vs 1001.
Another thing to note is that there's a good chance that the hash you provide won't even be the one the HashMap eventually uses. If you take a look at the OpenJDK HashMap code, for instance, you'll see that your keys' hashCodes are put through a private hash method (line 264 in that link) which re-hashes them. So, if you're going through the trouble of creating a carefully constructed custom hash function to reduce collisions (rather than just a simple, auto-generated one), make sure you also understand who's going to use it, and how.

Dealing with overflow in Java without using BigInteger

Suppose I have a method to calculate combinations of r items from n items:
public static long combi(int n, int r) {
if ( r == n) return 1;
long numr = 1;
for(int i=n; i > (n-r); i--) {
numr *=i;
}
return numr/fact(r);
}
public static long fact(int n) {
long rs = 1;
if(n <2) return 1;
for (int i=2; i<=n; i++) {
rs *=i;
}
return rs;
}
As you can see it involves factorial which can easily overflow the result. For example if I have fact(200) for the foctorial method I get zero. The question is why do I get zero?
Secondly how do I deal with overflow in above context? The method should return largest possible number to fit in long if the result is too big instead of returning wrong answer.
One approach (but this could be wrong) is that if the result exceed some large number for example 1,400,000,000 then return remainder of result modulo
1,400,000,001. Can you explain what this means and how can I do that in Java?
Note that I do not guarantee that above methods are accurate for calculating factorial and combinations. Extra bonus if you can find errors and correct them.
Note that I can only use int or long and if it is unavoidable, can also use double. Other data types are not allowed.
I am not sure who marked this question as homework. This is NOT homework. I wish it was homework and i was back to future, young student at university. But I am old with more than 10 years working as programmer. I just want to practice developing highly optimized solutions in Java. In our times at university, Internet did not even exist. Today's students are lucky that they can even post their homework on site like SO.

Use the multiplicative formula, instead of the factorial formula.

Since its homework, I won't want to just give you a solution. However a hint I will give is that instead of calculating two large numbers and dividing the result, try calculating both together. e.g. calculate the numerator until its about to over flow, then calculate the denominator. In this last step you can chose the divide the numerator instead of multiplying the denominator. This stops both values from getting really large when the ratio of the two is relatively small.
I got this result before an overflow was detected.
combi(61,30) = 232714176627630544 which is 2.52% of Long.MAX_VALUE
The only "bug" I found in your code is not having any overflow detection, since you know its likely to be a problem. ;)

To answer your first question (why did you get zero), the values of fact() as computed by modular arithmetic were such that you hit a result with all 64 bits zero! Change your fact code to this:
public static long fact(int n) {
long rs = 1;
if( n <2) return 1;
for (int i=2; i<=n; i++) {
rs *=i;
System.out.println(rs);
}
return rs;
}
Take a look at the outputs! They are very interesting.
Now onto the second question....
It looks like you want to give exact integer (er, long) answers for values of n and r that fit, and throw an exception if they do not. This is a fair exercise.
To do this properly you should not use factorial at all. The trick is to recognize that C(n,r) can be computed incrementally by adding terms. This can be done using recursion with memoization, or by the multiplicative formula mentioned by Stefan Kendall.
As you accumulate the results into a long variable that you will use for your answer, check the value after each addition to see if it goes negative. When it does, throw an exception. If it stays positive, you can safely return your accumulated result as your answer.
To see why this works consider Pascal's triangle
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1
which is generated like so:
C(0,0) = 1 (base case)
C(1,0) = 1 (base case)
C(1,1) = 1 (base case)
C(2,0) = 1 (base case)
C(2,1) = C(1,0) + C(1,1) = 2
C(2,2) = 1 (base case)
C(3,0) = 1 (base case)
C(3,1) = C(2,0) + C(2,1) = 3
C(3,2) = C(2,1) + C(2,2) = 3
...
When computing the value of C(n,r) using memoization, store the results of recursive invocations as you encounter them in a suitable structure such as an array or hashmap. Each value is the sum of two smaller numbers. The numbers start small and are always positive. Whenever you compute a new value (let's call it a subterm) you are adding smaller positive numbers. Recall from your computer organization class that whenever you add two modular positive numbers, there is an overflow if and only if the sum is negative. It only takes one overflow in the whole process for you to know that the C(n,r) you are looking for is too large.
This line of argument could be turned into a nice inductive proof, but that might be for another assignment, and perhaps another StackExchange site.
ADDENDUM
Here is a complete application you can run. (I haven't figured out how to get Java to run on codepad and ideone).
/**
* A demo showing how to do combinations using recursion and memoization, while detecting
* results that cannot fit in 64 bits.
*/
public class CombinationExample {
/**
* Returns the number of combinatios of r things out of n total.
*/
public static long combi(int n, int r) {
long[][] cache = new long[n + 1][n + 1];
if (n < 0 || r > n) {
throw new IllegalArgumentException("Nonsense args");
}
return c(n, r, cache);
}
/**
* Recursive helper for combi.
*/
private static long c(int n, int r, long[][] cache) {
if (r == 0 || r == n) {
return cache[n][r] = 1;
} else if (cache[n][r] != 0) {
return cache[n][r];
} else {
cache[n][r] = c(n-1, r-1, cache) + c(n-1, r, cache);
if (cache[n][r] < 0) {
throw new RuntimeException("Woops too big");
}
return cache[n][r];
}
}
/**
* Prints out a few example invocations.
*/
public static void main(String[] args) {
String[] data = ("0,0,3,1,4,4,5,2,10,0,10,10,10,4,9,7,70,8,295,100," +
"34,88,-2,7,9,-1,90,0,90,1,90,2,90,3,90,8,90,24").split(",");
for (int i = 0; i < data.length; i += 2) {
int n = Integer.valueOf(data[i]);
int r = Integer.valueOf(data[i + 1]);
System.out.printf("C(%d,%d) = ", n, r);
try {
System.out.println(combi(n, r));
} catch (Exception e) {
System.out.println(e.getMessage());
}
}
}
}
Hope it is useful. It's just a quick hack so you might want to clean it up a little.... Also note that a good solution would use proper unit testing, although this code does give nice output.

You can use the java.math.BigInteger class to deal with arbitrarily large numbers.

If you make the return type double, it can handle up to fact(170), but you'll lose some precision because of the nature of double (I don't know why you'd need exact precision for such huge numbers).
For input over 170, the result is infinity

Note that java.lang.Long includes constants for the min and max values for a long.
When you add together two signed 2s-complement positive values of a given size, and the result overflows, the result will be negative. Bit-wise, it will be the same bits you would have gotten with a larger representation, only the high-order bit will be truncated away.
Multiplying is a bit more complicated, unfortunately, since you can overflow by more than one bit.
But you can multiply in parts. Basically you break the to multipliers into low and high halves (or more than that, if you already have an "overflowed" value), perform the four possible multiplications between the four halves, then recombine the results. (It's really just like doing decimal multiplication by hand, but each "digit" is, say, 32 bits.)

You can copy the code from java.math.BigInteger to deal with arbitrarily large numbers. Go ahead and plagiarize.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java implementation-How to improve the speed of hash table - java

Related

Creating combinations of a BitSet

Random number array with lazy initialization

Unique hashCode with two fields without order

Hashcode for objects with only integers

Dealing with overflow in Java without using BigInteger

Categories

Resources