When to calculate key's hashcode, spread() method is called:
static final int spread(int h) {
return (h ^ (h >>> 16)) & HASH_BITS;
}
where HASH_BITS equals 0x7fffffff, so, what is the purpose of HASH_BITS? Some one says it make the sign bit to 0, I am not sure about that.
The index of KV Node in hash buckets is calculated by following formula:
index = (n - 1) & hash
hash is the result of spread()
n is the length of hash buckets which maximum is 2^30
private static final int MAXIMUM_CAPACITY = 1 << 30;
So the maximum of n - 1 is 2^30 - 1 which means the top bit of hash will never be used in index calculation.
But i still don't understand is it necessary to clear the top bit of hash to 0.It seems that there are more reasons to do so.
/**
* Spreads (XORs) higher bits of hash to lower and also forces top
* bit to 0. Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.) So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int spread(int h) {
return (h ^ (h >>> 16)) & HASH_BITS;
}
I think it is to avoid collision with the preserved hashcodes: MOVED(-1), TREEBIN(-2) and RESERVED(-3) of which symbol bits are always 1.
Related
I would like to convert arbitrary length of integers that are represented in binary format to the ASCII form.
One example being for the integer number 33023, the hexadecimal bytes is 0x80ff. I would like to represent 0x80ff into ASCII format of 33023 which has a hexadecimal representation of 0x3333303233.
I am working on a Java Card environment which does not recognize the String type so I would have to do the conversion manually via binary manipulation.
What is the most efficient way to go about solving this as Java Card environment on a 16 bit smart card is very constraint.
This is more tricky than that you may think as it requires base conversion, and base conversion is executed over the entire number, using big integer arithmetic.
That of course doesn't mean that we cannot create an efficient implementation of said big integer arithmetic specifically for the purpose. Here is an implementation that left pads with zero's (which is usually required on Java Card) and uses no additional memory (!). You may have to copy the original value of the big endian number if you want to keep it though - the input value is overwritten. Putting it in RAM is highly recommended.
This code simply divides the bytes with the new base (10 for decimals), returning the remainder. The remainder is the next lowest digit. As the input value has now been divided the next remainder is the digit that is just one position more significant than the one before. It keeps dividing and returning the remainder until the value is zero and the calculation is complete.
The tricky part of the algorithm is the inner loop, which divides the value by 10 in place, while returning the remainder using tail division over bytes. It provides one remainder / decimal digit per run. This also means that the order of the function is O(n) where n is the number of digits in the result (defining the tail division as a single operation). Note that n can be calculated by ceil(bigNumBytes * log_10(256)): the result of which is also present in the precalculated BCD_SIZE_PER_BYTES table. log_10(256) of course a constant decimal value, somewhere upwards of 2.408.
Here is the final code with optimizations (see the edit for different versions):
/**
* Converts an unsigned big endian value within the buffer to the same value
* stored using ASCII digits. The ASCII digits may be zero padded, depending
* on the value within the buffer.
* <p>
* <strong>Warning:</strong> this method zeros the value in the buffer that
* contains the original number. It is strongly recommended that the input
* value is in fast transient memory as it will be overwritten multiple
* times - until it is all zero.
* </p>
* <p>
* <strong>Warning:</strong> this method fails if not enough bytes are
* available in the output BCD buffer while destroying the input buffer.
* </p>
* <p>
* <strong>Warning:</strong> the big endian number can only occupy 16 bytes
* or less for this implementation.
* </p>
*
* #param uBigBuf
* the buffer containing the unsigned big endian number
* #param uBigOff
* the offset of the unsigned big endian number in the buffer
* #param uBigLen
* the length of the unsigned big endian number in the buffer
* #param decBuf
* the buffer that is to receive the BCD encoded number
* #param decOff
* the offset in the buffer to receive the BCD encoded number
* #return decLen, the length in the buffer of the received BCD encoded
* number
*/
public static short toDecimalASCII(byte[] uBigBuf, short uBigOff,
short uBigLen, byte[] decBuf, short decOff) {
// variables required to perform long division by 10 over bytes
// possible optimization: reuse remainder for dividend (yuk!)
short dividend, division, remainder;
// calculate stuff outside of loop
final short uBigEnd = (short) (uBigOff + uBigLen);
final short decDigits = BYTES_TO_DECIMAL_SIZE[uBigLen];
// --- basically perform division by 10 in a loop, storing the remainder
// traverse from right (least significant) to the left for the decimals
for (short decIndex = (short) (decOff + decDigits - 1); decIndex >= decOff; decIndex--) {
// --- the following code performs tail division by 10 over bytes
// clear remainder at the start of the division
remainder = 0;
// traverse from left (most significant) to the right for the input
for (short uBigIndex = uBigOff; uBigIndex < uBigEnd; uBigIndex++) {
// get rest of previous result times 256 (bytes are base 256)
// ... and add next positive byte value
// optimization: doing shift by 8 positions instead of mul.
dividend = (short) ((remainder << 8) + (uBigBuf[uBigIndex] & 0xFF));
// do the division
division = (short) (dividend / 10);
// optimization: perform the modular calculation using
// ... subtraction and multiplication
// ... instead of calculating the remainder directly
remainder = (short) (dividend - division * 10);
// store the result in place for the next iteration
uBigBuf[uBigIndex] = (byte) division;
}
// the remainder is what we were after
// add '0' value to create ASCII digits
decBuf[decIndex] = (byte) (remainder + '0');
}
return decDigits;
}
/*
* pre-calculated array storing the number of decimal digits for big endian
* encoded number with len bytes: ceil(len * log_10(256))
*/
private static final byte[] BYTES_TO_DECIMAL_SIZE = { 0, 3, 5, 8, 10, 13,
15, 17, 20, 22, 25, 27, 29, 32, 34, 37, 39 };
To extend the input size simply calculate and store the next decimal sizes in the table...
I see an LCG implementation in Java under Random class as shown below:
/*
* This is a linear congruential pseudorandom number generator, as
* defined by D. H. Lehmer and described by Donald E. Knuth in
* <i>The Art of Computer Programming,</i> Volume 3:
* <i>Seminumerical Algorithms</i>, section 3.2.1.
*
* #param bits random bits
* #return the next pseudorandom value from this random number
* generator's sequence
* #since 1.1
*/
protected int next(int bits) {
long oldseed, nextseed;
AtomicLong seed = this.seed;
do {
oldseed = seed.get();
nextseed = (oldseed * multiplier + addend) & mask;
} while (!seed.compareAndSet(oldseed, nextseed));
return (int)(nextseed >>> (48 - bits));
}
But below link tells that LCG should be of the form, x2=(ax1+b)modM
https://math.stackexchange.com/questions/89185/what-does-linear-congruential-mean
But above code does not look in similar form. Instead it uses & in place of modulo operation as per below line
nextseed = (oldseed * multiplier + addend) & mask;
Can somebody help me understand this approach of using & instead of modulo operation?
Bitwise-ANDing with a mask which is of the form 2^n - 1 is the same as computing the number modulo 2^n: Any 1's higher up in the number are multiples of 2^n and so can be safely discarded. Note, however, that some multiplier/addend combinations work very poorly if you make the modulus a power of two (rather than a power of two minus one). That code is fine, but make sure it's appropriate for your constants.
This can be used if mask + 1 is a power of 2.
For instance, if you want to do modulo 4, you can write x & 3 instead of x % 4 to obtain the same result.
Note however that this requires that x be a positive number.
I am working on a problem and I have problem with execution times becoming too large and now Im looking for possible optimizations.
The question: Is there any (considerable) difference in performance between using String or Integer as haskey?
The problem is I have a graph with nodes stored in a hashtable with String as key. For example the keys are as follows - "0011" or "1011" etc. Now I could convert these to integers aswell if this would mean a improvement in execution time.
Integer will perform better than String. Following is code for the hashcode computation for both.
Integer hash code implementation
/**
* Returns a hash code for this <code>Integer</code>.
*
* #return a hash code value for this object, equal to the
* primitive <code>int</code> value represented by this
* <code>Integer</code> object.
*/
public int hashCode() {
return value;
}
String hash code implementation
/**
* Returns a hash code for this string. The hash code for a
* <code>String</code> object is computed as
* <blockquote><pre>
* s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
* </pre></blockquote>
* using <code>int</code> arithmetic, where <code>s[i]</code> is the
* <i>i</i>th character of the string, <code>n</code> is the length of
* the string, and <code>^</code> indicates exponentiation.
* (The hash value of the empty string is zero.)
*
* #return a hash code value for this object.
*/
public int hashCode() {
int h = hash;
if (h == 0) {
int off = offset;
char val[] = value;
int len = count;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}
If you have performance problem, it's quite unlikely that the issue is due to the HashMap/HashTable. While hashing string is slightly more expensive than hashing integers, it's rather small difference, and hashCode is cached so it's not recalculated if you use the same string object, you are unlikely to get any significant performance benefit from converting it first to integer.
It's probably more fruitful to look somewhere else for the source of your performance issue. Have you tried profiling your code yet?
There is a difference in speed. HashMaps will use hashCode to calculate the bucket based on that code, and the implementation of Integer is much simpler than that of String.
Having said that, if you are having problems with execution times, you need to do some proper measurements and profiling yourself. That's the only way to find out what the problem is with the execution time, and using Integers instead of Strings will usually only have a minimal effect on performance, meaning that your performance problem might be elsewhere.
For example, look at this post if you want to do some proper micro benchmarks. There are many other resources available for profiling etc..
Here is an implementation of HashMap.
It provides this code for getting index of the bin:
private int getIndex(K key)
{
int hash = key.hashCode() % nodes.length;
if (hash < 0)
hash += nodes.length;
return hash;
}
To make sure the hash value is not bigger than the size of the table,
the result of the user provided hash function is used modulo the
length of the table. We need the index to be non-negative, but the
modulus operator (%) will return a negative number if the left operand
(the hash value) is negative, so we have to test for it and make it
non-negative.
If hash turns out to be very big negative value, the additions hash += nodes.length in the cycle may take a lot of processing.
I think there should be O(1) algorithm for it (independent of hash value).
If so, how can it be achieved?
It can't be a very big negative number.
The result of anything % nodes.length is always less that nodes.length in absolute value, so you need a single if, not a loop. This is exactly what the code does:
if (hash < 0) /* `if', not `while' */
hash += nodes.length;
This not the approach HashMap uses in reality.
272 /**
273 * Returns index for hash code h.
274 */
275 static int indexFor(int h, int length) {
276 return h & (length-1);
277 }
This works because length is always a power of 2 and this is the same an unsigned % length
If hash turns out to be very big negative value, the additions hash += nodes.length in the cycle may take a lot of processing.
The hash at this point must be between -length+1 and length-1 so it cannot be a very large negative value and the code wouldn't work if it did. In any case it doesn't matter how large the value is, the cost is always the same.
I am wondering if we implement our own hashmap that doesn't use power-of-two length hash tables (initial capacity and whenever we re-size), then in that case can we just use the object's hashcode and mod the total size directly instead of use a hash function to hash the object's hashcode ?
for example
public V put(K key, V value) {
if (key == null)
return putForNullKey(value);
// int hash = hash(key.hashCode()); original way
//can we just use the key's hashcode if our table length is not power-of-two ?
int hash = key.hashCode();
int i = indexFor(hash, table.length);
...
...
}
Presuming we're talking about OpenJDK 7, the additional hash is used to stimulate avalanching; it is a mixing function. It is used because the mapping function from a hash to a bucket, since were using a power of 2 for the capacity, is a mere bitwise & (since a % b is equivalent to a & (b - 1) iff b is a power of 2); this means that the lower bits are the only important ones, so by applying this mixing step it can help protect against poorer hashes.
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
If you want to use sizes that aren't powers of 2, the above may not be needed.
Actually changing the mapping from hashes to buckets (which normally relies on the capacity being a power of 2) will require you to you to look at indexFor:
static int indexFor(int h, int length) {
return h & (length-1);
}
You could use (h & 0x7fffffff) % length here.
You can think of the mod function as a simple form of hash function. It maps a large range of data to a smaller space. Assuming the original hashcode is well designed, I see no reason why a mod cannot be used to transform the hashcode into the size of the table you are using.
If your original hashfunction is not well implemented, e.g. always returns an even number, you will create quite a lot of collisions using just a mod function as your hashfunction.
This is true, you can pick pseudo-prime numbers instead.
Note: indexFor needs to use % compensating for the sign instead of a simple & which can actually make the lookup slower.
indexFor = (h & Integer.MAX_VALUE) % length
// or
indexFor = Math.abs(h % length)