Logic of code Guava's mightContain - java

I was going through the code of Guava library, i was interested to understand the probabilistic match code of mightContain. could any one explain what they are doing in the code specially with the bit wise operator.
here is the code....
public <T> boolean mightContain(T object, Funnel<? super T> funnel,
int numHashFunctions, BitArray bits) {
long hash64 = Hashing.murmur3_128().newHasher().putObject(object, funnel).hash().asLong();
int hash1 = (int) hash64;
int hash2 = (int) (hash64 >>> 32);
for (int i = 1; i <= numHashFunctions; i++) {
int nextHash = hash1 + i * hash2;
if (nextHash < 0) {
nextHash = ~nextHash;
}
// up to here, the code is identical with the previous method
if (!bits.get(nextHash % bits.size())) {
return false;
}

Assuming this is code from the Bloomfilter class, the logic goes like this:
Given the key, perform all of the chosen hashes on that key. Use each hash to pick a bit number and check if that bit is set. If any bits are not set in the filter at that position then this key cannot have been added.
If all of the bits are found to be set then we can only say that the filter might have had the key added. This is because it is possible for a different key (or a combination of a number of different keys) to result in all of the checked bits being set.
Note that the adding of a key to the filter does almost exactly the same function except that it **set**s all of the bits generated.
A Bloom Filter object operates as follows.
A number of hash functions are chosen, each will calculate the location of a bit in the filter. (see Optimal number of hash functions for discussion on how many).
Hold an arbitrary length bit pattern - the length is unimportant but it should be big enough (see Probability of false positives for a discussion on what big enough means).
Each time a key is added to the filter, all configured hash functions are performed on the key resulting in a number of bits being set in the pattern.
To check if a key has already been added, perform all of the hash functions and check the bit found there. If any are found to be zero then this key certainly has not been added to the filter.
If all bits are found to be set then then it may be that this key has been added. You will need to perform further checks to confirm.

There are only two bitwise operators here: >>> and ~.
The >>> is the "right shift, don't carry sign bit" operator. In Java, by default, if you shift:
1000 1100
right by 3 (using >>) you will obtain:
1111 0001
Using >>> which does not carry the sign bit you will get:
0001 0001
The second (~) is the bitwise negation, and is a simple way to obtain a positive number from a negative number, and it looks like they want positive numbers here (sparse array index maybe?). Applying this operator to:
1100 1010
which is a negative byte in Java will yield:
0011 0101
which is positive.
Basically, what this code does is create a hash of the object using a fast hash function, use that to circulate over a BitArray (no idea what that is -- an internal structure to BloomFilter probably), and ensure NON presence if at one point the hash is NOT present in the BitArray.
I suspect the BitArray is updated each time you add to the BloomFilter (using .put(), or .putAll()).

Related

Why use 1<<4 instead of 16?

The OpenJDK code for java.util.HashMap includes the following line:
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
Why is 1 << 4 used here, and not 16? I'm curious.
Writing 1 << 4 instead of 16 doesn't change the behavior here. It's done to emphasize that the number is a power of two, and not a completely arbitrary choice. It thus reminds developers experimenting with different numbers that they should stick to the pattern (e.g., use 1 << 3 or 1 << 5, not 20) so they don't break all the methods which rely on it being a power of two. There is a comment just above:
/**
* The default initial capacity - MUST be a power of two.
*/
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
No matter how big a java.util.HashMap grows, its table capacity (array length) is maintained as a power of two. This allows the use of a fast bitwise AND operation (&) to select the bucket index where an object is stored, as seen in methods that access the table:
final Node<K,V> getNode(int hash, Object key) {
Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
if ((tab = table) != null && (n = tab.length) > 0 &&
(first = tab[(n - 1) & hash]) != null) { /// <-- bitwise 'AND' here
...
There, n is the table capacity, and (n - 1) & hash wraps the hash value to fit that range.
More detail
A hash table has an array of 'buckets' (HashMap calls them Node), where each bucket stores zero or more key-value pairs of the map.
Every time we get or put a key-value pair, we compute the hash of the key. The hash is some arbitrary (maybe huge) number. Then we compute a bucket index from the hash, to select where the object is stored.
Hash values bigger than the number of buckets are "wrapped around" to fit the table. For example, with a table capacity of 100 buckets, the hash values 5, 105, 205, would all be stored in bucket 5. Think of it like degrees around a circle, or hours on a clock face.
(Hashes can also be negative. A value of -95 could correspond to bucket 5, or 95, depending on how it was implemented. The exact formula doesn't matter, so long as it distributes hashes roughly evenly among the buckets.)
If our table capacity n were not a power of two, the formula for the bucket would be Math.abs(hash % n), which uses the modulo operator to calculate the remainder after division by n, and uses abs to fix negative values. That would work, but be slower.
Why slower? Imagine an example in decimal, where you have some random hash value 12,459,217, and an arbitrary table length of 1,234. It's not obvious that 12459217 % 1234 happens to be 753. It's a lot of long division. But if your table length is an exact power of ten, the result of 12459217 % 1000 is simply the last 3 digits: 217.
Written in binary, a power of two is a 1 followed by some number of 0s, so the equivalent trick is possible. For example, if the capacity n is decimal 16, that's binary 10000. So, n - 1 is binary 1111, and (n - 1) & hash keeps only the last bits of the hash corresponding to those 1s, zeroing the rest. This also zeroes the sign bit, so the result cannot be negative. The result is from 0 to n-1, inclusive. That's the bucket index.
Even as CPUs get faster and their multimedia capabilities have improved, integer division is still one of the most expensive single-instruction operations you can do. It can be 50 times slower than a bitwise AND, and avoiding it in frequently executed loops can give real improvements.
I can't read the developer's mind, but we do things like that to indicate a relationship between the numbers.
Compare this:
int day = 86400;
vs
int day = 60 * 60 * 24; // 86400
The second example clearly shows the relationship between the numbers, and Java is smart enough to compile that as a constant.
I think the reason is that the developer can very easy change the value (according to JavaDoc '/* The default initial capacity - MUST be a power of two. */') for example to 1 << 5 or 1 << 3 and he doesn't need to do any calculations.

Using DHT to lookup stuff. SHA-1. Chord protocol

I'm trying to implement the Chord protocol in order to quickly lookup some nodes and keys in a small network. What I can't figure out is ... Chord cosideres the nodes and keys as being placed on a cirlce. And their placement dictated by the hash values obtained by applying the SHA-1 hash function. How exactly do I operate with those values? Do I make them as a string de9f2c7f d25e1b3a fad3e85a 0bd17d9b 100db4b3 and then compare them as such, considering that "a" < "b" is true ? Or how? How do I know if a key is before or after another?
Since the keyspace is a ring, a single value can't be said to be greater than another, because if you go the other way around the ring, the opposite is true. You can say a value is within a range or not. In the Chord DHT, each server is responsible for the keys within the range of values between it and its predecessor.
I would advise against using strings for the hash values. You shouldn't use the hashCode function for distributed systems, but you need to math on the hash keys when adding new nodes. You could try converting the hashes into BigIntegers instead.
sha1 hashes are not strings but are very long hex numbers - they are often stored as strings because they would otherwise require a native 160 bit number type. They are built as 5 32 bit hex numbers and then often 'strung' together.
using sha1 strings as the numbers they represent is not hard but requires a library that can handle such large numbers (like BigInt or bcmath). these libraries work by calculating the numbers within the string one column at a time from the right to left, much like a person when using a pen and paper to add, multiply, divide, etc. they will typically have functions for doing common math as well comparisons etc, and often take strings as arguments. Also, make sure that you use a function for converting big numbers anytime you need to go from hex to dec, or else your 160 bit hex number will likely get rounded into a 64 bit dec float or similar and loose most of it's accuracy.
more/less than comparisons are used in chord to figure ranges but do so using modulo so that they 'wrap', making ranges such as [64, 2] possible. the actual formula is
find_successor(fingers[k] = n + 2^(k-1) mod(2^160))
where 'n' is the sha1 of a node and 'k' is the finger number.
remember, 'n' will be hex while 'k' and 'mod(160^2)' will typically be dec, so this is where your BigInt hex to BigInt dec will be needed.
even if your programing framework will let you create these vars as hex, 160 is specifically a dec (literally meaning one hounded and sixty bits) and besides, wrapping your brain around 'mod(160^2)' is already hard enough without visualizing it as hex. convert 'n' to dec rather than converting 'k' etc to hex , and then use a BigInt lib to do the math including comparisons.

Exclusive or between N bit sets

I am implementing a program in Java using BitSets and I am stuck in the following operation:
Given N BitSets return a BitSet with 0 if there is more than 1 one in all the BitSets, and 1 otherwise
As an example, suppose we have this 3 sets:
10010
01011
00111
11100 expected result
For the following sets :
10010
01011
00111
10100
00101
01000 expected result
I am trying to do this exclusive with bit wise operations, and I have realized that what I need is literally the exclusive or between all the sets, but not in an iterative fashion,
so I am quite stumped with what to do. Is this even possible?
I wanted to avoid the costly solution of having to check each bit in each set, and keep a counter for each position...
Thanks for any help
Edit : as some people asked, this is part of a project I'm working on. I am building a time table generator and basically one of the soft constraints is that no student should have only 1 class in 1 day, so those Sets represent the attending students in each hour, and I want to filter the ones who have only 1 class.
You can do what you want with two values. One has the bits set at least once, the second has those set more than once. The combination can be used to determine those set once and no more.
int[] ints = {0b10010, 0b01011, 0b00111, 0b10100, 0b00101};
int setOnce = 0, setMore = 0;
for (int i : ints) {
setMore |= setOnce & i;
setOnce |= i;
}
int result = setOnce & ~setMore;
System.out.println(String.format("%5s", Integer.toBinaryString(result)).replace(' ', '0'));
prints
01000
Well first of all, you can't do this without checking every bit in each set. If you could solve this question without checking some arbitrary bit, then that would imply that there exist two solutions (i.e. two different ones for each of the two values that bit can be).
If you want a more efficient way of computing the XOR of multiple bit sets, I'd consider representing your sets as integers rather than with sets of individual bits. Then simply XOR the integers together to arrive at your answer. Otherwise, it seems to me that you would have to iterate through each bit, check its value, and compute the solution on your own (as you described in your question).

A good hash function to use in interviews for integer numbers, strings?

I have come across situations in an interview where I needed to use a hash function for integer numbers or for strings. In such situations which ones should we choose ? I've been wrong in these situations because I end up choosing the ones which have generate lot of collisions but then hash functions tend to be mathematical that you cannot recollect them in an interview. Are there any general recommendations so atleast the interviewer is satisfied with your approach for integer numbers or string inputs? Which functions would be adequate for both inputs in an "interview situation"
Here is a simple recipe from Effective java page 33:
Store some constant nonzero value, say, 17, in an int variable called result.
For each significant field f in your object (each field taken into account by the
equals method, that is), do the following:
Compute an int hash code c for the field:
If the field is a boolean, compute (f ? 1 : 0).
If the field is a byte, char, short, or int, compute (int) f.
If the field is a long, compute (int) (f ^ (f >>> 32)).
If the field is a float, compute Float.floatToIntBits(f).
If the field is a double, compute Double.doubleToLongBits(f), and
then hash the resulting long as in step 2.1.iii.
If the field is an object reference and this class’s equals method
compares the field by recursively invoking equals, recursively
invoke hashCode on the field. If a more complex comparison is
required, compute a “canonical representation” for this field and
invoke hashCode on the canonical representation. If the value of the
field is null, return 0 (or some other constant, but 0 is traditional).
48 CHAPTER 3 METHODS COMMON TO ALL OBJECTS
If the field is an array, treat it as if each element were a separate field.
That is, compute a hash code for each significant element by applying
these rules recursively, and combine these values per step 2.b. If every
element in an array field is significant, you can use one of the
Arrays.hashCode methods added in release 1.5.
Combine the hash code c computed in step 2.1 into result as follows:
result = 31 * result + c;
Return result.
When you are finished writing the hashCode method, ask yourself whether
equal instances have equal hash codes. Write unit tests to verify your intuition!
If equal instances have unequal hash codes, figure out why and fix the problem.
You should ask the interviewer what the hash function is for - the answer to this question will determine what kind of hash function is appropriate.
If it's for use in hashed data structures like hashmaps, you want it to be a simple as possible (fast to execute) and avoid collisions (most common values map to different hash values). A good example is an integer hashing to the same integer - this is the standard hashCode() implementation in java.lang.Integer
If it's for security purposes, you will want to use a cryptographic hash function. These are primarily designed so that it is hard to reverse the hash function or find collisions.
If you want fast pseudo-random-ish hash values (e.g. for a simulation) then you can usually modify a pseudo-random number generator to create these. My personal favourite is:
public static final int hash(int a) {
a ^= (a << 13);
a ^= (a >>> 17);
a ^= (a << 5);
return a;
}
If you are computing a hash for some form of composite structure (e.g. a string with multiple characters, or an array, or an object with multiple fields), then there are various techniques you can use to create a combined hash function. I'd suggest something that XORs the rotated hash values of the constituent parts, e.g.:
public static <T> int hashCode(T[] data) {
int result=0;
for(int i=0; i<data.length; i++) {
result^=data[i].hashCode();
result=Integer.rotateRight(result, 1);
}
return result;
}
Note the above is not cryptographically secure, but will do for most other purposes. You will obviously get collisions but that's unavoidable when hashing a large structure to a integer :-)
For integers, I usually go with k % p where p = size of the hash table and is a prime number and for strings I choose hashcode from String class. Is this sufficient enough for an interview with a major tech company? – phoenix 2 days ago
Maybe not. It's not uncommon to need to provide a hash function to a hash table whose implementation is unknown to you. Further, if you hash in a way that depends on the implementation using a prime number of buckets, then your performance may degrade if the implementation changes due to a new library, compiler, OS port etc..
Personally, I think the important thing at interview is a clear understanding of the ideal characteristics of a general-purpose hash algorithm, which is basically that for any two input keys with values varying by as little as one bit, each and every bit in the output has about 50/50 chance of flipping. I found that quite counter-intuitive because a lot of the hashing functions I first saw used bit-shifts and XOR and a flipped input bit usually flipped one output bit (usually in another bit position, so 1-input-bit-affects-many-output-bits was a little revelation moment when I read it in one of Knuth's books. With this knowledge you're at least capable of testing and assessing specific implementations regardless of how they're implemented.
One approach I'll mention because it achieves this ideal and is easy to remember, though the memory usage may make it slower than mathematical approaches (could be faster too depending on hardware), is to simply use each byte in the input to look up a table of random ints. For example, given a 24-bit RGB value and int table[3][256], table[0][r] ^ table[1][g] ^ table[2][b] is a great sizeof int hash value - indeed "perfect" if inputs are randomly scattered through the int values (rather than say incrementing - see below). This approach isn't ideal for long or arbitrary-length keys, though you can start revisiting tables and bit-shift the values etc..
All that said, you can sometimes do better than this randomising approach for specific cases where you are aware of the patterns in the input keys and/or the number of buckets involved (for example, you may know the input keys are contiguous from 1 to 100 and there are 128 buckets, so you can pass the keys through without any collisions). If, however, the input ceases to meet your expectations, you can get horrible collision problems, while a "randomising" approach should never get much worse than load (size() / buckets) implies. Another interesting insight is that when you want a quick-and-mediocre hash, you don't necessarily have to incorporate all the input data when generating the hash: e.g. last time I looked at Visual C++'s string hashing code it picked ten letters evenly spaced along the text to use as inputs....

BitMask operation in java

Consider the scenario
I have values assigned like these
Amazon -1
Walmart -2
Target -4
Costco -8
Bjs -16
In DB, data is stored by masking these values based on their availability for each product.
eg.,
Mask product description
1 laptop Available in Amazon
17 iPhone Available in Amazon
and BJ
24 Mattress Available in
Costco and BJ's
Like these all the products are masked and stored in the DB.
How do I retrieve all the Retailers based on the Masked value.,
eg., For Mattress the masked value is 24. Then how would I find or list Costco & BJ's programmatically. Any algorithm/logic would be highly appreciated.
int mattress = 24;
int mask = 1;
for(int i = 0; i < num_stores; ++i) {
if(mask & mattress != 0) {
System.out.println("Store "+i+" has mattresses!");
}
mask = mask << 1;
}
The if statement lines up the the bits, if the mattress value has the same bit as the mask set, then the store whose mask that is sells mattresses. An AND of the mattress value and mask value will only be non-zero when the store sells mattresses. For each iteration we move the mask bit one position to the left.
Note that the mask values should be positive, not negative, if need be you can multiply by negative one.
Assuming you mean in a SQL database, then in your retrieval SQL, you can generally add e.g. WHERE (MyField AND 16) = 16, WHERE (MyField AND 24) = 24 etc.
However, note that if you're trying to optimise such retrievals, and the number of rows typically matching a query is much smaller than the total number of rows, then this probably isn't a very good way to represent this data. In that case, it would be better to have a separate "ProductStore" table that contains (ProductID, StoreID) pairs representing this information (and indexed on StoreID).
Are there at most two retailers whose inventories sum to the "masked" value in each case? If so you will still have to check all pairs to retrieve them, which will take n² time. Just use a nested loop.
If the value represents the sum of any number of retailers' inventories, then you are trying to trying to solve the subset-sum problem, so unfortunately you cannot do it in better than 2^n time.
If you are able to augment your original data structure with information to lookup the retailers contributing to the sum, then this would be ideal. But since you are asking the question I am assuming you don't have access to the data structure while it is being built, so to generate all subsets of retailers for checking you will want to look into Knuth's algorithm [pdf] for generating all k-combinations (and run it for 1...k) given in TAOCP Vol 4a Sec 7.2.1.3.
http://www.antiifcampaign.com/
Remember this. If you can remove the "if" with another construct(map/strategy pattern), for me you can let it there, otherwise that "if" is really dangerous!! (F.Cirillo)
In this case you can use map of map with bitmask operation.
Luca.

Categories