Related
i have arrays containing random unique numbers from 0 to integer.max value.
How can i generate a unique id/signature(int) to identify each array uniquely rather than searching through each array and checking each number.
e.g
int[] x = {2,4,8,1,88,12....};
int[] y = {123,456,64,87,1,12...};
int[] z = {2,4,8,1...};
int[] xx = {213,3534,778,1,2,234....};
..................
..................
and so on.
Each array can have different length, but the numbers are not repeated within the array and can be repeated in other arrays. The purpose of unique id for each array is to identify it through the id so that the searching can be made fast. The arrays contain ids of components and the unique signature/id for the array will identify the components contained in it.
Also the id generated sould be same regardless of the order of the values in the array. Like {1,5} and {5,1} should generate the same id.
I have looked up different number paring methods, but the resulting number grows as the length of the array increases to a point it can not fit in an int.
The IDS assigned to components can be adjusted, they don't have to be a sequence of integers as long as a good range of numbers is available. The only requirement is that once an id is generated for an array (a collection of component ids) they should not collide. And can be generated at runtime if the collection in that array changes.
This can be approximately solved with a hash function h() with an order-normalization function (such as sort()). A hash function is lossy, since the number of unique hashes (2^32 or 2^64) is smaller than the number of possible variable length sets of integers, resulting in a small chance of two distinct sets having the same ID (hash collision). Typically this won't be a problem if
you use a good hash function, and
your data set is not ridiculously large.
The order normalization function would ensure that sets {x, y} and {y, x} are hashed to the same value.
For the hash function you have many options, but choose a hash that minimizes collision probability, such as a cryptographic hash (SHA-256, MD5) or if you need bleeding edge performance use MurmurHash3 or other hash du jour. MurmurHash3 can produce an integer as output, while the cryptographic hashes require an extra step of extracting 4 or 8 bytes from the binary output and unpacking to an integer. (Use any consistent selection of bytes such as first or last.)
In pseudocode:
int getId(setOfInts) {
intList = convert setOfInts to integer list
sortedIntList = sort(intList)
ilBytes = cast sortedIntList to byte array
hashdigest = hash(ilBytes)
leadingBytes = extract 4 or 8 leading bytes of hashdigest
idInt = cast leadingBytes to integer
return idInt
}
Strictly speaking, what you ask for is not possible: even with arrays of just two elements, there are many more possible arrays (about 261 after ignoring order) than possible signatures (232). And your arrays aren't limited to two elements, so your situation is exponentially worse.
However, if you can accept a low rate of duplicates and false matches, a simple approach is to just add together all of the elements with the + operator (which essentially computes the sum modulo 232). This is the approach taken by java.util.Set<Integer>'s hashCode() method. It doesn't completely eliminate the need to compare whole arrays (because you'll need to detect false matches), but it will radically reduce the number of such comparisons (because very few arrays will match any given array).
You want {1, 5} and {5, 1} to have the same ID. That rules out standard hash functions, which will give different results in that situation. One option is to sort the array before hashing. Be aware that cryptographic hashes are slow; you may find that a non-crypto hash like FNV is sufficient. It will certainly be faster.
To avoid sorting simply add all the numbers mod 2^32 or mod 2^64, as #ruakh suggests and accept that you will have a proportion of collisions. Adding in the array length will avoid some collisions: {5, 1} will not match {1, 2, 3} in that case as (2+(5+1)) != (3+(1+2+3)). You may want to test with your real data to see if this gives enough of an advantage.
I'm trying to generate n random numbers that depend on an input string. It would be a function generateNumbers(String input) that generates the same set of numbers for the same input string but entirely different numbers for a slightly different input string.
My question is: Is there an easy way to do this?
I agree with nihlon, if what you want is a function f() returning an int such that f(string1) != f(string2) for any string1, string2 in some set of strings S, then you're looking for a perfect hash.
Obviously, if S is the set of all possible strings, there are way more than 2^32, or even 2^64, so no such f() can exist returning an int or even long. Hence, the question is: how is S characterized?
Also, are you sure you need unique numbers for different strings? In most problem domains regular hashing is adequate...
As Roberto says, a hash is one way to do this, with a small possibility of two different strings hashing to the same value. That probability depends on the maximum size of string you allow and the bit-size of the resulting hash number.
You could also use an encryption, but then you would have to limit the string size to one or two blocks of a block cipher. Two blocks of AES is 32 characters, and will produce a 256 bit number.
Pick the smallest string size you can live with, and the largest hash size/block size you can work with. A non-cryptographic hash like the fnv hash will be faster than a cryptographic hash like SHA-256, but obviously less secure. You do not say how important security is to you.
It's usually said that inserting and finding a string in a hash table is O(1). But how is hash key of a string made? Why it's not considered O(L), length of string?
It is clear to me that why for integers it is O(1), but not for strings.
I do understand why in general, inserting into a hash table is O(1), but I am confused about the step before inserting the hash into table: making the hash value.
Also is there any difference between how hash keys for strings are generated in java and unordered_map in C++?
Thanks.
Inserting etc. in a hashtable is O(1) in the sense that it is constant (or more precisely, bounded) in regard to the number of elements in the table.
The "O(1)" in this context makes no claim about how fast you can compute your hashes. If the effort for this grows in some way, that is the way it is. However, I find it unlikely that the complexity of a decent (i.e. "fit for this application") hash function will ever be worse than linear in the "size" (i.e. the length in our string-example) of the object being hashed.
It's usually said that inserting and finding a string in a hashtable is O(1). But how is hash key of a string made ? Why it's not O(L), length of string? It's clear for me that why for integers it's O(1), but not for strings.
The O(1) commonly quoted means the time doesn't grow with the number of elements in the container. As you say, the time to generate a hash value from a string might not itself be O(1) in the length of the string - though for some implementations it is: for example Microsoft's C++ std::hash<std::string> has:
size_t _Val = 2166136261U;
size_t _First = 0;
size_t _Last = _Keyval.size();
size_t _Stride = 1 + _Last / 10;
if (_Stride < _Last)
_Last -= _Stride;
for(; _First < _Last; _First += _Stride)
_Val = 16777619U * _Val ^ (size_t)_Keyval[_First];
return (_Val);
The _Stride is a tenth of the string length, so a fixed number of characters that far apart will be incorporated in the hash value. Such a hash function is O(1) in the length of the string.
GCC's C++ Standard library takes a different approach: in v4.7.2 at least, it calls down through a _Hash_impl support class to the static non-member function _Hash_bytes, which does a Murmur hash incorporating every byte. GCC's hash<std::string> is therefore O(N) in the length of the string.
GCC's higher prioritorisation of collision minimisation is also evident in its use of prime numbers of buckets for std::unordered_set and std::unordered_map, which MS's implementation doesn't do - at least up until VS2013/VC12; summarily MS's approach will be lighter-weight/faster for keys that aren't collision prone, and at lower load factors, but degrades earlier and more dramatically otherwise.
And is there any difference between how hash keys for strings are produced between hashTable in java and unordered_map in C++?
How strings are hashed is not specified by the C++ Standard - it's left to the individual compiler implementations. Consequently, different compromises are struck by different compilers - even different versions of the same compiler.
The documentation David Pérez Cabrera's answer links to explains the hashCode function in Java:
Returns a hash code for this string. The hash code for a String object is computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)
That's clearly O(N) in the length of the string.
Returning quickly to...
It's usually said that inserting and finding a string in a hashtable is O(1).
...a "key" ;-P insight is that in many problem domains, the real-world lengths of the strings is known not to vary significantly, or hashing for the worst-case length is still plenty fast enough. Consider a person's or company's name, a street address, an identifier from some source code, a programming-language keyword, a product/book/CD etc name: you can expect a billion keys to take roughly a million times more memory to store than the first thousand. With a hash table, most operations on the entire data set can be expected to take a million times longer. And this will be as true in 100 years' time as it is today. Importantly, if some request comes in related to a single key, it shouldn't take much longer to perform than it used to with a thousand keys (assuming sufficient RAM, and ignoring CPU caching effects) - though sure, if it's a long key it may take longer than for a short key, and if you have ultra-low-latency or hard-realtime requirements, you may care. But, the average throughput for requests with random keys will be constant despite having a million times more data.
Only when you have a problem domain with massive variance in key size and the key-hashing time is significant given your performance needs, or where you expect the average key size to increase over time (e.g. if the keys are video streams, and every few years people are bumping up resolutions and frame rates creating an exponential growth in key size), will you need to pay close attention to the hashing (and key comparison) costs.
Acording to implementation of Java, Hashtable use the hashCode method of key (String or Integer).
Hashtable
String.hashCode
Integer.hashCode
And C++ use std::hash<std::string> or std::hash<int> according to http://en.cppreference.com/w/cpp/utility/hash and the implementation was in functional file (/path/to/c++... /include/c++/4.8/functional)
The complexity of a hashing function is never O(1). If the length of the string is n then the complexity is surely O(n). However, if you compute all hashes in a given array, you won't have to calculate for the second time and you can always compare two strings in O(1) time by comparing the precalculated hashes.
I have come across situations in an interview where I needed to use a hash function for integer numbers or for strings. In such situations which ones should we choose ? I've been wrong in these situations because I end up choosing the ones which have generate lot of collisions but then hash functions tend to be mathematical that you cannot recollect them in an interview. Are there any general recommendations so atleast the interviewer is satisfied with your approach for integer numbers or string inputs? Which functions would be adequate for both inputs in an "interview situation"
Here is a simple recipe from Effective java page 33:
Store some constant nonzero value, say, 17, in an int variable called result.
For each significant field f in your object (each field taken into account by the
equals method, that is), do the following:
Compute an int hash code c for the field:
If the field is a boolean, compute (f ? 1 : 0).
If the field is a byte, char, short, or int, compute (int) f.
If the field is a long, compute (int) (f ^ (f >>> 32)).
If the field is a float, compute Float.floatToIntBits(f).
If the field is a double, compute Double.doubleToLongBits(f), and
then hash the resulting long as in step 2.1.iii.
If the field is an object reference and this class’s equals method
compares the field by recursively invoking equals, recursively
invoke hashCode on the field. If a more complex comparison is
required, compute a “canonical representation” for this field and
invoke hashCode on the canonical representation. If the value of the
field is null, return 0 (or some other constant, but 0 is traditional).
48 CHAPTER 3 METHODS COMMON TO ALL OBJECTS
If the field is an array, treat it as if each element were a separate field.
That is, compute a hash code for each significant element by applying
these rules recursively, and combine these values per step 2.b. If every
element in an array field is significant, you can use one of the
Arrays.hashCode methods added in release 1.5.
Combine the hash code c computed in step 2.1 into result as follows:
result = 31 * result + c;
Return result.
When you are finished writing the hashCode method, ask yourself whether
equal instances have equal hash codes. Write unit tests to verify your intuition!
If equal instances have unequal hash codes, figure out why and fix the problem.
You should ask the interviewer what the hash function is for - the answer to this question will determine what kind of hash function is appropriate.
If it's for use in hashed data structures like hashmaps, you want it to be a simple as possible (fast to execute) and avoid collisions (most common values map to different hash values). A good example is an integer hashing to the same integer - this is the standard hashCode() implementation in java.lang.Integer
If it's for security purposes, you will want to use a cryptographic hash function. These are primarily designed so that it is hard to reverse the hash function or find collisions.
If you want fast pseudo-random-ish hash values (e.g. for a simulation) then you can usually modify a pseudo-random number generator to create these. My personal favourite is:
public static final int hash(int a) {
a ^= (a << 13);
a ^= (a >>> 17);
a ^= (a << 5);
return a;
}
If you are computing a hash for some form of composite structure (e.g. a string with multiple characters, or an array, or an object with multiple fields), then there are various techniques you can use to create a combined hash function. I'd suggest something that XORs the rotated hash values of the constituent parts, e.g.:
public static <T> int hashCode(T[] data) {
int result=0;
for(int i=0; i<data.length; i++) {
result^=data[i].hashCode();
result=Integer.rotateRight(result, 1);
}
return result;
}
Note the above is not cryptographically secure, but will do for most other purposes. You will obviously get collisions but that's unavoidable when hashing a large structure to a integer :-)
For integers, I usually go with k % p where p = size of the hash table and is a prime number and for strings I choose hashcode from String class. Is this sufficient enough for an interview with a major tech company? – phoenix 2 days ago
Maybe not. It's not uncommon to need to provide a hash function to a hash table whose implementation is unknown to you. Further, if you hash in a way that depends on the implementation using a prime number of buckets, then your performance may degrade if the implementation changes due to a new library, compiler, OS port etc..
Personally, I think the important thing at interview is a clear understanding of the ideal characteristics of a general-purpose hash algorithm, which is basically that for any two input keys with values varying by as little as one bit, each and every bit in the output has about 50/50 chance of flipping. I found that quite counter-intuitive because a lot of the hashing functions I first saw used bit-shifts and XOR and a flipped input bit usually flipped one output bit (usually in another bit position, so 1-input-bit-affects-many-output-bits was a little revelation moment when I read it in one of Knuth's books. With this knowledge you're at least capable of testing and assessing specific implementations regardless of how they're implemented.
One approach I'll mention because it achieves this ideal and is easy to remember, though the memory usage may make it slower than mathematical approaches (could be faster too depending on hardware), is to simply use each byte in the input to look up a table of random ints. For example, given a 24-bit RGB value and int table[3][256], table[0][r] ^ table[1][g] ^ table[2][b] is a great sizeof int hash value - indeed "perfect" if inputs are randomly scattered through the int values (rather than say incrementing - see below). This approach isn't ideal for long or arbitrary-length keys, though you can start revisiting tables and bit-shift the values etc..
All that said, you can sometimes do better than this randomising approach for specific cases where you are aware of the patterns in the input keys and/or the number of buckets involved (for example, you may know the input keys are contiguous from 1 to 100 and there are 128 buckets, so you can pass the keys through without any collisions). If, however, the input ceases to meet your expectations, you can get horrible collision problems, while a "randomising" approach should never get much worse than load (size() / buckets) implies. Another interesting insight is that when you want a quick-and-mediocre hash, you don't necessarily have to incorporate all the input data when generating the hash: e.g. last time I looked at Visual C++'s string hashing code it picked ten letters evenly spaced along the text to use as inputs....
This might sound as an very vague question upfront but it is not. I have gone through Hash Function description on wiki but it is not very helpful to understand.
I am looking simple answers for rather complex topics like Hashing. Here are my questions:
What do we mean by hashing? How does it work internally?
What algorithm does it follow ?
What is the difference between HashMap, HashTable and HashList ?
What do we mean by 'Constant Time Complexity' and why does different implementation of the hash gives constant time operation ?
Lastly, why in most interview questions Hash and LinkedList are asked, is there any specific logic for it from testing interviewee's knowledge?
I know my question list is big but I would really appreciate if I can get some clear answers to these questions as I really want to understand the topic.
Here is a good explanation about hashing. For example you want to store the string "Rachel" you apply a hash function to that string to get a memory location. myHashFunction(key: "Rachel" value: "Rachel") --> 10. The function may return 10 for the input "Rachel" so assuming you have an array of size 100 you store "Rachel" at index 10. If you want to retrieve that element you just call GetmyHashFunction("Rachel") and it will return 10. Note that for this example the key is "Rachel" and the value is "Rachel" but you could use another value for that key for example birth date or an object. Your hash function may return the same memory location for two different inputs, in this case you will have a collision you if you are implementing your own hash table you have to take care of this maybe using a linked list or other techniques.
Here are some common hash functions used. A good hash function satisfies that: each key is equally likely to hash to any of the n memory slots independently of where any other key has hashed to. One of the methods is called the division method. We map a key k into one of n slots by taking the remainder of k divided by n. h(k) = k mod n. For example if your array size is n = 100 and your key is an integer k = 15 then h(k) = 10.
Hashtable is synchronised and Hashmap is not.
Hashmap allows null values as key but Hashtable does not.
The purpose of a hash table is to have O(c) constant time complexity in adding and getting the elements. In a linked list of size N if you want to get the last element you have to traverse all the list until you get it so the complexity is O(N). With a hash table if you want to retrieve an element you just pass the key and the hash function will return you the desired element. If the hash function is well implemented it will be in constant time O(c) This means you dont have to traverse all the elements stored in the hash table. You will get the element "instantly".
Of couse a programer/developer computer scientist needs to know about data structures and complexity =)
Hashing means generating a (hopefully) unique number that represents a value.
Different types of values (Integer, String, etc) use different algorithms to compute a hashcode.
HashMap and HashTable are maps; they are a collection of unqiue keys, each of which is associated with a value.
Java doesn't have a HashList class. A HashSet is a set of unique values.
Getting an item from a hashtable is constant-time with regard to the size of the table.
Computing a hash is not necessarily constant-time with regard to the value being hashed.
For example, computing the hash of a string involves iterating the string, and isn't constant-time with regard to the size of the string.
These are things that people ought to know.
Hashing is transforming a given entity (in java terms - an object) to some number (or sequence). The hash function is not reversable - i.e. you can't obtain the original object from the hash. Internally it is implemented (for java.lang.Object by getting some memory address by the JVM.
The JVM address thing is unimportant detail. Each class can override the hashCode() method with its own algorithm. Modren Java IDEs allow for generating good hashCode methods.
Hashtable and hashmap are the same thing. They key-value pairs, where keys are hashed. Hash lists and hashsets don't store values - only keys.
Constant-time means that no matter how many entries there are in the hashtable (or any other collection), the number of operations needed to find a given object by its key is constant. That is - 1, or close to 1
This is basic computer-science material, and it is supposed that everyone is familiar with it. I think google have specified that the hashtable is the most important data-structure in computer science.
I'll try to give simple explanations of hashing and of its purpose.
First, consider a simple list. Each operation (insert, find, delete) on such list would have O(n) complexity, meaning that you have to parse the whole list (or half of it, on average) to perform such an operation.
Hashing is a very simple and effective way of speeding it up: consider that we split the whole list in a set of small lists. Items in one such small list would have something in common, and this something can be deduced from the key. For example, by having a list of names, we could use first letter as the quality that will choose in which small list to look. In this way, by partitioning the data by the first letter of the key, we obtained a simple hash, that would be able to split the whole list in ~30 smaller lists, so that each operation would take O(n)/30 time.
However, we could note that the results are not that perfect. First, there are only 30 of them, and we can't change it. Second, some letters are used more often than others, so that the set with Y or Z will be much smaller that the set with A. For better results, it's better to find a way to partition the items in sets of roughly same size. How could we solve that? This is where you use hash functions. It's such a function that is able to create an arbitrary number of partitions with roughly the same number of items in each. In our example with names, we could use something like
int hash(const char* str){
int rez = 0;
for (int i = 0; i < strlen(str); i++)
rez = rez * 37 + str[i];
return rez % NUMBER_OF_PARTITIONS;
};
This would assure a quite even distribution and configurable number of sets (also called buckets).
What do we mean by Hashing, how does
it work internally ?
Hashing is the transformation of a string shorter fixed-length value or key that represents the original string. It is not indexing. The heart of hashing is the hash table. It contains array of items. Hash tables contain an index from the data item's key and use this index to place the data into the array.
What algorithm does it follow ?
In simple words most of the Hash algorithms work on the logic "index = f(key, arrayLength)"
Lastly, why in most interview
questions Hash and LinkedList are
asked, is there any specific logic for
it from testing interviewee's
knowledge ?
Its about how good you are at logical reasoning. It is most important data-structure that every programmers know it.