I am looking for a way to create an int\long representation of an arbitrary alpha-numeric String. Hash codes won't do it, because I can't afford hash collisions i.e. the representation must be unique and repeatable.
The numeric representation will be used to perform efficient (hopefully) compares. The creation of the numeric key will take some time, but it only has to happen once, whereas I need to perform vast numbers of comparisons with it - which will hopefully be much faster than comparing the raw Strings.
Any other idea's on faster String comparison will be most appreciated too...
Unless your string is limited in length, you can't avoid collisions.
There are 4294967296 possible values for an integer (2^32). If you have a string of more than 4 ASCII characters, or more than two unicode characters, then there are more possible string values than possible integer values. You can't have a unique integer value for every possible 5 character string. Long values have more possible values, but they would only provide a unique value for every possible string of 8 ASCII characters.
Hash codes are useful as a two step process: first see if the hash code matches, then check the whole string. For most strings that don't match, you only need to do the first step, and it's really fast.
Can't you just start with a hash code, and if the hash codes match, do a character by character comparison?
How long are the strings? If they are very short, then a unique ID can be generated by considering the characters as digits in base 36 (26 + 10) that form a n-digits number where n is the length of the string. On the other hand, if the strings are short enough to allow this, direct comparison won't be an issue anyway.
Otherwise you'll have to generate a collision-free hash and this can only be done when the complete problem space is known in advance (i.e. if you know all strings that can possibly occur). You will want to have a look at perfect hashing, although the only feasible algorithm to find a perfect hash function that I know is probabilistic so collisions are still theoretically possible.
There might be other ways to find such a function. Knuth called this a “rather amusing … puzzle” in TAoCP but he doesn't give an algorithm either.
In general, you give way too few information to find an algorithm that doesn't require probing the whole problem space in some manner. This does invariably mean that the problem has exponential running time but could be solved using machine-learning heuristics. I'm not sure if this is advisable in your case.
Perhaps:
String y = "oiu291981u39u192u3198u389u28u389u";
BigInteger bi = new BigInteger(y, 36);
System.out.println(bi);
At the end of the day, a single alphanumeric character has at least 36 possible values. If you include punctuation, lower case, etc then you can easily pass 72 possible values.
A non-colliding number that allows you to quickly compare strings would necessarily grow exponentially with the length of the string.
So you first must decide on the longest string you are expecting to compare. Assuming it's N characters in length, and assuming you ONLY need uppercase letters and the numerals 0-9 then you need to have an integer representation that can be as high as
36^N
For a string of length 25 (common name field) then you end up needing a binary number with 130 bits.
If you compose that into 32 bit numbers, you'll need 4. Then you can compare each number (four integer compares should take no time, compared to walking the string). I would recommend a big number library, but for this specialized case I'm pretty sure you can write your own and get better performance.
If you want to handle 72 possible values per character (uppercase, lowercase, numerals, punctuation...) and you need 10 characters, then you'll need 62 bits - two 32 bit integers (or one 64 bit if you're on a system that supports 64 bit computing)
If, however, you are not able to restrict the numbers in the string (ie, could be any of the 256 letters/numbers/characters/etc) and you can't define the size of the string, then comparing the strings directly is the only way to go, but there's a shortcut.
Cast the pointer of the string to a 32 bit unsigned integer array, and compare the string 4 bytes at a time (or 64 bits/8bytes at a time on a 64 bit processor). This means that a 100 character string only requires 25 compares maximum to find which is greater.
You may need to re-define the character set (and convert the strings) so that the characters with higher precedence are assigned values closer to 0, and lower precedence values closer to 255 (or vice versa, depending on how you are comparing them).
Good luck!
-Adam
As long as it's a hash function, be it String.hashCode(), MD5 or SHA1, collision is unavoidable unless you have a fixed limit on the string's length. It is mathematically impossible to have one-to-one mapping from an infinite group to a finite group.
Stepping back, is collision avoidance absolutely necessary?
A few questions in the beginning:
Did you test that simple string comparison is too slow?
How the comparison looks like ('ABC' == 'abc' or 'ABC' != 'abc')?
How many string do you have to compare?
How many comparison do you have to do?
How your strings look like (the length, letter case)?
As far as I remember String in Java is an object and two identical strings point to the same object.
So, maybe it would be enough to compare objects (probably string comparison is already implemented in this way).
If it doesn't help you can try to use Pascal implementation of string object when first element is length and if your strings have various length this should save some CPU time.
How long are your strings? Unless you choose an int representation that's longer than the string, collisions will always be possible no matter what conversion you're using. So if you're using a 32 bit integer, you can only uniquely represent strings of up to 4 bytes.
How big are your strings? Arbitrarily long strings cannot be compressed into 32/64 bit format.
If you don't want collisions, try something insane like SHA-512. I can't guarantee there won't be collisions, but I don't think they have found any yet.
Assuming "alphanumeric" means letters and numbers, you could treat each letter/number as a base-36 digit. Unfortunately, large strings will cause the number to grow rapidly and you'd have to resort to big integers, which are hardly efficient.
If your strings are usually different when you make the comparison (i.e. searching for a specific string) the hash might be your best option. Once you get a potential hit, you can do the string comparison to be sure. A well-designed hash will make collisions exceedingly rare.
It would seem that an MD5 hash would work fine. The risk of a hash collision would be extremely unlikely. Depending on the length of your string, a hash that generates an int/long would run into max value problems very quickly.
Why don't you do something like 1stChar + (10 x 2ndChar) + 100 x (3rdChar) ...., where you use the simple integer value of each character, i.e. a = 1, b = 2 etc, or just the integer value if it's not a letter. This will give a unique value for each string, even for 2 strings that are just the same letters in a different order.
Of course if gets more complicated if you need to worry about Unicode rather than just ASCII and the numbers could get large if you need to use long string.
Are the standard Java string comparison functions definitely not efficient enough?
String length may vary, but let's say 10 characters for now.
In that case, in order to guarantee uniqueness you'd have to use some sort of big integer representation. I doubt that doing comparisons on big integers would be substantially faster than doing string comparisons in the first place. I'll second what other's have said here, use some sort of hash, then in the event of a hash match check the original strings to weed out any collisions.
In any case, If your strings are around 10 characters, I doubt that comparing, say, a bunch of 32 bit hashes will be all that much faster than direct string comparisons. I think you have to ask yourself if it's it really worth the additional complexity.
Related
I'm trying to generate n random numbers that depend on an input string. It would be a function generateNumbers(String input) that generates the same set of numbers for the same input string but entirely different numbers for a slightly different input string.
My question is: Is there an easy way to do this?
I agree with nihlon, if what you want is a function f() returning an int such that f(string1) != f(string2) for any string1, string2 in some set of strings S, then you're looking for a perfect hash.
Obviously, if S is the set of all possible strings, there are way more than 2^32, or even 2^64, so no such f() can exist returning an int or even long. Hence, the question is: how is S characterized?
Also, are you sure you need unique numbers for different strings? In most problem domains regular hashing is adequate...
As Roberto says, a hash is one way to do this, with a small possibility of two different strings hashing to the same value. That probability depends on the maximum size of string you allow and the bit-size of the resulting hash number.
You could also use an encryption, but then you would have to limit the string size to one or two blocks of a block cipher. Two blocks of AES is 32 characters, and will produce a 256 bit number.
Pick the smallest string size you can live with, and the largest hash size/block size you can work with. A non-cryptographic hash like the fnv hash will be faster than a cryptographic hash like SHA-256, but obviously less secure. You do not say how important security is to you.
I'm trying to implement the Chord protocol in order to quickly lookup some nodes and keys in a small network. What I can't figure out is ... Chord cosideres the nodes and keys as being placed on a cirlce. And their placement dictated by the hash values obtained by applying the SHA-1 hash function. How exactly do I operate with those values? Do I make them as a string de9f2c7f d25e1b3a fad3e85a 0bd17d9b 100db4b3 and then compare them as such, considering that "a" < "b" is true ? Or how? How do I know if a key is before or after another?
Since the keyspace is a ring, a single value can't be said to be greater than another, because if you go the other way around the ring, the opposite is true. You can say a value is within a range or not. In the Chord DHT, each server is responsible for the keys within the range of values between it and its predecessor.
I would advise against using strings for the hash values. You shouldn't use the hashCode function for distributed systems, but you need to math on the hash keys when adding new nodes. You could try converting the hashes into BigIntegers instead.
sha1 hashes are not strings but are very long hex numbers - they are often stored as strings because they would otherwise require a native 160 bit number type. They are built as 5 32 bit hex numbers and then often 'strung' together.
using sha1 strings as the numbers they represent is not hard but requires a library that can handle such large numbers (like BigInt or bcmath). these libraries work by calculating the numbers within the string one column at a time from the right to left, much like a person when using a pen and paper to add, multiply, divide, etc. they will typically have functions for doing common math as well comparisons etc, and often take strings as arguments. Also, make sure that you use a function for converting big numbers anytime you need to go from hex to dec, or else your 160 bit hex number will likely get rounded into a 64 bit dec float or similar and loose most of it's accuracy.
more/less than comparisons are used in chord to figure ranges but do so using modulo so that they 'wrap', making ranges such as [64, 2] possible. the actual formula is
find_successor(fingers[k] = n + 2^(k-1) mod(2^160))
where 'n' is the sha1 of a node and 'k' is the finger number.
remember, 'n' will be hex while 'k' and 'mod(160^2)' will typically be dec, so this is where your BigInt hex to BigInt dec will be needed.
even if your programing framework will let you create these vars as hex, 160 is specifically a dec (literally meaning one hounded and sixty bits) and besides, wrapping your brain around 'mod(160^2)' is already hard enough without visualizing it as hex. convert 'n' to dec rather than converting 'k' etc to hex , and then use a BigInt lib to do the math including comparisons.
I'm looking for a way to encode a sequence of enum values in Java that packs better than one object reference per element. In fantasy-code:
List<MyEnum> list = new EnumList<MyEnum>(MyEnum.class);
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits per element. Is there an existing implementation for this, or a simple way to do it?
It would be sufficient to have a class that encodes a sequence of numbers of arbitrary radix (i.e. if there are 5 possible enum values then use base 5) into a sequence of bytes, since a simple wrapper class could be used to implement List<MyEnum>.
I would prefer a general, existing solution, but as a poor man's solution I might just use an array of longs and radix-encode as many elements as possible into each long. With 5 enum values, 27 elements will fit into a long and waste only ~1.3 bits, which is pretty good.
Note: I'm not looking for a set implementation. That wouldn't preserve the sequence.
You can store bits in an int (32 bits, 32 "switches"). But aside from the exercise value, what's the point?- you're really talking about a very small amount of memory. A better question might be, why do you want to save a few bytes in enum references? Other parts of your program are likely to be using much more memory.
If you're concerned with transferring data efficiently, you could consider leaving the Enums alone but using custom serialization, though again, it'd be an unusual situation where it'd be worth the effort.
One object reference typically occupies one 32-bit or 64-bit word. To do better than that, you need to convert the enum values into numbers that are smaller than 32 bits, and hold them in an array.
Converting to a number is as simple as calling getOrdinal(). From there you could:
cast to a byte or short, then represent the sequence as an array of byte / short values, or
use a suitable compression algorithm on the array of int values.
Of course, all of this comes at the cost of making your code more complicated. For instance you cannot make use of the collection APIs, and you have to do your own sequence management. I doubt that this will be worth it unless you have to deal with very large sequences or huge numbers of sequences.
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits.
In fact you may be able to do better than that ... by compressing the sequences. It depends on how much redundancy there is.
So I know I can convert a string to a hashcode simply by doing .hashCode(), but is there a way to convert (or use some other function if there is one out there) that will instead of returning an integer return a double between 0 and 1? I was thinking of just dividing the number by the maximum possible integer but wasn't sure if there was a better way.
*Edit (more information about why i'm trying to do this): i'm doing a mathematical operation, and i'm trying to group different objects to perform the same mathematical operation in their group but have a different parameter into the function. each member has a list of characteristics that "group" them... so i was thinking to put the characteristics into a string and then hashcode it and find their group value from that
You couldn't just divide by Integer.MAX_VALUE, as that wouldn't deal with negative numbers. You could use:
private static double INTEGER_RANGE = 1L << 32;
...
// First need to put it in the range [0, INTEGER_RANGE)
double doubleHash = ((long) text.hashCode() - Integer.MIN_VALUE) / INTEGER_RANGE;
That should be okay, as far as I'm aware... but I'm not going to make any claims about the distribution. There may well be a fairly simple way of using the 32 bits to make a unique double (per unique hash code) in the right range, but if you don't care too much about that, this will be simpler.
Dividing it should be ok, but you might loose some "precision" due to rounding problems, etc, that doubles might have.
In general a hash is used to identify something trying to assure it'll be unique, loosing precision might have problems in that.
You could write your own String.hashCodeDouble() returning the desired number, perhaps using a common hash algorithm (let's say, MD5) and adapting it to your required response range.
Example: do the MD5 of the String to get a hash, then simply put a 0. in front of it...
Remember that the .hashCode() is used in lots of functions in Java, you can't simply overwrite it.
This smells bad but might do what you want:
Integer iHash = "123".hashCode();
String sHash = "0."+iHash;
Double dHash = Double.valueOf(sHash);
I am currently storing a list of words (around 120,000) in a HashSet, for the purpose of using as a list to check enetered words against to see if they are spelt correctly, and just returning yes or no.
I was wondering if there is a way to do this which takes up less memory. Currently 120,000 words is around 12meg, the actual file the words are read from is around 900kb.
Any suggestions?
Thanks in advance
You could use a prefix tree or trie: http://en.wikipedia.org/wiki/Trie
Check out bloom filters or cuckoo hashing. Bloom filter or cuckoo hashing?
I am not sure if this is the answer for your question but worth looking into these alternatives. bloom filters are mainly used for spell checker kind of use cases.
HashSet is probably not the right structure for this. Use Trie instead.
This might be a bit late but using Google you can easily find the DAWG investigation and C code that I posted a while ago.
http://www.pathcom.com/~vadco/dawg.html
TWL06 - 178,691 words - fits into 494,676 Bytes
The downside of a compressed-shared-node structure is that it does not work as a hash function for the words in your list. That is to say, it will tell you if a word exists, but it will not return an index to related data for a word that does exist.
If you want the perfect and complete hash functionality, in a processor-cache sized structure, you are going to have to read, understand, and modify a data structure called the ADTDAWG. It will be slightly larger than a traditional DAWG, but it is faster and more useful.
http://www.pathcom.com/~vadco/adtdawg.html
All the very best,
JohnPaul Adamovsky
12MB to store 120,000 words is about 100 bytes per word. Probably at least 32 bytes of that is String overhead. If words average 10 letters and they are stored as 2-byte chars, that accounts for another 20 bytes. Then there is the reference to each String in your HashSet, which is probably another 4 bytes. The remaining 44 bytes is probably the HashSet entry and indexing overhead, or something I haven't considered above.
The easiest thing to go after is the overhead of the String objects themselves, which can take far more memory than is required to store the actual character data. So your main approach would be to develop a custom representation that avoids storing a separate object for each string. In the course of doing this, you can also get rid of the HashSet overhead, since all you really need is a simple word lookup, which can be done by a straightforward binary search on an array that will be part of your custom implementation.
You could create your custom implementation as an array of type int with one element for each word. Each of these int elements would be broken into sub-fields that contain a length and an offset that points into a separate backing array of type char. Put both of these into a class that manages them, and that supports public methods allowing you to retrieve and/or convert your data and individual characters given a string index and an optional character index, and to perform the simple searches on the list of words that are needed for your spell check feature.
If you have no more than 16777216 characters of underlying string data (e.g., 120,000 strings times an average length of 10 characters = 1.2 million chars), you can take the low-order 24 bits of each int and store the starting offset of each string into your backing array of char data, and take the high-order 8 bits of each int and store the size of the corresponding string there.
Your char data will have your erstwhile strings crammed together without any delimiters, relying entirely upon the int array to know where each string starts and ends.
Taking the above approach, your 120,000 words (at an average of 10 letters each) would require about 2,400,000 bytes of backing array data and 480,000 bytes of integer index data (120,000 x 4 bytes), for a total of 2,880,000 bytes, which is about a 75 percent savings over the present 12MB amount you have reported above.
The words in the arrays would be sorted alphabetically, and your lookup process could be a simple binary search on the int array (retrieving the corresponding words from the char array for each test), which should be very efficient.
If your words happen to be entirely ASCII data, you could save an additional 1,200,000 bytes by storing the backing data as bytes instead of as chars.
This could get more difficult if you needed to alter these strings. Apparently, in your case (spell checker), you don't need to (unless you want to support user additions to the list, which would be infrequent anyway, and so re-writing the char data and indexes to add or delete words might be acceptable).
One way to save memory to save memory is to use a radix tree. This is better than a trie as the prefixes are not stored redundantly.
As your dictionary is fixed another way is to build a perfect hash function for it. Your hash set does not need buckets (and the associated overhead) as there cannot be collisions. Every implementation of a hash table/hash set that uses open addressing can be used for this (like google collection's ImmutableSet).
The problem is by design: Storing such a huge amount of words in a HashSet for spell-check-reasons isn't a good idea:
You can either use a spell-checker (example: http://softcorporation.com/products/spellcheck/ ), or you can buildup a "auto-wordcompletion" with a prefix tree ( description: http://en.wikipedia.org/wiki/Trie ).
There is no way to reduce memory-usage in this design.
You can also try Radix Tree(Wiki,Implementation) .This some what like trie but more memory efficient.