Given an array of numbers, I would like to create a number identifier that represents that combination as unique as possible.
For example:
int[] inputNumbers = { 543, 134, 998 };
int identifier = createIdentifier(inputNumbers);
System.out.println( identifier );
Output:
4532464234
-The returned number must be as unique as possible
-Ordering of the elements must influence the result
-The algorithm must return always the same result from the same input array
-The algorithm must be as fast as possible to be used alot in 'for' loops
The purpose of this algorithm, is to create a small value to be stored in a DB, and to be easily comparable. It is nothing critical so it's acceptable that some arrays of numbers return the same value, but that cases must be rare.
Can you suggest a good way to accomplish this?
The standard ( Java 7 ) implementation of Arrays.hashCode(int[]) has the required properties. It is implemented thus:
2938 public static int hashCode(int a[]) {
2939 if (a == null)
2940 return 0;
2941
2942 int result = 1;
2943 for (int element : a)
2944 result = 31 * result + element;
2945
2946 return result;
2947 }
As you can see, the implementation is fast, and the result depends on the order of the elements as well as the element values.
If there is a requirement that the hash values are the same across all Java platforms, I think you can rely on that being satisfied. The javadoc says that the method will return a value that is that same as you get when calling List<Integer>.hashcode() on an equivalent list. And the formula for that hashcode is specified.
Have a look at Arrays.hashCode(int[]), it is doing exactly this.
documentation
What you're looking for is the array's hash code.
int hash = Arrays.hashCode(new int[]{1, 2, 3, 4});
See also the Java API
I also say you are looking for some kind of hash function.
I don't know how much you will rely on point 3 The algorithm must return always the same result from the same input array, but this depends on the JVM implementation.
So depending on your use case you might run into some trouble (The solution then would be to use a extern hash library).
For further information take a look at this SO question: Java, Object.hashCode() result constant across all JVMs/Systems?
EDIT
I just read you want to store the values in a DB. In that case I would recommend you to use a extern hasing library that is reliable and guaranteed to yield the same value every time it is invoked. Otherwise you would have to re-hash your whole DB every time you start your application, to have it in a consistent state.
EDIT2
Since you are using only plain ints the hash value should be the same every time. As #Stephen C showed in his answer.
Related
The question was asking me to return set containing all the possible combination of strings made up of "cc" and "ddd" for given length n.
so for example if the length given was 5 then set would include "ccddd" and "dddcc".
and length 6 would return set containing "cccccc","dddddd"
and length 7 would return set contating "ccdddcc","dddcccc","ccccddd"
and length 12 will return 12 different combination and so on
However, set returned is empty.
Can you please help?
"Please understand extremeply poor coding style"
public static Set<String> set = new HashSet<String>();
public static Set<String> generateset(int n) {
String s = strings(n,n,"");
return set; // change this
}
public static String strings(int n,int size, String s){
if(n == 3){
s = s + ("cc");
return "";}
if(n == 2){
s = s + ("ddd");
return "";}
if(s.length() == size)
set.add(s);
return strings(n-3,size,s) + strings(n-2,size,s);
}
I think you'll need to rethink your approach. This is not an easy problem, so if you're extremely new to Java (and not extremely familiar with other programming languages), you may want to try some easier problems involving sets, lists, or other collections, before you tackle something like this.
Assuming you want to try it anyway: recursive problems like this require very clear thinking about how you want to accomplish the task. I think you have a general idea, but it needs to be much clearer. Here's how I would approach the problem:
(1) You want a method that returns a list (or set) of strings of length N. Your recursive method returns a single String, and as far as I can tell, you don't have a clear definition of what the resulting string is. (Clear definitions are very important in programming, but probably even more so when solving a complex recursive problem.)
(2) The strings will either begin with "cc" or "ddd". Thus, to form your resulting list, you need to:
(2a) Find all strings of length N-2. This is where you need a recursive call to get all strings of that length. Go through all strings in that list, and add "cc" to the front of each string.
(2b) Similarly, find all strings of length N-3 with a recursive call; go through all the strings in that list, and add "ddd" to the front.
(2c) The resulting list will be all the strings from steps (2a) and (2b).
(3) You need base cases. If N is 0 or 1, the resulting list will be empty. If N==2, it will have just one string, "cc"; if N==3, it will have just one string, "ddd".
You can use a Set instead of a list if you want, since the order won't matter.
Note that it's a bad idea to use a global list or set to hold the results. When a method is calling itself recursively, and every invocation of the method touches the same list or set, you will go insane trying to get everything to work. It's much easier if you let each recursive invocation hold its own local list with the results. Edit: This needs to be clarified. Using a global (i.e. instance field that is shared by all recursive invocations) collection to hold the final results is OK. But the approach I've outlined above involves a lot of intermediate results--i.e. if you want to find all strings whose length is 8, you will also be finding strings whose length is 6, 5, 4, ...; using a global to hold all of those would be painful.
The answer to why set is returned empty is simply follow the logic. Say you execute generateset(5); which will execute strings(5,5,"");:
First iteration strings(5,5,""); : (s.length() == size) is false hence nothing added to set
Second iteration strings(2,5,""); : (n == 2) is true, hence nothing added to set
Third iteration strings(3,5,""); : (n == 3) is true, hence nothing added
to set
So set remains un changed.
I am pulling data values from a database that returns a List of <Integer>. However, I would like to see if the List contains my BigInteger. Is there a simple way to do this?
I currently have the following code in Java:
ArrayList<Integer> arr = new ArrayList<Integer>() {{add(new Integer(29415));}};
boolean contains = arr.contains(29415); // true
boolean contains2 = arr.contains(new BigInteger("29415")); // false
I'm not sure on an efficient way to do this?
The correct answer will be returned by evaluation of the following:
val != null
&& BigInteger.valueOf(Integer.MIN_VALUE).compareTo(val) < 0
&& BigInteger.valueOf(Integer.MAX_VALUE).compareTo(val) > 0
&& list.contains(val.intValue())
This will correctly solve the question of whether the BigInteger you have is "contained" within the List<Integer>. Note that here we only downcast where necessary. If the val is outside the range of Integer values there is no need to downcast as we know that the value cannot be within the list.
A more relevant question is whether you should actually be using a List<BigInteger> in place of a List<Integer> but that is a different question and not part of the answer to your explicit question
While arshajii provides a solution which works, i would vote against it.
You should never downcast values. You are running in danger of your program producing larger values which translate to invalid values when downcasted. This kind of bug will be super nasty to troubleshoot months later.
If your code works with BigInteger, then you should convert all values from the database into BigInteger. This is an upcast where you cannot loose values.
Overall I would value correctness over efficiency. If at all, I would reconsider your usage of BigInteger (maybe long is fine?) but because you have it, I assume you have a reason for it.
In Java List.contains() uses the equals() method internally and because BigInteger.equals(Integer) returns false, your List.contains() also returns false. Either use the an List<BigInteger> or extract the Int value from BigInteger (as arshajii explained!). Of course, if you really want to search effectively, you should think of a binary search (in a sorted list) or of another data structure like Map.
You can try using BigInteger#intValue():
arr.contains(myBigInteger.intValue())
Note, however, that if myBigInteger is too big to fit into an int, then only the lower 32 bits will be returned (as described in the linked docs). Therefore, you might want to check if myBigInteger is less than or equal to Integer.MAX_VALUE before checking for containment.
I have been given some code to optimise. One of the bits contains some code which takes a set with elements and for all elements in the set compares them to all other elements. The comparison isn't symmetric so no shortcut there. The code looks as follows:
for(String string : initialSet)
{
Set<String> copiedSet = new HashSet<>(initialSet);
copiedSet.remove(string);
for(String innerString : copiedSet)
{
/**
* Magic, unicorns, and elves! Compare the distance of the two strings by
* some very fancy method! No need to detail it here, just believe me it
* works, it isn't the subject of the question!
*/
}
}
To my understanding, the complexity would look as follows: the initial loop has a complexity of O(n) where n is the size of the initial set. Creating a set via the copy constructor would, in my understanding induce equals tests on all elements as the set would need to ensure the contract of the set, that is, no duplicate elements. This would mean that for n insertions, the complexity would increase by the sum from 0 to n-1. The removal would again need to check, in the worst case, n elements. The inner for loop then loops on n-1 elements.
The method I have used this far is simply:
for(String string : set)
{
for(String innerString : copiedSet)
{
if(! string.equals(innerString)
{
/**
* Magic, unicorns, and elves! Compare the distance of the two strings by
* some very fancy method! No need to detail it here, just believe me it
* works, it isn't the subject of the question!
*/
}
}
}
In my understanding, this would induce a complexity of roughly O(n^2) abstracting the complexity of the code in the if clause.
Therefore, the second piece of code would be better by at least one order plus the sum I outlined above. However, I am working with a dangerous assumption, and that is that I assume how the copy constructor of the HashSet works. Simple benchmarks showed that the results were indeed better for the second snipped by about a factor of n. I would like to tap into your knowledge to confirm these findings and gain more insight into the workings of the copy constructor if possible. Also, the ideal case would be to find a resource listing functions by time complexity but I guess that last one will remain wishful thinking!
The source code for the copy constructor is widely available, so you can study that as well as clone() and see if one of them suits you.
But truly, if all you are trying to do is avoid comparing an element with itself then I think your second idea involving magic, unicorns, and Elvis elves, is probably the best idea of all. Comparing every element in a Set with every other element in it is inherently an O(n2) problem, and you're not going to get much better than that.
There's no reason to compare the elements in a Set. By definition, they are all different to one another.
From the javadoc:
A collection that contains no duplicate elements.
More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element.
As implied by its name, this interface models the mathematical set abstraction.
If you have different type of collection, though, and want to skip the comparing with self, you can't iterate with a step variable(s) (i and j) and skip the steps in which they are equal. For example:
for (int i = 0; i < collection.size(); i++) {
for (int j = 0; j < collection.size(); j++) {
if (i != j) {
//compare
}
}
}
I'm not sure exactly what you are doing in your "comparison" but if it really is just finding matching elements then the Set Interface at http://docs.oracle.com/javase/tutorial/collections/interfaces/set.html has some useful methods.
For example:
s1.retainAll(s2) — transforms s1 into the intersection of s1 and s2. (The intersection of two sets is the set containing only the elements common to both sets.)
s1.removeAll(s2) — transforms s1 into the (asymmetric) set difference of s1 and s2. (For example, the set difference of s1 minus s2 is the set containing all of the elements found in s1 but not in s2.)
s1.addAll(s2) — transforms s1 into the union of s1 and s2. (The union of two sets is the set containing all of the elements contained in either set.)
This lets you easily get intersections, combinations, etc for Java Sets.
In general the Java collections classes use very efficient algorithms so you are unlikely to improve upon them without a lot of work.
I'm looking for a way to store a string->int mapping. A HashMap is, of course, a most obvious solution, but as I'm memory constrained and need to store 2 million pairs, 7 characters long keys, I need something that's memory efficient, the retrieval speed is a secondary parameter.
Currently I'm going along the line of:
List<Tuple<String, int>> list = new ArrayList<Tuple<String, int>>();
list.add(...); // load from file
Collections.sort(list);
and then for retrieval:
Collections.binarySearch(list, key); // log(n), acceptable
Should I perhaps go for a custom tree (each node a single character, each leaf with result), or is there an existing collection that fits this nicely? The strings are practically sequential (UK postcodes, they don't differ much), so I'm expecting nice memory savings here.
Edit: I just saw you mentioned the String were UK postcodes so I'm fairly confident you couldn't get very wrong by using a Trove TLongIntHashMap (btw Trove is a small library and it's very easy to use).
Edit 2: Lots of people seem to find this answer interesting so I'm adding some information to it.
The goal here is to use a map containing keys/values in a memory-efficient way so we'll start by looking for memory-efficient collections.
The following SO question is related (but far from identical to this one).
What is the most efficient Java Collections library?
Jon Skeet mentions that Trove is "just a library of collections from primitive types" [sic] and, that, indeed, it doesn't add much functionality. We can also see a few benchmarks (by the.duckman) about memory and speed of Trove compared to the default Collections. Here's a snippet:
100000 put operations 100000 contains operations
java collections 1938 ms 203 ms
trove 234 ms 125 ms
pcj 516 ms 94 ms
And there's also an example showing how much memory can be saved by using Trove instead of a regular Java HashMap:
java collections oscillates between 6644536 and 7168840 bytes
trove 1853296 bytes
pcj 1866112 bytes
So even though benchmarks always need to be taken with a grain of salt, it's pretty obvious that Trove will save not only memory but will always be much faster.
So our goal now becomes to use Trove (seen that by putting millions and millions of entries in a regular HashMap, your app begins to feel unresponsive).
You mentioned 2 million pairs, 7 characters long keys and a String/int mapping.
2 million is really not that much but you'll still feel the "Object" overhead and the constant (un)boxing of primitives to Integer in a regular HashMap{String,Integer} which is why Trove makes a lot of sense here.
However, I'd point out that if you have control over the "7 characters", you could go even further: if you're using say only ASCII or ISO-8859-1 characters, your 7 characters would fit in a long (*). In that case you can dodge altogether objects creation and represent your 7 characters on a long. You'd then use a Trove TLongIntHashMap and bypass the "Java Object" overhead altogether.
You stated specifically that your keys were 7 characters long and then commented they were UK postcodes: I'd map each postcode to a long and save a tremendous amount of memory by fitting millions of keys/values pair into memory using Trove.
The advantage of Trove is basically that it is not doing constant boxing/unboxing of Objects/primitives: Trove works, in many cases, directly with primitives and primitives only.
(*) say you only have at most 256 codepoints/characters used, then it fits on 7*8 == 56 bits, which is small enough to fit in a long.
Sample method for encoding the String keys into long's (assuming ASCII characters, one byte per character for simplification - 7 bits would be enough):
long encode(final String key) {
final int length = key.length();
if (length > 8) {
throw new IndexOutOfBoundsException(
"key is longer than 8 characters");
}
long result = 0;
for (int i = 0; i < length; i++) {
result += ((long) ((byte) key.charAt(i))) << i * 8;
}
return result;
}
Use the Trove library.
The Trove library has optimized HashMap and HashSet classes for primitives. In this case, TObjectIntHashMap<String> will map the parameterized object (String) to a primitive int.
First of, did you measure that LinkedList is indeed more memory efficient than a HashMap, or how did you come to that conclusion? Secondly, a LinkedList's access time of an element is O(n), so you cannot do efficient binary search on it. If you want to do such approach, you should use an ArrayList, which should give you the beast compromise between performance and space. However, again, I doubt that a HashMap, HashTable or - in particular - a TreeMap would consume that much more memory, but the first two would provide constant access and the tree map logarithmic and provide a nicer interface that a normal list. I would try to do some measurements, how much the difference in memory consumption really is.
UPDATE: Given, as Adamski pointed out, that the Strings themselves, not the data structure they are stored in, will consume the most memory, it might be a good idea to look into data structures that are specific for strings, such as tries (especially patricia tries), which might reduce the storage space needed for the strings.
What you are looking for is a succinct-trie - a trie which stores its data in nearly the least amount of space theoretically possible.
Unfortunately, there are no succinct-trie classes libraries currently available for Java. One of my next projects (in a few weeks) is to write one for Java (and other languages).
In the meanwhile, if you don't mind JNI, there are several good native succinct-trie libraries you could reference.
Have you looked at tries. I've not used them but they may fit with what you're doing.
A custom tree would have the same complexity of O(log n), don't bother. Your solution is sound, but I would go with an ArrayList instead of the LinkedList because the linked list allocates one extra object per stored value, which will amount to a lot of objects in your case.
As Erick writes using the Trove library is a good place to start as you save space in storing int primitives rather than Integers.
However, you are still faced with storing 2 million String instances. Given that these are keys in the map, interning them won't offer any benefit so the next thing I'd consider is whether there's some characteristic of the Strings that can be exploited. For example:
If the Strings represent sentences of common words then you could transform the String into a Sentence class, and intern the individual words.
If the Strings only contain a subset of Unicode characters (e.g. only letters A-Z, or letters + digits) you could use a more compact encoding scheme than Java's Unicode.
You could consider transforming each String into a UTF-8 encoded byte array and wrapping this in class: MyString. Obviously the trade-off here is the additional time spent performing look-ups.
You could write the map to a file and then memory map a portion or all of the file.
You could consider libraries such as Berkeley DB that allow you to define persistent maps and cache a portion of the map in memory. This offers a scalable approach.
maybe you can go with a RadixTree?
Use java.util.TreeMap instead of java.util.HashMap. It makes use of a red black binary search tree and doesn't use more memory than what is required for holding notes containing the elements in the map. No extra buckets, unlike HashMap or Hashtable.
I think the solution is to step a little outside of Java. If you have that many values, you should use a database. If you don't feel like installing Oracle, SQLite is quick and easy. That way the data you don't immediately need is stored on the disk, and all of the caching/storage is done for you. Setting up a DB with one table and two columns won't take much time at all.
I'd consider using some cache as these often have the overflow-to-disk ability.
You might create a key class that matches your needs. Perhaps like this:
public class MyKey implements Comparable<MyKey>
{
char[7] keyValue;
public MyKey(String keyValue)
{
... load this.keyValue from the String keyValue.
}
public int compareTo(MyKey rhs)
{
... blah
}
public boolean equals(Object rhs)
{
... blah
}
public int hashCode()
{
... blah
}
}
try this one
OptimizedHashMap<String, int[]> myMap = new OptimizedHashMap<String, int[]>();
for(int i = 0; i < 2000000; i++)
{
myMap.put("iiiiii" + i, new int[]{i});
}
System.out.println(myMap.containsValue(new int[]{3}));
System.out.println(myMap.get("iiiiii" + 1));
public class OptimizedHashMap<K,V> extends HashMap<K,V>
{
public boolean containsValue(Object value) {
if(value != null)
{
Class<? extends Object> aClass = value.getClass();
if(aClass.isArray())
{
Collection values = this.values();
for(Object val : values)
{
int[] newval = (int[]) val;
int[] newvalue = (int[]) value;
if(newval[0] == newvalue[0])
{
return true;
}
}
}
}
return false;
}
Actually HashMap and List are too general for such specific task as a lookup of int by zipcode. You should use advantage of knowledge which data is used. One of the options is to use a prefix tree with leaves that stores the int value. Also, it could be pruned if (my guess) a lot of codes with same prefixes map to the same integer.
Lookup of the int by zipcode will be linear in such tree and will not grow if number of codes is increased, compare to O(log(N)) in case of binary search.
Since you are intending to use hashing, you can try numerical conversions of the strings based on ASCII values.
the simplest idea will be
int sum=0;
for(int i=0;i<arr.length;i++){
sum+=(int)arr[i];
}
hash "sum" using a well defined hash functions. You would use a hash function based on the expected input patterns.
e.g. if you use division method
public int hasher(int sum){
return sum%(a prime number);
}
selecting a prime number which is not close to an exact power of two improves performances and gives better uniformly hashed distribution of keys.
another method is to weigh the characters based on their respective position.
e.g: if you use the above method, both "abc" and "cab" will be hashed into a same location. but if you need them to be stored in two distinct location give weights for locations like we use the number systems.
int sum=0;
int weight=1;
for(int i=0;i<arr.length;i++){
sum+= (int)arr[i]*weight;
weight=weight*2; // using powers of 2 gives better results. (you know why :))
}
As your sample is quite large, you'd avoid collisions by a chaining mechanism rather than using a probe sequence.
After all,What method you would choose totally depends on the nature of your application.
The problem is objects' memory overhead, but using some tricks you can try to implement your own hashset. Something like this. Like others said strings have quite large overhead so you need to "compress" it somehow. Also try not to use too many arrays(lists) in hashtable (if you do chaining type hashtable) as they are also objects and also have overhead. Better yet do open addressing hashtable.
So, here is the actual question (it's for a homework):
A hashtable is data structure that allows access and manipulation of the date at constant time (O(1)). The hashtable array must be initialized to null during the creation of the hashtable in order to identify the empty cells. In most cases, the time penalty is enormous especially considering that most cells will never be read. We ask of you that you implement a hashtable that bypasses this problem at the price of a heavier insertion, but still at constant time. For the purpose of this homework and to simplify your work, we suppose that you can't delete elements in this hashtable.
In the archive of this homework you will find the interface of an hashtable that you need to fill. You can use the function hashcode() from java as a hash function. You will have to use the Vector data structure from Java in order to bypass the initialization and you have to find by yourself how to do so. You can only insert elements at the end of the vector so that the complexity is still O(1).
Here are some facts to consider:
In a hashtable containing integers, the table contains numeric values (but they don't make any sense).
In a stack, you cannot access elements over the highest element, but you know for sure that all the values are valid. Furthermore, you know the index of the highest element.
Use those facts to bypass the initialization of the hashtable. The table must use linear probing to resolve collisions.
Also, here is the interface that I need to implement for this homework:
public interface NoInitHashTable<E>
{
public void insert(E e);
public boolean contains(E e);
public void rehash();
public int nextPrime(int n);
public boolean isPrime(int n);
}
I have already implemented nextPrime and isPrime (I don't think they are different from a normal hashtable). The three other I need to figure out.
I thought a lot about it and discussed it with my teammate but I really can't find anything. I only need to know the basic principle of how to implement it, I can handle the coding.
tl;dr I need to implement an array hashtable that works without initializing the array to null at the start. The insertion must be done in constant time. I only need to know the basic principle of how to do that.
I think I have seen this in a book as exercise with answer at the back, but I can't remember which book or where. It is generally relevant to the question of why we usually concentrate on the time a program takes rather than the space - a program that runs efficiently in time shouldn't need huge amounts of space.
Here is some pseudo-code that checks if a cell in the hash table is valid. I will leave the job of altering the data structures it defines to make another cell in the hash table valid as a remaining exercise for the reader.
// each cell here is for a cell at the same offset in the
// hash table
int numValidWhenFirstSetValid[SIZE];
int numValidSoFar = 0; // initialise only this
// Only cells 0..numValidSoFar-1 here are valid.
int validOffsetsInOrderSeen[SIZE];
boolean isValid(int offsetInArray)
{
int supposedWhenFirstValid =
numValidWhenFirstSetValid[offsetInArray]
if supposedWhenFirstValid >= numValidSoFar)
{
return false;
}
if supposedWhenFirstValid < 0)
{
return false;
}
if (validOffsetsInOrderSeen[supposedWhenFirstValid] !=
offsetInArray)
{
return false;
}
return true;
}
Edit - this is exercise 24 in section 2.2.6 of Knuth Vol 1. The provided answer references exercise 2.12 of "The Design And Analysis of Computer Programs" by Aho, Hopcraft, and Ullman. You can avoid any accusation of plaigarism in your answer by referencing the source of the question you were asked :-)
Mark each element in hashtable with some color (1, 2, ...)
F.e.
Current color:
int curColor = 0;
When you put element to hash table, associate with it current color (curColor)
If you need to search, filter elements that haven't the same color (element.color == curColor)
If you need to clear hashTable, just increment current color (curColor++)