Problem with doubles as key in HashMap in Java

Problem with doubles as key in HashMap in Java - java

I came across an interview question for which I am not sure what the correct answer is.
The problem is below.
Given an array of integers, return indices of the two numbers such that they add up to a specific target.
The way I solved it by looping through the element and storing the diff between target and element as key and index of the current element in a map.
Then later on when the diff appears in the array, I can look up in map which element has this diff and at what index in the map.
Up to now it is fine.
However, for a follow-up question, "What if the elements are double"
I am not sure what is the issue if any.
When searching I came across a couple of posts mentioning correct way to compute a hashcode using logical shift and using or. However, I see the similar logic is used in Java's Double.hashcode function.
I thought that the problem could be when computing diff, there might be precision loss. Hence, it might end up mapping to a different hash bucket.
However, when I tried, I couldn't come up with such an input. What is the actual problem? And how do I test/solve it?
Java
I tried simply changing the numbers to double, but the logic worked fine.

This program illustrates the problem, with a two element array:
public strictfp class Test {
public static void main(String[] args) {
double in[] = {1e10, Math.nextUp(1e10)};
double target = 2e10;
System.out.println(in[0]+in[1]);
double diff = target-in[0];
System.out.println(diff);
System.out.println(in[1] == diff);
System.out.println(in[0]+in[1] == target);
}
}
output:
2.0E10
1.0E10
false
true
The problem is that your logic assumes that there is only one value that, when added to an element of your array, gives the target sum. Because of rounding, with double elements, there can be multiple values that give the same sum.
In my example, the sum of the two elements of the array is equal to the target, but the second element is not equal to the difference between in[0] and the target.
My first thought on a solution to the floating point version of the problem is as follows:
Create an array of ordered pairs containing value and index.
Sort by value.
For each element x, do a binary search for an element y such that x.value + y.value is equal to the target.
If you find one, return [x.index, y.index]
This is O(n log(n)) where the int version was O(n).

Related

Binary splitting an array and retrieving "leaf" arrays

I have a number of entries in an array (FT = [-10.5, 6.5, 7.5, -7.5]) which I am applying on binary splitting to append to a result array of arrays (LT = [[-10.5],[6.5, 7.5, -7.5],[6.5,7.5],[-7.5]] the tree describing the splitting for my example is below:
[-10.5, 6.5, 7.5, -7.5]
/ \
[-10.5] [6.5, 7.5, -7.5]
/ \
[6.5, 7.5] [ -7.5]
Now from the array LT I want to retrieve only "leaf" arrays (T = [[-10.5],[6.5,7.5],[-7.5]]) given the size of the initial array FT.
How to achieve this (get T) in Java?

I am presenting a way of thinking about your problem. I am not fledgling it out into a full algorithm; I am leaving some parts for yourself to fill in.
First, if LT is empty, no splitting has occurred. In this case the original FT was the leaf array, and we have no way of telling what it was. The problem cannot be solved.
If LT contains n arrays, then there must exist some m (0 < m < n) so that the first m arrays form the left subtree and the rest form the right subtree. We don’t know m, so we simply try all possible values of m in turn. For each possible m we check whether a solution for this value of m is possible by trying to reconstruct each subtree.
So define an auxiliary method to check if a part of LT can form a subtree and return the leaves if it can.
Your auxiliary method will work like this: If there is only one array, it’s a leaf, so return it. If there are two arrays, they cannot form a subtree. I there are three, they form a subtree exactly if the first is the concatenation of the other two. If there are more than three, then again we need to consider all the possibilities of how they are distributed into subtrees. The difference from before is that we know which full array the subtrees come from, namely the frontmost array. So all solutions should be checked against this. For starters, if the second array is not a prefix of the first, we cannot have a subtree.
Your algorithm will no doubt get recursive at some point.
Pruning opportunity: It seems to me that a binary tree always has an odd number of leaves. So for a solution to exist n needs to be even and m needs to be odd.
I would consider coding my algorithm using lists rather than arrays because I think it’s more convenient to pass lists of lists rather than arrays of arrays or lists of arrays around.
Happy further refinement and coding.

How can I write this variable in the right way?

I have to write in Java (with Cplex) the variable x[i][j] that is the sum on k of x[i][j][k].
i,j and k are the indices of three sets.
I have already declare x[i][j][k], but i would like to know the right expression.
Thanks

You appear to be confusing yourself regarding the type of x[i][j]. Is it a set with several elements, indexed by k, or is it a number representing their sum? You seem to start out this step in computation with the first answer and conclude with the second.
A solution to this is to use another variable to store the result, maybe something like:
sums[i][j] = sum(x[i][j]);
where sum(list) is a function that takes in a list, starts with a return value of 0, and iterates over the elements of the input list adding each one to the return value, then returns that value. You can check here for some ideas about how to implement that.

Double Recursion in one method Java

I am pretty sure that I thoroughly understand how the methods with only one recursion work.
Ex) calculating factorial
public int factorial(int n){ //factorial recursion
if(n==0){
return 1;
}
else{
return n*factorial(n-1);
}
}
For these methods, I can even picture what's going on in the stacks and what values are being returned at each stack level.
But Whenever, I encounter methods with Double Recursions, the nightmare begins.
Below is a recursion problem with double recursions from coding bat.
Ex) Given an array of ints, is it possible to choose a group of some of the ints, such that the group sums to the given target? If yes, true. If no, false.
You use 3 parameters; starting index start, An int Array nums, target int value target.
Below is the solution for this problem.
public boolean groupSum(int start, int[] nums, int target) {
if (start >= nums.length) return (target == 0);
if (groupSum(start + 1, nums, target - nums[start])) return true;
if (groupSum(start + 1, nums, target)) return true;
return false;
}
My take to understand this solution is this. Say I was given an array {2,4,8} with starting index = 0, and target value 10. So (0,{2,4,8},10) goes in through the method, the function gets re-called at
if (groupSum(start + 1, nums, target - nums[start])) return true;
so it becomes (1,{2,4,8},8) and it does over and over until start index hits
3. when it hits 3. The stack at the last level(?) goes to the second recursive call. And this is where I start losing track of what's happening.
Can anybody break this down for me? And when people use double recursion,(I know it's very inefficient and in practice, almost no one uses it for its inefficiency. But just in an attempt to understand it.)can they actually visualize what's going to happen? or do they just use it hoping that the base case and recursion would work properly? I think this applies generally to all the ppl who wrote merge sort, tower of hanoi alogrithm etc..
Any help is greatly appreciated..

The idea of a double recursion is to break the problem into two smaller problems. Once you solve the smaller problems, you can either join their solutions (as is done in merge sort) or choose one of them - which is done in your example, which only requires the second smaller problem to be solved if solving the first smaller problem didn't solve the full problem.
Your example tries to determine if there is a subset of the input nums array whose sum is the target sum. start determines which part of the array is considered by the current recursive call (when it's 0, the entire array is considered).
The problem is broken to two, since if such a subset exists, it either contains the first element of the array (in which case the problem is reduced to finding if there's a sub-set of the last n-1 elements of the array whose sum is target minus the value of the first element) or doesn't contain it (in which case the problem is reduced to finding if there's a sub-set of the last n-1 elements of the array whose sum is target).
The first recursion handles the case where the subset contains the first element, which is why it makes a recursive call that would look for the target sum minus the first element in the remaining n-1 elements of the array. If the first recursion returns true, it means that the required subset exists, so the second recursion is never called.
The second recursion handles the case where the subset doesn't contain the first element, which is why it makes a recursive call that would look for the target sum in the remaining n-1 elements of the array (this time the first element is not subtracted from the target sum, since the first element is not included in the sum). Again, if the second recursive call returns true, if means that the required subset exists.

Well if you want to visualize it, usually it's kind of like a tree. You first follow one path through the tree until the end, then step one back and pick a different path (if possible). If there is none or you are happy with your result you just take another step back and so on.
I don't know if this helps you but when I learned recursion, it helped to just think of my method as already working.
So I thought: Great, so basically my method is already working, but I can't call it with the same parameters and have to make sure I return the right value for these exact parameters by using different ones.
If we take that example:
At first we know that if we have no numbers to look at left, then the answer depends on if the target is 0. (first line)
Now what do we do with the rest? Well... we'd need to think about it for a moment.
Just think about the very first number. Under what circumstances is it part of the solution? Well that would be if you could create target-firstnumber with the rest of the numbers. Because then when you add firstnumber, you reach target.
So you try to see if that's possible. If so, it's solvable. (second line)
But if not, it's still possible that the first number just isn't important for the solution. So you have to try again to build the target without that number. (third line)
And that's basically all there is to this.
Of course to think like this you need two things:
1. You need to believe that your method already works for other parameters
2. You need to make sure your recursion terminates. That's the first line in this example but you should always think about if there is any combination of parameters that will just create an endless recursion.

Try to understand it like this: Recursion can be rewritten as a while-loop. where the condition of the while is the negation of the stop-condition of the recursion.
As already said, there is nothing called double recursion.

Storing values within a hashmap

I am trying to code a frequency analysis program for fun. I currently have everything being stored in a hashmap and I index the values using an iterator.
My values however are stored as integers, how could I go about converting these entries into percentages, or a more accessible format so i can compare them later?
I was thinking i could use getValue(), but this is an object.
Can anyone point me in the right direction? Should i be using a hashmap? should i transfer them into an array the size of the hashmap?

Hashmaps are indeed ideal for building freuqency tables, and the value type should definitely be Integer (if you try to store percentages, you'd have to update all percentages each time you add a new value). If you have another class that contains the hashmap as a field, you could make a method for retrieving the percentage of a specific character (note that I don't recall the exact method names):
public float getPercentage(char c) {
if (!map.containsKey(c))
return 0;
int sum = 0;
for (Integer count : map.values())
sum += count;
return map.get(c) / (float)sum;
}
If you want the percentages for all characters, you should make a method that produces a new hashmap that contains the percentages, calculated in a similar fashion. If you want to be fancy (read: overengineer), you could even implement an Iterator that produces percentages from the original Integer hashmap.

I'm assuming you have a map of the form {'A':5, 'B':4, etc} meaning A appears five times in your text, B four times, etc.
In that case, to calculate the frequency of a given letter, you need to know the total number of letters in the map (i.e. 9 in the example above). You can do this one of two ways:
Iterate over the entire map, and sum up the values
Keep a running count of the number of times you add something to the map, so you can use it later.
Both are reasonable solutions to the problem. I'd prefer option 2, especially if you are doing things interactively, whereas option 1 might suffice in a batch mode setting.

You should parametrize your hashmap so that getValue() returns an Integer. You could use the object type Float if you would like a percentage instead.

Most frequently repeated numbers in a huge list of numbers

I have a file which has a many random integers(around a million) each seperated by a white space. I need to find the top 10 most frequently occurring numbers in that file. What is the most efficient way of doing this in java?
I can think of
1. Create a hash map, key is the integer from the file and the value is the count. For every number in the file, check if that key already exists in the hash map, if yes, value++, else make a new entry in hash
2. Make a BST, each node is the integer from the file. For every integer from the file see if there is a node in the BST if yes, do value++, value is part of the node.
I feel hash map is better option if i can come up with good hashing function,
Can some one pl suggest me what is the best of doing this ? Is there is anyother efficient algo that i can use?

Edit #2:
Okay, I screwed up my own first rule--never optimize prematurely. The worst case for this is probably using a stock HashMap with a wide range--so I just did that. It still runs in like a second, so forget everything else here and just do that.
And I'll make ANOTHER note to myself to ALWAYS test speed before worrying about tricky implementations.
(Below is older obsolete post that could still be valid if someone had MANY more points than a million)
A HashSet would work, but if your integers have a reasonable range (say, 1-1000), it would be more efficient to create an array of 1000 integers, and for each of your million integers, increment that element of the array. (Pretty much the same idea as a HashMap, but optimizing out a few of the unknowns that a Hash has to make allowances for should make it a few times faster).
You could also create a tree. Each node in the tree would contain (value, count) and the tree would be organized by value (lower values on the left, higher on the right). Traverse to your node, if it doesn't exist--insert it--if it does, then just increment the count.
The range and distribution of your values would determine which of these two (or a regular hash) would perform better. I think a regular hash wouldn't have many "winning" cases though (It would have to be a wide range and "grouped" data, and even then the tree might win.
Since this is pretty trivial--I recommend you implement more than one solution and test speeds against the actual data set.
Edit: RE the comment
TreeMap would work, but would still add a layer of indirection (and it's so amazingly easy and fun to implement yourself). If you use the stock implementation, you have to use Integers and convert constantly to and from int for every increase. There is the indirection of the pointer to the Integer, and the fact that you are storing at least 2x as many objects. This doesn't even count any overhead for the method calls since they should be inlined with any luck.
Normally this would be an optimization (evil), but when you start to get near hundreds of thousands of nodes, you occasionally have to ensure efficiency, so the built-in TreeMap is going to be inefficient for the same reasons the built-in HashSet will.

Java handles hashing. You don't need to write a hash function. Just start pushing stuff in the hash map.
Also, if this is something that only needs to run once (or only occasionally), then don't both optimizing. It will be fast enough. Only bother if it's something that's going to run within an application.

HashMap
A million integers is not really a lot, even for interpreted languages, but especially for a speedy language like Java. You'll probably barely even notice the execution time. I'd try this first and move to something more complicated if you deem this too slow.
It will probably take longer to do string splitting and parsing to convert to integers than even the simplest algorithm to find frequencies using a HashMap.

Why use a hashtable? Just use an array that is the same size as the range of your numbers. Then you don't waste time executing the hashing function. Then sort the values after you're done. O(N log N)

Allocate an array / vector of the same size as the number of input items you have
Fill the array from your file with numbers, one number per element
Put the list in order
Iterate through the list and keep track of the the top 10 runs of numbers that you have encountered.
Output the top ten runs at the end.
As a refinement on step 4, you only need to step forward through the array in steps equilivent to your 10th longest run. Any run longer than that will overlap with your sampling. If the tenth longest run is 100 elements long, you only need to sample element 100, 200, 300 and at each point count the run of the integer you find there (both forwards and backwards). Any run longer than your 10th longest is sure to overlap with your sampling.
You should apply this optimisation after your 10th run length is very long compared to other runs in the array.
A map is overkill for this question unless you have very few unique numbers each with a large number of repeats.
NB: Similar to gshauger's answer but fleshed out

If you have to make it as efficient as possible, use an array of ints, with the position representing the value and the content representing the count. That way you avoid autoboxing and unboxing, the most likely killer of a standard Java collection.
If the range of numbers is too large then take a look at PJC and its IntKeyIntMap implementations. It will avoid the autoboxing as well. I don't know if it will be fast enough for you, though.

If the range of numbers is small (e.g. 0-1000), use an array. Otherwise, use a HashMap<Integer, int[]>, where the values are all length 1 arrays. It should be much faster to increment a value in an array of primitives than create a new Integer each time you want to increment a value. You're still creating Integer objects for the keys, but that's hard to avoid. It's not feasible to create an array of 2^31-1 ints, after all.
If all of the input is normalized so you don't have values like 01 instead of 1, use Strings as keys in the map so you don't have to create Integer keys.

Use a HashMap to create your dataset (value-count pairs) in memory as you traverse the file. The HashMap should give you close to O(1) access to the elements while you create the dataset (technically, in the worst case HashMap is O(n)). Once you are done searching the file, use Collections.sort() on the value Collection returned by HashMap.values() to create a sorted list of value-count pairs. Using Collections.sort() is guaranteed O(nLogn).
For example:
public static class Count implements Comparable<Count> {
int value;
int count;
public Count(int value) {
this.value = value;
this.count = 1;
}
public void increment() {
count++;
}
public int compareTo(Count other) {
return other.count - count;
}
}
public static void main(String args[]) throws Exception {
Scanner input = new Scanner(new FileInputStream(new File("...")));
HashMap<Integer, Count> dataset = new HashMap<Integer, Count>();
while (input.hasNextInt()) {
int tempInt = input.nextInt();
Count tempCount = dataset.get(tempInt);
if (tempCount != null) {
tempCount.increment();
} else {
dataset.put(tempInt, new Count(tempInt));
}
}
List<Count> counts = new ArrayList<Count>(dataset.values());
Collections.sort(counts);

Actually, there is an O(n) algorithm for doing exactly what you want to do. Your use case is similar to an LFU cache where the element's access count determines whether it syays in the cache or is evicted from it.
http://dhruvbird.blogspot.com/2009/11/o1-approach-to-lfu-page-replacement.html

This is the source for java.lang.Integer.hashCode(), which is the hashing function that will be used if you store your entries as a HashMap<Integer, Integer>:
public int hashCode() {
return value;
}
So in other words, the (default) hash value of a java.lang.Integer is the integer itself.
What is more efficient than that?

The correct way to do it is with a linked list. When you insert an element, you go down the linked list, if its there you increment the nodes count, otherwise create a new node with count of 1. After you inserted each element, you would have a sorted list of elements in O(n*log(n)).
For your methods, you are doing n inserts and then sorting in O(n*log(n)), so your coefficient on the complexity is higher.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.