Storing values within a hashmap - java

I am trying to code a frequency analysis program for fun. I currently have everything being stored in a hashmap and I index the values using an iterator.
My values however are stored as integers, how could I go about converting these entries into percentages, or a more accessible format so i can compare them later?
I was thinking i could use getValue(), but this is an object.
Can anyone point me in the right direction? Should i be using a hashmap? should i transfer them into an array the size of the hashmap?

Hashmaps are indeed ideal for building freuqency tables, and the value type should definitely be Integer (if you try to store percentages, you'd have to update all percentages each time you add a new value). If you have another class that contains the hashmap as a field, you could make a method for retrieving the percentage of a specific character (note that I don't recall the exact method names):
public float getPercentage(char c) {
if (!map.containsKey(c))
return 0;
int sum = 0;
for (Integer count : map.values())
sum += count;
return map.get(c) / (float)sum;
}
If you want the percentages for all characters, you should make a method that produces a new hashmap that contains the percentages, calculated in a similar fashion. If you want to be fancy (read: overengineer), you could even implement an Iterator that produces percentages from the original Integer hashmap.

I'm assuming you have a map of the form {'A':5, 'B':4, etc} meaning A appears five times in your text, B four times, etc.
In that case, to calculate the frequency of a given letter, you need to know the total number of letters in the map (i.e. 9 in the example above). You can do this one of two ways:
Iterate over the entire map, and sum up the values
Keep a running count of the number of times you add something to the map, so you can use it later.
Both are reasonable solutions to the problem. I'd prefer option 2, especially if you are doing things interactively, whereas option 1 might suffice in a batch mode setting.

You should parametrize your hashmap so that getValue() returns an Integer. You could use the object type Float if you would like a percentage instead.

Related

difference between 2d array and hashmap

I am relatively new to Java and I just want to make sure I get the basic concepts correctly. So my question is how is hashmap different to 2d array. I will illustrate an example and if someone could possibly correct me if I am wrong that would be great. So
You cannot access/change the 1st array of the 2d array directly in contrast to hashmap. So for example if you have got arr[2][5] the first arr[2] you cannot change it to something else.In other words if we have int arr[2][2] you cannot change it to say arr[Cars][2] whereas with hashmap you can. You cannot even access this at all whereas with hashmap you can. If you have got map Martin, 25 you can possibly make this to Joe, 22 easily.
You can search quite easily in hashmap on the first value. Say if you want to find the age of martin from the previous example you can easily search on Martin and the age 25 will appear.
I have been taught that 2d arrays represent a table. Something like.
arr[2][3]
1 [1 , 2 , 3]
2 [1 , 2 , 3]
But in reality you cannot access/change 1 and 2 outside the [] grid. This should serve only as an imaginary help to illustrate the concept of 2d arrays.
Could you please correct me if I am wrong or make any additional comments on that.
Thank you
A hashmap uses keys and values, not indices. Therefore you can only search for keys, and thus not access any index. Keys need to be unique, you can not have two identical keys, the old key's value will be replaced if you try to reassign something to it. In a hashmap, key can be any object (an index of an array has to be a number). The key kind of works as the index of an array. As said before, the key can be any object, an array's indexes must be int primitives.
It's like comparing apples and oranges.
A 2D array is just a bidimensional grid of objects, an HashMap is a special kind of associative array (called also dictionary or map) which associates generic keys to generic values. The HashMap is not the only one existing, a TreeMap, for example, exists too, which provides roughly the same interface but a totally different implementation.
The other main difference is that an HashMap is made to fulfill a specific requirement which is unnecessary in an array: being able to store sparse keys without wasting too much space while keeping complexity of get and set operation constant.
This can be seen easily:
int[] intMap = new int[10];
HashMap<Integer,Integer> hashIntMap = new HashMap<Integer,Integer>();
Now suppose that you want to insert the pair (500,100):
intMap[500] = 100;
hashIntMap.put(500, 100);
In the first case you will need to have enough room in the array (at least 501 elements) to be able to access cell at index 500. In an HashMap there is no such requirement since elements are stored by using an hash code and bucketed in a lot less cells than the required one.

How to represent a small map of integers as an array?

Let's say I want to build a small simple map, where the key is N integers (fixed, usually one) and the value is M integers (also fixed, usually one).
Now I would like to store the data an an integer array, for space efficiency. I am programming in the JVM, but it should not really make a difference, unless the algorithm would require storing pointers as integers.
Has anyone defined a simple data structure that can do that?
[EDIT] The answers I had so far seem to show that no one understands my question, so I'll try to clarify. Firstly, forget about the M and N; just imagine I said one int key and one int value. AFAIK, if you want to use a normal HashMap where the key is an integer and the value is an integer, then you will end up with at least 2 + 3 * N objects, where N is the number of entries.
What I want to know is, can you pack all those ints in a single array of primitive ints, reducing your object count to two, independent of the number of keys. One for the int[], and one for the wrapper object that gives you some map-like interface. Neither my keys nor my values will ever be null. And I don't need a full standard java.util.Map implementation either. I just need, get, put, and remove, taking and returning primitive ints not Integer objects. Access does not need to be O(1), like in a normal HashMap.
As far as I know, the answer is no, at least not in Java.
But you can simply keep your keys and your values in an int array (each) with matching indexes, and if you keep the key array sorted, you can perform a binary search.
The trouble with Java arrays though is that they're fixed size structures, so if you want to store more elements than your arrays are allocated to, reallocating them is quite expensive.
So you'll have to make a trade-off between the size of the array and the number of reallocations, maybe something similar to how ArrayList does it.
The other, somewhat smaller problem is that int being a primitive type, there's no null value, but you can designate a special int value to denote null.
I think I understand the problem, I'll take a shot at a solution:
The very simple way of achieving what you want is to keep a single two dimensional array int array[NKeys][2], where array[i][0] is the key and array[i][1] is one of the M values. You could, then iterate the array and for every query of a key, return all array[i][1] such that array[i][0] == key. Of course, this works for a single int key rather than a set of int keys. Also, this is awful in complexity. I can't think of any other way of doing this without adding more Java objects/C pointers.
Since the ArrayList(actually AbstractList) class has equals method overridden appropriately, you can directly use a map like so:
Map<List, List> map = new HashMap<List, List>();

Is there a kind of map that optimize for *sequences of keys* that have the same value?

If you are mapping Java shorts to a few immutable objects, and it is often the case that a consecutive sequence of short keys (neighbors) map to the same value, it there some map structure that allows you to save more memory then a hashmap, while keeping a fast access speed (O(1) or O(log(N)))?
I could inverse the map, and I would use much less memory, but then I would have to go through every mapping to know if a specific short is mapped, and to what it is mapped (O(N)).
I suppose some kind of treemap could do that; maybe there is something like that in some collection library?
Have a look at interval trees.
I once used a TreeMap with a custom key class and corresponding comparator to implement this. My key class contained both ends of a range of double values. Queries were specified as a range with both ends being the same and the comparator did the rest.
There were a few choices to be made, though:
How should remove() be handled?
What should happen if a get() is issued with a key range that overlaps two or more ranges?
Would it make sense to bundle this behaviour in a new Map implementation - possibly a subclass of TreeMap?
You can use a binary tree with one entry for each interval of shorts that map to the same value.
The key would be the start of the interval, while the data is the length of the interval plus the mapped objects.
Thus to find if given short is mapped you need to locate the node in the tree, with the highest key less than the given one (O(logn)) and check whether the given one falls within the interval this node represents.
This solution is pretty different - very old-fashioned, but approaching O(1), small and fast.
90% of the values will fit into 4 bits, whereas a map or tree entry takes hundreds of bits to represent (without a lot of custom reimplementation). So start by representing them in an array of 4-bit entries:
// Used to store nybbles containing small values, with direct arithmetic mapping.
// A value of 15 indicates that the value is larger than 14.
// Size: 32KB
byte[] zeroTo14Array = new byte[(1<<Short.SIZE)/2];
static final short BIGGER_THAN_NYBBLE = 15;
Then use an efficient short-to-byte map (from fastutil or gnu trove to represent the values from 15 to 255:
// Use to store bytes with values 15-255.
// If value is 0, value is larger than 255.
Short2ByteOpenHashMap byteMap = new Short2ByteOpenHashMap();
Finally, use an efficient short-to-object map for everything else:
// Use to store values larger than 255
Short2ObjectOpenHashMap<Value> objectMap = new Short2ObjectOpenHashMap();
// just a sketch
public class Value
{
short shortValue;
String optional;
}
I can post the rest of the untested code, if you'd like.

Hash : How does it work internally?

This might sound as an very vague question upfront but it is not. I have gone through Hash Function description on wiki but it is not very helpful to understand.
I am looking simple answers for rather complex topics like Hashing. Here are my questions:
What do we mean by hashing? How does it work internally?
What algorithm does it follow ?
What is the difference between HashMap, HashTable and HashList ?
What do we mean by 'Constant Time Complexity' and why does different implementation of the hash gives constant time operation ?
Lastly, why in most interview questions Hash and LinkedList are asked, is there any specific logic for it from testing interviewee's knowledge?
I know my question list is big but I would really appreciate if I can get some clear answers to these questions as I really want to understand the topic.
Here is a good explanation about hashing. For example you want to store the string "Rachel" you apply a hash function to that string to get a memory location. myHashFunction(key: "Rachel" value: "Rachel") --> 10. The function may return 10 for the input "Rachel" so assuming you have an array of size 100 you store "Rachel" at index 10. If you want to retrieve that element you just call GetmyHashFunction("Rachel") and it will return 10. Note that for this example the key is "Rachel" and the value is "Rachel" but you could use another value for that key for example birth date or an object. Your hash function may return the same memory location for two different inputs, in this case you will have a collision you if you are implementing your own hash table you have to take care of this maybe using a linked list or other techniques.
Here are some common hash functions used. A good hash function satisfies that: each key is equally likely to hash to any of the n memory slots independently of where any other key has hashed to. One of the methods is called the division method. We map a key k into one of n slots by taking the remainder of k divided by n. h(k) = k mod n. For example if your array size is n = 100 and your key is an integer k = 15 then h(k) = 10.
Hashtable is synchronised and Hashmap is not.
Hashmap allows null values as key but Hashtable does not.
The purpose of a hash table is to have O(c) constant time complexity in adding and getting the elements. In a linked list of size N if you want to get the last element you have to traverse all the list until you get it so the complexity is O(N). With a hash table if you want to retrieve an element you just pass the key and the hash function will return you the desired element. If the hash function is well implemented it will be in constant time O(c) This means you dont have to traverse all the elements stored in the hash table. You will get the element "instantly".
Of couse a programer/developer computer scientist needs to know about data structures and complexity =)
Hashing means generating a (hopefully) unique number that represents a value.
Different types of values (Integer, String, etc) use different algorithms to compute a hashcode.
HashMap and HashTable are maps; they are a collection of unqiue keys, each of which is associated with a value.
Java doesn't have a HashList class. A HashSet is a set of unique values.
Getting an item from a hashtable is constant-time with regard to the size of the table.
Computing a hash is not necessarily constant-time with regard to the value being hashed.
For example, computing the hash of a string involves iterating the string, and isn't constant-time with regard to the size of the string.
These are things that people ought to know.
Hashing is transforming a given entity (in java terms - an object) to some number (or sequence). The hash function is not reversable - i.e. you can't obtain the original object from the hash. Internally it is implemented (for java.lang.Object by getting some memory address by the JVM.
The JVM address thing is unimportant detail. Each class can override the hashCode() method with its own algorithm. Modren Java IDEs allow for generating good hashCode methods.
Hashtable and hashmap are the same thing. They key-value pairs, where keys are hashed. Hash lists and hashsets don't store values - only keys.
Constant-time means that no matter how many entries there are in the hashtable (or any other collection), the number of operations needed to find a given object by its key is constant. That is - 1, or close to 1
This is basic computer-science material, and it is supposed that everyone is familiar with it. I think google have specified that the hashtable is the most important data-structure in computer science.
I'll try to give simple explanations of hashing and of its purpose.
First, consider a simple list. Each operation (insert, find, delete) on such list would have O(n) complexity, meaning that you have to parse the whole list (or half of it, on average) to perform such an operation.
Hashing is a very simple and effective way of speeding it up: consider that we split the whole list in a set of small lists. Items in one such small list would have something in common, and this something can be deduced from the key. For example, by having a list of names, we could use first letter as the quality that will choose in which small list to look. In this way, by partitioning the data by the first letter of the key, we obtained a simple hash, that would be able to split the whole list in ~30 smaller lists, so that each operation would take O(n)/30 time.
However, we could note that the results are not that perfect. First, there are only 30 of them, and we can't change it. Second, some letters are used more often than others, so that the set with Y or Z will be much smaller that the set with A. For better results, it's better to find a way to partition the items in sets of roughly same size. How could we solve that? This is where you use hash functions. It's such a function that is able to create an arbitrary number of partitions with roughly the same number of items in each. In our example with names, we could use something like
int hash(const char* str){
int rez = 0;
for (int i = 0; i < strlen(str); i++)
rez = rez * 37 + str[i];
return rez % NUMBER_OF_PARTITIONS;
};
This would assure a quite even distribution and configurable number of sets (also called buckets).
What do we mean by Hashing, how does
it work internally ?
Hashing is the transformation of a string shorter fixed-length value or key that represents the original string. It is not indexing. The heart of hashing is the hash table. It contains array of items. Hash tables contain an index from the data item's key and use this index to place the data into the array.
What algorithm does it follow ?
In simple words most of the Hash algorithms work on the logic "index = f(key, arrayLength)"
Lastly, why in most interview
questions Hash and LinkedList are
asked, is there any specific logic for
it from testing interviewee's
knowledge ?
Its about how good you are at logical reasoning. It is most important data-structure that every programmers know it.

Most frequently repeated numbers in a huge list of numbers

I have a file which has a many random integers(around a million) each seperated by a white space. I need to find the top 10 most frequently occurring numbers in that file. What is the most efficient way of doing this in java?
I can think of
1. Create a hash map, key is the integer from the file and the value is the count. For every number in the file, check if that key already exists in the hash map, if yes, value++, else make a new entry in hash
2. Make a BST, each node is the integer from the file. For every integer from the file see if there is a node in the BST if yes, do value++, value is part of the node.
I feel hash map is better option if i can come up with good hashing function,
Can some one pl suggest me what is the best of doing this ? Is there is anyother efficient algo that i can use?
Edit #2:
Okay, I screwed up my own first rule--never optimize prematurely. The worst case for this is probably using a stock HashMap with a wide range--so I just did that. It still runs in like a second, so forget everything else here and just do that.
And I'll make ANOTHER note to myself to ALWAYS test speed before worrying about tricky implementations.
(Below is older obsolete post that could still be valid if someone had MANY more points than a million)
A HashSet would work, but if your integers have a reasonable range (say, 1-1000), it would be more efficient to create an array of 1000 integers, and for each of your million integers, increment that element of the array. (Pretty much the same idea as a HashMap, but optimizing out a few of the unknowns that a Hash has to make allowances for should make it a few times faster).
You could also create a tree. Each node in the tree would contain (value, count) and the tree would be organized by value (lower values on the left, higher on the right). Traverse to your node, if it doesn't exist--insert it--if it does, then just increment the count.
The range and distribution of your values would determine which of these two (or a regular hash) would perform better. I think a regular hash wouldn't have many "winning" cases though (It would have to be a wide range and "grouped" data, and even then the tree might win.
Since this is pretty trivial--I recommend you implement more than one solution and test speeds against the actual data set.
Edit: RE the comment
TreeMap would work, but would still add a layer of indirection (and it's so amazingly easy and fun to implement yourself). If you use the stock implementation, you have to use Integers and convert constantly to and from int for every increase. There is the indirection of the pointer to the Integer, and the fact that you are storing at least 2x as many objects. This doesn't even count any overhead for the method calls since they should be inlined with any luck.
Normally this would be an optimization (evil), but when you start to get near hundreds of thousands of nodes, you occasionally have to ensure efficiency, so the built-in TreeMap is going to be inefficient for the same reasons the built-in HashSet will.
Java handles hashing. You don't need to write a hash function. Just start pushing stuff in the hash map.
Also, if this is something that only needs to run once (or only occasionally), then don't both optimizing. It will be fast enough. Only bother if it's something that's going to run within an application.
HashMap
A million integers is not really a lot, even for interpreted languages, but especially for a speedy language like Java. You'll probably barely even notice the execution time. I'd try this first and move to something more complicated if you deem this too slow.
It will probably take longer to do string splitting and parsing to convert to integers than even the simplest algorithm to find frequencies using a HashMap.
Why use a hashtable? Just use an array that is the same size as the range of your numbers. Then you don't waste time executing the hashing function. Then sort the values after you're done. O(N log N)
Allocate an array / vector of the same size as the number of input items you have
Fill the array from your file with numbers, one number per element
Put the list in order
Iterate through the list and keep track of the the top 10 runs of numbers that you have encountered.
Output the top ten runs at the end.
As a refinement on step 4, you only need to step forward through the array in steps equilivent to your 10th longest run. Any run longer than that will overlap with your sampling. If the tenth longest run is 100 elements long, you only need to sample element 100, 200, 300 and at each point count the run of the integer you find there (both forwards and backwards). Any run longer than your 10th longest is sure to overlap with your sampling.
You should apply this optimisation after your 10th run length is very long compared to other runs in the array.
A map is overkill for this question unless you have very few unique numbers each with a large number of repeats.
NB: Similar to gshauger's answer but fleshed out
If you have to make it as efficient as possible, use an array of ints, with the position representing the value and the content representing the count. That way you avoid autoboxing and unboxing, the most likely killer of a standard Java collection.
If the range of numbers is too large then take a look at PJC and its IntKeyIntMap implementations. It will avoid the autoboxing as well. I don't know if it will be fast enough for you, though.
If the range of numbers is small (e.g. 0-1000), use an array. Otherwise, use a HashMap<Integer, int[]>, where the values are all length 1 arrays. It should be much faster to increment a value in an array of primitives than create a new Integer each time you want to increment a value. You're still creating Integer objects for the keys, but that's hard to avoid. It's not feasible to create an array of 2^31-1 ints, after all.
If all of the input is normalized so you don't have values like 01 instead of 1, use Strings as keys in the map so you don't have to create Integer keys.
Use a HashMap to create your dataset (value-count pairs) in memory as you traverse the file. The HashMap should give you close to O(1) access to the elements while you create the dataset (technically, in the worst case HashMap is O(n)). Once you are done searching the file, use Collections.sort() on the value Collection returned by HashMap.values() to create a sorted list of value-count pairs. Using Collections.sort() is guaranteed O(nLogn).
For example:
public static class Count implements Comparable<Count> {
int value;
int count;
public Count(int value) {
this.value = value;
this.count = 1;
}
public void increment() {
count++;
}
public int compareTo(Count other) {
return other.count - count;
}
}
public static void main(String args[]) throws Exception {
Scanner input = new Scanner(new FileInputStream(new File("...")));
HashMap<Integer, Count> dataset = new HashMap<Integer, Count>();
while (input.hasNextInt()) {
int tempInt = input.nextInt();
Count tempCount = dataset.get(tempInt);
if (tempCount != null) {
tempCount.increment();
} else {
dataset.put(tempInt, new Count(tempInt));
}
}
List<Count> counts = new ArrayList<Count>(dataset.values());
Collections.sort(counts);
Actually, there is an O(n) algorithm for doing exactly what you want to do. Your use case is similar to an LFU cache where the element's access count determines whether it syays in the cache or is evicted from it.
http://dhruvbird.blogspot.com/2009/11/o1-approach-to-lfu-page-replacement.html
This is the source for java.lang.Integer.hashCode(), which is the hashing function that will be used if you store your entries as a HashMap<Integer, Integer>:
public int hashCode() {
return value;
}
So in other words, the (default) hash value of a java.lang.Integer is the integer itself.
What is more efficient than that?
The correct way to do it is with a linked list. When you insert an element, you go down the linked list, if its there you increment the nodes count, otherwise create a new node with count of 1. After you inserted each element, you would have a sorted list of elements in O(n*log(n)).
For your methods, you are doing n inserts and then sorting in O(n*log(n)), so your coefficient on the complexity is higher.

Categories