In Java, I need an algorithm to find the max. number of occurrences in a collection of integers. For example, if my set is [2,4,3,2,2,1,4,2,2], the algorithm needs to output 5 because 2 is the mostly occurring integer and it appears 5 times. Consider it like finding the peak of the histogram of the set of integers.
The challenge is, I have to do it one by one for multiple sets of many integers so it needs to be efficient. Also, I do not know which element will be mostly appearing in the sets. It is totally random.
I thought about putting those values of the set into an array, sorting it and then iterating over the array, counting consecutive appearances of the numbers and identifying the maximum of the counts but I am guessing it will take a huge time. Are there any libraries or algorithms that could help me do it efficiently?
I would loop over the collection inserting into a Map datastructure with the following logic:
If the integer has not yet been inserted into the map, then insert key=integer, value=1.
If the key exists, increment the value.
There are two Maps in Java you could use - HashMap and TreeMap - these are compared below:
HashMap vs. TreeMap
You can skip the detailed explanation a jump straight to the summary if you wish.
A HashMap is a Map which stores key-value pairs in an array. The index used for key k is:
h.hashCode() % map.size()
Sometimes two completely different keys will end up at the same index. To solve this, each location in the array is really a linked list, which means every lookup always has to loop over the linked list and check for equality using the k.equals(other) method. Worst case, all keys get stored at the same location and the HashMap becomes an unindexed list.
As the HashMap gains more entries, the likelihood of these clashes increase, and the efficiency of the structure decreases. To solve this, when the number of entries reaches a critical point (determined by the loadFactor argument in the constructor), the structure is resized:
A new array is allocated at about twice the current size
A loop is run over all the existing keys
The key's location is recomputed for the new array
The key-value pair is inserted into the new structure
As you can see, this can become relatively expensive if there are many resizes.
This problem can be overcome if you can pre-allocate the HashMap at an appropriate size before you begin, eg map = new HashMap(input.size()*1.5). For large datasets, this can dramatically reduce memory churn.
Because the keys are essentially randomly positioned in the HashMap, the key iterator will iterate over them in a random order. Java does provide the LinkedHashMap which will iterator in the order the keys were inserted.
Performance for a HashMap:
Given the correct size and good distribution of hashes, lookup is constant-time.
With bad distribution, performance drops to (in the worst case) linear search - O(n).
With bad initial sizing, performance becomes that of rehashing. I can't trivially calculate this, but it's not good.
OTOH a TreeMap stores entries in a balanced tree - a dynamic structure that is incrementally built up as key-value pairs are added. Insert is dependent on the depth of the tree (log(tree.size()), but is predictable - unlike HashMap, there are no hiatuses, and no edge conditions where performance drops.
Each insert and lookup is more expensive given a well-distributed HashMap.
Further, in order to insert the key in the tree, every key must be comparable to every other key, requiring the k.compare(other) method from the Comparable interface. Obviously, given the question is about integers, this is not a problem.
Performance for a TreeMap:
Insert of n elements is O(n log n)
Lookup is O(log n)
Summary
First thoughts: Dataset size:
If small (even in the 1000's and 10,000's) it really doesn't matter on any modern hardware
If large, to the point of causing the machine to run out of memory, then TreeMap may be the only option
Otherwise, size is probably not be the determining factor
In this specific case, a key factor is whether the expected number of unique integers is large or small compared to the overall dataset size?
If small, then the overall time will be dominated by key lookup in a small set, so optimization is irrelevant (you can stop here).
If large, then the overall time will be dominated by insert, and the decision rests on more factors:
Dataset is of known size?
If yes: The HashMap can be pre-allocated, and so memory churn eliminated. This is especially important if the hashCode() method is expensive (not in our case)
If no: A TreeMap provides more predictable performance and may be the better choice
Is predictable performance with no large stalls required, eg in real-time systems or on the event thread of a GUI?
If yes: A TreeMap provides much better predictability with no stalls
If no: A HashMap probably provides better overall performance for the whole computation
One final point if there is not an overwhelming point from above:
Is a sorted list of keys of value?
If yes (eg to print a histogram): A TreeMap has already sorted the keys, and so is convenient
However, if performance is important, the only way to decide would be to implement to the Map interface, then profile both the HashMap and the TreeMap to see which is actually better in your situation. Premature optimization is the root of much evil :)
What's wrong with sorting? That's O(n log n), which isn't bad at all. Any better solution would either require more information about the input sets (an upper bound on the numbers involved, perhaps) or involve a Map<Integer, Integer> or something equivalent.
The basic method is to sort the collection and then simply run through the sorted collection. (This would be done in O(nLog(n) + n) which is O(nLog(n))).
If the numbers are bounded (say for example, -10000,10000) and the collection contains a lot of integers you can use a lookup table and count each element. This would take O(n + l) (O(n) for the count, O(l) to find the max element) where l is the range length (20001 in this case).
As you can see, if n >> l then this would become O(n) which is better than 1, but if n << l then it's O(l) which is constant but big enough to make this unusable.
Another variant of the previous is to use a HashTable instead of a lookup table. This would improve the complexity to O(n) but is not guaranteed to be faster than 2 when n>>l.
The good news is that the values don't have to be bounded.
I'm not much of a java but if you need help coding these, let me know.
Here is a sample implementation of your program. It returns the no with most frequency and if two nos are found with max occurences, then the larger no is returned. If u want to return the frequency then change the last line of the code to "return mf".
{public int mode(int[]a,int n)
{int i,j,f,mf=0,mv=a[0];
for(i=0;i<n;i++)
{f=0;
for(j=0;j<n;j++)
{if(a[i]==a[j])
{f++;
}
}
if(f>mf||f==mf && a[i]>mv)
{mf=f;
mv=a[i];
}
}
return mv;
}
}
Since it's a collection of integers, one can use either
radix sort to sort the collection and that takes O(nb) where b is the number of bits used to represent the integers (32 or 64, if you use java's primitive integer data types), or
a comparison-based sort (quicksort, merge sort, etc) and that takes O(n log n).
Notes:
The larger your n becomes, the more likely that radix sort will be faster than comparison-based sorts. For smaller n, you are probably better off with a comparison-based sort.
If you know a bound on the values in the collection, b will be even smaller than 32 (or 64) making the radix sort more desirable.
This little puppy works (edited to return the frequency instead of the number):
public static int mostFrequent(int[] numbers) {
Map<Integer, AtomicInteger> map = new HashMap<Integer, AtomicInteger>() {
public AtomicInteger get(Object key) {
AtomicInteger value = super.get(key);
if (value == null) {
value = new AtomicInteger();
super.put((Integer) key, value);
}
return value;
}
};
for (int number : numbers)
map.get(number).incrementAndGet();
List<Entry<Integer, AtomicInteger>> entries = new ArrayList<Map.Entry<Integer, AtomicInteger>>(map.entrySet());
Collections.sort(entries, new Comparator<Entry<Integer, AtomicInteger>>() {
#Override
public int compare(Entry<Integer, AtomicInteger> o1, Entry<Integer, AtomicInteger> o2) {
return o2.getValue().get() - o1.getValue().get();
}
});
return entries.get(0).getValue().get(); // return the largest *frequency*
// Use this next line instead to return the most frequent *number*
// return entries.get(0).getKey();
}
AtomicInteger was chosen to avoid creating new objects with every increment, and the code reads a little cleaner.
The anonymous map class was used to centralize the "if null" code
Here's a test:
public static void main(String[] args) {
System.out.println(mostFrequent(new int[] { 2, 4, 3, 2, 2, 1, 4, 2, 2 }));
}
Output:
5
useing HashMap:
import java.util.HashMap;
public class NumberCounter {
static HashMap<Integer,Integer> map;
static int[] arr = {1, 2, 1, 23, 4, 5, 4, 1, 2, 3, 12, 23};
static int max=0;
public NumberCounter(){
map=new HashMap<Integer, Integer>();
}
public static void main (String[] args)
{
Integer newValue=1;
NumberCounter c=new NumberCounter();
for(int i=0;i<arr.length;i++){
if(map.get(arr[i])!=null) {
newValue = map.get(arr[i]);
newValue += 1;
map.put(arr[i], newValue);
}
else
map.put(arr[i],1);
}
max=map.get(arr[0]);
for(int i=0;i<map.size();i++){
if(max<map.get(arr[i]))
max=map.get(arr[i]);
}
System.out.print(max);
}
}
Related
Majority element question:
Given an array of size n, find the majority element. The majority element is the element that appears more than ⌊ n/2 ⌋ times.
You may assume that the array is non-empty and the majority element always exist in the array.
// Solution1 - Sorting ----------------------------------------------------------------
class Solution {
public int majorityElement(int[] nums) {
Arrays.sort(nums);
return nums[nums.length/2];
}
}
// Solution2 - HashMap ---------------------------------------------------------------
class Solution {
public int majorityElement(int[] nums) {
// int[] arr1 = new int[nums.length];
HashMap<Integer, Integer> map = new HashMap<>(100);
Integer k = new Integer(-1);
try{
for(int i : nums){
if(map.containsKey(i)){
map.put(i, map.get(i)+1);
}
else{
map.put(i, 1);
}
}
for(Map.Entry<Integer, Integer> entry : map.entrySet()){
if(entry.getValue()>(nums.length/2)){
k = entry.getKey();
break;
}
}
}catch(Exception e){
throw new IllegalArgumentException("Error");
}
return k;
}
}
The Arrays.sort() function is implemented in Java using QuickSort and has O(n log n) time complexity.
On the other hand, using HashMap to find the majority element has only O(n) time complexity.
Hence, solution 1 (sorting) should take longer than solution 2 (HashMap), but when I was doing the question on LeetCode, the average time taken by solution 2 is much more (almost 8 times more) than solution 1.
Why is that the case? I'm really confused.....
Is the size of the test case the reason? Will solution 2 become more efficient when the number of elements in the test case increases dramatically?
Big O isn't a measure of actual performance. It's only going to give you an idea of how your performance will evolve in comparison to n.
Practically, an algorithms in O(n.logn) will eventually be slower than O(n) for some n. But that n might be 1, 10, 10^6 or even 10^600 - at which point it's probably irrelevant because you'll never run into such a data set - or you won't have enough hardware for it.
Software engineers have to consider both actual performance and performance at the practical limit. For example hash map lookup is in theory faster than an unsorted array lookup... but then most arrays are small (10-100 elements) negating any O(n) advantage due the extra code complexity.
You could certainly optimize your code a bit, but in this case you're unlikely to change the outcome for small n unless you introduce another factor (e.g. artificially slow down the time per cycle with a constant).
(I wanted to find a good metaphor to illustrate, but it's harder than expected...)
It depends on the test cases, some test cases will be faster in HashMap while others not.
Why is that? The Solution 1 grantee in worst case O(N log2 N), but the HashMap O(N . (M + R)) where M is the cost of collisions and R the cost of resizing the array.
HashMap uses an array named table of the nodes internally, and it resizes different times when the input increase or shrink. And you assigned it with an initial capacity of 100.
So let see what happens? Java uses Separate chaining for resolving the collisions and some test cases may have lots of collisions which lead to consuming lots of time when a query or update the hashmap.
Conclusion the implementation of hashmap is affected by two factors: 1. Resize the table array based on the input size 2. How many collision appears in the input
As per the following link document: Java HashMap Implementation
I'm confused with the implementation of HashMap (or rather, an enhancement in HashMap). My queries are:
Firstly
static final int TREEIFY_THRESHOLD = 8;
static final int UNTREEIFY_THRESHOLD = 6;
static final int MIN_TREEIFY_CAPACITY = 64;
Why and how are these constants used? I want some clear examples for this.
How they are achieving a performance gain with this?
Secondly
If you see the source code of HashMap in JDK, you will find the following static inner class:
static final class TreeNode<K, V> extends java.util.LinkedHashMap.Entry<K, V> {
HashMap.TreeNode<K, V> parent;
HashMap.TreeNode<K, V> left;
HashMap.TreeNode<K, V> right;
HashMap.TreeNode<K, V> prev;
boolean red;
TreeNode(int arg0, K arg1, V arg2, HashMap.Node<K, V> arg3) {
super(arg0, arg1, arg2, arg3);
}
final HashMap.TreeNode<K, V> root() {
HashMap.TreeNode arg0 = this;
while (true) {
HashMap.TreeNode arg1 = arg0.parent;
if (arg0.parent == null) {
return arg0;
}
arg0 = arg1;
}
}
//...
}
How is it used? I just want an explanation of the algorithm.
HashMap contains a certain number of buckets. It uses hashCode to determine which bucket to put these into. For simplicity's sake imagine it as a modulus.
If our hashcode is 123456 and we have 4 buckets, 123456 % 4 = 0 so the item goes in the first bucket, Bucket 1.
If our hashCode function is good, it should provide an even distribution so that all the buckets will be used somewhat equally. In this case, the bucket uses a linked list to store the values.
But you can't rely on people to implement good hash functions. People will often write poor hash functions which will result in a non-even distribution. It's also possible that we could just get unlucky with our inputs.
The less even this distribution is, the further we're moving from O(1) operations and the closer we're moving towards O(n) operations.
The implementation of HashMap tries to mitigate this by organising some buckets into trees rather than linked lists if the buckets become too large. This is what TREEIFY_THRESHOLD = 8 is for. If a bucket contains more than eight items, it should become a tree.
This tree is a Red-Black tree, presumably chosen because it offers some worst-case guarantees. It is first sorted by hash code. If the hash codes are the same, it uses the compareTo method of Comparable if the objects implement that interface, else the identity hash code.
If entries are removed from the map, the number of entries in the bucket might reduce such that this tree structure is no longer necessary. That's what the UNTREEIFY_THRESHOLD = 6 is for. If the number of elements in a bucket drops below six, we might as well go back to using a linked list.
Finally, there is the MIN_TREEIFY_CAPACITY = 64.
When a hash map grows in size, it automatically resizes itself to have more buckets. If we have a small HashMap, the likelihood of us getting very full buckets is quite high, because we don't have that many different buckets to put stuff into. It's much better to have a bigger HashMap, with more buckets that are less full. This constant basically says not to start making buckets into trees if our HashMap is very small - it should resize to be larger first instead.
To answer your question about the performance gain, these optimisations were added to improve the worst case. You would probably only see a noticeable performance improvement because of these optimisations if your hashCode function was not very good.
It is designed to protect against bad hashCode implementations and also provides basic protection against collision attacks, where a bad actor may attempt to slow down a system by deliberately selecting inputs which occupy the same buckets.
To put it simpler (as much as I could simpler) + some more details.
These properties depend on a lot of internal things that would be very cool to understand - before moving to them directly.
TREEIFY_THRESHOLD -> when a single bucket reaches this (and the total number exceeds MIN_TREEIFY_CAPACITY), it is transformed into a perfectly balanced red/black tree node. Why? Because of search speed. Think about it in a different way:
it would take at most 32 steps to search for an Entry within a bucket/bin with Integer.MAX_VALUE entries.
Some intro for the next topic. Why is the number of bins/buckets always a power of two? At least two reasons: faster than modulo operation and modulo on negative numbers will be negative. And you can't put an Entry into a "negative" bucket:
int arrayIndex = hashCode % buckets; // will be negative
buckets[arrayIndex] = Entry; // obviously will fail
Instead there is a nice trick used instead of modulo:
(n - 1) & hash // n is the number of bins, hash - is the hash function of the key
That is semantically the same as modulo operation. It will keep the lower bits. This has an interesting consequence when you do:
Map<String, String> map = new HashMap<>();
In the case above, the decision of where an entry goes is taken based on the last 4 bits only of you hashcode.
This is where multiplying the buckets comes into play. Under certain conditions (would take a lot of time to explain in exact details), buckets are doubled in size. Why? When buckets are doubled in size, there is one more bit coming into play.
So you have 16 buckets - last 4 bits of the hashcode decide where an entry goes. You double the buckets: 32 buckets - 5 last bits decide where entry will go.
As such this process is called re-hashing. This might get slow. That is (for people who care) as HashMap is "joked" as: fast, fast, fast, slooow. There are other implementations - search pauseless hashmap...
Now UNTREEIFY_THRESHOLD comes into play after re-hashing. At that point, some entries might move from this bins to others (they add one more bit to the (n-1)&hash computation - and as such might move to other buckets) and it might reach this UNTREEIFY_THRESHOLD. At this point it does not pay off to keep the bin as red-black tree node, but as a LinkedList instead, like
entry.next.next....
MIN_TREEIFY_CAPACITY is the minimum number of buckets before a certain bucket is transformed into a Tree.
TreeNode is an alternative way to store the entries that belong to a single bin of the HashMap. In older implementations the entries of a bin were stored in a linked list. In Java 8, if the number of entries in a bin passed a threshold (TREEIFY_THRESHOLD), they are stored in a tree structure instead of the original linked list. This is an optimization.
From the implementation:
/*
* Implementation notes.
*
* This map usually acts as a binned (bucketed) hash table, but
* when bins get too large, they are transformed into bins of
* TreeNodes, each structured similarly to those in
* java.util.TreeMap. Most methods try to use normal bins, but
* relay to TreeNode methods when applicable (simply by checking
* instanceof a node). Bins of TreeNodes may be traversed and
* used like any others, but additionally support faster lookup
* when overpopulated. However, since the vast majority of bins in
* normal use are not overpopulated, checking for existence of
* tree bins may be delayed in the course of table methods.
You would need to visualize it: say there is a Class Key with only hashCode() function overridden to always return same value
public class Key implements Comparable<Key>{
private String name;
public Key (String name){
this.name = name;
}
#Override
public int hashCode(){
return 1;
}
public String keyName(){
return this.name;
}
public int compareTo(Key key){
//returns a +ve or -ve integer
}
}
and then somewhere else, I am inserting 9 entries into a HashMap with all keys being instances of this class. e.g.
Map<Key, String> map = new HashMap<>();
Key key1 = new Key("key1");
map.put(key1, "one");
Key key2 = new Key("key2");
map.put(key2, "two");
Key key3 = new Key("key3");
map.put(key3, "three");
Key key4 = new Key("key4");
map.put(key4, "four");
Key key5 = new Key("key5");
map.put(key5, "five");
Key key6 = new Key("key6");
map.put(key6, "six");
Key key7 = new Key("key7");
map.put(key7, "seven");
Key key8 = new Key("key8");
map.put(key8, "eight");
//Since hascode is same, all entries will land into same bucket, lets call it bucket 1. upto here all entries in bucket 1 will be arranged in LinkedList structure e.g. key1 -> key2-> key3 -> ...so on. but when I insert one more entry
Key key9 = new Key("key9");
map.put(key9, "nine");
threshold value of 8 will be reached and it will rearrange bucket1 entires into Tree (red-black) structure, replacing old linked list. e.g.
key1
/ \
key2 key3
/ \ / \
Tree traversal is faster {O(log n)} than LinkedList {O(n)} and as n grows, the difference becomes more significant.
The change in HashMap implementation was was added with JEP-180. The purpose was to:
Improve the performance of java.util.HashMap under high hash-collision conditions by using balanced trees rather than linked lists to store map entries. Implement the same improvement in the LinkedHashMap class
However pure performance is not the only gain. It will also prevent HashDoS attack, in case a hash map is used to store user input, because the red-black tree that is used to store data in the bucket has worst case insertion complexity in O(log n). The tree is used after a certain criteria is met - see Eugene's answer.
To understand the internal implementation of hashmap, you need to understand the hashing.
Hashing in its simplest form, is a way to assigning a unique code for any variable/object after applying any formula/algorithm on its properties.
A true hash function must follow this rule –
“Hash function should return the same hash code each and every time when the function is applied on same or equal objects. In other words, two equal objects must produce the same hash code consistently.”
I have a reachability matrix in the form of a hashmap. The keys are the row numbers and the values are the list of columns which are non-zero for the reachability matrix. I want to generate the antecedent set from this matrix. It is developed by reading the non-zero entries for each column. The matrix has 5000 rows. If I want to use for loop to check whether each key is there in the value set of each key then the number of iterations i 5000*5000. I want to avoid this. Is there any efficient algortihm which can avoid this many iterations.
I think the best approach is to iterate over the values that are in the matrix, instead of the values that could be in the matrix. Since the matrix is organized by rows instead of by columns, that means navigating it the same way:
final Map<Integer, List<Integer>> reverseReachabilityMatrix = new HashMap<>();
for (final Map.Entry<Integer, List<Integer>> reachabilityMatrixRow :
reachabilityMatrix.entrySet()) {
final Integer rowNumber = reachabilityMatrixRow.getKey();
final List<Integer> columnNumbers = reachabilityMatrixRow.getValue();
for (final Integer columnNumber : columnNumbers) {
if (!reverseReachabilityMatrix.containsKey(columnNumber)) {
reverseReachabilityMatrix.put(columnNumber, new ArrayList<>());
}
reverseReachabilityMatrix.get(columnNumber).add(rowNumber);
}
}
(where reverseReachabilityMatrix is simply a columnwise representation of the same matrix).
(Note: the resulting lists in reverseReachabilityMatrix will not be in any meaningful order. If you need them to be, then you'll need to adjust the above in some way. For example, you can use for (int rowNumber = 1; rowNumber <= numRows; ++rowNumber) instead of iterating over the HashMap in its internal order.)
Incidentally, although I preserved the HashMap<Integer, List<Integer>> structure above for consistency with what you've already got, I must say that HashMap<Integer, List<Integer>> does not seem like the right data-structure here, for two reasons:
If your row numbers are 1 through n, and if the majority of rows have at least one nonzero entry, then it's much more efficient (both time-wise and space-wise) to use an array or ArrayList structure. That won't change the asymptotic complexity, but it should make a noticeable difference in the actual runtime.
It seems like contains is going to be a common operation here; you will very frequently want to check if the reachability-list for a given row-number includes a given column-number. So a Set, such as a TreeSet, seems more appropriate. (With an ArrayList, the contains method has to iterate through the whole list.)
Have you considered storing just the non-zero elements in a smaller matrix and keep the count of the zero elements for each column?
let me just say this is a HW project from UC Berkeley, so If you could give me a hint instead of the solution, It'd be great.
The problem is to create a data structure called "YearlyRecord" that maps a word to the number of times it is encountered in an ngram file. Here are the restrictions.
getting the count must be O(1)
getting the number of mappings must be O(1)
"put" must be on average O(log n)
getting the "rank" must be O(1)
counts, which provides a data structure of all of counts in increasing order
words, which provides a data structure of all words, in the order of counts
Note: "rank" is a function that takes a word and tells me its rank (e.g. 1, 2, 3, 4), which is based on the number of occurrences we have of it.
My solutions so far:
Hashmap (word -> count) provides O(1) getCount operation and
O(1) size operation
make a private inner class that implements Comparator which compares based on the number of occurrences in the hashmap. Then use
this comparator as input to a TreeSet of the words. This provides the
"words" collection, which we can return in O(1) time.
Make another TreeSet of the counts. This provides an O(1) return of the counts in increasing order.
rank: Here is where I am stuck. It is clear that rank should be a map as well. I have the words in the TreeSet in increasing order of rank, but no indices to map to.
I have a file which has a many random integers(around a million) each seperated by a white space. I need to find the top 10 most frequently occurring numbers in that file. What is the most efficient way of doing this in java?
I can think of
1. Create a hash map, key is the integer from the file and the value is the count. For every number in the file, check if that key already exists in the hash map, if yes, value++, else make a new entry in hash
2. Make a BST, each node is the integer from the file. For every integer from the file see if there is a node in the BST if yes, do value++, value is part of the node.
I feel hash map is better option if i can come up with good hashing function,
Can some one pl suggest me what is the best of doing this ? Is there is anyother efficient algo that i can use?
Edit #2:
Okay, I screwed up my own first rule--never optimize prematurely. The worst case for this is probably using a stock HashMap with a wide range--so I just did that. It still runs in like a second, so forget everything else here and just do that.
And I'll make ANOTHER note to myself to ALWAYS test speed before worrying about tricky implementations.
(Below is older obsolete post that could still be valid if someone had MANY more points than a million)
A HashSet would work, but if your integers have a reasonable range (say, 1-1000), it would be more efficient to create an array of 1000 integers, and for each of your million integers, increment that element of the array. (Pretty much the same idea as a HashMap, but optimizing out a few of the unknowns that a Hash has to make allowances for should make it a few times faster).
You could also create a tree. Each node in the tree would contain (value, count) and the tree would be organized by value (lower values on the left, higher on the right). Traverse to your node, if it doesn't exist--insert it--if it does, then just increment the count.
The range and distribution of your values would determine which of these two (or a regular hash) would perform better. I think a regular hash wouldn't have many "winning" cases though (It would have to be a wide range and "grouped" data, and even then the tree might win.
Since this is pretty trivial--I recommend you implement more than one solution and test speeds against the actual data set.
Edit: RE the comment
TreeMap would work, but would still add a layer of indirection (and it's so amazingly easy and fun to implement yourself). If you use the stock implementation, you have to use Integers and convert constantly to and from int for every increase. There is the indirection of the pointer to the Integer, and the fact that you are storing at least 2x as many objects. This doesn't even count any overhead for the method calls since they should be inlined with any luck.
Normally this would be an optimization (evil), but when you start to get near hundreds of thousands of nodes, you occasionally have to ensure efficiency, so the built-in TreeMap is going to be inefficient for the same reasons the built-in HashSet will.
Java handles hashing. You don't need to write a hash function. Just start pushing stuff in the hash map.
Also, if this is something that only needs to run once (or only occasionally), then don't both optimizing. It will be fast enough. Only bother if it's something that's going to run within an application.
HashMap
A million integers is not really a lot, even for interpreted languages, but especially for a speedy language like Java. You'll probably barely even notice the execution time. I'd try this first and move to something more complicated if you deem this too slow.
It will probably take longer to do string splitting and parsing to convert to integers than even the simplest algorithm to find frequencies using a HashMap.
Why use a hashtable? Just use an array that is the same size as the range of your numbers. Then you don't waste time executing the hashing function. Then sort the values after you're done. O(N log N)
Allocate an array / vector of the same size as the number of input items you have
Fill the array from your file with numbers, one number per element
Put the list in order
Iterate through the list and keep track of the the top 10 runs of numbers that you have encountered.
Output the top ten runs at the end.
As a refinement on step 4, you only need to step forward through the array in steps equilivent to your 10th longest run. Any run longer than that will overlap with your sampling. If the tenth longest run is 100 elements long, you only need to sample element 100, 200, 300 and at each point count the run of the integer you find there (both forwards and backwards). Any run longer than your 10th longest is sure to overlap with your sampling.
You should apply this optimisation after your 10th run length is very long compared to other runs in the array.
A map is overkill for this question unless you have very few unique numbers each with a large number of repeats.
NB: Similar to gshauger's answer but fleshed out
If you have to make it as efficient as possible, use an array of ints, with the position representing the value and the content representing the count. That way you avoid autoboxing and unboxing, the most likely killer of a standard Java collection.
If the range of numbers is too large then take a look at PJC and its IntKeyIntMap implementations. It will avoid the autoboxing as well. I don't know if it will be fast enough for you, though.
If the range of numbers is small (e.g. 0-1000), use an array. Otherwise, use a HashMap<Integer, int[]>, where the values are all length 1 arrays. It should be much faster to increment a value in an array of primitives than create a new Integer each time you want to increment a value. You're still creating Integer objects for the keys, but that's hard to avoid. It's not feasible to create an array of 2^31-1 ints, after all.
If all of the input is normalized so you don't have values like 01 instead of 1, use Strings as keys in the map so you don't have to create Integer keys.
Use a HashMap to create your dataset (value-count pairs) in memory as you traverse the file. The HashMap should give you close to O(1) access to the elements while you create the dataset (technically, in the worst case HashMap is O(n)). Once you are done searching the file, use Collections.sort() on the value Collection returned by HashMap.values() to create a sorted list of value-count pairs. Using Collections.sort() is guaranteed O(nLogn).
For example:
public static class Count implements Comparable<Count> {
int value;
int count;
public Count(int value) {
this.value = value;
this.count = 1;
}
public void increment() {
count++;
}
public int compareTo(Count other) {
return other.count - count;
}
}
public static void main(String args[]) throws Exception {
Scanner input = new Scanner(new FileInputStream(new File("...")));
HashMap<Integer, Count> dataset = new HashMap<Integer, Count>();
while (input.hasNextInt()) {
int tempInt = input.nextInt();
Count tempCount = dataset.get(tempInt);
if (tempCount != null) {
tempCount.increment();
} else {
dataset.put(tempInt, new Count(tempInt));
}
}
List<Count> counts = new ArrayList<Count>(dataset.values());
Collections.sort(counts);
Actually, there is an O(n) algorithm for doing exactly what you want to do. Your use case is similar to an LFU cache where the element's access count determines whether it syays in the cache or is evicted from it.
http://dhruvbird.blogspot.com/2009/11/o1-approach-to-lfu-page-replacement.html
This is the source for java.lang.Integer.hashCode(), which is the hashing function that will be used if you store your entries as a HashMap<Integer, Integer>:
public int hashCode() {
return value;
}
So in other words, the (default) hash value of a java.lang.Integer is the integer itself.
What is more efficient than that?
The correct way to do it is with a linked list. When you insert an element, you go down the linked list, if its there you increment the nodes count, otherwise create a new node with count of 1. After you inserted each element, you would have a sorted list of elements in O(n*log(n)).
For your methods, you are doing n inserts and then sorting in O(n*log(n)), so your coefficient on the complexity is higher.