Most frequently repeated numbers in a huge list of numbers

Most frequently repeated numbers in a huge list of numbers - java

I have a file which has a many random integers(around a million) each seperated by a white space. I need to find the top 10 most frequently occurring numbers in that file. What is the most efficient way of doing this in java?
I can think of
1. Create a hash map, key is the integer from the file and the value is the count. For every number in the file, check if that key already exists in the hash map, if yes, value++, else make a new entry in hash
2. Make a BST, each node is the integer from the file. For every integer from the file see if there is a node in the BST if yes, do value++, value is part of the node.
I feel hash map is better option if i can come up with good hashing function,
Can some one pl suggest me what is the best of doing this ? Is there is anyother efficient algo that i can use?

Edit #2:
Okay, I screwed up my own first rule--never optimize prematurely. The worst case for this is probably using a stock HashMap with a wide range--so I just did that. It still runs in like a second, so forget everything else here and just do that.
And I'll make ANOTHER note to myself to ALWAYS test speed before worrying about tricky implementations.
(Below is older obsolete post that could still be valid if someone had MANY more points than a million)
A HashSet would work, but if your integers have a reasonable range (say, 1-1000), it would be more efficient to create an array of 1000 integers, and for each of your million integers, increment that element of the array. (Pretty much the same idea as a HashMap, but optimizing out a few of the unknowns that a Hash has to make allowances for should make it a few times faster).
You could also create a tree. Each node in the tree would contain (value, count) and the tree would be organized by value (lower values on the left, higher on the right). Traverse to your node, if it doesn't exist--insert it--if it does, then just increment the count.
The range and distribution of your values would determine which of these two (or a regular hash) would perform better. I think a regular hash wouldn't have many "winning" cases though (It would have to be a wide range and "grouped" data, and even then the tree might win.
Since this is pretty trivial--I recommend you implement more than one solution and test speeds against the actual data set.
Edit: RE the comment
TreeMap would work, but would still add a layer of indirection (and it's so amazingly easy and fun to implement yourself). If you use the stock implementation, you have to use Integers and convert constantly to and from int for every increase. There is the indirection of the pointer to the Integer, and the fact that you are storing at least 2x as many objects. This doesn't even count any overhead for the method calls since they should be inlined with any luck.
Normally this would be an optimization (evil), but when you start to get near hundreds of thousands of nodes, you occasionally have to ensure efficiency, so the built-in TreeMap is going to be inefficient for the same reasons the built-in HashSet will.

Java handles hashing. You don't need to write a hash function. Just start pushing stuff in the hash map.
Also, if this is something that only needs to run once (or only occasionally), then don't both optimizing. It will be fast enough. Only bother if it's something that's going to run within an application.

HashMap
A million integers is not really a lot, even for interpreted languages, but especially for a speedy language like Java. You'll probably barely even notice the execution time. I'd try this first and move to something more complicated if you deem this too slow.
It will probably take longer to do string splitting and parsing to convert to integers than even the simplest algorithm to find frequencies using a HashMap.

Why use a hashtable? Just use an array that is the same size as the range of your numbers. Then you don't waste time executing the hashing function. Then sort the values after you're done. O(N log N)

Allocate an array / vector of the same size as the number of input items you have
Fill the array from your file with numbers, one number per element
Put the list in order
Iterate through the list and keep track of the the top 10 runs of numbers that you have encountered.
Output the top ten runs at the end.
As a refinement on step 4, you only need to step forward through the array in steps equilivent to your 10th longest run. Any run longer than that will overlap with your sampling. If the tenth longest run is 100 elements long, you only need to sample element 100, 200, 300 and at each point count the run of the integer you find there (both forwards and backwards). Any run longer than your 10th longest is sure to overlap with your sampling.
You should apply this optimisation after your 10th run length is very long compared to other runs in the array.
A map is overkill for this question unless you have very few unique numbers each with a large number of repeats.
NB: Similar to gshauger's answer but fleshed out

If you have to make it as efficient as possible, use an array of ints, with the position representing the value and the content representing the count. That way you avoid autoboxing and unboxing, the most likely killer of a standard Java collection.
If the range of numbers is too large then take a look at PJC and its IntKeyIntMap implementations. It will avoid the autoboxing as well. I don't know if it will be fast enough for you, though.

If the range of numbers is small (e.g. 0-1000), use an array. Otherwise, use a HashMap<Integer, int[]>, where the values are all length 1 arrays. It should be much faster to increment a value in an array of primitives than create a new Integer each time you want to increment a value. You're still creating Integer objects for the keys, but that's hard to avoid. It's not feasible to create an array of 2^31-1 ints, after all.
If all of the input is normalized so you don't have values like 01 instead of 1, use Strings as keys in the map so you don't have to create Integer keys.

Use a HashMap to create your dataset (value-count pairs) in memory as you traverse the file. The HashMap should give you close to O(1) access to the elements while you create the dataset (technically, in the worst case HashMap is O(n)). Once you are done searching the file, use Collections.sort() on the value Collection returned by HashMap.values() to create a sorted list of value-count pairs. Using Collections.sort() is guaranteed O(nLogn).
For example:
public static class Count implements Comparable<Count> {
int value;
int count;
public Count(int value) {
this.value = value;
this.count = 1;
}
public void increment() {
count++;
}
public int compareTo(Count other) {
return other.count - count;
}
}
public static void main(String args[]) throws Exception {
Scanner input = new Scanner(new FileInputStream(new File("...")));
HashMap<Integer, Count> dataset = new HashMap<Integer, Count>();
while (input.hasNextInt()) {
int tempInt = input.nextInt();
Count tempCount = dataset.get(tempInt);
if (tempCount != null) {
tempCount.increment();
} else {
dataset.put(tempInt, new Count(tempInt));
}
}
List<Count> counts = new ArrayList<Count>(dataset.values());
Collections.sort(counts);

Actually, there is an O(n) algorithm for doing exactly what you want to do. Your use case is similar to an LFU cache where the element's access count determines whether it syays in the cache or is evicted from it.
http://dhruvbird.blogspot.com/2009/11/o1-approach-to-lfu-page-replacement.html

This is the source for java.lang.Integer.hashCode(), which is the hashing function that will be used if you store your entries as a HashMap<Integer, Integer>:
public int hashCode() {
return value;
}
So in other words, the (default) hash value of a java.lang.Integer is the integer itself.
What is more efficient than that?

The correct way to do it is with a linked list. When you insert an element, you go down the linked list, if its there you increment the nodes count, otherwise create a new node with count of 1. After you inserted each element, you would have a sorted list of elements in O(n*log(n)).
For your methods, you are doing n inserts and then sorting in O(n*log(n)), so your coefficient on the complexity is higher.

Related

A good data structure for storing and searching integers?

Edit: Typos fixed and ambiguity tried to fix.
I have a list of five digit integers in a text file. The expected amount can only be as large as what a 5-digit integer can store. Regardless of how many there are, the FIRST line in this file tells me how many integers are present, so resizing will never be necessary. Example:
3
11111
22222
33333
There are 4 lines. The first says there are three 5-digit integers in the file. The next three lines hold these integers.
I want to read this file and store the integers (not the first line). I then want to be able to search this data structure A LOT, nothing else. All I want to do, is read the data, put it in the structure, and then be able to determine if there is a specific integer in there. Deletions will never occur. The only things done on this structure will be insertions and searching.
What would you suggest as an appropriate data structure? My initial thought was a binary tree of sorts; however, upon thinking, a HashTable may be the best implementation. Thoughts and help please?

It seems like the requirements you have are
store a bunch of integers,
where insertions are fast,
where lookups are fast, and
where absolutely nothing else matters.
If you are dealing with a "sufficiently small" range of integers - say, integers up to around 16,000,000 or so, you could just use a bitvector for this. You'd store one bit per number, all initially zero, and then set the bits to active whenever a number is entered. This has extremely fast lookups and extremely fast setting, but is very memory-intensive and infeasible if the integers can be totally arbitrary. This would probably be modeled with by BitSet.
If you are dealing with arbitrary integers, a hash table is probably the best option here. With a good hash function you'll get a great distribution across the table slots and very, very fast lookups. You'd want a HashSet for this.
If you absolutely must guarantee worst-case performance at all costs and you're dealing with arbitrary integers, use a balanced BST. The indirection costs in BSTs make them a bit slower than other data structures, but balanced BSTs can guarantee worst-case efficiency that hash tables can't. This would be represented by TreeSet.

Given that
All numbers are <= 99,999
You only want to check for existence of a number
You can simply use some form of bitmap.
e.g. create a byte[12500] (it is 100,000 bits which means 100,000 booleans to store existence of 0-99,999 )
"Inserting" a number N means turning the N-th bit on. Searching a number N means checking if N-th bit is on.
Pseduo code of the insertion logic is:
bitmap[number / 8] |= (1>> (number %8) );
searching looks like:
bitmap[number/8] & (1 >> (number %8) );
If you understand the rationale, then a even better news for you: In Java we already have BitSet which is doing what I was describing above.
So code looks like this:
BitSet bitset = new BitSet(12500);
// inserting number
bitset.set(number);
// search if number exists
bitset.get(number); // true if exists

If the number of times each number occurs don't matter (as you said, only inserts and see if the number exists), then you'll only have a maximum of 100,000. Just create an array of booleans:
boolean numbers = new boolean[100000];
This should take only 100 kilobytes of memory.
Then instead of add a number, like 11111, 22222, 33333 do:
numbers[11111]=true;
numbers[22222]=true;
numbers[33333]=true;
To see if a number exists, just do:
int whichNumber = 11111;
numberExists = numbers[whichNumber];
There you are. Easy to read, easier to mantain.

A Set is the go-to data structure to "find", and here's a tiny amount of code you need to make it happen:
Scanner scanner = new Scanner(new FileInputStream("myfile.txt"));
Set<Integer> numbers = Stream.generate(scanner::nextInt)
.limit(scanner.nextInt())
.collect(Collectors.toSet());

fastest way to map a large number of longs

I'm writing a java application that transforms numbers (long) into a small set of result objects. This mapping process is very critical to the app's performance as it is needed very often.
public static Object computeResult(long input) {
Object result;
// ... calculate
return result;
}
There are about 150,000,000 different key objects, and about 3,000 distinct values.
The transformation from the input number (long) to the output (immutable object) can be computed by my algorithm with a speed of 4,000,000 transformations per second. (using 4 threads)
I would like to cache the mapping of the 150M different possible inputs to make the translation even faster but i found some difficulties creating such a cache:
public class Cache {
private static long[] sortedInputs; // 150M length
private static Object[] results; // 150M length
public static Object lookupCachedResult(long input) {
int index = Arrays.binarySearch(sortedInputs, input);
return results[index];
}
}
i tried to create two arrays with a length of 150M. the first array holds all possible input longs, and it is sorted numerically. the second array holds a reference to one of the 3000 distinct, precalculated result objects at the index corresponding to the first array's input.
to get to the cached result, i do a binary search for the input number on the first array. the cached result is then looked up in the second array at the same index.
sadly, this cache method is not faster than computing the results. not even half, only about 1.5M lookups per second. (also using 4 threads)
Can anyone think of a faster way to cache results in such a scenario?
I doubt there is a database engine that is able to answer more than 4,000,000 queries per second on, let's say an average workstation.

Hashing is the way to go here, but I would avoid using HashMap, as it only works with objects, i.e. must build a Long each time you insert a long, which can slow it down. Maybe this performance issue is not significant due to JIT, but I would recommend at least to try the following and measure performance against the HashMap-variant:
Save your longs in a long-array of some length n > 3000 and do the hashing by hand via a very simple hash-function (and thus efficient) like
index = key % n. Since you know your 3000 possible values before hand you can empirically find an array-length n such that this trivial hash-function won't cause collisions. So you circumvent rehashing etc. and have true O(1)-performance.
Secondly I would recommend you to look at Java-numerical libraries like
https://github.com/mikiobraun/jblas
https://github.com/fommil/matrix-toolkits-java
Both are backed by native Lapack and BLAS implementations that are usually highly optimized by very smart people. Maybe you can formulate your algorithm in terms of matrix/vector-algebra such that it computes the whole long-array at one time (or chunk-wise).

There are about 150,000,000 different key objects, and about 3,000 distinct values.
With the few values, you should ensure that they get re-used (unless they're pretty small objects). For this an Interner is perfect (though you can run your own).
i tried hashmap and treemap, both attempts ended in an outOfMemoryError.
There's a huge memory overhead for both of them. And there isn't much point is using a TreeMap as it uses a sort of binary search which you've already tried.
There are at least three implementations of a long-to-object-map available, google for "primitive collections". This should use slightly more memory than your two arrays. With hashing being usually O(1) (let's ignore the worst case as there's no reason for it to happen, is it?) and much better memory locality, it'll beat(*) your binary search by a factor of 20. You binary search needs log2(150e6), i.e., about 27 steps and hashing may need on the average maybe two. This depends on how tightly you pack the hash table; this is usually a parameter given when it gets created.
In case you run your own (which you most probably shouldn't), I'd suggest to use an array of size 1 << 28, i.e., 268435456 entries, so that you can use bitwise operations for indexing.
(*) Such predictions are hard, but I'm sure it's worth trying.

Peak Value of Number of Occurences in Array of Integers

In Java, I need an algorithm to find the max. number of occurrences in a collection of integers. For example, if my set is [2,4,3,2,2,1,4,2,2], the algorithm needs to output 5 because 2 is the mostly occurring integer and it appears 5 times. Consider it like finding the peak of the histogram of the set of integers.
The challenge is, I have to do it one by one for multiple sets of many integers so it needs to be efficient. Also, I do not know which element will be mostly appearing in the sets. It is totally random.
I thought about putting those values of the set into an array, sorting it and then iterating over the array, counting consecutive appearances of the numbers and identifying the maximum of the counts but I am guessing it will take a huge time. Are there any libraries or algorithms that could help me do it efficiently?

I would loop over the collection inserting into a Map datastructure with the following logic:
If the integer has not yet been inserted into the map, then insert key=integer, value=1.
If the key exists, increment the value.
There are two Maps in Java you could use - HashMap and TreeMap - these are compared below:
HashMap vs. TreeMap
You can skip the detailed explanation a jump straight to the summary if you wish.
A HashMap is a Map which stores key-value pairs in an array. The index used for key k is:
h.hashCode() % map.size()
Sometimes two completely different keys will end up at the same index. To solve this, each location in the array is really a linked list, which means every lookup always has to loop over the linked list and check for equality using the k.equals(other) method. Worst case, all keys get stored at the same location and the HashMap becomes an unindexed list.
As the HashMap gains more entries, the likelihood of these clashes increase, and the efficiency of the structure decreases. To solve this, when the number of entries reaches a critical point (determined by the loadFactor argument in the constructor), the structure is resized:
A new array is allocated at about twice the current size
A loop is run over all the existing keys
The key's location is recomputed for the new array
The key-value pair is inserted into the new structure
As you can see, this can become relatively expensive if there are many resizes.
This problem can be overcome if you can pre-allocate the HashMap at an appropriate size before you begin, eg map = new HashMap(input.size()*1.5). For large datasets, this can dramatically reduce memory churn.
Because the keys are essentially randomly positioned in the HashMap, the key iterator will iterate over them in a random order. Java does provide the LinkedHashMap which will iterator in the order the keys were inserted.
Performance for a HashMap:
Given the correct size and good distribution of hashes, lookup is constant-time.
With bad distribution, performance drops to (in the worst case) linear search - O(n).
With bad initial sizing, performance becomes that of rehashing. I can't trivially calculate this, but it's not good.
OTOH a TreeMap stores entries in a balanced tree - a dynamic structure that is incrementally built up as key-value pairs are added. Insert is dependent on the depth of the tree (log(tree.size()), but is predictable - unlike HashMap, there are no hiatuses, and no edge conditions where performance drops.
Each insert and lookup is more expensive given a well-distributed HashMap.
Further, in order to insert the key in the tree, every key must be comparable to every other key, requiring the k.compare(other) method from the Comparable interface. Obviously, given the question is about integers, this is not a problem.
Performance for a TreeMap:
Insert of n elements is O(n log n)
Lookup is O(log n)
Summary
First thoughts: Dataset size:
If small (even in the 1000's and 10,000's) it really doesn't matter on any modern hardware
If large, to the point of causing the machine to run out of memory, then TreeMap may be the only option
Otherwise, size is probably not be the determining factor
In this specific case, a key factor is whether the expected number of unique integers is large or small compared to the overall dataset size?
If small, then the overall time will be dominated by key lookup in a small set, so optimization is irrelevant (you can stop here).
If large, then the overall time will be dominated by insert, and the decision rests on more factors:
Dataset is of known size?
If yes: The HashMap can be pre-allocated, and so memory churn eliminated. This is especially important if the hashCode() method is expensive (not in our case)
If no: A TreeMap provides more predictable performance and may be the better choice
Is predictable performance with no large stalls required, eg in real-time systems or on the event thread of a GUI?
If yes: A TreeMap provides much better predictability with no stalls
If no: A HashMap probably provides better overall performance for the whole computation
One final point if there is not an overwhelming point from above:
Is a sorted list of keys of value?
If yes (eg to print a histogram): A TreeMap has already sorted the keys, and so is convenient
However, if performance is important, the only way to decide would be to implement to the Map interface, then profile both the HashMap and the TreeMap to see which is actually better in your situation. Premature optimization is the root of much evil :)

What's wrong with sorting? That's O(n log n), which isn't bad at all. Any better solution would either require more information about the input sets (an upper bound on the numbers involved, perhaps) or involve a Map<Integer, Integer> or something equivalent.

The basic method is to sort the collection and then simply run through the sorted collection. (This would be done in O(nLog(n) + n) which is O(nLog(n))).
If the numbers are bounded (say for example, -10000,10000) and the collection contains a lot of integers you can use a lookup table and count each element. This would take O(n + l) (O(n) for the count, O(l) to find the max element) where l is the range length (20001 in this case).
As you can see, if n >> l then this would become O(n) which is better than 1, but if n << l then it's O(l) which is constant but big enough to make this unusable.
Another variant of the previous is to use a HashTable instead of a lookup table. This would improve the complexity to O(n) but is not guaranteed to be faster than 2 when n>>l.
The good news is that the values don't have to be bounded.
I'm not much of a java but if you need help coding these, let me know.

Here is a sample implementation of your program. It returns the no with most frequency and if two nos are found with max occurences, then the larger no is returned. If u want to return the frequency then change the last line of the code to "return mf".
{public int mode(int[]a,int n)
{int i,j,f,mf=0,mv=a[0];
for(i=0;i<n;i++)
{f=0;
for(j=0;j<n;j++)
{if(a[i]==a[j])
{f++;
}
}
if(f>mf||f==mf && a[i]>mv)
{mf=f;
mv=a[i];
}
}
return mv;
}
}

Since it's a collection of integers, one can use either
radix sort to sort the collection and that takes O(nb) where b is the number of bits used to represent the integers (32 or 64, if you use java's primitive integer data types), or
a comparison-based sort (quicksort, merge sort, etc) and that takes O(n log n).
Notes:
The larger your n becomes, the more likely that radix sort will be faster than comparison-based sorts. For smaller n, you are probably better off with a comparison-based sort.
If you know a bound on the values in the collection, b will be even smaller than 32 (or 64) making the radix sort more desirable.

This little puppy works (edited to return the frequency instead of the number):
public static int mostFrequent(int[] numbers) {
Map<Integer, AtomicInteger> map = new HashMap<Integer, AtomicInteger>() {
public AtomicInteger get(Object key) {
AtomicInteger value = super.get(key);
if (value == null) {
value = new AtomicInteger();
super.put((Integer) key, value);
}
return value;
}
};
for (int number : numbers)
map.get(number).incrementAndGet();
List<Entry<Integer, AtomicInteger>> entries = new ArrayList<Map.Entry<Integer, AtomicInteger>>(map.entrySet());
Collections.sort(entries, new Comparator<Entry<Integer, AtomicInteger>>() {
#Override
public int compare(Entry<Integer, AtomicInteger> o1, Entry<Integer, AtomicInteger> o2) {
return o2.getValue().get() - o1.getValue().get();
}
});
return entries.get(0).getValue().get(); // return the largest *frequency*
// Use this next line instead to return the most frequent *number*
// return entries.get(0).getKey();
}
AtomicInteger was chosen to avoid creating new objects with every increment, and the code reads a little cleaner.
The anonymous map class was used to centralize the "if null" code
Here's a test:
public static void main(String[] args) {
System.out.println(mostFrequent(new int[] { 2, 4, 3, 2, 2, 1, 4, 2, 2 }));
}
Output:
5

useing HashMap:
import java.util.HashMap;
public class NumberCounter {
static HashMap<Integer,Integer> map;
static int[] arr = {1, 2, 1, 23, 4, 5, 4, 1, 2, 3, 12, 23};
static int max=0;
public NumberCounter(){
map=new HashMap<Integer, Integer>();
}
public static void main (String[] args)
{
Integer newValue=1;
NumberCounter c=new NumberCounter();
for(int i=0;i<arr.length;i++){
if(map.get(arr[i])!=null) {
newValue = map.get(arr[i]);
newValue += 1;
map.put(arr[i], newValue);
}
else
map.put(arr[i],1);
}
max=map.get(arr[0]);
for(int i=0;i<map.size();i++){
if(max<map.get(arr[i]))
max=map.get(arr[i]);
}
System.out.print(max);
}
}

How do I count repeated words?

Given a 1GB(very large) file containing words (some repeated), we need to read the file and output how many times each word is repeated. Please let me know if my solution is high performant or not.
(For simplicity lets assume we have already captured the words in an arraylist<string>)
I think the big O(n) is "n". Am I correct??
public static void main(String[] args) {
ArrayList al = new ArrayList();
al.add("math1");
al.add("raj1");
al.add("raj2");
al.add("math");
al.add("rj2");
al.add("math");
al.add("rj3");
al.add("math2");
al.add("rj1");
al.add("is");
Map<String,Integer> map= new HashMap<String,Integer>();
for (int i=0;i<al.size();i++)
{
String s= (String)al.get(i);
map.put(s,null);
}
for (int i=0;i<al.size();i++)
{
String s= (String)al.get(i);
if(map.get(s)==null)
map.put(s,1);
else
{
int count =(int)map.get(s);
count=count+1;
map.put(s,count);
}
}
System.out.println("");
}

I think you could do better than using a HashMap.
Food for thought on the hashmap solution
Your anwser is acceptable but consider this: For simplicity's sake lets assume you read the file one byte at a time into a StringBuffer until you hit a space. At which point you'll call toString() to convert the StringBuffer into a String. You then check if the string is in the HashMap and either it gets stored or the counter get incremented.
The English dic. included with linux has 400k words and is about 5MBs in size. So of the "1GB" of text you read, we can guess that you'll only be storing about 5MBs of it in your HashMap. The rest of the file, will be converted into strings that will need to be Garbage Collected after your finished looking for them in your map. I could be wrong, but I believe the bytes will be iterated over again during the construction of the String since the byte array needs to be copied internally and again for calculating the HashCode. So, the solution may waste a fair amount of CPU cycles and force GC to occur often.
Its OK to point things like this out in your interview, even if it's the only solution you can think of.
I may consider using a custom RadixTree or Trie like structure
Keep in mind how the insert method of a RadixT/Trie works. Which is to take a stream of chars/bytes (usually a string) and compares each element against the current position in the tree. If the prefix exists it just advances down the tree and byte-stream in lock step. When it hits a new suffix it begins adding nodes into the tree. Once the end of stream is reached it marks that node as EOW. Now consider we could do the same thing while reading a much larger stream, by resetting the current position to the root of the tree anytime we hit a space.
If we wrote our own Radix tree (or maybe a Trie), who's nodes had end-of-word counters (instead of markers) and had the insert method read directly from the file. We could insert nodes into the tree one byte/char at a time until we read a space. At which point the insert method would increment the end-of-word counter (if it's an existing word) and reset the current position in the tree back to the head and start inserting bytes/chars again. The way a radix tree works is to collapse the duplicated prefixs of words. For example:
The following file:
math1 raj1 raj2 math rj2 math rj3
would be converted to:
(root)-math->1->(eow=1)
| |-(eow=2)
|
raj->1->(eow=1)
| |->2->(eow=1)
| |->3->(eow=1)
j2->(eow=1)
The insertion time into a tree like this would be O(k), where k is the length of the longest word. But since we are inserting/comparing as we read each byte. We aren't any more inefficient than just reading the file as we have to already.
Also, note the we would read byte(s) into a temp byte that would be a stack variable, so the only time we need to allocate memory from the heap is when we encounter a new word (actually a new suffix). Therefore, garbage collection wouldn't happen nearly as often. And the total memory used by a Radix tree would be a lot smaller than a HashMap.

Theoretically , since HashMap access is generally O(1), I guess your algorithm is O(n), but in reality has several inefficiencies. Ideally you would iterate over the contents of the file just once, processing (i.e. counting) the words while you read them in. There's no need to store the entire file contents in memory (your ArrayList). You loop over the contents three times - once to read them, and the second and third times in the two loops in your code above. In particular, the first loop in your code above is completely unnecessary. Finally, your use of HashMap will be slower than needed because the default size at construction is very small, and it will have to grow internally a number of times, forcing a rebuilding of the hash table each time. Better to start it off a size appropriate for what you expect it to hold. You also have to consider the load factor into that.

Have you considered using a mapreduce solution? If the dataset gets bigger then it would really be better to split it in pieces and count the words in parallel

You should read through the file with words only once.
No need to put the nulls beforehand - you can do it within the main loop.
The complexity is indeed O(n) in both cases, but you want to make the constant as small as possible. (O(n) = 1000 * O(n), right :) )

To answer your question, first, you need to understand how HashMap works. It consists of buckets, and every bucket is a linked list. If due to hashing another pair need to occupy the same bucket, it will be added to the end of linked list. So, if map has high load factor, searching and inserting will not be O(1) anymore, and algorithm will become inefficient. Moreover, if map load factor exceeds predefined load factor (0.75 by default), the whole map will be rehashed.
This is an excerpt from JavaDoc http://download.oracle.com/javase/6/docs/api/java/util/HashMap.html:
The expected number of entries in the map and its load factor should
be taken into account when setting its initial capacity, so as to
minimize the number of rehash operations. If the initial capacity is
greater than the maximum number of entries divided by the load factor,
no rehash operations will ever occur.
So I would like to recommend you to predefine a map capacity guessing that every word is unique:
Map<String,Integer> map= new HashMap<String,Integer>(al.size());
Without of that, your solution is not efficient enough, though it still has a linear approximation O(3n), because due to amortization of rehashing, an insertion of elements will cost 3n instead of n.

How to represent a small map of integers as an array?

Let's say I want to build a small simple map, where the key is N integers (fixed, usually one) and the value is M integers (also fixed, usually one).
Now I would like to store the data an an integer array, for space efficiency. I am programming in the JVM, but it should not really make a difference, unless the algorithm would require storing pointers as integers.
Has anyone defined a simple data structure that can do that?
[EDIT] The answers I had so far seem to show that no one understands my question, so I'll try to clarify. Firstly, forget about the M and N; just imagine I said one int key and one int value. AFAIK, if you want to use a normal HashMap where the key is an integer and the value is an integer, then you will end up with at least 2 + 3 * N objects, where N is the number of entries.
What I want to know is, can you pack all those ints in a single array of primitive ints, reducing your object count to two, independent of the number of keys. One for the int[], and one for the wrapper object that gives you some map-like interface. Neither my keys nor my values will ever be null. And I don't need a full standard java.util.Map implementation either. I just need, get, put, and remove, taking and returning primitive ints not Integer objects. Access does not need to be O(1), like in a normal HashMap.

As far as I know, the answer is no, at least not in Java.
But you can simply keep your keys and your values in an int array (each) with matching indexes, and if you keep the key array sorted, you can perform a binary search.
The trouble with Java arrays though is that they're fixed size structures, so if you want to store more elements than your arrays are allocated to, reallocating them is quite expensive.
So you'll have to make a trade-off between the size of the array and the number of reallocations, maybe something similar to how ArrayList does it.
The other, somewhat smaller problem is that int being a primitive type, there's no null value, but you can designate a special int value to denote null.

I think I understand the problem, I'll take a shot at a solution:
The very simple way of achieving what you want is to keep a single two dimensional array int array[NKeys][2], where array[i][0] is the key and array[i][1] is one of the M values. You could, then iterate the array and for every query of a key, return all array[i][1] such that array[i][0] == key. Of course, this works for a single int key rather than a set of int keys. Also, this is awful in complexity. I can't think of any other way of doing this without adding more Java objects/C pointers.

Since the ArrayList(actually AbstractList) class has equals method overridden appropriately, you can directly use a map like so:
Map<List, List> map = new HashMap<List, List>();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.