Most frequent words - java

What's the most efficient way in Java to get the 50 most frequent words with their frequency out of a text?
I want to search around ~1,000,000 texts with each have around ~10,000 words and hope that it works in a reasonable time frame.

Most efficient would probably be using a Patricia trie that links to a max-heap. Every time you read a word, find it on the trie, go to the heap and increase-key. If it's not in the trie, add it and set its key in the heap appropriately.
With a Fibonacci heap, increase-key is O(1).
A not so unreasonable solution is to use a Map<String, Integer>, adding the count every time a word is encountered, and then custom-sorting its entrySet() based on the count to get the top 50.
If the O(N log N) sort is unacceptable, use selection algorithm to find the top 50 in O(N).
Which technique is better really depends on what you're asking for (i.e. the comment whether this is more of an [algorithm] question than a [java] question is very telling).
The Map<String, Integer> followed by selection algorithm is most practical, but the Patricia trie solution clearly beats it in space efficiency alone (since common prefixes are not stored redundantly).

Following pseudocode should do the trick:
build a map<word, count>
build a tokenizer that gives you a word per iteration
for each word*,
if word in map, increment its count
otherwise add with count = 1
sort words by count
for each of the first 50 words,
output word, frequency = count / total_words
This is essentially O(N), and what jpabluz suggested. However, if you are going to use this on any sort of "in the wild" text, you will notice lots of garbage: uppercase/lowercase, punctuation, URLs, stop-words such as 'the' or 'and' with very high counts, multiple variations of the same word... The right way to do it is to lowercase all words, remove all punctuation (and things such as URLs), and add stop-word removal and stemming at the point marked with the asterisk in the above pseudocode.

Your best chance would be an O(n) algorithm, I would go for a text reader that will split the words and then, add it to an ordered tree, which you would order by number of appearences and link them to a word. After that just do a 50-iterations traverse to get the highest values.

O(n):
Count the number of words
Split your text word wise into list of words
Create a map of word=>number_of_occurences
Traverse the map and select max. 50.
Divide them by total number of words to get frequency
Of course some of this steps may be done at the same time or unnecessary depending on data structures you'll use.

Related

Dictionary data structure + fast complexity methods

I'm trying to build from scratch, a data structure that would be able to hold a vast dictionary (of words/characters).
The "words" can be made out of arbitrarily large number of characters.
The dictionary would need standard methods such as search, insert, delete.
I need the methods to have time complexity that is better than O(log(n)), so between O(log(n)) to O(1), e.g log(log(n))
where n = dictionary size (number of elements)
I've looked into various tree structures, like for example b-tree which has log(n) methods (not fast enough) as well as trie which seemed most appropriate for the dictionary, but due to the fact that the words can be arbitrarily large it seemed liked it's complexity would not be faster than log(n).
If you could please provide any explanation
A trie has significant memory requirements but the access time is usually faster than O(log n).
If I recall well, the access time depends on the length of the word, not of the count of the words in the structure.
The efficiency and memory consumption also depend on exactly what implementation of the trie you chose to use. There are some pretty efficient implementations out there.
For more information on Tries see:
http://en.wikipedia.org/wiki/Trie
http://algs4.cs.princeton.edu/52trie/
http://algs4.cs.princeton.edu/52trie/TrieST.java.html
https://www.topcoder.com/community/data-science/data-science-tutorials/using-tries/
If your question is how to achieve as few string comparisons as possible, then a hash table is probably a very good answer, as it requires close to O(1) string comparisons. Note that hashing the key value takes time proportional to the string length, as can be the time for string comparison.
But this is nothing new. Can we do better for long strings ? To be more precise, we will assume the string length to be bounded by M. We will also assume that the length of every string is known (for long strings, this can make a difference).
First notice that the search time is bounded below by the string length, and is Ω(M) in the worst case: comparing two strings can require to compare all characters as the strings can differ only in the last character comparisons. On the other hand, in the best case, the comparison can conclude immediately, either because the lengths are different or because the strings differ in the first characters compared.
Now you can reason as follows: consider the whole set of strings in the dictionary and find the position of the first character on which they differ. Based on the value of this character, you will decompose in a number of subsets. And you can continue this decomposition recursively until you get singletons.
For example,
able
about
above
accept
accident
accompany
is organized as
*bl*
*bou*
*bov*
*c*e**
*c*i****
*c*o*****
where an asterisk stands for a character which just ignored, and the remaining characters are used for discrimination.
As you can see, in this particular example two or three character comparisons are enough to recognize any word in the dictionary.
This representation can be described as a finite state automaton such that in every state you know which character to check next and what are the possible outcomes, leading to the next states. It has a K-ary tree structure (where K is the size of the alphabet).
For an efficient implementation, every state can be represented by the position of the decision character and an array of links to the next states. Actually, this is a trie structure, with path compression. (As said by #peter.petrov, there are many variants of the trie structure.)
How do we use it ? There are two situations:
1) the search string is known to be in the dictionary: then a simple traversal of the tree is guaranteed to find it. It will do so after a number of character comparisons equal to the depth of the corresponding leaf in the tree O(D), where D is this depth. This can be a very significant saving.
2) the search string may not be in the dictionary: during traversal of the tree you can observe an early rejection; otherwise, in the end you find a single potential match. Then you can't avoid performing an exhaustive comparison, O(1) in the best case, O(M) in the worst. (On average O(M) for random strings, but probably better for real-world distributions.) But you will compare against a single string, never more.
In addition to that device, if your distribution of key lengths is sparse, it may be useful to maintain a hash table of the key lengths, so that immediate rejection of the search string can occur.
As final remarks, notice that this solution has a cost not directly a function of N, and that it is likely that time sublinear in M could be achieved by suitable heuristics taking advantage of the particular distribution of the strings.

Algorithm to remove words in corpus with small occurrence

I have a large (+/- 300,000 lines) dataset of text fragments that contain some noisy elements. With noisy I mean words of slang, type errors, etc… I wish to filter out these noisy elements to have a more clean dataset.
I read some papers that propose to filter these out by keeping track of the occurrence of each word. By setting a treshold (eg. less than 20) we can assume these words are noise and thus can safely be removed from the corpus.
Maybe there are some libraries or algorithms that do this in a fast and efficient way. Ofcourse I tried it myself first but this is EXTREMELY slow!
So to summarize, I am looking for an algorithm that can filter out words, in a fast and efficient way, that occur less than a particular treshold. Maybe I add a small example:
This is just an example of whaat I wish to acccomplish.
The words 'whaat' and 'acccomplish' are misspelled and thus likely to occur less often (If we assume to live in a perfect world and typos are rare …). I wish to end up with
This is just an example of I wish to.
Thanks!
PS: If possible, I'd like to have an algorithm in Java (or pseudo-code so I can write it myself)
I think you are over complicating it with the approach suggested in comments.
You can do it with 2 passes on the data:
Build a histogram: Map<String,Integer> that counts number of occurances
For each word, print it to the new 'clean' file if and only if map.get(word) > THRESHOLD
As a side note, if any - I think a fixed threshold approach is not the best choice, I personally would filter words that occure less than MEAN-3*STD where MEAN is the average number of words, and STD is the standard deviation. (3 standard deviations mean you are catching words that are approximately out of the expected normal distribution with probability of ~99%). You can 'play' with the constant factor and find what best suits your needs.

Running time of insertion into 2 hashtables with iteration and printing

I have a program that does the following:
Iterates through a string, placing words into a HashMap<String, Integer> where the key represents the unique word, and the value represents a running total occurrences (incremented each time the word is found).
I believe up to this point we are O(n) since each of the insertions is constant time.
Then, I iterate through the hashmap and insert the values into a new HashMap<Integer, List<String>>. The String goes into the List in the value where the count matches. I think that we are still at O(n) because the operations used on HashMaps and Lists are constant time.
Then, I iterate through the HashMap and print the Strings in each List.
Does anything in this program cause me to go above O(n) complexity?
That is O(n), unless your word-parsing algorithm is not linear (but it should be).
You're correct, with a caveat. In a hash table, insertions and lookups take expected O(1) time each, so the expected runtime of your algorithm is O(n). If you have a bad hash function, there's a chance it will take longer than that, usually (for most reasonable hash table implementations) O(n2) in the worst-case.
Additionally, as #Paul Draper pointed out, this assumes that the computation of the hash code for each string takes time O(1) and that comparing the strings in the table takes time O(1). If you have strings whose lengths aren't bounded from above by some constant, it might take longer to compute the hash codes. In fact, a more accurate analysis would be that the runtime is O(n + L), where L is the total length of all the strings.
Hope this helps!
Beyond the two issues that Paul Draper and templatetypedef point out, there's another potential one. You write that your second map is a hashmap < int,list < string > >. This allows for a total linear complexity only if the implementation you choose for the list allows for (amortized) constant time appending. This is the case if you use an ArrayList and you add entries at the end, or you choose a LinkedList and add entries at either end.
I think this covers the default choices for most developers, so it's not really an obstacle.

Algorithm for Permutations without Repetition?

In a program I am making that generates anagrams for a given set of letters, my current approach is to:
Get all the the combinations of all the letters
Get the permutations of each combination group
Sort the resulting permutations alphabetically
Remove duplicate entries
My question pertains to the mathematics of permutations. I am wondering if it is possible to flat-out calculate the array size needed to store all of the remaining entries after removal of duplicate entries (using, say, the number of repeated letters in conjunction with the permutation formula or something).
I apologize about the vagueness of my question, I am still researching more about combinations and permutations. I will try to elaborate my goal as my understanding of combinations and permutations expands, and once I re-familiarize myself with my program (it was a spare-time project of mine last summer).
If you have n elements, and a[0] duplicates of one element, a[1] duplicates of another element, and so on up to a[k], then the total number of distinct permutations (up to duplicates) is n!/(a[0]! a[1]! ... a[k]!).
FYI, if you're interested, with Guava you could write
Collection<List<Character>> uniquePermutations =
Collections2.orderedPermutations(Lists.charactersOf(string));
and the result would be the unique permutations of the characters, accounting for duplicates and everything. You could even call its .size() method -- or just look at its implementation for hints. (Disclosure: I contribute to Guava.)
Generating all the permutations is really bad idea.The word "overflow" for instance has 40320 permutations. So the memory consumption gets really high as your word length grows.
I believe that the problem you are trying to solve can be reduced to finding out if one word is an anagram of another.
Then you can solve it by counting how many times each letter occurs (it will be a 26-tuple) and comparing these tupples against each other.

How do I count repeated words?

Given a 1GB(very large) file containing words (some repeated), we need to read the file and output how many times each word is repeated. Please let me know if my solution is high performant or not.
(For simplicity lets assume we have already captured the words in an arraylist<string>)
I think the big O(n) is "n". Am I correct??
public static void main(String[] args) {
ArrayList al = new ArrayList();
al.add("math1");
al.add("raj1");
al.add("raj2");
al.add("math");
al.add("rj2");
al.add("math");
al.add("rj3");
al.add("math2");
al.add("rj1");
al.add("is");
Map<String,Integer> map= new HashMap<String,Integer>();
for (int i=0;i<al.size();i++)
{
String s= (String)al.get(i);
map.put(s,null);
}
for (int i=0;i<al.size();i++)
{
String s= (String)al.get(i);
if(map.get(s)==null)
map.put(s,1);
else
{
int count =(int)map.get(s);
count=count+1;
map.put(s,count);
}
}
System.out.println("");
}
I think you could do better than using a HashMap.
Food for thought on the hashmap solution
Your anwser is acceptable but consider this: For simplicity's sake lets assume you read the file one byte at a time into a StringBuffer until you hit a space. At which point you'll call toString() to convert the StringBuffer into a String. You then check if the string is in the HashMap and either it gets stored or the counter get incremented.
The English dic. included with linux has 400k words and is about 5MBs in size. So of the "1GB" of text you read, we can guess that you'll only be storing about 5MBs of it in your HashMap. The rest of the file, will be converted into strings that will need to be Garbage Collected after your finished looking for them in your map. I could be wrong, but I believe the bytes will be iterated over again during the construction of the String since the byte array needs to be copied internally and again for calculating the HashCode. So, the solution may waste a fair amount of CPU cycles and force GC to occur often.
Its OK to point things like this out in your interview, even if it's the only solution you can think of.
I may consider using a custom RadixTree or Trie like structure
Keep in mind how the insert method of a RadixT/Trie works. Which is to take a stream of chars/bytes (usually a string) and compares each element against the current position in the tree. If the prefix exists it just advances down the tree and byte-stream in lock step. When it hits a new suffix it begins adding nodes into the tree. Once the end of stream is reached it marks that node as EOW. Now consider we could do the same thing while reading a much larger stream, by resetting the current position to the root of the tree anytime we hit a space.
If we wrote our own Radix tree (or maybe a Trie), who's nodes had end-of-word counters (instead of markers) and had the insert method read directly from the file. We could insert nodes into the tree one byte/char at a time until we read a space. At which point the insert method would increment the end-of-word counter (if it's an existing word) and reset the current position in the tree back to the head and start inserting bytes/chars again. The way a radix tree works is to collapse the duplicated prefixs of words. For example:
The following file:
math1 raj1 raj2 math rj2 math rj3
would be converted to:
(root)-math->1->(eow=1)
| |-(eow=2)
|
raj->1->(eow=1)
| |->2->(eow=1)
| |->3->(eow=1)
j2->(eow=1)
The insertion time into a tree like this would be O(k), where k is the length of the longest word. But since we are inserting/comparing as we read each byte. We aren't any more inefficient than just reading the file as we have to already.
Also, note the we would read byte(s) into a temp byte that would be a stack variable, so the only time we need to allocate memory from the heap is when we encounter a new word (actually a new suffix). Therefore, garbage collection wouldn't happen nearly as often. And the total memory used by a Radix tree would be a lot smaller than a HashMap.
Theoretically , since HashMap access is generally O(1), I guess your algorithm is O(n), but in reality has several inefficiencies. Ideally you would iterate over the contents of the file just once, processing (i.e. counting) the words while you read them in. There's no need to store the entire file contents in memory (your ArrayList). You loop over the contents three times - once to read them, and the second and third times in the two loops in your code above. In particular, the first loop in your code above is completely unnecessary. Finally, your use of HashMap will be slower than needed because the default size at construction is very small, and it will have to grow internally a number of times, forcing a rebuilding of the hash table each time. Better to start it off a size appropriate for what you expect it to hold. You also have to consider the load factor into that.
Have you considered using a mapreduce solution? If the dataset gets bigger then it would really be better to split it in pieces and count the words in parallel
You should read through the file with words only once.
No need to put the nulls beforehand - you can do it within the main loop.
The complexity is indeed O(n) in both cases, but you want to make the constant as small as possible. (O(n) = 1000 * O(n), right :) )
To answer your question, first, you need to understand how HashMap works. It consists of buckets, and every bucket is a linked list. If due to hashing another pair need to occupy the same bucket, it will be added to the end of linked list. So, if map has high load factor, searching and inserting will not be O(1) anymore, and algorithm will become inefficient. Moreover, if map load factor exceeds predefined load factor (0.75 by default), the whole map will be rehashed.
This is an excerpt from JavaDoc http://download.oracle.com/javase/6/docs/api/java/util/HashMap.html:
The expected number of entries in the map and its load factor should
be taken into account when setting its initial capacity, so as to
minimize the number of rehash operations. If the initial capacity is
greater than the maximum number of entries divided by the load factor,
no rehash operations will ever occur.
So I would like to recommend you to predefine a map capacity guessing that every word is unique:
Map<String,Integer> map= new HashMap<String,Integer>(al.size());
Without of that, your solution is not efficient enough, though it still has a linear approximation O(3n), because due to amortization of rehashing, an insertion of elements will cost 3n instead of n.

Categories