Count word frequency of huge text file [duplicate] - java

This question already has answers here:
Parsing one terabyte of text and efficiently counting the number of occurrences of each word
(16 answers)
Closed 10 years ago.
I have a huge text file (larger than the available RAM memory). I need to count the frequency of all words and output the word and the frequency count into a new file. The result should be sorted in the descending order of frequency count.
My Approach:
Sort the given file - external sort
Count the frequency of each word sequentially, store the count in another file (along with the word)
Sort the output file based of frequency count - external sort.
I want to know if there are better approaches to do it. I have heard of disk based hash tables? or B+ trees, but never tried them before.
Note: I have seen similar questions asked on SO, but none of them have to address the issue with data larger than memory.
Edit: Based on the comments, agreed the a dictionary in practice should fit in the memory of today's computers. But lets take a hypothetical dictionary of words, that is huge enough not to fit in the memory.

I would go with a map reduce approach:
Distribute your text file on nodes, assuming each text in a node can fit into RAM.
Calculate each word frequency within the node. (using hash tables )
Collect each result in a master node and combine them all.

All unique words probably fit in memory so I'd use this approach:
Create a dictionary (HashMap<string, int>).
Read the huge text file line by line.
Add new words into the dictionary and set value to 1.
Add 1 to the value of existing words.
After you've parsed the entire huge file:
Sort the dictionary by frequency.
Write, to a new file, the sorted dictionary with words and frequency.
Mind though to convert the words to either lowercase or uppercase.

Best way to achieve it would be to read the file line by line and store the words into a Multimap (e.g. Guava). If this Map extends your memory you could try using a Key-Value store (e.g. Berkeley JE DB, or MapDB). These key-value stores work similar to a map, but they store their values on the HDD. I used MapDB for a similar problem and it was blazing fast.

If the list of unique words and the frequency fits in memory (not the file just the unique words) you can use a hash table and read the file sequentially (without storing it).
You can then sort the entries of the hash table by the number of occurrences.

Related

script to build word pairs and their frequencies in native java

it's my first time programming ever and we have this assignment on finding both word frequencies and word pair frequencies in a text file
I've followed several tutorials online and implemented a rather fast word count solution however i have no clue on how to implement a method on ho to get all the word pairs in the text file and sum up the frequencies of duplicate word pairs to find their frequencies before adding them to an array (hashmap)
i tried asking my instructor but he is adamant on us figuring it out ourselves , please just point me in the right direction to a paper / article / tutorial (anything)i can read in-order to solve this
thanks in advance
Ideally this would be done using a hash map with String and Integer values. You can check wether it is in the hash map before adding it as a new value and frequency to the map.
Here is an example of a previously answered question using this method.
Counting frequency of words from a .txt file in java

What is the most memory efficient method of storing a large number of Strings in a map?

I want to store huge amounts of Strings in a Map<String, MagicObject>, so that the MagicObjects can be accessed quickly. There are so many entries to this Map that memory is becoming a bottleneck. Assuming the MagicObjects can't be optimized, what is the most efficient type of map I could use for this situation? I am currently using the following:
gnu.trove.map.hash.TCustomHashMap<byte[], MagicObject>
If your keys are long enough and have a lot of long enough common prefixes then you can save memory by using a trie (prefix tree) data structure. Answers to this question point to a a couple of Java implementations of trie.
To open mind, consider Huffman coding to compress your strings first before
put in map, as long as your strings are fixed(number and content of string don't change).
I'm a little late to this party but this question came up in a related search and piqued my interest. I don't usually answer Java questions.
There are so many entries to this Map that memory is becoming a bottleneck.
I doubt it.
For the storage of strings in memory to become a bottleneck you need an awfully large number of unique strings[1]. To put things into perspective, I recently worked with a 1.8m word dictionary (1.8m unique english words) and they took up around 1.6MB in RAM at runtime.
If you used every word in the dictionary as a key you'll still only use 1.6MB of RAM[2] to store the keys, hence memory cannot be your bottleneck.
What I suspect you are experiencing is the O(n^2) performance of string matching. By this I mean that as more keys are added performance slows down exponentially[3]. This is unavoidable if you are using strings are keys.
If you want to speed things up a bit, store each key into a hashtable that doesn't store duplicates and use the hash key as the key to your map.
NOTES:
[1] I'm assuming the strings are all unique or else you would not attempt to use them as a key into a map.
[2] Even if Java uses 2 bytes per character, it still only comes to 3.2MB of memory, total.
[3] It slows down even more if you choose the wrong data structure, such as an unbalanced binary tree, to store your values. I don't know how map stores values internally, but an unbalanced binary tree will have O(2^n) performance - pretty much the worst performance you can find.

What data structure to use for indexing data for partial %infix% searching?

Imagine you have a huge cache of data that is to be searched through by 4 ways :
exact match
prefix%
%suffix
%infix%
I'm using Trie for the first 3 types of searching, but I can't figure out how to approach the fourth one other than sequential processing of huge array of elements.
If your dataset is huge cosider using a search platform like Apache Solr so that you dont end up in a performance mess.
You can construct a navigable map or set (eg TreeMap or TreeSet) for the 2 (with keys in normal order) and 3 (keys in reverse)
For option 4 you can construct a collection with a key for every starting letter. You can simplify this depending on your requirement. This can lead to more space being used but get O(log n) lookup times.
For #4 I am thinking if you pre-compute the number of occurances of each character then you can look up in that table for entires that have at least as many occurances of the characters in the search string.
How efficient this algorithm is will probably depend on the nature of the data and the search string. It might be useful to give some examples of both here to get better answers.

Storing a dictionary in a hashtable

I have an assignment that I am working on, and I can't get a hold of the professor to get clarity on something. The idea is that we are writing an anagram solver, using a given set of words, that we store in 3 different dictionary classes: Linear, Binary, and Hash.
So we read in the words from a textfile, and for the first 2 dictionary objects(linear and binary), we store the words as an ArrayList...easy enough.
But for the HashDictionary, he want's us to store the words in a HashTable. I'm just not sure what the values are going to be for the HashTable, or why you would do that. The instructions say we store the words in a Hashtable for quick retrieval, but I just don't get what the point of that is. Makes sense to store words in an arraylist, but I'm just not sure of how key/value pairing helps with a dictionary.
Maybe i'm not giving enough details, but I figured maybe someone would have seen something like this and its obvious to them.
Each of our classes has a contains method, that returns a boolean representing whether or not a word passed in is in the dictionary, so the linear does a linear search of the arraylist, the binary does a binary search of the arraylist, and I'm not sure about the hash....
The difference is speed. Both methods work, but the hash table is fast.
When you use an ArrayList, or any sort of List, to find an element, you must inspect each list item, one by one, until you find the desired word. If the word isn't there, you've looped through the entire list.
When you use a HashTable, you perform some "magic" on the word you are looking up known as calculating the word's hash. Using that hash value, instead of looping through a list of values, you can immediately deduce where to find your word - or, if your word doesn't exist in the hash, that your word isn't there.
I've oversimplified here, but that's the general idea. You can find another question here with a variety of explanations on how a hash table works.
Here is a small code snippet utilizing a HashMap.
// We will map our words to their definitions; word is the key, definition is the value
Map<String, String> dictionary = new HashMap<String, String>();
map.put("hello","A common salutation");
map.put("chicken","A delightful vessel for protein");
// Later ...
map.get("chicken"); // Returns "A delightful vessel for protein";
The problem you describe asks that you use a HashMap as the basis for a dictionary that fulfills three requirements:
Adding a word to the dictionary
Removing a word from the dictionary
Checking if a word is in the dictionary
It seems counter-intuitive to use a map, which stores a key and a value, since all you really want to is store just a key (or just a value). However, as I described above, a HashMap makes it extremely quick to find the value associated with a key. Similarly, it makes it extremely quick to see if the HashMap knows about a key at all. We can leverage this quality by storing each of the dictionary words as a key in the HashMap, and associating it with a garbage value (since we don't care about it), such as null.
You can see how to fulfill the three requirements, as follows.
Map<String, Object> map = new HashMap<String, Object>();
// Add a word
map.put('word', null);
// Remove a word
map.remove('word');
// Check for the presence of a word
map.containsKey('word');
I don't want to overload you with information, but the requirements we have here align with a data structure known as a Set. In Java, a commonly used Set is the HashSet, which is almost exactly what you are implementing with this bit of your homework assignment. (In fact, if this weren't a homework assignment explicitly instructing you to use a HashMap, I'd recommend you instead use a HashSet.)
Arrays are hard to find stuff in. If I gave you array[0] = "cat"; array[1] = "dog"; array[2] = "pikachu";, you'd have to check each element just to know if jigglypuff is a word. If I gave you hash["cat"] = 1; hash["dog"] = 1; hash["pikachu"] = 1;", instant to do this in, you just look it up directly. The value 1 doesn't matter in this particular case although you can put useful information there, such as how many times youv'e looked up a word, or maybe 1 will mean real word and 2 will mean name of a Pokemon, or for a real dictionary it could contain a sentence-long definition. Less relevant.
It sounds like you don't really understand hash tables then. Even Wikipedia has a good explanation of this data structure.
Your hash table is just going to be a large array of strings (initially all empty). You compute a hash value using the characters in your word, and then insert the word at that position in the table.
There are issues when the hash value for two words is the same. And there are a few solutions. One is to store a list at each array position and just shove the word onto that list. Another is to step through the table by a known amount until you find a free position. Another is to compute a secondary hash using a different algorithm.
The point of this is that hash lookup is fast. It's very quick to compute a hash value, and then all you have to do is check that the word at that array position exists (and matches the search word). You follow the same rules for hash value collisions (in this case, mismatches) that you used for the insertion.
You want your table size to be a prime number that is larger than the number of elements you intend to store. You also need a hash function that diverges quickly so that your data is more likely to be dispersed widely through your hash table (rather than being clustered heavily in one region).
Hope this is a help and points you in the right direction.

Term Document Frequencies in Java with Lucene/Lingpipe in a large corpus

I am trying to analyze a large corpus of documents, which are in a huge file (3.5GB, 300K lines, 300K documents), one document per line. In this process I am using Lucene for indexing and Lingpipe for preprocessing.
The problem is that I want to get rid of very rare words in the documents. For example, if a word occurs less than MinDF times in the corpus (the huge file), I want to remove it.
I can try to do it with Lucene: Compute the Document Frequencies for all distinct terms, sort them in ascending order, get the terms that have DF lower than MinDF, go over the huge file again, and remove these terms line per line.
This process will be insanely slow. Does anybody know of any quicker way to do this using Java?
Regards
First create a temp index, then use the information in it to produce the final index. Use IndexReader.terms(), iterate over that, and you have TermEnum.docFreq for each term. Accumulate all low-freq terms and then feed that info into an analyzer that extends StopWordAnalyzerBase when you are creating the final index.

Categories