I have set of documents. I want to know the frequency count of each word in each document (i.e) term frequency using java program. thanks in advance. I know how to find the frequency count for each word. My question is about how to take the unique words in each document from the list of documents
You can split your documents on spaces and punctuation, go through the resulting array and then count frequency for each word (a Map<String, Integer> would really help you with this).
Resources :
Java - faster data structure to count word frequency?
On the same topic :
How to count words in java
If it's more than a one time problem to solve, you should consider using Lucene to index your documents. Then this post would help you answer your question.
Related
it's my first time programming ever and we have this assignment on finding both word frequencies and word pair frequencies in a text file
I've followed several tutorials online and implemented a rather fast word count solution however i have no clue on how to implement a method on ho to get all the word pairs in the text file and sum up the frequencies of duplicate word pairs to find their frequencies before adding them to an array (hashmap)
i tried asking my instructor but he is adamant on us figuring it out ourselves , please just point me in the right direction to a paper / article / tutorial (anything)i can read in-order to solve this
thanks in advance
Ideally this would be done using a hash map with String and Integer values. You can check wether it is in the hash map before adding it as a new value and frequency to the map.
Here is an example of a previously answered question using this method.
Counting frequency of words from a .txt file in java
This question already has answers here:
Parsing one terabyte of text and efficiently counting the number of occurrences of each word
(16 answers)
Closed 10 years ago.
I have a huge text file (larger than the available RAM memory). I need to count the frequency of all words and output the word and the frequency count into a new file. The result should be sorted in the descending order of frequency count.
My Approach:
Sort the given file - external sort
Count the frequency of each word sequentially, store the count in another file (along with the word)
Sort the output file based of frequency count - external sort.
I want to know if there are better approaches to do it. I have heard of disk based hash tables? or B+ trees, but never tried them before.
Note: I have seen similar questions asked on SO, but none of them have to address the issue with data larger than memory.
Edit: Based on the comments, agreed the a dictionary in practice should fit in the memory of today's computers. But lets take a hypothetical dictionary of words, that is huge enough not to fit in the memory.
I would go with a map reduce approach:
Distribute your text file on nodes, assuming each text in a node can fit into RAM.
Calculate each word frequency within the node. (using hash tables )
Collect each result in a master node and combine them all.
All unique words probably fit in memory so I'd use this approach:
Create a dictionary (HashMap<string, int>).
Read the huge text file line by line.
Add new words into the dictionary and set value to 1.
Add 1 to the value of existing words.
After you've parsed the entire huge file:
Sort the dictionary by frequency.
Write, to a new file, the sorted dictionary with words and frequency.
Mind though to convert the words to either lowercase or uppercase.
Best way to achieve it would be to read the file line by line and store the words into a Multimap (e.g. Guava). If this Map extends your memory you could try using a Key-Value store (e.g. Berkeley JE DB, or MapDB). These key-value stores work similar to a map, but they store their values on the HDD. I used MapDB for a similar problem and it was blazing fast.
If the list of unique words and the frequency fits in memory (not the file just the unique words) you can use a hash table and read the file sequentially (without storing it).
You can then sort the entries of the hash table by the number of occurrences.
Imagine you have a huge cache of data that is to be searched through by 4 ways :
exact match
prefix%
%suffix
%infix%
I'm using Trie for the first 3 types of searching, but I can't figure out how to approach the fourth one other than sequential processing of huge array of elements.
If your dataset is huge cosider using a search platform like Apache Solr so that you dont end up in a performance mess.
You can construct a navigable map or set (eg TreeMap or TreeSet) for the 2 (with keys in normal order) and 3 (keys in reverse)
For option 4 you can construct a collection with a key for every starting letter. You can simplify this depending on your requirement. This can lead to more space being used but get O(log n) lookup times.
For #4 I am thinking if you pre-compute the number of occurances of each character then you can look up in that table for entires that have at least as many occurances of the characters in the search string.
How efficient this algorithm is will probably depend on the nature of the data and the search string. It might be useful to give some examples of both here to get better answers.
I am trying to analyze a large corpus of documents, which are in a huge file (3.5GB, 300K lines, 300K documents), one document per line. In this process I am using Lucene for indexing and Lingpipe for preprocessing.
The problem is that I want to get rid of very rare words in the documents. For example, if a word occurs less than MinDF times in the corpus (the huge file), I want to remove it.
I can try to do it with Lucene: Compute the Document Frequencies for all distinct terms, sort them in ascending order, get the terms that have DF lower than MinDF, go over the huge file again, and remove these terms line per line.
This process will be insanely slow. Does anybody know of any quicker way to do this using Java?
Regards
First create a temp index, then use the information in it to produce the final index. Use IndexReader.terms(), iterate over that, and you have TermEnum.docFreq for each term. Accumulate all low-freq terms and then feed that info into an analyzer that extends StopWordAnalyzerBase when you are creating the final index.
From what I understand, the demo IndexFiles example in the Lucene contributions directory will create an inverted index from document terms to the corresponding document pathnames.
I was wondering if there was a way to add the term frequency in each document to the index as well.
In other words (if I understand this right), the original mapping:
term -> list of(pathname of documents)
term -> list of(pathname of document, term frequency in that document)
Is there a way to achieve this? Currently, I am counting the term frequency on the fly by opening each document pathname in java, then counting the terms. There is some huge overhead since there are potentially hundreds of documents to open and process.
Lucene generally does store the term frequencies, and can also store the term offsets and positions. The frequency info is stored in a file with the extension "frq," so if you have that in your index folder, you are storing term frequencies.
You didn't say why you care, or what you want to do with the frequencies. Usually Lucene uses them to compute relevance scores for your queries. If you want the raw frequencies, this other question discusses how to retrieve them: Get term frequencies in Lucene