Add term frequency to lucene index - java

From what I understand, the demo IndexFiles example in the Lucene contributions directory will create an inverted index from document terms to the corresponding document pathnames.
I was wondering if there was a way to add the term frequency in each document to the index as well.
In other words (if I understand this right), the original mapping:
term -> list of(pathname of documents)
term -> list of(pathname of document, term frequency in that document)
Is there a way to achieve this? Currently, I am counting the term frequency on the fly by opening each document pathname in java, then counting the terms. There is some huge overhead since there are potentially hundreds of documents to open and process.

Lucene generally does store the term frequencies, and can also store the term offsets and positions. The frequency info is stored in a file with the extension "frq," so if you have that in your index folder, you are storing term frequencies.
You didn't say why you care, or what you want to do with the frequencies. Usually Lucene uses them to compute relevance scores for your queries. If you want the raw frequencies, this other question discusses how to retrieve them: Get term frequencies in Lucene

Related

Count word frequency of huge text file [duplicate]

This question already has answers here:
Parsing one terabyte of text and efficiently counting the number of occurrences of each word
(16 answers)
Closed 10 years ago.
I have a huge text file (larger than the available RAM memory). I need to count the frequency of all words and output the word and the frequency count into a new file. The result should be sorted in the descending order of frequency count.
My Approach:
Sort the given file - external sort
Count the frequency of each word sequentially, store the count in another file (along with the word)
Sort the output file based of frequency count - external sort.
I want to know if there are better approaches to do it. I have heard of disk based hash tables? or B+ trees, but never tried them before.
Note: I have seen similar questions asked on SO, but none of them have to address the issue with data larger than memory.
Edit: Based on the comments, agreed the a dictionary in practice should fit in the memory of today's computers. But lets take a hypothetical dictionary of words, that is huge enough not to fit in the memory.
I would go with a map reduce approach:
Distribute your text file on nodes, assuming each text in a node can fit into RAM.
Calculate each word frequency within the node. (using hash tables )
Collect each result in a master node and combine them all.
All unique words probably fit in memory so I'd use this approach:
Create a dictionary (HashMap<string, int>).
Read the huge text file line by line.
Add new words into the dictionary and set value to 1.
Add 1 to the value of existing words.
After you've parsed the entire huge file:
Sort the dictionary by frequency.
Write, to a new file, the sorted dictionary with words and frequency.
Mind though to convert the words to either lowercase or uppercase.
Best way to achieve it would be to read the file line by line and store the words into a Multimap (e.g. Guava). If this Map extends your memory you could try using a Key-Value store (e.g. Berkeley JE DB, or MapDB). These key-value stores work similar to a map, but they store their values on the HDD. I used MapDB for a similar problem and it was blazing fast.
If the list of unique words and the frequency fits in memory (not the file just the unique words) you can use a hash table and read the file sequentially (without storing it).
You can then sort the entries of the hash table by the number of occurrences.

Term Document Frequencies in Java with Lucene/Lingpipe in a large corpus

I am trying to analyze a large corpus of documents, which are in a huge file (3.5GB, 300K lines, 300K documents), one document per line. In this process I am using Lucene for indexing and Lingpipe for preprocessing.
The problem is that I want to get rid of very rare words in the documents. For example, if a word occurs less than MinDF times in the corpus (the huge file), I want to remove it.
I can try to do it with Lucene: Compute the Document Frequencies for all distinct terms, sort them in ascending order, get the terms that have DF lower than MinDF, go over the huge file again, and remove these terms line per line.
This process will be insanely slow. Does anybody know of any quicker way to do this using Java?
Regards
First create a temp index, then use the information in it to produce the final index. Use IndexReader.terms(), iterate over that, and you have TermEnum.docFreq for each term. Accumulate all low-freq terms and then feed that info into an analyzer that extends StopWordAnalyzerBase when you are creating the final index.

Advice on how to improve a current fuzzy search implementation

I'm currently working on implementing a fuzzy search for a terminology web service and I'm looking for suggestions on how I might improve the current implementation. It's too much code to share, but I think an explanation might suffice to prompt thoughtful suggestions. I realize it's a lot to read but I'd appreciate any help.
First, the terminology is basically just a number of names (or terms). For each word, we split it into tokens by space and then iterate through each character to add it to the trie. On a terminal node (such as when the character y in strawberry is reached) we store in a list an index to the master term list. So a terminal node can have multiple indices (since the terminal node for strawberry will match 'strawberry' and 'allergy to strawberry').
As for the actual search, the search query is also broken up into tokens by space. The search algorithm is run for each token. The first character of the search token must be a match (so traw will never match strawberry). After that, we go through children of each successive node. If there is child with a character that matches, we continue the search with the next character of the search token. If a child does not match the given character, we look at the children using the current character of the search token (so not advancing it). This is the fuzziness part, so 'stwb' will match 'strawberry'.
When we reach the end of the search token, we will search through the rest of the trie structure at that node to get all potential matches (since the indexes to the master term list are only on the terminal nodes). We call this the roll up. We store the indices by setting their value on a BitSet. Then, we simply and the BitSets from the results of each search token result. We then take, say, the first 1000 or 5000 indices from the anded BitSets and find the actual terms they correspond to. We use Levenshtein to score each term and then sort by score to get our final results.
This works fairly well and is pretty fast. There are over 390k nodes in the tree and over 1.1 million actual term names. However, there are problems with this as it stands.
For example, searching for 'car cat' will return Catheterization, when we don't want it to (since the search query is two words, the result should be at least two). That would be easy enough to check, but it doesn't take care of a situation like Catheterization Procedure, since it is two words. Ideally, we'd want it to match something like Cardiac Catheterization.
Based on the need to correct this, we came up with some changes. For one, we go through the trie in a mixed depth/breadth search. Essentially we go depth first as long as a character matches. Those child nodes that didn't match get added to a priority queue. The priority queue is ordered by edit distance, which can be calculated while searching the trie (since if there's a character match, distance remains the same and if not, it increases by 1). By doing this, we get the edit distance for each word.
We are no longer using the BitSet. Instead, it's a map of the index to a Terminfo object. This object stores the index of the query phrase and the term phrase and the score. So if the search is "car cat" and a term matched is "Catheterization procedure" the term phrase indices will be 1 as will the query phrase indices. For "Cardiac Catheterization" the term phrase indices will be 1,2 as will the query phrase indices. As you can see, it's very simple afterward to look at the count of term phrase indices and query phrase indices and if they aren't at least equal to the search word count, they can be discarded.
After that, we add up the edit distances of the words, remove the words from the term that match the term phrase index, and count the remaining letters to get the true edit distance. For example, if you matched the term "allergy to strawberries" and your search query was "straw" you would have a score of 7 from strawberries, then you'd use the term phrase index to discard strawberries from the term, and just count "allergy to" (minus the spaces) to get the score of 16.
This gets us the accurate results we expect. However, it is far too slow. Where before we could get 25-40 ms on one word search, now it could be as much as half a second. It's largely from things like instantiating TermInfo objects, using .add() operations, .put() operations and the fact that we have to return a large number of matches. We could limit each search to only return 1000 matches, but there's no guarantee that the first 1000 results for "car" would match any of the first 1000 matches for "cat" (remember, there are over 1.1. million terms).
Even for a single query word, like cat, we still need a large number of matches. This is because if we search for 'cat' the search is going to match car and roll up all the terminal nodes below it (which will be a lot). However, if we limited the number of results, it would place too heavy an emphasis on words that begin with the query and not the edit distance. Thus, words like catheterization would be more likely to be included than something like coat.
So, basically, are there any thoughts on how we could handle the problems that the second implementation fixed, but without as much of the speed slow down that it introduced? I can include some selected code if it might make things clearer but I didn't want to post a giant wall of code.
Wow... tough one.
Well why don't you implement lucene? It is the best and current state of the art when it comes to problems like yours afaik.
However I want to share some thoughts...
Fuziness isnt something like straw* its rather the mis typing of some words. And every missing/wrong character adds 1 to the distance.
Its generally very, very hard to have partial matching (wildcards) and fuzziness at the same time!
Tokenizing is generally a good idea.
Everything also heavily depends on the data you get. Are there spelling mistakes in the source files or only in the search queries?
I have seen some pretty nice implementations using multi dimensional range trees.
But I really think if you want to accomplish all of the above you need a pretty neat combination of a graph set and a nice indexing algorithm.
You could for example use a semantic database like sesame and when importing your documents import every token and document as a node. Then depending on position in the document etc you can add a weighted relation.
Then you need the tokens in some structure where you can do efficient fuzzy matches such as bk-trees.
I think you could index the tokens in a mysql database and do bit-wise comparision functions to get differences. Theres a function that returns all matching bits, if you translit your strings to ascii and group the bits you could achieve something pretty fast.
However if you matched the tokens to the string you can construct a hypothetical perfect match antity and query your semantic database for the nearest neighbours.
You would have to break the words apart into partial words when tokenizing to achieve partial matches.
However you can do also wildcard matches (prefix, suffix or both) but no fuzziness then.
You can also index the whole word or different concatenations of tokens.
However there may be special bk-tree implementations that support this but i have never seen one.
I did a number of iterations of a spelling corrector ages ago, and here's a recent description of the basic method. Basically the dictionary of correct words is in a trie, and the search is a simple branch-and-bound. I used repeated depth-first trie walk, bounded by lev. distance because, since each additional increment of distance results in much more of the trie being walked, the cost, for small distance, is basically exponential in the distance, so going to a combined depth/breadth search doesn't save much but makes it a lot more complicated.
(Aside: You'd be amazed how many ways physicians can try to spell "acetylsalicylic acid".)
I'm surprised at the size of your trie. A basic dictionary of acceptable words is maybe a few thousand. Then there are common prefixes and suffixes. Since the structure is a trie, you can connect together sub-tries and save a lot of space. Like the trie of basic prefixes can connect to the main dictionary, and then the terminal nodes of the main dictionary can connect to the trie of common suffixes (which can in fact contain cycles). In other words, the trie can be generalized into a finite state machine. That gives you a lot of flexibility.
REGARDLESS of all that, you have a performance problem. The nice thing about performance problems is, the worse they are, the easier they are to find. I've been a real pest on StackOverflow pointing this out. This link explains how to do it, links to a detailed example, and tries to dispel some popular myths. In a nutshell, the more time it is spending doing something that you could optimize, the more likely you will catch it doing that if you just pause it and take a look. My suspicion is that a lot of time is going into operations on overblown data structure, rather than just getting to the answer. That's a common situation, but don't fix anything until samples point you directly at the problem.

term frequency using java program

I have set of documents. I want to know the frequency count of each word in each document (i.e) term frequency using java program. thanks in advance. I know how to find the frequency count for each word. My question is about how to take the unique words in each document from the list of documents
You can split your documents on spaces and punctuation, go through the resulting array and then count frequency for each word (a Map<String, Integer> would really help you with this).
Resources :
Java - faster data structure to count word frequency?
On the same topic :
How to count words in java
If it's more than a one time problem to solve, you should consider using Lucene to index your documents. Then this post would help you answer your question.

Get term frequencies in Lucene

Is there a fast and easy way of getting term frequencies from a Lucene index, without doing it through the TermVectorFrequencies class, since that takes an awful lot of time for large collections?
What I mean is, is there something like TermEnum which has not just the document frequency but term frequency as well?
UPDATE:
Using TermDocs is way too slow.
Use TermDocs to get the term frequency for a given document. Like the document frequency, you get the term documents from an IndexReader, using the term of interest.
You won't find a faster method than TermDocs without losing some generality. TermDocs reads directly from the ".frq" file in an index segment, where each term frequency is listed in document order.
If that's "too slow", make sure that you've optimized your index to merge multiple segments into a single segment. Iterate over the documents in order (skips are alright, but you can't jump back and forth in the document list efficiently).
Your next step might be additional processing to create an even more specialized file structure that leaves out the SkipData. Personally I would look for a better algorithm to achieve my objective, or provide better hardware—lots of memory, either to hold a RAMDirectory, or to give to the OS for use on its own file-caching system.
The trunk version of Lucene (to be 4.0, eventually) now exposes the totalTermFreq() for each term from the TermsEnum. This is the total number of times this term appeared in all content (but, like docFreq, does not take into account deletions).
TermDocs gives the TF of a given term in each document that contains the term. You can get the DF by iterating through each <document, frequency> pair and counting the number of pairs, although TermEnums should be faster. IndexReader has a termDocs(Term) method that returns a TermDocs for the given Term and index.

Categories