string matching algorithms used by lucene

string matching algorithms used by lucene - java

i want to know the string matching algorithms used by Apache Lucene. i have been going through the index file format used by lucene given here. it seems that lucene stores all words occurring in the text as is with their frequency of occurrence in each document.
but as far as i know that for efficient string matching it would need to preprocess the words occurring in the Documents.
example:
search for "iamrohitbanga is a user of stackoverflow" (use fuzzy matching)
in some documents.
it is possible that there is a document containing the string "rohit banga"
to find that the substrings rohit and banga are present in the search string, it would use some efficient substring matching.
i want to know which algorithm it is. also if it does some preprocessing which function call in the java api triggers it.

As Yuval explained, in general Lucene is geared at exact matches (by normalizing terms with analyzers at both index and query time).
In the Lucene trunk code (not any released version yet) there is in fact suffix tree usage for inexact matches such as Regex, Wildcard, and Fuzzy.
The way this works is that a Lucene term dictionary itself is really a form of a suffix tree. You can see this in the file formats that you mentioned in a few places:
Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".
The term info index gives us "random access" by indexing this tree at certain intervals (every 128th term by default).
So low-level it is a suffix tree, but at the higher level, we exploit these properties (mainly the ones specified in IndexReader.terms to treat the term dictionary as a deterministic finite state automaton (DFA):
Returns an enumeration of all terms starting at a given term. If the given term does not exist, the enumeration is positioned at the first term greater than the supplied term. The enumeration is ordered by Term.compareTo(). Each term is greater than all that precede it in the enumeration.
Inexact queries such as Regex, Wildcard, and Fuzzy are themselves also defined as DFAs, and the "matching" is simply DFA intersection.

The basic design of Lucene uses exact string matches, or defines equivalent strings using an Analyzer. An analyzer breaks text into indexable tokens. During this process, it may collate equivalent strings (e.g. upper and lower case, stemmed strings, remove diacritics etc.)
The resulting tokens are stored in the index as a dictionary plus a posting list of the tokens in documents. Therefore, you can build and use a Lucene index without ever using a string-matching algorithm such as KMP.
However, FuzzyQuery and WildCardQuery use something similar, first searching for matching terms and then using them for the full match. Please see Robert Muir's Blog Post about AutomatonQuery for a new, efficient approach to this problem.

As you pointed out Lucene stores only list of terms that occured in documents. How Lucene extracts these words is up to you. Default lucene analyzer simply breaks the words separated by spaces. You could write your own implementation that, for example for source string 'iamrohitbanga' yields 5 tokens: 'iamrohitbanga', 'i', 'am', 'rohit', 'banga'.
Please look lucene API docs for TokenFilter class.

Related

Lucene: Mining email addresses, names, and identifiers from an index

I have a lucene index with approx. 1 million documents. From these documents, I want to mine
email addresses
signatures - ( [whitespace]/s/[whitespace]john doe[whitespace] )
specific identifiers from each of the documents (that follow a regex pattern "\s[0-9]{3}[a-zA-Z0-9]{6}\s").
I understand that ideally using solr, during index build time, its much easier, but how can one do this from a built lucene index?
I am using java. For email address search, I tried to .setAllowLeadingWildcard(true) and then searched for # to find all email addresses - but I actually got zero results . if I search for # in luke I get zero results. If I search for #hotmail.com in luke, I get bunch of results with valid email addresses such as aaaaa#hotmail.com.
The index was created using StandardAnalyzer. Not sure if it matters, but the text is in UTF-8 I believe.
Any helpful suggestions, pointers is great! Note this is not for front end, so query doesn't have to be near realtime.

Analysis does matter, yes. The standard analyzer will treat whitespace and punctuation, such as #, as a place to split input into tokens. As such, you wouldn't expect to see any of them actually present in the indexed data.
You can use Lucene's regex query, particularly for the third case. A PhraseQuery seems appropriate for the second, I think, though I'm more that slightly confused about what you are trying to accomplish there.
Generally, you might want to use a different analyzer for an email field, in order to use it as a single token. You should get reasonable results searching for a particular e-mail address, since, though the analyzer would remove the punctuation, searching for the three (usually) tokens of a email consecutively in a phrase would be expected to get good matches. However, a regex search like \w*#\w*\.\w*, won't be particularly effective, since the punctuation won't actually be indexed and searchable, and a regex search doesn't span multiple terms in the index. Apart from searching for a known set of e-mail domains, or something of that nature, you would want to re-index use analysis more in line with how you need to search it in order to do what you are asking.

Lucene Index problems with "-" character

I'm having trouble with a Lucene Index, which has indexed words, that contain "-" Characters.
It works for some words that contain "-" but not for all and I don't find the reason, why it's not working.
The field I'm searching in, is analyzed and contains version of the word with and without the "-" character.
I'm using the analyzer: org.apache.lucene.analysis.standard.StandardAnalyzer
here an example:
if I search for "gsx-*" I got a result, the indexed field contains
"SUZUKI GSX-R 1000 GSX-R1000 GSXR"
but if I search for "v-*" I got no result. The indexed field of the expected result contains:
"SUZUKI DL 1000 V-STROM DL1000V-STROMVSTROM V STROM"
If I search for "v-strom" without "*" it works, but if I just search for "v-str" for example I don't get the result. (There should be a result because it's for a live search for a webshop)
So, what's the difference between the 2 expected results? why does it work for "gsx-" but not for "v-" ?

StandardAnalyzer will treat the hyphen as whitespace, I believe. So it turns your query "gsx-*" into "gsx*" and "v-*" into nothing because at also eliminates single-letter tokens. What you see as the field contents in the search result is the stored value of the field, which is completely independent of the terms that were indexed for that field.
So what you want is for "v-strom" as a whole to be an indexed term. StandardAnalyzer is not suited to this kind of text. Maybe have a go with the WhitespaceAnalyzer or SimpleAnalyzer. If that still doesn't cut it, you also have the option of throwing together your own analyzer, or just starting off those two mentined and composing them with further TokenFilters. A very good explanation is given in the Lucene Analysis package Javadoc.
BTW there's no need to enter all the variants in the index, like V-strom, V-Strom, etc. The idea is for the same analyzer to normalize all these variants to the same string both in the index and while parsing the query.

ClassicAnalyzer handles '-' as a useful, non-delimiter character. As I understand ClassicAnalyzer, it handles '-' like the pre-3.1 StandardAnalyzer because ClassicAnalyzer uses ClassicTokenizer which treats numbers with an embedded '-' as a product code, so the whole thing is tokenized as one term.
When I was at Regenstrief Institute I noticed this after upgrading Luke, as the LOINC standard medical terms (LOINC was initiated by R.I.) are identified by a number followed by a '-' and a checkdigit, like '1-8' or '2857-1'. My searches for LOINCs like '45963-6' failed using StandardAnalyzer in Luke 3.5.0, but succeeded with ClassicAnalyzer (and this was because we built the index with the 2.9.2 Lucene.NET).

(Based on Lucene 4.7) StandardTokenizer splits hyphenated words into two. for example "chat-room" into "chat","room" and index the two words separately instead of indexing as a single whole word. It is quite common for separate words to be connected with a hyphen: “sport-mad,” “camera-ready,” “quick-thinking,” and so on. A significant number are hyphenated names, such as “Emma-Claire.” When doing a Whole Word Search or query, users expect to find the word within those hyphens. While there are some cases where they are separate words, that's why lucene keeps the hyphen out of the default definition.
To give support of hyphen in StandardAnalyzer, you have to make changes in StandardTokenizerImpl.java which is generated class from jFlex.
Refer this link for complete guide.
You have to add following line in SUPPLEMENTARY.jflex-macro which is included by StandardTokenizerImpl.jflex file.
MidLetterSupp = ( [\u002D] )
And After making changes provide StandardTokenizerImpl.jflex file as input to jFlex engine and click on generate. The output of that will be StandardTokenizerImpl.java
And using that class file rebuild the index.

The ClassicAnalzer is recommended to index text containing product codes like 'GSX-R1000'. It will recognize this as a single term and did not split up its parts. But for example the text 'Europe/Berlin' will be split up by the ClassicAnalzer into the words 'Europe' and 'Berlin'. This means if you have a text indexed by the ClassicAnalyzer containing the phrase
Europe/Berlin GSX-R1000
you can search for "europe", "berlin" or "GSX-R1000".
But be careful which analyzer you use for the search. I think the best choice to search a Lucene index is the KeywordAnalyzer. With the KeywordAnalyzer you can also search for specific fields in a document and you can build complex queries like:
(processid:4711) (berlin)
This query will search documents with the phrase 'berlin' but also a field 'processid' containing the number 4711.
But if you search the index for the phrase "europe/berlin" you will get no result! This is because the KeywordAnalyzer did not change your search phrase, but the phrase 'Europe/Berlin' was split up into two separate words by the ClassicAnalyzer. This means you have to search for 'europe' and 'berlin' separately.
To solve this conflict you can translate a search term, entered by the user, in a search query that fits you needs using the following code:
QueryParser parser = new QueryParser("content", new ClassicAnalyzer());
Query result = parser.parse(searchTerm);
searchTerm = result.toString("content");
This code will translate the serach pharse
Europe/Berlin
into
europe berlin
which will result in the expected document set.
Note: This will also work for more complex situations. The search term
Europe/Berlin GSX-R1000
will be translated into:
(europe berlin) GSX-R1000
which will search correctly for all phrases in combination using the KeyWordAnalyzer.

Partial match on a dictionary

I am working with GATE (Java Based NLP Framework) and want to find words with partial match with a dictionary.
For example I have a disease dictionary with following terms
Congestive cardiac failure
Congestive Heart Failure
Colon Cancer
.
.
.
Thousands of more terms
Let's assume I have as string "Father had cardiac failure last year" from this string I want to identify "cardiac failure" as partial match because it occurs as part of a term in the dictionary.
I have seen some discussion on similar subject in Python, JS and C# but I am not sure what can help in such a case here.
I wonder if I can utilize Aho-Corrasick over here.

The UIMA Concept Mapper annotator addon includes a functionality similar to what you are looking. You may consider:
including using UIMA inside GATE: http://gate.ac.uk/userguide/chap:uima
develop a similar component using the main ideas from the addon

Maybe you should use Lucene. Treat each line of the dictionary as a document, and each sentence in the text as a query.

One question that arises is which substrings you want to include in the search. If you included all substrings just "Heart" would also be a match, but that is not really a disease.
Maybe all right-aligned (word-)substrings (perhaps with length > 1) would be acceptable.
So one thing you could do is to train the Aho-Corrasick pattern matcher with the substrings you want to include. To keep the information from which dictionary term the substring came you probably need to modify the algorithm a bit (if keeping that information is important) or build another datastructure to look it up afterwards.
In any case I would convert the disease list and the documents you want to search to lower case before training/matching. If there is a chance of misspellings - there are also papers on fuzzy aho-corasick automata.

Fuzzy string search in Java, including word swaps

I am a Java beginner, trying to write a program that will match an input to a list of predefined strings. I have looked at Levenshtein distance, but I have come to problems such as this:
If I have an input such as "fillet of beef" I want it to be matched to "beef fillet". The problem is that "fillet of beef" is closer, according to Levenshtein distance, to something like "fillet of tuna", which of course is wrong.
Should I be using something like Lucene for this? Does one use Lucene methods within a Java class?
Thanks!

You need to compute the relevance of your search terms to the input strings. Lucene does have relevance calculations built in, and this article might be a good start to understanding them (I just scanned it, but it seems reasonably authoritative).
The basic process is this:
Initialization: tokenize your search terms, and store them in a series of HashSets, one per term. Or, if you want to give different weights to each word, use HashMap where the word is the key.
Processing: tokenize each input string, and probe each of the sets of search terms to determine how closely they apply to the input. See above for a description of algorithms.
There's an easy trick to handle misspellings: during initialization, you create sets containing potential misspellings of the search terms. Peter Norvig's post on "How to Write a Spelling Corrector" describes this process (it uses Python code, but a Java implementation is certainly possible).

Lucene does support fuzzy search based on Levenshtein distance.
https://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Fuzzy%20Searches
But lucene is meant to search on set of documents rather than string search, so lucene might be an overkill for you. There are other Java implementation available. Take a look at http://www.merriampark.com/ldjava.htm

It should be possible to apply the Levenshtein distance to words, not characters. Then, to match words, you could again apply Levenshtein on the character level, so that "filet" in "filet of beef" should match "fillet" in "beef fillet".

How to query lucene with "like" operator? [duplicate]

This question already has answers here:
Leading wildcard character throws error in Lucene.NET
(3 answers)
Closed 9 years ago.
The wildcard * can only be used at the end of a word, like user*.
I want to query with a like %user%, how to do that?

The trouble with LIKE queries is that they are expensive in terms of time taken to execute. You can set up QueryParser to allow leading wildcards with the following:
QueryParser.setAllowLeadingWildcard(true)
And this will allow you to do searches like:
*user*
But this will take a long time to execute. Sometimes when people say they want a LIKE query, what they actually want is a fuzzy query. This would allow you to do the following search:
user~
Which would match the terms users and fuser. You can specify an edit distance between the term in your query and the terms you want matched using a float value between 0 and 1. For example user~0.8 would match more terms than user~0.5.
I suggest you also take a look at regex query, which supports regular expression syntax for Lucene searches. It may be closer to what you really need. Perhaps something like:
.*user.*

Lucene provides the ReverseStringFilter that allows to do leading wildcard search like *user. It works by indexing all terms in reverse order.
But I think there is no way to do something similar to 'LIKE %user%'.

Since Lucene 2.1 you can use
QueryParser.setAllowLeadingWildcard(true);
but this can kill performance. The LuceneFAQ has some more info for this.

When you think about it, it is not entirely unsurprising that lucene's support for wildcarding is (normally) restricted to a wildcard at the end of a word pattern.
Keyword search engines works by creating a reverse index of all words in the corpus, which is sorted in word order. When you do a normal non-wildcard search, the engine makes use of the fact that index entries are sorted to locate the entry or entries for your word in O(logN) steps where N is the number of words or entries. For a word pattern with a suffix wildcard, the same thing happens to find the first matching word, and other matches are found by scanning the entries until the fixed part of the pattern no longer matches.
However, for a word pattern with a wildcard prefix and a wildcard suffix, the engine would have to look at all entries in the index. This would be O(N) ... unless the engine built a whole stack of secondary indexes for matching literal substrings of words. (And that would make indexing a whole lot more expensive). And for more complex patterns (e.g. regexes) the problem would be even worse for the search engine.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.