Configuring the tokanisation of the search term in an elasticsearch query

Configuring the tokanisation of the search term in an elasticsearch query - java

I am doing a general search against elasticsearch (1.7) using a match query against a number of specified fields. This is done in a java app with one box to enter search terms in. Various search options are allowed (for example surrounding phrase with quotes to look for the phase not the component words). This means I am doing full test searches.
All is well except my account refs have forward slashes in them and a search on an account ref produces thousands of results. If I surround the account ref with quotes I get just the result I want. I assume an account ref of AC/1234/A01 is searching for [AC OR 1234 OR A01]. Initially I thought this was a regex issue but I don’t think it is.
I raised a similar question a while ago and one suggestion which I had thought worked was to add "analyzer": "keyword" to the query (in my code
queryStringQueryBuilder.analyzer("keyword")
).
The problem with this is that many of the other fields searched are not keyword and it is stopping a lot of flexible search options working (case sensitivity etc). I assume this has become something along the lines of an exact match in the text search.
I've looked at this the wrong way around for a while now and as I see it I can't fix it in the index or even in the general analyser settings as even if the account ref field is tokenised and analysed perfectly for my requirement the search will still search all the other fields for [AC OR 1234 OR A01].
Is there a way of configuring the search query to not split the account number on forward slashes? I could test ignoring all punctuation if it is possible to only split by whitespaces although I would prefer not to make such a radical change...
So I guess what I am asking is whether there is another built in analyzer which would still do a full full text search but would not split the search term up using punctuation ? If not is this something I could do with a custom analyzer (without applying it to the index itself ?)
Thanks.

The simplest way to do it is by replacing / with some character that doesn't cause the word to be split in two tokens, but doesn't interfere with your other terms (_, ., ' should work) or remove / completely using mapping char filter. There is a similar example here https://stackoverflow.com/a/23640832/783043

Related

Lucene: searching for string matching a regex

I use Lucene to search for specific patterns using a regular expression. A new use case came up where I need to look up a specific string matching a regex pattern. Good example would be to look up a prices in documents: prices can be written in many ways, just looking for "1256.88" as stored in the database is not enough. The value in the document may have a currency in front of it, behind it or even not present at all ("EUR 1256,88", "1256,88 EUR" or just "1256,88"). The value may have thousands separators or not. And of course this can be combined with each other. So I want to search for a specific, known price ("1256.88") being part of a regex at the same time. An example regex would be
[0-9]{1,10}*([\.|,][0-9]{0,2})?([\ ]?[€|$])?
What is the Lucene way of doing this? Is there a way to search with a regex AND an "example"?
Or do I have to search with a regex and then filter out wrong hits manually afterwards? How do I find out which strings triggered the match?

Lucene: Mining email addresses, names, and identifiers from an index

I have a lucene index with approx. 1 million documents. From these documents, I want to mine
email addresses
signatures - ( [whitespace]/s/[whitespace]john doe[whitespace] )
specific identifiers from each of the documents (that follow a regex pattern "\s[0-9]{3}[a-zA-Z0-9]{6}\s").
I understand that ideally using solr, during index build time, its much easier, but how can one do this from a built lucene index?
I am using java. For email address search, I tried to .setAllowLeadingWildcard(true) and then searched for # to find all email addresses - but I actually got zero results . if I search for # in luke I get zero results. If I search for #hotmail.com in luke, I get bunch of results with valid email addresses such as aaaaa#hotmail.com.
The index was created using StandardAnalyzer. Not sure if it matters, but the text is in UTF-8 I believe.
Any helpful suggestions, pointers is great! Note this is not for front end, so query doesn't have to be near realtime.

Analysis does matter, yes. The standard analyzer will treat whitespace and punctuation, such as #, as a place to split input into tokens. As such, you wouldn't expect to see any of them actually present in the indexed data.
You can use Lucene's regex query, particularly for the third case. A PhraseQuery seems appropriate for the second, I think, though I'm more that slightly confused about what you are trying to accomplish there.
Generally, you might want to use a different analyzer for an email field, in order to use it as a single token. You should get reasonable results searching for a particular e-mail address, since, though the analyzer would remove the punctuation, searching for the three (usually) tokens of a email consecutively in a phrase would be expected to get good matches. However, a regex search like \w*#\w*\.\w*, won't be particularly effective, since the punctuation won't actually be indexed and searchable, and a regex search doesn't span multiple terms in the index. Apart from searching for a known set of e-mail domains, or something of that nature, you would want to re-index use analysis more in line with how you need to search it in order to do what you are asking.

Partial match on a dictionary

I am working with GATE (Java Based NLP Framework) and want to find words with partial match with a dictionary.
For example I have a disease dictionary with following terms
Congestive cardiac failure
Congestive Heart Failure
Colon Cancer
.
.
.
Thousands of more terms
Let's assume I have as string "Father had cardiac failure last year" from this string I want to identify "cardiac failure" as partial match because it occurs as part of a term in the dictionary.
I have seen some discussion on similar subject in Python, JS and C# but I am not sure what can help in such a case here.
I wonder if I can utilize Aho-Corrasick over here.

The UIMA Concept Mapper annotator addon includes a functionality similar to what you are looking. You may consider:
including using UIMA inside GATE: http://gate.ac.uk/userguide/chap:uima
develop a similar component using the main ideas from the addon

Maybe you should use Lucene. Treat each line of the dictionary as a document, and each sentence in the text as a query.

One question that arises is which substrings you want to include in the search. If you included all substrings just "Heart" would also be a match, but that is not really a disease.
Maybe all right-aligned (word-)substrings (perhaps with length > 1) would be acceptable.
So one thing you could do is to train the Aho-Corrasick pattern matcher with the substrings you want to include. To keep the information from which dictionary term the substring came you probably need to modify the algorithm a bit (if keeping that information is important) or build another datastructure to look it up afterwards.
In any case I would convert the disease list and the documents you want to search to lower case before training/matching. If there is a chance of misspellings - there are also papers on fuzzy aho-corasick automata.

Sentence Auto-Complete with Java

Lets say I have about 1000 sentences that I want to offer as suggestions when user is typing into a field.
I was thinking about running lucene in memory search and then feeding the results into the suggestions set.
The trigger for running the searches would be space char and exit from the input field.
I intend to use this with GWT so the client with be just getting the results from server.
I don't want to do what google is doing; where they complete each word and than make suggestions on each set of keywords. I just want to check the keywords and make suggestions based on that. Sort of like when I'm typing the title for the question here on stackoverflow.
Did anyone do something like this before? Is there already library I could use?

I was working on a similar solution. This paper titled Effective Phrase Prediction was quite helpful for me . You will have to prioritize the suggestions as well

If you've only got 1000 sentences, you probably don't need a powerful indexer like lucene. I'm not sure whether you want to do "complete the sentence" suggestions or "suggest other queries that have the same keywords" suggestions. Here are solutions to both:
Assuming that you want to complete the sentence input by the user, then you could put all of your strings into a SortedSet, and use the tailSet method to get a list of strings that are "greater" than the input string (since the string comparator considers a longer string A that starts with string B to be "greater" than B). Then, iterate over the top few entries of the set returned by tailSet to create a set of strings where the first inputString.length() characters match the input string. You can stop iterating as soon as the first inputString.length() characters don't match the input string.
If you want to do keyword suggestions instead of "complete the sentence" suggestions, then the overhead depends on how long your sentences are, and how many unique words there are in the sentences. If this set is small enough, you'll be able to get away with a HashMap<String,Set<String>>, where you mapped keywords to the sentences that contained them. Then you could handle multiword queries by intersecting the sets.
In both cases, I'd probably convert all strings to lower case first (assuming that's appropriate in your application). I don't think either solution would scale to hundreds of thousands of suggestions either. Do either of those do what you want? Happy to provide code if you'd like it.

Keyword (OR, AND) search in Lucene

I am using Lucene in my portal (J2EE based) for indexing and search services.
The problem is about the keywords of Lucene. When you use one of them in the search query, you'll get an error.
For example:
searchTerms = "ik OR jij"
This works fine, because it will search for "ik" or "jij"
searchTerms = "ik AND jij"
This works fine, it searches for "ik" and "jij"
But when you search:
searchTerms = "OR"
searchTerms = "AND"
searchTerms = "ik OR"
searchTerms = "OR ik"
Etc., it will fail with an error:
Component Name: STSE_RESULTS Class: org.apache.lucene.queryParser.ParseException Message: Cannot parse 'OR jij': Encountered "OR" at line 1, column 0.
Was expecting one of:
...
It makes sense, because these words are keywords for Lucene are probably reserved and will act as keywords.
In Dutch, the word "OR" is important because it has a meaning for "Ondernemings Raad". It is used in many texts, and it needs to be found. For example "or" does work, but does not return texts matching the term "OR". How can I make it searchable?
How can I escape the keyword "or"? Or How can I tell Lucene to treat "or" as a search term NOT as a keyword.

I suppose you have tried putting the "OR" into double quotes?
If that doesn't work I think you might have to go so far as to change the Lucene source and then recompile the whole thing, as the operator "OR" is buried deep inside the code. Actually, compiling probably isn't even enough: you'll have to change the file QueryParser.jj in the source package that serves as input for JavaCC, then run JavaCC, then recompile the whole thing.
The good news, however, is that there's only one line to change:
| <OR: ("OR" | "||") >
becomes
| <OR: ("||") >
That way, you'll have only "||" as logical OR operator. There is a build.xml that also contains the invocation of JavaCC, but you have to download that tool yourself. I can't try it myself right now, I'm afraid.
This is perhaps a good question for the Lucene developer mailing list, but please let us know if you do that and they come up with a simpler solution ;-)

OR, NOT and AND are reserved keywords. I solved this problem just 2 days ago by lower-casing those 3 words in the user's search term before feeding it into the lucene query parser. Note that if you search and replace for these keywords make sure you use word boundaries (\b) so you don't end up changing words such as ANDROID and ORDER.
I then let the user specify NOT and AND by using - and +, just like Google does.

Escaping OR and AND with double quotes works for me. So try with a Java string like
String query = "field:\"AND\"";

I have read your question many times! =[
please look at these suggestions
How is your index stored?
Document containing Fields stored can be stored as
1)Stored 2)Tokenized 3)Indexed 4)Vector
it can make a significant difference
please use Luke, it can tell you how your indexes are stored(actually)
Luke is a must have if you are working with lucene, as it gives you a real idea of how indexes are stored,it also offers search, try it let us know with your update!

You're probably doing something wrong when you're building the query. I'll second Narayan's suggestion on getting Luke (as posted in the comments) and try running your queries with that. It has been a little while since I used Lucene, but I don't remember ever having issues with OR and AND.
Other than that, you can try escaping the input strings using QueryParser.escape(userQuery)
More On Escaping

You can escape the "OR" when it's a search term, or write your own query parser for a different syntax. Lucene offers an extensive query API in addition to the parser, with which you support your own query syntax quite easily.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.