Interpret Queries of Lucene - java

I was wondering if there is any way to interpret Queries of Lucene in simple terms?
For example :
Example # 1:
Input Query - name:John
Output - Interpreted as : Find all entries where attribute "name" is equal "John".
Example # 2:
Input Query - name:John AND phoneNumber:1234
Output - Interpreted as : Find all entries where attribute "name" is equal to "John" and attribute "phoneNumber" is equal to "1234".
Any tutorials in this regard will be helpful,
Thanks

The Lucene documentation does a pretty decent job in explaining basic queries and their interpretation. It seems as though that's all you're looking for; once you get into some of the more advanced query types, it gets hairy, but the documentation should always be your first stop; it's fairly comprehensive.
Edit: Ah, you want automated query explanation. I don't know of any that currently exist; I think you'll have to write your own, but if you're starting with standard QueryParser Syntax, I think the best input for your interpreter would be the output of QueryParser.parse(). That breaks down the free text into Lucene query objects that shouldn't be too difficult to wrap in a utility function that outputs a plain-English string for each one.

Related

Exact Match in SOLR 5.1

I have setup Solr 5.1.0 with proper data importation from MYSQL database. It is working good.
But I want exact match results or relevant to that only.
like,
Dancers in Mumbai
It gives all results which contains "dancers + mumbai" and only "dancers" + only "mumbai" keywords. I want result which must contains only "dancers + mumbai" not others.
This is not a complete answer, but it's the direction I'm trying to take with a similar problem. Comments are very welcome.
Step 1:
Implement multiple Solr cores, core 1 is "jobs" (dancers/lawyers/etc), and core 2 is "cities" (mumbai/chennai/etc).
Step 2:
Query each core for exact matches, so implement the KeywordTokenizerFactory on the relevant field to find exact matches only. This will give you all the matches accross cores (e.g. jobs: dancers and cities:mumbai).
Step 3:
Perform your general query using EDisMax for a user-friendly search (e.g. searching for "dancers in mumbai" accross many fields), and use the boost field to boost the jobs/cities found in the earlier query.
I would love to know if there is a better way of doing something this elaborate, but I have not found it yet. Hope it helps.
Using required terms like: +dancers +mumbia
Or a phrase query: "dancers in mumbia"
Would work.
You can also set the default operator for your query to be "AND", using the q.op parameter.

Solr: The default OR operator returns irrelevant results, when the fields are queried with multiple words

I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?
If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.
From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong

problem with Lucene's automagical query conversion

Recently I have started using Lucene. However, after few days I've spotted that queries provided by me in form of Strings are converted by Lucene to more general ones.
Example:
MY QUERY: "want to go" (including " as I'm searching whole phrases)
QUERY OBJECT created from my query (.toString): text:"want ? go"
NUMBER OF RESULTS for texts:
I want to go out today -> 1 result - correct
I want sdfto go out today -> 1 result - incorrect, should be 0
I wanted to match execly phrase "want to go" and not "want whatever go". I noticed that only words "to" and "a" are replaced with "?".
My question is why Lucene is changing queries provided by me, and how to force Lucene to ask my queries (unchanged)?
Moreover, I'm using StandardAnayzer (indexing and quering).
to is a stop word, meaning it is not indexed and not searched by some analyzers [including StandardAnalyzer], because it is usually not useful for searching. if you don't want it to be 'stopped' you will need to use a different analyzer [both for indexing and searching], but it will probably have worth results.
You can also remove the word 'to' from the field STOP_WORDS
IMPORTANT: your indexing analyzer and searching analyzer should be consistent, including the STOP_WORDS field!

Case-insensitive search with eXist-db

I am going through a final refinement posted by the client, which needs me to do a case-insensitive query. I will basically walk through how this simple program works.
First of all, in my Java class, I did a fairly simple webpage parsing:
title=(String)results.get("title");
doc = docBuilder.parse("http://" + server + ":" + port + "/exist/rest/db/wb/xql/media_lookup.xql?" + "&title=" + title);
This Java statement references an XQuery file "media_lookup.xql" which is stored on localhost, and the only parameter we are passing is the string "title".
Secondly, let's take at look at that XQuery file:
$title := request:get-parameter('title',""),
$mediaNodes := doc('/db/wb/portfolio/media_data.xml'),
$query := $mediaNodes//media[contains(title,$title)],
Then it will evaluate that query. This XQuery will get the "title" parameter that are passes from our Java class, and query the "media_data" xml file stored in the database, which contains a bunch of media nodes with a 'title' element node. As you may expect, this simple query will just match those media nodes whose 'title' element contains a substring of what the value of string 'title' is. So if our 'title' is "Chi", it will return media nodes whose title may be "Chicago" or "Chicken".
The refinement request posted by the client is that there should be NO case-sensitivity. The very intuitive way is to modify the XQuery statement by using a lower-case function in it, like:
$query := $mediaNodes//media[contains(lower-case(title/text(),lower-case($title))],
However, the question comes: this modified query will run my machine into memory overflow. Since my "media_data.xml" is quite huge and contains thouands of millions of media nodes,
I assume the lower-case() function will run on each of the entries, thus causing the machine to crash.
I've talked with some experienced XQuery programmer, and they think I should use an index to solve this problem, and I will definitely research into that. But before that, I am just posting this problem here to get other ideas or any suggestions, do you think any other way may help? for example, could I tweak the Java parse statement to realize the case-insensitivity? Since I think I saw some people did some string concatenation by using "contains." in Java before passing it to the server.
Any idea or help is welcomed.
The refinement request posted by the
client is that there should be NO
case-sensitivity. The very intuitive
way is to modify the XQuery statement
by using a lower-case function in it,
like:
$query := $mediaNodes//media
[contains(lower-case(title/text(),lower-case($title))],
However, the question comes: this
modified query will run my machine
into memory overflow. Since my
"media_data.xml" is quite huge and
contains thousands of millions of media
nodes, I assume the lower-case()
function will run on each of the
entries, thus causing the machine to
crash.
Such fears are not justified.
Any sane implementation of XPath uses automatic memory for its functions. This means that the memory required for evaluating a particular predicate, including the result of lower-case() becomes freed (in languages with no garbage collection) or unreferenced and ready for garbage collection immediately after the evaluation of the predicate.
A table index probably is not the solution as absebse of an index will slow things down, but not trigger a memory overflow.
I think your best bet is to duplicate the title in your database copying it into an all-lowercase (or uppercase with makes more clear that it was converted) and query the alternate title while presenting the normal title.
To save some processing to you can do the case coversion of $product before the query.
You can drop the ampersand in your URL, I'm not sure all webservers parse the ?& correctly.

Matching inexact company names in Java

I have a database of companies. My application receives data that references a company by name, but the name may not exactly match the value in the database. I need to match the incoming data to the company it refers to.
For instance, my database might contain a company with name "A. B. Widgets & Co Ltd." while my incoming data might reference "AB Widgets Limited", "A.B. Widgets and Co", or "A B Widgets".
Some words in the company name (A B Widgets) are more important for matching than others (Co, Ltd, Inc, etc). It's important to avoid false matches.
The number of companies is small enough that I can maintain a map of their names in memory, ie. I have the option of using Java rather than SQL to find the right name.
How would you do this in Java?
You could standardize the formats as much as possible in your DB/map & input (i.e. convert to upper/lowercase), then use the Levenshtein (edit) distance metric from dynamic programming to score the input against all your known names.
You could then have the user confirm the match & if they don't like it, give them the option to enter that value into your list of known names (on second thought--that might be too much power to give a user...)
Although this thread is a bit old, I recently did an investigation on the efficiency of string distance metrics for name matching and came across this library:
https://code.google.com/p/java-similarities/
If you don't want to spend ages on implementing string distance algorithms, I recommend to give it a try as the first step, there's a ~20 different algorithms already implemented (incl. Levenshtein, Jaro-Winkler, Monge-Elkan algorithms etc.) and its code is structured well enough that you don't have to understand the whole logic in-depth, but you can start using it in minutes.
(BTW, I'm not the author of the library, so kudos for its creators.)
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
LCS code
Example usage (guessing a category based on what people entered)
I'd do LCS ignoring spaces, punctuation, case, and variations on "co", "llc", "ltd", and so forth.
Have a look at Lucene. It's an open source full text search Java library with 'near match' capabilities.
Your database may suport the use of Regular Expressions (regex) - see below for some tutorials in Java - here's the link to the MySQL documentation (as an example):
http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp
You would probably want to store in the database a fairly complex regular express statement for each company that encompassed the variations in spelling that you might anticipate - or the sub-elements of the company name that you would like to weight as being significant.
You can also use the regex library in Java
JDK 1.4.2
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
JDK 1.5.0
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html
Using Regular Expressions in Java
http://www.regular-expressions.info/java.html
The Java Regex API Explained
http://www.sitepoint.com/article/java-regex-api-explained/
You might also want to see if your database supports Soundex capabilities (for example, see the following link to MySQL)
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
vote up 1 vote down
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
* LCS code
* Example usage (guessing a category based on what people entered)
to be more precise, better than Least Common Subsequence, Least Common Substring should be more precise as the order of characters is important.
You could use Lucene to index your database, then query the Lucene index. There are a number of search engines built on top of Lucene, including Solr.

Categories