Searching multiple fields with Lucene - java

I'm having some trouble with a search I'm trying to implement. I need for a user to be able to enter a search query into a web interface and for the back-end Java to search for the query in a number of fields. An example of this might be best:
Say I have a List containing "Person" objects. Say each object holds two String fields about the person:
FirstName: Jack
Surname: Smith
FirstName Mary
Surname: Jackson
If a user enters, "jack", I need the search to match both objects, the first on Surname, and the second on FirstName.
I've been looking at using a MultiFieldQueryParser but can't get the fields set up right. Any help on this or pointing to a good tutorial would be greatly appreciated.

MultiFieldQueryParser is what you want, as you say.
Make sure:
The field names are always used consistently
The same Analyzer is used on both fields, and also on the query parser
You won't find partial words by default, so if you search for jack you won't find jackson. (You can search for jack* in that case.)
Regarding field name, I always set up an enum for my field names, then use e.g. MyFieldEnum.firstname.name() when passing field names to Lucene, so that if I make a spelling mistake the compiler can catch it, and it's also a good place to put Javadoc so you can see what the fields are for, and also a place where you can see the complete list of fields you wish to support in your Lucene documents.

Related

How to get result set by checking a specific element in an aggregated array using JOOQ?

I want to filter results by a specific value in the aggregated array in the query.
Here is a little description of the problem.
Section belongs to the garden. Garden belongs to District and District belongs to the province.
Users have multiple sections. Those sections belong to their gardens and they are to their Districts and them to Province.
I want to get user ids that have value 2 in district array.
I tried to use any operator but it doesn't work properly. (syntax error)
Any help would be appreciated.
ps: This is possible writing using plain SQL
rs = dslContext.select(
field("user_id"),
field("gardens_array"),
field("province_array"),
field("district_array"))
.from(table(select(
arrayAggDistinct(field("garden")).as("gardens_array"),
arrayAggDistinct(field("province")).as("province_array"),
arrayAggDistinct(field("distict")).as("district_array"))
.from(table("lst.user"))
.leftJoin(table(select(
field("section.user_id").as("user_id"),
field("garden.garden").as("garden"),
field("garden.province").as("province"),
field("garden.distict").as("distict"))
.from(table("lst.section"))
.leftJoin("lst.garden")
.on(field("section.garden").eq(field("garden.garden")))
.leftJoin("lst.district")
.on(field("district.district").eq(field("garden.district")))).as("lo"))
.on(field("user.user_id").eq(field("lo.user_id")))
.groupBy(field("user.user_id"))).as("joined_table"))
.where(val(2).equal(DSL.any("district_array"))
.fetch()
.intoResultSet();
Your code is calling DSL.any(T...), which corresponds to the expression any(?) in PostgreSQL, where the bind value is a String[] in your case. But you don't want "district_array" to be a bind value, you want it to be a column reference. So, either, you assign your arrayAggDistinct() expression to a local variable and reuse that, or you re-use your field("district_array") expression or replicate it:
val(2).equal(DSL.any(field("district_array", Integer[].class)))
Notice that it's usually a good idea to be explicit about data types (e.g. Integer[].class) when working with the plain SQL templating API, or even better, use the code generator.

How to predict correct country name for user provided country name?

I am planning to do some data tuning on my data.
Situation-I have a data which has a field country. It contains user input country names( It might contain spelling mistakes or different country names for same country like US/U.S.A/United States for USA). I have a list of correct country names.
What I want- To predict which closest country it is referring to. For example- If U.S. is given then it will change to USA(correct country name in our list).
Is there any way I can do it using Java or opennlp or any other method?
You can use Getty API . It will give you abbreviations of country name. Just play on this API.
OR
You can also use Levenshtein Distance to get most closest country name.
Try this out. Will help you.
You can try Google's auto complete location api to your text box or select.
if you will use this api then you will get google like auto complete intellisence while typing.
visit link
If you have the city or state information that is sanitized then you could do a look up of the country.
You could also define aliases in your list of country names and point the aliases to the preferred notation. For example, US, United States, USA all are aliases of U.S.A. You could make the program to append to alias database so that it improves as it is being used. You might have do multiple passes over the data and also certain amount of manual work is involved.

freebase MQL query to match any field

My application gets in input a certain amount of String, suppose the name of the "object" I'm looking for, and other fields like the year when an artist was born or the last album he made.
By the way the application has no knowledge on the type of the object in input, so what I'm trying to do is making an MQL query that, given the name of the object and other values (in any field, as I don't know the type of what I'm querying), returns me the type of what I searched. Once I get its type, I could for example make a better query asking for specifical fields.
As we all know, an example is worth thousand words, so let's assume my input is "The Police" and "So Lonely", one of their songs. I just know "The Police" is the name of what I'm looking for, but I don't know nothing about "So Lonely", so I should insert it someway in the query to get better results, without the knowledge of its type.
My first basic query is:
[{
"name": "The Police",
"type": []
}]
and it works, but I can't refine my search including so lonely, that could narrow the search output.
Any hint?
Another thing I need to accomplish, once I do the above (and still no idea on how to do it!) would be printing a summary of the entity i found. For example above, I would not have ALL the information about "The Police", but only basic general fields, that could be albums, the year they were born.
Is this possible to do something like that without the knowledge of what I found?

Solr: The default OR operator returns irrelevant results, when the fields are queried with multiple words

I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?
If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.
From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong

Lucene search on a Hibernate List field

I have a Hibernate annotated class TestClass that contains a List<String> field that I am indexing with Lucene. Consider the following example:
"Foo Bar" and "Bar Snafu" are two entries in the List for a particular record. Now, If a user searches on TestClass for "Foo Snafu" then the record will be found, I am guessing because the token Foo and the token Snafu are both tokens in the List<String> for this record.
Is there a way I can prevent this from happening?
The real world example is a Court case that has a List of Plaintiffs and Defendants. Say there are two people being prosecuted on the case, Joe Lewis Bob and Robert Clay Smith. These users are stored in the Court case record in a List of Defendants. This List of defendants is indexed with Lucene. Now if a user searches for either of the two defendants mentioned earlier, the case will be found. But the case will also be found if a user searches for Lewis Smith, or Joe Clay.
Update: It was mentioned in the Lucene IRC channel that I could possibly use a multi-valued field.
Update 2: It was mentioned in the Solr IRC channel that I could use the positionIncrementGap setting in schema.xml to accomplish this with Solr. Apparently if I use a phrase query (with or without slop) then "the increment gap ensures that different values in the same field won't cause an unintended match".
Lucene appends successive additions to the same field in the same document to the end of what it already has in the field.
If you want to treat each member of the List as an entirely separate entity, you should index them in different fields. you could just append the index to the field name you are already using. While I don't have complete information on your needs, of course, doing something like this is probably the better solution.
If you just want to search for the precise text "Foo Snafu", you can use a PhraseQuery. If you want to be sure your phrasequery doesn't cross from one list item to the next (ie, if you had "Bar Foo" and "Snafu Bar" in the index), you could insert some form of delimiting term between each member when writing to the index.

Categories