Lucene search on a Hibernate List field - java

I have a Hibernate annotated class TestClass that contains a List<String> field that I am indexing with Lucene. Consider the following example:
"Foo Bar" and "Bar Snafu" are two entries in the List for a particular record. Now, If a user searches on TestClass for "Foo Snafu" then the record will be found, I am guessing because the token Foo and the token Snafu are both tokens in the List<String> for this record.
Is there a way I can prevent this from happening?
The real world example is a Court case that has a List of Plaintiffs and Defendants. Say there are two people being prosecuted on the case, Joe Lewis Bob and Robert Clay Smith. These users are stored in the Court case record in a List of Defendants. This List of defendants is indexed with Lucene. Now if a user searches for either of the two defendants mentioned earlier, the case will be found. But the case will also be found if a user searches for Lewis Smith, or Joe Clay.
Update: It was mentioned in the Lucene IRC channel that I could possibly use a multi-valued field.
Update 2: It was mentioned in the Solr IRC channel that I could use the positionIncrementGap setting in schema.xml to accomplish this with Solr. Apparently if I use a phrase query (with or without slop) then "the increment gap ensures that different values in the same field won't cause an unintended match".

Lucene appends successive additions to the same field in the same document to the end of what it already has in the field.
If you want to treat each member of the List as an entirely separate entity, you should index them in different fields. you could just append the index to the field name you are already using. While I don't have complete information on your needs, of course, doing something like this is probably the better solution.
If you just want to search for the precise text "Foo Snafu", you can use a PhraseQuery. If you want to be sure your phrasequery doesn't cross from one list item to the next (ie, if you had "Bar Foo" and "Snafu Bar" in the index), you could insert some form of delimiting term between each member when writing to the index.

Related

Apache Solr, Require multiple of a single field while optimizing query to only hit specific indexes

My data is partitioned inside solr so that when I send a request "+apple" (required apple) I only hit partition 'a' to search.
Because of this optimization I cannot easily use boolean logic that spans all my data.
Solr query: +fruit:bananna +fruit:apple
Result: there are no fruit with both fields so i get 0 results because I am searching the 'a' partition AND 'b' partition each with both required fields. In this case it is very unlikely that a record with the fruit field has two names, however this is a multi-valued field, so it is possible and I want those records to be at the top of my result from solr.
One way to get what I want would be to change the query to: fruit:bananna fruit:apple
However... this will sometimes return results that are neither apple nor bananna because solr marks both as optional thus I allow it to search all my indexes. For example:
fruit:bananna fruit:apple Country:Mexico
This might return oranges in Mexico... in which case I would rather get 0 results.
Also, doing two separate queries is not an option...does anyone know of a better way to get this 'REQUIRED OR' functionality with my partition optimization?
I am also open to other designs, i'm just looking for input.

Using Stanford NER for extracting Address from a text document?

I was looking Stanford NER and thinking of using JAVA Apis it to extract postal address from a text document. The document may be any document where there is an postal address section e.g. Utility Bills, electricity bills.
So what I am thinking as the approach is,
Define postal address as a named entity using LOCATION and other primitive named entities.
Define segmentation and other sub process.
I am trying to find a example pipeline for the same (what are the steps in details required), anyone has done this before? Suggestions welcome.
To be clear: all credit goes to Raj Vardhan (and John Bauer) who had an interaction on the [java-nlp-user] mailing list.
Raj Vardhan wrote about the plan to work on "finding street address in a sentence":
Here is an approach I have thought of:
Find the event-anchor in a sentence
Select outgoing-edges in the SemanticGraph from that event-node
with relations such as *"prep-in" *or "prep-at".
IF the dependent value in the relation has POS tag as NNP
a) Find outgoing-edges from dependent value's node with relations such
as "nn"
b) Connect all such nodes in increasing order of occurrence in the
sentence.
c) PRINT resulting value as Location where the event occurred
This is obviously with certain assumptions such as direct dependency
between the event-anchor and location in a sentence.
Not sure whether this could help you, but I wanted to mention it just in case. Again, any credit should go to Raj Vardhan (and John Bauer).

Solr: The default OR operator returns irrelevant results, when the fields are queried with multiple words

I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?
If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.
From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong

Nesting searches in Lucene without duplicating keywords

I want to set up a search in Lucene (actually Lucene.NET, but I can convert from Java as necessary) using the following logic:
Search string is: A B C
Search one field in the index for anything that matches A, B, or C. (Query: (field1:A field1:B field1:C))
For each term that didn't match in step 2, search a second field for it while keeping the results from the first search (Query: (+(field1:A) +(field2:B field2:C)))
For each term that didn't match in step 3, search a third field...
Continue until running out of fields, or there's a search which has used every term.
Currently, my code can test whether a given search produces NO results, and ANDs together all the ones that do produce results. But I have no way to stop it before it tests against every field (which unnecessarily limits the results) - it's currently ending up with a query like: (+(field1:A field1:B field1:C) +(field3:A field3:B field3:C)) when I want it to be (+(field1:A field1:C) +(field3:B)). I can't just look at the results from the first search and remove words from the search string because the Analyzer mangles the words when it parses it for search, and I have no way to un-mangle them to figure out which of the original search terms it corresponds to.
Any suggestions?
Edit:
Ok, generally I prefer describing my problems in the abstract, but I think some part of it is getting lost in the process, so I'll be more concrete.
I'm building a search engine for an site which needs to have several layers of search logic. A few example searches which I'll trace out are:
Headphones
Monster Headphones
White Monster Headphones
White Foobar Headphones
The index contains documents with seven fields - the relevant ones to this example are:
"datattype": A string representing what type of item this document represents (product, category, brand), so we know how to display it
"brand": The brand(s) that are relevant (categories have multiple brands, products and brands have one each)
"path": The path to a given category (i.e. "Audio Headphones In-Ear" for "Audio > Headphones > In-Ear")
"keywords": Various things that describe the product that don't go anywhere else.
In general, the logic for each step of the search is as follows:
Check to see if we have a match.
If so, filter the results based on that match, and continue parsing the rest of the search terms in the next step.
If not, parse the search terms in the next step.
Each step is something like:
Search for a category
Search for a brand
Search for keywords
So here's how those three example searches should play out:
Headphones
Search for a category: +path:headphones +datatype:Category
There are matches (the Headphone category), and no words from the original query are left, so we return it.
Monster Headphones
Search for a category: `+(path:monster path:headphones) +datatype:Category
Matches were found for path:headphones and datatype:Category, leaving "Monster" unmatched
Search for a brand: +path:headphones +brand:monster
Matches were found for path:headphones and brand:monster, and no words from the original query are left, so we return all the headphones by Monster.
White Monster Headphones
Search for a category: +(path:monster path:headphones path:white) +datatype:Category
Matches were found for path:headphones, and datatype:Category, leaving "White" and "Monster" unmatched
Search for a brand: +path:headphones +(brand:monster +brand:white)
Matches were found for path:headphones and brand:monster, leaving "White" unmatched
Search keywords: +path:headphones +brand:monster +keywords:white
There are matches, and no words from the original query are left, so we return them.
White Foobar Headphones
Search for a category: +(path:foobar path:headphones path:white) +datatype:Category
Matches were found for path:headphones, and datatype:Category, leaving "White" and "Foobar" unmatched
Search for a brand: +path:headphones +(brand:foobar +brand:white)
Nothing was found, so we continue.
Search keywords: +path:headphones +(keywords:white keywords:foobar)
Matches were found for path:headphones and keywords:white, leaving "Foobar" unmatched
... (continue searching other fields, including product description) ...
There are search terms still unmatched ("Foobar"), return "No results found"
The problem I have is twofold:
I don't want the matches to continue once everything's matched (only products have descriptions, so once it reaches that step we'll never return something that's not a product). I could manage this by using denis's GetHitTerms from here, except that I then end up searching for the first matched term in all subsequent fields until everything matches (i.e. in example #2, I'd have +path:headphones +(brand:headphones brand:monster)).
Despite my example above, my actual search query on the path field looks like +path:headphon +datatype:Taxonomy because I'm mangling it for searching. So I can't take the matched term and just remove that from the original query (because "headphon" != "headphones").
Hopefully that makes it clearer what I'm looking for.
I don't understand your use case, but you sound like you're asking about the BooleanQuery API. You can get the clauses of your query by calling getClauses.
A simple example:
BooleanQuery bq = new BooleanQuery();
bq.add(new TermQuery(new Term("field1","a")), BooleanClause.Occur.SHOULD)
bq.add(new TermQuery(new Term("field1","b")), BooleanClause.Occur.SHOULD)
BooleanClause[] clauses = bq.getClauses();
EDIT: maybe you're just asking for a search algorithm. In pseudocode:
generate_query (qs_that_matched, qs_that_didnt_match, level):
new_query = qs_that_matched AND level:qs_that_didnt_match
qs_still_unmatched = ...
qs_which_just_matched = ...
if qs_still_unmatched != null:
return generate_query(qs_that_matched AND qs_which_just_matched, qs_still_unmatched, level+1)
else:
return qs_that_matched AND qs_which_just_matched
In the end, I built a QueryTree class and stored the queries in a tree structure. It stores a reference to a function that takes a query, a list of terms to pump into that query, whether it should AND or OR those terms, and a list of children (which represent unique combinations of matching terms).
To perform the next level of searching, I just call Evaluate(Func<string, QueryParser.Operator, Query> newQuery) on the deepest nodes in my tree, with a reference to a function which takes terms and an operator and returns the correct Query for that set of logic. The Evaluate function then tests that new query against the list of unmatched terms that have been passed down to it and the result sets of all ancestral Querys (by ANDing with the parent, which ANDs with it's parent and so on). It then creates children for each set of matching terms, using GetHitTerms, and gives the unmatched terms to the child. Repeat for each level of search.
I suspect that there's a better way to do this - I didn't even look into Bobo that Xodarap mentioned, and I never really got faceted searching (as per denis) working. However, it's working, which means it's time to move on to other aspects of the site.

Searching multiple fields with Lucene

I'm having some trouble with a search I'm trying to implement. I need for a user to be able to enter a search query into a web interface and for the back-end Java to search for the query in a number of fields. An example of this might be best:
Say I have a List containing "Person" objects. Say each object holds two String fields about the person:
FirstName: Jack
Surname: Smith
FirstName Mary
Surname: Jackson
If a user enters, "jack", I need the search to match both objects, the first on Surname, and the second on FirstName.
I've been looking at using a MultiFieldQueryParser but can't get the fields set up right. Any help on this or pointing to a good tutorial would be greatly appreciated.
MultiFieldQueryParser is what you want, as you say.
Make sure:
The field names are always used consistently
The same Analyzer is used on both fields, and also on the query parser
You won't find partial words by default, so if you search for jack you won't find jackson. (You can search for jack* in that case.)
Regarding field name, I always set up an enum for my field names, then use e.g. MyFieldEnum.firstname.name() when passing field names to Lucene, so that if I make a spelling mistake the compiler can catch it, and it's also a good place to put Javadoc so you can see what the fields are for, and also a place where you can see the complete list of fields you wish to support in your Lucene documents.

Categories