Lucene : getting same length results when querying - java

So I've tried absolutely all query types in lucene and none of them seemed to work. What I'm trying to do is simple: I want to query the index but I wanna get exact same matches. And when I say exact same I mean the results should have the same text (obviously) AND the same length, something that usually happens when querying a database. So for example when I'm searching for jodie foster, I'm getting this text as one of the results : List of awards and nominations received by Jodie Foster. I don't want results containing the search term, I want results that are exactly like the search term.
First of all, this is how I'm building the lucene index:
IndexWriterConfig luceneConfig = new IndexWriterConfig(new StandardAnalyzer());
Path path = Paths.get("C:/Users/i_l_g/Desktop/DBpedia/qls_labels");
Directory dir = FSDirectory.open(path);
IndexWriter writer = new IndexWriter(dir, luceneConfig);
while (rs.next()) {
Document doc = new Document();
doc.add(new Field("entity", rs.getString("entity"), TextField.TYPE_STORED));
doc.add(new Field("label", rs.getString("label"), TextField.TYPE_STORED));
writer.addDocument(doc);
}
rs is a ResultSet type variable and I'm obviously just extracting data from a table and indexing them using Lucene.
Next, I tried querying this index using all types of queries, but I'm getting the same set of results every time, it's almost as if I didn't even change the query type. My last attempt was using a PhraseQuery:
StandardAnalyzer analyzer = new StandardAnalyzer();
PhraseQuery.Builder builder = new PhraseQuery.Builder();
PhraseQuery q;
builder.add(new Term("label","jodie"));
builder.add(new Term("label","foster"));
builder.setSlop(0);
q=builder.build();
This is the set of results I'm getting every single time, if it could be of any help:
Found 5 hits.
1. Jodie Foster
2. Alicia Christian "Jodie" Foster
3. Jodie Foster filmography
4. Impress Jodie Foster
5. List of awards and nominations received by Jodie Foster
I didn't really think that it would take me that much time, I've been trying to solve this issue for 2 days now and I've visited tens of links and there appears to be no one that has ever had this problem. Please help.

Related

Apache Lucene one-to-many query

I am trying to build a lucene query that will work with the following one-to-many relationship. I’m trying to-do this in lucene 5.5 but if i can’t then i’ll move towards upgrading the project to a newer version if necessary.
Say i have two objects like so. One Company that has multiple Items.
Company (one)
String name
String address_state
String address_street
...
Items items
Items (many)
Int item_id
String item_name
...
Int item_price
How would i do a search for Companies in a particular state that have a particular item name with a price below a certain point? For instance, search for companies that are in CA with an item named “Phone” that also have a price below 150?
I only have around 300k companies but have around 5 million items. So id rather first filter by company if possible.
To anyone out there, thanks.
I would recommend to look at block join approach in Lucene (and you could already use it in 5.5 version).
Example of the code, which should give an idea how to do this:
final Document item1 = new Document();
item1.add(new TextField("item_name", "item1", Field.Store.YES));
item1.add(new TextField("type", "item", Field.Store.YES));
final Document item2 = new Document();
item2.add(new TextField("item_name", "item2", Field.Store.YES));
item2.add(new TextField("type", "item", Field.Store.YES));
final Document company1 = new Document();
company1.add(new TextField("name", "company1", Field.Store.YES));
company1.add(new TextField("type", "company", Field.Store.YES));
writer.addDocuments(Arrays.asList(new Document[] {item1, item2, company1}));
In this example, I have 2 items created and attached to company. Mind the order of documents in addDocuments method - child documents (items in your case) should go before parent one (company in your case). You may have as many items as you want in this block.
Later, you could do very efficient querying using several types of queries from this package.
Example of the query could be something like this:
BooleanQuery.Builder builder = new BooleanQuery.Builder();
builder.add(new BooleanClause(new TermQuery(new Term("item_name", "item1")), BooleanClause.Occur.MUST));
BooleanQuery childQuery = builder.build();
ToParentBlockJoinQuery parentQuery =
new ToParentBlockJoinQuery(
childQuery,
new QueryBitSetProducer(new TermQuery(new Term("type", "company"))),
ScoreMode.Avg);
This query searched for all companies, which have items with name item1. You should be able to extend this pretty easy as well

How to get exact search using lucene

When we searching like "prd gem". It returns all results names with prd gem.
but when we search only "prd", it returns all results with prd in it like prd,prd gem, prd time etc. Why not exact search now?
Following was code in picture:
luceneQuery = queryBuilder.phrase()
.onField("productId")
.andField("productName").andField("refId")
.sentence(searchText)
.createQuery();
Exact search is working fine with the name having space in it like if i search "Prd Gem", it shows only one product with name "Prd Gem", but when i search only a word like "prd", exact search is not working, it shows all product like "prd","prd gem"
So what changes need to be done with above code to implement the same?
Thats because you are "tokenizing" the data in the lucene index.
Lucene by default will try to break the strings into tokens in order to allow and speed up such searches.
I assume you are using the latest hibernate. Can you try to annotate the "productName" field with the following:
#Field(name = "productName", index = Index.YES, analyze = Analyze.NO, norms = Norms.NO)
The "analyze = Analyze.no" part should disable this feature.

Apache Lucene Sort Issues with GAE-Lucene addDocuments

I have been trying to get Sort working for Apache Lucene and Google App Engine. I am using the https://github.com/UltimaPhoenix/luceneappengine to integrate Luncene in GAE. Here is what I am doing
I have a list of Documents, which I am putting into Lucene using the IndexWriter using addDocuments() method.
for(Object object : objects) {
Document doc = new Document();
document.add(new Field("id", generateDocId(object), idType));
document.add(new NumericDocValuesField("sortLong",<Long Value>));
documents.add(doc)
}
I am basically aggregating all the documents into a list and writing to index using
IndexWriter writer = getWriter();
writer.addDocuments(documents);
I am trying to query a few documents, based on some Query as well as Sort
Sort sort = new Sort(new SortField("sortLong", SortField.Type.LONG, true));
TopFieldDocs docs = searcher.search(new MatchAllDocsQuery(),2000,sort);
Problem:
When I use addDocuments to bulk index the documents, my Sort Queries are not returning the data in the correct Sort Order, basically they are wrong, however if I index each document using addDocument(), the Sort Queries are working correctly.
This has led me to deduce that there is something inherently wrong with addDocuments(). The sort wont work unless, I open the indexWriter, addDocument and Close the indexWriter. Which I am unwilling to do, because I have may thousands of records to index.
Is there any solution for this problem? Or is it a known defect.

How to do a Multi field - Phrase search in Lucene?

Title asks it all... I want to do a multi field - phrase search in Lucene.. How to do it ?
for example :
I have fields as String s[] = {"title","author","content"};
I want to search harry potter across all fields.. How do I do it ?
Can someone please provide an example snippet ?
Use MultiFieldQueryParser, its a QueryParser which constructs queries to search multiple fields..
Other way is to use Create a BooleanQuery consisting of TermQurey (in your case phrase query).
Third way is to include the content of other fields into your default content field.
Add
Generally speaking, querying on multiple fields isn’t the best practice for user-entered queries. More commonly, all words you want searched are indexed into a contents or keywords field by combining various fields.
Update
Usage:
Query query = MultiFieldQueryParser.parse(Version.LUCENE_30, new String[] {"harry potter","harry potter","harry potter"}, new String[] {"title","author","content"},new SimpleAnalyzer());
IndexSearcher searcher = new IndexSearcher(...);
Hits hits = searcher.search(query);
The MultiFieldQueryParser will resolve the query in this way: (See javadoc)
Parses a query which searches on the
fields specified. If x fields are
specified, this effectively
constructs:
(field1:query1) (field2:query2)
(field3:query3)...(fieldx:queryx)
Hope this helps.
intensified googling revealed this :
http://lucene.472066.n3.nabble.com/Phrase-query-on-multiple-fields-td2292312.html.
Since it is latest and best, I'll go with his approach I guess.. Nevertheless, it might help someone who is looking for something like I am...
You need to use MultiFieldQueryParser with escaped string. I have tested it with Lucene 8.8.1 and it's working like magic.
String queryStr = "harry potter";
queryStr = "\"" + queryStr.trim() + "\"";
Query query = new MultiFieldQueryParser(new String[]{"title","author","content"}, new StandardAnalyzer()).parse(queryStr);
System.out.println(query);
It will print.
(title:"harry potter") (author:"harry potter") (content:"harry potter")

Lucene: queries and docs with multiple fields

I have a collection of documents consisting of several fields, and I need to perform queries with several terms coming from multiple fields.
What do you suggest me to use ? MultiFieldQueryParser or MultiPhraseQuery ?
thanks
How about BooleanQuery?
http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/BooleanQuery.html
Choice of Analyzer
First of all, watch out which analyzer you are using. I was stumped for a while only to realise that the StandardAnalyzer filters out common words like 'the' and 'a'. This is a problem when your field has the value 'A'. You might want to consider the KeywordAnalyzer:
See this post around the analyzer.
// Create an analyzer:
// NOTE: We want the keyword analyzer so that it doesn't strip or alter any terms:
// In our example, the Standard Analyzer removes the term 'A' because it is a common English word.
// https://stackoverflow.com/a/9071806/231860
KeywordAnalyzer analyzer = new KeywordAnalyzer();
Query Parser
Next, you can either create your query using the QueryParser:
See this post around overriding the default operator.
// Create a query parser without a default field in this example (the first argument):
QueryParser queryParser = new QueryParser("", analyzer);
// Optionally, set the default operator to be AND (we leave it the default OR):
// https://stackoverflow.com/a/9084178/231860
// queryParser.setDefaultOperator(QueryParser.Operator.AND);
// Parse the query:
Query multiTermQuery = queryParser.parse("field_name1:\"field value 1\" AND field_name2:\"field value 2\"");
Query API
Or you can achieve the same by constructing the query yourself using their API:
See this tutorial around creating the BooleanQuery.
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(new TermQuery(new Term("field_name1", "field value 1")), BooleanClause.Occur.MUST);
multiTermQuery.add(new TermQuery(new Term("field_name2", "field value 2")), BooleanClause.Occur.MUST);
Delete the Documents that Match the Query
Then we finally pass the query to the writer to delete documents that match the query:
See my answer here, related to this answer.
See the answer to this question.
// Remove the document by using a multi key query:
// http://www.avajava.com/tutorials/lessons/how-do-i-combine-queries-with-a-boolean-query.html
writer.deleteDocuments(multiTermQuery);

Categories