Lucene query with phrase order slop and OR clause

Lucene query with phrase order slop and OR clause - java

I need to frame a lucene query such that it works for both "convert int to string" and "convert integer to string". Also, in the matched results, there could be more words in between the terms. For example "How could I convert a proper int to a well formatted string". I tried the following:
Query query = new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term("title", "convert")),
new SpanTermQuery(new Term("title", "int")),
new SpanTermQuery(new Term("title", "string"))
},
50,
true);
return query;
and the following:
MultiPhraseQuery mpq = new MultiPhraseQuery();
mpq.setSlop(50);
mpq.add(new Term("title","convert"));
mpq.add(new Term[]{new Term("title","int"),new Term("title", "integer")});
mpq.add(new Term("title","string"));
return mpq;
and also the following:
BooleanQuery bq = new BooleanQuery();
bq.add(new TermQuery(new Term("title","convert")), Occur.MUST);
BooleanQuery idFilter = new BooleanQuery();
idFilter.setMinimumNumberShouldMatch(1);
idFilter.add(new TermQuery(new Term("title", "int")), BooleanClause.Occur.SHOULD);
idFilter.add(new TermQuery(new Term("title", "integer")), BooleanClause.Occur.SHOULD);
bq.add(idFilter, BooleanClause.Occur.MUST);
bq.add(new TermQuery(new Term("title","string")), Occur.MUST);
return bq;
None of them seem to do what I need. Can someone help me write a valid query which includes both order of terms and also allows to specify "OR" condition? Thanks.

Your first attempt is closest to the mark. The stumbling block there is how to handle int vs integer.
Two approaches for this come to mind. The best approach might be to incorporate a SynonymFilter into your analyzer. This would allow you to set up a synonym automatically converting integer to int at index time, reducing the need to come up with more complex querying logic.
As far as setting it up strictly in the query construction, I don't know of a way to wrap a boolean query into a span query, but a wildcard, or more precisely, a prefix, query seems like it would serve that purpose, which can be used in a SpanNearQuery by wrapping it in a SpanMultiTermQueryWrapper, something like:
Query query = new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term("title", "convert")),
new SpanMultiTermQueryWrapper(new PrefixQuery(new Term("title", "int"))),
new SpanTermQuery(new Term("title", "string"))
},
50,
true);
return query;
int* may not be quite identical to int OR integer, but hopefully it's close enough.

Related

How to search full text in lucene 4.10

I want to search text phase in pdf like "Labor Law". But in result, it return all file that contain the word "Labor" and "Law". please any help checking my cod below:
EnglishAnalyzer analyzer = new EnglishAnalyzer();
analyzer.setVersion(Version.LATEST);
QueryParser parser = new QueryParser("content", analyzer);
Query query = parser.parse("Labor Law");
Directory indexDirectory = FSDirectory.open(new File(indexLucencePath));
DirectoryReader dirReader = DirectoryReader.open(indexDirectory);
indexSearcher = new IndexSearcher(dirReader);
ScoreDoc[] queryResults = indexSearcher.search(query, numOfResults).scoreDocs;
List<IndexItem> results = new ArrayList<IndexItem>();
for (ScoreDoc scoreDoc : queryResults) {
Document doc = indexSearcher.doc(scoreDoc.doc);
results.add(new IndexItem(doc.get(IndexItem.TITLE), doc.get(IndexItem.CONTENT)));
}

Try
Phrase query:
Query query = parser.parse("\"Labor Law\"");
All terms must be present
Query query = parser.parse("+Labor +Law");
You can also create query yourself like this
BooleanQuery query= new BooleanQuery();
TermQuery clause1 = new TermQuery(new Term("content", "Labor"));
TermQuery clause2 = new TermQuery(new Term("content", "Law"));
query.add(new BooleanClause(clause1, BooleanClause.Occur.MUST));
query.add(new BooleanClause(clause1, BooleanClause.Occur.MUST));

There are different types of Analyzer available, please check with different Analyzer for your requirement. Comparison of Lucene Analyzers. This may also help Lucene: Multi-word phrases as search terms

Lucene OR search using Boolean query

I have an index with multiple fields, one of which is a string field in which I store category names for a product... such as "Electronics", "Home", "Garden", etc
new StringField("category_name", categoryName, Field.Store.YES)); //categoryName is a value such as "Electronics"
I am performing a Boolean query to find products by name, price, and category but I'm not sure how to do an OR search so that I can query for two categories at the same time.
My current query looks like this:
String cat = "Electronics"
TermQuery catQuery = new TermQuery(new Term("category_name", cat));
bq.add(new BooleanClause(catQuery, BooleanClause.Occur.MUST)); // where "bq" is the boolean query I am adding to, I tried .SHOULD but that didn't help either
this works fine for a one category search, but I am not sure how to search "Electronics OR Home" which would be two categories.

You can write like:
BooleanQuery categoryQuery = new BooleanQuery();
TermQuery catQuery1 = new TermQuery(new Term("category_name", "Electronics"));
TermQuery catQuery2 = new TermQuery(new Term("category_name", "Home"));
categoryQuery.add(new BooleanClause(catQuery1, BooleanClause.Occur.SHOULD));
categoryQuery.add(new BooleanClause(catQuery2, BooleanClause.Occur.SHOULD));
bq.add(new BooleanClause(categoryQuery, BooleanClause.Occur.MUST));

Lucene: IndexSearcher.search() causes java heap space error on very large database

I have a very large database (approximately 30 million records, each with at least 26 fields) which I have indexed with Apache Lucene Java.
I am constructing a query from two fields. Each search term could appear in any one of nine fields, and I want my query to return a Document if both of the search terms appear in any of the relevant fields in the Document. The query is structured like so:
Private Query CreateQuery(String theSearchTerm, String theField) throws ParseException
{
StandardAnalyzer theAnalyzer = new StandardAnalyzer(Version.LUCENE_35);
Query q;
QueryParser qp = new QueryParser(Version.LUCENE_35, theField, theAnalyzer);
qp.setDefaultOperator(QueryParser.Operator.AND);
qp.setAllowLeadingWildcard = true;
q = qp.parse(theSearchTerm);
return q;
}
Public ScoreDoc[] RunTheQuery(String searchTerm1, String searchTerm2)
{
Directory theIndex = new SimpleFSDirectory(new File("C:\\MyDirectory");
IndexSearcher theSearcher = new IndexSearcher(InderReader.open(theIndex));
BooleanQuery theTopLevelBooleanQuery = new BooleanQuery();
BooleanQuery fields1 = new BooleanQuery();
BooleanQuery fields2 = new BooleanQuery();
BooleanQuery fields3 = new BooleanQuery();
BooleanQuery fields4 = new BooleanQuery();
BooleanQuery fields5 = new BooleanQuery();
BooleanQuery fields6 = new BooleanQuery();
BooleanQuery fields7 = new BooleanQuery();
BooleanQuery fields8 = new BooleanQuery();
BooleanQuery fields9 = new BooleanQuery();
BooleanQuery innerQuery = new BooleanQuery();
fields1.add(CreateQuery(searchTerm1, param1), BooleanClause.Occur.MUST);
fields1.add(CreateQuery(searchTerm2, param2), BooleanClause.Occur.MUST);
fields2.add(CreateQuery(searchTerm1, param3), BooleanClause.Occur.MUST);
fields2.add(CreateQuery(searchTerm2, param4), BooleanClause.Occur.MUST);
fields3.add(CreateQuery(searchTerm1, param5), BooleanClause.Occur.MUST);
fields3.add(CreateQuery(searchTerm2, param6), BooleanClause.Occur.MUST);
fields4.add(CreateQuery(searchTerm1, param7), BooleanClause.Occur.MUST);
fields4.add(CreateQuery(searchTerm2, param8), BooleanClause.Occur.MUST);
fields5.add(CreateQuery(searchTerm1, param9), BooleanClause.Occur.MUST);
fields5.add(CreateQuery(searchTerm2, param10), BooleanClause.Occur.MUST);
fields6.add(CreateQuery(searchTerm1, param11), BooleanClause.Occur.MUST);
fields6.add(CreateQuery(searchTerm2, param12), BooleanClause.Occur.MUST);
fields7.add(CreateQuery(searchTerm1, param13), BooleanClause.Occur.MUST);
fields7.add(CreateQuery(searchTerm2, param14), BooleanClause.Occur.MUST);
fields8.add(CreateQuery(searchTerm1, param15), BooleanClause.Occur.MUST);
fields8.add(CreateQuery(searchTerm2, param16), BooleanClause.Occur.MUST);
fields9.add(CreateQuery(searchTerm1, param17), BooleanClause.Occur.MUST);
fields9.add(CreateQuery(searchTerm2, param18), BooleanClause.Occur.MUST);
innerQuery.add(fields1, BooleanClause.Occur.SHOULD);
innerQuery.add(fields2, BooleanClause.Occur.SHOULD);
innerQuery.add(fields3, BooleanClause.Occur.SHOULD);
innerQuery.add(fields4, BooleanClause.Occur.SHOULD);
innerQuery.add(fields5, BooleanClause.Occur.SHOULD);
innerQuery.add(fields6, BooleanClause.Occur.SHOULD);
innerQuery.add(fields7, BooleanClause.Occur.SHOULD);
innerQuery.add(fields8, BooleanClause.Occur.SHOULD);
innerQuery.add(fields9, BooleanClause.Occur.SHOULD);
theTopLevelBooleanQuery.add(innerQuery, BooleanClause.Occur.MUST);
TopDocScoreCollector collector = TopDocScoreCollector.create(200, true);
//Heap space error occurs here
theSearcher.search(theTopLevelBooleanQuery, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
return hits;
}
My problem is that when I call the IndexSearcher.search() method, the java.exe process on the server (Windows Server 2003 R2) consumes more than 540 MB, which causes a java heap space error. For completeness, the java app is running on a web server (currently Oracle Glassfish, although I'm looking to move to Apache Tomcat).
Does anyone have an idea for how to stop this heap space error? A StackOverflow post (http://stackoverflow.com/questions/7259736/cant-open-lucene-index-java-heap-space) seems to address a similar problem, but doesn't really give a detailed answer.
Is the only answer to increase the amount of memory that the Java process can use? Is the only answer to write a new searcher, in which case can anyone recommend a good article about light weight searchers?
Is there a way of solving this issue by modifying the above code?
Any help would be gratefully received,
Thanks,
Rik

You can increase the java heap space like this:
java -Xmx6g myprogram
or see this post:
increase heap size in Java
or:
IBM SDK for Java

Lucene: Multi-word phrases as search terms

I'm trying to make a searchable phone/local business directory using Apache Lucene.
I have fields for street name, business name, phone number etc. The problem that I'm having is that when I try to search by street where the street name has multiple words (e.g. 'the crescent'), no results are returned. But if I try to search with just one word, e.g 'crescent', I get all the results that I want.
I'm indexing the data with the following:
String LocationOfDirectory = "C:\\dir\\index";
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
Directory Index = new SimpleFSDirectory(LocationOfDirectory);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE.34, analyzer);
IndexWriter w = new IndexWriter(index, config);
Document doc = new Document();
doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Analyzed);
w.add(doc);
w.close();
My searches work like this:
int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);
WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
I have tried swapping the wildcard query for a phrase query, first with the entire string and then splitting the string up on white space and wrapping them in a BooleanQuery like this:
String term = "the crescent";
BooleanQuery b = new BooleanQuery();
PhraseQuery p = new PhraseQuery();
String[] tokens = term.split(" ");
for(int i = 0 ; i < tokens.length ; ++i)
{
p.add(new Term("Street", tokens[i]));
}
b.add(p, BooleanClause.Occur.MUST);
However, this didn't work. I tried using a KeywordAnalyzer instead of a StandardAnalyzer, but then all other types of search stopped working as well. I have tried replacing spaces with other characters (+ and #), and converting queries to and from this form, but that still doesn't work. I think it doesn't work because + and # are special characters which are not indexed, but I can't seem to find a list anywhere of which characters are like that.
I'm beginning to go slightly mad, does anyone know what I'm doing wrong?

The reason why you don't get your documents back is that while indexing you're using StandardAnalyzer, which converts tokens to lowercase and removes stop words. So the only term that gets indexed for your example is 'crescent'. However, wildcard queries are not analyzed, so 'the' is included as mandatory part of the query. The same goes for phrase queries in your scenario.
KeywordAnalyzer is probably not very suitable for your use case, because it takes whole field content as a single token. You can use SimpleAnalyzer for the street field -- it will split the input on all non-letter characters and then convert them to lowercase. You can also consider using WhitespaceAnalyzer with LowerCaseFilter. You need to try different options and work out what works best for your data and users.
Also, you can use different analyzers per field (e.g. with PerFieldAnalyzerWrapper) if changing analyzer for that field breaks other searches.

I found that my attempt to generate a query without using a QueryParser was not working, so I stopped trying to create my own queries and used a QueryParser instead. All of the recomendations that I saw online showed that you should use the same Analyzer in the QueryParser that you use during indexing, so I used a StandardAnalyzer to build the QueryParser.
This works on this example because the StandardAnalyzer removes the word "the" from the street "the crescent" during indexing, and hence we can't search for it because it isn't in the index.
However, if we choose to search for "Grove Road", we have a problem with the out-of-the-box functionality, namely that the query will return all of the results containing either "Grove" OR "Road". This is easily fixed by setting up the QueryParser so that it's default operation is AND instead of OR.
In the end, the correct solution was the following:
int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
//WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");
QueryParser qp = new QueryParser(Version.LUCENE_35, "Street", analyzer);
qp.setDefaultOperator(QueryParser.Operator.AND);
Query q = qp.parse("grove road");
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

#RikSaunderson's solution for searching documents where all subqueries of a query have to occur, is still working with Lucene 9.
QueryParser queryParser = new QueryParser(LuceneConstants.CONTENTS, new StandardAnalyzer());
queryParser.setDefaultOperator(QueryParser.Operator.AND);

If you want an exact words match the street, you could set Field "Street" NOT_ANALYZED which will not filter stop word "the".
doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Not_Analyzed);

There is no need of using any Analyzer here coz Hibernate implicitly uses StandardAnalyzer which will split the words based on white spaces so the solution here is set the Analyze to NO it will automatically performs Multi Phrase Search
#Column(name="skill")
#Field(index=Index.YES, analyze=Analyze.NO, store=Store.NO)
#Analyzer(definition="SkillsAnalyzer")
private String skill;

MongoDb's $set equivalent in its java Driver

Is there a way in which I can modify the value of one of the keys in MongoDb via its Java Driver. I tried out the following:
someCollection.update(DBObject query, DBObject update);
someCollection.findAndModify(DBObject query, DBObject update);
But both the functions completely replace the queried document with the updated document. What is the way to update only one of the value of a particular key as in the case of using $set in the mongo shell.(apart from making a completely new Document with all fields copied and one of the fields updated).

BasicDBObject carrier = new BasicDBObject();
BasicDBObject query = new BasicDBObject();
query.put("YOUR_QUERY_STRING", YOUR_QUERY_VALUE);
BasicDBObject set = new BasicDBObject("$set", carrier);
carrier.put("a", 6);
carrier.put("b", "wx1");
myColl.updateMany(query, set);
This should work, the answer which is accepted is not right above.

Try something like this:
BasicDBObject set = new BasicDBObject("$set", new BasicDBObject("age", 10));
set.append("$set", new BasicDBObject("name", "Some Name"));
someCollection.update(someSearchQuery, set);
Also look at this example.

None of the solutions mentioned above worked for me. I realized that the query should be a Document type and not a BasicDBObject :
Document set = new Document("$set", new Document("firstName","newValue"));
yourMongoCollection.updateOne(new Document("_id",objectId), set);
Where "yourMongoCollection" is of type "MongoCollection" and "objectId" of type "ObjectId"

The previous answer pointed me in the right direction, but the code to add a 2nd object to the update did not work for me. The following did:
BasicDBObject newValues = new BasicDBObject("age", 10);
newValues.append("name", "Some Name");
BasicDBObject set = new BasicDBObject("$set", newValues);
collection.update(someSearchQuery, set);

First, unless I want to reconfigure/reformat/"re-type" my values I'd go only with findAndModify and not update.
Here is a fully working example for c&p purposes... Enjoy:
Boolean updateValue(DB db, DBCollection collection, String id, String key, Object newValue)
{
DBCollection collection = db.getCollection(<collection name>);
// Identify your required document (id, key, etc...)
DBObject query = new BasicDBObject("_ID",<ID or key value>);
DBObject update = new BasicDBObject("$set", new BasicDBObject(key, newValue));
//These flags will guarantee that you'lls get the updated result
DBObject result = collection.findAndModify(query, null, null, false, update,true, true);
//Just for precaution....
if(result == null)
return false;
return result.get(key).equals(newValue);
}

According to the documents, $set is an alise for $addFields, so just use that:
var iterable = collection.aggregate(Arrays.asList(
Aggregates.addFields(new Field("foo", "bar"))
));

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene query with phrase order slop and OR clause - java

Related

How to search full text in lucene 4.10

Lucene OR search using Boolean query

Lucene: IndexSearcher.search() causes java heap space error on very large database

Lucene: Multi-word phrases as search terms

MongoDb's $set equivalent in its java Driver

Categories

Resources