How to search across multiple fields in Lucene using Query Syntax? - java

I'm searching a lucene index and I'm building search queries like
field1:"hello" AND field2:"world"
but I'd like to search for a value in any field as well as the values in specific fields in the same query i.e.
field1:"hello" AND anyField:"world"
Can anyone tell me how I can search across all indexed fields in this way?

Based on the answers I got for this question: Impact of repeat value across multiple fields in Lucene...
I can put the same search term into multiple fields and therefore create an "all" field which I put everything in. This way I can create a query like...
field1:"hello" AND all:"world"
This seems to work very nicely, prevents the need for huge search queries, and apparently the performance impact is minimal.

Boolean (OR) queries with a clause for each field are used to search multiple fields. The MultiFieldQueryParser will do that as well, but the fields still need to be enumerated. There's no implicit "all" fields; but IndexReader.getFieldNames can acquire them.

This might not apply to you, but in Azure Search, which is based on Lucene, using Lucene syntax, I use this:
name:plywood^100 OR plywood
Results with "plywood" in the "name" field are boosted.

Related

Why do we create mapping in elasticsearch while setting up repository?

Okay, I got it this question that what is the need for mapping.
Now I am going through a piece of code, what they are doing is that they are generating the mapping while creating the elastic search repository by pushing a dummy object and then deleting it.
I got it that elastic search can generate mappings, but what is the point of doing so. It does not help with the search queries ( at least the regex one that I have tried unless you explicitly tell in your mapping that this is of type keyword).
I would be thankful if someone can explain this.
Although Elasticsearch generates the mapping when you don't define one, and just index the document, but that way Elasticsearch generates the mapping based on the first document data, for example you have product-id field in your index, and if you index it without defining explicit mapping, Elasticsearch generates two data-type, one is text and another is keyword for this field when you index product-id as below.
{
"product-id" : "1"
}
Now, it depends on your use-case, let's suppose in your case, product-id is keyword and fixed, and you just want to use the exact search or aggregation on the product-id field, and don't want the full-text search, than you better go with explicit mapping and define it as in keyword field, that way Elasticsearch storage and queries would be optimal. You can refer to this Stackoverflow comment, for more information on it.
Bottomline, When you want to have a greater control on how your data should be indexed, It's always better to define explicit mapping than relaying on default mapping generated by Elasticsearch.

Can we ever have Documents with different fields in a single Lucene index?

This question has cropped up in my mind because, when constructing the output from a query's results, I want to make sure that I extract all the Fields (and not try extracting non-existent ones) from the Documents in the TopDocs... in anticipation of finding indices which contain Documents from an older version of my app.
The interesting/curious thing is that you have a method Document.getFields(). I.e. the Document class, rather than, for example, IndexSearcher or DirectoryReader, is responsible for telling us the Fields used. Theoretically, therefore, you could store and later retrieve Documents with a different set of Fields.
At this stage of my TDD I am just going to test that fields are extracted from the first Document in the TopDocs, and assume that all the others have the same fields.
But are there any use cases where one might have Documents with differing fields?

Elasticsearch, Nested "ANDS" and "ORS"

I am having some difficulty structuring the exact Elasticsearch query that I am looking for, specifically using the java api.
It seems like if I construct a fieldsearch using the java api, I can only use a single field and a single term. If I use a querystring, it looks like I can apply an entire query to a set of fields. What I want to do is apply a specific query to one field, and another query to a different field.
This is confusing I know. This is the type of query I would like to construct
(name contains "foo" or name contains "bar") AND ( date equals today)
I am really loving Elasticsearch for it's speed and flexibility, but the docs on http://www.elasticsearch.org/ are kind of tough to parse (I noticed "introduction" and "concepts" have no links, but the API section does) If anyone has some good resources on mastering these queries, I'd love to see them. Thanks!
Sounds like a bool query with 2 must clause:
matchQuery("name", "foo bar")
rangeQuery("date").from("2013-02-05").to("2013-02-06")
Does it help?

Lucene custom scoring (Lucene 3.2) involves iterating through all documents in the index - fastest way?

I'm trying to implement a custom scoring formula in Lucene that has nothing to do with tf-idf (so changing just the similarity, for example, will not work).
In order to do this, I need to be able to take my custom Query and generate a score for every document stored in the index - not just the ones that match the terms in the query (since my scoring involves checking what are essentially synonyms, so even if a doc doesn't have the exact Terms, it could still produce a positive score). Is the best way to simply create an IndexReader and call Document d = reader.doc(i) for all docs (as described here), and then generate a score on the spot?
I've been looking around at Lucene's scoring internals, specifically various Scorer and Collector classes, and it appears that what happens (for Lucene 3.2) is a Weight provides a Scorer, which along with the Collector loops through all documents that match the query. Can I utilize this structure in some way, but again get a custom Scorer implementation to consider ALL documents?
If you decide to go for a custom scoring scheme, the proper way is to use a subclass of CustomScoreQuery with getCustomScoreProvider overridden to return your subclass of CustomScoreProvider. The CustomScoreQuery constructor requires a subquery. Here you will want to provide a fast native Lucene Query that will narrow down the result set as much as possible before going through your custom score calculation. You can also arrange to store any number of float values with each of your docs and make those accessible to your custom score provider. You will need to provide an appropriate ValueSourceQuery to the constructor of CustomScoreQuery for each such float value. See the Javadocs on these classes, they are well written. Unfortunately I don't have a Java snippet at hand.
As I understand Lucene, it stores (term, doc) pairs in its index, so that querying is implemented as
Get documents containing the query terms,
score/sort them.
I've never implemented my own scoring, but I'd look at IndexReader.termDocs first; it seems to implement step 1.
With IndexReader.termDocs you can iterate through a term's posting list, that is, all documents that contain that term. You could use this to provide your whole own query processing own top of Lucene, but then you won't be able to use any of Query, Similarity and stuff.
Also, if you are working with synonyms Lucene has some things in the contrib package. Another possible solution, don't know if you tried it, is to inject synonyms into the documents through a Analyzer (or other). That way you could return documents even if they don't have query terms.

Is it valid for Hibernate list() to return duplicates?

Is anyone aware of the validity of Hibernate's Criteria.list() and Query.list() methods returning multiple occurrences of the same entity?
Occasionally I find when using the Criteria API, that changing the default fetch strategy in my class mapping definition (from "select" to "join") can sometimes affect how many references to the same entity can appear in the resulting output of list(), and I'm unsure whether to treat this as a bug or not. The javadoc does not define it, it simply says "The list of matched query results." (thanks guys).
If this is expected and normal behaviour, then I can de-dup the list myself, that's not a problem, but if it's a bug, then I would prefer to avoid it, rather than de-dup the results and try to ignore it.
Anyone got any experience of this?
Yes, getting duplicates is perfectly possible if you construct your queries so that this can happen. See for example Hibernate CollectionOfElements EAGER fetch duplicates elements
I also started noticing this behavior in my Java API as it started to grow. Glad there is an easy way to prevent it. Out of practice I've started out appending:
.setResultTransformer(Criteria.DISTINCT_ROOT_ENTITY)
To all of my criteria that return a list. For example:
List<PaymentTypeAccountEntity> paymentTypeAccounts = criteria()
.setResultTransformer(Criteria.DISTINCT_ROOT_ENTITY)
.list();
If you have an object which has a list of sub objects on it, and your criteria joins the two tables together, you could potentially get duplicates of the main object.
One way to ensure that you don't get duplicates is to use a DistinctRootEntityResultTransformer. The main drawback to this is if you are using result set buffering/row counting. The two don't work together.
I had the exact same issue with Criteria API. The simple solution for me was to set distinct to true on the query like
CriteriaQuery<Foo> query = criteriaBuilder.createQuery(Foo.class);
query.distinct(true);
Another possible option that came to my mind before would be to simply pass the resulting list to a Set which will also by definition have just an object's single instance.

Categories