Lucene searching by numeric values - java

I'm building a Java Lucene-based search system that, on addition, adds a certain number of meta-fields, one of which is a sourceId field, which denotes where the entry came from.
I'm now trying to retrieve all documents from a particular source, but the index doesn't appear to be able to find them. However, if I search for a wildcard value, the returned documents all have the correct value for this field.
The lucene query I'm using is quite simple, basically index-source-id:1 but that fails to return any hits, if I search for content:a* I get dozens of documents, all of which, when asked, return the value 1 for the index-source-id value, which is correct.
Any ideas?

I have only worked with the PHP port, however, have you checked what text analyzer you are using? This FAQ seems to indicate that like the PHP version, you need to use a diffrent one that doesn't remove digits.
You can find a list of analyzers here
Just to be sure, you have set the id to be indexable?

Related

Lucene does not work for MUSTNOT Boolean query

I created a lucene indexes for set of data and trying to retrieve results from that.
When I do a boolean query with SHOULD, lucene returns me expected result.
eg: (title:"america")
But on the other hand when I do a MUST_NOT query, it returns me empty results even though there are lot of data which satisfy this criteria.
(-title:"america")
I think I am doing some silly mistake but not able to figure it out so far. Could someone please give some pointers.
Understood the issue. I should combine MUSt NoT with some other operators.
Quote from https://www.bookdepository.com/Lucene-in-Action-Erik-Hatcher/9781933988177?redirected=true&utm_medium=Google&utm_campaign=Base3&utm_source=BE&utm_content=Lucene-in-Action&selectCurrency=EUR&w=AF4UAU960P6LMLA8VCZZ&gclid=Cj0KCQjwuL_8BRCXARIsAGiC51C8OdXsVpJbYRfodiFcGFEl2FKylqh2MvBjnHs9T5fVfMmDzZXbU4oaAisFEALw_wcB
Placing a NOT in front of a term excludes documents matching the following term.
Negating a term must be combined with at
least one non-negated term to return docu-
ments; in other words, it isn’t possible to
use a query like NOT term to find all docu-
ments that don’t contain a term.

Lucene - KeyWord Filed confusion

I started learning Lucene, so I am reading Lucene in Action. An excerpt from this book regarding fields is:
Keyword—Isn’t analyzed, but is indexed and stored in the index verbatim.
This type is suitable for fields whose original value should be preserved in
its entirety, such as URLs, file system paths, dates, personal names, Social
Security numbers, telephone numbers, and so on
What I understood from this is, if a text is indexed with Keyword field it is not analyzed (not split into tokens) but is indexed. However, what I don't understand is where and stored in the index verbatim.
I am confused about storing in the index. I assumed that if the text is indexed it will get stored in the index data structure.
Can any one please explain me with an example?
I think you must be reading the first edition of Lucene in Action. That book is 11 years old and hopelessly outdated. I wouldn't be inclined to worry too much about understanding the conventions of Lucene 1.4.
The Second Edition is available. It's five years old and is based on Lucene 3.0, so it's definitely somewhat outdated, especially since the big changes in lucene version 4.0, but not hopelessly so. Reading that would certainly be much more useful.
The difference between storing and indexing a field does still exist though. In Lucene parlance:
Index - The field is indexed, and can be searched for. Keyword fields (Or, more recently, StringField) are not analyzed, but they are indexed, so their complete content can searched without tokenization.
Store - The field is stored, in it's entirety, separately from the indexed form for later retrieval. When you get a search result from Lucene (for instance, from IndexSearcher.doc(int)), the document you get back will only have stored fields in it.
As such, you can have a field that you can search on, but won't be returned in results, or a field that is returned in results but can't be searched.

Finding number of unique terms over multiple fields

I need to find number (or list) of unique terms over a combination of two or more fields in Lucene-Java. I am using Java libraries for Lucene 4.1.0. I checked questions such as this and this, but they discuss finding list of unique terms from a single (specific) field, or over all the fields (no subset).
For example, I am interested in number(unique(height, gender)) rather than number(unique(height)), or number(unique(gender)).
Given the data:
height,gender
1,M
2,F
3,M
3,F
4,M
4,F
number(unique(height)) is 4, number(unique(gender)) is 2 and number(unique(gender,height)) is 6.
Any help will be greatly appreciated.
Thanks!
If you have predefined multiple fields then the simplest and quickest (in search terms) would be to index a combined field, i.e. heightGender (1.23:male). You can then just count the unique terms in this field, however this doesn't offer any flexibility at search time.
A more flexible approach would be to use facets (https://lucene.apache.org/core/4_1_0/facet/index.html). You would then constrain you query to each value of one field (e.g. Gender (male/female)) and retrieve all the values (and document counts) of the other field.
However if you do not have the ability to change the indexing process then you are left with doing a brute force search using Boolean queries to find the number of documents in the index for all combinations of the field values in which you are interested. I presume you are only counting combinations where the number of documents is non-zero.
It is worth noting that this question is exactly what Solr Pivot Facets address (http://lucidworks.com/blog/pivot-facets-inside-and-out/)

Solr: The default OR operator returns irrelevant results, when the fields are queried with multiple words

I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?
If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.
From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong

Improving the speed of Solr query over 16 million tweets

I use Solr (SolrCloud) to index and search my tweets. There are about 16 million tweets and the index size is approximately 3 GB. The tweets are indexed in real time as they come so that real time search is enabled. Currently I use lowercase field type for my tweet body field. For a single search term in the search, it is taking around 7 seconds and with addition of each search term, time taken for search is linearly increasing. 3GB is the maximum RAM allocated for the solr process. Sample solr search query looks like this
tweet_body:*big* AND tweet_body:*data* AND tweet_tag:big_data
Any suggestions on improving the speed of searching? Currently I run only 1 shard which contains the entire tweet collection.
The query tweet_body:*big* can be expected to perform poorly. Trailing wildcards are easy, Leading Wildcards can be readily handled with a ReversedWildcardFilterFactory. Both, however, will have to scan every document, rather than being able to utilize the index to locate matching documents. Combining the two approaches would only allow you to search:
tweet_body:*big tweet_body:big*
Which is not the same thing. If you really must search for terms with a leading AND trailing wildcard, I would recommend looking into indexing your data as N-grams.
I wasn't previously aware of it, but it seems the lowercase field type is a Lowercase filtered KeywordAnalyzer. This is not what you want. That means the entire field is treated as a single token. Good for identification numbers and the like, but not for a body of text you wish to perform a full text search on.
So yes, you need to change it. text_general is probably appropriate. That will index a correctly tokenized field, and you should be able to performt he query you are looking for with:
tweet_body:big AND tweet_body:data AND tweet_tag:big_data
You will have to reindex, but there is no avoiding that. There is no good, performant way to perform a full text search on a keyword field.
Try using filter queries,as filter queries runs in parallel

Categories