Lucene A-Z list - java

I would like to have an index view with A B C D E...in order to facet all my lucene results with the alphabet.
I have been googling around and I haven't found anything..
I tried with bobo facet library but id didn't work. I would like to obtain like an array like this:
Results{
prefix A: 1 results
prefix B: 2 results
prefix C: 3 results
prefix D: 0 results
....
}
This way I can disable or enable the buttons if I have results for the prefix.
Any ideas?
Thanks.
Thanks for your response!
Currently I am using Hibernate Search as core system. Moreover, to get facets I use BOBO library (http://code.google.com/p/bobo-browse). So, as you said before, I am thinking about creating a new field with the first word of the title. That way I can get the facets with BOBO.
For the moment I don't think about install SOLR.
I thought that I could find some code to avoid indexing this new field, I mean, the code to facet on a wilcard query, and not directly on a field, but I didn't find anything :)
Hibernator.

Have you considered using Solr (or ElasticSearch) instead of just plain Lucene? If so, then all you would need to do is store the first letter of the word as a separate field on the indexed object and then do a facet search by that firstLetter field. Solr (and ElasticSearch) has faceting out of the box.

Related

How to get top words by lucene index and search?

I used lucene library to create index and search. But now I want to get top 30 words are most of the words appearing in my texts. What can I do?
If you are using Lucene 4.0 or later, you can use the HighFreqTerms class, such as:
TermStats[] commonTerms = HighFreqTerms.getHighFreqTerms(reader, 30, "mytextfield");
for (TermStats commonTerm : commonTerms) {
System.out.println(commonTerm.termtext.utf8ToString()); //Or whatever you need to do with it
}
From each TermStats object, you can get the frequencies, field name, and text.
A quick search in SO got me this: Get highest frequency terms from Lucene index
Would this work for you? sounded like the exact same question..

WildcardQuery not returning correct result

I have created an index using some data. Now I am using WildcardQuery to search this data. The documents indexed have a field name Product Code against which I am searching.
Below is the code that I am using for creating the query and searching:
Term productCodeTerm = new Term("Product Code", "*"+searchText+"*");
query = new WildcardQuery(productCodeTerm);
searcher.search(query, 100);
The searchText variable contains the search string that is used to search the index. In case when searchString is 'jf', I get the following result:
JF32358
JF5215
JF2592
Now, when I try to search using 25, or f2 or f3 or anything else other than using only j,f,jf, then the query has no hits.
I am not able to understand why it is happening. Can someone help me understand the reason the search is behaving in this way?
What analyzer did you use at indexing time? Given your examples, you should make sure that your analyzer:
does lowercasing,
does not remove digits,
does not split at boundaries between letters and digits.
In the lucene FAQ page it says :
Leading wildcards (e.g. *ook) are not supported by the QueryParser by
default. As of Lucene 2.1, they can be enabled by calling
QueryParser.setAllowLeadingWildcard( true ). Note that this can be an
expensive operation: it requires scanning the list of tokens in the
index in its entirety to look for those that match the pattern.
For more information check here.

problem with Lucene's automagical query conversion

Recently I have started using Lucene. However, after few days I've spotted that queries provided by me in form of Strings are converted by Lucene to more general ones.
Example:
MY QUERY: "want to go" (including " as I'm searching whole phrases)
QUERY OBJECT created from my query (.toString): text:"want ? go"
NUMBER OF RESULTS for texts:
I want to go out today -> 1 result - correct
I want sdfto go out today -> 1 result - incorrect, should be 0
I wanted to match execly phrase "want to go" and not "want whatever go". I noticed that only words "to" and "a" are replaced with "?".
My question is why Lucene is changing queries provided by me, and how to force Lucene to ask my queries (unchanged)?
Moreover, I'm using StandardAnayzer (indexing and quering).
to is a stop word, meaning it is not indexed and not searched by some analyzers [including StandardAnalyzer], because it is usually not useful for searching. if you don't want it to be 'stopped' you will need to use a different analyzer [both for indexing and searching], but it will probably have worth results.
You can also remove the word 'to' from the field STOP_WORDS
IMPORTANT: your indexing analyzer and searching analyzer should be consistent, including the STOP_WORDS field!

Why is Lucene sometimes not matching InChIKeys?

I have indexed my database using Hibernate Search. I use a custom analyzer, both for indexing and for querying. I have a field called inchikey that should not get tokenized. Example values are:
BBBAWACESCACAP-UHFFFAOYSA-N
KEZLDSPIRVZOKZ-AUWJEWJLSA-N
When I look into my index with Luke I can confirm that they are not tokenized, as required.
However, when I try to search them using the web app, some inchikeys are found and others are not. Curiously, for these inchikeys the search DOES work when I search without the last hyphen, as so: BBBAWACESCACAP-UHFFFAOYSA N
I have not been able to find a common element in the inchikeys that are not found.
Any idea what is going on here?
I use a MultiFieldQueryParser to search over the different fields in the database:
String[] searchfields = Compound.getSearchfields();
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_29, Compound.getSearchfields(), new ChemicalNameAnalyzer());
//Disable the following if search performance is too slow
parser.setAllowLeadingWildcard(true);
FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery(parser.parse("searchterms"), Compound.class);
List<Compound> hits = fullTextQuery.list();
More details about our setup have been posted here by Tim and I.
It turns out the last entries in the input file are not being indexed correctly. These ARE being tokenized. In fact, it seems they are indexed twice: once without being tokenized and once with. When I search I cannot find the un-tokenized.
I have not yet found the reason, but I think it perhaps has to do with our parser ending while Lucene is still indexing the last entries, and as a result Lucene reverting to the default analyzer (StandardAnalyzer). When I find the culprit I will report back here.
Adding #Analyzer(impl = ChemicalNameAnalyzer.class) to the fields solves the problem, but what I want is my original setup, with the default analyzer defined once, in config, like so:
<property name="hibernate.search.analyzer">path.to.ChemicalNameAnalyzer</property>

Lucene searching by numeric values

I'm building a Java Lucene-based search system that, on addition, adds a certain number of meta-fields, one of which is a sourceId field, which denotes where the entry came from.
I'm now trying to retrieve all documents from a particular source, but the index doesn't appear to be able to find them. However, if I search for a wildcard value, the returned documents all have the correct value for this field.
The lucene query I'm using is quite simple, basically index-source-id:1 but that fails to return any hits, if I search for content:a* I get dozens of documents, all of which, when asked, return the value 1 for the index-source-id value, which is correct.
Any ideas?
I have only worked with the PHP port, however, have you checked what text analyzer you are using? This FAQ seems to indicate that like the PHP version, you need to use a diffrent one that doesn't remove digits.
You can find a list of analyzers here
Just to be sure, you have set the id to be indexable?

Categories