How to get top words by lucene index and search?

How to get top words by lucene index and search? - java

I used lucene library to create index and search. But now I want to get top 30 words are most of the words appearing in my texts. What can I do?

If you are using Lucene 4.0 or later, you can use the HighFreqTerms class, such as:
TermStats[] commonTerms = HighFreqTerms.getHighFreqTerms(reader, 30, "mytextfield");
for (TermStats commonTerm : commonTerms) {
System.out.println(commonTerm.termtext.utf8ToString()); //Or whatever you need to do with it
}
From each TermStats object, you can get the frequencies, field name, and text.

A quick search in SO got me this: Get highest frequency terms from Lucene index
Would this work for you? sounded like the exact same question..

Related

Lucene A-Z list

I would like to have an index view with A B C D E...in order to facet all my lucene results with the alphabet.
I have been googling around and I haven't found anything..
I tried with bobo facet library but id didn't work. I would like to obtain like an array like this:
Results{
prefix A: 1 results
prefix B: 2 results
prefix C: 3 results
prefix D: 0 results
....
}
This way I can disable or enable the buttons if I have results for the prefix.
Any ideas?
Thanks.
Thanks for your response!
Currently I am using Hibernate Search as core system. Moreover, to get facets I use BOBO library (http://code.google.com/p/bobo-browse). So, as you said before, I am thinking about creating a new field with the first word of the title. That way I can get the facets with BOBO.
For the moment I don't think about install SOLR.
I thought that I could find some code to avoid indexing this new field, I mean, the code to facet on a wilcard query, and not directly on a field, but I didn't find anything :)
Hibernator.

Have you considered using Solr (or ElasticSearch) instead of just plain Lucene? If so, then all you would need to do is store the first letter of the word as a separate field on the indexed object and then do a facet search by that firstLetter field. Solr (and ElasticSearch) has faceting out of the box.

WildcardQuery not returning correct result

I have created an index using some data. Now I am using WildcardQuery to search this data. The documents indexed have a field name Product Code against which I am searching.
Below is the code that I am using for creating the query and searching:
Term productCodeTerm = new Term("Product Code", "*"+searchText+"*");
query = new WildcardQuery(productCodeTerm);
searcher.search(query, 100);
The searchText variable contains the search string that is used to search the index. In case when searchString is 'jf', I get the following result:
JF32358
JF5215
JF2592
Now, when I try to search using 25, or f2 or f3 or anything else other than using only j,f,jf, then the query has no hits.
I am not able to understand why it is happening. Can someone help me understand the reason the search is behaving in this way?

What analyzer did you use at indexing time? Given your examples, you should make sure that your analyzer:
does lowercasing,
does not remove digits,
does not split at boundaries between letters and digits.

In the lucene FAQ page it says :
Leading wildcards (e.g. *ook) are not supported by the QueryParser by
default. As of Lucene 2.1, they can be enabled by calling
QueryParser.setAllowLeadingWildcard( true ). Note that this can be an
expensive operation: it requires scanning the list of tokens in the
index in its entirety to look for those that match the pattern.
For more information check here.

Create a Occurrence Vector Using Apache Lucene

We are developing an application to detect plagiarism. We are using Apache lucene for document indexing. I have a need to create an occurrence vector for each document using the index we created. I would like to know whether there is a way to do this using apache lucene. I tried to use TermFreqVectors but I couldn't find a proper way. Any suggestion or help is highly appreciated.
Thanks.

The TermFreqVector class does what you'd like, I think. It can even give you term positions so that you can detect ordered sequences of words. To generate the vector, you need to specify this at indexing time like this:
String text = "text you want to index; you could also use a Reader here";
Document doc = new Document();
doc.add(new Field("text", text, Store.NO, Index.ANALYZED, TermVector.WITH_POSITIONS));
At retrieval time, you can run phrase queries (e.g, "a b c"~25) or SpanQuerys (which you have to construct programmatically).
To get term frequency and position information from the index, do something like this:
TermPositionVector v = (TermPositionVector) this.reader.getTermFreqVector(docnum, this.textField);
int wordIndex = v.indexOf("want");
int[] positions = v.getTermPositions(wordIndex); // should return the position(s) of the word "want" in your text

If you want to achieve this you could use a RAMDirectory to store your document (assuming you only want to do this for one document).
Then you can use IndexReader.termDocs(Term term) to fetch the TermDocs for this directory, containing the document id (only one if you store one doc) and the frequency of the term in the document.
You can then do this for each term to create your occurance vector.
You could off course also do this for more than one document and create multiple occurance vectors at once.
http://lucene.apache.org/java/3_1_0/api/all/org/apache/lucene/index/IndexReader.html
As I'm sure you are looking to find similarities in documents => similar documents, you might want to have a look on the MoreLikeThis implementation of Lucene: http://lucene.apache.org/java/3_1_0/api/all/org/apache/lucene/search/similar/MoreLikeThis.html

Why is Lucene sometimes not matching InChIKeys?

I have indexed my database using Hibernate Search. I use a custom analyzer, both for indexing and for querying. I have a field called inchikey that should not get tokenized. Example values are:
BBBAWACESCACAP-UHFFFAOYSA-N
KEZLDSPIRVZOKZ-AUWJEWJLSA-N
When I look into my index with Luke I can confirm that they are not tokenized, as required.
However, when I try to search them using the web app, some inchikeys are found and others are not. Curiously, for these inchikeys the search DOES work when I search without the last hyphen, as so: BBBAWACESCACAP-UHFFFAOYSA N
I have not been able to find a common element in the inchikeys that are not found.
Any idea what is going on here?
I use a MultiFieldQueryParser to search over the different fields in the database:
String[] searchfields = Compound.getSearchfields();
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_29, Compound.getSearchfields(), new ChemicalNameAnalyzer());
//Disable the following if search performance is too slow
parser.setAllowLeadingWildcard(true);
FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery(parser.parse("searchterms"), Compound.class);
List<Compound> hits = fullTextQuery.list();
More details about our setup have been posted here by Tim and I.

It turns out the last entries in the input file are not being indexed correctly. These ARE being tokenized. In fact, it seems they are indexed twice: once without being tokenized and once with. When I search I cannot find the un-tokenized.
I have not yet found the reason, but I think it perhaps has to do with our parser ending while Lucene is still indexing the last entries, and as a result Lucene reverting to the default analyzer (StandardAnalyzer). When I find the culprit I will report back here.
Adding #Analyzer(impl = ChemicalNameAnalyzer.class) to the fields solves the problem, but what I want is my original setup, with the default analyzer defined once, in config, like so:
<property name="hibernate.search.analyzer">path.to.ChemicalNameAnalyzer</property>

Data structure for search engine in JAVA?

I m MCS 2nd year student.I m doing a project in Java in which I have different images. For storing description of say IMAGE-1, I have ArrayList named IMAGE-1, similarly for IMAGE-2 ArrayList IMAGE-2 n so on.....
Now I need to develop a search engine, in which i need to find a all image's whose description matches with a word entered in search engine..........
FOR EX If i enter "computer" then I should be able to find all images whose description contain "computer".
So my question is...
How should i do this efficiently?
How should i maintain all those
ArrayList since i can have 100 of
such...? or should i use another
data structure instead of ArrayList?

A simple implementation is to tokenize the description and use a Map<String, Collection<Item>> to store all items for a token.
Building:
for(String token: tokenize(description)) map.get(token).add(item)
(A collection is needed as multiple entries could be found for a token. The initialization of the collection is missing in the code. But the idea should be clear.)
Use:
List<Item> result = map.get("Computer")
The the general purpose HashMap implementation is not the most efficient in this case. When you start getting memory problems you can look into a tree implementation that is more efficient (like radix trees - implementation).
The next step could be to use some (in-memory) database. These could be relational (HSQL) or not (Berkeley DB).

If you have a small number of images and short descriptions (< 1000 characters), load them into an array and search for words using String.indexOf() (i.e. one entry in the array == one complete image description). This is efficient enough for, say, less than 10'000 images.
Use toLowerCase() to fold the case of the characters (so users will find "Computer" when they type "computer"). String.indexOf() will also work for short words (using "comp" to find "Computer" or "compare").
If you have lots of images and long descriptions and/or you want to give your users some comforts for the search (like Google does), then use Lucene.

There is no simple, easy-to-use data structure that supports efficient fulltext search.
But do you actually need efficiency? Is this a desktop app or a web app? In the former case, don't worry about efficiency, a modern CPU can search through megabytes of text in fractions of a second - simply look through all your descriptions using String.contains() (or a regexp to allow more flexible searches).
If you really need efficiency (such as for a webapp where many people could do searches at the same time), look into Apache Lucene.
As for your ArrayLists, it seems strange to use one for the description of a single image. Why a list, what does the index represent? Lines? If so, and unless you actually need to access lines directly, replace the lists with a simple String - it can contain newline characters just fine.

I would suggest you to use the Hashtable class or to organize your content into a tree to optimize searching.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get top words by lucene index and search? - java

I used lucene library to create index and search. But now I want to get top 30 words are most of the words appearing in my texts. What can I do?

A quick search in SO got me this: Get highest frequency terms from Lucene index Would this work for you? sounded like the exact same question..

Related

Lucene A-Z list

WildcardQuery not returning correct result

Create a Occurrence Vector Using Apache Lucene

Why is Lucene sometimes not matching InChIKeys?

Data structure for search engine in JAVA?

Categories

Resources