Identify existence of keywords in document from list

Identify existence of keywords in document from list - java

I want to create a tag list for a Lucene document based on a pre-determined list.
So, if we have a document with the text
Looking for a Java programmer with experience in Lucene
and we have the keyword list (about 1000 items)
java, php, lucene, c# [...]
I want to identify that the keywords Java and Lucene exist in the document.
Just doing a java OR php OR lucene will not work because then I will not know which keyword generated the hit.
Any suggestions on how to implement this in Lucene?

I assume that you have one or more indexed fields, and you want to build your tag cloud based on the intersection of your keywords and the indexed terms for a document.
Your problem is very similar to highlighting, so the same ideas apply, you can either:
re-analyze the stored fields of your Lucene document,
use term vectors for fast access to your documents' stored fields.
Note that if you want to use term vectors, you need to enable them at compile time (see Field.TermVector.YES documentation and Field constructor).

Yes, this works
FullTextSession fts = Search.getFullTextSession(getSessionFactory().getCurrentSession());
Query q = fts.getSearchFactory().buildQueryBuilder()
.forEntity(Offer.class).get()
.keyword()
.onField("id")
.matching(myId)
.createQuery();
Object[] dId = (Object[]) fts.createFullTextQuery(q, Offer.class)
.setProjection(ProjectionConstants.DOCUMENT_ID)
.uniqueResult();
if(dId != null){
IndexReader indexReader = fts.getSearchFactory().getIndexReaderAccessor().open(Offer.class);
TermFreqVector freq = indexReader.getTermFreqVector((Integer) dId[0], "description");
}
You have to remember to index the field with TermVector.YES in your hibernate search annotation for the field.

Related

WildcardQuery not returning correct result

I have created an index using some data. Now I am using WildcardQuery to search this data. The documents indexed have a field name Product Code against which I am searching.
Below is the code that I am using for creating the query and searching:
Term productCodeTerm = new Term("Product Code", "*"+searchText+"*");
query = new WildcardQuery(productCodeTerm);
searcher.search(query, 100);
The searchText variable contains the search string that is used to search the index. In case when searchString is 'jf', I get the following result:
JF32358
JF5215
JF2592
Now, when I try to search using 25, or f2 or f3 or anything else other than using only j,f,jf, then the query has no hits.
I am not able to understand why it is happening. Can someone help me understand the reason the search is behaving in this way?

What analyzer did you use at indexing time? Given your examples, you should make sure that your analyzer:
does lowercasing,
does not remove digits,
does not split at boundaries between letters and digits.

In the lucene FAQ page it says :
Leading wildcards (e.g. *ook) are not supported by the QueryParser by
default. As of Lucene 2.1, they can be enabled by calling
QueryParser.setAllowLeadingWildcard( true ). Note that this can be an
expensive operation: it requires scanning the list of tokens in the
index in its entirety to look for those that match the pattern.
For more information check here.

Create a Occurrence Vector Using Apache Lucene

We are developing an application to detect plagiarism. We are using Apache lucene for document indexing. I have a need to create an occurrence vector for each document using the index we created. I would like to know whether there is a way to do this using apache lucene. I tried to use TermFreqVectors but I couldn't find a proper way. Any suggestion or help is highly appreciated.
Thanks.

The TermFreqVector class does what you'd like, I think. It can even give you term positions so that you can detect ordered sequences of words. To generate the vector, you need to specify this at indexing time like this:
String text = "text you want to index; you could also use a Reader here";
Document doc = new Document();
doc.add(new Field("text", text, Store.NO, Index.ANALYZED, TermVector.WITH_POSITIONS));
At retrieval time, you can run phrase queries (e.g, "a b c"~25) or SpanQuerys (which you have to construct programmatically).
To get term frequency and position information from the index, do something like this:
TermPositionVector v = (TermPositionVector) this.reader.getTermFreqVector(docnum, this.textField);
int wordIndex = v.indexOf("want");
int[] positions = v.getTermPositions(wordIndex); // should return the position(s) of the word "want" in your text

If you want to achieve this you could use a RAMDirectory to store your document (assuming you only want to do this for one document).
Then you can use IndexReader.termDocs(Term term) to fetch the TermDocs for this directory, containing the document id (only one if you store one doc) and the frequency of the term in the document.
You can then do this for each term to create your occurance vector.
You could off course also do this for more than one document and create multiple occurance vectors at once.
http://lucene.apache.org/java/3_1_0/api/all/org/apache/lucene/index/IndexReader.html
As I'm sure you are looking to find similarities in documents => similar documents, you might want to have a look on the MoreLikeThis implementation of Lucene: http://lucene.apache.org/java/3_1_0/api/all/org/apache/lucene/search/similar/MoreLikeThis.html

search lucene on a specific document

I'm using Lucene with java to index some text documents. Now, after I get some top documents for a keyword search, I want to further refine my search and search only those top documents with some additional keywords, so each document once. Can somebody tell me on how I can search a specific document with a specific keyword, not the whole index, but lets say just 123.xml with keywords "bla blah".
thanx in advance

If you want to refine your search, you should use filters (look at IndexSearcher
search(Query query,
Filter filter,
int n,
Sort sort)
)! Filters will be executed on the result set and are the proper way to implement refined searches.
Have a look at this page to find out how to use filters: http://www.javaranch.com/journal/2009/02/filtering-a-lucene-search.html
Anyway:
If you want to search in just one document you can either take the one document, store it in a RAMDirectory and search in the RAMDirectory just as you would in your normal index. Or you can have a field containig unique identifyers for each document and add this to your query e.g. "contant:(bla blah) and uniqe_doc_id:(doc1)"

Why is Lucene sometimes not matching InChIKeys?

I have indexed my database using Hibernate Search. I use a custom analyzer, both for indexing and for querying. I have a field called inchikey that should not get tokenized. Example values are:
BBBAWACESCACAP-UHFFFAOYSA-N
KEZLDSPIRVZOKZ-AUWJEWJLSA-N
When I look into my index with Luke I can confirm that they are not tokenized, as required.
However, when I try to search them using the web app, some inchikeys are found and others are not. Curiously, for these inchikeys the search DOES work when I search without the last hyphen, as so: BBBAWACESCACAP-UHFFFAOYSA N
I have not been able to find a common element in the inchikeys that are not found.
Any idea what is going on here?
I use a MultiFieldQueryParser to search over the different fields in the database:
String[] searchfields = Compound.getSearchfields();
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_29, Compound.getSearchfields(), new ChemicalNameAnalyzer());
//Disable the following if search performance is too slow
parser.setAllowLeadingWildcard(true);
FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery(parser.parse("searchterms"), Compound.class);
List<Compound> hits = fullTextQuery.list();
More details about our setup have been posted here by Tim and I.

It turns out the last entries in the input file are not being indexed correctly. These ARE being tokenized. In fact, it seems they are indexed twice: once without being tokenized and once with. When I search I cannot find the un-tokenized.
I have not yet found the reason, but I think it perhaps has to do with our parser ending while Lucene is still indexing the last entries, and as a result Lucene reverting to the default analyzer (StandardAnalyzer). When I find the culprit I will report back here.
Adding #Analyzer(impl = ChemicalNameAnalyzer.class) to the fields solves the problem, but what I want is my original setup, with the default analyzer defined once, in config, like so:
<property name="hibernate.search.analyzer">path.to.ChemicalNameAnalyzer</property>

Reverse search in Hibernate Search

I'm using Hibernate Search (which uses Lucene) for searching some Data I have indexed in a directory. It works fine but I need to do a reverse search. By reverse search I mean that I have a list of queries stored in my database I need to check which one of these queries match with a Data object each time Data Object is created. I need it to alert the user when a Data Object matches with a Query he has created. So I need to index this single Data Object which has just been created and see which queries of my list has this object as a result.
I've seen Lucene MemoryIndex Class to create an index in memory so I can do something like this example for every query in a list (though iterating in a Java list of queries would not be very efficient):
//Iterating over my list<Query>
MemoryIndex index = new MemoryIndex();
//Add all fields
index.addField("myField", "myFieldData", analyzer);
...
QueryParser parser = new QueryParser("myField", analyzer);
float score = index.search(query);
if (score > 0.0f) {
System.out.println("it's a match");
} else {
System.out.println("no match found");
}
The problem here is that this Data Class has several Hibernate Search Annotations #Field,#IndexedEmbedded,... which indicated how fields should be indexed, so when I invoke index() method on the FullTextEntityManager instance it uses this information to index the object in the directory. Is there a similar way to index it in memory using this information?
Is there a more efficient way of doing this reverse search?

Just index the new object (if you use automatic indexing you don't have to do anything besides committing the current transaction), then retrieve the queries you want to run and run all of them in a boolean query, combining the stored query with the id of the new object. Something like this:
...
BooleanQuery query = new BooleanQuery();
query.add(storedQuery, BooleanClause.Occur.MUST);
query.add(new TermQuery(ProjectionConstants.ID, id), BooleanClause.Occur.MUST);
...
If you get a result you know the query matched.

Since MemoryIndex is a completely separate component that doesn't extend or implement Lucene's Directory or IndexReader, I don't think there's a way you can plug this into Hibernate Search Annotations. I'm guessing that if you choose to use MemoryIndex, you'll need to write your addField() calls which basically mirrors what you're doing in the annotations.
How many queries are we talking about here? Depending on how many there are you might be able to get away with just running the queries on the main index that Hibernate maintains, ensuring to constrain the search to the document ID you just added. Or for every document that's added, create a one-document in-memory index using RAMDirectory and run the queries through that.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Identify existence of keywords in document from list - java

Related

WildcardQuery not returning correct result

Create a Occurrence Vector Using Apache Lucene

search lucene on a specific document

Why is Lucene sometimes not matching InChIKeys?

Reverse search in Hibernate Search

Categories

Resources