We are developing an application to detect plagiarism. We are using Apache lucene for document indexing. I have a need to create an occurrence vector for each document using the index we created. I would like to know whether there is a way to do this using apache lucene. I tried to use TermFreqVectors but I couldn't find a proper way. Any suggestion or help is highly appreciated.
Thanks.
The TermFreqVector class does what you'd like, I think. It can even give you term positions so that you can detect ordered sequences of words. To generate the vector, you need to specify this at indexing time like this:
String text = "text you want to index; you could also use a Reader here";
Document doc = new Document();
doc.add(new Field("text", text, Store.NO, Index.ANALYZED, TermVector.WITH_POSITIONS));
At retrieval time, you can run phrase queries (e.g, "a b c"~25) or SpanQuerys (which you have to construct programmatically).
To get term frequency and position information from the index, do something like this:
TermPositionVector v = (TermPositionVector) this.reader.getTermFreqVector(docnum, this.textField);
int wordIndex = v.indexOf("want");
int[] positions = v.getTermPositions(wordIndex); // should return the position(s) of the word "want" in your text
If you want to achieve this you could use a RAMDirectory to store your document (assuming you only want to do this for one document).
Then you can use IndexReader.termDocs(Term term) to fetch the TermDocs for this directory, containing the document id (only one if you store one doc) and the frequency of the term in the document.
You can then do this for each term to create your occurance vector.
You could off course also do this for more than one document and create multiple occurance vectors at once.
http://lucene.apache.org/java/3_1_0/api/all/org/apache/lucene/index/IndexReader.html
As I'm sure you are looking to find similarities in documents => similar documents, you might want to have a look on the MoreLikeThis implementation of Lucene: http://lucene.apache.org/java/3_1_0/api/all/org/apache/lucene/search/similar/MoreLikeThis.html
Related
I had a huge dictionary containing around 1.2 million strings. As an input I will get a sentence and I need to check for each word of input sentence whether it is present in dictionary or not?
Performance is the highest priority for me hence I would like to keep this dictionary in-memory. I want to complete my dictionary lookup in less than a millisecond. Kindly suggest how I can achieve this? Any existing external API which do this?
So you only need a set of words from the dictionary and see whether it contains the set of words of the sentence.
Set<String> dictionaryIndex = new HashSet<>();
Set<String> sentence = new HashSet<>();
if (!dictionaryIndex.containsAll(sentence)) {
...
However if you want to do more, consider a database, maybe an embedded in-memory database, like H2 or Derby. You can then do more, and a query would be:
SELECT COUNT(*) FROM dictionary WHERE word IN('think', 'possitive', 'human')
You might even consider a spelling library. They keep smaller dictionary and use stemming: 'learn' for learning, learner, learned, learns.
If you are open to using external apis, I would suggest you go for elastic search's percolate api. Performance being the priority, this exactly fits your requirement.
The java api can be found here.
You can index all the keywords and then provide it a document(in your case the sentence)
Indexing:
for(String obj:keywordLst){
client.prepareIndex("myindex", ".percolator", obj)
.setSource(XContentFactory.jsonBuilder()
.startObject()
.field("query", QueryBuilders.matchPhraseQuery("content", obj))
.endObject())
.setRefresh(true)
.execute().actionGet();
}
Searching:
XContentBuilder docBuilder = XContentFactory.jsonBuilder().startObject();
docBuilder.field("doc").startObject();
docBuilder.field("content", text);
docBuilder.endObject(); //End of the doc field
docBuilder.endObject(); //End of the JSON root object
PercolateResponse response = client.preparePercolate().setSource(docBuilder)
.setIndices("myindex").setDocumentType("type")
.execute().actionGet();
for(PercolateResponse.Match match : response) {
//found matches
}
I think the 1.2 million strings will not fit in memory or easily overflow the size limitation of your memory (consider a bad case where the average string length 256).
If some kind of pre-processing is allowed, I think you'd better first reduce the sequence of strings into a sequence of words. It means that you first convert your data into another set of data that will easily fit in memory and won't overflow.
After that, I think you can depend on the in-memory data structures such as HashMap.
I used lucene library to create index and search. But now I want to get top 30 words are most of the words appearing in my texts. What can I do?
If you are using Lucene 4.0 or later, you can use the HighFreqTerms class, such as:
TermStats[] commonTerms = HighFreqTerms.getHighFreqTerms(reader, 30, "mytextfield");
for (TermStats commonTerm : commonTerms) {
System.out.println(commonTerm.termtext.utf8ToString()); //Or whatever you need to do with it
}
From each TermStats object, you can get the frequencies, field name, and text.
A quick search in SO got me this: Get highest frequency terms from Lucene index
Would this work for you? sounded like the exact same question..
Now I have several Lucene index sets (I call it shards), which indexes different document sets. They are independent, which means I can perform search on each of them without reading others. Then I get a query request. I want to search it over every index set and combine the result to form the final top documents.
I know that when scoring the documents, Lucene needs to know the <idf> of every term, and different index sets will give different <idf> to the same term (because different index sets hold different document sets). Thus to my understanding, I cannot compare the document score from different index sets directly. Then how should I generate the final result?
An obvious solution would be first merge the index and then perform the search over the big index. However, this is tooo time-consuming for me and thus unacceptable. Anyone has other better solutions?
P.S.: I don't want to use any packages or softwares (like Katta) except Lucene and Hadoop.
I think MultiReader is what you are looking for. If you have multiple IndexReaders, say reader1 and reader2:
MultiReader multiReader = new MultiReader(reader1, reader2);
IndexSearcher searcher = new IndexSearcher(multiReader);
I want to create a tag list for a Lucene document based on a pre-determined list.
So, if we have a document with the text
Looking for a Java programmer with experience in Lucene
and we have the keyword list (about 1000 items)
java, php, lucene, c# [...]
I want to identify that the keywords Java and Lucene exist in the document.
Just doing a java OR php OR lucene will not work because then I will not know which keyword generated the hit.
Any suggestions on how to implement this in Lucene?
I assume that you have one or more indexed fields, and you want to build your tag cloud based on the intersection of your keywords and the indexed terms for a document.
Your problem is very similar to highlighting, so the same ideas apply, you can either:
re-analyze the stored fields of your Lucene document,
use term vectors for fast access to your documents' stored fields.
Note that if you want to use term vectors, you need to enable them at compile time (see Field.TermVector.YES documentation and Field constructor).
Yes, this works
FullTextSession fts = Search.getFullTextSession(getSessionFactory().getCurrentSession());
Query q = fts.getSearchFactory().buildQueryBuilder()
.forEntity(Offer.class).get()
.keyword()
.onField("id")
.matching(myId)
.createQuery();
Object[] dId = (Object[]) fts.createFullTextQuery(q, Offer.class)
.setProjection(ProjectionConstants.DOCUMENT_ID)
.uniqueResult();
if(dId != null){
IndexReader indexReader = fts.getSearchFactory().getIndexReaderAccessor().open(Offer.class);
TermFreqVector freq = indexReader.getTermFreqVector((Integer) dId[0], "description");
}
You have to remember to index the field with TermVector.YES in your hibernate search annotation for the field.
I'm using Lucene with java to index some text documents. Now, after I get some top documents for a keyword search, I want to further refine my search and search only those top documents with some additional keywords, so each document once. Can somebody tell me on how I can search a specific document with a specific keyword, not the whole index, but lets say just 123.xml with keywords "bla blah".
thanx in advance
If you want to refine your search, you should use filters (look at IndexSearcher
search(Query query,
Filter filter,
int n,
Sort sort)
)! Filters will be executed on the result set and are the proper way to implement refined searches.
Have a look at this page to find out how to use filters: http://www.javaranch.com/journal/2009/02/filtering-a-lucene-search.html
Anyway:
If you want to search in just one document you can either take the one document, store it in a RAMDirectory and search in the RAMDirectory just as you would in your normal index. Or you can have a field containig unique identifyers for each document and add this to your query e.g. "contant:(bla blah) and uniqe_doc_id:(doc1)"