Get word position In document with lucene - java

I wonder how to get position of a word in document using Lucene
I already generate index files and I want to extract some information from the index such as indexed word, position of the word in document, etc
I created a reader like this :
public void readIndex(Directory indexDir) throws IOException {
IndexReader ir = IndexReader.open(indexDir);
Fields fields = MultiFields.getFields(ir);
System.out.println("TOTAL DOCUMENTS : " + ir.numDocs());
for(String field : fields) {
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator(null);
BytesRef text;
while((text = termsEnum.next()) != null) {
System.out.println("text = " + text.utf8ToString() + "\nfrequency = " + termsEnum.totalTermFreq());
}
}
}
I modified the writer to :
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
FieldType fieldType = new FieldType();
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setIndexed(true);
doc.add(new Field("word", new BufferedReader(new InputStreamReader(fis, "UTF-8")), fieldType));
And I tried to read whether the term has position by calling terms.hasPositions() which return true
But have no idea which function can gives me the position??

Before you try to retrieve the positional information, you've got to make sure that the indexing happened with the positional information enabled in the first place.
TermsEnum.DocsAndPositionsEnum : Get DocsAndPositionsEnum for the current term. Do not call this when the enum is unpositioned. This method will return null if positions were not indexed.

Related

Prefix search using lucene

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}
I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);
If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

Lucene indexing - lots of docs/phrases

What approach should I use in indexing following set of files.
Each file contains around 500k lines of characters (400MB) - characters are not words, they are, lets say for sake of question random characters, without spaces.
I need to be able to find each line which contains given 12-character string, for example:
line:
AXXXXXXXXXXXXJJJJKJIDJUD....ect up to 200 chars
interesting part: XXXXXXXXXXXX
While searching, I'm only interested in characters 1-13 (so XXXXXXXXXXXX). After the search I would like to be able to read line containing XXXXXXXXXXXX without looping through the file.
I wrote following poc (simplified for question:
Indexing:
while ( (line = br.readLine()) != null ) {
doc = new Document();
Field fileNameField = new StringField(FILE_NAME, file.getName(), Field.Store.YES);
doc.add(fileNameField);
Field characterOffset = new IntField(CHARACTER_OFFSET, charsRead, Field.Store.YES);
doc.add(characterOffset);
String id = "";
try {
id = line.substring(1, 13);
doc.add(new TextField(CONTENTS, id, Field.Store.YES));
writer.addDocument(doc);
} catch ( IndexOutOfBoundsException ior ) {
//cut off for sake of question
} finally {
//simplified snipped for sake of question. characterOffset is amount of chars to skip which reading a file (ultimately bytes read)
charsRead += line.length() + 2;
}
}
Searching:
RegexpQuery q = new RegexpQuery(new Term(CONTENTS, id), RegExp.NONE); //cause id can be a regexp concernign 12char string
TopDocs results = searcher.search(q, Integer.MAX_VALUE);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = results.totalHits;
Map<String, Set<Integer>> fileToOffsets = new HashMap<String, Set<Integer>>();
for ( int i = 0; i < numTotalHits; i++ ) {
Document doc = searcher.doc(hits[i].doc);
String fileName = doc.get(FILE_NAME);
if ( fileName != null ) {
String foundIds = doc.get(CONTENTS);
Set<Integer> offsets = fileToOffsets.get(fileName);
if ( offsets == null ) {
offsets = new HashSet<Integer>();
fileToOffsets.put(fileName, offsets);
}
String offset = doc.get(CHARACTER_OFFSET);
offsets.add(Integer.parseInt(offset));
}
}
The problem with this approach is that, it will create one doc per line.
Can you please give me hints how to approach this problem with lucene and if lucene is a way to go here?
Instead of adding a new document for each iteration, use the same document and keep adding fields with the same name to it, something like:
Document doc = new Document();
Field fileNameField = new StringField(FILE_NAME, file.getName(), Field.Store.YES);
doc.add(fileNameField);
String id;
while ( (line = br.readLine()) != null ) {
id = "";
try {
id = line.substring(1, 13);
doc.add(new TextField(CONTENTS, id, Field.Store.YES));
//What is this (characteroffset) field for?
Field characterOffset = new IntField(CHARACTER_OFFSET, bytesRead, Field.Store.YES);
doc.add(characterOffset);
} catch ( IndexOutOfBoundsException ior ) {
//cut off
} finally {
if ( "".equals(line) ) {
bytesRead += 1;
} else {
bytesRead += line.length() + 2;
}
}
}
writer.addDocument(doc);
This will add the id from each line as a new term in the same field. The same query should continue to work.
I'm not really sure what to make of your use of the CharacterOffset field, though. Each value will, as with the ids, be appended to the end of the field as another term. It won't be directly associated with a particular term, aside from being, one would assume, the same number of tokens into the field. If you need to retreive a particular line, rather than the contents of the whole file, your current approach of indexing line by line might be the most reasonable.

How you do you get fields from an IndexReader now that TermEnum doesnt exist in Lucene 4.1?

Whats the equivalent of this 3.6 code in Lucene 4.1:
IndexReader ir = IndexReader.open(dir);
TermEnum termEnum = ir.terms(t);
that is use in many of my testcases
I've checked the migration guide which says
TermEnum termsEnum = ...;
while(termsEnum.next()) {
Term t = termsEnum.term();
System.out.println("field=" + t.field() + "; text=" + t.text());
}
Do this:
for(String field : fields) {
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator(null);
BytesRef text;
while((text = termsEnum.next()) != null) {
System.out.println("field=" + field + "; text=" + text.utf8ToString());
}
}
but but where does field come from, how do i get fields from my IndexReader
Use
MultiFields.getFields(ir)
instead

Lucene 4.0 in text search

I'm using lucene 4.0 with java. I'm trying to search for a string inside a string. If we look at the lucene hello world example, I wish to find the text "lucene" inside the phrase "inLuceneAction". I want it to find me two matches in this case instead of one.
Any Idea on how to do it?
Thanks
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "inLuceneAction", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
If you index the terms in the default way, meaning inLuceneAction is one term, Lucene won't be able to seek to this term given Lucene because it has a different prefix. Analyze this string so that it results in three indexed terms: in Lucene Action and then you'll have it fetched. You'll either find a ready-made analyzer for this or you'll have to write your own. Writing own analyzers is a bit out of scope for a single StackOverflow answer, but an excellent place to start is the package info at the bottom of the org.apache.lucene.analysis package Javadoc page.

Lucene 4.0 API - NRTManager simple case usage

I'm literally struggling with this new API and the lack of examples for core things like the NRT Manager.
I followed this example and here is the final result:
This is how the NRT Manager is built:
analyzer = new StopAnalyzer(Version.LUCENE_40);
config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
writer = new IndexWriter(FSDirectory.open(new File(ConfigUtil.getProperty("lucene.directory"))), config);
mgrWriter = new NRTManager.TrackingIndexWriter(writer);
ReferenceManager<IndexSearcher> mgr = new NRTManager(mgrWriter, new SearcherFactory(), true);
Adding a new element to the NRT Manager's writer:
long gen = -1;
try{
Document userDoc = DocumentManager.getDocument(user);
gen = mgrWriter.addDocument(userDoc);
} catch (Exception e) {}
return gen;
After some small amount of time I need to update the previous document:
// Acquire a searcher from the NRTManager. I am using the generation obtained in the creation step
((NRTManager)mgr).waitForGeneration(gen);
searcher = mgr.acquire();
//Search for the document based on some user id
Term idTerm = new Term(USER_ID, Integer.toString(userId));
Query idTermQuery = new TermQuery(term);
TopDocs result = searcher.search(idTermQuery, 1);
if (result.totalHits > 0) resultDoc = searcher.doc(result.scoreDocs[0].doc);
else resultDoc = null;
The problem is that resultDoc will always be null. What am I missing? I should not use commit() or flush() in orther to see those changes.
I am using a NRTManagerReopenThread as exemplified here.
LE userDoc creation
public static Document getDocument(User user) {
Document doc = new Document();
FieldType storedType = new FieldType();
storedType.setStored(true);
storedType.setIndexed(false);
// Store user data
doc.add(new Field(USER_ID, user.getId().toString(), storedType));
doc.add(new Field(USER_NAME, user.getFirstName() + user.getLastName(), storedType));
FieldType unstoredType = new FieldType();
unstoredType.setStored(false);
unstoredType.setIndexed(true);
Field field = null;
// Analyze Location
String tokens = "";
if (user.getLocation() != null && ! user.getLocation().isEmpty()){
for (Tag location : user.getLocation()) tokens += location.getName() + " ";
field = new Field(USER_LOCATION, tokens, unstoredType);
field.setBoost(Constants.LOCATION);
doc.add(field);
}
// Analyze Language
if (user.getLanguage() != null && ! user.getLanguage().isEmpty()){
// Same as Location
}
// Analyze Career
if (user.getCareer() != null && ! user.getCareer().isEmpty()){
// Same as Location
}
return doc;
}
Your problem is not NRT-related. You are searching agains the USER_ID field although it has not been indexed, this can't work. If you don't want your ID field to be tokenized, just call FieldType#setTokenized(false) (or just use StringField, which does exactly that by default: indexed by not tokenized).

Categories