In short, I am in need to exchange the mapping of multiple field and values from one Index to the resulting Index.
The following is the scenario.
Index 1 Structure
[Field => Values] [Stored]
Doc 1
keys => keyword1;
Ids => id1, id1, id2, id3, id7, id11, etc..
Doc 2
keys => keyword2;
Ids => id3, id11, etc..
Index 2 Structure
[Field => Values] [Stored]
Doc 1
ids => id1
keys => keyword1, keyword1
Doc 3
ids => id3
keys => keyword1, keyword2, etc..
Please note that the keys<->ids mapping is reversed in the resulting Index.
What do you think the most effective way to accomplish this in terms of time complexity? ..
The only way I could think of is that..
1) index1Reader.terms();
2) Process only terms belonging to "Ids" field
3) For each term, get TermDocs
4) For each doc, load it, get "keys" field info
5) Create a new Lucene Doc, add 'Id', multi Keys, write it to index2.
6) Go to step 2.
Since the fields are stored, I'm sure that there are multiple ways of doing it.
Please guide me with any performance techniques. Even the slightest improvement will have a huge impact in my scenario considering that the Index1 size is ~ 6GB.
Total no. of unique keywords: 18 million;
Total no. of unique ids: 0.9 million
Interesting UPDATE
Optimization 1
While adding a new doc, instead of creating multiple duplicate 'Field' objects, creating a single StringBuffer with " " delimiter, and then adding entire as a single Field seems to have up to 25% improvement.
UPDATE 2: Code
public void go() throws IOException, ParseException {
String id = null;
int counter = 0;
while ((id = getNextId()) != null) { // this method is not taking time..
System.out.println("Node id: " + id);
updateIndex2DataForId(id);
if(++counter > 10){
break;
}
}
index2Writer.close();
}
private void updateIndex2DataForId(String id) throws ParseException, IOException {
// Get all terms containing the node id
TermDocs termDocs = index1Reader.termDocs(new Term("id", id));
// Iterate
Document doc = new Document();
doc.add(new Field("id", id, Store.YES, Index.NOT_ANALYZED));
int docId = -1;
while (termDocs.next()) {
docId = termDocs.doc();
doc.add(getKeyDataAsField(docId, Store.YES, Index.NOT_ANALYZED));
}
index2Writer.addDocument(doc);
}
private Field getKeyDataAsField(int docId, Store storeOption, Index indexOption) throws CorruptIndexException,
IOException {
Document doc = index1Reader.document(docId, fieldSelector); // fieldSel has "key"
Field f = new Field("key", doc.get("key"), storeOption, indexOption);
return f;
}
Usage of FieldCache worked like a charm... But, we need to allot more and more RAM to accommodate all the fields on the heap.
I've updated the above updateIndex2DataForId() with the following snippet..
private void updateIndex2DataForId(String id) throws ParseException, IOException {
// Get all terms containing the node id
TermDocs termDocs = index1Reader.termDocs(new Term("id", id));
// Iterate
Document doc = new Document();
doc.add(new Field("id", id, Store.YES, Index.NOT_ANALYZED));
int docId = -1;
StringBuffer buffer = new StringBuffer();
while (termDocs.next()) {
docId = termDocs.doc();
buffer .append(keys[docId] + " "); // keys[] is pre-populated using FieldCache
}
doc.add(new Field("id", buffer.trim().toString(), Store.YES, Index.ANALYZED));
index2Writer.addDocument(doc);
}
String[] keys = FieldCache.DEFAULT.getStrings(index1Reader, "keywords");
It made everything faster, I cannot tell you the exact metrics but I must say very substantial.
Now the program is completing in a bit of reasonable time. Anyways, further guidance is highly appreciated.
Related
I am trying to get all of terms and related postings which called Terms from a Lucene`s document field(i.e. How to calculate term frequeny in Lucene?). According to documentation there is a method to do that:
public final Terms getTermVector​(int docID, String field) throws IOException
Retrieve term vector for this document and field, or null if term vectors were not indexed. The returned Fields instance acts like a single-document inverted index (the docID will be 0).
There is a field called int docID. What is this?? for a given document what is the id field of that and how does Lucene recognize that?
According to Lucene's documentation i have used StringField as id and it is not a int.
import org.apache.lucene.document.*;
Document doc = new Document();
Field idField = new StringField("id",post.Id,Field.Store.YES);
Field bodyField = new TextField("body", post.Body, Field.Store.YES);
doc.add(idField);
doc.add(bodyField);
I have five question accordingly:
How does Lucene recognize the id field is used as docId for this document? or even Lucene does it or not ??
I used String for id but this method give a int. Does it cause a problem?
Is there any appropriate method to get postings?
I have used TextField . Is there any way to retrieve term vector(Terms) of that field? I don't want to re-index my doc as explained here, because it is too large (35-GB).
Is there any way to get terms count and get each term frequency from TextField?
To calculate term frequency we can use IndexReader.getTermVector(int docID ,String field). int docID is a field which refers to document id created by Lucene. You can retrieve docID by the code follow:
String index = "index/AIndex/";
String query = "the query text"
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("docField", analyzer);
Query lQuery = parser.parse(query);
]TopDocs results = searcher.search(lQuery , requiredHits);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = (int) results.totalHits.value;
for (int i = start; i < numTotalHits; i++)
{
int docID = hits[i].doc;
Terms termVector = reader.getTermVector(docID, "docField");
}
Each termVector object have term and frequency related to a document field and you can retrieve that by the following code:
private HashMap<String,Long> termsFrequency = new HashMap<>();
TermsEnum itr = termVector.iterator();
int allTermFrequency=0;
BytesRef term;
while ((term = itr.next()) != null){
String termText = term.utf8ToString();
long tf = itr.totalTermFreq();
termsFrequency.put(termText, tf);
allTermFrequency += itr.totalTermFreq();
}
Note: Don't forget to set store term vector as i explained here (Or this one) when you are indexing documents. If you index your document without setting to store term vector, the method getTermVector will return null. All kind of predefind Lucene Field deactivated this option by default. So you need to set it.
I have n users (at the moment only 1000 it should be more than 100'000 in the future so it shouldn't be too inefficient).
The users are visualized as m (usually 1-10) documents. In my first attempt I concat the documents in one String. The String contains all different letters, numbers and special character ("/", "\r\n", "&", "+", ...) everything can be possible.
In the end I want to have a nxn matrix with a similarity score that compare each document to each other. (the diagonal should be the highest score because it's most similar to himself)
Example:
user/user| userA | userB | userC |
userA | 1.00 | 0.94 | 0.33 |
userB | 0.92 | 1.00 | 0.12 |
userC | 0.35 | 0.22 | 1.00 |
That's what I want to achieve. I'm using Lucene but I think I can switch to another framework too when Lucene doesn't provide it.
I have done this:
public class Similarity {
public static void main(String[] args) throws IOException {
UserFactory userFactory = UserFactory.getInstance();
UserBase base = userFactory.getUserFromCsv("user.csv");
Similarity sim = new Similarity();
sim.indexing(base);
}
private StandardAnalyzer analyzer = null;
private Directory index = null;
private IndexWriterConfig config = null;
private IndexWriter w = null;
public Similarity() throws IOException {
analyzer = new StandardAnalyzer();
index = new RAMDirectory();
config = new IndexWriterConfig(analyzer);
w = new IndexWriter(index, config);
}
public void query(User user){
// How??
}
public void indexing(UserBase base) throws IOException {
for(User user : base.getUsers()){
addDoc(w, user.getText(), user.getId());
}
}
w.close();
}
private void addDoc(IndexWriter w, String text, String id) throws IOException {
Document doc = new Document();
doc.add(new TextField("text", text, Field.Store.YES));
doc.add(new StringField("id", id, Field.Store.YES));
w.addDocument(doc);
}
}
The User class is really simple and only have 2 fields text and id. I want to compare the text for each user: user.getText()
My first attempt was to try a normal Query parser:
public void query(User user){
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
Query q = new QueryParser("text",analyzer).parse(user.getText());
TopDocs docs = searcher.search(q,10000);
for(ScoreDoc hit : docs.scoreDocs){
Document d = searcher.doc(hit.doc);
System.out.println(d.getId() + " " + hit.score);
}
}
The problem in this attempt was the following:
I have to replace all special character before I could let it run (replace "/" - probably more I only tested it with one document and I never know which character could be a problem in the future)
The most similar user is not the user I used as a query - more worse he doesn't even appear in the list...
But when I use only a subset (first 20 character) he find the user I used as a query - strange? probably a flaw in my thinking...
But yeah I don't think that's the best approach to solve this problem...
I also tried it with MoreLikeThis (already deleted the code sorry... but it didn't work I couldn't even get a result it was always empty even if I compare everything with everyone).
I'm a beginner with Lucene so yeah could be a few flaws in my thinking.
What approach should I use in indexing following set of files.
Each file contains around 500k lines of characters (400MB) - characters are not words, they are, lets say for sake of question random characters, without spaces.
I need to be able to find each line which contains given 12-character string, for example:
line:
AXXXXXXXXXXXXJJJJKJIDJUD....ect up to 200 chars
interesting part: XXXXXXXXXXXX
While searching, I'm only interested in characters 1-13 (so XXXXXXXXXXXX). After the search I would like to be able to read line containing XXXXXXXXXXXX without looping through the file.
I wrote following poc (simplified for question:
Indexing:
while ( (line = br.readLine()) != null ) {
doc = new Document();
Field fileNameField = new StringField(FILE_NAME, file.getName(), Field.Store.YES);
doc.add(fileNameField);
Field characterOffset = new IntField(CHARACTER_OFFSET, charsRead, Field.Store.YES);
doc.add(characterOffset);
String id = "";
try {
id = line.substring(1, 13);
doc.add(new TextField(CONTENTS, id, Field.Store.YES));
writer.addDocument(doc);
} catch ( IndexOutOfBoundsException ior ) {
//cut off for sake of question
} finally {
//simplified snipped for sake of question. characterOffset is amount of chars to skip which reading a file (ultimately bytes read)
charsRead += line.length() + 2;
}
}
Searching:
RegexpQuery q = new RegexpQuery(new Term(CONTENTS, id), RegExp.NONE); //cause id can be a regexp concernign 12char string
TopDocs results = searcher.search(q, Integer.MAX_VALUE);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = results.totalHits;
Map<String, Set<Integer>> fileToOffsets = new HashMap<String, Set<Integer>>();
for ( int i = 0; i < numTotalHits; i++ ) {
Document doc = searcher.doc(hits[i].doc);
String fileName = doc.get(FILE_NAME);
if ( fileName != null ) {
String foundIds = doc.get(CONTENTS);
Set<Integer> offsets = fileToOffsets.get(fileName);
if ( offsets == null ) {
offsets = new HashSet<Integer>();
fileToOffsets.put(fileName, offsets);
}
String offset = doc.get(CHARACTER_OFFSET);
offsets.add(Integer.parseInt(offset));
}
}
The problem with this approach is that, it will create one doc per line.
Can you please give me hints how to approach this problem with lucene and if lucene is a way to go here?
Instead of adding a new document for each iteration, use the same document and keep adding fields with the same name to it, something like:
Document doc = new Document();
Field fileNameField = new StringField(FILE_NAME, file.getName(), Field.Store.YES);
doc.add(fileNameField);
String id;
while ( (line = br.readLine()) != null ) {
id = "";
try {
id = line.substring(1, 13);
doc.add(new TextField(CONTENTS, id, Field.Store.YES));
//What is this (characteroffset) field for?
Field characterOffset = new IntField(CHARACTER_OFFSET, bytesRead, Field.Store.YES);
doc.add(characterOffset);
} catch ( IndexOutOfBoundsException ior ) {
//cut off
} finally {
if ( "".equals(line) ) {
bytesRead += 1;
} else {
bytesRead += line.length() + 2;
}
}
}
writer.addDocument(doc);
This will add the id from each line as a new term in the same field. The same query should continue to work.
I'm not really sure what to make of your use of the CharacterOffset field, though. Each value will, as with the ids, be appended to the end of the field as another term. It won't be directly associated with a particular term, aside from being, one would assume, the same number of tokens into the field. If you need to retreive a particular line, rather than the contents of the whole file, your current approach of indexing line by line might be the most reasonable.
I have a web application which stores customers usernames, emails and phone numbers.
I want customers to search for other users using email, phone or username for a start just to understand the whole lucene concept. then later on i will add functionality to search within a user an item he posts. I am following this example on www.lucenetutorial.com/lucene-in-5-minutes.html
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
I want new customers to be added to index automatically on registration. customerId is timestamp. so should i add a new document for each field on the customers details or should i concatenate all fields into a string and add as a single document? Please go easy on me I am really new.
This is a good place to start with Lucene indexing mechanism
http://www.ibm.com/developerworks/library/wa-lucene/
In the bottom line when lucene index the document, it first converts it into lucene document form. This lucene document comprises of set of fields and each field is a set of terms. Term are nothing but stream of bytes.
The document which is to be index to pass to analyzer which forms these terms out of it, and these terms keywords which are match during searching process.
When we perform a search process the query is analyzed through the same analyzer and then is match against the terms.
So you dont have to create a document for each field, rather you should create a single document for each user.
I wonder how to get position of a word in document using Lucene
I already generate index files and I want to extract some information from the index such as indexed word, position of the word in document, etc
I created a reader like this :
public void readIndex(Directory indexDir) throws IOException {
IndexReader ir = IndexReader.open(indexDir);
Fields fields = MultiFields.getFields(ir);
System.out.println("TOTAL DOCUMENTS : " + ir.numDocs());
for(String field : fields) {
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator(null);
BytesRef text;
while((text = termsEnum.next()) != null) {
System.out.println("text = " + text.utf8ToString() + "\nfrequency = " + termsEnum.totalTermFreq());
}
}
}
I modified the writer to :
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
FieldType fieldType = new FieldType();
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setIndexed(true);
doc.add(new Field("word", new BufferedReader(new InputStreamReader(fis, "UTF-8")), fieldType));
And I tried to read whether the term has position by calling terms.hasPositions() which return true
But have no idea which function can gives me the position??
Before you try to retrieve the positional information, you've got to make sure that the indexing happened with the positional information enabled in the first place.
TermsEnum.DocsAndPositionsEnum : Get DocsAndPositionsEnum for the current term. Do not call this when the enum is unpositioned. This method will return null if positions were not indexed.