Lucene indexing - lots of docs/phrases

Lucene indexing - lots of docs/phrases - java

What approach should I use in indexing following set of files.
Each file contains around 500k lines of characters (400MB) - characters are not words, they are, lets say for sake of question random characters, without spaces.
I need to be able to find each line which contains given 12-character string, for example:
line:
AXXXXXXXXXXXXJJJJKJIDJUD....ect up to 200 chars
interesting part: XXXXXXXXXXXX
While searching, I'm only interested in characters 1-13 (so XXXXXXXXXXXX). After the search I would like to be able to read line containing XXXXXXXXXXXX without looping through the file.
I wrote following poc (simplified for question:
Indexing:
while ( (line = br.readLine()) != null ) {
doc = new Document();
Field fileNameField = new StringField(FILE_NAME, file.getName(), Field.Store.YES);
doc.add(fileNameField);
Field characterOffset = new IntField(CHARACTER_OFFSET, charsRead, Field.Store.YES);
doc.add(characterOffset);
String id = "";
try {
id = line.substring(1, 13);
doc.add(new TextField(CONTENTS, id, Field.Store.YES));
writer.addDocument(doc);
} catch ( IndexOutOfBoundsException ior ) {
//cut off for sake of question
} finally {
//simplified snipped for sake of question. characterOffset is amount of chars to skip which reading a file (ultimately bytes read)
charsRead += line.length() + 2;
}
}
Searching:
RegexpQuery q = new RegexpQuery(new Term(CONTENTS, id), RegExp.NONE); //cause id can be a regexp concernign 12char string
TopDocs results = searcher.search(q, Integer.MAX_VALUE);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = results.totalHits;
Map<String, Set<Integer>> fileToOffsets = new HashMap<String, Set<Integer>>();
for ( int i = 0; i < numTotalHits; i++ ) {
Document doc = searcher.doc(hits[i].doc);
String fileName = doc.get(FILE_NAME);
if ( fileName != null ) {
String foundIds = doc.get(CONTENTS);
Set<Integer> offsets = fileToOffsets.get(fileName);
if ( offsets == null ) {
offsets = new HashSet<Integer>();
fileToOffsets.put(fileName, offsets);
}
String offset = doc.get(CHARACTER_OFFSET);
offsets.add(Integer.parseInt(offset));
}
}
The problem with this approach is that, it will create one doc per line.
Can you please give me hints how to approach this problem with lucene and if lucene is a way to go here?

Instead of adding a new document for each iteration, use the same document and keep adding fields with the same name to it, something like:
Document doc = new Document();
Field fileNameField = new StringField(FILE_NAME, file.getName(), Field.Store.YES);
doc.add(fileNameField);
String id;
while ( (line = br.readLine()) != null ) {
id = "";
try {
id = line.substring(1, 13);
doc.add(new TextField(CONTENTS, id, Field.Store.YES));
//What is this (characteroffset) field for?
Field characterOffset = new IntField(CHARACTER_OFFSET, bytesRead, Field.Store.YES);
doc.add(characterOffset);
} catch ( IndexOutOfBoundsException ior ) {
//cut off
} finally {
if ( "".equals(line) ) {
bytesRead += 1;
} else {
bytesRead += line.length() + 2;
}
}
}
writer.addDocument(doc);
This will add the id from each line as a new term in the same field. The same query should continue to work.
I'm not really sure what to make of your use of the CharacterOffset field, though. Each value will, as with the ids, be appended to the end of the field as another term. It won't be directly associated with a particular term, aside from being, one would assume, the same number of tokens into the field. If you need to retreive a particular line, rather than the contents of the whole file, your current approach of indexing line by line might be the most reasonable.

Related

How to deal with NumberFormatException when reading from a csv file [duplicate]

This question already has answers here:
How can I prevent java.lang.NumberFormatException: For input string: "N/A"?
(6 answers)
Closed 9 months ago.
My task is to read values from a csv file, and import each line of information from this file into an object array. I think my issue is the blank data elements in my csv file which doesn't work for my parsing from string to int, but I have found no way to deal with this. Here is my code:
`fileStream = new FileInputStream(pFileName);
rdr = new InputStreamReader(fileStream);
bufRdr = new BufferedReader(rdr);
lineNum = 0;`
while (line != null) {
lineNum++;
String[] Values = new String[13];
Values = line.split(",");
int cumulPos = Integer.parseInt(Values[6]);
int cumulDec = Integer.parseInt(Values[7]);
int cumuRec = Integer.parseInt(Values[8]);
int curPos = Integer.parseInt(Values[9]);
int hosp = Integer.parseInt(Values[10]);
int intenCar = Integer.parseInt(Values[11]);
double latitude = Double.parseDouble(Values[4]);
double longitude = Double.parseDouble(Values[5]);
covidrecordArray[lineNum] = new CovidRecord(Values[0], cumulPos, cumulDec, cumuRec, curPos, hosp,
intenCar, new Country(Values[1], Values[2], Values[3], Values[13], latitude, longitude));
If anyone could help it would be greatly appreciated.

As already suggested, use a proper CSV Parser if you can but if for some unknown reason you can't, this could be one way you can do it. Be sure to read the comments in code:
fileStream = new FileInputStream(pFileName);
rdr = new InputStreamReader(fileStream);
bufRdr = new BufferedReader(rdr);
// Remove the following line if there is no Header line in the CSV file.
String line = bufRdr.readLine();
String csvFileDataDelimiter = ",";
List<CovidRecord> recordsList = new ArrayList<>();
// True value calculated later in code (read comments).
int expectedNumberOfElements = 0; // 0 is default
while ((line = bufRdr.readLine()) != null) {
line = line.trim();
// If for some crazy reason a blank line is encountered...skip it.
if (line.isEmpty()) {
continue;
}
/* Get the expected number of elements within each CSV File Data Line.
This is based off of the number of actual delimiters within a file
data line plus 1. This is only calculated from the very first data
line. */
if (expectedNumberOfElements == 0) {
expectedNumberOfElements = line.replaceAll("[^\\" + csvFileDataDelimiter + "]", "").length() + 1;
}
/* Create and fill (with Null String) an array to be the expected
size of a CSV data line. This is done because if a data line
contains nothing for the last data element on that line then
when the line is split, the srray that is created will be short
by one element. This will ensure that there will alsways be a
Null String ("") present within the array when there is nothing
in the CSV data line. This null string is used in data validations
so as to provide a default value (like 0) if an Array Element
contains an actual Null String (""). */
String[] csvLineElements = new String[expectedNumberOfElements];
Arrays.fill(csvLineElements, "");
/* Take the array from the split (values) and place the data into
the csvLineElements[] array. */
String[] values = line.split("\\s*,\\s*"); // Takes care of any comma/whitespace combinations (if any).
for (int i = 0; i < values.length; i++) {
csvLineElements[i] = values[i];
}
/* Is the csvLineElements[] element a String representation of a signed
or unsigned integer data type value ("-?\\d+"). If so, convert the
String array element into an Integer value. If not, provide a default
value of 0. */
int cumulPos = Integer.parseInt(csvLineElements[6].matches("-?\\d+") ? csvLineElements[6] : "0");
int cumulDec = Integer.parseInt(csvLineElements[7].matches("-?\\d+") ? csvLineElements[7] : "0");
int cumuRec = Integer.parseInt(csvLineElements[8].matches("-?\\d+") ? csvLineElements[8] : "0");
int curPos = Integer.parseInt(csvLineElements[9].matches("-?\\d+") ? csvLineElements[9] : "0");
int hosp = Integer.parseInt(csvLineElements[10].matches("-?\\d+") ? csvLineElements[10] : "0");
int intenCar = Integer.parseInt(csvLineElements[11].matches("-?\\d+") ? csvLineElements[11] : "0");
/* Is the csvLineElements[] element a String representation of a signed
or unsigned integer or floating point value ("-?\\d+(\\.\\d+)?").
If so, convert the String array element into an Double data type value.
If not, provide a default value of 0.0 */
double latitude = Double.parseDouble(csvLineElements[4]
.matches("-?\\d+(\\.\\d+)?") ? csvLineElements[4] : "0.0d");
double longitude = Double.parseDouble(csvLineElements[5]
.matches("-?\\d+(\\.\\d+)?") ? csvLineElements[5] : "0.0d");
/* Create an instance of Country to pass into the constructor of
CovidRecord below. */
Country country = new Country(csvLineElements[1], csvLineElements[2],
csvLineElements[3], csvLineElements[13],
latitude, longitude);
// Create an add an instance of CovidRecord to the recordsList List.
recordsList.add(new CovidRecord(csvLineElements[0], cumulPos, cumulDec,
cumuRec, curPos, hosp, intenCar, country));
// Do what you want with the recordList List....
}
For obvious reasons, the code above was not tested. If you have any problems with it then let me know.
You will also notice the instead of the covidrecordArray[] CovidRecord Array I opted to use a List Interface named recordsList. This List can grow dynamically whereas the array is fixed meaning you need to determine the number of data lines within the file when initializing the array. This is not required with the List.

you can create one generic method for null check and check if it's null then return empty string or any thing else based on your needs
int hosp = Integer.parseInt(checkForNull(Values[10]));
public static String checkForNull(String val) {
return (val == null ? " " : val);
}

Prefix search using lucene

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}

I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);

If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

Get word position In document with lucene

I wonder how to get position of a word in document using Lucene
I already generate index files and I want to extract some information from the index such as indexed word, position of the word in document, etc
I created a reader like this :
public void readIndex(Directory indexDir) throws IOException {
IndexReader ir = IndexReader.open(indexDir);
Fields fields = MultiFields.getFields(ir);
System.out.println("TOTAL DOCUMENTS : " + ir.numDocs());
for(String field : fields) {
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator(null);
BytesRef text;
while((text = termsEnum.next()) != null) {
System.out.println("text = " + text.utf8ToString() + "\nfrequency = " + termsEnum.totalTermFreq());
}
}
}
I modified the writer to :
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
FieldType fieldType = new FieldType();
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setIndexed(true);
doc.add(new Field("word", new BufferedReader(new InputStreamReader(fis, "UTF-8")), fieldType));
And I tried to read whether the term has position by calling terms.hasPositions() which return true
But have no idea which function can gives me the position??

Before you try to retrieve the positional information, you've got to make sure that the indexing happened with the positional information enabled in the first place.
TermsEnum.DocsAndPositionsEnum : Get DocsAndPositionsEnum for the current term. Do not call this when the enum is unpositioned. This method will return null if positions were not indexed.

Lucene 4.0 API - NRTManager simple case usage

I'm literally struggling with this new API and the lack of examples for core things like the NRT Manager.
I followed this example and here is the final result:
This is how the NRT Manager is built:
analyzer = new StopAnalyzer(Version.LUCENE_40);
config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
writer = new IndexWriter(FSDirectory.open(new File(ConfigUtil.getProperty("lucene.directory"))), config);
mgrWriter = new NRTManager.TrackingIndexWriter(writer);
ReferenceManager<IndexSearcher> mgr = new NRTManager(mgrWriter, new SearcherFactory(), true);
Adding a new element to the NRT Manager's writer:
long gen = -1;
try{
Document userDoc = DocumentManager.getDocument(user);
gen = mgrWriter.addDocument(userDoc);
} catch (Exception e) {}
return gen;
After some small amount of time I need to update the previous document:
// Acquire a searcher from the NRTManager. I am using the generation obtained in the creation step
((NRTManager)mgr).waitForGeneration(gen);
searcher = mgr.acquire();
//Search for the document based on some user id
Term idTerm = new Term(USER_ID, Integer.toString(userId));
Query idTermQuery = new TermQuery(term);
TopDocs result = searcher.search(idTermQuery, 1);
if (result.totalHits > 0) resultDoc = searcher.doc(result.scoreDocs[0].doc);
else resultDoc = null;
The problem is that resultDoc will always be null. What am I missing? I should not use commit() or flush() in orther to see those changes.
I am using a NRTManagerReopenThread as exemplified here.
LE userDoc creation
public static Document getDocument(User user) {
Document doc = new Document();
FieldType storedType = new FieldType();
storedType.setStored(true);
storedType.setIndexed(false);
// Store user data
doc.add(new Field(USER_ID, user.getId().toString(), storedType));
doc.add(new Field(USER_NAME, user.getFirstName() + user.getLastName(), storedType));
FieldType unstoredType = new FieldType();
unstoredType.setStored(false);
unstoredType.setIndexed(true);
Field field = null;
// Analyze Location
String tokens = "";
if (user.getLocation() != null && ! user.getLocation().isEmpty()){
for (Tag location : user.getLocation()) tokens += location.getName() + " ";
field = new Field(USER_LOCATION, tokens, unstoredType);
field.setBoost(Constants.LOCATION);
doc.add(field);
}
// Analyze Language
if (user.getLanguage() != null && ! user.getLanguage().isEmpty()){
// Same as Location
}
// Analyze Career
if (user.getCareer() != null && ! user.getCareer().isEmpty()){
// Same as Location
}
return doc;
}

Your problem is not NRT-related. You are searching agains the USER_ID field although it has not been indexed, this can't work. If you don't want your ID field to be tokenized, just call FieldType#setTokenized(false) (or just use StringField, which does exactly that by default: indexed by not tokenized).

Missing hits on lucene index search

i index one big database overview (just text fields) on which the user must be able to search (below in indexFields method). This search before was done in the database with ILIKE query, but was slow, so now search is done on index. Hovewer, when i compare search results from db query, and results i get with the index search, there is always much less results with search from index.
Im not sure if i am making mistake in indexing or in search process. To me all seems to make sense here. Any ideas?
Here is the code. All advices appreciated!
// INDEXING
StandardAnalyzer analyzer = new StandardAnalyzer(
Version.LUCENE_CURRENT, stopSet); // stop set is empty
IndexWriter writer = new IndexWriter(INDEX_DIR, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);
indexFields(writer);
writer.optimize();
writer.commit();
writer.close();
analyzer.close();
private void indexFields(IndexWriter writer) {
DetachedCriteria criteria = DetachedCriteria
.forClass(Activit.class);
int count = 0;
int max = 50000;
boolean existMoreToIndex = true;
List<Activit> result = new ArrayList<Activit>();
while (existMoreToIndex) {
try {
result = activitService.listPaged(count, max);
if (result.size() < max)
existMoreToIndex = false;
if (result.size() == 0)
return;
for (Activit ao : result) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(ao.getId()),
Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitOwner()!=null)
doc.add(new Field("field1", ao.getActivityOwner(),Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitResponsible() != null)
doc.add(new Field("field2", ao.getActivityResponsible(), Field.Store.YES,Field.Index.ANALYZED));
try {
writer.addDocument(doc);
} catch (CorruptIndexException e) {
e.printStackTrace();
}
count += max;
//SEARCH
public List<Activit> searchActivitiesInIndex(String searchCriteria) {
Set<String> stopSet = new HashSet<String>(); // empty because we do not want to remove stop words
Version version = Version.LUCENE_CURRENT;
String[] fields = {
"field1", "field2"};
try {
File tempFile = new File("C://testindex");
Directory INDEX_DIR = new SimpleFSDirectory(tempFile);
Searcher searcher = new IndexSearcher(INDEX_DIR, true);
QueryParser parser = new MultiFieldQueryParser(version, fields, new StandardAnalyzer(
version, stopSet));
Query query = parser.parse(searchCriteria);
TopDocs topDocs = searcher.search(query, 500);
ScoreDoc[] hits = topDocs.scoreDocs;
//here i always get smaller hits lenght
searcher.close();
} catch (Exception e) {
e.printStackTrace();
}
}

Most likely the analyzer is doing something that you aren't expecting.
Open your index using Luke, you can see what your (analyzed) indexed documents look like, as well as your parsed queries - should let you see what's going wrong.
Also, can you give an example of searchCriteria? And the corresponding SQL query? Without that, it's hard to know if the indexing is done correctly. You may also not need to use MultiFieldQueryParser, which is quite inefficient.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene indexing - lots of docs/phrases - java

Related

How to deal with NumberFormatException when reading from a csv file [duplicate]

Prefix search using lucene

Get word position In document with lucene

Lucene 4.0 API - NRTManager simple case usage

Missing hits on lucene index search

Categories

Resources