Lucene 4.0 API - NRTManager simple case usage - java

I'm literally struggling with this new API and the lack of examples for core things like the NRT Manager.
I followed this example and here is the final result:
This is how the NRT Manager is built:
analyzer = new StopAnalyzer(Version.LUCENE_40);
config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
writer = new IndexWriter(FSDirectory.open(new File(ConfigUtil.getProperty("lucene.directory"))), config);
mgrWriter = new NRTManager.TrackingIndexWriter(writer);
ReferenceManager<IndexSearcher> mgr = new NRTManager(mgrWriter, new SearcherFactory(), true);
Adding a new element to the NRT Manager's writer:
long gen = -1;
try{
Document userDoc = DocumentManager.getDocument(user);
gen = mgrWriter.addDocument(userDoc);
} catch (Exception e) {}
return gen;
After some small amount of time I need to update the previous document:
// Acquire a searcher from the NRTManager. I am using the generation obtained in the creation step
((NRTManager)mgr).waitForGeneration(gen);
searcher = mgr.acquire();
//Search for the document based on some user id
Term idTerm = new Term(USER_ID, Integer.toString(userId));
Query idTermQuery = new TermQuery(term);
TopDocs result = searcher.search(idTermQuery, 1);
if (result.totalHits > 0) resultDoc = searcher.doc(result.scoreDocs[0].doc);
else resultDoc = null;
The problem is that resultDoc will always be null. What am I missing? I should not use commit() or flush() in orther to see those changes.
I am using a NRTManagerReopenThread as exemplified here.
LE userDoc creation
public static Document getDocument(User user) {
Document doc = new Document();
FieldType storedType = new FieldType();
storedType.setStored(true);
storedType.setIndexed(false);
// Store user data
doc.add(new Field(USER_ID, user.getId().toString(), storedType));
doc.add(new Field(USER_NAME, user.getFirstName() + user.getLastName(), storedType));
FieldType unstoredType = new FieldType();
unstoredType.setStored(false);
unstoredType.setIndexed(true);
Field field = null;
// Analyze Location
String tokens = "";
if (user.getLocation() != null && ! user.getLocation().isEmpty()){
for (Tag location : user.getLocation()) tokens += location.getName() + " ";
field = new Field(USER_LOCATION, tokens, unstoredType);
field.setBoost(Constants.LOCATION);
doc.add(field);
}
// Analyze Language
if (user.getLanguage() != null && ! user.getLanguage().isEmpty()){
// Same as Location
}
// Analyze Career
if (user.getCareer() != null && ! user.getCareer().isEmpty()){
// Same as Location
}
return doc;
}

Your problem is not NRT-related. You are searching agains the USER_ID field although it has not been indexed, this can't work. If you don't want your ID field to be tokenized, just call FieldType#setTokenized(false) (or just use StringField, which does exactly that by default: indexed by not tokenized).

Related

Add new column attribute to the shapefile and save it to database using Geotools Java

I am transforming a shapefile by adding a new column attributes. Since this task is performed using Java, the only option I know for now is using Geotools. I have 2 main concerns:
1. I am not able to figure out how do I actually add a new column variable. Is the feature.setAttribute("col","value") the answer?
I see from this post just the example:https://gis.stackexchange.com/questions/215660/modifying-feature-attributes-of-a-shapefile-in-geotools but I dont get the solution.
//Upload the ShapeFile
File file = JFileDataStoreChooser.showOpenFile("shp", null);
Map<String, Object> params = new HashMap<>();
params.put("url", file.toURI().toURL());
DataStore store = DataStoreFinder.getDataStore(params);
SimpleFeatureSource featureSource = store.getFeatureSource(store.getTypeNames()[0]);
String typeName = store.getTypeNames()[0];
FeatureSource<SimpleFeatureType, SimpleFeature> source =
store.getFeatureSource(typeName);
Filter filter = Filter.INCLUDE;
FeatureCollection<SimpleFeatureType, SimpleFeature> collection = source.getFeatures(filter);
try (FeatureIterator<SimpleFeature> features = collection.features()) {
while (features.hasNext()) {
SimpleFeature feature = features.next();
//adding new columns
feature.setAttribute("ShapeID", "SHP1213");
feature.setAttribute("UserName", "John");
System.out.print(feature.getID());
System.out.print(":");
System.out.println(feature.getDefaultGeometryProperty().getValue());
}
}
/*
* Write the features to the shapefile
*/
Transaction transaction = new DefaultTransaction("create");
// featureSource.addFeatureListener(fl);
if (featureSource instanceof SimpleFeatureStore) {
SimpleFeatureStore featureStore = (SimpleFeatureStore) featureSource;
featureStore.setTransaction(transaction);
try {
featureStore.addFeatures(collection);
transaction.commit();
} catch (Exception problem) {
problem.printStackTrace();
transaction.rollback();
} finally {
transaction.close();
}
System.exit(0); // success!
} else {
System.out.println(typeName + " does not support read/write access");
System.exit(1);
}
Assuming that setattribute is the one which adds, I get the following error for the above code.
Exception in thread "main" org.geotools.feature.IllegalAttributeException:Unknown attribute ShapeID:null value:null
at org.geotools.feature.simple.SimpleFeatureImpl.setAttribute(SimpleFeatureImpl.java:238)
at org.geotools.Testing.WritetoDatabase.main(WritetoDatabase.java:73)
2. After modifying these changes I want to store it in the database(PostGIS). I figured out the below snippet does the task, but doesn't seems to work for me with just shape file insertion
Properties params = new Properties();
params.put("user", "postgres");
params.put("passwd", "postgres");
params.put("port", "5432");
params.put("host", "127.0.0.1");
params.put("database", "test");
params.put("dbtype", "postgis");
dataStore = DataStoreFinder.getDataStore(params);
The error is a NullPointerException in the above case.
In GeoTools a (Simple)FeatureType is immutable (unchangeable) so you can't just add a new attribute to a shapefile. So first you must make a new FeatureType with your new attribute included.
FileDataStore ds = FileDataStoreFinder.getDataStore(new File("/home/ian/Data/states/states.shp"));
SimpleFeatureType schema = ds.getSchema();
// create new schema
SimpleFeatureTypeBuilder builder = new SimpleFeatureTypeBuilder();
builder.setName(schema.getName());
builder.setSuperType((SimpleFeatureType) schema.getSuper());
builder.addAll(schema.getAttributeDescriptors());
// add new attribute(s)
builder.add("shapeID", String.class);
// build new schema
SimpleFeatureType nSchema = builder.buildFeatureType();
Then you need to convert all your existing features to the new schema and add the new attribute.
// loop through features adding new attribute
List<SimpleFeature> features = new ArrayList<>();
try (SimpleFeatureIterator itr = ds.getFeatureSource().getFeatures().features()) {
while (itr.hasNext()) {
SimpleFeature f = itr.next();
SimpleFeature f2 = DataUtilities.reType(nSchema, f);
f2.setAttribute("shapeID", "newAttrValue");
//System.out.println(f2);
features.add(f2);
}
}
Finally, open the Postgis datastore and write the new features to it.
Properties params = new Properties();
params.put("user", "postgres");
params.put("passwd", "postgres");
params.put("port", "5432");
params.put("host", "127.0.0.1");
params.put("database", "test");
params.put("dbtype", "postgis");
DataStore dataStore = DataStoreFinder.getDataStore(params);
SimpleFeatureSource source = dataStore.getFeatureSource("tablename");
if (source instanceof SimpleFeatureStore) {
SimpleFeatureStore store = (SimpleFeatureStore) source;
store.addFeatures(DataUtilities.collection(features));
} else {
System.err.println("Unable to write to database");
}

Prefix search using lucene

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}
I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);
If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

Lucene indexing - lots of docs/phrases

What approach should I use in indexing following set of files.
Each file contains around 500k lines of characters (400MB) - characters are not words, they are, lets say for sake of question random characters, without spaces.
I need to be able to find each line which contains given 12-character string, for example:
line:
AXXXXXXXXXXXXJJJJKJIDJUD....ect up to 200 chars
interesting part: XXXXXXXXXXXX
While searching, I'm only interested in characters 1-13 (so XXXXXXXXXXXX). After the search I would like to be able to read line containing XXXXXXXXXXXX without looping through the file.
I wrote following poc (simplified for question:
Indexing:
while ( (line = br.readLine()) != null ) {
doc = new Document();
Field fileNameField = new StringField(FILE_NAME, file.getName(), Field.Store.YES);
doc.add(fileNameField);
Field characterOffset = new IntField(CHARACTER_OFFSET, charsRead, Field.Store.YES);
doc.add(characterOffset);
String id = "";
try {
id = line.substring(1, 13);
doc.add(new TextField(CONTENTS, id, Field.Store.YES));
writer.addDocument(doc);
} catch ( IndexOutOfBoundsException ior ) {
//cut off for sake of question
} finally {
//simplified snipped for sake of question. characterOffset is amount of chars to skip which reading a file (ultimately bytes read)
charsRead += line.length() + 2;
}
}
Searching:
RegexpQuery q = new RegexpQuery(new Term(CONTENTS, id), RegExp.NONE); //cause id can be a regexp concernign 12char string
TopDocs results = searcher.search(q, Integer.MAX_VALUE);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = results.totalHits;
Map<String, Set<Integer>> fileToOffsets = new HashMap<String, Set<Integer>>();
for ( int i = 0; i < numTotalHits; i++ ) {
Document doc = searcher.doc(hits[i].doc);
String fileName = doc.get(FILE_NAME);
if ( fileName != null ) {
String foundIds = doc.get(CONTENTS);
Set<Integer> offsets = fileToOffsets.get(fileName);
if ( offsets == null ) {
offsets = new HashSet<Integer>();
fileToOffsets.put(fileName, offsets);
}
String offset = doc.get(CHARACTER_OFFSET);
offsets.add(Integer.parseInt(offset));
}
}
The problem with this approach is that, it will create one doc per line.
Can you please give me hints how to approach this problem with lucene and if lucene is a way to go here?
Instead of adding a new document for each iteration, use the same document and keep adding fields with the same name to it, something like:
Document doc = new Document();
Field fileNameField = new StringField(FILE_NAME, file.getName(), Field.Store.YES);
doc.add(fileNameField);
String id;
while ( (line = br.readLine()) != null ) {
id = "";
try {
id = line.substring(1, 13);
doc.add(new TextField(CONTENTS, id, Field.Store.YES));
//What is this (characteroffset) field for?
Field characterOffset = new IntField(CHARACTER_OFFSET, bytesRead, Field.Store.YES);
doc.add(characterOffset);
} catch ( IndexOutOfBoundsException ior ) {
//cut off
} finally {
if ( "".equals(line) ) {
bytesRead += 1;
} else {
bytesRead += line.length() + 2;
}
}
}
writer.addDocument(doc);
This will add the id from each line as a new term in the same field. The same query should continue to work.
I'm not really sure what to make of your use of the CharacterOffset field, though. Each value will, as with the ids, be appended to the end of the field as another term. It won't be directly associated with a particular term, aside from being, one would assume, the same number of tokens into the field. If you need to retreive a particular line, rather than the contents of the whole file, your current approach of indexing line by line might be the most reasonable.

Get word position In document with lucene

I wonder how to get position of a word in document using Lucene
I already generate index files and I want to extract some information from the index such as indexed word, position of the word in document, etc
I created a reader like this :
public void readIndex(Directory indexDir) throws IOException {
IndexReader ir = IndexReader.open(indexDir);
Fields fields = MultiFields.getFields(ir);
System.out.println("TOTAL DOCUMENTS : " + ir.numDocs());
for(String field : fields) {
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator(null);
BytesRef text;
while((text = termsEnum.next()) != null) {
System.out.println("text = " + text.utf8ToString() + "\nfrequency = " + termsEnum.totalTermFreq());
}
}
}
I modified the writer to :
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
FieldType fieldType = new FieldType();
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setIndexed(true);
doc.add(new Field("word", new BufferedReader(new InputStreamReader(fis, "UTF-8")), fieldType));
And I tried to read whether the term has position by calling terms.hasPositions() which return true
But have no idea which function can gives me the position??
Before you try to retrieve the positional information, you've got to make sure that the indexing happened with the positional information enabled in the first place.
TermsEnum.DocsAndPositionsEnum : Get DocsAndPositionsEnum for the current term. Do not call this when the enum is unpositioned. This method will return null if positions were not indexed.

Missing hits on lucene index search

i index one big database overview (just text fields) on which the user must be able to search (below in indexFields method). This search before was done in the database with ILIKE query, but was slow, so now search is done on index. Hovewer, when i compare search results from db query, and results i get with the index search, there is always much less results with search from index.
Im not sure if i am making mistake in indexing or in search process. To me all seems to make sense here. Any ideas?
Here is the code. All advices appreciated!
// INDEXING
StandardAnalyzer analyzer = new StandardAnalyzer(
Version.LUCENE_CURRENT, stopSet); // stop set is empty
IndexWriter writer = new IndexWriter(INDEX_DIR, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);
indexFields(writer);
writer.optimize();
writer.commit();
writer.close();
analyzer.close();
private void indexFields(IndexWriter writer) {
DetachedCriteria criteria = DetachedCriteria
.forClass(Activit.class);
int count = 0;
int max = 50000;
boolean existMoreToIndex = true;
List<Activit> result = new ArrayList<Activit>();
while (existMoreToIndex) {
try {
result = activitService.listPaged(count, max);
if (result.size() < max)
existMoreToIndex = false;
if (result.size() == 0)
return;
for (Activit ao : result) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(ao.getId()),
Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitOwner()!=null)
doc.add(new Field("field1", ao.getActivityOwner(),Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitResponsible() != null)
doc.add(new Field("field2", ao.getActivityResponsible(), Field.Store.YES,Field.Index.ANALYZED));
try {
writer.addDocument(doc);
} catch (CorruptIndexException e) {
e.printStackTrace();
}
count += max;
//SEARCH
public List<Activit> searchActivitiesInIndex(String searchCriteria) {
Set<String> stopSet = new HashSet<String>(); // empty because we do not want to remove stop words
Version version = Version.LUCENE_CURRENT;
String[] fields = {
"field1", "field2"};
try {
File tempFile = new File("C://testindex");
Directory INDEX_DIR = new SimpleFSDirectory(tempFile);
Searcher searcher = new IndexSearcher(INDEX_DIR, true);
QueryParser parser = new MultiFieldQueryParser(version, fields, new StandardAnalyzer(
version, stopSet));
Query query = parser.parse(searchCriteria);
TopDocs topDocs = searcher.search(query, 500);
ScoreDoc[] hits = topDocs.scoreDocs;
//here i always get smaller hits lenght
searcher.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Most likely the analyzer is doing something that you aren't expecting.
Open your index using Luke, you can see what your (analyzed) indexed documents look like, as well as your parsed queries - should let you see what's going wrong.
Also, can you give an example of searchCriteria? And the corresponding SQL query? Without that, it's hard to know if the indexing is done correctly. You may also not need to use MultiFieldQueryParser, which is quite inefficient.

Categories