Lucene 4.0 overrides final method tokenStream - java

For different reasons I have to work with the latest release of Lucene's API.
The API isn't well documented yet so I find myself not able to perform a simple addDocument()
Here is the Writer initialization:
analyzer = new StopAnalyzer(Version.LUCENE_40);
config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
writer = new IndexWriter(FSDirectory.open(new File(ConfigUtil.getProperty("lucene.directory"))), config);
The simple toDocument method:
public static Document getDocument(User user) {
Document doc = new Document();
FieldType storedType = new FieldType();
storedType.setStored(true);
storedType.setTokenized(false);
// Store user data
doc.add(new Field(USER_ID, user.getId().toString(), storedType));
doc.add(new Field(USER_NAME, user.getFirstName() + user.getLastName(), storedType));
FieldType unstoredType = new FieldType();
unstoredType.setStored(false);
unstoredType.setTokenized(true);
// Analyze Location
String tokens = "";
if (user.getLocation() != null && ! user.getLocation().isEmpty()){
for (Tag location : user.getLocation()) tokens += location.getName() + " ";
doc.add(new Field(USER_LOCATION, tokens, unstoredType));
}
}
When running:
Document userDoc = DocumentManager.getDocument(userWrap);
IndexAccess.getWriter().addDocument(userDoc);
This is the error message I get:
class org.apache.lucene.analysis.util.ReusableAnalyzerBase overrides final method tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
It may be a simple matter but I cannot find any reference to help with this problem. I'm using a default analyzer and I followed a tutorial in order to avoid the deprecated Field.Index.ANALYZED

This is due to some kind of JAR version mismatch. You may be depending on a contrib JAR that in turn depends on different version of Lucene. Try to get a hold of the exact dependency set at runtime and look for any version mismatches.

Related

MongoDB "Invalid BSON Field Name"

I know that there's probably a better way to do this however I'm completely stumped. I'm writing a Discord bot in which a user is able to add points to other users, however I can't figure out how to replace a user's "points". My code is as follows:
BasicDBObject cursor = new BasicDBObject();
cursor.put(user.getAsMember().getId(), getMongoPoints(user.getAsMember()));
if(cursor.containsKey(user.getAsMember().getId())) {
Document old = new Document(user.getAsMember().getId(), getMongoPoints(user.getAsMember()));
Document doc = new Document(user.getAsMember().getId(), getMongoPoints(user.getAsMember()) + Integer.parseInt(amount.getAsString()));
collection.findOneAndUpdate(old, doc);
}
My getMongoPoints function:
public static int getMongoPoints(Member m) {
ConnectionString connectionString = new ConnectionString("database");
MongoClientSettings settings = MongoClientSettings.builder()
.applyConnectionString(connectionString)
.build();
MongoClient mongoClient = MongoClients.create(settings);
MongoDatabase database = mongoClient.getDatabase("SRU");
MongoCollection<Document> collection = database.getCollection("points");
DistinctIterable<Integer> docs = collection.distinct(m.getId(), Integer.class);
MongoCursor<Integer> result = docs.iterator();
return result.next();
}
I've tried findOneAndReplace, however that simply makes a new entry without deleting the old one. The error I receive is: Invalid BSON field name 262014495440896000
Everything else works, include writing to the database itself which is why I'm stumped. Any help would be greatly appreciated and I apologize if this is written poorly.
BSON field names must be string. From the spec:
Zero or more modified UTF-8 encoded characters followed by '\x00'. The (byte*) MUST NOT contain '\x00', hence it is not full UTF-8.
To use 262014495440896000 as a field name, convert it to string first.

Lucene : Check Index availability before

I would like to know if there are ways to check the Index availability before creating it. I referred a lot of threads like : Lucene.NET - check if document exists in index . But, my program didn't work.
I am using the TutorialsPoint example to implement this with Java 1.5.
Below is a little code snippet which I used :
LuceneTester.java
static String indexDir = "D:\\Lucene\\Index";
static String dataDir = "D:\\Lucene\\Data";
Main :
Directory directory = FSDirectory.open(new File(indexDir));
IndexReader reader = IndexReader.open(directory);
Term term = new Term("D:\\Lucene\\Data","record1.txt");
TermDocs docs = reader.termDocs(term);
if(docs.next()){
System.out.println("Already Indexed");
}else{
tester = new LuceneTester();
tester.createIndex();
tester.search("Anish");
}

Lucene 4.1 is ignoring FieldType.tokenized = false

I'm using Lucene 4.1 to index keyword/value pairs, where the keywords and values are not real words - i.e., they are voltages, settings, that should not be analyzed or tokenized. e.g. $P14R / 16777216. (this is FCS data for any Flow Cytometrists out there)
For indexing, I create a FieldType with indexed = true, stored = true, and tokenized = false. These mimic the ancient Field.Keyword from Lucene 1, for which I have the book. :-) I even freeze the fieldType.
I see these values in the debugger. I create the document and index.
When I read the index and document and look at the Fields in the debugger, I see all my fields. The names and fieldsData look correct. However, the FieldType is wrong. It shows indexed = true, stored = true, and tokenized = true. The result is that my searches (using a TermQuery) do not work.
How can I fix this? Thanks.
p.s. I am using a KeywordAnalyzer in the IndexWriterConfig. I'll try to post some demo code later, but it's off to my real job for today. :-)
DEMO CODE:
public class LuceneDemo {
public static void main(String[] args) throws IOException {
Directory lDir = new RAMDirectory();
Analyzer analyzer = new KeywordAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_41, analyzer);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(lDir, iwc);
// BTW, Lucene, anyway you could make this even more tedious???
// ever heard of builders, Enums, or even old fashioned bits?
FieldType keywordFieldType = new FieldType();
keywordFieldType.setStored(true);
keywordFieldType.setIndexed(true);
keywordFieldType.setTokenized(false);
Document doc = new Document();
doc.add(new Field("$foo", "$bar123", keywordFieldType));
doc.add(new Field("contents", "$foo=$bar123", keywordFieldType));
doc.add(new Field("$foo2", "$bar12345", keywordFieldType));
Field onCreation = new Field("contents", "$foo2=$bar12345", keywordFieldType);
doc.add(onCreation);
System.out.println("When creating, the field's tokenized is " + onCreation.fieldType().tokenized());
writer.addDocument(doc);
writer.close();
IndexReader reader = DirectoryReader.open(lDir);
Document d1 = reader.document(0);
Field readBackField = (Field) d1.getFields().get(0);
System.out.println("When read back the field's tokenized is " + readBackField.fieldType().tokenized());
IndexSearcher searcher = new IndexSearcher(reader);
// exact match works
Term term = new Term("$foo", "$bar123" );
Query query = new TermQuery(term);
TopDocs results = searcher.search(query, 10);
System.out.println("when searching for : " + query.toString() + " hits = " + results.totalHits);
// partial match fails
term = new Term("$foo", "123" );
query = new TermQuery(term);
results = searcher.search(query, 10);
System.out.println("when searching for : " + query.toString() + " hits = " + results.totalHits);
// wildcard search works
term = new Term("contents", "*$bar12345" );
query = new WildcardQuery(term);
results = searcher.search(query, 10);
System.out.println("when searching for : " + query.toString() + " hits = " + results.totalHits);
}
}
output will be:
When creating, the field's tokenized is false
When read back the field's tokenized is true
when searching for : $foo:$bar123 hits = 1
when searching for : $foo:123 hits = 0
when searching for : contents:*$bar12345 hits = 1
You can try to use a KeywordAnalyzer for the fields you don't want to tokenize.
If you need multiple analyzers (that is, if you have other fields that need tokenization), PerFieldAnalyzerWrapper is the way.
Lucene stores all tokens in lower case - hence, you need to convert your search strings to lower case first for non-tokenized fields.
The demo code proves that the value for tokenized is different when you read it back. Not sure if that is a bug or not.
But that isn't why the partial search doesn't work. The partial search doesn't work cause Lucene doesn't do partial searches (unless you use Wildcard) e.g. Says so here in StackOverflow
Been using Google so long I guess I didn't understand that. :-)

Java MongoDB getting value for sub document

I am trying to get the value of a key from a sub-document and I can't seem to figure out how to use the BasicDBObject.get() function since the key is embedded two levels deep. Here is the structure of the document
File {
name: file_1
report: {
name: report_1,
group: RnD
}
}
Basically a file has multiple reports and I need to retrieve the names of all reports in a given file. I am able to do BasicDBObject.get("name") and I can get the value "file_1", but how do I do something like this BasicDBObject.get("report.name")? I tried that but it did not work.
You should first get the "report" object and then access its contents.You can see the sample code in the below.
DBCursor cur = coll.find();
for (DBObject doc : cur) {
String fileName = (String) doc.get("name");
System.out.println(fileName);
DBObject report = (BasicDBObject) doc.get("report");
String reportName = (String) report.get("name");
System.out.println(reportName);
}
I found a second way of doing it, on another post (didnt save the link otherwise I would have included that).
(BasicDBObject)(query.get("report")).getString("name")
where query = (BasicDBObject) cursor.next()
You can also use queries, as in the case of MongoTemplate and so on...
Query query = new Query(Criteria.where("report.name").is("some value"));
You can try this, this worked for me
BasicDBObject query = new BasicDBObject("report.name", "some value");

Lucene : Changing the default facet delimiter?

First post on this wonderful site!
My goal is to use hierarchical facets for searching an index using Lucene. However, my facets need to be delimited by a character other than '/', (in this case, '~'). Example:
Categories
Categories~Category1
Categories~Category2
I have created a class that implements FacetIndexingParams interface (a copy of DefaultFacetIndexingParams with the DEFAULT_FACET_DELIM_CHAR param set to '~').
Paraphrased indexing code : (using FSDirectory for both index and taxonomy)
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34)
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_34, analyzer)
IndexWriter writer = new IndexWriter(indexDir, config)
TaxonomyWriter taxo = new LuceneTaxonomyWriter(taxDir, OpenMode.CREATE)
Document doc = new Document()
// Add bunch of Fields... hidden for the sake of brevity
List<CategoryPath> categories = new ArrayList<CategoryPath>()
row.tags.split('\\|').each{ tag ->
def cp = new CategoryPath()
tag.split('~').each{
cp.add(it)
}
categories.add(cp)
}
NewFacetIndexingParams facetIndexingParams = new NewFacetIndexingParams()
DocumentBuilder categoryDocBuilder = new CategoryDocumentBuilder(taxo, facetIndexingParams)
categoryDocBuilder.setCategoryPaths(categories).build(doc)
writer.addDocument(doc)
// Commit and close both writer and taxo.
Search code paraphrased:
// Create index and taxonomoy readers to get info from index and taxonomy
IndexReader indexReader = IndexReader.open(indexDir)
TaxonomyReader taxo = new LuceneTaxonomyReader(taxDir)
Searcher searcher = new IndexSearcher(indexReader)
QueryParser parser = new QueryParser(Version.LUCENE_34, "content", new StandardAnalyzer(Version.LUCENE_34))
parser.setAllowLeadingWildcard(true)
Query q = parser.parse(query)
TopScoreDocCollector tdc = TopScoreDocCollector.create(10, true)
List<FacetResult> res = null
NewFacetIndexingParams facetIndexingParams = new NewFacetIndexingParams()
FacetSearchParams facetSearchParams = new FacetSearchParams(facetIndexingParams)
CountFacetRequest cfr = new CountFacetRequest(new CategoryPath(""), 99)
cfr.setDepth(2)
cfr.setSortBy(SortBy.VALUE)
facetSearchParams.addFacetRequest(cfr)
FacetsCollector facetsCollector = new FacetsCollector(facetSearchParams, indexReader, taxo)
def cp = new CategoryPath("Category~Category1", (char)'~')
searcher.search(DrillDown.query(q, cp), MultiCollector.wrap(tdc, facetsCollector))
The results always return a list of facets in the form of "Category/Category1".
I have used the Luke tool to look at the index and it appears the facets are being delimited by the '~' character in the index.
What is the best route to do this? Any help is greatly appreciated!
I have figured out the issue. The search and indexing are working as they are supposed to. It is how I have been getting the facet results that is the issue. I was using :
res = facetsCollector.getFacetResults()
res.each{ result ->
result.getFacetResultNode().getLabel().toString()
}
What I needed to use was :
res = facetsCollector.getFacetResults()
res.each{ result ->
result.getFacetResultNode().getLabel().toString((char)'~')
}
The difference being the paramter sent to the toString function!
Easy to overlook, tough to find.
Hope this helps others.

Categories