How to deal with identifier fields in Lucene?

How to deal with identifier fields in Lucene? - java

I've stumbled upon a problem similar to the one described in this other question: I have a field named like 'type', which is an identifier, ie, it's case sensitive and I want to use it for exact searches, no tokenisation, no similarity searches, just plain "find exactly 'Sport:01'". I might benefit from 'Sport*', but it's not extremely important in my case.
I cannot make it work: I thought the right kind of field to store this is: StringField.TYPE_STORED, with DOCS_AND_FREQS_AND_POSITIONS and setOmitNorms ( true ). However, this way I can't correctly resolve a query like: +type:"RockMusic" +title: "a sample title" using the standard analyzer, because, as far as I understand, the analyzer converts the input into lower case (ie, rockmusic) and the type is stored in its original mixed-case form (hence, I cannot resolve it even if I remove the title clause).
I'd like to mix case-insensitive search over title with case-sensitive over type, since I've cases where type := BRAIN is an acronym and it's different than 'Brain'.
So, what's the best way to manage fields and searches like the above? Are there alternatives other than text and string fields?
I'm using Lucene 6.6.0, but this is a general issue, regarding multiple (all?) Lucene versions.
Some code showing details is here (see testIdMixedCaseID*). The real use case is rather more complicated, if you want to give a look, the problem is with the field CC_FIELD, which might be 'BioProc' and nothing can be found in such a case.
Please note I need to use the plain Lucene, not Solr or Elastic search.

The following notes are based on Lucene 8.x, not on Lucene 6.6 - so there may be some syntax differences - but I take your point about how any such differences should be coincidental to your question.
Here are some notes, where I will focus on the following aspect of your question:
However, this way I can't correctly resolve a query like: +type:"RockMusic" +title:"a sample title" using the standard analyzer
I think there are 2 parts to this:
Firstly, the query example using "a sample title" will - as you say - not work well with how a standard analyzer works - for the reasons you state.
But, secondly, it is possible to combine the two types of query you want to use, in a way which I believe gets you what you need: An exact match for the type field (e.g. RockMusic) and a more traditional tokenized & case-insensitive result for the title field (a sample title).
Here is how I would do that:
Here is some simple test data:
public static void buildIndex() throws IOException {
final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
Document doc;
try (IndexWriter writer = new IndexWriter(dir, iwc)) {
doc = new Document();
doc.add(new StringField("type", "RockMusic", Field.Store.YES));
doc.add(new TextField("title", "a sample title", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new StringField("type", "RockMusic", Field.Store.YES));
doc.add(new TextField("title", "another different title", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new StringField("type", "Rock Music", Field.Store.YES));
doc.add(new TextField("title", "a sample title", Field.Store.YES));
writer.addDocument(doc);
}
}
Here is the query code:
public static void doSearch() throws QueryNodeException, ParseException, IOException {
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
IndexSearcher searcher = new IndexSearcher(reader);
TermQuery typeQuery = new TermQuery(new Term("type", "RockMusic"));
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("title", analyzer);
Query titleQuery = parser.parse("A Sample Title");
Query query = new BooleanQuery.Builder()
.add(typeQuery, BooleanClause.Occur.MUST)
.add(titleQuery, BooleanClause.Occur.MUST)
.build();
System.out.println("Query: " + query.toString());
System.out.println();
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits) {
System.out.println("doc = " + hit.doc + "; score = " + hit.score);
Document doc = searcher.doc(hit.doc);
System.out.println("Type = " + doc.get("type")
+ "; Title = " + doc.get("title"));
System.out.println();
}
}
The output from the above query is as follows:
Query: +type:RockMusic +(title:a title:sample title:title)
doc = 0; score = 0.7016101
Type = RockMusic; Title = a sample title
doc = 1; score = 0.2743341
Type = RockMusic; Title = another different title
As you can see, this query is a little different from the one taken from your question.
But the list of found documents shows that (a) the Rock Music document was not found at all (good - because Rock Music does not match the "type" search term of RockMusic); and (b) the title a sample title got a far higher match score than the another different title document, when searching for A Sample Title.
Additional notes:
This query works by combining a StringField exact search with a more traditional TextField tokenized search - this latter search being processed by the StandardAnalyzer (matching how the data was indexed in the first place).
I am making an assumption about the score ranking being useful to you - but for title searches, I think that is reasonable.
This approach would also apply to your BRAIN vs. brain example, for StringField data.
(I also assume that, for a user interface, a user could select the "RockMusic" type value from a drop-down, and enter the "A Sample Title" search in an input field - but this is getting off-topic, I think).
You could obviously enhance the analyzer to include stop-words, and so on, as needed.
Of course, my examples involve hard-coded data - but it would not take much to generalize this approach to handle dynamically-provided search terms.
Hope that this makes sense - and that I understood the problem correctly.

Going to answer myself...
I discovered what #andrewjames outlines in his excellent analysis by making a number of tests of my own. Essentially, fields like "type" don't play well with the standard analyser and they are best indexed and searched with an analyzer like KeywordAnalyzer, which, in practice, stores the original value as-is and searches it accordingly.
Most real cases are like my example, ie, mixed ID-like fields, which need exact matching, plus fields like 'title' or 'description', which best serves user searches using per-token searching, word-based scoring, stop words elimination, etc.
Because of that, PerFieldAnalyzerWrapper (see also my sample code, linked above) comes to much help, ie, a wrapper analyzer, which is able to dispatch analysis field-specific analyzers, on a field name basis.
One thing to add is that I still haven't clear which analyzer is used when a query is built without a parser (eg, using new TermQuery ( new Term ( fname, fval )), so now I use a QueryParser.

Related

How to read attributes out of multiple nested documents in MongoDB Java?

I need some help with a project I am planing to do. At this stage I am trying to learn using NoSQL Databases in Java.
I've got a few nested documents looking like this:
MongoDB nesting structure
Like you can see on the image, my inner attributes are "model" and "construction".
Now I need to iterate through all the documents in my collection, whose keynames are unknown, because they are generated in runtime, when a user enters some information.
At the end I need to list them in a TreeView, keeping the structure they have already in the database.
What I've tried is getting keySets from documents, but I cannot pass the second layer of the structure. I am able to print the whole Object in Json format, but I cannot access the specific attributes like "model" or "construction".
MongoCollection collection= mongoDatabase.getCollection("test");
MongoCursor<Document> cursor = collection.find().iterator();
for(String keys: document.keySet()) {
Document vehicles = (Document) document.getString(keys);
//System.out.println(keys);
//System.out.println(document.get(keys));
}
/Document cars = (Document) vehicle.get("cars");
Document types = (Document) cars.get("coupes");
Document brands = (Document) types.get("Ford");
Document model = (Document) brands.get("Mustang GT");
Here I tried to get some properties, by hardcoding the keynames of the documents, but I can't seem to get any value either. It keeps telling me that it could not read from vehicle, because it is null.
The most tutorials and posts in forums, somehow does not work for me. I don't know if they have any other version of MongoDB Driver. Mine is: mongodb driver 3.12.7. if this helps you in any way.
I am trying to get this working for days now and it is driving me crazy.
I hope there is anyone out there who is able to help me with this problem.

Here is a way you can try using the Document class's methods. You use the Document#getEmbedded method to navigate the embedded (or sub-document) document's path.
try (MongoCursor<Document> cursor = collection.find().iterator()) {
while (cursor.hasNext()) {
// Get a document
Document doc = (Document) cursor.next();
// Get the sub-document with the known key path "vehicles.cars.coupes"
Document coupes = doc.getEmbedded(
Arrays.asList("vehicles", "cars", "coupes"),
Document.class);
// For each of the sub-documents within the "coupes" get the
// dynamic keys and their values.
for (Map.Entry<String, Object> coupe : coupes.entrySet()) {
System.out.println(coupe.getKey()); // e.g., Mercedes
// The dynamic sub-document for the dynamic key (e.g., Mercedes):
// {"S-Class": {"model": "S-Class", "construction": "2011"}}
Document coupeSubDoc = (Document) coupe.getValue();
// Get the coupeSubDoc's keys and values
coupeSubDoc.keySet().forEach(k -> {
System.out.println("\t" + k); // e.g., S-Class
System.out.println("\t\t" + "model" + " : " +
coupeSubDoc.getEmbedded(Arrays.asList(k, "model"), String.class));
System.out.println("\t\t" + "construction" + " : " +
coupeSubDoc.getEmbedded(Arrays.asList(k, "construction"), String.class));
});
}
}
}
The above code prints to the console as:
Mercedes
S-Class
model : S-Class
construction : 2011
Ford
Mustang
model : Mustang GT
construction : 2015

I think it's not the complete answer to his question.
Here he says:
Now I need to iterate through all the documents in my collection, whose keynames are unknown, because they are generated in runtime, when a user enters some information.
Your answer #prasad_ just refers to his case with vehicles, cars and so on. He needs a way to handle unknown key/value pairs i guess. For example, in this case he only knows the keys:vehicle,cars,coupe,Mercedes/Ford and their subkeys. If another user inserts some new key/value paairs in the collection he will have problems because he can't navigate trough the new document without to have a look into the database.
I'm also interested in the solution because I never nested my key/value pairs and cant see the advantage of it. Am I wrong or does it make the programming more difficult?

Get N terms with top TFIDF scores for each documents in Lucene (PyLucene)

I am currently using PyLucene but since there is no documentation for it, I guess a solution in Java for Lucene will also do (but if anyone has one in Python it would be even better).
I am working with scientific publications and for now, I retrieve the keywords of those. However, for some documents there are simply no keywords. An alternative to this would be to get N words (5-8) with the highest TFIDF scores.
I am not sure how to do it, and also when. By when, I mean : Do I have to tell Lucene at the stage of indexing to compute these values, of it is possible to do it when searching the index.
What I would like to have for each query would be something like this :
Query Ranking
Document1, top 5 TFIDF terms, Lucene score (default TFIDF)
Document2, " " , " "
...
What would also be possible is to first retrieve the ranking for the query, and then compute the top 5 TFIDF terms for each of these documents.
Does anyone have an idea how shall I do this ?

If a field is indexed, document frequencies can be retrieved with getTerms. If a field has stored term vectors, term frequencies can be retrieved with getTermVector.
I also suggest looking at MoreLikeThis, which uses tf*idf to create a query similar to the document, from which you can extract the terms.
And if you'd like a more pythonic interface, that was my motivation for lupyne:
from lupyne import engine
searcher = engine.IndexSearcher(<filepath>)
df = dict(searcher.terms(<field>, counts=True))
tf = dict(searcher.termvector(<docnum>, <field>, counts=True))
query = searcher.morelikethis(<docnum>, <field>)

After digging a bit in the mailing list, I ended up having what I was looking for.
Here is the method I came up with :
def getTopTFIDFTerms(docID, reader):
termVector = reader.getTermVector(docID, "contents");
termsEnumvar = termVector.iterator(None)
termsref = BytesRefIterator.cast_(termsEnumvar)
tc_dict = {} # Counts of each term
dc_dict = {} # Number of docs associated with each term
tfidf_dict = {} # TF-IDF values of each term in the doc
N_terms = 0
try:
while (termsref.next()):
termval = TermsEnum.cast_(termsref)
fg = termval.term().utf8ToString() # Term in unicode
tc = termval.totalTermFreq() # Term count in the doc
# Number of docs having this term in the index
dc = reader.docFreq(Term("contents", termval.term()))
N_terms = N_terms + 1
tc_dict[fg]=tc
dc_dict[fg]=dc
except:
print 'error in term_dict'
# Compute TF-IDF for each term
for term in tc_dict:
tf = tc_dict[term] / N_terms
idf = 1 + math.log(N_DOCS_INDEX/(dc_dict[term]+1))
tfidf_dict[term] = tf*idf
# Here I get a representation of the sorted dictionary
sorted_x = sorted(tfidf_dict.items(), key=operator.itemgetter(1), reverse=True)
# Get the top 5
top5 = [i[0] for i in sorted_x[:5]] # replace 5 by TOP N
I am not sure why I have to cast the termsEnum as a BytesRefIterator, I got this from a thread in the mailing list which can be found here
Hope this will help :)

Lucene Analyzer query and Search Results Relevance Score

first of all, sorry for my bad English!
i am new to Lucene Library(Since last Wednesday) and im trying to understand how to get best relevance level of matching documents based on the terms found.
i use Lucene 4.10.0 (no Solr)
I'm able to index/search english/arabic text as well as supporting hit highlighting for these texts.
now i have a Problem with the relevance of search results.
if i search for "Mohammad Omar" in three docs:
doc1.add(new TextField("contents", "xyz abc, 123 Mohammad Abu Omar 123", Field.Store.YES));
indexWriter.addDocument(config.build(taxoWriter, doc1));
doc2 = new Document();
doc2.add(new TextField("contents", "xyz abc, 123 Omar bin Mohammad 123", Field.Store.YES));
indexWriter.addDocument(config.build(taxoWriter, doc2));
doc3 = new Document();
doc3.add(new TextField("contents", "xyz abc, 123 Abu Mohammad Omar 123", Field.Store.YES));
indexWriter.addDocument(config.build(taxoWriter, doc3));
...etc
i get same Score for these 3 docs.
it looks like Lucene ignores the Words Order and just scoring on the Matches Count.
i expect the following as best Results:
doc3 THEN doc1 THEN doc2
but i get:
doc1 THEN doc2 THEN doc3 (ALL HAVE SAME SCORE)
for searching in lowercase and in substrings i use an extended Analyzer like this:
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new WhitespaceTokenizer(reader);
TokenStream filter = new LowerCaseFilter(source);
filter = new WordDelimiterFilter(filter,Integer.MAX_VALUE,null);
return new TokenStreamComponents(source, filter);
}
any idea how to perform it?
from here: http://lucene.apache.org/core/4_10_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Boosting_a_Term
i see that Boosting Query Terms AND/OR using RegEx could be an Option, but this means, i have to handle User inputs manually. isn't there an "out of box" Solution(like a function, Filter or Analyzer)?
many thanks!

How does your "Mohammad Omar" query look like in terms of code? If you need just this exact phrase, feed this string into a PhraseQuery or if you use QueryParser, wrap this phrase into quotes to produce PhraseQuery.
If you need both this phrase as well as documents containing both terms separately in the search results, you could include "Mohammad Omar" both as a phrase (as specified above) and as separate terms, something like this: some_field:"Mohammad Omar" some_field:Mohammad some_field:Omar. You can also add boosting for the phrase element so that phrase results rank higher.

Using Lucene, how to index TXT files into different fields?

I am using the NSF data whose format is txt. Now I have indexed these data and can send a query and got several results. But how can I search something in a selected field (eg. title) ? Because all of these NSF data are totally plain txt file. I do not think Lucene can recognize which part of the file is a "title" or something else. Should I firstly transfer the txt files to XML files (with tags telling Lucene which part is "title")? Can Lucene do that? I have no idea how to split the txt files into several fields. Can anyone please give me some suggestions? Thanks a lot!
BTW, every txt file looks like this:
---begin---
Title: Mitochondrial DNA and Historical Demography
Type: Award
Date: August 1, 1991
Number: 9000006
Abstract: asdajsfhsjdfhsjngfdjnguwiehfrwiuefnjdnfsd
----end----

You have to split the text into the several parts. You can use the resulting strings to create a field for each part of the text, i.e. title.
Create your lucene document with the fields like this:
Document doc = new Document();
doc.add(new Field("title", titleString, Field.Store.NO, Field.Index.TOKENIZED));
doc.add(new Field("abstract", abstractString, Field.Store.NO, Field.Index.TOKENIZED));
and so on. After indexing the document you can search in the title like this: title:dna
More complex queries and mixing multiple fields in the query also possible: +title:dna +abstract:"some example text" -number:935353

How to retrieve the Field that "hit" in Lucene

Maybe I'm really missing something.
I have indexed a bunch of key/value pairs in Lucene (v4.1 if it matters). Say I have
key1=value1 and key2=value2, e.g. as read from a properties file.
They get indexed both as specific fields and into a catchall "ALL" field, e.g.
new Field("key1", "value1", aFieldTypeMimickingKeywords);
new Field("key2", "value2", aFieldTypeMimickingKeywords);
new Field("ALL", "key1=value1", aFieldTypeMimickingKeywords);
new Field("ALL", "key2=value2", aFieldTypeMimickingKeywords);
// then get added to the Document of course...
I can then do a wildcard search, using
new WildcardQuery(new Term("ALL", "*alue1"));
and it will find the hit.
But, it would be nice to get more info, like "what was complete value (e.g. "key1=value1") that goes with that hit?".
The best I can figure out it to get the Document, then get the list of IndexableFields, then loop over all of them and see if the field.stringValue().contains("alue1"). (I can look at the data structures in the debugger and all the info is there)
This seems completely insane cause isn't that what Lucene just did? Shouldn't the Hit information return some of the Fields?
Is Lucene missing what seems like "obvious" functionality? Google and starting at the APIs hasn't revealed anything straightforward, but I feel like I must be searching on the wrong stuff.

You might want to try with IndexSearcher.explain() method. Once you get the ID of the matching document, prepare a query for each field (using the same search keywords) and invoke Explanation.isMatch() for each query: the ones that yield true will give you the matched field. Example:
for (String field: fields){
Query query = new WildcardQuery(new Term(field, "*alue1"));
Explanation ex = searcher.explain(query, docID);
if (ex.isMatch()){
//Your query matched field
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.