How to index a String in Lucene?

How to index a String in Lucene? - java

I'm using Lucene to index strings which I read from document. I'm not using reader class, since I need to index string to different fields.
document.add(new Field("FIELD1","string1", Field.Store.YES, Field.Index.UNTOKENIZED));
document.add(new Field("FIELD2","string2", Field.Store.YES, Field.Index.UNTOKENIZED));
This works in building the index but searching
QueryParser queryParser = new QueryParser("FIELD1", new StandardAnalyzer());
Query query = queryParser.parse(searchString);
Hits hits = indexSearcher.search(query);
System.out.println("Number of hits: " + hits.length());
doesn't returns any result.
But when I index a sentence like,
document.add(new Field("FIELD1","This is sentence to be indexed", Field.Store.YES, Field.Index.TOKENIZED));
searching works fine.
Thanks.

You need to set the parameter for the fields with the words also to Field.Index.TOKENIZED because searching is only possible when you tokenize. The word "string1" will be indexed as "string1". Without tokenization it won't be indexed at all.
Use this:
document.add(new Field("FIELD1","string1", Field.Store.YES, Field.Index.TOKENIZED));
document.add(new Field("FIELD2","string2", Field.Store.YES, Field.Index.TOKENIZED));
When you want to index a string containing multiple words, e.g. "two words" as one searchable element without tokenizing into 2 words, you either need to use the KeywordAnalyzer during indexing which takes the whole string as a token or you can use the StringField object in newer versions of Lucene.

Related

How to deal with identifier fields in Lucene?

I've stumbled upon a problem similar to the one described in this other question: I have a field named like 'type', which is an identifier, ie, it's case sensitive and I want to use it for exact searches, no tokenisation, no similarity searches, just plain "find exactly 'Sport:01'". I might benefit from 'Sport*', but it's not extremely important in my case.
I cannot make it work: I thought the right kind of field to store this is: StringField.TYPE_STORED, with DOCS_AND_FREQS_AND_POSITIONS and setOmitNorms ( true ). However, this way I can't correctly resolve a query like: +type:"RockMusic" +title: "a sample title" using the standard analyzer, because, as far as I understand, the analyzer converts the input into lower case (ie, rockmusic) and the type is stored in its original mixed-case form (hence, I cannot resolve it even if I remove the title clause).
I'd like to mix case-insensitive search over title with case-sensitive over type, since I've cases where type := BRAIN is an acronym and it's different than 'Brain'.
So, what's the best way to manage fields and searches like the above? Are there alternatives other than text and string fields?
I'm using Lucene 6.6.0, but this is a general issue, regarding multiple (all?) Lucene versions.
Some code showing details is here (see testIdMixedCaseID*). The real use case is rather more complicated, if you want to give a look, the problem is with the field CC_FIELD, which might be 'BioProc' and nothing can be found in such a case.
Please note I need to use the plain Lucene, not Solr or Elastic search.

The following notes are based on Lucene 8.x, not on Lucene 6.6 - so there may be some syntax differences - but I take your point about how any such differences should be coincidental to your question.
Here are some notes, where I will focus on the following aspect of your question:
However, this way I can't correctly resolve a query like: +type:"RockMusic" +title:"a sample title" using the standard analyzer
I think there are 2 parts to this:
Firstly, the query example using "a sample title" will - as you say - not work well with how a standard analyzer works - for the reasons you state.
But, secondly, it is possible to combine the two types of query you want to use, in a way which I believe gets you what you need: An exact match for the type field (e.g. RockMusic) and a more traditional tokenized & case-insensitive result for the title field (a sample title).
Here is how I would do that:
Here is some simple test data:
public static void buildIndex() throws IOException {
final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
Document doc;
try (IndexWriter writer = new IndexWriter(dir, iwc)) {
doc = new Document();
doc.add(new StringField("type", "RockMusic", Field.Store.YES));
doc.add(new TextField("title", "a sample title", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new StringField("type", "RockMusic", Field.Store.YES));
doc.add(new TextField("title", "another different title", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new StringField("type", "Rock Music", Field.Store.YES));
doc.add(new TextField("title", "a sample title", Field.Store.YES));
writer.addDocument(doc);
}
}
Here is the query code:
public static void doSearch() throws QueryNodeException, ParseException, IOException {
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
IndexSearcher searcher = new IndexSearcher(reader);
TermQuery typeQuery = new TermQuery(new Term("type", "RockMusic"));
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("title", analyzer);
Query titleQuery = parser.parse("A Sample Title");
Query query = new BooleanQuery.Builder()
.add(typeQuery, BooleanClause.Occur.MUST)
.add(titleQuery, BooleanClause.Occur.MUST)
.build();
System.out.println("Query: " + query.toString());
System.out.println();
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits) {
System.out.println("doc = " + hit.doc + "; score = " + hit.score);
Document doc = searcher.doc(hit.doc);
System.out.println("Type = " + doc.get("type")
+ "; Title = " + doc.get("title"));
System.out.println();
}
}
The output from the above query is as follows:
Query: +type:RockMusic +(title:a title:sample title:title)
doc = 0; score = 0.7016101
Type = RockMusic; Title = a sample title
doc = 1; score = 0.2743341
Type = RockMusic; Title = another different title
As you can see, this query is a little different from the one taken from your question.
But the list of found documents shows that (a) the Rock Music document was not found at all (good - because Rock Music does not match the "type" search term of RockMusic); and (b) the title a sample title got a far higher match score than the another different title document, when searching for A Sample Title.
Additional notes:
This query works by combining a StringField exact search with a more traditional TextField tokenized search - this latter search being processed by the StandardAnalyzer (matching how the data was indexed in the first place).
I am making an assumption about the score ranking being useful to you - but for title searches, I think that is reasonable.
This approach would also apply to your BRAIN vs. brain example, for StringField data.
(I also assume that, for a user interface, a user could select the "RockMusic" type value from a drop-down, and enter the "A Sample Title" search in an input field - but this is getting off-topic, I think).
You could obviously enhance the analyzer to include stop-words, and so on, as needed.
Of course, my examples involve hard-coded data - but it would not take much to generalize this approach to handle dynamically-provided search terms.
Hope that this makes sense - and that I understood the problem correctly.

Going to answer myself...
I discovered what #andrewjames outlines in his excellent analysis by making a number of tests of my own. Essentially, fields like "type" don't play well with the standard analyser and they are best indexed and searched with an analyzer like KeywordAnalyzer, which, in practice, stores the original value as-is and searches it accordingly.
Most real cases are like my example, ie, mixed ID-like fields, which need exact matching, plus fields like 'title' or 'description', which best serves user searches using per-token searching, word-based scoring, stop words elimination, etc.
Because of that, PerFieldAnalyzerWrapper (see also my sample code, linked above) comes to much help, ie, a wrapper analyzer, which is able to dispatch analysis field-specific analyzers, on a field name basis.
One thing to add is that I still haven't clear which analyzer is used when a query is built without a parser (eg, using new TermQuery ( new Term ( fname, fval )), so now I use a QueryParser.

Lucene : Search with partial words

I am working on integrating Lucene in our application. Lucene is currently working, for example when I am searching "Upload" and there is some text called "Upload" in a document, then it works, but when I search "Uplo", then it doesn't work. Any ideas?
Code :
Directory directory = FSDirectory.open(path);
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser queryParser = new QueryParser("contents", new SimpleAnalyzer());
Query query = queryParser.parse(text);
TopDocs topDocs = indexSearcher.search(query, 50);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
org.apache.lucene.document.Document document = indexSearcher.doc(scoreDoc.doc);
objectIds.add(Integer.valueOf(document.get("id")));
System.out.println("");
System.out.println("id " + document.get("id"));
System.out.println("content " + document.get("contents"));
}
return objectIds;
Thank you.

'Upload' might be ONE Token in your Lucene index where a Token would be the smallest entity non splittable further. If you want to match partial words like 'Uplo' then it is better to go for Lucene NGram Indexing. Note that if you go for NGram indexing you will have higher space requirements for your inverted index.

You can use wildcard searches.
"?" symbol for single character wildcard search and "*" symbol for Multiple character wildcard searches (0 or more characters).
example - "Uplo*"

Change
Query query = queryParser.parse(text);
To
Query query = queryParser.parse("*"+text+"*");
Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries).
To perform a single character wildcard search use the "?" symbol.
To perform a multiple character wildcard search use the "*" symbol.
The single character wildcard search looks for terms that match that with the single character replaced. For example, to search for "text" or "test" you can use the search:
te?t
Multiple character wildcard searches looks for 0 or more characters. For example, to search for test, tests or tester, you can use the search:
test*
You can also use the wildcard searches in the middle of a term.
te*t
Note: You cannot use a * or ? symbol as the first character of a search.

How to map pairs of different URLs in a Lucene index and query against those?

How it is possible to add URL mappings to Lucene and read them?
Store like: url1 - url2. And if you send query url1 get url2. I used PhraseQuery, TermQuery and FuzzyQuery but couldn't get result.
For example:
http://www.w3.org/2004/02/skos/core#Mountain - http://www.w3.org/2004/02/skos/core#Everst

At index time, it is possible to add the urls in two separate fields: url1 searchable (indexed), and url2, just stored.
Assuming a more recent version of Lucene (at least 4):
doc.add(new StringField("url1", url1String, Store.NO));
doc.add(new StoredField("url2", url2String));

Lucene Analyzer query and Search Results Relevance Score

first of all, sorry for my bad English!
i am new to Lucene Library(Since last Wednesday) and im trying to understand how to get best relevance level of matching documents based on the terms found.
i use Lucene 4.10.0 (no Solr)
I'm able to index/search english/arabic text as well as supporting hit highlighting for these texts.
now i have a Problem with the relevance of search results.
if i search for "Mohammad Omar" in three docs:
doc1.add(new TextField("contents", "xyz abc, 123 Mohammad Abu Omar 123", Field.Store.YES));
indexWriter.addDocument(config.build(taxoWriter, doc1));
doc2 = new Document();
doc2.add(new TextField("contents", "xyz abc, 123 Omar bin Mohammad 123", Field.Store.YES));
indexWriter.addDocument(config.build(taxoWriter, doc2));
doc3 = new Document();
doc3.add(new TextField("contents", "xyz abc, 123 Abu Mohammad Omar 123", Field.Store.YES));
indexWriter.addDocument(config.build(taxoWriter, doc3));
...etc
i get same Score for these 3 docs.
it looks like Lucene ignores the Words Order and just scoring on the Matches Count.
i expect the following as best Results:
doc3 THEN doc1 THEN doc2
but i get:
doc1 THEN doc2 THEN doc3 (ALL HAVE SAME SCORE)
for searching in lowercase and in substrings i use an extended Analyzer like this:
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new WhitespaceTokenizer(reader);
TokenStream filter = new LowerCaseFilter(source);
filter = new WordDelimiterFilter(filter,Integer.MAX_VALUE,null);
return new TokenStreamComponents(source, filter);
}
any idea how to perform it?
from here: http://lucene.apache.org/core/4_10_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Boosting_a_Term
i see that Boosting Query Terms AND/OR using RegEx could be an Option, but this means, i have to handle User inputs manually. isn't there an "out of box" Solution(like a function, Filter or Analyzer)?
many thanks!

How does your "Mohammad Omar" query look like in terms of code? If you need just this exact phrase, feed this string into a PhraseQuery or if you use QueryParser, wrap this phrase into quotes to produce PhraseQuery.
If you need both this phrase as well as documents containing both terms separately in the search results, you could include "Mohammad Omar" both as a phrase (as specified above) and as separate terms, something like this: some_field:"Mohammad Omar" some_field:Mohammad some_field:Omar. You can also add boosting for the phrase element so that phrase results rank higher.

Using Lucene, how to index TXT files into different fields?

I am using the NSF data whose format is txt. Now I have indexed these data and can send a query and got several results. But how can I search something in a selected field (eg. title) ? Because all of these NSF data are totally plain txt file. I do not think Lucene can recognize which part of the file is a "title" or something else. Should I firstly transfer the txt files to XML files (with tags telling Lucene which part is "title")? Can Lucene do that? I have no idea how to split the txt files into several fields. Can anyone please give me some suggestions? Thanks a lot!
BTW, every txt file looks like this:
---begin---
Title: Mitochondrial DNA and Historical Demography
Type: Award
Date: August 1, 1991
Number: 9000006
Abstract: asdajsfhsjdfhsjngfdjnguwiehfrwiuefnjdnfsd
----end----

You have to split the text into the several parts. You can use the resulting strings to create a field for each part of the text, i.e. title.
Create your lucene document with the fields like this:
Document doc = new Document();
doc.add(new Field("title", titleString, Field.Store.NO, Field.Index.TOKENIZED));
doc.add(new Field("abstract", abstractString, Field.Store.NO, Field.Index.TOKENIZED));
and so on. After indexing the document you can search in the title like this: title:dna
More complex queries and mixing multiple fields in the query also possible: +title:dna +abstract:"some example text" -number:935353

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.