How to make a lucene analyzer more strict?

How to make a lucene analyzer more strict? - java

I have a custom lucene analyzer for names.
I mostly get the correct match back, but I would like to prevent the return of results that are "not so close" matched.
Example:
Query: Art Inn Hotel Essen
One of the result: Hotel Garni an der Eissporthalle, score: 7.6011443
I'd like to prevent this "result", even though it is not the topmost, it is still inappropriate. Is that possible?
I use the following matcher:
public class MyAnalyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName, java.io.Reader reader) {
VERSION = Version.LUCENE_4_9;
final Tokenizer source = new StandardTokenizer(VERSION, reader);
TokenStream result = new StandardFilter(VERSION, source);
result = new LowerCaseFilter(VERSION, result);
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
}
BooleanQuery q = new BooleanQuery();
q.add(new QueryParser(VERSION, "name", new MyAnalyzer()).parse(name), Occur.MUST);
Also I wonder: why is the result matched at all? Because the only term that occurs in both of them is the string Hotel?

Related

Trying to get more matches with lucene

I'm using Java and lucene to match each song of a list I receive from a service, with local files. What I'm currently struggling with, is finding a query that will get me the greatest amount of matches per song possible. If I could get at least one matching file per song, it would be great.
This is what I have atm:
public List<String> getMatchesForSong(String artist, String title, String album) throws ParseException, IOException {
StandardAnalyzer analyzer = new StandardAnalyzer();
String defaultQuery = "(title: \"%s\"~2) AND ((artist: \"%s\") OR (album: \"%s\"))";
String searchQuery = String.format(defaultQuery, title, artist, album);
Query query = new QueryParser("title", analyzer).parse(searchQuery);
if (indexWriter == null) {
indexWriter = createIndexWriter(indexDir);
indexSearcher = createIndexSearcher(indexWriter);
}
TopDocs topDocs = indexSearcher.search(query, 20);
if (topDocs.totalHits > 0) {
return parseScoreDocsList(topDocs.scoreDocs);
}
return null;
}
This works very well when there are no inconsistencies, even for non-English characters. But it will not return me a single match, for example, if I receive a song with the title "The Sun Was In My Eyes: Part One", but my corresponding file has the title "The Sun Was In My Eyes: Part 1", or if I receive it like "Pt. 1".
I don't get matches either, when the titles have more words than the corresponding files, like "The End of all Times (Martyrs Fire)" opposed to "The End of all Times". Could happen for albums names too.
So, what I'd like to know is what improvements should I make in my code, in order to get more matches.

So I eventually found out that using a PhraseQuery for the title or album, isn't the best approach, since that would cause lucene to search for an exact mach of such phrase.
What I ended up doing was making a TermQuery for each of the words, of both the title and album, and join everything in a BooleanQuery.
private Query parseQueryForSong(String artist, String title, String album) throws ParseException {
String[] artistArr = artist.split(" ");
String[] titleArr = sanitizePhrase(title).split(" ");
String[] albumArr = sanitizePhrase(album).split(" ");
BooleanQuery.Builder mainQueryBuilder = new BooleanQuery.Builder();
BooleanQuery.Builder albumQueryBuilder = new BooleanQuery.Builder();
PhraseQuery artistQuery = new PhraseQuery("artist", artistArr);
for (String titleWord : titleArr) {
if (!titleWord.isEmpty()) {
mainQueryBuilder.add(new TermQuery(new Term("title", titleWord)), BooleanClause.Occur.SHOULD);
}
}
for (String albumWord : albumArr) {
if (!albumWord.isEmpty()) {
albumQueryBuilder.add(new TermQuery(new Term("album", albumWord)), BooleanClause.Occur.SHOULD);
}
}
mainQueryBuilder.add(artistQuery, BooleanClause.Occur.MUST);
mainQueryBuilder.add(albumQueryBuilder.build(), BooleanClause.Occur.MUST);
StandardAnalyzer analyzer = new StandardAnalyzer();
Query mainQuery = new QueryParser("title", analyzer).parse(mainQueryBuilder.build().toString());
return mainQuery;
}

Hibernate search highlighting not analyzed fields

I'd like to highlight the whole not analyzed fields if they match the search query.
The indexed entity looks as follows:
#Entity
#Indexed
#AnalyzerDef(
name = "documentAnalyzer",
tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
#TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(
factory = StopFilterFactory.class,
params = {
#Parameter(name = "words", value = "stoplist.properties"),
#Parameter(name = "ignoreCase", value = "true")
}
)
}
)
public class Document {
...
#Field(analyze = Analyze.NO)
private String notAnalyzedField; // has "x-xxx-xxx" format
#Field(analyze = Analyze.YES)
private String analyzedField;
}
Suppose I have a Document with notAnalyzedField: "a-bbb-ccc", then I run a search query with the same value and highlight search results using the following code:
String highlightText(Query query, Analyzer analyzer, String fieldName, String text) {
QueryScorer queryScorer = new QueryScorer(query);
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span>", "</span>");
Highlighter highlighter = new Highlighter(formatter, queryScorer);
return highlighter.getBestFragment(analyzer, fieldName, text);
}
As a result I get the following snippet:"a-<span>bbb</span>-<span>ccc</span>".
And it seems reasonable because the analyzer treats a symbol as a stop word and - as a delimiter and doesn't highlight them. But I cannot figure out how I can avoid using analyzer while highlighting this field. There are a few methods in Highlighter class that require TokenStream instead of Analyzer but I'm not sure how to use them.
A result I want to achieve is the whole highlighted field: "<span>a-bbb-ccc</span>"
Is there a way to achieve this with hibernate-search?

Where does your analyzer come from?
You might want to get it from Hibernate Search:
FullTextEntityManager em = /*...*/;
Analyzer analyzer = em.getSearchFactory()
.getAnalyzer(Document.class);
highlightText(query, analyzer, fieldName, text);
If it doesn't work, try using a KeywordAnalyzer: highlightText(query, new KeywordAnalyzer(), fieldName, text);

Why my version of case insensitive Lucene keyword analyzer is not working

I am trying to index documents for case insensitive search using KeywordTokenizer.
I have created a custom Analyzer that is supposed to do keyword tokenisation as well as convert all keywords to lowercase:
public class LowercasingKeywordAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
KeywordTokenizer keywordTokenizer = new KeywordTokenizer();
return new TokenStreamComponents(keywordTokenizer, new LowerCaseFilter(keywordTokenizer));
}
}
Why does search return no results when I am submitting TermQuery with all criteria terms lowecased?? Here is a unit test reproducing the issue:
#Test
public void experiment() throws IOException, ParseException {
Analyzer analyzer = new LowercasingKeywordAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new StringField("fieldname", text, Store.NO));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index:
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
//THE TEST PASSES WITH THE CASE SENSITIVE QUERY TERM, BUT DOES NOT PASS WITH LOWERCASED
//Query query = new TermQuery(new Term("fieldname", "This is the text to be indexed."));
Query query = new TermQuery(new Term("fieldname", "This is the text to be indexed.".toLowerCase()));
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
assertEquals(1, hits.length);
ireader.close();
directory.close();
}
Please help me to identify what is wrong here?
NOTE: I am aware of Lucene QueryParsers as well as deprecation of some interfaces, please do not bother commenting on this.

StringField is not analyzed. No analyzer you define will affect it. You can use a TextField instead, or a Field where you can define your own FieldType. Or just lowercase it before constructing the field and continue to use StringField.

Lucene: prefix query not working with WhitespaceAnalyzer

I'm experimenting a little with Lucene's diverse Query objects and I'm trying to understand why a prefix query doesn't match any documents when using a WhitespaceAnaylzer for indexing. Consider the following test code:
protected String[] ids = { "1", "2" };
protected String[] unindexed = { "Netherlands", "Italy" };
protected String[] unstored = { "Amsterdam has lots of bridges",
"Venice has lots of canals" };
protected String[] text = { "Amsterdam", "Venice" };
#Test
public void testWhitespaceAnalyzerPrefixQuery() throws IOException, ParseException {
File indexes = new File(
"C:/LuceneInActionTutorial/indexes");
FSDirectory dir = FSDirectory.open(indexes);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9,
new LimitTokenCountAnalyzer(new WhitespaceAnalyzer(
Version.LUCENE_4_9), Integer.MAX_VALUE));
IndexWriter writer = new IndexWriter(dir, config);
for (int i = 0; i < ids.length; i++) {
Document doc = new Document();
doc.add(new StringField("id", ids[i], Store.NO));
doc.add(new StoredField("country", unindexed[i]));
doc.add(new TextField("contents", unstored[i], Store.NO));
doc.add(new Field("city", text[i], TextField.TYPE_STORED));
writer.addDocument(doc);
}
writer.close();
DirectoryReader dr = DirectoryReader.open(dir);
IndexSearcher is = new IndexSearcher(dr);
QueryParser queryParser = new QueryParser(Version.LUCENE_4_9,
"contents", new WhitespaceAnalyzer(Version.LUCENE_4_9));
queryParser.setLowercaseExpandedTerms(true);
Query q = queryParser.parse("Ven*");
assertTrue(q.getClass().getSimpleName().contains("PrefixQuery"));
TopDocs hits = is.search(q, 10);
assertEquals(1, hits.totalHits);
}
If I replace the WhitespaceAnaylzer with the StandardAnalyzer the test passes though. I used Luke to inspect the index content, but couldn't find any differences in how Lucene stores the values during indexing. Could anybody please clarify what's going wrong?

StandardAnalyzer lowercases text when it is indexed. WhitespaceAnalyzer does not. The term in the index, with WhitespaceAnalyzer is "Venice".
The query parser will lowercase your query though, since you have set setLowercaseExpandedTerms(true) (this is also the default, to disable this you need to explicitly set it to false). So your query is "ven*", which does not match "Venice".

How to get Lucene Fuzzy Search result 's matching terms?

how do you get the matching fuzzy term and its offset when using Lucene Fuzzy Search?
IndexSearcher mem = ....(some standard code)
QueryParser parser = new QueryParser(Version.LUCENE_30, CONTENT_FIELD, analyzer);
TopDocs topDocs = mem.search(parser.parse("wuzzy~"), 1);
// the ~ triggers the fuzzy search as per "Lucene In Action"
The fuzzy search works fine. If a document contains the term "fuzzy" or "luzzy", it is matched. How do I get which term matched and what are their offsets?
I have made sure that all CONTENT_FIELDs are added with termVectorStored with positions and offsets .

There was no straight forward way of doing this, however I reconsidered Jared's suggestion and was able to get the solution working.
I am documenting this here just in case someone else has the same issue.
Create a class that implements org.apache.lucene.search.highlight.Formatter
public class HitPositionCollector implements Formatter
{
// MatchOffset is a simple DTO
private List<MatchOffset> matchList;
public HitPositionCollector(
{
matchList = new ArrayList<MatchOffset>();
}
// this ie where the term start and end offset as well as the actual term is captured
#Override
public String highlightTerm(String originalText, TokenGroup tokenGroup)
{
if (tokenGroup.getTotalScore() <= 0)
{
}
else
{
MatchOffset mo= new MatchOffset(tokenGroup.getToken(0).toString(), tokenGroup.getStartOffset(),tokenGroup.getEndOffset());
getMatchList().add(mo);
}
return originalText;
}
/**
* #return the matchList
*/
public List<MatchOffset> getMatchList()
{
return matchList;
}
}
Main Code
public void testHitsWithHitPositionCollector() throws Exception
{
System.out.println(" .... testHitsWithHitPositionCollector");
String fuzzyStr = "bro*";
QueryParser parser = new QueryParser(Version.LUCENE_30, "f", analyzer);
Query fzyQry = parser.parse(fuzzyStr);
TopDocs hits = searcher.search(fzyQry, 10);
QueryScorer scorer = new QueryScorer(fzyQry, "f");
HitPositionCollector myFormatter= new HitPositionCollector();
//Highlighter(Formatter formatter, Scorer fragmentScorer)
Highlighter highlighter = new Highlighter(myFormatter,scorer);
highlighter.setTextFragmenter(
new SimpleSpanFragmenter(scorer)
);
Analyzer analyzer2 = new SimpleAnalyzer();
int loopIndex=0;
//for (ScoreDoc sd : hits.scoreDocs) {
Document doc = searcher.doc( hits.scoreDocs[0].doc);
String title = doc.get("f");
TokenStream stream = TokenSources.getAnyTokenStream(searcher.getIndexReader(),
hits.scoreDocs[0].doc,
"f",
doc,
analyzer2);
String fragment = highlighter.getBestFragment(stream, title);
System.out.println(fragment);
assertEquals("the quick brown fox jumps over the lazy dog", fragment);
MatchOffset mo= myFormatter.getMatchList().get(loopIndex++);
assertTrue(mo.getEndPos()==15);
assertTrue(mo.getStartPos()==10);
assertTrue(mo.getToken().equals("brown"));
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to make a lucene analyzer more strict? - java

Related

Trying to get more matches with lucene

Hibernate search highlighting not analyzed fields

Why my version of case insensitive Lucene keyword analyzer is not working

Lucene: prefix query not working with WhitespaceAnalyzer

How to get Lucene Fuzzy Search result 's matching terms?

Categories

Resources