Trying to get more matches with lucene - java

I'm using Java and lucene to match each song of a list I receive from a service, with local files. What I'm currently struggling with, is finding a query that will get me the greatest amount of matches per song possible. If I could get at least one matching file per song, it would be great.
This is what I have atm:
public List<String> getMatchesForSong(String artist, String title, String album) throws ParseException, IOException {
StandardAnalyzer analyzer = new StandardAnalyzer();
String defaultQuery = "(title: \"%s\"~2) AND ((artist: \"%s\") OR (album: \"%s\"))";
String searchQuery = String.format(defaultQuery, title, artist, album);
Query query = new QueryParser("title", analyzer).parse(searchQuery);
if (indexWriter == null) {
indexWriter = createIndexWriter(indexDir);
indexSearcher = createIndexSearcher(indexWriter);
}
TopDocs topDocs = indexSearcher.search(query, 20);
if (topDocs.totalHits > 0) {
return parseScoreDocsList(topDocs.scoreDocs);
}
return null;
}
This works very well when there are no inconsistencies, even for non-English characters. But it will not return me a single match, for example, if I receive a song with the title "The Sun Was In My Eyes: Part One", but my corresponding file has the title "The Sun Was In My Eyes: Part 1", or if I receive it like "Pt. 1".
I don't get matches either, when the titles have more words than the corresponding files, like "The End of all Times (Martyrs Fire)" opposed to "The End of all Times". Could happen for albums names too.
So, what I'd like to know is what improvements should I make in my code, in order to get more matches.

So I eventually found out that using a PhraseQuery for the title or album, isn't the best approach, since that would cause lucene to search for an exact mach of such phrase.
What I ended up doing was making a TermQuery for each of the words, of both the title and album, and join everything in a BooleanQuery.
private Query parseQueryForSong(String artist, String title, String album) throws ParseException {
String[] artistArr = artist.split(" ");
String[] titleArr = sanitizePhrase(title).split(" ");
String[] albumArr = sanitizePhrase(album).split(" ");
BooleanQuery.Builder mainQueryBuilder = new BooleanQuery.Builder();
BooleanQuery.Builder albumQueryBuilder = new BooleanQuery.Builder();
PhraseQuery artistQuery = new PhraseQuery("artist", artistArr);
for (String titleWord : titleArr) {
if (!titleWord.isEmpty()) {
mainQueryBuilder.add(new TermQuery(new Term("title", titleWord)), BooleanClause.Occur.SHOULD);
}
}
for (String albumWord : albumArr) {
if (!albumWord.isEmpty()) {
albumQueryBuilder.add(new TermQuery(new Term("album", albumWord)), BooleanClause.Occur.SHOULD);
}
}
mainQueryBuilder.add(artistQuery, BooleanClause.Occur.MUST);
mainQueryBuilder.add(albumQueryBuilder.build(), BooleanClause.Occur.MUST);
StandardAnalyzer analyzer = new StandardAnalyzer();
Query mainQuery = new QueryParser("title", analyzer).parse(mainQueryBuilder.build().toString());
return mainQuery;
}

Related

Prefix search using lucene

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}
I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);
If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

How will I go about indexing a customer using Lucene

I have a web application which stores customers usernames, emails and phone numbers.
I want customers to search for other users using email, phone or username for a start just to understand the whole lucene concept. then later on i will add functionality to search within a user an item he posts. I am following this example on www.lucenetutorial.com/lucene-in-5-minutes.html
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
I want new customers to be added to index automatically on registration. customerId is timestamp. so should i add a new document for each field on the customers details or should i concatenate all fields into a string and add as a single document? Please go easy on me I am really new.
This is a good place to start with Lucene indexing mechanism
http://www.ibm.com/developerworks/library/wa-lucene/
In the bottom line when lucene index the document, it first converts it into lucene document form. This lucene document comprises of set of fields and each field is a set of terms. Term are nothing but stream of bytes.
The document which is to be index to pass to analyzer which forms these terms out of it, and these terms keywords which are match during searching process.
When we perform a search process the query is analyzed through the same analyzer and then is match against the terms.
So you dont have to create a document for each field, rather you should create a single document for each user.

How to search fields with wildcard and spaces in Hibernate Search

I have a search box that performs a search on title field based on the given input, so the user has recommended all available titles starting with the text inserted.It is based on Lucene and Hibernate Search. It works fine until space is entered. Then the result disapear. For example, I want "Learning H" to give me "Learning Hibernate" as the result. However, this doesn't happen. could you please advice me what should I use here instead.
Query Builder:
QueryBuilder qBuilder = fullTextSession.getSearchFactory()
.buildQueryBuilder().forEntity(LearningGoal.class).get();
Query query = qBuilder.keyword().wildcard().onField("title")
.matching(searchString + "*").createQuery();
BooleanQuery bQuery = new BooleanQuery();
bQuery.add(query, BooleanClause.Occur.MUST);
for (LearningGoal exGoal : existingGoals) {
Term omittedTerm = new Term("id", String.valueOf(exGoal.getId()));
bQuery.add(new TermQuery(omittedTerm), BooleanClause.Occur.MUST_NOT);
}
#SuppressWarnings("unused")
org.hibernate.Query hibQuery = fullTextSession.createFullTextQuery(
query, LearningGoal.class);
Hibernate class:
#AnalyzerDef(name = "searchtokenanalyzer",tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(factory = StopFilterFactory.class,params = {
#Parameter(name = "ignoreCase", value = "true") }) })
#Analyzer(definition = "searchtokenanalyzer")
public class LearningGoal extends Node {
I found workaround for this problem. The idea is to tokenize input string and remove stop words. For the last token I created a query using keyword wildcard, and for the all previous words I created a TermQuery. Here is the full code
BooleanQuery bQuery = new BooleanQuery();
Session session = persistence.currentManager();
FullTextSession fullTextSession = Search.getFullTextSession(session);
Analyzer analyzer = fullTextSession.getSearchFactory().getAnalyzer("searchtokenanalyzer");
QueryParser parser = new QueryParser(Version.LUCENE_35, "title", analyzer);
String[] tokenized=null;
try {
Query query= parser.parse(searchString);
String cleanedText=query.toString("title");
tokenized = cleanedText.split("\\s");
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
QueryBuilder qBuilder = fullTextSession.getSearchFactory()
.buildQueryBuilder().forEntity(LearningGoal.class).get();
for(int i=0;i<tokenized.length;i++){
if(i==(tokenized.length-1)){
Query query = qBuilder.keyword().wildcard().onField("title")
.matching(tokenized[i] + "*").createQuery();
bQuery.add(query, BooleanClause.Occur.MUST);
}else{
Term exactTerm = new Term("title", tokenized[i]);
bQuery.add(new TermQuery(exactTerm), BooleanClause.Occur.MUST);
}
}
for (LearningGoal exGoal : existingGoals) {
Term omittedTerm = new Term("id", String.valueOf(exGoal.getId()));
bQuery.add(new TermQuery(omittedTerm), BooleanClause.Occur.MUST_NOT);
}
org.hibernate.Query hibQuery = fullTextSession.createFullTextQuery(
bQuery, LearningGoal.class);
SQL uses different wildcards than any terminal. In SQL '%' replaces zero or more occurrences of any character (in the terminal you use '*' instead), and the underscore '_' replaces exactly one character (in the terminal you use '?' instead). Hibernate doesn't translate the wildcard characters.
So in the second line you have to replace matching(searchString + "*") with
matching(searchString + "%")

How to get Lucene Fuzzy Search result 's matching terms?

how do you get the matching fuzzy term and its offset when using Lucene Fuzzy Search?
IndexSearcher mem = ....(some standard code)
QueryParser parser = new QueryParser(Version.LUCENE_30, CONTENT_FIELD, analyzer);
TopDocs topDocs = mem.search(parser.parse("wuzzy~"), 1);
// the ~ triggers the fuzzy search as per "Lucene In Action"
The fuzzy search works fine. If a document contains the term "fuzzy" or "luzzy", it is matched. How do I get which term matched and what are their offsets?
I have made sure that all CONTENT_FIELDs are added with termVectorStored with positions and offsets .
There was no straight forward way of doing this, however I reconsidered Jared's suggestion and was able to get the solution working.
I am documenting this here just in case someone else has the same issue.
Create a class that implements org.apache.lucene.search.highlight.Formatter
public class HitPositionCollector implements Formatter
{
// MatchOffset is a simple DTO
private List<MatchOffset> matchList;
public HitPositionCollector(
{
matchList = new ArrayList<MatchOffset>();
}
// this ie where the term start and end offset as well as the actual term is captured
#Override
public String highlightTerm(String originalText, TokenGroup tokenGroup)
{
if (tokenGroup.getTotalScore() <= 0)
{
}
else
{
MatchOffset mo= new MatchOffset(tokenGroup.getToken(0).toString(), tokenGroup.getStartOffset(),tokenGroup.getEndOffset());
getMatchList().add(mo);
}
return originalText;
}
/**
* #return the matchList
*/
public List<MatchOffset> getMatchList()
{
return matchList;
}
}
Main Code
public void testHitsWithHitPositionCollector() throws Exception
{
System.out.println(" .... testHitsWithHitPositionCollector");
String fuzzyStr = "bro*";
QueryParser parser = new QueryParser(Version.LUCENE_30, "f", analyzer);
Query fzyQry = parser.parse(fuzzyStr);
TopDocs hits = searcher.search(fzyQry, 10);
QueryScorer scorer = new QueryScorer(fzyQry, "f");
HitPositionCollector myFormatter= new HitPositionCollector();
//Highlighter(Formatter formatter, Scorer fragmentScorer)
Highlighter highlighter = new Highlighter(myFormatter,scorer);
highlighter.setTextFragmenter(
new SimpleSpanFragmenter(scorer)
);
Analyzer analyzer2 = new SimpleAnalyzer();
int loopIndex=0;
//for (ScoreDoc sd : hits.scoreDocs) {
Document doc = searcher.doc( hits.scoreDocs[0].doc);
String title = doc.get("f");
TokenStream stream = TokenSources.getAnyTokenStream(searcher.getIndexReader(),
hits.scoreDocs[0].doc,
"f",
doc,
analyzer2);
String fragment = highlighter.getBestFragment(stream, title);
System.out.println(fragment);
assertEquals("the quick brown fox jumps over the lazy dog", fragment);
MatchOffset mo= myFormatter.getMatchList().get(loopIndex++);
assertTrue(mo.getEndPos()==15);
assertTrue(mo.getStartPos()==10);
assertTrue(mo.getToken().equals("brown"));
}

lucene get matched terms in query

What is the best way to find out which terms in a query matched against a given document returned as a hit in lucene?
I have tried a weird method involving hit highlighting package in lucene contrib and also a method that searches for every word in the query against the top most document ("docId: xy AND description: each_word_in_query").
Do not get satisfactory results?
Hit highlighting does not report some of the words that matched for a document other than the first one.
I'm not sure if the second approach is the best alternative.
The method explain in the Searcher is a nice way to see which part of a query was matched and how it affects the overall score.
Example taken from the book Lucene In Action 2nd Edition:
public class Explainer {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: Explainer <index dir> <query>");
System.exit(1);
}
String indexDir = args[0];
String queryExpression = args[1];
Directory directory = FSDirectory.open(new File(indexDir));
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT,
"contents", new SimpleAnalyzer());
Query query = parser.parse(queryExpression);
System.out.println("Query: " + queryExpression);
IndexSearcher searcher = new IndexSearcher(directory);
TopDocs topDocs = searcher.search(query, 10);
for (int i = 0; i < topDocs.totalHits; i++) {
ScoreDoc match = topDocs.scoreDocs[i];
Explanation explanation = searcher.explain(query, match.doc);
System.out.println("----------");
Document doc = searcher.doc(match.doc);
System.out.println(doc.get("title"));
System.out.println(explanation.toString());
}
}
}
This will explain the score of each document that matches the query.
Not tried yet, but have a look at the implementation of org.apache.lucene.search.highlight.QueryTermExtractor.

Categories