how do you get the matching fuzzy term and its offset when using Lucene Fuzzy Search?
IndexSearcher mem = ....(some standard code)
QueryParser parser = new QueryParser(Version.LUCENE_30, CONTENT_FIELD, analyzer);
TopDocs topDocs = mem.search(parser.parse("wuzzy~"), 1);
// the ~ triggers the fuzzy search as per "Lucene In Action"
The fuzzy search works fine. If a document contains the term "fuzzy" or "luzzy", it is matched. How do I get which term matched and what are their offsets?
I have made sure that all CONTENT_FIELDs are added with termVectorStored with positions and offsets .
There was no straight forward way of doing this, however I reconsidered Jared's suggestion and was able to get the solution working.
I am documenting this here just in case someone else has the same issue.
Create a class that implements org.apache.lucene.search.highlight.Formatter
public class HitPositionCollector implements Formatter
{
// MatchOffset is a simple DTO
private List<MatchOffset> matchList;
public HitPositionCollector(
{
matchList = new ArrayList<MatchOffset>();
}
// this ie where the term start and end offset as well as the actual term is captured
#Override
public String highlightTerm(String originalText, TokenGroup tokenGroup)
{
if (tokenGroup.getTotalScore() <= 0)
{
}
else
{
MatchOffset mo= new MatchOffset(tokenGroup.getToken(0).toString(), tokenGroup.getStartOffset(),tokenGroup.getEndOffset());
getMatchList().add(mo);
}
return originalText;
}
/**
* #return the matchList
*/
public List<MatchOffset> getMatchList()
{
return matchList;
}
}
Main Code
public void testHitsWithHitPositionCollector() throws Exception
{
System.out.println(" .... testHitsWithHitPositionCollector");
String fuzzyStr = "bro*";
QueryParser parser = new QueryParser(Version.LUCENE_30, "f", analyzer);
Query fzyQry = parser.parse(fuzzyStr);
TopDocs hits = searcher.search(fzyQry, 10);
QueryScorer scorer = new QueryScorer(fzyQry, "f");
HitPositionCollector myFormatter= new HitPositionCollector();
//Highlighter(Formatter formatter, Scorer fragmentScorer)
Highlighter highlighter = new Highlighter(myFormatter,scorer);
highlighter.setTextFragmenter(
new SimpleSpanFragmenter(scorer)
);
Analyzer analyzer2 = new SimpleAnalyzer();
int loopIndex=0;
//for (ScoreDoc sd : hits.scoreDocs) {
Document doc = searcher.doc( hits.scoreDocs[0].doc);
String title = doc.get("f");
TokenStream stream = TokenSources.getAnyTokenStream(searcher.getIndexReader(),
hits.scoreDocs[0].doc,
"f",
doc,
analyzer2);
String fragment = highlighter.getBestFragment(stream, title);
System.out.println(fragment);
assertEquals("the quick brown fox jumps over the lazy dog", fragment);
MatchOffset mo= myFormatter.getMatchList().get(loopIndex++);
assertTrue(mo.getEndPos()==15);
assertTrue(mo.getStartPos()==10);
assertTrue(mo.getToken().equals("brown"));
}
Related
I'm using Java and lucene to match each song of a list I receive from a service, with local files. What I'm currently struggling with, is finding a query that will get me the greatest amount of matches per song possible. If I could get at least one matching file per song, it would be great.
This is what I have atm:
public List<String> getMatchesForSong(String artist, String title, String album) throws ParseException, IOException {
StandardAnalyzer analyzer = new StandardAnalyzer();
String defaultQuery = "(title: \"%s\"~2) AND ((artist: \"%s\") OR (album: \"%s\"))";
String searchQuery = String.format(defaultQuery, title, artist, album);
Query query = new QueryParser("title", analyzer).parse(searchQuery);
if (indexWriter == null) {
indexWriter = createIndexWriter(indexDir);
indexSearcher = createIndexSearcher(indexWriter);
}
TopDocs topDocs = indexSearcher.search(query, 20);
if (topDocs.totalHits > 0) {
return parseScoreDocsList(topDocs.scoreDocs);
}
return null;
}
This works very well when there are no inconsistencies, even for non-English characters. But it will not return me a single match, for example, if I receive a song with the title "The Sun Was In My Eyes: Part One", but my corresponding file has the title "The Sun Was In My Eyes: Part 1", or if I receive it like "Pt. 1".
I don't get matches either, when the titles have more words than the corresponding files, like "The End of all Times (Martyrs Fire)" opposed to "The End of all Times". Could happen for albums names too.
So, what I'd like to know is what improvements should I make in my code, in order to get more matches.
So I eventually found out that using a PhraseQuery for the title or album, isn't the best approach, since that would cause lucene to search for an exact mach of such phrase.
What I ended up doing was making a TermQuery for each of the words, of both the title and album, and join everything in a BooleanQuery.
private Query parseQueryForSong(String artist, String title, String album) throws ParseException {
String[] artistArr = artist.split(" ");
String[] titleArr = sanitizePhrase(title).split(" ");
String[] albumArr = sanitizePhrase(album).split(" ");
BooleanQuery.Builder mainQueryBuilder = new BooleanQuery.Builder();
BooleanQuery.Builder albumQueryBuilder = new BooleanQuery.Builder();
PhraseQuery artistQuery = new PhraseQuery("artist", artistArr);
for (String titleWord : titleArr) {
if (!titleWord.isEmpty()) {
mainQueryBuilder.add(new TermQuery(new Term("title", titleWord)), BooleanClause.Occur.SHOULD);
}
}
for (String albumWord : albumArr) {
if (!albumWord.isEmpty()) {
albumQueryBuilder.add(new TermQuery(new Term("album", albumWord)), BooleanClause.Occur.SHOULD);
}
}
mainQueryBuilder.add(artistQuery, BooleanClause.Occur.MUST);
mainQueryBuilder.add(albumQueryBuilder.build(), BooleanClause.Occur.MUST);
StandardAnalyzer analyzer = new StandardAnalyzer();
Query mainQuery = new QueryParser("title", analyzer).parse(mainQueryBuilder.build().toString());
return mainQuery;
}
I have a custom lucene analyzer for names.
I mostly get the correct match back, but I would like to prevent the return of results that are "not so close" matched.
Example:
Query: Art Inn Hotel Essen
One of the result: Hotel Garni an der Eissporthalle, score: 7.6011443
I'd like to prevent this "result", even though it is not the topmost, it is still inappropriate. Is that possible?
I use the following matcher:
public class MyAnalyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName, java.io.Reader reader) {
VERSION = Version.LUCENE_4_9;
final Tokenizer source = new StandardTokenizer(VERSION, reader);
TokenStream result = new StandardFilter(VERSION, source);
result = new LowerCaseFilter(VERSION, result);
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
}
BooleanQuery q = new BooleanQuery();
q.add(new QueryParser(VERSION, "name", new MyAnalyzer()).parse(name), Occur.MUST);
Also I wonder: why is the result matched at all? Because the only term that occurs in both of them is the string Hotel?
I'm trying to create Term-Document matrix for a small corpus to further experiment with LSI. However, I couldn't find a way to do it with Lucene 4.4.
I know how to get TermVector for each document as following:
//create boolean query to search for a specific document (not shown)
TopDocs hits = searcher.search(query, 1);
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
System.out.println(termVector.size()); //just testing
I thought I can just union all the termVector together as columns in a matrix to get the matrix. However, termVector for different documents have different size. And we don't know how to pad 0 into the termVector. So, certainly, this method does not work.
Hence, I wonder if someone can show me how to create Term-Document vector with Lucene 4.4 please? (If possible, please show sample code).
If Lucene does not support this function, what is the other way you recommend to do it?
Many thanks,
I found the solution to my problem here. Very detail example given by Mr. Sujit, although the code is written in older version of Lucene so many things will have to be changed. I'll update details when I finish my code.
Here is my solution that works on Lucene 4.4
public class BuildTermDocumentMatrix {
public BuildTermDocumentMatrix(File index, File corpus) throws IOException{
reader = DirectoryReader.open(FSDirectory.open(index));
searcher = new IndexSearcher(reader);
this.corpus = corpus;
termIdMap = computeTermIdMap(reader);
}
/**
* Map term to a fix integer so that we can build document matrix later.
* It's used to assign term to specific row in Term-Document matrix
*/
private Map<String, Integer> computeTermIdMap(IndexReader reader) throws IOException {
Map<String,Integer> termIdMap = new HashMap<String,Integer>();
int id = 0;
Fields fields = MultiFields.getFields(reader);
Terms terms = fields.terms("contents");
TermsEnum itr = terms.iterator(null);
BytesRef term = null;
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
if (termIdMap.containsKey(termText))
continue;
//System.out.println(termText);
termIdMap.put(termText, id++);
}
return termIdMap;
}
/**
* build term-document matrix for the given directory
*/
public RealMatrix buildTermDocumentMatrix () throws IOException {
//iterate through directory to work with each doc
int col = 0;
int numDocs = countDocs(corpus); //get the number of documents here
int numTerms = termIdMap.size(); //total number of terms
RealMatrix tdMatrix = new Array2DRowRealMatrix(numTerms, numDocs);
for (File f : corpus.listFiles()) {
if (!f.isHidden() && f.canRead()) {
//I build term document matrix for a subset of corpus so
//I need to lookup document by path name.
//If you build for the whole corpus, just iterate through all documents
String path = f.getPath();
BooleanQuery pathQuery = new BooleanQuery();
pathQuery.add(new TermQuery(new Term("path", path)), BooleanClause.Occur.SHOULD);
TopDocs hits = searcher.search(pathQuery, 1);
//get term vector
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
TermsEnum itr = termVector.iterator(null);
BytesRef term = null;
//compute term weight
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
int row = termIdMap.get(termText);
long termFreq = itr.totalTermFreq();
long docCount = itr.docFreq();
double weight = computeTfIdfWeight(termFreq, docCount, numDocs);
tdMatrix.setEntry(row, col, weight);
}
col++;
}
}
return tdMatrix;
}
}
One can refer this code also. In the latest Lucene version It will be quite easy.
Example 15
public void testSparseFreqDoubleArrayConversion() throws Exception {
Terms fieldTerms = MultiFields.getTerms(index, "text");
if (fieldTerms != null && fieldTerms.size() != -1) {
IndexSearcher indexSearcher = new IndexSearcher(index);
for (ScoreDoc scoreDoc : indexSearcher.search(new MatchAllDocsQuery(), Integer.MAX_VALUE).scoreDocs) {
Terms docTerms = index.getTermVector(scoreDoc.doc, "text");
Double[] vector = DocToDoubleVectorUtils.toSparseLocalFreqDoubleArray(docTerms, fieldTerms);
assertNotNull(vector);
assertTrue(vector.length > 0);
}
}
}
I tried it to index date with DateTools.dateToString() method. Its working properly for indexing as well as searching.
But my already indexed data which has some references is in such a way that it has indexed Date as a new Date().getTime().
So my problem is how to perform RangeSearch Query on this data...
Any solution to this???
Thanks in Advance.
You need to use a TermRangeQuery on your date field. That field always needs to be indexed with DateTools.dateToString() for it to work properly. Here's a full example of indexing and searching on a date range with Lucene 3.0:
public class LuceneDateRange {
public static void main(String[] args) throws Exception {
// setup Lucene to use an in-memory index
Directory directory = new RAMDirectory();
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
MaxFieldLength mlf = MaxFieldLength.UNLIMITED;
IndexWriter writer = new IndexWriter(directory, analyzer, true, mlf);
// use the current time as the base of dates for this example
long baseTime = System.currentTimeMillis();
// index 10 documents with 1 second between dates
for (int i = 0; i < 10; i++) {
Document doc = new Document();
String id = String.valueOf(i);
String date = buildDate(baseTime + i * 1000);
doc.add(new Field("id", id, Store.YES, Index.NOT_ANALYZED));
doc.add(new Field("date", date, Store.YES, Index.NOT_ANALYZED));
writer.addDocument(doc);
}
writer.close();
// search for documents from 5 to 8 seconds after base, inclusive
IndexSearcher searcher = new IndexSearcher(directory);
String lowerDate = buildDate(baseTime + 5000);
String upperDate = buildDate(baseTime + 8000);
boolean includeLower = true;
boolean includeUpper = true;
TermRangeQuery query = new TermRangeQuery("date",
lowerDate, upperDate, includeLower, includeUpper);
// display search results
TopDocs topDocs = searcher.search(query, 10);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document doc = searcher.doc(scoreDoc.doc);
System.out.println(doc);
}
}
public static String buildDate(long time) {
return DateTools.dateToString(new Date(time), Resolution.SECOND);
}
}
You'll get much better search performance if you use a NumericField for your date, and then NumericRangeFilter/Query to do the range search.
You just have to encode your date as a long or int. One simple way is to call the .getTime() method of your Date, but this may be far more resolution (milli-seconds) than you need. If you only need down to the day, you can encode it as YYYYMMDD integer.
Then, at search time, do the same conversion on your start/end Dates and run NumericRangeQuery/Filter.
What is the best way to find out which terms in a query matched against a given document returned as a hit in lucene?
I have tried a weird method involving hit highlighting package in lucene contrib and also a method that searches for every word in the query against the top most document ("docId: xy AND description: each_word_in_query").
Do not get satisfactory results?
Hit highlighting does not report some of the words that matched for a document other than the first one.
I'm not sure if the second approach is the best alternative.
The method explain in the Searcher is a nice way to see which part of a query was matched and how it affects the overall score.
Example taken from the book Lucene In Action 2nd Edition:
public class Explainer {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: Explainer <index dir> <query>");
System.exit(1);
}
String indexDir = args[0];
String queryExpression = args[1];
Directory directory = FSDirectory.open(new File(indexDir));
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT,
"contents", new SimpleAnalyzer());
Query query = parser.parse(queryExpression);
System.out.println("Query: " + queryExpression);
IndexSearcher searcher = new IndexSearcher(directory);
TopDocs topDocs = searcher.search(query, 10);
for (int i = 0; i < topDocs.totalHits; i++) {
ScoreDoc match = topDocs.scoreDocs[i];
Explanation explanation = searcher.explain(query, match.doc);
System.out.println("----------");
Document doc = searcher.doc(match.doc);
System.out.println(doc.get("title"));
System.out.println(explanation.toString());
}
}
}
This will explain the score of each document that matches the query.
Not tried yet, but have a look at the implementation of org.apache.lucene.search.highlight.QueryTermExtractor.