Lucene 4.0 in text search

Lucene 4.0 in text search - java

I'm using lucene 4.0 with java. I'm trying to search for a string inside a string. If we look at the lucene hello world example, I wish to find the text "lucene" inside the phrase "inLuceneAction". I want it to find me two matches in this case instead of one.
Any Idea on how to do it?
Thanks
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "inLuceneAction", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}

If you index the terms in the default way, meaning inLuceneAction is one term, Lucene won't be able to seek to this term given Lucene because it has a different prefix. Analyze this string so that it results in three indexed terms: in Lucene Action and then you'll have it fetched. You'll either find a ready-made analyzer for this or you'll have to write your own. Writing own analyzers is a bit out of scope for a single StackOverflow answer, but an excellent place to start is the package info at the bottom of the org.apache.lucene.analysis package Javadoc page.

Related

Lucene searching and ranking with topdocs

I have managed to create an index using some supplied Java code with Lucene and indexed 9 XML files successfully. I now have to modify another supplied Java file to search the index. I am able to output the number of hits, but I need to further modify the output so that when you submit a query it outputs the top ten results in the following format:-
Ranking: 0. Filename: .xml FilePath: c:/folder/movie.xml Score: 0.5
I'm trying to create a for loop, but none of the examples I have tried seem to work. This is my first venture with both Java and Lucene, so any help would be greatly appreciated.
public class LuceneSearch {
public int n = 0;
//String fileName;
/*searchIndex is the method involved with initiating Searching the Index
via the standardanalyzer by iterating through and using Hits for the results*/
public static void searchIndex(String searchString) throws IOException, ParseException {
String fieldContents = "summary";//current field name to search for. Each text item field name= 'contents'
String fileName = "filename";
String filePath = "filepath";
Directory directory = FSDirectory.getDirectory("/Users/Jac/Documents/index/");
//get index location
//initiate reader and searcher classes
IndexReader indexReader = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
//initiate standardanalyzer
Analyzer analyzer = new StandardAnalyzer();
//parse the query contents field with queryparser
QueryParser queryParser = new QueryParser(fieldContents, analyzer);
//get user query string
Query query = queryParser.parse(searchString.toLowerCase());
//Initiate HITS class and utilise methods
TopDocs hits = indexSearcher.search(query,null,10);
System.out.println("Searching for '" + searchString.toLowerCase() + "'");
//Hits hits = indexSearcher.search(query);
System.out.println("Number of hits: " + hits.totalHits);
System.out.println("Searching XML Tag Element '" + searchString.toLowerCase() + "'");
System.out.println("Number of hits: " + hits.totalHits);
for(ScoreDoc scoreDoc : hits.scoreDocs) {
// Document doc = IndexSearcher.doc(scoreDoc.doc);
System.out.println("Ranking: ");
// System.out.println(doc.get("fullpath"));
}
System.out.println("***Search Complete***");
}
public static void main(String[] args) throws Exception {

Java, Lucene : Sort search results with highest hit rate.

I am working on a Spring-MVC application in which I am saving contents of user-data and using Lucene to index and search. Currently the functionality is working fine. Is it possible to sort the result with the highest matching probability first? I am currently saving paragraphs or more of text in indexes. Thank you.
Save code :
Directory directory = org.apache.lucene.store.FSDirectory.open(path);
IndexWriterConfig config = new IndexWriterConfig(new SimpleAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, config);
indexWriter.commit();
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
if (filePath != null) {
File file = new File(filePath); // current directory
doc.add(new TextField("path", file.getPath(), Field.Store.YES));
}
doc.add(new StringField("id", String.valueOf(objectId), Field.Store.YES));
FieldType fieldType = new FieldType(TextField.TYPE_STORED);
fieldType.setTokenized(false);
if(groupNotes!=null) {
doc.add(new Field("contents", text + "\n" + tagFileName+"\n"+String.valueOf(groupNotes.getNoteNumber()), fieldType));
}else {
doc.add(new Field("contents", text + "\n" + tagFileName, fieldType));
}
Search code :
File file = new File(path.toString());
if ((file.isDirectory()) && (file.list().length > 0)) {
if(text.contains(" ")) {
String[] textArray = text.split(" ");
for(String str : textArray) {
Directory directory = FSDirectory.open(path);
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Query query = new WildcardQuery(new Term("contents","*"+str + "*"));
TopDocs topDocs = indexSearcher.search(query, 100);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println("Score is "+scoreDoc.score);
org.apache.lucene.document.Document document = indexSearcher.doc(scoreDoc.doc);
objectIds.add(Integer.valueOf(document.get("id")));
}
indexSearcher.getIndexReader().close();
directory.close();
}
}
}
}
Thank you.

Your question is not a bit very clear to me so below are just guessed answers ,
There are methods in IndexSearcher which take org.apache.lucene.search.Sort as argument ,
public TopFieldDocs search(Query query, int n,
Sort sort, boolean doDocScores, boolean doMaxScore) throws IOException OR
public TopFieldDocs search(Query query, int n, Sort sort) throws IOException
See if these methods solve your issue.
If you simply want to sort on the basis of scores then don't collect only document Ids but collect score too in a pojo that has that score field .
Collect all these pojos in some List then outside loop sort list on the basis
of score.
for (ScoreDoc hit : hits) {
//additional code
pojo.setScore(hit.score);
list.add(pojo);
}
then outside for loop ,
list.sort((POJO p1, POJO p2) -> p2
.getScore().compareTo(p1.getScore()));

Prefix search using lucene

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}

I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);

If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

Deleting a document in apache lucene having exact match

I want to delete a document in apache lucene having exact match only. for example I have documents containing text:
Document1: Bilal
Document2: Bilal Ahmed
Doucument3: Bilal Ahmed - 54
And when Try to remove the document with query 'Bilal' it deletes all these three documents while it should delete just first document with exact match.
The Code I use is this:
String query = "bilal";
String field = "userNames";
Term term = new Term(field, query);
IndexWriter indexWriter = null;
File indexDir = new File(idexedDirectory);
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);
indexWriter = new IndexWriter(directory, iwc);
indexWriter.deleteDocuments(term);
indexWriter.close();
This is how I am indexing my documents:
File indexDir = new File("C:\\Local DB\\TextFiled");
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);
//Thirdly We tell the Index Writer that which document to index
indexWriter = new IndexWriter(directory, iwc);
int i = 0;
try (DataSource db = DataSource.getInstance()) {
PreparedStatement ps = db.getPreparedStatement(
"SELECT user_id, username FROM " + TABLE_NAME + " as au" + User_CONDITION);
try (ResultSet resultSet = ps.executeQuery()) {
while (resultSet.next()) {
i++;
doc = new Document();
text = resultSet.getString("username");
doc.add(new StringField("userNames", text, Field.Store.YES));
indexWriter.addDocument(doc);
System.out.println("User Name : " + text + " : " + userID);
}
}

You have missed to provide how you index those documents. If they are indexed using StandardAnalyzer and tokenization is on, it is understandable that you get these results - this is because StandardAnalyzer tokenizes the text for each word and since each of your documents contains Bilal, you hit all those documents as a result.
The general advice is that you should always add a unique id field and query/delete by this id field.
If you can't do this - index the same text as a separate field - without tokenization - and use phrase query to find the exact match, but this sounds like a horrible hack to me.

How will I go about indexing a customer using Lucene

I have a web application which stores customers usernames, emails and phone numbers.
I want customers to search for other users using email, phone or username for a start just to understand the whole lucene concept. then later on i will add functionality to search within a user an item he posts. I am following this example on www.lucenetutorial.com/lucene-in-5-minutes.html
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
I want new customers to be added to index automatically on registration. customerId is timestamp. so should i add a new document for each field on the customers details or should i concatenate all fields into a string and add as a single document? Please go easy on me I am really new.

This is a good place to start with Lucene indexing mechanism
http://www.ibm.com/developerworks/library/wa-lucene/
In the bottom line when lucene index the document, it first converts it into lucene document form. This lucene document comprises of set of fields and each field is a set of terms. Term are nothing but stream of bytes.
The document which is to be index to pass to analyzer which forms these terms out of it, and these terms keywords which are match during searching process.
When we perform a search process the query is analyzed through the same analyzer and then is match against the terms.
So you dont have to create a document for each field, rather you should create a single document for each user.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene 4.0 in text search - java

Related

Lucene searching and ranking with topdocs

Java, Lucene : Sort search results with highest hit rate.

Prefix search using lucene

Deleting a document in apache lucene having exact match

How will I go about indexing a customer using Lucene

Categories

Resources