How will I go about indexing a customer using Lucene - java

I have a web application which stores customers usernames, emails and phone numbers.
I want customers to search for other users using email, phone or username for a start just to understand the whole lucene concept. then later on i will add functionality to search within a user an item he posts. I am following this example on www.lucenetutorial.com/lucene-in-5-minutes.html
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
I want new customers to be added to index automatically on registration. customerId is timestamp. so should i add a new document for each field on the customers details or should i concatenate all fields into a string and add as a single document? Please go easy on me I am really new.

This is a good place to start with Lucene indexing mechanism
http://www.ibm.com/developerworks/library/wa-lucene/
In the bottom line when lucene index the document, it first converts it into lucene document form. This lucene document comprises of set of fields and each field is a set of terms. Term are nothing but stream of bytes.
The document which is to be index to pass to analyzer which forms these terms out of it, and these terms keywords which are match during searching process.
When we perform a search process the query is analyzed through the same analyzer and then is match against the terms.
So you dont have to create a document for each field, rather you should create a single document for each user.

Related

Lucene searching and ranking with topdocs

I have managed to create an index using some supplied Java code with Lucene and indexed 9 XML files successfully. I now have to modify another supplied Java file to search the index. I am able to output the number of hits, but I need to further modify the output so that when you submit a query it outputs the top ten results in the following format:-
Ranking: 0. Filename: .xml FilePath: c:/folder/movie.xml Score: 0.5
I'm trying to create a for loop, but none of the examples I have tried seem to work. This is my first venture with both Java and Lucene, so any help would be greatly appreciated.
public class LuceneSearch {
public int n = 0;
//String fileName;
/*searchIndex is the method involved with initiating Searching the Index
via the standardanalyzer by iterating through and using Hits for the results*/
public static void searchIndex(String searchString) throws IOException, ParseException {
String fieldContents = "summary";//current field name to search for. Each text item field name= 'contents'
String fileName = "filename";
String filePath = "filepath";
Directory directory = FSDirectory.getDirectory("/Users/Jac/Documents/index/");
//get index location
//initiate reader and searcher classes
IndexReader indexReader = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
//initiate standardanalyzer
Analyzer analyzer = new StandardAnalyzer();
//parse the query contents field with queryparser
QueryParser queryParser = new QueryParser(fieldContents, analyzer);
//get user query string
Query query = queryParser.parse(searchString.toLowerCase());
//Initiate HITS class and utilise methods
TopDocs hits = indexSearcher.search(query,null,10);
System.out.println("Searching for '" + searchString.toLowerCase() + "'");
//Hits hits = indexSearcher.search(query);
System.out.println("Number of hits: " + hits.totalHits);
System.out.println("Searching XML Tag Element '" + searchString.toLowerCase() + "'");
System.out.println("Number of hits: " + hits.totalHits);
for(ScoreDoc scoreDoc : hits.scoreDocs) {
// Document doc = IndexSearcher.doc(scoreDoc.doc);
System.out.println("Ranking: ");
// System.out.println(doc.get("fullpath"));
}
System.out.println("***Search Complete***");
}
public static void main(String[] args) throws Exception {

Prefix search using lucene

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}
I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);
If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

Deleting a document in apache lucene having exact match

I want to delete a document in apache lucene having exact match only. for example I have documents containing text:
Document1: Bilal
Document2: Bilal Ahmed
Doucument3: Bilal Ahmed - 54
And when Try to remove the document with query 'Bilal' it deletes all these three documents while it should delete just first document with exact match.
The Code I use is this:
String query = "bilal";
String field = "userNames";
Term term = new Term(field, query);
IndexWriter indexWriter = null;
File indexDir = new File(idexedDirectory);
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);
indexWriter = new IndexWriter(directory, iwc);
indexWriter.deleteDocuments(term);
indexWriter.close();
This is how I am indexing my documents:
File indexDir = new File("C:\\Local DB\\TextFiled");
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);
//Thirdly We tell the Index Writer that which document to index
indexWriter = new IndexWriter(directory, iwc);
int i = 0;
try (DataSource db = DataSource.getInstance()) {
PreparedStatement ps = db.getPreparedStatement(
"SELECT user_id, username FROM " + TABLE_NAME + " as au" + User_CONDITION);
try (ResultSet resultSet = ps.executeQuery()) {
while (resultSet.next()) {
i++;
doc = new Document();
text = resultSet.getString("username");
doc.add(new StringField("userNames", text, Field.Store.YES));
indexWriter.addDocument(doc);
System.out.println("User Name : " + text + " : " + userID);
}
}
You have missed to provide how you index those documents. If they are indexed using StandardAnalyzer and tokenization is on, it is understandable that you get these results - this is because StandardAnalyzer tokenizes the text for each word and since each of your documents contains Bilal, you hit all those documents as a result.
The general advice is that you should always add a unique id field and query/delete by this id field.
If you can't do this - index the same text as a separate field - without tokenization - and use phrase query to find the exact match, but this sounds like a horrible hack to me.

What analyzer should I use so that I get hits for mispelled words?

I am writing a full text search functionality in my project using Lucene 4.3
Everything works just fine when i add data but when querying I only get hits only if at least one word in the query matches at least one word in the value of a field in the index.
eg if i add
private static StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
public static void addCustomerDoc(Map<String, String[]> parameters, String path, long customerId) throws IOException {
File file = new File(path + "/index/");
FSDirectory indexDir = FSDirectory.open(file);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_43, analyzer);
IndexWriter writer = new IndexWriter(indexDir, config);
Document doc = new Document();
doc.add(new TextField("email", parameters.get("email")[0].toString(), Field.Store.YES));
doc.add(new TextField("username", parameters.get("username")[0].toString(), Field.Store.YES));
doc.add(new TextField("phone", parameters.get("phone")[0].toString(), Field.Store.YES));
doc.add(new StringField("customerId", "" + customerId, Field.Store.YES));
addDoc(writer, doc);
writer.close();
}
private static void addDoc(IndexWriter writer, Document doc) throws IOException {
writer.addDocument(doc);
writer.commit();
}
adding a user like
username = foobar
email = foobar#example.com
phone = 0723123456
if i search for foo, fooba or foobarx i get no hits shouldn't I get a result even if I typed f or exceeded the word foobar?
If you are looking for the query parser syntax, you should look into Wildcard and
fuzzy query syntax.
You can search for a prefix with funtax like:
username:foob*
And you can use a fuzzy query instead, with:
username:foobarx~
Or, you can limit how loose fuzzy querying is, with a number between 0 and 1, higher being more restrictive, like:
username:foorbarx~0.5

Lucene 4.0 in text search

I'm using lucene 4.0 with java. I'm trying to search for a string inside a string. If we look at the lucene hello world example, I wish to find the text "lucene" inside the phrase "inLuceneAction". I want it to find me two matches in this case instead of one.
Any Idea on how to do it?
Thanks
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "inLuceneAction", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
If you index the terms in the default way, meaning inLuceneAction is one term, Lucene won't be able to seek to this term given Lucene because it has a different prefix. Analyze this string so that it results in three indexed terms: in Lucene Action and then you'll have it fetched. You'll either find a ready-made analyzer for this or you'll have to write your own. Writing own analyzers is a bit out of scope for a single StackOverflow answer, but an excellent place to start is the package info at the bottom of the org.apache.lucene.analysis package Javadoc page.

Categories