I have a Lucene index of around 22,000 lucene documents but I have been facing a unique problem with it while creating a search program.
Each document has a Title, description and long_description fields, these fields have data related to different diseases and their symptoms. Now when I search for a phrase like following
"infection of the small intestine" I am expecting "Cholera" to be the first result(By the way I am using MultiFieldQueryParser with StandardAnalyzer.)
The reason I expect Cholera to be the first one is because it has exact phrase "infection of the small intestine" in the long description fields. But instead of this result coming on top it comes way at the bottom because there are plenty of other documents which mentions the term "infection" in the title field(which is substantially smaller in length than description field). This can be easily seen in the screenshot bellow.
So just because "cholera" does not have the most pertinent information in the "title" field it comes way at the bottom. I saw following thread where the use of "~3" is suggested, but is that what I should do for all my queries from behind the scene? Isn't there a better way of doing it?
Searching phrases in Lucene
Make your query boost the hits in title high, description medium and long_desc low, like this:
title:intestine^100 description:intestine^10 long_description:intestine^1
This example gives title matches score "+100", description matches score "+10" and long_description matches score "+1". Higher total boost scores are sorted first. You can pick any numbers you like for the boost values.
You can change computeNorm in DefaultSimilarity.
Please check http://www.supermind.org/blog/378/lucene-scoring-for-dummies and http://blog.architexa.com/2010/12/custom-lucene-scoring/
Related
Im trying to better organise the types of tasks regularly sent to my team based off of the titles and short comment people enter.
Our team only handles a handful of issues (maybe 10 or so) different types of tasks, so I've put together a list of common words used within the description of a particular type of task and i've been using this to categorise the issues. for example.... an issue might come through like "User x doesn't have access to office after hours, please update their swipecard access level". what i've got so far is if the comments contain 'swipecard' or 'access', its a building access type request.
I've quickly found myself with code that's LOTS of ... if contains, and if !contains...
Is there a neater way of doing what im after?
If you want to make it complex, it sounds like you have a classification problem.
If you want to keep it simple, you're probably on the right track with your if statements and contains(). To get to a cleaner solution, I would approach it as follows:
Create a class to modify your categories - give it two attributes: String categoryName, List<String> commonlyUsedWords;
Populate a list with instances of that class - one per type.
For each issue, loop through the list of categories and check how many words match, and store that as a percentage (e.g. 8 out of 10 words match, therefore 80% match).
Return the category with the highest match rate.
Is it possible to configure or otherwise alter how Elastic Search scores its results?
When running a search for "term" using the NativeSearchQueryBuilder documents that contain one instance of the term are all scored the same. This makes sense. However one of the documents contain just the term, where are the others contain term and other data. For example;
Doc1: Title : Space
Doc2: Title : Space Time
Doc3: Title : No Space
So when searching for Space is there anyway to make Doc1 score more highly?
-Edit
So, a little more detail following briarheart's response. I think the problem is the way we're implementing typeahead searches. If I run the Space query using our standard search the ranking is as outlined by briarheart, but our typeahead scores everything equally because we are using the wildcard request part and looking for "term*" so "Space" and "Space Lane" do both match that equally well.
So really I guess I'm asking the wrong question. Scoring is working as it should, I just need to figure out a better implementation of type ahead.
(The Suggest Request Part doesn't seem to fit the use case as this would involve picking and resubmitting the desired suggestion).
I do not know how exactly does look like your query but actually in case of using full text search of the term "Space" the document "Doc1" from your example will get the highest score because of length of its "Title" field. Shorter fields have more weight in terms of relevance.
I have started working on Lucene (v 4.10.2) Search Based Ranking/Scoring.
Consider the following Scenario: I am searching 'Mark' in my search box. Auto-complete result shows Top 5 people named 'Mark' (although there might be hundreds of Mark in the Lucene index files).
I go on Mark Zuckerberg's profile which is placed on 4th place in the beginning of the search. Say I have clicked his profile a lot of times. Now according to me, next time I search 'Mark', 'Mark Zuckerberg' should come at the top of the list.
Several questions coming in my mind (even I don't know that I'm on right track or not):
1) How to achieve this using Lucene library ? (Automated or custom based scoring)
2) Can we change the scoring after any search?
3) Does Lucene library stores the scoring in indexed files?
4) Can we store the scoring in the indexed files?
Please let me know if I'm on the right track or not.
This is what I would try, regardless any performance and index
maintainability issues for now.
I would add a multivalued string field for users that have at least once hit the
profile document.
Every time a user (say "vipul") hits an auto-completed profile (say
"Mark Zuckerberg") I would add the username to the special multivalued string
field in the profile document.
When searching I would add a term in the special field with the current username
as the value, boosting it, so it comes first in the searches.
Now, some performance. Since updating the full document only to update a single
field could be quite expensive, I would try something with the
SortedSetDocValuesField. I honestly haven't tried anything yet with this
relatively new field. But if I understand well, it was designed for
situations like this.
How to boost some particular words in lucene index?
For e.g. I have a list of items:
"lucene in action"
"solr in action"
"solr in action book"
"building search applications"
"building search applications book"
I consider the word "book" as not important and would like to down vote it. I would not like to use filter to remove the word completely from search results as it is still might be useful. Some book might have a word book in it's name (for e.g. "book of mormon").
Currently, I use
new StandardAnalyzer(version)
and store fields as
new TextField("name", name, Field.Store.YES)
Ideally, I would like to have a dictionary with a list of terms to boost and to provide it to lucene. I know that I can boost on search if I break the request to terms (like "lucene" AND "book"^0.5), but it's not what I want.
In Apache Lucene, you can configure boosting in three different places: Document, Field and Query. Since you don't want to boost at the query level, then I think boosting during Field level might come in handy in your case. Method setBoost() of Field class.
Keep in mind that if you add the boost to your field, then you need to do that before adding the document to the index.
You need also to think about what to do when you delete a document from the index or your dictionary of words is changed (which I'm pretty sure it will).
I was looking a price comparison site like this. So the question is how it knows two products from two different sites to be of same product and clubs the two to same bucket to show the price comparison.
If it is only books than i can understand that all books have unique ISBN number so just write some website specific code which will fetch data from the websites and compare.
e.g. you have two websites:
www.xyz.com
www.pqr.com
Now these two websites list their books differently i.e. the html will be different, so parse the HTML and fetch ISBN, price from it. Than for corresponding ISBN we can put the two website's price. It is simple, but how you will parse for products which does not have an id which is unique and uniform (like presser cooker, watch etc…) across websites like ISBN.
Thanks.
Other products also have identification numbers, in Europe it is the EAN which is currently turned into a global number called GTIN. In ecommerce usually Amazon IDs (ASIN, of which ISBN is a subset) are often used.
If you don't have these numbers available, which is usually the case, you will need a strategy called Record Linkage or Data Matching.
TL;DR It usually uses a string matching algorithm to find similar "worded" products (using an inverted index on n-grams for example). In the end you can use machine-learning to remove the wrong matches (false-positives). This requires a lot of training data (there are no or too small public datasets available) and thus most of the time a human will check those matches.
For a more detailed analysis of the problem I can only recommend reading the book Data Matching by Peter Christen. It goes deep into information retrieval (how to find similar products) and then how to sort out wrong or right matches using machine-learning (e.g. via structural analysis).
There are also plenty of papers by him available on the net, so checkout his scholar profile.