Is it possible to configure or otherwise alter how Elastic Search scores its results?
When running a search for "term" using the NativeSearchQueryBuilder documents that contain one instance of the term are all scored the same. This makes sense. However one of the documents contain just the term, where are the others contain term and other data. For example;
Doc1: Title : Space
Doc2: Title : Space Time
Doc3: Title : No Space
So when searching for Space is there anyway to make Doc1 score more highly?
-Edit
So, a little more detail following briarheart's response. I think the problem is the way we're implementing typeahead searches. If I run the Space query using our standard search the ranking is as outlined by briarheart, but our typeahead scores everything equally because we are using the wildcard request part and looking for "term*" so "Space" and "Space Lane" do both match that equally well.
So really I guess I'm asking the wrong question. Scoring is working as it should, I just need to figure out a better implementation of type ahead.
(The Suggest Request Part doesn't seem to fit the use case as this would involve picking and resubmitting the desired suggestion).
I do not know how exactly does look like your query but actually in case of using full text search of the term "Space" the document "Doc1" from your example will get the highest score because of length of its "Title" field. Shorter fields have more weight in terms of relevance.
Related
I am trying to implement type-ahead in my app, and I got search suggest to work with an element range index as recommended in the documentation. The problem is, it doesn't fit my use case.
As anyone who has used it knows, it will not return results unless the search string is at the beginning of the content being searched. Barring the use of a leading and trailing wildcard, this won't return what I need.
I was thinking instead of simply doing a search based on the term, then returning the result snippets (truncated in my server-side code) as the suggestions in my type-ahead.
As I don't have a good way of comparing performance, I was hoping for some insight on whether this would be practical, or if it would be too slow.
Also, since it may come up in the answers, yes I have read the post about "chunked Element Range Indexes", but being new to MarkLogic, I can't make heads or tails of it and haven't been able to adapt it to my app.
I wrote the Chunked Element Range Indexes blog post, and found out last-minute that my performance numbers were skewed by a surprisingly large document in my index. When I removed that large document, many of the other techniques such as wildcard matching were suddenly much faster. That surprised me because all the other search engines I'd used couldn't offer such fast performance and flexibility for type-ahead scenarios, expecially if I tried introducing a wild-card search. I decided not to push my post publicly, but someone else accidentally did it for me, so we decided to leave it out there since it still presents a valid option.
Since MarkLogic offers multiple wildcard indexes, there's really a lot you can do in that area. However, search snippets would not be the right way to do that as I believe they'd add some overhead. Call cts:search or one of the other cts calls to match a lexicon. I'm guessing you'd want cts:element-value-match. That does wildcard matches against a range index since which are all in memory, so faster. Turn on all your wildcard indexes on your db if you can.
It should be called from a custom XQuery script in a MarkLogic HTTP server. I'm not recommending a REST extension as I usually would, because you need to be as stream-lined as possible to do most type-ahead scenarios correctly (that is, fast enough).
I'd suggest you find ways to whittle down the set of values in the range index to less than 100,000 so there's less to match against and you're not letting in any junk suggestions. Also, make sure that you filter the set of matches based on the rest of the query (if a user already started typing other words or phrases). Make sure your HTTP script limits the number of suggestions returned since a user can't usually benefit from a long list of suggestions. And craft some algorithms to rank the suggestions so the most helpful ones make it to the top. Finally, be very, very careful not to present suggestions that are more distracting than helpful. If you're going to give your users type-ahead, it will interrupt their searching and train-of-thought, so don't interrupt them if you're going to suggest search phrases that won't help them get what they want. I've seen that way too often, even on major websites. Don't do type-ahead unless you're willing to measure the usage of the feature, and tune it over time or remove it if it's distracting users.
Hoping that helps!
You mention you are using a range index to populate your suggestions, but you can use word lexicons as well. Word lexicons would produce suggestions based on tokenized character data, not entire values of elements (or json properties). It might be worth looking into that.
Alternatively, since you are mentioning wildcards, perhaps cts:value-match could be of interest to you. It runs on values (not words) from range indexes, but takes a wild-carded expression as input. It would perform far better than a snippet approach, which would need to pull up and process actual contents.
HTH!
I have started working on Lucene (v 4.10.2) Search Based Ranking/Scoring.
Consider the following Scenario: I am searching 'Mark' in my search box. Auto-complete result shows Top 5 people named 'Mark' (although there might be hundreds of Mark in the Lucene index files).
I go on Mark Zuckerberg's profile which is placed on 4th place in the beginning of the search. Say I have clicked his profile a lot of times. Now according to me, next time I search 'Mark', 'Mark Zuckerberg' should come at the top of the list.
Several questions coming in my mind (even I don't know that I'm on right track or not):
1) How to achieve this using Lucene library ? (Automated or custom based scoring)
2) Can we change the scoring after any search?
3) Does Lucene library stores the scoring in indexed files?
4) Can we store the scoring in the indexed files?
Please let me know if I'm on the right track or not.
This is what I would try, regardless any performance and index
maintainability issues for now.
I would add a multivalued string field for users that have at least once hit the
profile document.
Every time a user (say "vipul") hits an auto-completed profile (say
"Mark Zuckerberg") I would add the username to the special multivalued string
field in the profile document.
When searching I would add a term in the special field with the current username
as the value, boosting it, so it comes first in the searches.
Now, some performance. Since updating the full document only to update a single
field could be quite expensive, I would try something with the
SortedSetDocValuesField. I honestly haven't tried anything yet with this
relatively new field. But if I understand well, it was designed for
situations like this.
i've store a lucene document with a single TextField contains words without stems.
I need to implement a search program that allow users to search words and exact words,
but if i've stored words without stemming, a stem search cannot be done.
There's a method to search both exact words and/or stemming words in Documents without
store Two fields ?
Thanks in advance.
Indexing two separate fields seems like the right approach to me.
Stemmed and unstemmed text require different analysis strategies, and so require you to provide a different Analyzer to the QueryParser. Lucene doesn't really support indexing text in the same field with different analyzers. That is by design. Furthermore, duplicating the text in the same field could result in some fairly strange scoring impacts (heavier scoring on terms that are not touched by the stemmer, particularly).
There is no need to store the text in each of these fields, but it only makes sense to index them in separate fields.
You can apply a different analyzer to different fields by using a PerFieldAnalyzerWrapper, by the way. Like:
Map<String,Analyzer> analyzerList = new HashMap<String,Analyzer>();
analyzerList.put("stemmedText", new EnglishAnalyzer(Version.LUCENE_44));
analyzerList.put("unstemmedText", new StandardAnalyzer(Version.LUCENE_44));
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_44), analyzerList);
I can see a couple of possibilities to accomplish it though, if you really want to.
One would be to create your own stem filter, based on (or possibly extending) the one you wish to use already, and add in the ability to keep the original tokens after stemming. Mind your position increments, in this case. Phrase queries and the like may be problematic.
The other (probably worse) possibility, would be to add the text to the field normally, then add it again to the same field, but this time after manually stemming. Two fields added with the same name will be effectively concatenated. You'dd want to store in a separate field, in this case. Expect wonky scoring.
Again, though, both of these are bad ideas. I see no benefit whatsoever to either of these strategies over the much easier and more useful approach of just indexing two fields.
I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I donĀ“t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page
Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.
String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}
In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.
If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.
It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.
A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.
I have a Lucene index of around 22,000 lucene documents but I have been facing a unique problem with it while creating a search program.
Each document has a Title, description and long_description fields, these fields have data related to different diseases and their symptoms. Now when I search for a phrase like following
"infection of the small intestine" I am expecting "Cholera" to be the first result(By the way I am using MultiFieldQueryParser with StandardAnalyzer.)
The reason I expect Cholera to be the first one is because it has exact phrase "infection of the small intestine" in the long description fields. But instead of this result coming on top it comes way at the bottom because there are plenty of other documents which mentions the term "infection" in the title field(which is substantially smaller in length than description field). This can be easily seen in the screenshot bellow.
So just because "cholera" does not have the most pertinent information in the "title" field it comes way at the bottom. I saw following thread where the use of "~3" is suggested, but is that what I should do for all my queries from behind the scene? Isn't there a better way of doing it?
Searching phrases in Lucene
Make your query boost the hits in title high, description medium and long_desc low, like this:
title:intestine^100 description:intestine^10 long_description:intestine^1
This example gives title matches score "+100", description matches score "+10" and long_description matches score "+1". Higher total boost scores are sorted first. You can pick any numbers you like for the boost values.
You can change computeNorm in DefaultSimilarity.
Please check http://www.supermind.org/blog/378/lucene-scoring-for-dummies and http://blog.architexa.com/2010/12/custom-lucene-scoring/