Automated vs Custom Lucene Scoring

Automated vs Custom Lucene Scoring - java

I have started working on Lucene (v 4.10.2) Search Based Ranking/Scoring.
Consider the following Scenario: I am searching 'Mark' in my search box. Auto-complete result shows Top 5 people named 'Mark' (although there might be hundreds of Mark in the Lucene index files).
I go on Mark Zuckerberg's profile which is placed on 4th place in the beginning of the search. Say I have clicked his profile a lot of times. Now according to me, next time I search 'Mark', 'Mark Zuckerberg' should come at the top of the list.
Several questions coming in my mind (even I don't know that I'm on right track or not):
1) How to achieve this using Lucene library ? (Automated or custom based scoring)
2) Can we change the scoring after any search?
3) Does Lucene library stores the scoring in indexed files?
4) Can we store the scoring in the indexed files?
Please let me know if I'm on the right track or not.

This is what I would try, regardless any performance and index
maintainability issues for now.
I would add a multivalued string field for users that have at least once hit the
profile document.
Every time a user (say "vipul") hits an auto-completed profile (say
"Mark Zuckerberg") I would add the username to the special multivalued string
field in the profile document.
When searching I would add a term in the special field with the current username
as the value, boosting it, so it comes first in the searches.
Now, some performance. Since updating the full document only to update a single
field could be quite expensive, I would try something with the
SortedSetDocValuesField. I honestly haven't tried anything yet with this
relatively new field. But if I understand well, it was designed for
situations like this.

Related

Configure Elastic Search Result Scoring

Is it possible to configure or otherwise alter how Elastic Search scores its results?
When running a search for "term" using the NativeSearchQueryBuilder documents that contain one instance of the term are all scored the same. This makes sense. However one of the documents contain just the term, where are the others contain term and other data. For example;
Doc1: Title : Space
Doc2: Title : Space Time
Doc3: Title : No Space
So when searching for Space is there anyway to make Doc1 score more highly?
-Edit
So, a little more detail following briarheart's response. I think the problem is the way we're implementing typeahead searches. If I run the Space query using our standard search the ranking is as outlined by briarheart, but our typeahead scores everything equally because we are using the wildcard request part and looking for "term*" so "Space" and "Space Lane" do both match that equally well.
So really I guess I'm asking the wrong question. Scoring is working as it should, I just need to figure out a better implementation of type ahead.
(The Suggest Request Part doesn't seem to fit the use case as this would involve picking and resubmitting the desired suggestion).

I do not know how exactly does look like your query but actually in case of using full text search of the term "Space" the document "Doc1" from your example will get the highest score because of length of its "Title" field. Shorter fields have more weight in terms of relevance.

Elasticsearch: Learning from clicks (Search result ranking)

I have read over the chapter "Learning from clicks" in the book Programming Collective Intelligence and liked the idea: The search engine there learns on which results the user clicked and use this information to improve the ranking of results.
I think it would improve the quality of the search ranking a lot in my Java/Elasticsearch application if I could learn from the user clicks.
In the book, they build a multiplayer perceptron (MLP) network to use the learned information even for new search phrases. They use Python with a SQL database to calculate the search ranking.
Has anybody implemented something like this already with Elasticsearch or knows an example project?
It would be great, if I could manage the clicking information directly in Elasticsearch without needing an extra SQL database.

In the field of Information Retrieval (the general academic field of search and recommendations) this is more generally known as Learning to Rank. Whether its clicks, conversions, or other forms of sussing out what's a "good" or "bad" result for a keyword search, learning to rank uses either a classifier or regression process to learn what features of the query and document correlate with relevance.
Clicks?
For clicks specifically, there's reasons to be skeptical that optimizing clicks is ideal. There's a paper from Microsoft Research I'm trying to dig up that claims that in their case, clicks are only 45% correlated with relevance. Click+dwell is often a more useful general-purpose indicator of relevance.
There's also the risk of self-reinforcing bias in search, as I talk about in this blog article. There's a chance that if you're already showing a user mediocre results, and they keep clicking on those mediocre results, you'll end up reinforcing search to keep showing users mediocre results.
Beyond clicks, there's often domain-specific considerations for what you should measure. For example, clasically in e-commerce, conversions matter. Perhaps a search result click that led to such a purchase should count more. Netflix famously tries to suss out what it means when you watch a movie for 5 minutes and go back to the menu vs 30 minutes and exit. Some search use cases are informational: clicking may mean something different when you're researching and clicking many search results vs when you're shopping for a single item.
So sorry to say it's not a silver bullet. I've heard of many successful and unsuccessful attempts at doing Learning to Rank and it mostly boils down to how successful you are at measuring what your users consider relevant. The difficulty of this problem surprises a lot of peop.le
For Elasticsearch...
For Elasticsearch specifically, there's this plugin (disclaimer I'm the author). Which is documented here. Once you've figured out how to "grade" a document for a specific query (whether its clicks or something more) you can train a model that can be then fed into Elasticsearch via this plugin for your ranking.

What you would need to do is store information about the clicks in a field inside the Elasticsearch index. Every click would result in an update of a document. Since an update action is actually a delete and insert Update API, you need to make sure your document text is stored, not only indexed. You can then use a Function Score Query to build a score function reflecting the value stored in the index.
Alternatively, you could store the information in a separate database and use a script function inside the score function to access the database. I wouldn't suggest this solution due to performance issues.

I get the point of your question. You want to build learning to rank model within Elasticsearch framework. The relevance of each doc to the query is computed online. You want to combine query and doc to compute the score, so a custom function to compute _score is needed. I am new in elasticsearch, and I'm finding a way to solve the problem.
Lucene is a more general search engine which is open to define your own scorer to compute the relevance, and I have developed several applications on it before.
This article describes the belief understanding of customizing scorer. However, on elasticsearch, I haven't found related articles. Welcome to discuss with me about your progress on elasticsearch.

How do I achieve the task of distributing my index table over 3 systems?

I want to achieve something like this
Given a document say a txt file with an id, I need to process it, do stemming on the words, and generate a index table out of it. But this index table is distributed over 3 systems probably on the basis of the criteria that words beginning with letters from [a-h] are indexed on 1st system, next one third on second and last one third on 3rd system. But i have no idea what technology should i use to achieve this? The index table data structure in ought to be in the RAM so that the search queries can be answered quickly(supposing we are able to index it in this way and have a user searching for a word or sentence from different system). Can this purpose be fulfilled by use of JAVA Sockets?
Actually we(group of 5) are trying to make a small but distributed search engine. Supposing the crawling has been done and the page(the document i was talking about) is saved somewhere and i extract it, do the processing , stemming etc, I would like to finally make a distributed Index data structure based on scheme mentioned above. Would it be possible? I just want to know what technology to use to achieve something like this. Like modifying a data structure inside some program running on some other machine(but in the same network).
Secondly, since we actually don't know if this approach is feasible, if thats the case I would be keen to know the correct way I should look at a distributed index table.

Have the index information saved as you crawl the documents. Have a head node which presents the search user interface. The head node then distributes the search to the index nodes, and collects the results to present to the user.
There are a number of available frameworks, such as Mapreduce, which will help you solve this problem.

Storing data in Lucene or database

I'm a Lucene newbie and am thinking of using it to index the words in the title and description elements of RSS feeds so that I can record counts of the most popular words in the feeds.
Various search options are needed, some will have keywords entered manually by users, whereas in other cases popular terms would be generated automatically by the system. So I could have Lucene use query strings to return the counts of hits for manually entered keywords and TermEnums in automated cases?
The system also needs to be able to handle new data from the feeds as they are polled at regular intervals.
Now, I could do much / all of this using hashmaps in Java to work out counts, but if I use Lucene, my question concerns the best way to store the words for counting. To take a single RSS feed, is it wise to have Lucene create a temporary index in memory, and pass the words and hit counts out so other programs can write them to database?
Or is it better to create a Lucene document per feed and add new feed data to it at polling time? So that if a keyword count is required between dates x and y, Lucene can return the values? This implies I can datestamp Lucene entries which I'm not sure of yet.
Hope this makes sense.
Mr Morgan.

From the description you have given in the question, I think Lucene alone will be sufficient. (No need of MySQL or Solr). Lucene API is also easy to use and you won't need to change your frontend code.
From every RSS feed, you can create a Document having three fields; namely title, description and date. The date must preferably be a NumericField. You can then append every document to the lucene index as the feeds arrive.
How frequently do you want the system to automatically generate the popular terms? For eg. Do you want to show the users, "most popular terms last week", etc.? If so, then you can use the NumericRangeFilter to efficiently search the date field you have stored. Once you get the documents satisfying a date range, you can then find the document frequency of each term in the retrieved documents to find the most popular terms. (Do not forget to remove the stopwords from your documents (say by using the StopAnalyzer) or else the most popular terms will be the stopwords)

I can recommend you check out Apache Solr. In a nutshell, Solr is a web enabled front end to Lucene that simplifies integration and also provides value added features. Specifically, the Data Import Handlers make updating/adding new content to your Lucene index really simple.
Further, for the word counting feature you are asking about, Solr has a concept of "faceting" which will exactly fit the problem you're are describing.
If you're already familiar with web applications, I would definitely consider it: http://lucene.apache.org/solr/

Solr is definitely the way to go although I would caution against using it with Apache Tomcat on Windows as the install process is a bloody nightmare. More than happy to guide you through it if you like as I have it working perfectly now.
You might also consider the full text indexing capabilities of MySQL, far easier the Lucene.
Regards

Lucene search results sort by custom order list (unique to each user)

I have authenticated users in my application who have access to a shared database of up to 500,000 items. Each of the users has their own public facing web site and needs the ability to prioritize the items on display (think upvote) on their own site.
out of the 500,000 items they may only have up to 200 prioritized items, the order of the rest of the items is of less importance.
Each of the users will prioritize the items differently.
I initially asked a similar mysql question here Mysql results sorted by list which is unique for each user and got a good answer but i believe a better option may be to opt for a non sql indexed solution.
Can this be done in Lucene?, is there another search technology which would be better for this.
ps. Google implements a similar type setup with their search results where you can prioritize and exclude your own search results if you are logged in.
Update: re-tagged with sphinx as i have been reading the documentation and i believe it may be able to do what i am looking for with "per-document attribute values" stored in memory - interested to hear any feedback on this from sphinx gurus

You'll definitely want to store the id of item in each document object when building your index. There's a few ways to do the next step, but an easy one would be take the prioritized items and add them to your search query, something like this for each special item:
"OR item_id=%d+X"
where X is the amount of boost you'd like to use. You'll probably need to empirically tweak this number to make sure that just being "upvoted" doesn't put it to the top of a list searching for something totally unrelated.
Doing it this way will at least prevent you from a lot of annoying postprocessing steps that would require you to iterate over the whole result set -- hopefully the proper sorting will be there right from querying the index.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.