We are developing an app with java and using elasticsearch java api. We indexed metadatas and want to use ranking/scoring at indexing time or searching time.
And also, I dunno if it is possible to rank/score a result which is choosed/approved by the users when they click a result. It is like to set that result is a popular result and increase its popularity.
How to implement them? Thanks for your suggestions.
elasticsearch is allow us to change/modify the elasticsearch score using the _score.
I hope your requirement is to maintain custom ranking in documents rather than the elasticsearch scoring.
if so you need to design the document like that. Add a filed name like userRank in all the documents and increment the value if a user click the document in the result. using function_score you can add the userRank field value to the calculated _score.
There's a large and complex field called learning to rank that studies how to turn quality information about documents/queries and turn them into relevance ranking rules.
For Elasticsearch specifically, there is this plugin that could help. (disclaimer I'm the creator).
Related
I have started working on Lucene (v 4.10.2) Search Based Ranking/Scoring.
Consider the following Scenario: I am searching 'Mark' in my search box. Auto-complete result shows Top 5 people named 'Mark' (although there might be hundreds of Mark in the Lucene index files).
I go on Mark Zuckerberg's profile which is placed on 4th place in the beginning of the search. Say I have clicked his profile a lot of times. Now according to me, next time I search 'Mark', 'Mark Zuckerberg' should come at the top of the list.
Several questions coming in my mind (even I don't know that I'm on right track or not):
1) How to achieve this using Lucene library ? (Automated or custom based scoring)
2) Can we change the scoring after any search?
3) Does Lucene library stores the scoring in indexed files?
4) Can we store the scoring in the indexed files?
Please let me know if I'm on the right track or not.
This is what I would try, regardless any performance and index
maintainability issues for now.
I would add a multivalued string field for users that have at least once hit the
profile document.
Every time a user (say "vipul") hits an auto-completed profile (say
"Mark Zuckerberg") I would add the username to the special multivalued string
field in the profile document.
When searching I would add a term in the special field with the current username
as the value, boosting it, so it comes first in the searches.
Now, some performance. Since updating the full document only to update a single
field could be quite expensive, I would try something with the
SortedSetDocValuesField. I honestly haven't tried anything yet with this
relatively new field. But if I understand well, it was designed for
situations like this.
I have read over the chapter "Learning from clicks" in the book Programming Collective Intelligence and liked the idea: The search engine there learns on which results the user clicked and use this information to improve the ranking of results.
I think it would improve the quality of the search ranking a lot in my Java/Elasticsearch application if I could learn from the user clicks.
In the book, they build a multiplayer perceptron (MLP) network to use the learned information even for new search phrases. They use Python with a SQL database to calculate the search ranking.
Has anybody implemented something like this already with Elasticsearch or knows an example project?
It would be great, if I could manage the clicking information directly in Elasticsearch without needing an extra SQL database.
In the field of Information Retrieval (the general academic field of search and recommendations) this is more generally known as Learning to Rank. Whether its clicks, conversions, or other forms of sussing out what's a "good" or "bad" result for a keyword search, learning to rank uses either a classifier or regression process to learn what features of the query and document correlate with relevance.
Clicks?
For clicks specifically, there's reasons to be skeptical that optimizing clicks is ideal. There's a paper from Microsoft Research I'm trying to dig up that claims that in their case, clicks are only 45% correlated with relevance. Click+dwell is often a more useful general-purpose indicator of relevance.
There's also the risk of self-reinforcing bias in search, as I talk about in this blog article. There's a chance that if you're already showing a user mediocre results, and they keep clicking on those mediocre results, you'll end up reinforcing search to keep showing users mediocre results.
Beyond clicks, there's often domain-specific considerations for what you should measure. For example, clasically in e-commerce, conversions matter. Perhaps a search result click that led to such a purchase should count more. Netflix famously tries to suss out what it means when you watch a movie for 5 minutes and go back to the menu vs 30 minutes and exit. Some search use cases are informational: clicking may mean something different when you're researching and clicking many search results vs when you're shopping for a single item.
So sorry to say it's not a silver bullet. I've heard of many successful and unsuccessful attempts at doing Learning to Rank and it mostly boils down to how successful you are at measuring what your users consider relevant. The difficulty of this problem surprises a lot of peop.le
For Elasticsearch...
For Elasticsearch specifically, there's this plugin (disclaimer I'm the author). Which is documented here. Once you've figured out how to "grade" a document for a specific query (whether its clicks or something more) you can train a model that can be then fed into Elasticsearch via this plugin for your ranking.
What you would need to do is store information about the clicks in a field inside the Elasticsearch index. Every click would result in an update of a document. Since an update action is actually a delete and insert Update API, you need to make sure your document text is stored, not only indexed. You can then use a Function Score Query to build a score function reflecting the value stored in the index.
Alternatively, you could store the information in a separate database and use a script function inside the score function to access the database. I wouldn't suggest this solution due to performance issues.
I get the point of your question. You want to build learning to rank model within Elasticsearch framework. The relevance of each doc to the query is computed online. You want to combine query and doc to compute the score, so a custom function to compute _score is needed. I am new in elasticsearch, and I'm finding a way to solve the problem.
Lucene is a more general search engine which is open to define your own scorer to compute the relevance, and I have developed several applications on it before.
This article describes the belief understanding of customizing scorer. However, on elasticsearch, I haven't found related articles. Welcome to discuss with me about your progress on elasticsearch.
I'm a Lucene newbie and am thinking of using it to index the words in the title and description elements of RSS feeds so that I can record counts of the most popular words in the feeds.
Various search options are needed, some will have keywords entered manually by users, whereas in other cases popular terms would be generated automatically by the system. So I could have Lucene use query strings to return the counts of hits for manually entered keywords and TermEnums in automated cases?
The system also needs to be able to handle new data from the feeds as they are polled at regular intervals.
Now, I could do much / all of this using hashmaps in Java to work out counts, but if I use Lucene, my question concerns the best way to store the words for counting. To take a single RSS feed, is it wise to have Lucene create a temporary index in memory, and pass the words and hit counts out so other programs can write them to database?
Or is it better to create a Lucene document per feed and add new feed data to it at polling time? So that if a keyword count is required between dates x and y, Lucene can return the values? This implies I can datestamp Lucene entries which I'm not sure of yet.
Hope this makes sense.
Mr Morgan.
From the description you have given in the question, I think Lucene alone will be sufficient. (No need of MySQL or Solr). Lucene API is also easy to use and you won't need to change your frontend code.
From every RSS feed, you can create a Document having three fields; namely title, description and date. The date must preferably be a NumericField. You can then append every document to the lucene index as the feeds arrive.
How frequently do you want the system to automatically generate the popular terms? For eg. Do you want to show the users, "most popular terms last week", etc.? If so, then you can use the NumericRangeFilter to efficiently search the date field you have stored. Once you get the documents satisfying a date range, you can then find the document frequency of each term in the retrieved documents to find the most popular terms. (Do not forget to remove the stopwords from your documents (say by using the StopAnalyzer) or else the most popular terms will be the stopwords)
I can recommend you check out Apache Solr. In a nutshell, Solr is a web enabled front end to Lucene that simplifies integration and also provides value added features. Specifically, the Data Import Handlers make updating/adding new content to your Lucene index really simple.
Further, for the word counting feature you are asking about, Solr has a concept of "faceting" which will exactly fit the problem you're are describing.
If you're already familiar with web applications, I would definitely consider it: http://lucene.apache.org/solr/
Solr is definitely the way to go although I would caution against using it with Apache Tomcat on Windows as the install process is a bloody nightmare. More than happy to guide you through it if you like as I have it working perfectly now.
You might also consider the full text indexing capabilities of MySQL, far easier the Lucene.
Regards
I have authenticated users in my application who have access to a shared database of up to 500,000 items. Each of the users has their own public facing web site and needs the ability to prioritize the items on display (think upvote) on their own site.
out of the 500,000 items they may only have up to 200 prioritized items, the order of the rest of the items is of less importance.
Each of the users will prioritize the items differently.
I initially asked a similar mysql question here Mysql results sorted by list which is unique for each user and got a good answer but i believe a better option may be to opt for a non sql indexed solution.
Can this be done in Lucene?, is there another search technology which would be better for this.
ps. Google implements a similar type setup with their search results where you can prioritize and exclude your own search results if you are logged in.
Update: re-tagged with sphinx as i have been reading the documentation and i believe it may be able to do what i am looking for with "per-document attribute values" stored in memory - interested to hear any feedback on this from sphinx gurus
You'll definitely want to store the id of item in each document object when building your index. There's a few ways to do the next step, but an easy one would be take the prioritized items and add them to your search query, something like this for each special item:
"OR item_id=%d+X"
where X is the amount of boost you'd like to use. You'll probably need to empirically tweak this number to make sure that just being "upvoted" doesn't put it to the top of a list searching for something totally unrelated.
Doing it this way will at least prevent you from a lot of annoying postprocessing steps that would require you to iterate over the whole result set -- hopefully the proper sorting will be there right from querying the index.
I need to index a lot of text. The search results must give me the name of the files containing the query and all of the positions where the query matched in each file - so, I don't have to load the whole file to find the matching portion. What libraries can you recommend for doing this?
update: Lucene has been suggested. Can you give me some info on how should I use Lucene to achieve this? (I have seen examples where the search query returned only the matching files)
For java try Lucene
I believe the lucene term for what you are looking for is highlighting. Here is a very recent report on Lucene highlighting. You will probably need to store word position information in order to get the snippets you are looking for. The Token API may help.
It all depends on how you are going to access it. And of course, how many are going to access it. Read up on MapReduce.
If you are going to roll your own, you will need to create an index file which is sort of a map between unique words and a tuple like (file, line, offset). Of course, you can think of other in-memory data structures like a trie(prefix-tree) a Judy array and the like...
Some 3rd party solutions are listed here.
Have a look at http://www.compass-project.org/ it can be looked on as a wrapper on top of Lucene, Compass simplifies common usage patterns of Lucene such as google-style search, index updates as well as more advanced concepts such as caching and index sharding (sub indexes). Compass also uses built in optimizations for concurrent commits and merges.
The Overview can give you more info
http://www.compass-project.org/overview.html
I have integrated this into a spring project in no time. It is really easy to use and gives what your users will see as google like results.
Lucene - Java
It's open source as well so you are free to use and deploy in your application.
As far as I know, Eclipse IDE help file is powered by Lucene - It is tested by millions
Also take a look at Lemur Toolkit.
Why don't you try and construct a state machine by reading all files ? Transitions between states will be letters, and states will be either final (some files contain the considered word, in which case the list is available there) or intermediate.
As far as multiple-word lookups, you'll have to deal with them independently before intersecting the results.
I believe the Boost::Statechart library may be of some help for that matter.
I'm aware you asked for a library, just wanted to point you to the underlying concept of building an inverted index (from Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze).