How to collapse duplicates in search results

How to collapse duplicates in search results - java

We use Hibernate Search 6 CR2 with Elasticsearch and Spring Boot 2.4.0. Is there any way to collapse duplicates in search results?
We tried to kind of "collapse" them like this:
searchResults = searchSession.search(Items.class)
.select(f -> f.field(field.getCode(), String.class))
.where(f -> f.phrase()
.field(field.getCode())
.matching(phrase)
.slop(SLOP))
.fetchHits(20)
.stream()
.distinct()
.collect(Collectors.toList());
...but this method works only on small amount of results (less than fetchHits size) and when there are not so many identical hits. When we tried this method on another index with thousands hits (~28M docs) we saw that it's not working as expected because of fetchHits setting -- some search results that should be -- are lost. And of course, the main question here is that by using this method we don't distinct search results while searching, it happens after the original search, so it's not the best solution.
Another solution was found here but it's a bit outdated and not an actual answer for our question.
Over Hibernate Search forums we found another solution for similar task, we tried to implement it and it worked, but as a downsides we got 2x multiplication for index document fields (8 fields now instead of 4).
So after all, is it possible to tune HS to collapse duplicates in search results without help of these extra-fields? Or, if it's OK... Okay then! We'll remember this and use as a solution in future cases.
P.S.: we implement search-as-you-type prediction service so it's not necessary for original entities to be extracted.

The solution you linked is the most straightforward way to get a list of all values in matched documents for a given field. It is what aggregations are for.
Yes, it requires additional fields. Generally speaking, you can't get performance out of thin air: to get a smaller execution time, you need to use more memory.
That being said, if what you want is suggestions, you should probably have a look at Elasticsearch's suggester feature.
There is no API for this in Hibernate Search (yet), so you will have to transform JSON in order to leverage this feature. It's relatively easy, and you even have an example for your very use case in the reference documentation (have a look at the second example).
Of course if you really want to use phrase queries, it's going to be more complicated. I'd suggest you have a look at the phrase suggester or maybe the completion suggester.
Should you need to register a field with a type that is not supported out of the box by Hibernate Search (e.g. completion), it's possible too: you will just need a custom bridge. See this example.

Related

How to limit the number of "candidate" documents when using the "fq" parameter in SOLR

I believe this will require custom coding, but what I'd like to do is
a) require that queries run against solr contain a value for the filter query parameter (fq)
b) short circuit the request if the resulting candidate list from the filter query is larger than some threshold.
From looking at the SOLR code is seems like I can do the first by providing a custom component to the /select request handler (via the config file), but the second is a little bit more murky.
Is anyone aware of a plugin already implementing something like this? Or has suggestions on how to implement it? As best I can see if I extend QueryComponent and inject a custom SolrIndexSearcher it ought to be possible, but I'm not seeing where the filter cache is used exactly (get from cache I found).
edit, adding a sample query
Here's an example query, after sanitizing it
https://....?q=startDate%3A%5B2019-01-10T00%3A00%3A00.000Z%20TO%202019-01-10T23%3A59%3A59.999Z%5D%20AND%20(name%3A%22grr%22)&wt=json&sort=startDate%20asc&fl=recordId,startDate,aName,aSomethingOrOtherType,&start=0&rows=50&joinCollection=
The application is multi-tenant, and while some tenants produce a smaller volume of documents others produce very larger volumes. We already force a date range into the search to limit the total number of potential documents looked at, but what we'd really like is be able to determine the total number of potential documents before considering anything else and fail the request if it's too broad. That way we don't incur the cost of the additional search parameters or sorting.

Practical to use snippets as search suggest?

I am trying to implement type-ahead in my app, and I got search suggest to work with an element range index as recommended in the documentation. The problem is, it doesn't fit my use case.
As anyone who has used it knows, it will not return results unless the search string is at the beginning of the content being searched. Barring the use of a leading and trailing wildcard, this won't return what I need.
I was thinking instead of simply doing a search based on the term, then returning the result snippets (truncated in my server-side code) as the suggestions in my type-ahead.
As I don't have a good way of comparing performance, I was hoping for some insight on whether this would be practical, or if it would be too slow.
Also, since it may come up in the answers, yes I have read the post about "chunked Element Range Indexes", but being new to MarkLogic, I can't make heads or tails of it and haven't been able to adapt it to my app.

I wrote the Chunked Element Range Indexes blog post, and found out last-minute that my performance numbers were skewed by a surprisingly large document in my index. When I removed that large document, many of the other techniques such as wildcard matching were suddenly much faster. That surprised me because all the other search engines I'd used couldn't offer such fast performance and flexibility for type-ahead scenarios, expecially if I tried introducing a wild-card search. I decided not to push my post publicly, but someone else accidentally did it for me, so we decided to leave it out there since it still presents a valid option.
Since MarkLogic offers multiple wildcard indexes, there's really a lot you can do in that area. However, search snippets would not be the right way to do that as I believe they'd add some overhead. Call cts:search or one of the other cts calls to match a lexicon. I'm guessing you'd want cts:element-value-match. That does wildcard matches against a range index since which are all in memory, so faster. Turn on all your wildcard indexes on your db if you can.
It should be called from a custom XQuery script in a MarkLogic HTTP server. I'm not recommending a REST extension as I usually would, because you need to be as stream-lined as possible to do most type-ahead scenarios correctly (that is, fast enough).
I'd suggest you find ways to whittle down the set of values in the range index to less than 100,000 so there's less to match against and you're not letting in any junk suggestions. Also, make sure that you filter the set of matches based on the rest of the query (if a user already started typing other words or phrases). Make sure your HTTP script limits the number of suggestions returned since a user can't usually benefit from a long list of suggestions. And craft some algorithms to rank the suggestions so the most helpful ones make it to the top. Finally, be very, very careful not to present suggestions that are more distracting than helpful. If you're going to give your users type-ahead, it will interrupt their searching and train-of-thought, so don't interrupt them if you're going to suggest search phrases that won't help them get what they want. I've seen that way too often, even on major websites. Don't do type-ahead unless you're willing to measure the usage of the feature, and tune it over time or remove it if it's distracting users.
Hoping that helps!

You mention you are using a range index to populate your suggestions, but you can use word lexicons as well. Word lexicons would produce suggestions based on tokenized character data, not entire values of elements (or json properties). It might be worth looking into that.
Alternatively, since you are mentioning wildcards, perhaps cts:value-match could be of interest to you. It runs on values (not words) from range indexes, but takes a wild-carded expression as input. It would perform far better than a snippet approach, which would need to pull up and process actual contents.
HTH!

How to design the server side of an autocomplete box like Quora?

I don't want to use Lucene because i think it is to heavy.
Is there any easier way to implement this (Millons of data) ?

If you don't want to have to worry about performance, I recommend you take a look at Amazon Web Services new CloudSearch service. It's fast and scales as your needs scale. It can also handle millions of documents without a problem and supports wildcard searches (ex: quo*, would retrieve Quora).
Check it out here.

Obviously this isn't how it definitely works at either Quora or Google, as I haven't had the pleasure to work at either...this is just how I'd go about doing it.
The first thing to obtain is a list of search terms - I'm assuming you don't want to know how this is done, as it will really depend on all sorts of things, but basically you're either going to do a select distinct title from pages (in the case of the autocomplete on Wikipedia) or something much more advanced in the case of Google's.
The next step is also pretty simple at a high level: you need to perform the query select title from titles where title like 'Qu%' in the case of the user typing Qu into the search box. The list of titles is then returned to the browser as the response to some kind of Ajax request, perhaps in the form of JSON or similar. And you need to do it as fast as possible - that's where it becomes difficult.
How do they do it so quickly? There are probably four things to bear in mind.
They have LOTS of machines handling the requests. Bear in mind that Google's autocomplete is turned on by default and works in (almost?) all languages. That's a lot of searches against the autocomplete index. A lot more than there will be against the web index itself: for each web search request, Google will probably have processed 3 or 4 autocomplete requests.
They're probably doing it in memory. Google is already known to store its web indexes in memory, so I would expect them to be doing the same with this.
Specialised software (this is where it gets really interesting). While a traditional database or a NoSQL database could do this and do it quickly I would expect the big boys to actually be doing this with specialised code whose sole purpose is to provide autocomplete suggestions. The SQL statement I provided above was purely to demonstrate the logical request that would be needed. You're probably looking at some kind of specialised tree, such as a suffix tree, radix tree, or similar.
Sharding. To cope with the quantity of data and the number of machines doing the requests you're going to need to shard. That is ensure that a certain subset of all the machines involved only process requests requests that begin with one or more letters. eg a group of X machines processing searches that begin with a certain letter or even 2 letters. That means that you've got more machines, but they don't each have to have the whole index to hand. How does a particular group of machines get chosen? You're either routing once the request is in your data centre, or you could route on the client side (eg in your Javascript decide which IP to query based upon the first X letters of the search term)
So, that's how I would do it. Not having had the experience of the enormous datasets Google/Quora are dealing with, I'm sure there are things that I've not considered. But, it's a start.
And, here's how I have done it, purely in an experimental environment at home:
I had a simple list of a good few hundred thousand titles to search. These were loaded into a dedicated MongoDB collection, which had a single index defined on it. I then had a Play Framework controller in front of it and used jQuery's autocomplete plugin to do the search.
Obviously this is tiny compared with what you are looking for, but MongoDB should provide the same kind of performance for your dataset provided you follow the recommendations (ie good hardware, lots of RAM, keep the indexes in memory). In addition, Mongo supports sharding, and the Play Framework is shared nothing, so adding new machines to cope with the load should your userbase grow would be straightforward in this situation.
By the way, Mongo is by no means the only solution, traditional SQL databases will be up to the job too, of course - I was just using Mongo for other reasons.

First, for autocomplete you should aim to get the response back to the user in <= 100ms if you want something that appears fast. That should be your first concern. Any setup that can't do that probably won't be good enough for users. In my own tests in Firefox using Firebug, Google's autocomplete returned returns in about 50ms and Quora in about 65ms.
See, e.g.
http://stackoverflow.com/questions/536300/what-is-the-shortest-perceivable-application-response-delay
Apparently, Quora uses prefix matching, not full text search which makes it faster. To roll your own fast prefix-based autocomplete, which should be sufficient for many cases, but won't handle things like misspellings using fuzzy matching, etc., try an in-memory data store like Redis. The details can be seen here:
http://charlesleifer.com/blog/powerful-autocomplete-with-redis-in-under-200-lines-of-python/
I haven't been able to get CloudSearch (95-125ms in browser fetching from endpoint directly as measured by Firebug, and + 20-30ms longer accessing endpoint via cURL in PHP) down to the low latencies of Google and Quora I cited regardless of the simplicity of the search query. An Elasticsearch cluster is a bit faster. These statements obviously depend upon use case and probably don't generalize well, but something to think about.

Can I make Lucene return an unlimited number of search results?

I am using Lucene 3.0.1 in a Java 5 environment.
I've been researching this issue a little bit, but the documentation hasn't given any direct answers.
Using the search method
TopFieldDocs search(Weight weight, Filter filter, int nDocs, Sort sort)
I always need to provide a maximum number of search results nDocs.
What if I wanted to have all matching results? It feels like setting nDocs to Integer.MAX_VALUE is a kind of hacky way to do this (and would result in speed and memory performance drop?).
Anyone else who has any idea?

You are using a search method that returns the top n hits for a query.
There are other (more low-level) methods that do not have the limitation, and it says in the documentation that "applications should only use this if they need all of the matching documents. The high-level search API (search(Query, int)) is usually more efficient, as it skips non-high-scoring hits.".
So if you really need all documents, you can use the low-level API. I doubt that it makes a big difference in performance to passing a really high limit to the high-level API. If you need all documents (and there really are a lot of them), it is going to be slow either way, especially if sorting is involved.

Lucene search results sort by custom order list (unique to each user)

I have authenticated users in my application who have access to a shared database of up to 500,000 items. Each of the users has their own public facing web site and needs the ability to prioritize the items on display (think upvote) on their own site.
out of the 500,000 items they may only have up to 200 prioritized items, the order of the rest of the items is of less importance.
Each of the users will prioritize the items differently.
I initially asked a similar mysql question here Mysql results sorted by list which is unique for each user and got a good answer but i believe a better option may be to opt for a non sql indexed solution.
Can this be done in Lucene?, is there another search technology which would be better for this.
ps. Google implements a similar type setup with their search results where you can prioritize and exclude your own search results if you are logged in.
Update: re-tagged with sphinx as i have been reading the documentation and i believe it may be able to do what i am looking for with "per-document attribute values" stored in memory - interested to hear any feedback on this from sphinx gurus

You'll definitely want to store the id of item in each document object when building your index. There's a few ways to do the next step, but an easy one would be take the prioritized items and add them to your search query, something like this for each special item:
"OR item_id=%d+X"
where X is the amount of boost you'd like to use. You'll probably need to empirically tweak this number to make sure that just being "upvoted" doesn't put it to the top of a list searching for something totally unrelated.
Doing it this way will at least prevent you from a lot of annoying postprocessing steps that would require you to iterate over the whole result set -- hopefully the proper sorting will be there right from querying the index.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.