Practical to use snippets as search suggest?

Practical to use snippets as search suggest? - java

I am trying to implement type-ahead in my app, and I got search suggest to work with an element range index as recommended in the documentation. The problem is, it doesn't fit my use case.
As anyone who has used it knows, it will not return results unless the search string is at the beginning of the content being searched. Barring the use of a leading and trailing wildcard, this won't return what I need.
I was thinking instead of simply doing a search based on the term, then returning the result snippets (truncated in my server-side code) as the suggestions in my type-ahead.
As I don't have a good way of comparing performance, I was hoping for some insight on whether this would be practical, or if it would be too slow.
Also, since it may come up in the answers, yes I have read the post about "chunked Element Range Indexes", but being new to MarkLogic, I can't make heads or tails of it and haven't been able to adapt it to my app.

I wrote the Chunked Element Range Indexes blog post, and found out last-minute that my performance numbers were skewed by a surprisingly large document in my index. When I removed that large document, many of the other techniques such as wildcard matching were suddenly much faster. That surprised me because all the other search engines I'd used couldn't offer such fast performance and flexibility for type-ahead scenarios, expecially if I tried introducing a wild-card search. I decided not to push my post publicly, but someone else accidentally did it for me, so we decided to leave it out there since it still presents a valid option.
Since MarkLogic offers multiple wildcard indexes, there's really a lot you can do in that area. However, search snippets would not be the right way to do that as I believe they'd add some overhead. Call cts:search or one of the other cts calls to match a lexicon. I'm guessing you'd want cts:element-value-match. That does wildcard matches against a range index since which are all in memory, so faster. Turn on all your wildcard indexes on your db if you can.
It should be called from a custom XQuery script in a MarkLogic HTTP server. I'm not recommending a REST extension as I usually would, because you need to be as stream-lined as possible to do most type-ahead scenarios correctly (that is, fast enough).
I'd suggest you find ways to whittle down the set of values in the range index to less than 100,000 so there's less to match against and you're not letting in any junk suggestions. Also, make sure that you filter the set of matches based on the rest of the query (if a user already started typing other words or phrases). Make sure your HTTP script limits the number of suggestions returned since a user can't usually benefit from a long list of suggestions. And craft some algorithms to rank the suggestions so the most helpful ones make it to the top. Finally, be very, very careful not to present suggestions that are more distracting than helpful. If you're going to give your users type-ahead, it will interrupt their searching and train-of-thought, so don't interrupt them if you're going to suggest search phrases that won't help them get what they want. I've seen that way too often, even on major websites. Don't do type-ahead unless you're willing to measure the usage of the feature, and tune it over time or remove it if it's distracting users.
Hoping that helps!

You mention you are using a range index to populate your suggestions, but you can use word lexicons as well. Word lexicons would produce suggestions based on tokenized character data, not entire values of elements (or json properties). It might be worth looking into that.
Alternatively, since you are mentioning wildcards, perhaps cts:value-match could be of interest to you. It runs on values (not words) from range indexes, but takes a wild-carded expression as input. It would perform far better than a snippet approach, which would need to pull up and process actual contents.
HTH!

Related

How to collapse duplicates in search results

We use Hibernate Search 6 CR2 with Elasticsearch and Spring Boot 2.4.0. Is there any way to collapse duplicates in search results?
We tried to kind of "collapse" them like this:
searchResults = searchSession.search(Items.class)
.select(f -> f.field(field.getCode(), String.class))
.where(f -> f.phrase()
.field(field.getCode())
.matching(phrase)
.slop(SLOP))
.fetchHits(20)
.stream()
.distinct()
.collect(Collectors.toList());
...but this method works only on small amount of results (less than fetchHits size) and when there are not so many identical hits. When we tried this method on another index with thousands hits (~28M docs) we saw that it's not working as expected because of fetchHits setting -- some search results that should be -- are lost. And of course, the main question here is that by using this method we don't distinct search results while searching, it happens after the original search, so it's not the best solution.
Another solution was found here but it's a bit outdated and not an actual answer for our question.
Over Hibernate Search forums we found another solution for similar task, we tried to implement it and it worked, but as a downsides we got 2x multiplication for index document fields (8 fields now instead of 4).
So after all, is it possible to tune HS to collapse duplicates in search results without help of these extra-fields? Or, if it's OK... Okay then! We'll remember this and use as a solution in future cases.
P.S.: we implement search-as-you-type prediction service so it's not necessary for original entities to be extracted.

The solution you linked is the most straightforward way to get a list of all values in matched documents for a given field. It is what aggregations are for.
Yes, it requires additional fields. Generally speaking, you can't get performance out of thin air: to get a smaller execution time, you need to use more memory.
That being said, if what you want is suggestions, you should probably have a look at Elasticsearch's suggester feature.
There is no API for this in Hibernate Search (yet), so you will have to transform JSON in order to leverage this feature. It's relatively easy, and you even have an example for your very use case in the reference documentation (have a look at the second example).
Of course if you really want to use phrase queries, it's going to be more complicated. I'd suggest you have a look at the phrase suggester or maybe the completion suggester.
Should you need to register a field with a type that is not supported out of the box by Hibernate Search (e.g. completion), it's possible too: you will just need a custom bridge. See this example.

How to design the server side of an autocomplete box like Quora?

I don't want to use Lucene because i think it is to heavy.
Is there any easier way to implement this (Millons of data) ?

If you don't want to have to worry about performance, I recommend you take a look at Amazon Web Services new CloudSearch service. It's fast and scales as your needs scale. It can also handle millions of documents without a problem and supports wildcard searches (ex: quo*, would retrieve Quora).
Check it out here.

Obviously this isn't how it definitely works at either Quora or Google, as I haven't had the pleasure to work at either...this is just how I'd go about doing it.
The first thing to obtain is a list of search terms - I'm assuming you don't want to know how this is done, as it will really depend on all sorts of things, but basically you're either going to do a select distinct title from pages (in the case of the autocomplete on Wikipedia) or something much more advanced in the case of Google's.
The next step is also pretty simple at a high level: you need to perform the query select title from titles where title like 'Qu%' in the case of the user typing Qu into the search box. The list of titles is then returned to the browser as the response to some kind of Ajax request, perhaps in the form of JSON or similar. And you need to do it as fast as possible - that's where it becomes difficult.
How do they do it so quickly? There are probably four things to bear in mind.
They have LOTS of machines handling the requests. Bear in mind that Google's autocomplete is turned on by default and works in (almost?) all languages. That's a lot of searches against the autocomplete index. A lot more than there will be against the web index itself: for each web search request, Google will probably have processed 3 or 4 autocomplete requests.
They're probably doing it in memory. Google is already known to store its web indexes in memory, so I would expect them to be doing the same with this.
Specialised software (this is where it gets really interesting). While a traditional database or a NoSQL database could do this and do it quickly I would expect the big boys to actually be doing this with specialised code whose sole purpose is to provide autocomplete suggestions. The SQL statement I provided above was purely to demonstrate the logical request that would be needed. You're probably looking at some kind of specialised tree, such as a suffix tree, radix tree, or similar.
Sharding. To cope with the quantity of data and the number of machines doing the requests you're going to need to shard. That is ensure that a certain subset of all the machines involved only process requests requests that begin with one or more letters. eg a group of X machines processing searches that begin with a certain letter or even 2 letters. That means that you've got more machines, but they don't each have to have the whole index to hand. How does a particular group of machines get chosen? You're either routing once the request is in your data centre, or you could route on the client side (eg in your Javascript decide which IP to query based upon the first X letters of the search term)
So, that's how I would do it. Not having had the experience of the enormous datasets Google/Quora are dealing with, I'm sure there are things that I've not considered. But, it's a start.
And, here's how I have done it, purely in an experimental environment at home:
I had a simple list of a good few hundred thousand titles to search. These were loaded into a dedicated MongoDB collection, which had a single index defined on it. I then had a Play Framework controller in front of it and used jQuery's autocomplete plugin to do the search.
Obviously this is tiny compared with what you are looking for, but MongoDB should provide the same kind of performance for your dataset provided you follow the recommendations (ie good hardware, lots of RAM, keep the indexes in memory). In addition, Mongo supports sharding, and the Play Framework is shared nothing, so adding new machines to cope with the load should your userbase grow would be straightforward in this situation.
By the way, Mongo is by no means the only solution, traditional SQL databases will be up to the job too, of course - I was just using Mongo for other reasons.

First, for autocomplete you should aim to get the response back to the user in <= 100ms if you want something that appears fast. That should be your first concern. Any setup that can't do that probably won't be good enough for users. In my own tests in Firefox using Firebug, Google's autocomplete returned returns in about 50ms and Quora in about 65ms.
See, e.g.
http://stackoverflow.com/questions/536300/what-is-the-shortest-perceivable-application-response-delay
Apparently, Quora uses prefix matching, not full text search which makes it faster. To roll your own fast prefix-based autocomplete, which should be sufficient for many cases, but won't handle things like misspellings using fuzzy matching, etc., try an in-memory data store like Redis. The details can be seen here:
http://charlesleifer.com/blog/powerful-autocomplete-with-redis-in-under-200-lines-of-python/
I haven't been able to get CloudSearch (95-125ms in browser fetching from endpoint directly as measured by Firebug, and + 20-30ms longer accessing endpoint via cURL in PHP) down to the low latencies of Google and Quora I cited regardless of the simplicity of the search query. An Elasticsearch cluster is a bit faster. These statements obviously depend upon use case and probably don't generalize well, but something to think about.

String analysis and classification

I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I don´t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page

Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.

String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}

In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.

If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.

It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.

A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.

Can I make Lucene return an unlimited number of search results?

I am using Lucene 3.0.1 in a Java 5 environment.
I've been researching this issue a little bit, but the documentation hasn't given any direct answers.
Using the search method
TopFieldDocs search(Weight weight, Filter filter, int nDocs, Sort sort)
I always need to provide a maximum number of search results nDocs.
What if I wanted to have all matching results? It feels like setting nDocs to Integer.MAX_VALUE is a kind of hacky way to do this (and would result in speed and memory performance drop?).
Anyone else who has any idea?

You are using a search method that returns the top n hits for a query.
There are other (more low-level) methods that do not have the limitation, and it says in the documentation that "applications should only use this if they need all of the matching documents. The high-level search API (search(Query, int)) is usually more efficient, as it skips non-high-scoring hits.".
So if you really need all documents, you can use the low-level API. I doubt that it makes a big difference in performance to passing a really high limit to the high-level API. If you need all documents (and there really are a lot of them), it is going to be slow either way, especially if sorting is involved.

Text processing / comparison engine

I'm looking to compare two documents to determine what percentage of their text matches based on keywords.
To do this I could easily chop them into a set word of sanitised words and compare, but I would like something a bit smarter, something that can match words based on their root, ie. even if their tense or plurality is different. This sort of technique seems to be used in full text searches, but I have no idea what to look for.
Does such an engine (preferably applicable to Java) exist?

Yes, you want a stemmer. Lauri Karttunen did some work with finite state machines that was amazing, but sadly I don't think there's an available implementation to use. As mentioned, Lucene has stemmers for a variety of languages and the OpenNLP and Gate projects might help you as well. Also, how were you planning to "chop them up"? This is a little trickier than most people think because of punctuation, possesives, and the like. And just splitting on white space doesn't work at all in many languages. Take a look at OpenNLP for that too.
Another thing to consider is that just comparing the non stop-words of the two documents might not be the best approach for good similarity depending on what you are actually trying to do because you lose locality information. For example, a common approach to plagiarism detection is to break the documents into chunks of n tokens and compare those. There are algorithms such that you can compare many documents at the same time in this way much more efficiently than doing a pairwise comparison between each document.

I don't know of a pre-built engine, but if you decide to roll your own (e.g., if you can't find pre-written code to do what you want), searching for "Porter Stemmer" should get you started on an algorithm to get rid of (most) suffixes reasonably well.

I think Lucene might be along the lines of what your looking for. From my experience its pretty easy to use.
EDIT: I just reread the question and thought about it some more. Lucene is a full-text search engine for java. However, I'm not quite sure how hard it would be to re purpose it for what your trying to do. either way, it might be a good resource to start looking at and go from there.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.