Solr Geospatial Search return wrong distances

Solr Geospatial Search return wrong distances - java

I'm using the Solr 3.6.1 Webapp with the SOLR-2155 Patch for geohash field support.
I set everything up like described here: https://github.com/dsmiley/SOLR-2155
Now and then the search returns a totally false distance for every hit. Refreshing the result won't help. If I change the offset of the search (e.g. listing the 2nd page) it's all fine again. Even going back to the first page now shows up the right distance.
It's not the webapp because the json stream I get directly from Solr contains the same distances as result score. In my case it's always 20015,12km.
Here is my Query:
fq={!geofilt}
&fl=internalid,score
&start=0
&rows=10
&sort=geodist()+asc
&sfield=location
&pt=53.6,10.11
&d=50
&q={!func}geodist()

I wrote SOLR-2155. I'm not particularly happy with the distance sorting code therein; there seems to be something I overlooked. It shouldn't be some sort of math issue, it's some sort of Lucene internals issue, I think. If you have just one point per document, then use solr.LatLonType. If you have multiple then.... :-( I don't know what the problem is. In Solr 4, the replacement code for this called SpatialRecursivePrefixTreeFieldType and it uses different code but it is similar, so I'm not sure if the problem will go away or not. But even in that case, again, I'm not at all happy with the implementation. I know how I want to do it right, but it's not on the top of my TODO list right now.
BTW, you are basically reporting a bug, and the proper place to report a bug would be the issue tracker of the project in question -- in this case, that's GitHub SOLR-2155. StackOverflow doesn't make sense for that.

Related

How to collapse duplicates in search results

We use Hibernate Search 6 CR2 with Elasticsearch and Spring Boot 2.4.0. Is there any way to collapse duplicates in search results?
We tried to kind of "collapse" them like this:
searchResults = searchSession.search(Items.class)
.select(f -> f.field(field.getCode(), String.class))
.where(f -> f.phrase()
.field(field.getCode())
.matching(phrase)
.slop(SLOP))
.fetchHits(20)
.stream()
.distinct()
.collect(Collectors.toList());
...but this method works only on small amount of results (less than fetchHits size) and when there are not so many identical hits. When we tried this method on another index with thousands hits (~28M docs) we saw that it's not working as expected because of fetchHits setting -- some search results that should be -- are lost. And of course, the main question here is that by using this method we don't distinct search results while searching, it happens after the original search, so it's not the best solution.
Another solution was found here but it's a bit outdated and not an actual answer for our question.
Over Hibernate Search forums we found another solution for similar task, we tried to implement it and it worked, but as a downsides we got 2x multiplication for index document fields (8 fields now instead of 4).
So after all, is it possible to tune HS to collapse duplicates in search results without help of these extra-fields? Or, if it's OK... Okay then! We'll remember this and use as a solution in future cases.
P.S.: we implement search-as-you-type prediction service so it's not necessary for original entities to be extracted.

The solution you linked is the most straightforward way to get a list of all values in matched documents for a given field. It is what aggregations are for.
Yes, it requires additional fields. Generally speaking, you can't get performance out of thin air: to get a smaller execution time, you need to use more memory.
That being said, if what you want is suggestions, you should probably have a look at Elasticsearch's suggester feature.
There is no API for this in Hibernate Search (yet), so you will have to transform JSON in order to leverage this feature. It's relatively easy, and you even have an example for your very use case in the reference documentation (have a look at the second example).
Of course if you really want to use phrase queries, it's going to be more complicated. I'd suggest you have a look at the phrase suggester or maybe the completion suggester.
Should you need to register a field with a type that is not supported out of the box by Hibernate Search (e.g. completion), it's possible too: you will just need a custom bridge. See this example.

Practical to use snippets as search suggest?

I am trying to implement type-ahead in my app, and I got search suggest to work with an element range index as recommended in the documentation. The problem is, it doesn't fit my use case.
As anyone who has used it knows, it will not return results unless the search string is at the beginning of the content being searched. Barring the use of a leading and trailing wildcard, this won't return what I need.
I was thinking instead of simply doing a search based on the term, then returning the result snippets (truncated in my server-side code) as the suggestions in my type-ahead.
As I don't have a good way of comparing performance, I was hoping for some insight on whether this would be practical, or if it would be too slow.
Also, since it may come up in the answers, yes I have read the post about "chunked Element Range Indexes", but being new to MarkLogic, I can't make heads or tails of it and haven't been able to adapt it to my app.

I wrote the Chunked Element Range Indexes blog post, and found out last-minute that my performance numbers were skewed by a surprisingly large document in my index. When I removed that large document, many of the other techniques such as wildcard matching were suddenly much faster. That surprised me because all the other search engines I'd used couldn't offer such fast performance and flexibility for type-ahead scenarios, expecially if I tried introducing a wild-card search. I decided not to push my post publicly, but someone else accidentally did it for me, so we decided to leave it out there since it still presents a valid option.
Since MarkLogic offers multiple wildcard indexes, there's really a lot you can do in that area. However, search snippets would not be the right way to do that as I believe they'd add some overhead. Call cts:search or one of the other cts calls to match a lexicon. I'm guessing you'd want cts:element-value-match. That does wildcard matches against a range index since which are all in memory, so faster. Turn on all your wildcard indexes on your db if you can.
It should be called from a custom XQuery script in a MarkLogic HTTP server. I'm not recommending a REST extension as I usually would, because you need to be as stream-lined as possible to do most type-ahead scenarios correctly (that is, fast enough).
I'd suggest you find ways to whittle down the set of values in the range index to less than 100,000 so there's less to match against and you're not letting in any junk suggestions. Also, make sure that you filter the set of matches based on the rest of the query (if a user already started typing other words or phrases). Make sure your HTTP script limits the number of suggestions returned since a user can't usually benefit from a long list of suggestions. And craft some algorithms to rank the suggestions so the most helpful ones make it to the top. Finally, be very, very careful not to present suggestions that are more distracting than helpful. If you're going to give your users type-ahead, it will interrupt their searching and train-of-thought, so don't interrupt them if you're going to suggest search phrases that won't help them get what they want. I've seen that way too often, even on major websites. Don't do type-ahead unless you're willing to measure the usage of the feature, and tune it over time or remove it if it's distracting users.
Hoping that helps!

You mention you are using a range index to populate your suggestions, but you can use word lexicons as well. Word lexicons would produce suggestions based on tokenized character data, not entire values of elements (or json properties). It might be worth looking into that.
Alternatively, since you are mentioning wildcards, perhaps cts:value-match could be of interest to you. It runs on values (not words) from range indexes, but takes a wild-carded expression as input. It would perform far better than a snippet approach, which would need to pull up and process actual contents.
HTH!

Work out Analyzer, Version, etc. from Lucene index files?

Just double-checking on this: I assume this is not possible and that if you want to keep such info somehow bundled up with the index files in your index directory you have to work out a way to do it yourself.
Obviously you might be using different Analyzers for different directories, and 99% of the time it is pretty important to use the right one when constructing a QueryParser: if your QP has a different one all sorts of inaccuracies might crop up in the results.
Equally, getting the wrong Version of the index files might, for all I know, not result in a complete failure: again, you might instead get inaccurate results.
I wonder whether the Lucene people have ever considered bundling up this sort of info with the index files? Equally I wonder if anyone knows whether any of the Lucene derivative apps, like Elasticsearch, maybe do incorporate such a mechanism?
Actually, just looking inside the "_0" files (_0.cfe, _0.cfs and _0.si) of an index, all 3 do actually contain the word "Lucene" seemingly followed by version info. Hmmm...
PS other related thoughts which occur: say you are indexing a text document of some kind (or 1000 documents)... and you want to keep your index up-to-date each time it is opened. One obvious way to do this would be to compare the last-modified date of individual files with the last time the index was updated: any documents which are now out-of-date would need to have info pertaining to them removed from the index, and then have to be re-indexed.
This need must occur all the time in connection with Lucene indices. How is it generally tackled in the absence of helpful "meta info" included in with the index files proper?

Anyone interested in this issue:
It does appear from what I said that the Version is contained in the index files. I looked at the CheckIndex class and the various info you can get from that, e.g. CheckIndex.Status.SegmentInfoStatus, without finding a way to obtain the Version. I'm starting to assume this is deliberate, and that the idea is just to let Lucene handle the updating of the index as required. Not an entirely satisfactory state of affairs if so...
As for getting other things, such as the Analyzer class, it appears you have to implement this sort of "metadata" stuff yourself if you want to... this could be done by just including a text file in with the other files, or alternately it appears you can use the IndexData class. Of course your Version could also be stored this way.
For writing such info, see IndexWriter.setCommitData().
For retrieving such info, you have to use one of several (?) subclasses of IndexReader, such as DirectoryReader.

What's the algorithm wikipedia uses for their version comparison feature

I am currently implementing some sort of text-version (revision) comparison visualizations and am trying to find some information about how wikipedia achieves their "View History"-feature in which they allow to compare the current revision with an older one.
You can find one example (About stackoverflow!) here:
http://en.wikipedia.org/w/index.php?title=Stack_Overflow&diff=512241244&oldid=458578615
I have implemented several ideas so far and also tried to reproduce the way wikipedia is doing it. For this I've implemented the Levenshtein-distance algorithm ( http://en.wikipedia.org/wiki/Levenshtein_distance ).
Lets assume I have two lists. I am iterating over the first list and check for the index-position of the first list on the second if the string found there is more than 50% equal. If it is, I'll just print both Strings side by side in my comparison view and continue with the next item of the first list. If it is not, I check the next item in the second list until I find it or leave the field for the second list blank if it cannot be found. (Although I would basically prefer that a sentence from the second list also always appears on the comparison view instead of leaving it out, just e.g. with a blank field for the first list field)
This method has some weaknesses. At first, if some sentence got deleted I would need to check the positions around the index for not simply "forgetting" it. But still I need to take care that text positions don't get inverted if I do so.
Has anyone of you tried to achieve something similar with java? If there are some code examples how others or you achieved it, I would gladly take a look to learn from it.
And of course, if you know anything about the algorithm wikipedia (and general wikis I assume?) uses for their revision comparison I'd be glad to hear it.
Thanks a lot

Wikipedia explains how the wiki difference engine works - http://en.wikipedia.org/wiki/Help:Diff
You can follow the links at the bottom of the page to learn more, but this page lists the template used.

Another implementation besides Wikipedia's version control is diff on Unix flavor systems. GNU actually makes the source code available for diff which may enable you to look at their algorithms here:
http://ftp.gnu.org/gnu/diffutils/

Solr 3.4 Geodist function, incorrect/ or unwanted results

I'm having a problem with Solr 3.4, I'm using it's spacial search functions like
Geodist, and Geofilt.
Everything seems ok and the results are return supposedly sorted by distance form a given center point.
However since Solr 3.4 lacks the ability to return function results in the data I had to calculate it manually (by PHP in this case).
I read the docs and the geodist should be a function that implements the haversine function of geo distance between 2 lat/lng points. I ported the function to PHP (easy!), and made sure that it give correct result.
The problem is: Solr calculate the distance in different formula that I couldn't find. So when I re-calculate the distance in PHP it results a inconsistent data distances (e.g. 132Mile instead of 83Mile), that's not a difference I can tolerate.
My Solution: I said OK it's handy to create a function comparison to see if I made a mistake in my port to the data, I dug into Solr code and extracted the literal implementation of havesine in org.apache.solr.search.function.distance.HaversineConstFunction, and the result was almost identical. and made this testing script (full source code and data).
My conclusion that Solr (or Lucene) does not use haversine as a geodist implemenation. But I don't know which equation.
UPDATE The bug was resolved. I think I went too far with my tests. The incorrect results occurred because of wrong parameter naming, I was using order (the one from SQL) instead of sort (Solr convention) to change the order of the results from the Solr web-service.

See the update, bug have been resolved. Thanks to #jarnbjo, and #TreyA for reminding me of a stupid issue. I should look to stupid mistakes in my code before debugging the libraries code in the future.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.