Lucene Search Luke vs Hibernate Search different result

Lucene Search Luke vs Hibernate Search different result - java

I am running the following lucene query phrase in luke:
+(debtorNumber:10200000 originalDebtorNumber:10200000) +(serviceName:"skype
for"^840.0 (serviceName:for* serviceId:for*) (serviceName:skype*
serviceId:skype*))
shows at the beginning expected result for ex.:
Skype for Business for Managers
Microsoft Skype for Business Conferencing (Plan2)
Telephone dial-in for Skype for Business Conferencing
and so on.
The same query executed with hibernate search shows different result :/
I am getting for example the following result:
antivirus protection for your PC, notebook or server
central administration for thin clients
skype for comes on the 3rd or 4th page.
The java code is:
SearchManager = Search.getSearchManager(cache)
CacheQuery<MyType> query = searchManager.getQuery(booleanQuery, MyType.class)
List<MyType> pagedResulat = query
.maxResults(criteria.getPageSize())
.firstResult(Math.toIntExact(criteria.getOffset()))
.list()
This logs the above query which I used in Luke
log.info("Lucene Search boolean query:" + booleanQuery);
Please advise.

There might be multiple reasons for the difference, let me try compile a checklist.
Different index
The main difference I can think of is that Luke will always target a single index: the one you opened explicitly.
Hibernate Search will actually run the query on a composite view of all indexes containing MyType and indexed subclasses (and any shards you might have). Often that's just one index, but you possibly have multiple indexes opened?
That will affect the results, and definitely the scores.
Different Lucene version
Verify that the Luke version you're using is using the exact same version of Lucene.
Check the scoring
You can use a Projection query to have Infinispan Query / Hibernate Search explain the scores of all results it produced; this can be very useful to understand what is going on.
See FullTextQuery.EXPLANATION and FullTextQuery.SCORE in section Projections, and Example 105.
IndexReader
You can also use the SearchManager to get the low-level IndexReader(s) and run the query directly, by-passing Infinispan and Hibernate Search code.
SearchIntegrator si searchManager.unwrap(SearchIntegrator.class);
si.getIndexReaderAccessor(). ...
that might help narrow down which component is affecting your expected scoring.
The IndexReaderAccessor can open an index by type or by name. When opened by name it will open the single index, when opened by type it will apply the rules to satisfy polymorphic queries and might return an aggregate: might be interesting to experiment with both of them to verify they return the same results.
...and check the basics
Make sure you're opening the same physical index :-)
In particular recent versions of Infinispan might apply sharding transparently to improve data distribution in the cluster, this might be confusing when debugging scoring - especially when you're not aware of it.

Related

Set pagination off while searching documents in marklogic for a specified criteria

I am doing a search in the marklogic using JsonDocumentManager by providing the StructuredQuery Definition. As a result I am getting a DocumentPage, defaults to 50 records (page length defaulted in JsonDocumentManager). But I want to retrieve all the documents in one go?
I can see two options here to solve this, either by increasing the page length to a limit which cannot be exceeded for the criteria I am supplying or by providing the page offset in the jsonDocumentManager.search(queryDefinition, pageOffset) in the loop till the documentPage.isLastPage returns to true
Could some one please let me know the further options if any? Is there any parameter for pagination which I can switch to false to not allow marklogic to do a paginated search?

As stated by #grtjn, it's always best to paginate, and even faster if you can run requests in parallel. For that reason, the Java API doesn't have a flag to get all results. Nor do the layers it builds on: REST API and the search:search API.
The layer those build on, cts:search, uses server-side lazy evaluation to efficiently paginate under the hood while it appears to get all results. With that said, if you must have another options besides those you already know about, consider creating a Resource extension and have it call directly to the cts:search API.
For what it's worth, in MarkLogic 9 we'll be providing the Data Movement SDK which will do all the pagination and parallelization for you under the hood on the client side. It is specifically designed for long-running data movement applications that need to export or manipulate large datasets. If that's of interest, please consider joining the early access program and you can try it out.

Why would Sybase not use a functional index?

I've created a functional index on a sybase table.
create index acadress_codpost_lower on acadress(LOWER(l5_codpost))
I then run a complex query that uses the index. Without the index it takes 17.086 seconds. With the index it takes 0.076 seconds.
I've run it from two different SQL clients and on both development and pre-prod Sybase servers. In all cases I see the acceleration from the index.
However when we run an identical query from Java (and I know it's identical since I've logged the generated SQL and used that directly in the SQL clients) then the performance is exactly the same as before we added the indexes.
What possible reason might there be for identical SQL queries to use the index when run from ACE and SQuirreL but not from Java?
My first thought is that maybe Sybase is caching execution plans for the Prepared Statements and not using the index. We've tried restarting the Java server several times (other services use the Sybase server so it's harder to bounce) and it has made no difference.
The other possibility is that we are using a very old version of the Sybase driver:
jConnect (TM) for JDBC(TM)/7.00(Build 26502)/P/EBF17993/JDK16/Thu Jun 3 3:09:09 2010
Is it possible that functional indexes are not supported by this version of JConnect?
Does anyone know if either of these theories might be correct, or whether there is something else I've missed?

I've been looking into this off and on for the past week or so and while I still do not have a definitive answer I do have a plausible theory.
I tried the suggestions from the comments and thanks to them I was able to narrow the cause down to a single change, if I have the query:
"where LOWER(aca.l5_codpost) like '"+StringEscapeUtils.escapeSql("NG179GT".toLowerCase())+"'"
Then the query uses the index and returns extremely quickly.
If on the other hand I have:
where LOWER(aca.l5_codpost) like :postcode
query.setString("postcode", "NG179GT".toLowerCase());
Then it does not use the index.
The theory is that Sybase is optimizing the query plan with no information about the contents of :postcode, so it is not using the index. It doesn't recompile the query once it does know the contents so it never uses the index.
I've tried forcing the index using (index acadress_codpost_lower) and that made no difference.
I've tried set forceplan off and set literal_autoparam off and neither made any difference.
The only thing that I can find that changes the behavior is directly embedding the option into the query plan vs having it as a parameter.
So the work around is embedding the parameter into the query string, although I'd still like to know what's actually happening and solve the problem properly.

Flexible search in database

I have a legacy system that allows users to manage some entities called "TRANSACTION" in the (MySQL) DB, and mapped to Transaction class in Java. Transaction objects have about 30 fields, some of them are columns in the DB, some of them are joins to another tables, like CUSTOMER, PRODUCT, COMPANY and stuff like that.
Users have access to a "Search" screen, where they are allowed to search using a TransactionId and a couple of extra fields, but they want more flexibility. Basically, they want to be able to search using any field in TRANSACTION or any linked table.
I don't know how to make the search both flexible and quick. Is there any way?. I don't think that having an index for every combination of columns is a valid solution, but full table scans are also not valid... is there any reasonable design? I'm using Criteria to build the queries, but this is not the problem.
Also, I think mysql is not using the right indexes, since when I make hibernate log the sql command, I can almost always improve the response time by forcing an index... I'm starting to use something like this trick adapted to Criteria to force a specific index use, but I'm not proud of the "if" chain. I'm getting something like
if(queryDto.getFirstName() != null){
//force index "IDX_TX_BY_FIRSTNAME"
}else if(queryDto.getProduct() != null){
//force index "IDX_TX_BY_PRODUCT"
}
and it feels horrible
Sorry if the question is "too open", I think this is a typical problem, but I can't find a good approach

Hibernate is very good for writing while SQL still excels on reading data. JOOQ might be a better alternative in your case, and since you're using MySQL it's free of charge anyway.
JOOQ is like Criteria on steroids, and you can build more complex queries using the exact syntax you'd use for native querying. You have type-safety and all features your current DB has to offer.
As for indexes, you need can't simply use any field combination. It's better to index the most used ones and try using compound indexes that cover as many use cases as possible. Sometimes the query executor will not use an index because it's faster otherwise, so it's not always a good idea to force the index. What works on your test environment might not stand still for the production system.

Appengine Search API vs Datastore

I am trying to decide whether I should use App-engine Search API or Datastore for an App-engine Connected Android Project. The only distinction that the google documentation makes is
... an index search can find no more than 10,000 matching documents.
The App Engine Datastore may be more appropriate for applications that
need to retrieve very large result sets.
Given that I am already very familiar with the Datastore: Will someone please help me, assuming I don't need 10,000 results?
Are there any advantages to using the Search API versus using Datastore for my queries (per the quote above, it seems sensible to use one or the other)? In my case the end user must be able to search, update existing entries, and create new entities. For example if my app is a bookstore, the user must be able to add new books, add reviews to existing books, search for a specific book.
My data structure is such that the content will be supplied by the end user. Document vs Datastore entity: which is cheaper to update? $$, etc.
Can they supplement each other: Datastore and Search API? What's the advantage? Why would someone consider pairing the two? What's the catch/cost?

Some other info:
The datastore is a transactional system, which is important in many use cases. The search API is not. For example, you can't put and delete and document in a search index in a single transaction.
The datastore has a lot in common with a NoSql DB like Cassandra, while the search API is really a textual search engine, very similar to something like Lucene. If you understand how a reverse index works, you'll get a better understanding of how the search API works.
A very good reason to combine usage of the datastore API and the search API is that the datastore makes it very difficult to do some types of queries (e.g. free text queries, geospatial queries) that the search API handles very easily. Thus, you could store your main entities in the datastore, but then use the search API if you need to search in ways the datastore doesn't allow. Down the road, I think it would be great if the datastore and search API were more tightly integrated, for example by letting you do free text search against indexed Text fields, where app engine would automatically create a search Document Index behind the scenes for you.

The key difference is that with the Datastore you cannot search inside entities. If you have a book called "War and peace", you cannot find it if a user types "war peace" in a search box. The same with reviews, etc. Therefore, it's not really an option for you.

The most serious con of Search API is Eventual Consistency as stated here:
https://developers.google.com/appengine/docs/java/search/#Java_Consistency
It means that when you add or update a record with Search API, it may not reflect the change immediately. Imagine a case where a user upload a book or update his account setting, and nothing changes because the change hasn't gone to all servers yet.
I think Search API is only good for one thing: Search. It basically acts as a search engine for your data in Datastore.
So my advice is to keep the data in datastore that user expects immediate result, and use Search API to search the data that user won't expect immediate result.

The Datastore only provides a few query operators (=, !=, <, >), doing nested filters and multiple inequalities would either be costly or impossible (timeouts) and search results may give a lot of False Positives. You can do partial string search by tokenizing but this will bloat your entity. Best way to get through these limitations is using Structured Properties and/or Ancestor Queries.
Search API on the other hand runs a Full Text search on Search Documents, which is faster and more accurate than NDB queries without relying on tokenized data. Downside is it relies on data staying up to date.
Use Datastore to process your data (create, update, delete), then run a function to put these data as documents and cluster using indexes, then run the searches using the Search API.

Strategy for locale sensitive sort with pagination

I work on an application that is deployed on the web. Part of the app is search functions where the result is presented in a sorted list. The application targets users in several countries using different locales (= sorting rules). I need to find a solution for sorting correctly for all users.
I currently sort with ORDER BY in my SQL query, so the sorting is done according to the locale (or LC_LOCATE) set for the database. These rules are incorrect for those users with a locale different than the one set for the database.
Also, to further complicate the issue, I use pagination in the application, so when I query the database I ask for rows 1 - 15, 16 - 30, etc. depending on the page I need. However, since the sorting is wrong, each page contains entries that are incorrectly sorted. In a worst case scenario, the entire result set for a given page could be out of order, depending on the locale/sorting rules of the current user.
If I were to sort in (server side) code, I need to retrieve all rows from the database and then sort. This results in a tremendous performance hit given the amount of data. Thus I would like to avoid this.
Does anyone have a strategy (or even technical solution) for attacking this problem that will result in correctly sorted lists without having to take the performance hit of loading all data?
Tech details: The database is PostgreSQL 8.3, the application an EJB3 app using EJB QL for data query, running on JBoss 4.5.

Are you willing to develop a small Postgres custom function module in C? (Probably only a few days for an experienced C coder.)
strxfrm() is the function that transforms the language-dependent text string based on the current LC_COLLATE setting (more or less the current language) into a transformed string that results in proper collation order in that language if sorted as a binary byte sequence (e.g. strcmp()).
If you implement this for Postgres, say it takes a string and a collation order, then you will be able to order by strxfrm(textfield, collation_order). I think you can then even create multiple functional indexes on your text column (say one per language) using that function to store the results of the strxfrm() so that the optimizer will use the index.
Alternatively, you could join the Postgres developers in implementing this in mainstream Postgres. Here are the wiki pages about this issues: Collation, ICU (which is also used by Java as far as I know).
Alternatively, as a less sophisticated solution if data input is only through Java, you could compute these strxfrm() values in Java (Java will probably have a different name for this concept) when you add the data to the database, and then let Postgres index and order by these precomputed values.

How tied are you to PostgreSQL? The documentation isn't promising:
The nature of some locale categories is that their value has to be fixed for the lifetime of a database cluster. That is, once initdb has run, you cannot change them anymore. LC_COLLATE and LC_CTYPE are those categories. They affect the sort order of indexes, so they must be kept fixed, or indexes on text columns will become corrupt. PostgreSQL enforces this by recording the values of LC_COLLATE and LC_CTYPE that are seen by initdb. The server automatically adopts those two values when it is started.
(Collation rules define how text is sorted.)
Google throws up patch under discussion:
PostgreSQL currently only supports one collation at a time, as fixed by the LC_COLLATE variable at the time the database cluster is initialised.
I'm not sure I'd want to manage this outside the database, though I'd be interested in reading about how it can be done. (Anyone wanting a good technical overview of the issues should check out Sorting Your Linguistic Data inside the Oracle Database on the Oracle globalization site.)

I don't know any way to switch the database order by order. Therefore, one has to consider other solutions.
If the number of results is really big (hundred thousands ?), I have no solutions, except showing only the number of results, and asking the user to make a more precise request. Otherwise, the server-side could do, depending on the precise conditions....
Especially, using a cache could improve things tremendously. The first request to the database (unlimited) would not be so much slower than for a query limited in number of results. And the subsequent requests would be much faster. Often, paging and re-sorting makes for several requests, so the cache would work well (even with a few minutes duration).
I use EhCache as a technical solution.
Sorting and paging go together, sorting then paging.
The raw results could be memorized in the cache.
To reduce the performance hit, some hints:
you can run the query once for result set size, and warn the user if there are too many results (ask either for confirming a slow query, or add some selection fields)
only request the columns you need, let go all other columns (usually some data is not shown immediately for all results, but displayed on mouse move for example ; this data can be requested lazyly, only as needed, therefore reducing the columns requested for all results)
if you have computed values, cache the smaller between the database columns and the computed values
if you have repeated values in multiple results, you can request that data/columns separately (so you retrieve from the database once, and cache them only once), retrieve only a key (typically, and id) in the main request.

You might want to checkout this packge: http://www.fi.muni.cz/~adelton/l10n/postgresql-nls-string/. It hasn't been updated in a long time, and may not work anymore, but it seems like a reasonable startingpoint if you want to build a function that can do this for you.

This module is broken for Postgres 8.4.3. I fixed it - you can download fixed version from http://www.itreport.eu/__cw_files/.01/.17/.ee7844ba6716aa36b19abbd582a31701/nls_string.c and you'll have to compile and install it by hands (as described at related README and INSTALL from original module) but anyway sorting is working incorrectly. I tried it on FreeBSD 8.0, LC_COLLATE is cs_CZ.UTF-8

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.