How to combine Hibernate Search (Lucene) with paging and ACLs

How to combine Hibernate Search (Lucene) with paging and ACLs - java

I am using Spring Security with ACLs to secure the documents in my application. On the other hand I use Hibernate Search (on top of lucene) to search for the documents. This search also support paging. (Documents are only meta data of documents stored in a Database.)
FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager);
QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Document.class).get();
Query query = queryBuilder.keyword().onFields(fieldNames.toArray(new String[0])).matching(searchQuery)
.createQuery();
FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery(query, Document.class);
fullTextQuery.setFirstResult(pageable.getFirstItem());
fullTextQuery.setMaxResults(pageable.getPageSize());
Now I have to combine the paging with the ACLs. The only idea I have at the moment, is to remove the paging form the FullTextQuery, read all search result documents, filter them by there ACLs and then do the paging by hand. But I don't like that solution, because it loads all the documents, instead of only the one for the page.
Does anybody have a better idea?

If your ACL is not too complex, that is you have a small, finite number of levels, then I suggest to Use Filter and Bitset to implement it.
And here you'll find additional examples ACL implementation with Filters
http://java.dzone.com/articles/how-implement-row-level-access
Here you'll find a cached bitset filter implementation which has been in production for at least 5 years (it's my open source webapp for a searchable parallel text corpus)
Look for the addSourceFilter method
http://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/lucene/LuceneQueryBuilder.java

I have hit the same problem too and I don't think there is a simple answer.
I think there are only two solutions. The one you have suggested which has performance problems you've described as you have to load the documents and resolve the ACL for each result and then do your own paging. The alternative is to push this work to the indexing side and index your ACL in Lucene. This gives you the search performance, hiding the results which a user can't see by adding filter terms based on the current user/group/permissions/roles but at the expense of maintaining the index with ACL information. If your ACL is simple then this may be an option. If your ACL is hierarchical then it's still an option but more complicated. Its also tricky to keep your index upto date with the ACL.
The fact that you are starting to look into this sort of functionality may indicate that you are beginning to stretch your Database/Hibernate/Lucene solution. Maybe a content repository like Jackrabbit may be a better fit? I guess this is probably a step too far but it may be worth taking a look to see how it does it. Alternatively take a look at SOLR, particularly this issue which describes what a thorny problem it is.

Here is my ACL implementation with complex User/Group/Role hierarchical ACL system using pure Lucene queries (on top of Hibernate Search).

Related

Lucene Search Luke vs Hibernate Search different result

I am running the following lucene query phrase in luke:
+(debtorNumber:10200000 originalDebtorNumber:10200000) +(serviceName:"skype
for"^840.0 (serviceName:for* serviceId:for*) (serviceName:skype*
serviceId:skype*))
shows at the beginning expected result for ex.:
Skype for Business for Managers
Microsoft Skype for Business Conferencing (Plan2)
Telephone dial-in for Skype for Business Conferencing
and so on.
The same query executed with hibernate search shows different result :/
I am getting for example the following result:
antivirus protection for your PC, notebook or server
central administration for thin clients
skype for comes on the 3rd or 4th page.
The java code is:
SearchManager = Search.getSearchManager(cache)
CacheQuery<MyType> query = searchManager.getQuery(booleanQuery, MyType.class)
List<MyType> pagedResulat = query
.maxResults(criteria.getPageSize())
.firstResult(Math.toIntExact(criteria.getOffset()))
.list()
This logs the above query which I used in Luke
log.info("Lucene Search boolean query:" + booleanQuery);
Please advise.

There might be multiple reasons for the difference, let me try compile a checklist.
Different index
The main difference I can think of is that Luke will always target a single index: the one you opened explicitly.
Hibernate Search will actually run the query on a composite view of all indexes containing MyType and indexed subclasses (and any shards you might have). Often that's just one index, but you possibly have multiple indexes opened?
That will affect the results, and definitely the scores.
Different Lucene version
Verify that the Luke version you're using is using the exact same version of Lucene.
Check the scoring
You can use a Projection query to have Infinispan Query / Hibernate Search explain the scores of all results it produced; this can be very useful to understand what is going on.
See FullTextQuery.EXPLANATION and FullTextQuery.SCORE in section Projections, and Example 105.
IndexReader
You can also use the SearchManager to get the low-level IndexReader(s) and run the query directly, by-passing Infinispan and Hibernate Search code.
SearchIntegrator si searchManager.unwrap(SearchIntegrator.class);
si.getIndexReaderAccessor(). ...
that might help narrow down which component is affecting your expected scoring.
The IndexReaderAccessor can open an index by type or by name. When opened by name it will open the single index, when opened by type it will apply the rules to satisfy polymorphic queries and might return an aggregate: might be interesting to experiment with both of them to verify they return the same results.
...and check the basics
Make sure you're opening the same physical index :-)
In particular recent versions of Infinispan might apply sharding transparently to improve data distribution in the cluster, this might be confusing when debugging scoring - especially when you're not aware of it.

Set pagination off while searching documents in marklogic for a specified criteria

I am doing a search in the marklogic using JsonDocumentManager by providing the StructuredQuery Definition. As a result I am getting a DocumentPage, defaults to 50 records (page length defaulted in JsonDocumentManager). But I want to retrieve all the documents in one go?
I can see two options here to solve this, either by increasing the page length to a limit which cannot be exceeded for the criteria I am supplying or by providing the page offset in the jsonDocumentManager.search(queryDefinition, pageOffset) in the loop till the documentPage.isLastPage returns to true
Could some one please let me know the further options if any? Is there any parameter for pagination which I can switch to false to not allow marklogic to do a paginated search?

As stated by #grtjn, it's always best to paginate, and even faster if you can run requests in parallel. For that reason, the Java API doesn't have a flag to get all results. Nor do the layers it builds on: REST API and the search:search API.
The layer those build on, cts:search, uses server-side lazy evaluation to efficiently paginate under the hood while it appears to get all results. With that said, if you must have another options besides those you already know about, consider creating a Resource extension and have it call directly to the cts:search API.
For what it's worth, in MarkLogic 9 we'll be providing the Data Movement SDK which will do all the pagination and parallelization for you under the hood on the client side. It is specifically designed for long-running data movement applications that need to export or manipulate large datasets. If that's of interest, please consider joining the early access program and you can try it out.

Appengine Search API vs Datastore

I am trying to decide whether I should use App-engine Search API or Datastore for an App-engine Connected Android Project. The only distinction that the google documentation makes is
... an index search can find no more than 10,000 matching documents.
The App Engine Datastore may be more appropriate for applications that
need to retrieve very large result sets.
Given that I am already very familiar with the Datastore: Will someone please help me, assuming I don't need 10,000 results?
Are there any advantages to using the Search API versus using Datastore for my queries (per the quote above, it seems sensible to use one or the other)? In my case the end user must be able to search, update existing entries, and create new entities. For example if my app is a bookstore, the user must be able to add new books, add reviews to existing books, search for a specific book.
My data structure is such that the content will be supplied by the end user. Document vs Datastore entity: which is cheaper to update? $$, etc.
Can they supplement each other: Datastore and Search API? What's the advantage? Why would someone consider pairing the two? What's the catch/cost?

Some other info:
The datastore is a transactional system, which is important in many use cases. The search API is not. For example, you can't put and delete and document in a search index in a single transaction.
The datastore has a lot in common with a NoSql DB like Cassandra, while the search API is really a textual search engine, very similar to something like Lucene. If you understand how a reverse index works, you'll get a better understanding of how the search API works.
A very good reason to combine usage of the datastore API and the search API is that the datastore makes it very difficult to do some types of queries (e.g. free text queries, geospatial queries) that the search API handles very easily. Thus, you could store your main entities in the datastore, but then use the search API if you need to search in ways the datastore doesn't allow. Down the road, I think it would be great if the datastore and search API were more tightly integrated, for example by letting you do free text search against indexed Text fields, where app engine would automatically create a search Document Index behind the scenes for you.

The key difference is that with the Datastore you cannot search inside entities. If you have a book called "War and peace", you cannot find it if a user types "war peace" in a search box. The same with reviews, etc. Therefore, it's not really an option for you.

The most serious con of Search API is Eventual Consistency as stated here:
https://developers.google.com/appengine/docs/java/search/#Java_Consistency
It means that when you add or update a record with Search API, it may not reflect the change immediately. Imagine a case where a user upload a book or update his account setting, and nothing changes because the change hasn't gone to all servers yet.
I think Search API is only good for one thing: Search. It basically acts as a search engine for your data in Datastore.
So my advice is to keep the data in datastore that user expects immediate result, and use Search API to search the data that user won't expect immediate result.

The Datastore only provides a few query operators (=, !=, <, >), doing nested filters and multiple inequalities would either be costly or impossible (timeouts) and search results may give a lot of False Positives. You can do partial string search by tokenizing but this will bloat your entity. Best way to get through these limitations is using Structured Properties and/or Ancestor Queries.
Search API on the other hand runs a Full Text search on Search Documents, which is faster and more accurate than NDB queries without relying on tokenized data. Downside is it relies on data staying up to date.
Use Datastore to process your data (create, update, delete), then run a function to put these data as documents and cluster using indexes, then run the searches using the Search API.

how to cache the objects for display tags in jsp JSTL

I am using the displaytag for the pagination purpose.
Now from the DB, I have a millions of records, to go one from the other page, its taking a quite longer time.
Is there a way we can cache the objects which needs to be shown, and so that traversing in between the pages can be faster.
Requirement : We are querying and displaying the number of files in the directory under Linux environment. each folders has thousands of files..

How are your reading from DB? It would be good to see some more from your implementation.
As a general guideline:
If you read all your data into a list from the DB and only display a page, you will be wasting resources (processing and memory). This can kill your app. Try an approach that will just go for the page you're needing.
If you are using a framework like Hibernate, you can implement caching and paging without much trouble.
If you are using direct JDBC, you will have to limit registers in your query. Here the proper technique might depend on the Database Engine you're using. Please provide this information.
Be aware that your problem might be the amount of read information rather than a caching problem (just depends on the implementation).
As a sample, in Oracle, you would need to know the page and the pagesize. With both, you could limit the query with "where rownum < pagesize * page" (or something similar depending on how you index, and navigate to the first register you need with the absolute(int) method of Resultset. On other Engines it might be more efficient.
Now, if you're paginating with some framework, normally they support some implementation of a "DataProvider" so you can control how to fetch results for each page.

Caching solutions and Querying

Are there any in-memory/caching solutions for java that allow for a form of Querying for specific attributes of objects in the Cache?
I realize this is something that a full blown database would be used for, but I want to be able to have the speed/performance of a cache with the Querying ability of a database.

JBoss Cache has search functionality. It's called JBossCacheSearchable. From the site:
This is the integration package
between JBoss Cache and Hibernate
Search.
The goal is to add search capabilities
to JBoss Cache. We achieve this by
using Hibernate Search to index user
objects as they are added to the cache
and modified. The cache is queried by
passing in a valid Apache Lucene query
which is then used to search through
the indexes and retrieve matching
objects from the cache.
Main JBoss Cache page: http://www.jboss.org/jbosscache/
JBossCacheSearch: http://www.jboss.org/community/docs/DOC-10286

Nowadays the answer should be updated to Infinispan, the successor of JBoss Cache and having much improved Search technology.

Terracotta or GBeans or POJOCache

At first, HSQLDB came to mind, but that's an in-memory relational database rather than an object database. Might want to look at this list. There's a few object databases there, one of which might meet your needs.

Look at db4oat rather lightweight java object database. You can even query the data using regular java code:
List students = database.query( new Predicate(){
public boolean match(Student student){
return student.getAge() < 20
&& student.getGrade().equals(gradeA);}})
(From this article).

Another idea is to use Lucene and a RAMDirectory implementation of Directory to index what you put into your cache. That way, you can query using all the search engine query features which Lucene provides.
In your case, you will probably index the relevant properties of your objects as-is (without using an Analyzer) and query using a boolean equality operator.
Lucene is very lightweight, performant, thread-safe and memory consumption is low.

You might want to check out this library:
http://casperdatasets.googlecode.com
this is a dataset technology. it supports tabular data (either from a database or constructed in code), and you can then construct queries and filters against the dataset (and sort), all in-memory. its fast and easy-to-use. MOST IMPORTANTLY, you can perform queries against ANY column or attribute on the dataset.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to combine Hibernate Search (Lucene) with paging and ACLs - java

Here is my ACL implementation with complex User/Group/Role hierarchical ACL system using pure Lucene queries (on top of Hibernate Search).

Related

Lucene Search Luke vs Hibernate Search different result

Set pagination off while searching documents in marklogic for a specified criteria

Appengine Search API vs Datastore

how to cache the objects for display tags in jsp JSTL

Caching solutions and Querying

Categories

Resources