Apache Jackrabbit vs Rolling your on Content Repository - java

I'm looking at using Apache Jackrabbit for a personal project but would like to know what are the advantageous/disadvantageous to using it versus a custom content repository or traditional database with content stored in the file system.
I'm not sure that I like the tree structure of the JCR so if you can explain the design decisions (adv/dis) behind that it would also help.

There are several benefits to using an existing JCR implementation, such as Jackrabbit or ModeShape. First and foremost, you immediately get lot of functionality for free:
Hierarchical data storage - Lots of data is naturally hierarchical, and a JCR repository allows you to organize your data in the way that your applications will access it. Anything keyed by a URI, date/time, categories, or folder structures are natural fits for storing in a repository.
Use a standard Java API - The JCR API is standard Java API with a TCK, meaning your applications can rely upon standard behavior and not be tied to a particular JCR implementation.
Flexible schema enforcement - You can choose whether and where the node structure and property values are enforced by how you define and use node types.
Data evolution - Your data structure will likely evolve over time, and JCR makes this very easy to do.
Query and full-text search - Your applications can navigate through the data, or they can query for content independently of location. The JCR query languages are very rich, and they support full-text search.
Transactions - You can control the transaction boundaries, which means that JCR Sessions can participate in JTA transactions controlled by your application or its container.
Events - Your applications can be notified when nodes and/or properties are added, changed, or removed.
Clustering - Scale your application by clustering the JCR Repository across multiple processes. Each implementation configure clustering differently, but they behave the same to client applications.
Versioning - JCR includes a standard mechanism for versioning content. It might not be suitable to all use cases, but it is extremely handy when it is a fit.
Locking - JCR includes a standard mechanism for short-term locks, which are useful when your applications need to ensure parts of the repository are updated by only one process.
If some of these features are important to you, then you should definitely consider reusing an existing implementation rather than rolling your own - otherwise you'll be spending all your time implementing these kinds of features.
However, if none of these features fits your use case, then you should consider other data storage technologies:
Relational databases work great when your data is very constrained, when your schema is not likely to change too often, or when your data is flat (lots of values of a few key types). (Note that many JCR implementations can store the content inside relational databases, so "I have to store my data in a relational database" is not really a good reason for your application to directly use a relational database.)
Key-value stores work great when you need to store arbitrary values by unique keys, and all access is via gets and puts. The values are usually opaque to the store.
Document stores are similar to key-value stores, except that the store is aware of the structure of the values. Some document stores support queries.
Other storage technologies have their own sweet spot.
Other things to consider are whether you need or want an eventually-consistent database or a strongly-consistent database. It's far easier to write many "conventional" applications against strongly-consistent databases, and in fact most JCR repositories (including Jackrabbit and ModeShape) are strongly-consistent.

Related

Is there a mature Java Workflow Engine for BPM backed by NoSQL?

I am researching how to build a general application or microservice to enable building workflow-centric applications. I have done some research about frameworks (see below), and the most promising candidates share a hard reliance upon RDBMSes to store workflow and process state combined with JPA-annotated entities. In my opinion, this damages the possibility of designing a general, data-driven workflow microservice. It seems that a truly general workflow system can be built upon NoSQL solutions like MondoDB or Cassandra by storing data objects and rules in JSON or XML. These would allow executing code to enforce types or schemas while using one or two simple Java objects to retrieve and save entities. As I see it, this could enable a single application to be deployed as a Controller for different domains' Model-View pairs without modification (admittedly given a very clever interface).
I have tried to find a workflow engine/BPM framework that supports NoSQL backends. The closest I have found is Activiti-Neo4J, which appears to be an abandoned project enabling a connector between Activity and Neo4J.
Is there a Java Work Engine/BPM framework that supports NoSQL backends and generalizes data objects without requiring specific POJO entities?
If I were to give up on my ideal, magically general solution, I would probably choose a framework like jBPM and Activi since they have great feature sets and are mature. In trying to find other candidates, I have found a veritable graveyard of abandoned projects like this one on Java-Source.net.
Yes, Temporal Workflow has pluggable persistence and runs on Cassandra as well as on SQL databases. It was tested to up to 100 Cassandra nodes and could support tens of thousands of events per second and hundreds of millions of open workflows.
It allows to model your workflow logic as plain old java classes and ensures that the code is fully fault tolerant and durable across all sorts of failures. This includes local variable and threads.
See this presentation that goes into more details about the programming model.
I think the reason why workflow engines are often based on RDBMS is not the database schema but more the combination to a transaction-safe data store.
Transactional robustness is an important factor for workflow engines, especially for long-running or nested transactions which are typical for complex workflows.
So maybe this is one reason why most engines (like activi) did not focus on a data-driven approach. (I am not talking about data replication here which is covered by NoSQL databases in most cases)
If you take a look at the Imixs-Workflow Project you will find a different approach based on Java Enterprise. This engine uses a generic data object which can consume any kind of serializable data values. The problem of the data retrieval is solved with the Lucene Search technology. Each object is translated into a virtual document with name/value pairs for each item. This makes it easy to search through the processed business data as also to query structured workflow data like the status information or the process owners. So this is one possible solution.
Apart from that, you always have the option to store your business data into a NoSQL database. This is independent from the workflow data of a running process instance as far as you link both objects together.
Going back to the aspect of transactional robustness it's a good idea to store the reference to your NoSQL data storage into the process instance, which is transaction aware. Take also a look here.
So the only problem you can run into is the fact that it's very hard to synchronize a transaction context from a EJB/JPA to an 'external' NoSQL database. For example: what will you do when your data was successful saved into your NoSQL data storage (e.g. Casnadra), but the transaction of the workflow engine fails and a role-back is triggered?
The designers of the Activiti project have also been aware of the problem you have stated, but knew it would be quite a re-write to implement such flexibility which, arguably, should have been designed into the project from the beginning. As you'll see in the link provided below, the problem has been a lack of interfaces toward which to code different implementations other than that of a relational database. With version 6 they went ahead and ripped off the bandaid and refactored the framework with a set of interfaces for which different implementations (think Neo4J, MongoDB or whatever other persistence technology you fancy) could be written and plugged in.
In the linked article below, they provide some code examples for a simple in-memory implementation of the aforementioned interfaces. Looks pretty cool and sounds to perhaps be precisely what you're looking for.
https://www.javacodegeeks.com/2015/09/pluggable-persistence-in-activiti-6.html

Solr as primary database [duplicate]

My team is working with a third party CMS that uses Solr as a search index. I've noticed that it seems like the authors are using Solr as a database of sorts in that each document returned contains two fields:
The Solr document ID (basically a classname and database id)
An XML representation of the entire object
So basically it runs a search against Solr, download the XML representation of the object, and then instantiate the object from the XML rather than looking it up in the database using the id.
My gut feeling tells me this is a bad practice. Solr is a search index, not a database... so it makes more sense to me to execute our complex searches against Solr, get the document ids, and then pull the corresponding rows out of the database.
Is the current implementation perfectly sound, or is there data to support the idea that this is ripe for refactoring?
EDIT: When I say "XML representation" - I mean one stored field that contains an XML string of all of the object's properties, not multiple stored fields.
Yes, you can use SOLR as a database but there are some really serious caveats :
SOLR's most common access pattern, which is over http doesnt respond particularly well to batch querying. Furthermore, SOLR does NOT stream data --- so you can't lazily iterate through millions of records at a time. This means you have to be very thoughtful when you design large scale data access patterns with SOLR.
Although SOLR performance scales horizontally (more machines, more cores, etc..) as well as vertically (more RAM, better machines, etc), its querying capabilities are severely limited compared to those of a mature RDBMS. That said, there are some excellent functions, like the field stats queries, which are quite convenient.
Developers who are used to using relational databases will often run into problems when they use the same DAO design patterns in a SOLR paradigm, because of the way SOLR uses filters in queries. There will be a learning curve for developing the right approach to building an application that uses SOLR for part of its large queries or statefull modifications.
The "enterprisy" tools that allow for advanced session management and statefull entities that many advanced web-frameworks (Ruby, Hibernate, ...) offer will have to be thrown completely out the window.
Relational databases are meant to deal with complex data and relationships - and they are thus accompanied by state of the art metrics and automated analysis tools. In SOLR, I've found myself writing such tools and manually stress-testing alot, which can be a time sink.
Joining : this is the big killer. Relational databases support methods for building and optimizing views and queries that join tuples based on simple predicates. In SOLR, there aren't any robust methods for joining data across indices.
Resiliency : For high availability, SolrCloud uses a distributed file system underneath (i.e. HCFS). This model is quite different then that of a relational database, which usually does resiliency using slaves and masters, or RAID, and so on. So you have to be ready to provide the resiliency infrastructure SOLR requires if you want it to be cloud scalable and resistent.
That said - there are plenty of obvious advantages to SOLR for certain tasks : (see http://wiki.apache.org/solr/WhyUseSolr) -- loose queries are much easier to run and return meaningful results. Indexing is done as a matter of default, so most arbitrary queries run pretty effectively (unlike a RDBMS, where you often have to optimize and de-normalize after the fact).
Conclusion: Even though you CAN use SOLR as an RDBMS, you may find (as I have) that there is ultimately "no free lunch" - and the cost savings of super-cool lucene text-searches and high-performance, in-memory indexing, are often paid for by less flexibility and adoption of new data access workflows.
It's perfectly reasonable to use Solr as a database, depending on your application. In fact, that's pretty much what guardian.co.uk is doing.
It's definitely not bad practice per se. It's only bad if you use it the wrong way, just like any other tool at any level, even GOTOs.
When you say "An XML representation..." I assume you're talking about having multiple stored Solr fields and retrieving this using Solr's XML format, and not just one big XML-content field (which would be a terrible use of Solr). The fact that Solr uses XML as default response format is largely irrelevant, you can also use a binary protocol, so it's quite comparable to traditional relational databases in that regard.
Ultimately, it's up to your application's needs. Solr is primarily a text search engine, but can also act as a NoSQL database for many applications.
This was probably done for performance reasons, if it doesn't cause any problems I would leave it alone. There is a big grey area of what should be in a traditional database vs a solr index. Ive seem people do similar things to this (usually key value pairs or json instead of xml) for UI presentation and only get the real object from the database if needed for updates/deletes. But all reads just go to Solr.
I've seen similar things done because it allows for very fast lookup. We're moving data out of our Lucene indexes into a fast key-value store to follow DRY principles and also decrease the size of the index. There's not a hard-and-fast rule for this sort of thing.
I had similar idea, in my case to store some simple json data in Solr, using Solr as a database. However, a BIG caveat that changed my mind was the Solr upgrade process.
Please see https://issues.apache.org/jira/browse/LUCENE-9127.
Apparently, there has been in the past (pre v6) the recommendation to re-index documents after major version upgrades (not just use IndexUpdater) although you did not have to do this to maintain functionality (I cannot vouch for this myself, this is from what I have read). Now, after you have upgraded 2 major versions but did not re-index (actually, fully delete docs then the index files themselves) after the first major version upgrade, your core is now not recognized.
Specifically in my case, I started with Solr v6. After upgrade to v7, I ran IndexUpdater so index is now at v7. After upgrade to v8, the core would not load. I had no idea why - my index was at v7, so that satisfies the version-minus-1 compatibility statement from Solr, right? Well, no - wrong.
I did an experiment. I started fresh from v6.6, created a core and added some documents. Upgraded to v7.7.3 and ran IndexUpdater, so index for that core is now at v7.7.3. Upgraded to v8.6.0, after which the core would not load. Then I repeated the same steps, except after running IndexUpdater I also re-indexed the documents. Same problem. Then I again repeated everything, except I did not just re-index, I deleted the docs from the index and deleted the index files and then re-indexed. Now, when I arrived in v8.6.0, my core was there and everything OK.
So, the takeaway for the OP or anyone else contemplating this idea (using Solr as db) is that you must EXPECT and PLAN to re-index your documents/data from time to time, meaning you must store them somewhere else anyway (a previous poster alluded to this idea), which sort of defeats the concept of a database. Unless of course your Solr core/index will be short-lived (not last more than one major version Solr upgrade), you never intend to upgrade Solr more than 1 version, or the Solr devs change this upgrade limitation. So, as an index for data stored elsewhere (and readily available for re-indexing when necessary), Solr is excellent. As a database for the data itself, it strongly "depends".
Adding to #Jayunit100 response, using solar as a database, you get availability and partition tolerance at the cost of some consistency. There is going to be a configurable lag between what you write and when you can read it back.

Migration from one ORM to another

Here is my problem. I am using Play2 Framework right now and it's providing me with Ebean as my default ORM product. I know Java fairly well and decide to write a website using Java, but I also want to learn Go, and ultimately change my websites' backend codes to Go (Go's framework Revel). I know my data will still be there, but I will have to use a different ORM product to rewrite all the models. Will this cause a problem even though I maintain the same exact database structure?
It depends on what's your definition of 'problem'.
ORM frameworks provides facility to map database information (relational data) into OOP object. Variation exists between ORM frameworks as to what DBMS they support, default naming rule when mapping table/column name to class/field, update cascading, transaction management, cache management, SQL translation etc.
You can keep your database schema and map it using different ORM, above is just some problem you might / not encounter along the way

Is there a good generic pattern for data store queries that hide vendor specific logic?

I'm trying to find a good way to implement a generic search API in Java that will allow my users to search the backend repository without needing to know what that backend technology is, and so that if in the future we switch vendors I can reimplement the underlying logic without needing to recode the API. The repository underneath could be a relational database or a document store like SOLR, CouchDB, MongoDB, etc... It would need to support all the typical search requirements such as wildcards, ranges, bitwise operators, and so on.
Are there any standard ways of approaching this problem?
Would JPA be my best bet? Would it do everything I need it for, including non-relational databases?
Thanks in advance!
What you need is a ORM framework like Hibernate, if you go for JPA, you need to re-invent a lot of wheel.
using Hibernate you can write the business logic for searching the backend database or repository without vendor specific implementation, and if later you need to change the backend, you can do it without affecting your existing business code implementation.
I would advice you to check the hibernate documentation for further reference
The Spring Data umbrella of projects provides a nice DAO abstraction named CrudRepository. I believe most of the sub-projects (JPA, MongoDB, etc.) provide some implementations of it.
JPA would be one of a number of implementations you would use to map your relational database to objects. It would not protect you from database changes.
I think you're looking for the DAO Pattern. What I'm doing is as follows:
Create an interface for each DAO
Create a higher level DAO implementation that simply calls my actual database specific implementation
Wire the higher level DAO implementation to the database specific implementation with Spring.
This way, no code anywhere touches database specific implementation. The connections are formed only in XML.
JPA is designed around RDBMS ... only. Using it for other types of datastores makes little sense since things likes its query language leak SQL syntax. JDO is designed for datastore agnoticity, and provides persistence to many datastores using its implementations such as DataNucleus, though not all of those that you mention.
JPA is designed around RDBMS, Hibernate is also designed for RDBMS. There are few implementations of JPA which support no-sql. Similar projects are built around hibernate to support no-sql databases. However the API itself is tuned for RDBMS.
Implemeting a DAO patterns would require you to write your own query api. Later extend the implementation when ever your data store changes.
JDO and DataNucleus is ground up designed for heterogeneous data stores. Already has support for a dozen stores ,plus RDBMS. Beauty is that the query api remains constant across the stores. JDO allows you to work with domain model and leave the storage details to implementations like DataNucleus.
Hence I suggest JDO api with datanucleus.
The below link gives list data stores and f features already available in DataNucleus
http://www.datanucleus.org/products/accessplatform_3_0/datastore_features.html

Using Solr search index as a database - is this "wrong"?

My team is working with a third party CMS that uses Solr as a search index. I've noticed that it seems like the authors are using Solr as a database of sorts in that each document returned contains two fields:
The Solr document ID (basically a classname and database id)
An XML representation of the entire object
So basically it runs a search against Solr, download the XML representation of the object, and then instantiate the object from the XML rather than looking it up in the database using the id.
My gut feeling tells me this is a bad practice. Solr is a search index, not a database... so it makes more sense to me to execute our complex searches against Solr, get the document ids, and then pull the corresponding rows out of the database.
Is the current implementation perfectly sound, or is there data to support the idea that this is ripe for refactoring?
EDIT: When I say "XML representation" - I mean one stored field that contains an XML string of all of the object's properties, not multiple stored fields.
Yes, you can use SOLR as a database but there are some really serious caveats :
SOLR's most common access pattern, which is over http doesnt respond particularly well to batch querying. Furthermore, SOLR does NOT stream data --- so you can't lazily iterate through millions of records at a time. This means you have to be very thoughtful when you design large scale data access patterns with SOLR.
Although SOLR performance scales horizontally (more machines, more cores, etc..) as well as vertically (more RAM, better machines, etc), its querying capabilities are severely limited compared to those of a mature RDBMS. That said, there are some excellent functions, like the field stats queries, which are quite convenient.
Developers who are used to using relational databases will often run into problems when they use the same DAO design patterns in a SOLR paradigm, because of the way SOLR uses filters in queries. There will be a learning curve for developing the right approach to building an application that uses SOLR for part of its large queries or statefull modifications.
The "enterprisy" tools that allow for advanced session management and statefull entities that many advanced web-frameworks (Ruby, Hibernate, ...) offer will have to be thrown completely out the window.
Relational databases are meant to deal with complex data and relationships - and they are thus accompanied by state of the art metrics and automated analysis tools. In SOLR, I've found myself writing such tools and manually stress-testing alot, which can be a time sink.
Joining : this is the big killer. Relational databases support methods for building and optimizing views and queries that join tuples based on simple predicates. In SOLR, there aren't any robust methods for joining data across indices.
Resiliency : For high availability, SolrCloud uses a distributed file system underneath (i.e. HCFS). This model is quite different then that of a relational database, which usually does resiliency using slaves and masters, or RAID, and so on. So you have to be ready to provide the resiliency infrastructure SOLR requires if you want it to be cloud scalable and resistent.
That said - there are plenty of obvious advantages to SOLR for certain tasks : (see http://wiki.apache.org/solr/WhyUseSolr) -- loose queries are much easier to run and return meaningful results. Indexing is done as a matter of default, so most arbitrary queries run pretty effectively (unlike a RDBMS, where you often have to optimize and de-normalize after the fact).
Conclusion: Even though you CAN use SOLR as an RDBMS, you may find (as I have) that there is ultimately "no free lunch" - and the cost savings of super-cool lucene text-searches and high-performance, in-memory indexing, are often paid for by less flexibility and adoption of new data access workflows.
It's perfectly reasonable to use Solr as a database, depending on your application. In fact, that's pretty much what guardian.co.uk is doing.
It's definitely not bad practice per se. It's only bad if you use it the wrong way, just like any other tool at any level, even GOTOs.
When you say "An XML representation..." I assume you're talking about having multiple stored Solr fields and retrieving this using Solr's XML format, and not just one big XML-content field (which would be a terrible use of Solr). The fact that Solr uses XML as default response format is largely irrelevant, you can also use a binary protocol, so it's quite comparable to traditional relational databases in that regard.
Ultimately, it's up to your application's needs. Solr is primarily a text search engine, but can also act as a NoSQL database for many applications.
This was probably done for performance reasons, if it doesn't cause any problems I would leave it alone. There is a big grey area of what should be in a traditional database vs a solr index. Ive seem people do similar things to this (usually key value pairs or json instead of xml) for UI presentation and only get the real object from the database if needed for updates/deletes. But all reads just go to Solr.
I've seen similar things done because it allows for very fast lookup. We're moving data out of our Lucene indexes into a fast key-value store to follow DRY principles and also decrease the size of the index. There's not a hard-and-fast rule for this sort of thing.
I had similar idea, in my case to store some simple json data in Solr, using Solr as a database. However, a BIG caveat that changed my mind was the Solr upgrade process.
Please see https://issues.apache.org/jira/browse/LUCENE-9127.
Apparently, there has been in the past (pre v6) the recommendation to re-index documents after major version upgrades (not just use IndexUpdater) although you did not have to do this to maintain functionality (I cannot vouch for this myself, this is from what I have read). Now, after you have upgraded 2 major versions but did not re-index (actually, fully delete docs then the index files themselves) after the first major version upgrade, your core is now not recognized.
Specifically in my case, I started with Solr v6. After upgrade to v7, I ran IndexUpdater so index is now at v7. After upgrade to v8, the core would not load. I had no idea why - my index was at v7, so that satisfies the version-minus-1 compatibility statement from Solr, right? Well, no - wrong.
I did an experiment. I started fresh from v6.6, created a core and added some documents. Upgraded to v7.7.3 and ran IndexUpdater, so index for that core is now at v7.7.3. Upgraded to v8.6.0, after which the core would not load. Then I repeated the same steps, except after running IndexUpdater I also re-indexed the documents. Same problem. Then I again repeated everything, except I did not just re-index, I deleted the docs from the index and deleted the index files and then re-indexed. Now, when I arrived in v8.6.0, my core was there and everything OK.
So, the takeaway for the OP or anyone else contemplating this idea (using Solr as db) is that you must EXPECT and PLAN to re-index your documents/data from time to time, meaning you must store them somewhere else anyway (a previous poster alluded to this idea), which sort of defeats the concept of a database. Unless of course your Solr core/index will be short-lived (not last more than one major version Solr upgrade), you never intend to upgrade Solr more than 1 version, or the Solr devs change this upgrade limitation. So, as an index for data stored elsewhere (and readily available for re-indexing when necessary), Solr is excellent. As a database for the data itself, it strongly "depends".
Adding to #Jayunit100 response, using solar as a database, you get availability and partition tolerance at the cost of some consistency. There is going to be a configurable lag between what you write and when you can read it back.

Categories