Using Solr/Lucene as persistence technology - java

Solr/Lucene's reverse index and query supports an subset of RDBMS functionalities, i.e. filtering, sorting, groupby, paging. In this sense it is very close to an nosql database as it also does not support transaction and joins.
With framework like Hibernate-Search, it is possible to map even complex objects to the index and perform basic CRUD operations, while supporting full-text search.
Considerations:
1) Write throughput
From my past experience, Lucene index's write throughput is much lower than RDBMS
2) Query Speed
Query speed for Lucene index should be comparable, if not faster, due to the reverse index.
3) Scalability
Could be resolved using replication or Solr-cloud.
4) Ability to handle large data set
I have used lucene index with 15M+ document on a single JVM without any performance issue.
Background:
I am currently using MongoDB with Solr and it is working well enough. However, it is not as "simple" as i would like it to be due to:
Keeping mongo and Solr index in sync (not a trivial task)
Transformation between Java object <-> mongo <-> solr (SpringData and SolrJ helps, but still not great).
Why use two "persistence" technology if one will do
From the small scale test I have done so far, I haven't found any technical road block that would prevent me from using Solr/Lucene as persistence. However, I also don't want to commit to such a drastic refactoring without more information. I also aware of projects like Solandra with attempts to bring NoSQl and Solr together, but they don't seem to be mature enough.
Question
So with applications where full-text search is an major (but not the only) requirement, is it then feasible to for-go traditional (RDBMS) and contemporary (NoSQL) data store?
Great Reference Thanks to raticulin
Atlassian (Jira) - Lucene Generic Data Indexing

I think I remember watching some presentation from Atlassian where they explained that for Jira the were using just Lucene nowadays, they had dropped their previous DB (whatever it was) and using Lucene as storage too. They were happy.
If someone can confirm it was them would be cool.
Edit:
http://blogs.atlassian.com/rebelutionary/downloads/tssjs2007-lucene-generic-data-indexing.pdf

Lucene - Full Text Search/Information Retrieval Library.
Solr - Enterprise Search Server built on top of Lucene.
Lucene/Solr should not be used in place of Persistence, neither they will be able to replace RDBMS nor it is a good thing to compare them to RDBMS, you are comparing apples & oranges.
As far index throughput speed of Lucene that you are comparing with RDBMS will not help & it is not right to compare directly, there could be a number of factors that affect Lucene throughput depending on your search schema configurations.
Lucene has one of the well known & best data structures for information retrieval, Query speed that you get depends on number of factors from configuration, HW etc..
Obviously, that's the way to go.
Handling 15M+ on a single JVM is great, but it does not go far without understanding Document size, feature set used, JVM Memory, CPU Cores etc...
Now if your problem is that RDBMS is real scalability bottleneck, you could use pick a NoSQL datastore based on your persistence needs, which you could then with integrate Solr/Lucene to provide full-text search capability. Since NoSQL is rapidly evolving & fairly new you might not find fairly stable adapters to integrate Solr/Lucene with NoSQL.
Edit:
Now that the question is updated, this is already well debated in this question NoSQL (MongoDB) vs Lucene (or Solr) as your database. It could be a pain to have too many moving parts, Lucene/Solr could very well replace MongoDB, depending on app. But you have to consider NoSQL Data Store are built from ground up to be fully distributed, you dont lose or have limited functionality due to scaling, while Solr is not built with Distributed Computing in mind, so there are limitations Distributed Search limitations when it comes horizontal scaling. SolrCloud may be the answer too that..

Related

JSON ad-hoc vs NoSQ document database for a embedded desktop software in Java

I need to choose between a ad-hoc solution with JSON or pick one embedded NoSQL DB (OrientDB probably).
Scenario:
Open-Source desktop software in Java (free as beer)
Single connection
Continuous Delivery (will change)
Really easy client installation (copy and paste)
about 20,000 records
polyglot persistence
The problem:
setup NoSQL DB is hard
one environment build, interoperability (Linux and Windows)
lack of embedded Document NoSQL DB for Java
complexity
So JSOn ad-hoc is the right option? Some recommendation of a really embedded NoSQL database? or another approach?
Thanks.
One of the main motivations behind the development, and adoption, of NoSQL databases is the possibility to scale horizontally which is needed when your database reach a huge enough size that may require more nodes processing its operation to be more responsive.
If improve performance is the motivation one should have to move a database to a NoSQL approach when it is reaching a huge amount of data. As a side note, it is even interesting to think about the etymology behind the name of one of the most successful NoSQL databases so far, MongoDB that get the prefix "mongo" as a reference to humongous: enormous. This clearly states the purpose of such tools.
That being said, considering that in your scenario you are dealing with 20 thousands records only, you may have many other NoSQL alternatives that are easier to manage. You can go for JSON ad-hoc, or even use more tradicional, solid and stable tools like Firebird embedded or the most obvious and widely used option for embedded databases: SQLite.

High-performance storage for messages

I have been looking for high-performance file storage solution to be used for persisting soap messages in Java EE environment.
We are currently using a CLOB table on Oracle RMDBS, but it is very expensive to scale. While oracle works well for storing the related metadata, it doesn't perform too well with the message content. Insert on a table with a CLOB gives roughly 1000% worse performance than one without it (This was measured by comparing performance of VARCHAR2(4000)-insert to CLOB-insert when in row storage has been disabled for CLOB)
Persisting messages on file system is one option, but I have some serious doubts how an average file systems would perform storing millions of files per day. Considering we have to keep those files for several months, it just doesn't sound right.
I know there are several open source key-value databases (jackrabbit, mongodb to name few) that might be up for the task, but I just can't find time to evaluate them all. I would also like to hear about performance of open source RMDBS.
Considering that volume of transmitted messages is ever increasing, priority is on low latency and high performance. We do not require clustering or transactionality and (minor) data loss on system failure is acceptable.
Requirements:
Must be able to maintain rate of at least 100persisted messages/sec when message size is 8kilobytes
Must be able to store at least 100million messages
Must support deletion of persisted messages by age
Must support persisting while deletion is in progress
Must support retrieval of message by id
Help is appreciated
Here is nice comparison between MongoDB and SQL Server (I believe Oracle will have similar performance). You can see from charts that Mongo can handle 20 000 inserts per second. Mongo has also query language based on JSON which can do almost everything like regular SQL and it has Sharded Clusters and Replica sets which can handle all neccesary backups and failover (some basic info here).
Also, if you are interested in digging little bit deeper, 10 gen has an online course starting in two weeks awarded with a certificate.
You can try the following products:
HBase
MongoDB
Cassandra
Solr 4.0 (only)
These are the guys that I have any experience. There are a lot of other good products that can do what you want in the market.
Some observations: none of them have this "delete by age" feature out-of-the-box, as far as I know it. But it should be really simple to implement it. Easier in MogoDB I must assume.
If you will try Solr, you should stick with versions 4.X as these are the only ones with support to near realtime commits, and it will affect your "delete and insert" requirement.
All of them have great performance, but I did not run a benchmark with your requirement. If I were you I would make my own benchmarks.
Oracle11g has the data deduplication featured introduced. This feature will improve the performance of the oracle database with clob.
This is what I've discovered so far. I will try to update this answer after evaluating each product.
I started my experiments using MongoDB, which on paper looked like a viable option. Here's a summary of my findings:
Written in C++
Replication (replicaset) requires 3 nodes for high availability
One of the nodes is elected as a master - only the master can write
Scaling out is done by sharding (partitioning)
Each shard is essentially a replicaset - therefore sharded environment requires atleast 6 nodes for high availability
mongod instance consumes all available memory - virtualization should be used for resource partitioning (if you intend to run application server on same hardware)
Master re-election may take up to 1minute
Document collections (tables) use exclusive lock during write operation
Java API is exceptionally easy to use and includes a virtual filesystem called GridFS
Single node write performance on test system was ~20000 inserts/sec for 1kbyte document
Single node read performance was ~20000 read/sec for 1kbyte document
The fact that MongoDB would require 6 nodes on a two data center configuration, made me look further for more cost-efficient solutions.
Apache Cassandra:
Written in Java
Replication requires 3 nodes for high availability
Database survives network partitioning
Replication algorithm has been designed for multiple data centers
All nodes are writable
Scaling out can be done by adding more nodes (up to a certain limit)
Cassandra may require JVM garbage collection tuning
Java API is not the easiest to work with
Single node write performance was ~7000 inserts/sec for 1kbyte document
Single node read performance was ~7000 reads/sec for 1kbyte document
While Cassandra was slower in a single node configuration, write performance on a high availability configuration would match MongoDB's performance. The ability perform writes on every node (even during network partitioning) is very welcome addition for logging.
Couchbase:
Unfortunately I was unable to test Couchbase.
For now we'll keep using Oracle SecureFiles. Would we run out of resources on Oracle, both Cassandra and MongoDB seem like viable alternatives.

Java Large number of transaction object caching

I am looking for best solution for caching large amount of simple transactional pojo structure in memory. Transactions happen at oracle database on 3-4 tables by external application. Another application is kind of Business Intelligence type, which based on transactions in database evaluates updated pojos(mapped to table) and applies various business rules.
Hibernate solution relies on transactions on same server; where as in our case transactions happen some where else, and not sure cached objects can be queried.
Question:
Is there oracle jdbc API that would trigger update event on java side?
Which Caching solution would support #1,
Is cached objects can be queried?
Oracle databases support Java triggers, so in theory you could implement something like this yourself, see this guide. In theory, your Java trigger could invoke the client library of whichever distributed caching solution you are using, to update or evict stale entries.
Oracle also have a caching solution of their own, known as Coherence. It might have integration like this built in, or at least it might be worth checking it out. Search for "java distributed cache" for some alternatives.
As far as I know Hibernate does not support queries on objects stored in its cache.
However if you cache an entire collection of objects separately, then there are some libraries which will allow you to perform SQL-like queries on those collections:
LambdaJ - supports advanced queries, not as fast
CQEngine - supports typical queries, extremely fast
BTW I am the author of CQEngine. I like both of those libraries. But please excuse my slight bias for my own one :)

Is Hibernate Search a clean abstraction for Lucene?

I have used Hibernate in the past and I share many people's frustration with using an ORM. Since traditional databases are relational, any ORM has a leaky abstraction. Much of my time ends up being used to understand the details of the abstraction so I can achieve good performance.
Hibernate Search however works on top of Lucene. Since Lucene contains a collection of documents of the same types, it might not have the same problems as Hibernate with a relational database. Does Hibernate Search provide a clean abstraction, or is Hibernate Search fraught with the same problems as Hibernate+MySQL?
I'm considering moving from an existing implementation in raw Lucene to Hibernate Search.
A good abstraction is an abstraction which solves a problem which is not addressed by the underlying library with no or little compromise on the feature set. In case of Lucene, these problems could be:
index distribution,
synchronization with another persisted data source,
...
Then, the best abstraction depends on this problem you need to solve:
If you just want to be able to build a reverse index on a single server and query it, then stick to plain Lucene. Lucene is already the best abstraction available. Any other abstraction would add overhead while probably preventing you from using some features and not making things much easier.
If you want to go distributed, then Solr or Elastic Search would help you a lot.
If you want to integrate full-text functionality with another persisted data set, then Hibernate Search or Compass could be interesting candidates.

Comparison of NoSQL Databases for Java

I want to find out more about NoSQL databases/data-stores available for use from Java, and so far I tried out Project Voldemort. Except for awfully chosen name, it seems fine so far.
I'd like to find out more about other such database systems. Now, on wikipedia article there is a list of some of them, and there is some documentation on their project pages.
However, instead of comparing technical specs and tutorials provided by authors, what I would like to know is:
What are your experiences with working with these libraries on real projects? Which one would you recommend for use based on that experience, which one you wouldn't and why?
I know that only people to be able to answer this question are those who actually used more than one such database, but I hope that someone did do so.
EDIT:
By "real project" I primarily mean a project in production (but in absence of these anything larger than a homework or finished tutorial applies).
I worked with a relational database that had enormous amount of data in it, most of it concentrated in a single table, which was denormalized for performance anyway. But, because of the entire mess with constraints etc, creating a usable cluster had shown horrible results in both stability and performance.
Now, I'm quite sure that most likely any of these NoSQL systems would be a better choice then what I had at disposal. But, there has to be a difference between them, too. Whether it is in documentation, stability between versions, community, ease of use, whatever... And there are many giants. Which ones shoulders to choose? :D
We have been working with HBase for our projects. Our experience is -
The community is very dynamic and extremely helpful
The installation procedure for developers is quite easy in either pseudo distributed or standalone mode
We have been using it for integration test like unit tests
Installing a cluster is also easy but comparing some other NoSQL it has more components to install than others.
Administering - is still going on so not able to say much to say about it.
Do not use it for SQL like SELECT queries, for that we are using Apache Solr
To make development and testing easier we have come up with a simple object mapper - https://github.com/smart-it/smart-dao
The reason I chose is HBase, like other NoSQL, solves sharding, scaling by design making it easier in the long run and that seems to hold well.
Maybe the most prominent of Java NoSQL solutions is Cassandra. It has some features beyond Voldemort (Order-Preserving Partitioner which allows range queries; BigTable style structure for values); and is missing others (no alternate storage backends or version clocks for versioning).
Its performance is more optimal for fast writes, but its biggest strength is probably ease at which it can be horizontally scaled by adding new nodes (something where V is bit more static).
Compared to, say, MongoDB, its data model is quite simple and often there's no point in using much more than key/value abstraction (that is, handle data mapping on client side, store serialized objects).
It has full replication and distribution, unlike some k/v stores (couchdb, from what I understand).
It's pretty difficult to nail down a good choice without knowing exactly what your use case is. Much of it depends on what kind of data model are you comfortable with and fits your need. You have key-value stores, document-oriented, column-oriented, etc. Another huge factor is the products take on scaling and how they choose to deal with availability/consistency trade-offs.
I like MongoDB. I like how it supports queries and I like the document oriented data models. It fits many problems that I seem to run into. There is a Great (with capital G) community as seen at the recent MongoSV event.
Your best bet it to pick 3 different products and evaluate them. I would also see if you can find some companies who have presented at conferences and tell their stories of how they were successful. Videos from MongoSV will be available soon.

Categories