Java Large number of transaction object caching

Java Large number of transaction object caching - java

I am looking for best solution for caching large amount of simple transactional pojo structure in memory. Transactions happen at oracle database on 3-4 tables by external application. Another application is kind of Business Intelligence type, which based on transactions in database evaluates updated pojos(mapped to table) and applies various business rules.
Hibernate solution relies on transactions on same server; where as in our case transactions happen some where else, and not sure cached objects can be queried.
Question:
Is there oracle jdbc API that would trigger update event on java side?
Which Caching solution would support #1,
Is cached objects can be queried?

Oracle databases support Java triggers, so in theory you could implement something like this yourself, see this guide. In theory, your Java trigger could invoke the client library of whichever distributed caching solution you are using, to update or evict stale entries.
Oracle also have a caching solution of their own, known as Coherence. It might have integration like this built in, or at least it might be worth checking it out. Search for "java distributed cache" for some alternatives.
As far as I know Hibernate does not support queries on objects stored in its cache.
However if you cache an entire collection of objects separately, then there are some libraries which will allow you to perform SQL-like queries on those collections:
LambdaJ - supports advanced queries, not as fast
CQEngine - supports typical queries, extremely fast
BTW I am the author of CQEngine. I like both of those libraries. But please excuse my slight bias for my own one :)

Related

Using Solr/Lucene as persistence technology

Solr/Lucene's reverse index and query supports an subset of RDBMS functionalities, i.e. filtering, sorting, groupby, paging. In this sense it is very close to an nosql database as it also does not support transaction and joins.
With framework like Hibernate-Search, it is possible to map even complex objects to the index and perform basic CRUD operations, while supporting full-text search.
Considerations:
1) Write throughput
From my past experience, Lucene index's write throughput is much lower than RDBMS
2) Query Speed
Query speed for Lucene index should be comparable, if not faster, due to the reverse index.
3) Scalability
Could be resolved using replication or Solr-cloud.
4) Ability to handle large data set
I have used lucene index with 15M+ document on a single JVM without any performance issue.
Background:
I am currently using MongoDB with Solr and it is working well enough. However, it is not as "simple" as i would like it to be due to:
Keeping mongo and Solr index in sync (not a trivial task)
Transformation between Java object <-> mongo <-> solr (SpringData and SolrJ helps, but still not great).
Why use two "persistence" technology if one will do
From the small scale test I have done so far, I haven't found any technical road block that would prevent me from using Solr/Lucene as persistence. However, I also don't want to commit to such a drastic refactoring without more information. I also aware of projects like Solandra with attempts to bring NoSQl and Solr together, but they don't seem to be mature enough.
Question
So with applications where full-text search is an major (but not the only) requirement, is it then feasible to for-go traditional (RDBMS) and contemporary (NoSQL) data store?
Great Reference Thanks to raticulin
Atlassian (Jira) - Lucene Generic Data Indexing

I think I remember watching some presentation from Atlassian where they explained that for Jira the were using just Lucene nowadays, they had dropped their previous DB (whatever it was) and using Lucene as storage too. They were happy.
If someone can confirm it was them would be cool.
Edit:
http://blogs.atlassian.com/rebelutionary/downloads/tssjs2007-lucene-generic-data-indexing.pdf

Lucene - Full Text Search/Information Retrieval Library.
Solr - Enterprise Search Server built on top of Lucene.
Lucene/Solr should not be used in place of Persistence, neither they will be able to replace RDBMS nor it is a good thing to compare them to RDBMS, you are comparing apples & oranges.
As far index throughput speed of Lucene that you are comparing with RDBMS will not help & it is not right to compare directly, there could be a number of factors that affect Lucene throughput depending on your search schema configurations.
Lucene has one of the well known & best data structures for information retrieval, Query speed that you get depends on number of factors from configuration, HW etc..
Obviously, that's the way to go.
Handling 15M+ on a single JVM is great, but it does not go far without understanding Document size, feature set used, JVM Memory, CPU Cores etc...
Now if your problem is that RDBMS is real scalability bottleneck, you could use pick a NoSQL datastore based on your persistence needs, which you could then with integrate Solr/Lucene to provide full-text search capability. Since NoSQL is rapidly evolving & fairly new you might not find fairly stable adapters to integrate Solr/Lucene with NoSQL.
Edit:
Now that the question is updated, this is already well debated in this question NoSQL (MongoDB) vs Lucene (or Solr) as your database. It could be a pain to have too many moving parts, Lucene/Solr could very well replace MongoDB, depending on app. But you have to consider NoSQL Data Store are built from ground up to be fully distributed, you dont lose or have limited functionality due to scaling, while Solr is not built with Distributed Computing in mind, so there are limitations Distributed Search limitations when it comes horizontal scaling. SolrCloud may be the answer too that..

Overhead of using coherence cache

I consider caching key-value lists stored in database. Right now for rendering of JSF pages, a lot of redundant queries are executed to find the names to be displayed for some keys (O/R-Mapper: Eclipselink).
The values are quasi-static, but can change very seldom by using the application (no change in database except by the application in question).
A simple cache would suffice when only using one application server. However, load balancing with multiple servers should be possible, avoiding returning stale values if data is changed using one server and therefore not reflected by the other server.
One idea would be to use oracle coherence as distributed cache. I'm not sure whether this is overkill because of the fact that the data is only changed very seldomly and the cache itself does not need to be distributed, only the invalidation should be.
What is the overhead of coherence in terms of memory, execution times and network communication? Are there any alternatives that better suit my use case?
I talk about 50.000 key value pairs, mainly short strings.

If the invalidation is that rare, then you can use a local cache and something like a JMS Topic that everyone subscribes to in order to handle the invalidation.
There's also something like EHCache as an alternative, since it's OSS and free to use vs Coherence, if that's important. I like to use EHCaches pull through ability.

Coherence has relatively low overhead, and can easily manage 50,000 (or 50,000,000) objects. However, if your use case is super simple, and you don't mind doing the invalidation work yourself, and don't need the various QoS that Coherence provides, then it probably is overkill.
Also, this simple use case can easily be done using the Coherence Standard Edition, which is far less expensive (licensed per server instead of per processor, and it's a much lower price).
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

Performance difference of Native SQL(using MySQL) vs using Hibernate ORM?

I am using Spring MVC for an application that involved a multilevel back end for management and a customer/member front end. The project was initially started with no framework and simple native JDBC calls for database access.
As the project has grown significantly(as they always do) I have made more significant database calls, sometimes querying large selection sizes.
I am doing what I can to treat my db calls to closely emulate Object Relational Mapping best practices, but am still just using JDBC. I have been contemplating on whether or not I should make the transition to hibernate, but was unsure if it would be worth it. I would be willing to do it, if it was worth a performance gain.
Is there any performance gain from using Hibernate( or even just Object Relational Mapping) over native SQL using JDBC?

Is there any performance gain from using Hibernate (or even just Object Relational Mapping) over native SQL using JDBC?
Using an ORM, a datamapper, etc won't make the same SQL queries run faster. However, when using Hibernate you can benefit from things like lazy loading, second level caching, query caching and these features might help to improve the performances. I'm not saying Hibernate is perfect for every use case (for the special cases Hibernate can't handle well, you can always fall back to native SQL) but it does a very decent job and definitely improves development time (even after adding time spent on optimization).
But the best way to convince yourself would be to measure things and in your case, I would probably create an Hibernate prototype covering some representative scenarios and bench it.

ORM lets you stay inside OOP world, but this comes at the cost of performance, especially with many to many relations in our case. We were using Hibernate as default, doing performance optimization with jdbc where required.

Hibernate will make the development and maintenance of your app easier, but it won't necessarily make DB access quicker.
If your native JDBC calls use inefficient SQL then you might see some performance improvement as HIbernate tends to generate good SQL.

Java Fast Data Storage & Retrieval

I need to store records into a persistant storage and retrieve it on demand. The requirement is as follows:
Extremely fast retrieval and insertion
Each record will have a unique key. This key will be used to retrieve the record
The data stored should be persistent i.e. should be available upon JVM restart
A separate process would move stale records to RDBMS once a day
What do you guys think? I cannot use standard database because of latency issues. Memory databases like HSQLDB/ H2 have performace contraints. Moreover the records are simple string objects and do not qualify for SQL. I am thinking of some kind of flat file based solution. Any ideas? Any open source project? I am sure, there must be someone who has solved this problem before.

There are lot of diverse tools and methods, but I think none of them can shine in all of the requirements.
For low latency, you can only rely on in-memory data access - disks are physically too slow (and SSDs too). If data does not fit in the memory of a single machine, we have to distribute our data to more nodes summing up enough memory.
For persistency, we have to write our data to disk after all. Supposing optimal organization
this can be done as background activity, not affecting latency.
However for reliability (failover, HA or whatever), disk operations can not be totally independent of the access methods: we have to wait for the disks when modifying data to make shure our operation will not disappear. Concurrency also adds some complexity and latency.
Data model is not restricting here: most of the methods support access based on a unique key.
We have to decide,
if data fits in the memory of one machine, or we have to find distributed solutions,
if concurrency is an issue, or there are no parallel operations,
if reliability is strict, we can not loose modifications, or we can live with the fact that an unplanned crash would result in data loss.
Solutions might be
self implemented data structures using standard java library, files etc. may not be the best solution, because reliability and low latency require clever implementations and lots of testing,
Traditional RDBMS s have flexible data model, durable, atomic and isolated operations, caching etc. - they actually know too much, and are mostly hard to distribute. That's why they are too slow, if you can not turn off the unwanted features, which is usually the case.
NoSQL and key-value stores are good alternatives. These terms are quite vague, and cover lots of tools. Examples are
BerkeleyDB or Kyoto Cabinet as one-machine persistent key-value stores (using B-trees): can be used if the data set is small enough to fit in the memory of one machine.
Project Voldemort as a distributed key-value store: uses BerkeleyDB java edition inside, simple and distributed,
ScalienDB as a distributed key-value store: reliable, but not too slow for writes either.
MemcacheDB, Redis other caching databases with persistency,
popular NoSQL systems like Cassandra, CouchDB, HBase etc: used mainly for big data.
A list of NoSQL tools can be found eg. here.
Voldemort's performance tests report sub-millisecond response times, and these can be achieved quite easily, however we have to be careful with the hardware too (like the network properties mentioned above).

Have a look at LinkedIn's Voldemort.

If all the data fits in memory, MySQL can run in memory instead of from disk (MySQL Cluster, Hybrid Storage). It can then handle storing itself to disk for you.

What about something like CouchDB?

I would use a BlockingQueue for that. Simple, and built into Java.
I do something similar using realtime data from Chicago Merchantile Exchange.
The data is sent to one place for realtime use... and to another place (via TCP),
using a BlockingQueue (Producer/Consumer) to persist the data to a database (Oracle,H2).
The Consumer uses a time delayed commit to avoid fdisk sync issues in the database.
(H2 type databases are asyncronous commit by default and avoid that issue)
I log the persisting in the Consumer to keep track of the queue size to be sure
it is able to keep up with the Producer. Works pretty good for me.

MySQL with shards may be a good idea. However, it depends on what is the data volume, transactions per second and latency you need.
In memory databases are also a good idea. In fact MySQL provides memory-based tables as well.

Would a Tuple space / JavaSpace work? Also check out other enterprise data fabrics like Oracle Coherence and Gemstone.

MapDB provides highly performant HashMaps/TreeMaps that are persisted to disk. Its a single library that you can embed in your Java program.

Have you actually proved that using an out-of-process SQL database like MySQL or SQL Server is too slow, or is this an assumption?
You could use a SQL database approach in conjunction with an in-memory cache to ensure that retrievals do not hit the database at all. Despite the fact that the records are plaintext I would still advise using SQL over a flat file solution (e.g. using a text column in your table schema) as the RDBMS will perform optimisations that a file system cannot (e.g. caching recently accessed pages, etc).
However, without more information about your access patterns, expected throughput, etc. I can't provide much more in the way of suggestions.

If you are looking for a simple key-value store and don't need complex sql querying, Berkeley DB might be worth a look.
Another alternative is Tokyo Cabinet, a modern DBM implementation.

How bad would it be if you lose a couple of entries in case of a crash?
If it isn't that bad the following approach might work for you:
Create flat files for each entry, name of file equals id. Possible one file for a not so big number of consecutive entries.
Make sure your controller has a good cache and/or use one of the existing caches implemented in Java.
Talk to a file system expert how to make this really fast
It is simple and it might be fast.
Of course you lose transactions including the ACID principles.

Sub millisecond r/w means you cannot depend on disk, and you have to be careful about network latency. Just forget about standard SQL based solutions, main-memory or not. In a ms, you cannot get more than 100 KByte over a GBit network. Ask a telecom engineer, they are used to solving these kind of problems.

How much does it matter if you lose a record or two? Where are they coming from? Do you have a transactional relationship with the source?
If you have serious reliability requirements then I think you may need to be prepared to pay some DB Overhead.
Perhaps you could separate the persistence problem from the in-memory problem. Use a pup-sub approach. One subscriber look after in-memory, the other persisting the data ready for subsequent startup?
Distributed cahcing products such as WebSphere eXtreme Scale (no Java EE dependency) might be relevent if you can buy rather than build.

Chronicle Map is a ConcurrentMap implementation which stores keys and values off-heap, in a memory-mapped file. So you have persistence on JVM restart.
ChronicleMap.get() is consistently faster than 1 us, sometimes as fast as 100 ns / operation. It's the fastest solution in the class.

Will all the records and keys you need fit in memory at once? If so, you could just use a HashMap<String,String>, since it's Serializable.

Ideas on this alternative to ORM + RDBMS? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am currently developing a proof of concept for an alternative data store. The reason why is I need to enhance a read-mostly clustered webapp, but also because I want to free myself from the pain of the sometimes overly-complex ORM+RDBMS solution.
Overall the idea is quite similar to a distributed cache with persistence (letting the cluster be the SoR), however:
want to be able to retrieve any object along with its children, by
id (providing class & id) [only that to start off, as the main
querying part is already resolved with lucene in my app].
need to have map of maps of types ( ~ tables in the relational
world), and therein distributed maps of 'dehydrated' stored objects (flattening the object graph via reflection deep cloning)
a bin log (like Prevayler, for example) for
eventual recovery if whole cluster goes down
development (and ability to refactor code / change structure)
perhaps asynchronously processed for other purposes (reporting, whatever)
eventually later on try to integrate a statically-typed query mechanism, like LINQ, Jaque or H2's JaQu / see ODBs / Lucene (?)
it has to be transaction-aware (not sure "JTA type" though)
I'm planning to implement this idea with Hazelcast (I love its super-simple API) or Terracotta (which I never used - but I'm aware of their 'sweet spot', medium-term data). If you will, my aim is to do more or less what Jonas once blogged about. Using one of these, stored data would roughly have to fit in the sum of the JVM heaps of the cluster.
This should be pretty simple to scale, would avoid the relational impedance mismatch (ie save as with an ODB) and JDBC + I/O overhead.
Do you know of other tools/frameworks or combination thereof already providing similar functionality, that I'm ignoring?
Can you suggest other ways of tackling this 'getting rid of the DB'? What flaws do you already see in this idea?
Concurrency-wise would it make sense to consider Scala instead of Java?
How about non-relational data stores such as Couch DB, Neo4j, HyperTable, HBase?
A similar question was asked one month ago - but there was no concrete solution.
BTW I just stumbled upon the concept of Enterprise Data Fabric, which, to my surprise, describes a lot of these ideas.

Definitely give Terracotta a try. It's free (unless you go Enterprise which has an SLA and support). It is a JVM-level cluster, so to speak, so you don't have the issues associated with sessions on multiple boxes behind disparate JK workers (assuming you're using this for a J2EE app).
I'm just rambling, so have a look here: http://en.wikipedia.org/wiki/Terracotta_Cluster
UPDATE numerous bits of info on Terracotta on the web too, e.g. http://blog.terracottatech.com/2007/12/fud_of_the_week_terracotta_doe.html
UPDATE2 Bit of background on my views: I work for a company with a fairly big audience. We have a enterprise MySQL running with a master and about 5 slaves (times 2 considering we have 2 channels, with 4 app servers per channel), using MySQL's JDBC Replication driver (for which we've already submitted various patches). We use Spring2.5/Hibernate3 using Spring's declarative JTA transaction management, so read-onlies go to the slaves. With the advent of numerous Ajax enhancements on a future version of our site, our DB servers' load has gone up - we create pricing summaries for thousands of products for all countries, taking into account duties/tax rules for all these countries (plus promotions and real-time auctions running all the time), then the Ajax services have the latest prices in a blink. Terracotta takes the load off the DB and app servers by making these prices available to all app servers on a JVM-layer, with all the JVMs across the boxes linked. So, server A can update the prices every few minutes, and if Ajax hits server B, the prices are available immediately. I know there are people/companies out there with similar businesses, who probably have better ideas and implementations, so I'm always open for discussion, but this is my two cents.
I get inspiration from the guys at Facebook too, for instance this very informative article:
http://www.facebook.com/note.php?note_id=23844338919
They talk about memcached which you should also definitely check out.

As Neo4j is mentioned in the question, I'm chiming in with a few thoughts on using a graph database in this case. (I'm part of the Neo4j team)
retrieving children is trivial in a
graph db
there is a map implementation
for neo4j
as graphs are native to a graph db
you could consider not to flatten the
object graph, but to persist data in nodes
and edges/relationships (this gives you
more flexibility in handling the data)
neo4j is fully transactional
With the new DB technologies emerging today, there's really no need to stay with a RDBMS if your data isn't a good fit for the relational paradigm.

Seems to me Terracotta is a perfect fit for your requirements:
cluster a map to retrieve children
via keys (e.g. clustered Map)
map of maps - no problem
no explicit bin log - but Terracotta already persists everything to disk so full cluster restart is already supported
integrated already to Compass, Hibernate Search, and Lucene for search
Transactions? Too slow. Use the cache as a datastore. With persistence you won't lose data writing to (clustered) memory and trickle back to the DB.
In addition, Terracotta does the "reflection" thing you ask for - although it doesn't use reflection as that is far too slow. It uses BCM. Only changes are propagated on the network.
Hazelcast btw requires serialization so it will be slow and will not do well at all with a map of maps data structure (every put will result in a full deep clone copy across the network) and it doesn't have any kind of persistence built in.

Interesting.
I have a view that we all develop a zoo which comprises all the abstraction layers we habitually use in our projects. And each abstraction layer is a completely different animal.
My goal is to minimize the amount of time spent on just care and feeding of the animals whenever it diverts me from solving the problem at hand - it's overhead - wasted resources. So the fewer, simpler abstraction layers we can get away with, the more productive we are.
I can usually do just fine with two beasties - OOP and RDBMS, coupled through nice, simple, minimal, hand-crafted DAL. For me, ORM is mostly overhead - one abstraction too many, and a pretty hungry one.
Don't discount the option of treating stored procedures as an abstraction tool, either. If you're real comfortable with SQL, it can be a useful resource for implementing a light-weight BL facade that means not needing to think about the ORM problem.
And this post suggests the emergence of alternatives to RDBMS for some requirements, anyway.

Thanks for your answers.
Actually, you talk about DBs which is something I want to completely take out of the picture.
The use case I'm targetting is a startup's small/medium-sized clustered webapp (boxes in a LAN, or in the cloud). It needs to retrieve objects at ~RAM-speed levels and scale fairly easily. As a side-effect, one wouldn't have to think about DB server installations, impedance mismatch, JDBC, caches, polluting domain models with annotations, etc.
Again, what I want to accomplish is something like described here, and I would love to have some more feedback on ideas concerning the actual implementation (why use Terracotta instead of Hazelcast, use serialization or deep cloning via reflection or whatever else, and also the major drawbacks of an approach like this - eg. why wouldn't you change it for your current ORM/DB setup).
It has to be super simple to integrate so it'll feature a really neat Java API, improving code readability. No other software (DB, memcached will be required).

Try GigaSpaces. I think they have exactly what you require, and if I'm not mistaken there's a free version for startups.
Some concepts:
"Space" is some place where you can store and retrieve objects
Space can be backed by any JDBC-compliant DB, automatically (no code, only configuration)
Space can be started in your java process, so all accesses are at RAM speed
Space can be clustered/partitioned in any way you want (full mirror, partial, grid).
Space supports distributed or local transactions
Check their wiki, (but only "programmer's guide" - all the rest is marketing BS).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.