Ideas on this alternative to ORM + RDBMS? [closed] - java

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am currently developing a proof of concept for an alternative data store. The reason why is I need to enhance a read-mostly clustered webapp, but also because I want to free myself from the pain of the sometimes overly-complex ORM+RDBMS solution.
Overall the idea is quite similar to a distributed cache with persistence (letting the cluster be the SoR), however:
want to be able to retrieve any object along with its children, by
id (providing class & id) [only that to start off, as the main
querying part is already resolved with lucene in my app].
need to have map of maps of types ( ~ tables in the relational
world), and therein distributed maps of 'dehydrated' stored objects (flattening the object graph via reflection deep cloning)
a bin log (like Prevayler, for example) for
eventual recovery if whole cluster goes down
development (and ability to refactor code / change structure)
perhaps asynchronously processed for other purposes (reporting, whatever)
eventually later on try to integrate a statically-typed query mechanism, like LINQ, Jaque or H2's JaQu / see ODBs / Lucene (?)
it has to be transaction-aware (not sure "JTA type" though)
I'm planning to implement this idea with Hazelcast (I love its super-simple API) or Terracotta (which I never used - but I'm aware of their 'sweet spot', medium-term data). If you will, my aim is to do more or less what Jonas once blogged about. Using one of these, stored data would roughly have to fit in the sum of the JVM heaps of the cluster.
This should be pretty simple to scale, would avoid the relational impedance mismatch (ie save as with an ODB) and JDBC + I/O overhead.
Do you know of other tools/frameworks or combination thereof already providing similar functionality, that I'm ignoring?
Can you suggest other ways of tackling this 'getting rid of the DB'? What flaws do you already see in this idea?
Concurrency-wise would it make sense to consider Scala instead of Java?
How about non-relational data stores such as Couch DB, Neo4j, HyperTable, HBase?
A similar question was asked one month ago - but there was no concrete solution.
BTW I just stumbled upon the concept of Enterprise Data Fabric, which, to my surprise, describes a lot of these ideas.

Definitely give Terracotta a try. It's free (unless you go Enterprise which has an SLA and support). It is a JVM-level cluster, so to speak, so you don't have the issues associated with sessions on multiple boxes behind disparate JK workers (assuming you're using this for a J2EE app).
I'm just rambling, so have a look here: http://en.wikipedia.org/wiki/Terracotta_Cluster
UPDATE numerous bits of info on Terracotta on the web too, e.g. http://blog.terracottatech.com/2007/12/fud_of_the_week_terracotta_doe.html
UPDATE2 Bit of background on my views: I work for a company with a fairly big audience. We have a enterprise MySQL running with a master and about 5 slaves (times 2 considering we have 2 channels, with 4 app servers per channel), using MySQL's JDBC Replication driver (for which we've already submitted various patches). We use Spring2.5/Hibernate3 using Spring's declarative JTA transaction management, so read-onlies go to the slaves. With the advent of numerous Ajax enhancements on a future version of our site, our DB servers' load has gone up - we create pricing summaries for thousands of products for all countries, taking into account duties/tax rules for all these countries (plus promotions and real-time auctions running all the time), then the Ajax services have the latest prices in a blink. Terracotta takes the load off the DB and app servers by making these prices available to all app servers on a JVM-layer, with all the JVMs across the boxes linked. So, server A can update the prices every few minutes, and if Ajax hits server B, the prices are available immediately. I know there are people/companies out there with similar businesses, who probably have better ideas and implementations, so I'm always open for discussion, but this is my two cents.
I get inspiration from the guys at Facebook too, for instance this very informative article:
http://www.facebook.com/note.php?note_id=23844338919
They talk about memcached which you should also definitely check out.

As Neo4j is mentioned in the question, I'm chiming in with a few thoughts on using a graph database in this case. (I'm part of the Neo4j team)
retrieving children is trivial in a
graph db
there is a map implementation
for neo4j
as graphs are native to a graph db
you could consider not to flatten the
object graph, but to persist data in nodes
and edges/relationships (this gives you
more flexibility in handling the data)
neo4j is fully transactional
With the new DB technologies emerging today, there's really no need to stay with a RDBMS if your data isn't a good fit for the relational paradigm.

Seems to me Terracotta is a perfect fit for your requirements:
cluster a map to retrieve children
via keys (e.g. clustered Map)
map of maps - no problem
no explicit bin log - but Terracotta already persists everything to disk so full cluster restart is already supported
integrated already to Compass, Hibernate Search, and Lucene for search
Transactions? Too slow. Use the cache as a datastore. With persistence you won't lose data writing to (clustered) memory and trickle back to the DB.
In addition, Terracotta does the "reflection" thing you ask for - although it doesn't use reflection as that is far too slow. It uses BCM. Only changes are propagated on the network.
Hazelcast btw requires serialization so it will be slow and will not do well at all with a map of maps data structure (every put will result in a full deep clone copy across the network) and it doesn't have any kind of persistence built in.

Interesting.
I have a view that we all develop a zoo which comprises all the abstraction layers we habitually use in our projects. And each abstraction layer is a completely different animal.
My goal is to minimize the amount of time spent on just care and feeding of the animals whenever it diverts me from solving the problem at hand - it's overhead - wasted resources. So the fewer, simpler abstraction layers we can get away with, the more productive we are.
I can usually do just fine with two beasties - OOP and RDBMS, coupled through nice, simple, minimal, hand-crafted DAL. For me, ORM is mostly overhead - one abstraction too many, and a pretty hungry one.
Don't discount the option of treating stored procedures as an abstraction tool, either. If you're real comfortable with SQL, it can be a useful resource for implementing a light-weight BL facade that means not needing to think about the ORM problem.
And this post suggests the emergence of alternatives to RDBMS for some requirements, anyway.

Thanks for your answers.
Actually, you talk about DBs which is something I want to completely take out of the picture.
The use case I'm targetting is a startup's small/medium-sized clustered webapp (boxes in a LAN, or in the cloud). It needs to retrieve objects at ~RAM-speed levels and scale fairly easily. As a side-effect, one wouldn't have to think about DB server installations, impedance mismatch, JDBC, caches, polluting domain models with annotations, etc.
Again, what I want to accomplish is something like described here, and I would love to have some more feedback on ideas concerning the actual implementation (why use Terracotta instead of Hazelcast, use serialization or deep cloning via reflection or whatever else, and also the major drawbacks of an approach like this - eg. why wouldn't you change it for your current ORM/DB setup).
It has to be super simple to integrate so it'll feature a really neat Java API, improving code readability. No other software (DB, memcached will be required).

Try GigaSpaces. I think they have exactly what you require, and if I'm not mistaken there's a free version for startups.
Some concepts:
"Space" is some place where you can store and retrieve objects
Space can be backed by any JDBC-compliant DB, automatically (no code, only configuration)
Space can be started in your java process, so all accesses are at RAM speed
Space can be clustered/partitioned in any way you want (full mirror, partial, grid).
Space supports distributed or local transactions
Check their wiki, (but only "programmer's guide" - all the rest is marketing BS).

Related

Java Caching frameworks for maintaining huge data

Java Caching frameworks for storing huge data.
Context: We are developing a Restful service using Jersey 2.6 and will deploy it on WAS 8.5. This service need to serve more than 10 million requests per day.
We need to implement a cache to store more than 300k object (data will come from DB). And we need some way to update the cache on a daily basis.
Is this approach of caching 300k object and updating them on a daily basis is recommended?
Are there any Java framework which supports this kind of functionality?
Your question is too general to get a clear answer. You need to be describe what the problem you are trying to solve is.
Are you concerned about response times?
Are you trying to protect your DB from doing heavy lifting?
Are expecting to have to scale out and want to be sure that you can deal with future loads?
Additionally some more contextual information would be useful, especially:
How dynamic is your data compared to your requests?
What percentage of your data population will be requested on average per day? (How many of the 3 lakh objects will be enquired upon at least once per day? If you don't know, provide your best guess).
Your figures given as 3 lakh (300k) data points and 10M requests means that you are expecting to hit each object on average 33 times a day, which indicates that you are more concerned about back end DB load than your responses being right up to date.
In my experience there are a lot of fairly primitive solutions which will work much better than going for a heavyweight distributed systems such as Mongo, Cassandra or Coherence.
My first response would be: Keep it simple - 300k objects is not too much to store in an internal hash table which you flush once a day and populate on first request.
If you need to scale horizontally, I would suggest Memcache Spymemcached with a 1 day cache time, which populate when you don't find an existing entry.
I would NOT go for something like Cassandra or Mongo unless you have real compelling reasons to require a persistent store. Rationale: Purging can become really onerous, especially if your data is fast moving. For example: Cassandra does not really know how to delete, but instead "tombstones" deleted entries, which means that your data store will simply grow and grow until you create a strategy for purging.
Question is if caching must be distributed. Remember the caching is something you have seen. And posting this around for the chance it might be of use... well why.
Distributed Cache system: Redis, Cassandra in Memory. MongoDB in memory.
Local RocksDB (let you store byte[] -> byte[]) and SSDs makes a fine local cache layer. You might also add distributed layer on top of it. Usually better than something from the shelves. Should also be easy to implement.
10Million Requests per day isnt much. in 10hours tops you can server 1Mio / 60 / 60 => 3000 requests per second. Based on the afford you usually can go with an efficient frontend and efficient backend. We can do 40k pages per second and core and having 24 cores.. you know the math. Data in memory no chaching done...
For the caching provider I suggest Coherence, I am using Coherence at my company, and it is very robust and synchronized over multiple clusters.
For the other point about how to handle cache, it depends on the nature of your application, based on my experience with caching, I've decided to update the cache in the following scenarios:
1. Grid paging
2. Browsing
and decided to clear the cache and reload the data again:
Edit item
Add new item
Delete item
And I've decided so as maintaining the cache it an overkill headache that will be blown in your face when you handle some kind of statistics and nested hierarchies.
Hope this helped you.
Yes they are for example: Coherence, Hazelcast. All are distrubuted cashes.
http://java.dzone.com/articles/sneak-peek-jcache-api-jsr-107
In general you should cache what you are using, and cache should be always in sync not daily. You place in cache the recently used objects, and you get read/write through cache to your DB.
If you have money , best one is coherence (its reputation is proved by big financial companies )
Hazelcast is an other distributed cache memory you can use, it is one level lower than coherence based on preformance metrics.
Cou could try ehcache. It can be used as query cache or even hibernate second level cache.
You can configure how long entities should be stored in cache before they are invalidated.
If you already have WebSphere ND 8.5.5, you may take a look at WebSphere Extreme Scale, which is provided with that. It is distributed, partitioned caching solution that integrates with WebSphere. See WebSphere eXtreme Scale overview for more details.
See the new JCache standard (JSR 107 in the Java Community Process). This API is implemented by Coherence and other caching implementations (ehcache etc.), and also has a small reference implementation that you can use for basic use cases.
Yes, any of the Java caching frameworks should be able to help you. Coherence (note: I work with Coherence at Oracle) for example can definitely handle 3,00,000 items easily (I assume you are from India if you use lakh!), but I suggest only using Coherence if you are deploying this on more than one server.

Different scenarios on distributed processing [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have a web application - a simple web application archive file - that having several storage adapters for different storage types ie. MongoDB and CouchDB. By using this application I can store/query data to those databases with the web services I have written. Currently I can only have one single database instance per application, cannot have more than one which prevents me having parallel processing.
What I want is to run my application on several machines. And above of those, I want to write a UI enabling the Client to store/query the data without knowing the database types/addresses.
I have two different scenarios and wanted to ask you which one of them is a better way to do it and why.
1) Let's say I have three servers running three single databases - couchdb. I can upload my application to those servers and then with the help of my UI or a layer above my application I can define a map of servers so that I can store and query the data.
As you see above, database and application lies in the same server, so they are remote.
2) Let's say three servers are still running remotely but in this case my application is local. And I enabled it to accept several database instances.
I actually prefer the first one since in that case I won't need to extend my application but I wanted to hear what you think about it. I will be glad if you can provide some sources for that kind of distributed scenarios - I had no experience at all on that kind of stuff.
Please take a look on article, which is describing of Instagram architecture. It's quite interesting to know how 3 engineers handled 15-25 millions of user with 150 millions of photos per day.
Also I would recommend interesting blog, which describes different scalability solutions for popular web-resources:
Facebook: An Example Canonical Architecture for Scaling Billions of Messages
Tumblr Architecture - 15 Billion Page Views a Month and Harder to Scale than Twitter
There are lots of information.
But the most common things are:
keep everything as easy as it's possible
use well-known, mature technologies, which have good support of the community
scale each part of application separately
And even though you may find explanations of each these, I'd like to focus on the last one according to your requirements.
When you want to make your application horizontally scalable, you need to consider each of clusters as separate logical module, regardless on actual number of servers, involved into cluster. F.e. for your web-application you can setup several instances of that application and set a load-balancer before them. Thus users can access single entry point (e.g. http://mysite.com), meanwhile actual instance may be arbitrary.
If you need to collaborate instances between each other, then you need to avoid in-memory storage, but use "key-value" storages, such as Redis, along with Messages Brokers, such as ActiveMQ, RabbitMQ or cloud version Iron.IO etc.
Datastorage you also need to consider as single entry point, e.g. sharded cluster (f.e. MongoDB supports out-of-box auto-sharding, and most of NoSQL solutions also have it - CouchDB, HBase).
So basically you call some shard-controller, which according specific shard-key redirects to corresponding instance. But please note, that usually sharding might be quite non-trivial thing, therefore in most cases when you deal with RDBMS you need to use vertical scalability.
Considering everything above I would suggest you such structure:
For sure ideally all the servers must be near each other physically (f.e. in the same data-centre). But if you are going to use your application as World-wide, then you need to shard your instances according less latency. Here is quite interesting lecture about server's configuration (even though it's about MongoDb, I believe some approaches might be helpful in your case as well): https://www.youtube.com/watch?v=TZOH92mZIN8
But if do not need to use all your servers for distributed "map/reduce" computing, and for getting result you need only one particular server's instance, in that case I believe scenario #1 is fairly suitable and better for your needs (in case if you setup load-balancer before your instances).

Solution to provide shared entities between multiple Java processes

I am trying to reconstruct a flow of information from multiple parts handled by different Java processes. Please note that i don't generate the flows, i just read some information about them.
I've tried using MySQL (MyISAM/InnoDB tables) with INSERT ON DUPLICATE KEY UPDATE using an id for each flow. I've also tried storing all the pieces of information and running a query at the end to get the full information. Neither of these approaches yielded the performance needed.
I'm looking for a solution that will allow me to have a set of shared objects between multiple Java processes. The objects should be persistent between runs and fast to lookup/update concurrently (>100k lookups/updates per second).
I've thought of a few solutions including:
NoSQL: something like MongoDB, HBase etc.
a caching solution like EhCache, Memcached etc.
The problem is i don't have any experience with any of these solutions. So, what would you recommend that fits the following criteria:
very fast on a single system. Most of the applications i mentioned were built for distributed systems, but it's not the case here.
easy to learn/use (i want to be able to prototype it in a day)
mature technology
free to use even for commercial purposes
preferably open-source
You could try a seperate java process that co-ordinates between the others. This process would hold the information to pass over to the main processes. You could wire them up with RMI.
If you want to do only exchange of objects withing java applications, you could also looki into tuple spaces. There are specific implementations of spaces for java, JavaSpaces, which should be able to do what you need. Not sure if they can keep up with the performance though. Also I’m not sure how widely this technology is still being used, since it only supports Java and isn’t as flexible as NoSQL stores would be these days.
Wikipedia has a more detailed description and list of different implementations, many of which are open source.
The other option is to go with Redis, you have notifications there and it can for sure scale to the requirements you are looking for.
The old (legacy?) solution is JavaSpaces. However, from an software architects point of view I would say distributed caches are the replacements for that nowadays. Especially take a look at hazelcast and infinispan.
From the performance viewpoint I am not happy with the performance of the "big" distributed caching solutions, when only a single in-memory cache is needed, see my writeup on the cache2k benchmarks page (hazelcast needs to be added here).
Anyways, please clarify your problem statement first, because your question falls into the XyProblem category. You are not describing the actual problem, and your question just boils down to "fast reliable distributed objects" solution. What kind of data comes in? What is the rate? Who is it accessed? What consistency guarantees need to be met, considering the fact that writing and reading is in parallel?
By the term "flow of information" it sounds more like a complex event processing problem to me.

RealWorld HazelCast [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
Does anyone have any real world experience with Hazelcast distributed data grid and execution product? How has it worked for you? It has an astonishingly simple API and functionality that seems almost to good to be true for such a simple to use tool. I have done some very simple apps and it seems to work as advertised so far. So here I am looking for the real world 'reality check'. Thank you.
We've been using it in production since version 1.8+, using mainly the distributed locking feature. It works great, we've found a couple of workarounds/bugs, but those were fixed relatively fast.
With 1.8M locks per day we found no problems so far.
I recommend start using version 1.9.4.4.
There are still some issues still with its development,
http://code.google.com/p/hazelcast/issues/list
Generally, you can choose to either let it use its own multicast algorithm or specify your own ip's. We've tried it in a LAN environment and it works pretty well. Performance wise it's not bad but the monitoring tool didn't work very well as it failed to update most of the time. If you can live with the current issues then by all mean go for it. I would use it with caution but it's a great working tool IMHO.
Update:
We've been using Hazelcast for a few months now and it's working very well. The settings are relatively easy to set up and with the new updates, are comprehensive enough to customize even small things like the number of threads allowed in read/write operations.
We are using Hazelcast (1.9.4.6 now) in production integrated with a complicated transactional service. It was added to alleviate immediate database throughput issues. We have discovered that we frequently have to stop it bringing down all transaction services for at least an hour. We are running clients in superclient mode because it is the only option that even remotely meets our performance requirements (about 4 times faster than native clients.) Unfortunately stopping a superclient node causes split brain issues and causes the grid to lose records, forcing a complete shutdown of services. We have been trying to make this product work for us for almost a full year now, and even paid to have 2 hazelcast reps flown in to help. They were unable to produce a solution, but were able to let us know that we were probably doing it wrong. In their opinion it should work better but it was pretty much a wasted trip.
At this point we are on the hook for over 6 figures per year in licensing fees and we are currently using about 5 times the resources to keep the grid alive and meet our performance needs than we would be using with a clustered and optimized database stack. This was absolutely the wrong decision for us.
This product is killing us off. Use with caution, sparingly, and only for simple services.
If my own company and projects count as real world, here's my experience. I wanted to get as close to eliminating external (disk) storage in favor of limitless and persistent "RAM". For starters that eliminates CRUD plumbing which sometimes makes up to 90% of the so-called "middle tier". There are other benefits. Since RAM is your "database" you don't need any complex caches or HTTP session replication (which in turn eliminates ugly sticky session technique).
I believe RAM is the future and Hazelcast has everything to be an in-memory database: queries, transactions, etc. So I wrote a mini-framework abstracting it: to load data from the persistent storage (I can plugin anything that can store BLOBs - the fastest turned out to be MySQL). It is too long to explain why I didn't like Hazelcast's built-in persistence support. It's rather generic and rudimentary. They should remove it. It is not rocket science to implement your own distributed and optimized write-behind and write-through. Took me a week.
Everything was fine until I started performance-testing. Queries are slow - after all of the optimizations I did: indexes, Portable serialization, explicit comparators, etc. A simple "greater than" query on an indexed field takes 30 seconds on the set of 60K of 1K records (map entries). I believe Hazelcast team did everything they could. As much as I hate to say it, Java collections are still slow compared to super-optimized C++ code normal databases use. There are some open-source Java projects that address that. However at this time query persistence is unacceptable. It should be instant on a single local instance. It is an in-memory technology after all.
I switched to Mongo for the database, however left Hazelcast for shared runtime data - namely sessions. Once they improve query performance I'll switch back.
If you have alternatives to hazelcast maybe look at these first. We have it in running production mode and it is still quite buggy, just check out the open issues.
However, the integration with Spring, Hibernate etc. is quite nice and the setup is really easy :)
We use Hazelcast in our e-commerce application to make sure that our inventory is consistent.
We use extensive use of distributed locking to make sure SKU Items of inventory are modified in atomic way because there are hundred of nodes in our web application cluster that operates concurrently on these items.
Also, we use distributed map for caching purpose which are shared across all the nodes. Since scaling node in Hazelcast is so simple and it utilises all its CPU core, it gives added advantage over redis or any other caching framework.
We are using Hazelcast from last 3 years in our e-commerce application to make sure availability (supply & demand) is consistent, atomic, available & scalable.
We are using IMap (distributed map) to cache the data and Entry Processor for read & write operations to do fast in-memory operations on IMap without you having to worry about locks.

Anyone using JavaSpaces technology? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Are there real practical uses of JavaSpaces technology out there and how exactly is it implemented?
We are currently using javaspaces (the Sun outrigger implementation), to coordinate loosely coupled processes. The idea behind it is compelling, and the API is very simple. The actual implementation has been a problem. It's built on Jini, so 5 or 6 processes are required to bring up a space. And, at least in Sun's implementation, there is no way to have it communicate over specific ports, which makes firewalls a bit of a pain.
The other issue that we have run into is that there is no implied ordering in the space. So if you put 5 objects in, and your template on the read/take matches all 5, it is unspecified which one you will get. Depending on the application, this may or may not be an issue.
GigaSpaces is a mature version of JavaSpaces. It is widely used in financial applications, which are kept quiet.
As for the Implementation it is basically an transactional Object database on top of Jini. The queries are similar to db4o.
I've seen it used in a financial application, mostly for managing compute workers (grid style) where entries were written into the space from front-tier applications and pulled out by workers by matching on a field showing work was needed. Results could be written back into the space, triggering a notify registered by the front-tier app which then reads back the finished piece of work.
For compute workers it's OK, but lack of ordering may be an issue for you (if only because of unpredictability) - some implementations have features to enforce FIFO ordering. It was also used for long term data storage as it was persistent, but I don't think that was a good idea. The admin tooling wasn't good enough to make it manageable and performance suffered due to the volume of data.
Dan Creswell's Blitz JavaSpaces implementation was used - it's got a good range of features (can run in transient or persistent modes), is designed to be robust (with transaction logging) and retain high performance, and it's very tunable. As with the other Jini services, you can configure the "exporter" to have it listen on specific ports to make firewalling easier - SSL transports and full PKI were used too and are made possible by Jini's abstraction of communication.
I think Gigaspaces is the only implementation that has continued to innovate by extending the specification in numerous ways, which is nice to see. They've made it fit a wide variety of use-cases and added implementation features such as clustering and high availability. Using it would worry me though, as I'd be much happier seeing two or more implementations of these features in the community, given Gigaspaces is fairly proprietary.
I believe Orbitz which is a reservation system for hotels runs on Jini.
Based on Java Posse episodes #82, #84 and #86 which is an interview with Vin Simmons this technology is sometimes used in military or financial applications which are unfortunatley on the quiet.
I used it a few years back but it probably has not changed much.
#Keith: It is(used to be atleast) possible to start all the services in a single process/JVM and I think there is documentation out there on how to do this.
I believe Jini/Javaspaces is used in a few large applications (ticketing, cell phones etc) in Europe. Also used by GE Aircraft for research and analysis.
SORCER lab at Texas Tech has a large SOA architecture built on top of Jini/Javaspaces and you may be able to find some help there.
I'm not aware of any new usage of JavaSpaces at this point in time. For distributed computing, most large-scale systems are being built with In-Memory Data Grid technology or partitioned NoSQL-like solutions. (I see a lot of Oracle Coherence being used, but that's probably because I work with it.)
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

Categories