I have a task to build a prototype for a massively scalable distributed shared memory (DSM) app. The prototype would only serve as a proof-of-concept, but I want to spend my time most effectively by picking the components which would be used in the real solution later on.
The aim of this solution is to take data input from an external source, churn it and make the result available for a number of frontends. Those "frontends" would just take the data from the cache and serve it without extra processing. The amount of frontend hits on this data can literally be millions per second.
The data itself is very volatile; it can (and does) change quite rapidly. However the frontends should see "old" data until the newest has been processed and cached. The processing and writing is done by a single (redundant) node while other nodes only read the data. In other words: no read-through behaviour.
I was looking into solutions like memcached however this particular one doesn't fulfil all our requirements which are listed below:
The solution must at least have Java client API which is reasonably well maintained as the rest of app is written in Java and we are seasoned Java developers;
The solution must be totally elastic: it should be possible to add new nodes without restarting other nodes in the cluster;
The solution must be able to handle failover. Yes, I realize this means some overhead, but the overall served data size isn't big (1G max) so this shouldn't be a problem. By "failover" I mean seamless execution without hardcoding/changing server IP address(es) like in memcached clients when a node goes down;
Ideally it should be possible to specify the degree of data overlapping (e.g. how many copies of the same data should be stored in the DSM cluster);
There is no need to permanently store all the data but there might be a need of post-processing of some of the data (e.g. serialization to the DB).
Price. Obviously we prefer free/open source but we're happy to pay a reasonable amount if a solution is worth it. In any way, paid 24hr/day support contract is a must.
The whole thing has to be hosted in our data centers so SaaS offerings like Amazon SimpleDB are out of scope. We would only consider this if no other options would be available.
Ideally the solution would be strictly consistent (as in CAP); however, eventual consistence can be considered as an option.
Thanks in advance for any ideas.
Have a look at Hazelcast. It is pure Java, open source (Apache license) highly scalable in-memory data grid product. It does offer 7X24 support. And it does solve all of your problems I tried to explain each of them below:
It has a native Java Client.
It is 100% dynamic. Add and remove nodes dynamically. No need to change anything.
Again everything is dynamic.
You can configure number of backup nodes.
Hazelcast support persistency.
Everything that Hazelcast offers is free(open source) and it does offer enterprise level support.
Hazelcast is single jar file. super easy to use. Just add jar to your classpath. Have a look at screen cast in main page.
Hazelcast is strictly consistent. You can never read stale data.
I suggest you to use Redisson - Redis based In-memory Data Grid for Java. Implements (BitSet, BloomFilter, Set, SortedSet, Map, ConcurrentMap, List, Queue, Deque, BlockingQueue, BlockingDeque, ReadWriteLock, Semaphore, Lock, AtomicLong, CountDownLatch, Publish / Subscribe, RemoteService, ExecutorService, LiveObjectService, SchedulerService) on top of Redis server! It supports master/slave, sentinel and cluster server modes. Automatic cluster/sentinel servers topology discovery supported also. This lib is free and open-source.
Perfectly works in cloud thanks to AWS Elasticache support
Depending of what you prefer, i would surely follow the others by suggesting Hazelcast if you're towards AP from the CAP Theorem but if you need CP, i would choose Redis
Have a look at Terracotta's JVM clustering, it's OpenSource ;)
It has no API while it works efficent at JVM level, when you store the value in a replicated object it is sent to all other nodes.
Even locking and all those things work transparent and without adding any new code.
You may want to checkout Java-specific solutions like Coherence: http://www.oracle.com/global/ru/products/middleware/coherence/index.html
However, I consider such solutions to be too complex and prefer to use solutions like memcached. Big disadvantage of memcached for your purpose is lack of record lock it seems and there is no built in way to replicate data for failover. That is why I would look into the key-value data stores. Many of them would satisfy your need completely.
Here is a list of key-value data stores that may help you with your task:
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores
Just pick one that you fill comfortable with.
I am doing a similar project, but instead targeting the .NET platform. Apart from the already mentioned solutions, I think you should take a look at ScaleOut StateServer and Alachisoft NCache. I am afraid neither of these alternatives are cheap, but they are a safer bet than open source for commercial solutions according to my judgement.
Both provide Java client APIs, even though I have only played around with the .NET APIs.
StateServer features self-discovery of new cache nodes, and NCache has a management console where new cache nodes can be added.
Both should be able to handle failovers seamlessly.
StateServer can have 1 or 2 passive copies of the data. NCache features more caching topologies to choose between.
If you mean write-through/write-behind to a database that is available in both.
I have no idea how many cache servers you plan to use, but here are the full price specs:
ScaleOut StateServer
Alachisoft NCache
Both are installed and configured locally on your server and they both have GUI Management.
I am not sure exactly what strictly consistent involves, so I'll leave that for you to investigate..
Overall, StateServer is the best option if you want to skip configuring every little detail in the cache cluster, while NCache features very many features and caching topologies to choose from.
Depending on the behaviour of data towards the clients (if the data is read many times from the same client) it might be a good idea to mix local caching on the clients with the distributed caching in the cluster (available for both NCache and StateServer), just a thought.
The specified use case seems to fit into Netflix's Hollow. This is a read-only replicated cache with a single producer and multiple consumers.
Have you tought about using a standard messaging solution like rabbitmq ?
RabbitMQ is an open source implementation of the AMQP protocol.
Your application seems more or less like a Publish/subscribe system.
The Publisher node is the one that does the processing and puts messages (processed data) in a queue in the servers.
Subscribers can get messages from the server in various ways. AMQP decouples the producer and the consumer of messages and is very flexible in how you can combine the two sides.
Related
Java Caching frameworks for storing huge data.
Context: We are developing a Restful service using Jersey 2.6 and will deploy it on WAS 8.5. This service need to serve more than 10 million requests per day.
We need to implement a cache to store more than 300k object (data will come from DB). And we need some way to update the cache on a daily basis.
Is this approach of caching 300k object and updating them on a daily basis is recommended?
Are there any Java framework which supports this kind of functionality?
Your question is too general to get a clear answer. You need to be describe what the problem you are trying to solve is.
Are you concerned about response times?
Are you trying to protect your DB from doing heavy lifting?
Are expecting to have to scale out and want to be sure that you can deal with future loads?
Additionally some more contextual information would be useful, especially:
How dynamic is your data compared to your requests?
What percentage of your data population will be requested on average per day? (How many of the 3 lakh objects will be enquired upon at least once per day? If you don't know, provide your best guess).
Your figures given as 3 lakh (300k) data points and 10M requests means that you are expecting to hit each object on average 33 times a day, which indicates that you are more concerned about back end DB load than your responses being right up to date.
In my experience there are a lot of fairly primitive solutions which will work much better than going for a heavyweight distributed systems such as Mongo, Cassandra or Coherence.
My first response would be: Keep it simple - 300k objects is not too much to store in an internal hash table which you flush once a day and populate on first request.
If you need to scale horizontally, I would suggest Memcache Spymemcached with a 1 day cache time, which populate when you don't find an existing entry.
I would NOT go for something like Cassandra or Mongo unless you have real compelling reasons to require a persistent store. Rationale: Purging can become really onerous, especially if your data is fast moving. For example: Cassandra does not really know how to delete, but instead "tombstones" deleted entries, which means that your data store will simply grow and grow until you create a strategy for purging.
Question is if caching must be distributed. Remember the caching is something you have seen. And posting this around for the chance it might be of use... well why.
Distributed Cache system: Redis, Cassandra in Memory. MongoDB in memory.
Local RocksDB (let you store byte[] -> byte[]) and SSDs makes a fine local cache layer. You might also add distributed layer on top of it. Usually better than something from the shelves. Should also be easy to implement.
10Million Requests per day isnt much. in 10hours tops you can server 1Mio / 60 / 60 => 3000 requests per second. Based on the afford you usually can go with an efficient frontend and efficient backend. We can do 40k pages per second and core and having 24 cores.. you know the math. Data in memory no chaching done...
For the caching provider I suggest Coherence, I am using Coherence at my company, and it is very robust and synchronized over multiple clusters.
For the other point about how to handle cache, it depends on the nature of your application, based on my experience with caching, I've decided to update the cache in the following scenarios:
1. Grid paging
2. Browsing
and decided to clear the cache and reload the data again:
Edit item
Add new item
Delete item
And I've decided so as maintaining the cache it an overkill headache that will be blown in your face when you handle some kind of statistics and nested hierarchies.
Hope this helped you.
Yes they are for example: Coherence, Hazelcast. All are distrubuted cashes.
http://java.dzone.com/articles/sneak-peek-jcache-api-jsr-107
In general you should cache what you are using, and cache should be always in sync not daily. You place in cache the recently used objects, and you get read/write through cache to your DB.
If you have money , best one is coherence (its reputation is proved by big financial companies )
Hazelcast is an other distributed cache memory you can use, it is one level lower than coherence based on preformance metrics.
Cou could try ehcache. It can be used as query cache or even hibernate second level cache.
You can configure how long entities should be stored in cache before they are invalidated.
If you already have WebSphere ND 8.5.5, you may take a look at WebSphere Extreme Scale, which is provided with that. It is distributed, partitioned caching solution that integrates with WebSphere. See WebSphere eXtreme Scale overview for more details.
See the new JCache standard (JSR 107 in the Java Community Process). This API is implemented by Coherence and other caching implementations (ehcache etc.), and also has a small reference implementation that you can use for basic use cases.
Yes, any of the Java caching frameworks should be able to help you. Coherence (note: I work with Coherence at Oracle) for example can definitely handle 3,00,000 items easily (I assume you are from India if you use lakh!), but I suggest only using Coherence if you are deploying this on more than one server.
I am trying to reconstruct a flow of information from multiple parts handled by different Java processes. Please note that i don't generate the flows, i just read some information about them.
I've tried using MySQL (MyISAM/InnoDB tables) with INSERT ON DUPLICATE KEY UPDATE using an id for each flow. I've also tried storing all the pieces of information and running a query at the end to get the full information. Neither of these approaches yielded the performance needed.
I'm looking for a solution that will allow me to have a set of shared objects between multiple Java processes. The objects should be persistent between runs and fast to lookup/update concurrently (>100k lookups/updates per second).
I've thought of a few solutions including:
NoSQL: something like MongoDB, HBase etc.
a caching solution like EhCache, Memcached etc.
The problem is i don't have any experience with any of these solutions. So, what would you recommend that fits the following criteria:
very fast on a single system. Most of the applications i mentioned were built for distributed systems, but it's not the case here.
easy to learn/use (i want to be able to prototype it in a day)
mature technology
free to use even for commercial purposes
preferably open-source
You could try a seperate java process that co-ordinates between the others. This process would hold the information to pass over to the main processes. You could wire them up with RMI.
If you want to do only exchange of objects withing java applications, you could also looki into tuple spaces. There are specific implementations of spaces for java, JavaSpaces, which should be able to do what you need. Not sure if they can keep up with the performance though. Also I’m not sure how widely this technology is still being used, since it only supports Java and isn’t as flexible as NoSQL stores would be these days.
Wikipedia has a more detailed description and list of different implementations, many of which are open source.
The other option is to go with Redis, you have notifications there and it can for sure scale to the requirements you are looking for.
The old (legacy?) solution is JavaSpaces. However, from an software architects point of view I would say distributed caches are the replacements for that nowadays. Especially take a look at hazelcast and infinispan.
From the performance viewpoint I am not happy with the performance of the "big" distributed caching solutions, when only a single in-memory cache is needed, see my writeup on the cache2k benchmarks page (hazelcast needs to be added here).
Anyways, please clarify your problem statement first, because your question falls into the XyProblem category. You are not describing the actual problem, and your question just boils down to "fast reliable distributed objects" solution. What kind of data comes in? What is the rate? Who is it accessed? What consistency guarantees need to be met, considering the fact that writing and reading is in parallel?
By the term "flow of information" it sounds more like a complex event processing problem to me.
I am looking for a Map to share information between two instances of a Java web application running on separate machines. Reads and writes to this map need to be very fast and don't have to be transactional i.e. its ok if one instance has stale data for a while.
Any recommendations?
I need to keep track of the last time a user did something in the application, so its not terribly bad if this information is out of date. Speed and ease of use are important. I don't want writes to the Map to impact response times.
I would try Hazelcast, JGroups or Ehcache. All support a distributed map.
EDIT: Another option is to use RMI top a service running in one or the other JVM. This avoids the need for an additional library.
Additionally, there is Memcached which is very robust and proven over the time.
I consider caching key-value lists stored in database. Right now for rendering of JSF pages, a lot of redundant queries are executed to find the names to be displayed for some keys (O/R-Mapper: Eclipselink).
The values are quasi-static, but can change very seldom by using the application (no change in database except by the application in question).
A simple cache would suffice when only using one application server. However, load balancing with multiple servers should be possible, avoiding returning stale values if data is changed using one server and therefore not reflected by the other server.
One idea would be to use oracle coherence as distributed cache. I'm not sure whether this is overkill because of the fact that the data is only changed very seldomly and the cache itself does not need to be distributed, only the invalidation should be.
What is the overhead of coherence in terms of memory, execution times and network communication? Are there any alternatives that better suit my use case?
I talk about 50.000 key value pairs, mainly short strings.
If the invalidation is that rare, then you can use a local cache and something like a JMS Topic that everyone subscribes to in order to handle the invalidation.
There's also something like EHCache as an alternative, since it's OSS and free to use vs Coherence, if that's important. I like to use EHCaches pull through ability.
Coherence has relatively low overhead, and can easily manage 50,000 (or 50,000,000) objects. However, if your use case is super simple, and you don't mind doing the invalidation work yourself, and don't need the various QoS that Coherence provides, then it probably is overkill.
Also, this simple use case can easily be done using the Coherence Standard Edition, which is far less expensive (licensed per server instead of per processor, and it's a much lower price).
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.
We have a large cluster of tomcat servers and I'm trying to find an efficient way to share a count among all of them. This count is the number of "widgets" purchased and needs to be checked for every page view. Any server can complete a sale and increment that count, at which point the new value should be made available to all the cluster members.
We don't want to use the count from the database because there will be many page views between updates across the cluster and a get operation to the db for every page view seems unnecessary.
We have an extensive memcached cluster where we could store the value, get it on every page view, and anyone who updates the value sets the new value to the cluster. This again seems wasteful because of a cache get for each page view.
What I'd like to do is have an in-memory value on each server and a multicast (or similar mechanism) message tell all servers that they just incremented and the new number is X. That would seem to be the most efficient because an action is only taken when update is made, instead of doing work for every page view.
How have you handled this in your applications? Am I over-thinking this... should we just throw it in memcached?
Thanks!
Both JBossCache and EhCache can operate in UDP multicast mode, replicating an in-memory cache across multiple virtual machines. Unlike memcached, they operate inside the VM, and so a "cache get" is essentially a free operation. They're also pure java, so no needing to maintain a separate cache system.
JBossCache also provides transactions and sync/async operation, so if those are of interest to you, I'd pick that over EHCache.
If you already have a memcached infrastructure, i don't see why you shouldn't use that it'll be ideal for this. Wether it'll be wasteful or another drop in the ocean, only testing will tell you.
The architecture will be simple as well, compared to introducing something as intrusive as terracotta or another cache sharing mechanism.
Have a look at Terracotta. It gives you a distributed JVM memory model so the value of an object on every box gets update at once.
They have a wrapper around Ehcache, or you can use it transparently to your code with some XML config.
Terracotta provides commercial and open source licenses, and typically, they play down the open source in favour of the commercial -- but the free open source will definitely do what you need and will allow yours apps to scale a very long way.