Ehcache performance on a large cluster

Ehcache performance on a large cluster - java

I would like to use Ehcache replicated cache, first as the backend to Hibernate second level cache, second as a cache for any data.
I know how a distributed cache like memcached is working, and I know it can scale to large clusters, but I cannot find how Ehcache replication behaves on large clusters.
Has someone a pointer to some information or some kind of benchmark?
I found that many replication strategies can be used, like RMI, JGroups, JMS or Terracotta, and RMI and Terracotta seem the most popular.
How do they compare on large clusters?
Will the replication kill my performances as I add many nodes (like several dozens)?

Fully replicated cache will only work if your application is read-mostly. Replicated cache cannot scale; passing the updates to the other nodes will kill your performance. You need partitioned cache with backup replicas. Partitioned caches will linearly scale even for the write-intensive applications.
Try Hazelcast! it is open source (Apache license) transactional, partitioned caching solution for Java. It comes with hibernate second level cache plugin.
Several dozens? No problem. Hazelcast 100 node cluster demo can be found here.

A good solution to the cluster scaling problem is the notion of "buddy replication", where data is only replicated to each node's neighbours (however you define that), rather to all nodes. You get failover without the scaling issue.
To my knowledge, ehcache doesn't do this. However, JBossCache does, and that also integrates with Hibernate in the same way that ehcache does.

Have you read the section in the manual about Distributed Caching with ehcache?
There are further chapters on:
RMI Distributed Caching
Distributed Caching using JGroups
Distributed Caching using JMS
Distributed Caching via Terracotta

Related

Putting a cache infront of distributed redis cache

I have a java enterprise application that does a lot of fetching of cached data.
The data is stored in a 3 server redis cluster and is accessed by 5 backend api nodes.
I am seeing that we are putting alot of stress on the redis caches, which is why I am wondering if it is dumb to put a in-mem cache such as Ehcache in front of redis. With this solution I would set the TTL to be very short in the Ehcache.
Is this a common solution or is it more reasonable to look into expanding the redis cluster?

Thing you are talking about is called near cache. It's absolutely legit solution in some cases. It provides trade-off between performance and freshness of the values. However you can only consider this option if seeing a bit stale values is tolerable in your case. Just FYI, Apache Ignite supports this feature out of the box.

Using Replicated Cache vs LB sticky session

I need to keep some data in cache on server. The servers are in cluster and call can go to any of them. In such a scenario is it better to use a replicated/distributed cache like EhCache Or to use session stickiness of LB.
If the data size(in cache) is big, won't it have a performance impact of serialization and de-serialization across all servers?
Also in case of distributed cache, whats the optimal number of servers till which such cache is effective. Since data is replicated to all nodes, and say number of nodes is 20, its like master to master replication across all nodes. By that I mean, each node will get notifications from other 19 and will update modifications to other 19.Does such type os setup scale?

As always in distributed systems, the answer depands on different things:
A load balancer with sticky sessions is for sure the simpler way for the developer, since it doesn’t make any difference if the application runs on 1, 2 or 100 servers. If this is all you care about, stick with it and you can stop reading right here.
I’m not sure how session aware load balancers are implemented and what their general limit in terms of requests per second would be, but they have at least one big disadvantage over the distributed cache. - What to do if the machine handling the sessions is down? - If you distributed your cache, any machine can serve the request and it doesn’t matter if one of them fails. The serialisation/deserialisation part is not a big problem, rather the network could be the bottleneck if you don't run it in at least a 1 Gbit network environment, but it should be ok.
For distributed cache you could go either with Hazelcast, Infinispan or similar solutions, which would simplify the access from your own application. (Update: these implementations use DHT to distribute the cache)
Fully replicated cache you could use EhCached, which you mentioned, or Infinispan. Here the advantage over the distributed cache is the much faster access since you have all the data replicated on every machine and only need to access it localy. The disadvantage is slower writes (so rather use it for read very often, write very seldom scenarios) and the fact that your cache is limited by the amount which one machine is able to store. If you are running your applications on servers with 64GB of RAM this is ok. If you want to distribute them over small amazon instances, this is probably a bad idea. I think before you will hit any problems with updating too many nodes, you will run out of memory, and that one is at least very easy to calculate: AVG_CACHE_NEEDED_PER_CLIENT * NUMBER_OF_CLIENTS < MEMORY_FOR_CACHE_AVAILABLE (on one server). If you need more cache than you have available on any node in your EhCached cluster, full replication won't be possible any more.
Or you could use a Redis cluster or similar independent from your application and the servers your application is running on. This would allow you to scale the cache at a different speed than the rest of your application, however the access to the data wouldn’t be that trivial.
Of course the actual decision depends on your very specific use-case and the demands you are putting on your application.
Personally I was very happy when I found out today that Azure WebPages have a load balancer with sticky session support, and I don’t need to reconfigure my application to use Redis as a session object store, and can just keep everything as it is.
But for a huge workload with hundreds of servers a simple load balancer probably will be rather overwhelmed, and distributed cache, or centralized replicated cache (Redis) will be the way to go.

Cache implementation

I've been researching this for a week now, but I'd like some thoughts on my particular situation...
2 physical servers:
Server A - public WAR, admin WAR
Server B - public WAR
Requirements:
Both WARs need to view the same data.
admin WAR modifies / adds data to the cache.
public WARs modify other parts of the cache / add data to it.
entire cache needs to reside in memory on each physical server (if I add something on Server A admin WAR or public WAR, it needs to show up on Server B public WAR) so in the event of a failure, we aren't waiting for half the cache to be populated
1,500 active users/server, vast majority of traffic is read, very little write
Additional hardware is out of the question.
Is there a good third party caching solution for this scenario? It seems most distributed caching systems want to leave half the data on Server A and half on Server B, which wouldn't meet our failover performance needs.
Thanks for any ideas!

You should look at Redis

http://www.gigaspaces.com/ has a solution for that, it allows you to create "Space" that serves as cache in replicated mode, so each node will have exact copy of data.
They also have solution for fail-over or hot stand by.
Edit:
Gigaspace is far more than just a shared cache, but you can use just the caching solution. It's called In memory data grid. They have dramaticaly changed they web pages so I can't find exact page. But if you search through the documentation yo'll find it.
You can start here
http://www.gigaspaces.com/datagrid
But the technology is not free.

Take a look at the replicated options for EhCache.
Sounds like you've been searching for information on "distributed caches", which has a different defintion than "replicated cache". A distributed cache is a larger cache system spread out among many machines, so that the loss of anyone machine in the cluster does not bring down the entire cache, but just a portion. In this scenario the total size of your cache can reach (number of machines times memory of each machine).
In a replicated cache, the cached data is replicated across each machine, limiting you to a total cache size of max(memory of any one machine).

It seems most distributed caching systems want to leave half the data
on Server A and half on Server B, which wouldn't meet our failover
performance needs.
No, you can tweak it easy. Otherwise you need sticky seesion (you have to know exactly, which cache stores your data). You can choose any solution on the market EhCache, GigaSpace, GridGain etc. I would recommend to use JBoss Cache, imho the simplest and exactly what you need

There are many solutions in this space.
Memcached
EhCache
Infinispan
All of them can be configured as distributed caches. AFAIK Infinispan works best when left an an embedded cache in JBoss AS, last I checked it was difficult to integrate into other app servers. If you have money I would recommend BigMemory from Terracotta. Its the commercial derivative of EhCache and provides alot of additional nice-to-have features.

We use Apache Commons JCS and have been very pleased with it. It claims to be almost twice as fast as EHCache. For the situation you have described, you would probably configure a Lateral TCP Cache.

Shared cache between Tomcat web apps

I'm looking for a solution to share a cache between two tomcat web apps running on different hosts. The cache is being used for data synchronization, so the cache must be guaranteed to be up-to-date at all times between the two tomcat instances. (Sorry, I'm not 100% sure if the correct terminology for this requirement is "consistency" or something more specific like having ACID property). Another requirement is of course is that it should be fast to access the cache, with about equal numbers of writes as reads. I do have access to a shared filesystem so that is a consideration.
I've looked at something like ehcache but in order to get a shared cache between the webapps I would either need to implement on top of a Terracotta environment or using the new ehcache cache server. The former (Terracotta) seems like overkill for this, while the cache web server seems like it wouldn't provide the fast performance that I want.
Another solution I've looked at is building something simple on top of a fast key-value store like Redis or memcachedb. Redis is in-memory but can easily be configured to be a centralized cache, while memcachedb is a disk-based persistent cache which could work because I have a shared filesystem.
I'm looking for suggestions on how to best solve this problem. The solution needs to be a relatively mature technology as it will be used in a production environment.
Thanks in advance!

I'm quite sure that you don't require terracotta or ehcache server if you need a distributed cache. Ehcache with one of the four replication mechanisms would do.
However, based on what you've written I guess that you're looking for more than just a cache. Memcached/Ehcache are examples of what you might call a caching layer for your application - nothing more.
If you find yourself using words like 'guaranteed' 'up-to-date' 'ACID' you're better off using an in-memory DB like Oracle Times Ten/MySQL Cluster/Redis with a disk-based persistent storage.

You can use memcached (not memcachedb) for fast and efficient caching. Redis or memcachedb could be an overkill unless you want persistent caching. Memcached can be clustered very easily and you can use spymemcached java client to access it. Memcacached is very mature and is running in several hundred thousands, if not millions of production servers. It can be monitored through Nagios and Munin systems when in production.

hibernate distributed 2nd level cache options

Not really a question but I'm looking for comments/suggestions from anyone who has experiences using one or more of the following:
EhCache with RMI
EhCache with JGroups
EhCache with Terracotta
Gigaspaces Data Grid
A bit of background: our applications is read only for the most part but there is some user data that is read-write and some that is only written (and can also be reasonably inaccurate). In addition, it would be nice to have tools that enable us to flush and fill the cache at intervals or by admin intervention.
Regarding the first option - are there any concerns about the overhead of RMI and performance of Java serialization?

I'm working with EhCache for Hibernate and for application level cache since 3 years ago.
We use it with RMI for cache invalidation and it works really good. If you use the cache for replication you should take care about the object graph, it could turn very heavy with high cardinality relations.
If you use EhCache for Hibernate you could use it for Query cache (it's a good improvement for read-only tables) and it the table is modified it cleans the cache automatically.
Using EhCache to cache collections is a good idea too, to avoid joins a sub-selects.
To clean the caches at time intervals you could implement a cache extension of EhCache that cleans the caches. We did it, it works well.

Also check out Hazelcast, Coherence and GemStone. These are distributed caching solutions with Query support. They also have ready-to-go second level cache plug-in for Hibernate. Hazelcast is open source.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.