I have a java enterprise application that does a lot of fetching of cached data.
The data is stored in a 3 server redis cluster and is accessed by 5 backend api nodes.
I am seeing that we are putting alot of stress on the redis caches, which is why I am wondering if it is dumb to put a in-mem cache such as Ehcache in front of redis. With this solution I would set the TTL to be very short in the Ehcache.
Is this a common solution or is it more reasonable to look into expanding the redis cluster?
Thing you are talking about is called near cache. It's absolutely legit solution in some cases. It provides trade-off between performance and freshness of the values. However you can only consider this option if seeing a bit stale values is tolerable in your case. Just FYI, Apache Ignite supports this feature out of the box.
Related
Java Caching frameworks for storing huge data.
Context: We are developing a Restful service using Jersey 2.6 and will deploy it on WAS 8.5. This service need to serve more than 10 million requests per day.
We need to implement a cache to store more than 300k object (data will come from DB). And we need some way to update the cache on a daily basis.
Is this approach of caching 300k object and updating them on a daily basis is recommended?
Are there any Java framework which supports this kind of functionality?
Your question is too general to get a clear answer. You need to be describe what the problem you are trying to solve is.
Are you concerned about response times?
Are you trying to protect your DB from doing heavy lifting?
Are expecting to have to scale out and want to be sure that you can deal with future loads?
Additionally some more contextual information would be useful, especially:
How dynamic is your data compared to your requests?
What percentage of your data population will be requested on average per day? (How many of the 3 lakh objects will be enquired upon at least once per day? If you don't know, provide your best guess).
Your figures given as 3 lakh (300k) data points and 10M requests means that you are expecting to hit each object on average 33 times a day, which indicates that you are more concerned about back end DB load than your responses being right up to date.
In my experience there are a lot of fairly primitive solutions which will work much better than going for a heavyweight distributed systems such as Mongo, Cassandra or Coherence.
My first response would be: Keep it simple - 300k objects is not too much to store in an internal hash table which you flush once a day and populate on first request.
If you need to scale horizontally, I would suggest Memcache Spymemcached with a 1 day cache time, which populate when you don't find an existing entry.
I would NOT go for something like Cassandra or Mongo unless you have real compelling reasons to require a persistent store. Rationale: Purging can become really onerous, especially if your data is fast moving. For example: Cassandra does not really know how to delete, but instead "tombstones" deleted entries, which means that your data store will simply grow and grow until you create a strategy for purging.
Question is if caching must be distributed. Remember the caching is something you have seen. And posting this around for the chance it might be of use... well why.
Distributed Cache system: Redis, Cassandra in Memory. MongoDB in memory.
Local RocksDB (let you store byte[] -> byte[]) and SSDs makes a fine local cache layer. You might also add distributed layer on top of it. Usually better than something from the shelves. Should also be easy to implement.
10Million Requests per day isnt much. in 10hours tops you can server 1Mio / 60 / 60 => 3000 requests per second. Based on the afford you usually can go with an efficient frontend and efficient backend. We can do 40k pages per second and core and having 24 cores.. you know the math. Data in memory no chaching done...
For the caching provider I suggest Coherence, I am using Coherence at my company, and it is very robust and synchronized over multiple clusters.
For the other point about how to handle cache, it depends on the nature of your application, based on my experience with caching, I've decided to update the cache in the following scenarios:
1. Grid paging
2. Browsing
and decided to clear the cache and reload the data again:
Edit item
Add new item
Delete item
And I've decided so as maintaining the cache it an overkill headache that will be blown in your face when you handle some kind of statistics and nested hierarchies.
Hope this helped you.
Yes they are for example: Coherence, Hazelcast. All are distrubuted cashes.
http://java.dzone.com/articles/sneak-peek-jcache-api-jsr-107
In general you should cache what you are using, and cache should be always in sync not daily. You place in cache the recently used objects, and you get read/write through cache to your DB.
If you have money , best one is coherence (its reputation is proved by big financial companies )
Hazelcast is an other distributed cache memory you can use, it is one level lower than coherence based on preformance metrics.
Cou could try ehcache. It can be used as query cache or even hibernate second level cache.
You can configure how long entities should be stored in cache before they are invalidated.
If you already have WebSphere ND 8.5.5, you may take a look at WebSphere Extreme Scale, which is provided with that. It is distributed, partitioned caching solution that integrates with WebSphere. See WebSphere eXtreme Scale overview for more details.
See the new JCache standard (JSR 107 in the Java Community Process). This API is implemented by Coherence and other caching implementations (ehcache etc.), and also has a small reference implementation that you can use for basic use cases.
Yes, any of the Java caching frameworks should be able to help you. Coherence (note: I work with Coherence at Oracle) for example can definitely handle 3,00,000 items easily (I assume you are from India if you use lakh!), but I suggest only using Coherence if you are deploying this on more than one server.
I've been researching this for a week now, but I'd like some thoughts on my particular situation...
2 physical servers:
Server A - public WAR, admin WAR
Server B - public WAR
Requirements:
Both WARs need to view the same data.
admin WAR modifies / adds data to the cache.
public WARs modify other parts of the cache / add data to it.
entire cache needs to reside in memory on each physical server (if I add something on Server A admin WAR or public WAR, it needs to show up on Server B public WAR) so in the event of a failure, we aren't waiting for half the cache to be populated
1,500 active users/server, vast majority of traffic is read, very little write
Additional hardware is out of the question.
Is there a good third party caching solution for this scenario? It seems most distributed caching systems want to leave half the data on Server A and half on Server B, which wouldn't meet our failover performance needs.
Thanks for any ideas!
You should look at Redis
http://www.gigaspaces.com/ has a solution for that, it allows you to create "Space" that serves as cache in replicated mode, so each node will have exact copy of data.
They also have solution for fail-over or hot stand by.
Edit:
Gigaspace is far more than just a shared cache, but you can use just the caching solution. It's called In memory data grid. They have dramaticaly changed they web pages so I can't find exact page. But if you search through the documentation yo'll find it.
You can start here
http://www.gigaspaces.com/datagrid
But the technology is not free.
Take a look at the replicated options for EhCache.
Sounds like you've been searching for information on "distributed caches", which has a different defintion than "replicated cache". A distributed cache is a larger cache system spread out among many machines, so that the loss of anyone machine in the cluster does not bring down the entire cache, but just a portion. In this scenario the total size of your cache can reach (number of machines times memory of each machine).
In a replicated cache, the cached data is replicated across each machine, limiting you to a total cache size of max(memory of any one machine).
It seems most distributed caching systems want to leave half the data
on Server A and half on Server B, which wouldn't meet our failover
performance needs.
No, you can tweak it easy. Otherwise you need sticky seesion (you have to know exactly, which cache stores your data). You can choose any solution on the market EhCache, GigaSpace, GridGain etc. I would recommend to use JBoss Cache, imho the simplest and exactly what you need
There are many solutions in this space.
Memcached
EhCache
Infinispan
All of them can be configured as distributed caches. AFAIK Infinispan works best when left an an embedded cache in JBoss AS, last I checked it was difficult to integrate into other app servers. If you have money I would recommend BigMemory from Terracotta. Its the commercial derivative of EhCache and provides alot of additional nice-to-have features.
We use Apache Commons JCS and have been very pleased with it. It claims to be almost twice as fast as EHCache. For the situation you have described, you would probably configure a Lateral TCP Cache.
I'm looking for a solution to share a cache between two tomcat web apps running on different hosts. The cache is being used for data synchronization, so the cache must be guaranteed to be up-to-date at all times between the two tomcat instances. (Sorry, I'm not 100% sure if the correct terminology for this requirement is "consistency" or something more specific like having ACID property). Another requirement is of course is that it should be fast to access the cache, with about equal numbers of writes as reads. I do have access to a shared filesystem so that is a consideration.
I've looked at something like ehcache but in order to get a shared cache between the webapps I would either need to implement on top of a Terracotta environment or using the new ehcache cache server. The former (Terracotta) seems like overkill for this, while the cache web server seems like it wouldn't provide the fast performance that I want.
Another solution I've looked at is building something simple on top of a fast key-value store like Redis or memcachedb. Redis is in-memory but can easily be configured to be a centralized cache, while memcachedb is a disk-based persistent cache which could work because I have a shared filesystem.
I'm looking for suggestions on how to best solve this problem. The solution needs to be a relatively mature technology as it will be used in a production environment.
Thanks in advance!
I'm quite sure that you don't require terracotta or ehcache server if you need a distributed cache. Ehcache with one of the four replication mechanisms would do.
However, based on what you've written I guess that you're looking for more than just a cache. Memcached/Ehcache are examples of what you might call a caching layer for your application - nothing more.
If you find yourself using words like 'guaranteed' 'up-to-date' 'ACID' you're better off using an in-memory DB like Oracle Times Ten/MySQL Cluster/Redis with a disk-based persistent storage.
You can use memcached (not memcachedb) for fast and efficient caching. Redis or memcachedb could be an overkill unless you want persistent caching. Memcached can be clustered very easily and you can use spymemcached java client to access it. Memcacached is very mature and is running in several hundred thousands, if not millions of production servers. It can be monitored through Nagios and Munin systems when in production.
Not really a question but I'm looking for comments/suggestions from anyone who has experiences using one or more of the following:
EhCache with RMI
EhCache with JGroups
EhCache with Terracotta
Gigaspaces Data Grid
A bit of background: our applications is read only for the most part but there is some user data that is read-write and some that is only written (and can also be reasonably inaccurate). In addition, it would be nice to have tools that enable us to flush and fill the cache at intervals or by admin intervention.
Regarding the first option - are there any concerns about the overhead of RMI and performance of Java serialization?
I'm working with EhCache for Hibernate and for application level cache since 3 years ago.
We use it with RMI for cache invalidation and it works really good. If you use the cache for replication you should take care about the object graph, it could turn very heavy with high cardinality relations.
If you use EhCache for Hibernate you could use it for Query cache (it's a good improvement for read-only tables) and it the table is modified it cleans the cache automatically.
Using EhCache to cache collections is a good idea too, to avoid joins a sub-selects.
To clean the caches at time intervals you could implement a cache extension of EhCache that cleans the caches. We did it, it works well.
Also check out Hazelcast, Coherence and GemStone. These are distributed caching solutions with Query support. They also have ready-to-go second level cache plug-in for Hibernate. Hazelcast is open source.
I would like to use Ehcache replicated cache, first as the backend to Hibernate second level cache, second as a cache for any data.
I know how a distributed cache like memcached is working, and I know it can scale to large clusters, but I cannot find how Ehcache replication behaves on large clusters.
Has someone a pointer to some information or some kind of benchmark?
I found that many replication strategies can be used, like RMI, JGroups, JMS or Terracotta, and RMI and Terracotta seem the most popular.
How do they compare on large clusters?
Will the replication kill my performances as I add many nodes (like several dozens)?
Fully replicated cache will only work if your application is read-mostly. Replicated cache cannot scale; passing the updates to the other nodes will kill your performance. You need partitioned cache with backup replicas. Partitioned caches will linearly scale even for the write-intensive applications.
Try Hazelcast! it is open source (Apache license) transactional, partitioned caching solution for Java. It comes with hibernate second level cache plugin.
Several dozens? No problem. Hazelcast 100 node cluster demo can be found here.
A good solution to the cluster scaling problem is the notion of "buddy replication", where data is only replicated to each node's neighbours (however you define that), rather to all nodes. You get failover without the scaling issue.
To my knowledge, ehcache doesn't do this. However, JBossCache does, and that also integrates with Hibernate in the same way that ehcache does.
Have you read the section in the manual about Distributed Caching with ehcache?
There are further chapters on:
RMI Distributed Caching
Distributed Caching using JGroups
Distributed Caching using JMS
Distributed Caching via Terracotta