Cache implementation

Cache implementation - java

I've been researching this for a week now, but I'd like some thoughts on my particular situation...
2 physical servers:
Server A - public WAR, admin WAR
Server B - public WAR
Requirements:
Both WARs need to view the same data.
admin WAR modifies / adds data to the cache.
public WARs modify other parts of the cache / add data to it.
entire cache needs to reside in memory on each physical server (if I add something on Server A admin WAR or public WAR, it needs to show up on Server B public WAR) so in the event of a failure, we aren't waiting for half the cache to be populated
1,500 active users/server, vast majority of traffic is read, very little write
Additional hardware is out of the question.
Is there a good third party caching solution for this scenario? It seems most distributed caching systems want to leave half the data on Server A and half on Server B, which wouldn't meet our failover performance needs.
Thanks for any ideas!

You should look at Redis

http://www.gigaspaces.com/ has a solution for that, it allows you to create "Space" that serves as cache in replicated mode, so each node will have exact copy of data.
They also have solution for fail-over or hot stand by.
Edit:
Gigaspace is far more than just a shared cache, but you can use just the caching solution. It's called In memory data grid. They have dramaticaly changed they web pages so I can't find exact page. But if you search through the documentation yo'll find it.
You can start here
http://www.gigaspaces.com/datagrid
But the technology is not free.

Take a look at the replicated options for EhCache.
Sounds like you've been searching for information on "distributed caches", which has a different defintion than "replicated cache". A distributed cache is a larger cache system spread out among many machines, so that the loss of anyone machine in the cluster does not bring down the entire cache, but just a portion. In this scenario the total size of your cache can reach (number of machines times memory of each machine).
In a replicated cache, the cached data is replicated across each machine, limiting you to a total cache size of max(memory of any one machine).

It seems most distributed caching systems want to leave half the data
on Server A and half on Server B, which wouldn't meet our failover
performance needs.
No, you can tweak it easy. Otherwise you need sticky seesion (you have to know exactly, which cache stores your data). You can choose any solution on the market EhCache, GigaSpace, GridGain etc. I would recommend to use JBoss Cache, imho the simplest and exactly what you need

There are many solutions in this space.
Memcached
EhCache
Infinispan
All of them can be configured as distributed caches. AFAIK Infinispan works best when left an an embedded cache in JBoss AS, last I checked it was difficult to integrate into other app servers. If you have money I would recommend BigMemory from Terracotta. Its the commercial derivative of EhCache and provides alot of additional nice-to-have features.

We use Apache Commons JCS and have been very pleased with it. It claims to be almost twice as fast as EHCache. For the situation you have described, you would probably configure a Lateral TCP Cache.

Related

Using Replicated Cache vs LB sticky session

I need to keep some data in cache on server. The servers are in cluster and call can go to any of them. In such a scenario is it better to use a replicated/distributed cache like EhCache Or to use session stickiness of LB.
If the data size(in cache) is big, won't it have a performance impact of serialization and de-serialization across all servers?
Also in case of distributed cache, whats the optimal number of servers till which such cache is effective. Since data is replicated to all nodes, and say number of nodes is 20, its like master to master replication across all nodes. By that I mean, each node will get notifications from other 19 and will update modifications to other 19.Does such type os setup scale?

As always in distributed systems, the answer depands on different things:
A load balancer with sticky sessions is for sure the simpler way for the developer, since it doesn’t make any difference if the application runs on 1, 2 or 100 servers. If this is all you care about, stick with it and you can stop reading right here.
I’m not sure how session aware load balancers are implemented and what their general limit in terms of requests per second would be, but they have at least one big disadvantage over the distributed cache. - What to do if the machine handling the sessions is down? - If you distributed your cache, any machine can serve the request and it doesn’t matter if one of them fails. The serialisation/deserialisation part is not a big problem, rather the network could be the bottleneck if you don't run it in at least a 1 Gbit network environment, but it should be ok.
For distributed cache you could go either with Hazelcast, Infinispan or similar solutions, which would simplify the access from your own application. (Update: these implementations use DHT to distribute the cache)
Fully replicated cache you could use EhCached, which you mentioned, or Infinispan. Here the advantage over the distributed cache is the much faster access since you have all the data replicated on every machine and only need to access it localy. The disadvantage is slower writes (so rather use it for read very often, write very seldom scenarios) and the fact that your cache is limited by the amount which one machine is able to store. If you are running your applications on servers with 64GB of RAM this is ok. If you want to distribute them over small amazon instances, this is probably a bad idea. I think before you will hit any problems with updating too many nodes, you will run out of memory, and that one is at least very easy to calculate: AVG_CACHE_NEEDED_PER_CLIENT * NUMBER_OF_CLIENTS < MEMORY_FOR_CACHE_AVAILABLE (on one server). If you need more cache than you have available on any node in your EhCached cluster, full replication won't be possible any more.
Or you could use a Redis cluster or similar independent from your application and the servers your application is running on. This would allow you to scale the cache at a different speed than the rest of your application, however the access to the data wouldn’t be that trivial.
Of course the actual decision depends on your very specific use-case and the demands you are putting on your application.
Personally I was very happy when I found out today that Azure WebPages have a load balancer with sticky session support, and I don’t need to reconfigure my application to use Redis as a session object store, and can just keep everything as it is.
But for a huge workload with hundreds of servers a simple load balancer probably will be rather overwhelmed, and distributed cache, or centralized replicated cache (Redis) will be the way to go.

Caching approach for a cluster of servers

I have a Java application deployed on a cluster of JBoss AS 5.1 which requires a lot (> 3 GB) of data to be cached.
Right now the server cluster has just 2 nodes (separate machines).
Here are specific requirements:
Both nodes should not require data to be loaded into cache (i.e., there should either be replication or cache should reside on a separate server)
The data should never expire.
Both of the above requirements are REALLY important for the application. I'd be thankful if the suggestion would be made keeping both of these in mind.
I should also add a third requirement:
ease of use
The application was initially using Hashmap. I tried replacing the hashmap with JBoss Cache 3.2.1 for its replication and thread safety features. But i'm not really happy with JBoss Cache performance. Also when i load the data in the cache the 8 Gig of RAM is almost entirely being used (most of it is used by the cache entries).
I'd like to hear the experience of people who have handled such kind of caching scenario themselves. Thanks for your time in advance.

You can try out using GigaSpaces XAP datagrid is a replicated cache. It is very highly performant.
http://www.gigaspaces.com/datagrid

If you want a cache that provides a Java HashMap interface and can easily support gigabytes of cache data, with no expiry, then check out Oracle Coherence. This would use the Coherence "distributed cache" option (which is the default configuration). For more info, see: http://coherence.oracle.com/
Elastic. Just add nodes. Auto-discovery. Auto-load-balancing. No data loss. No interruption. Every time you add a node, you get more data capacity and more throughput.
Use both RAM and flash. Transparently. Easily handle 10s or even 100s of gigabytes per Coherence node (e.g. up to a TB or more per physical server).
Automatic high availability (HA). Kill a process, no data loss. Kill a server, no data loss.
Datacenter continuous availability (CA). Kill a data center, no data loss.
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

I have 'solved' this problem before (work code, can't show you)... but, I can tell you this much:
with large volumes, a large amount of memory is used in overhead in HashMaps.
you can save a lot of memory by replacing java.util.* classes with smart uses of arrays.
every time you have memory allocations you also have to scan/collect that memory in the GC, so saving memory also improves performance.
Wherever you can, use arrays....
Edit: Apparently the concept of Hash Maps has been forgotten.... Has the Java implementation of HashMap made people believe it is the only way? A structured set of arrays, with a hash function, and a binary sort.... all basic structures... http://en.wikipedia.org/wiki/Hash_table
One array to add keys to. A parallel array to store the values in, and an int-based hash table to make a fast lookup in to the key array...
Computer Science - maybe second year ;-)
Edit again: I used to core concepts I have described in the JDOM project here: https://github.com/hunterhacker/jdom/blob/master/core/src/java/org/jdom2/StringBin.java

Shared cache between Tomcat web apps

I'm looking for a solution to share a cache between two tomcat web apps running on different hosts. The cache is being used for data synchronization, so the cache must be guaranteed to be up-to-date at all times between the two tomcat instances. (Sorry, I'm not 100% sure if the correct terminology for this requirement is "consistency" or something more specific like having ACID property). Another requirement is of course is that it should be fast to access the cache, with about equal numbers of writes as reads. I do have access to a shared filesystem so that is a consideration.
I've looked at something like ehcache but in order to get a shared cache between the webapps I would either need to implement on top of a Terracotta environment or using the new ehcache cache server. The former (Terracotta) seems like overkill for this, while the cache web server seems like it wouldn't provide the fast performance that I want.
Another solution I've looked at is building something simple on top of a fast key-value store like Redis or memcachedb. Redis is in-memory but can easily be configured to be a centralized cache, while memcachedb is a disk-based persistent cache which could work because I have a shared filesystem.
I'm looking for suggestions on how to best solve this problem. The solution needs to be a relatively mature technology as it will be used in a production environment.
Thanks in advance!

I'm quite sure that you don't require terracotta or ehcache server if you need a distributed cache. Ehcache with one of the four replication mechanisms would do.
However, based on what you've written I guess that you're looking for more than just a cache. Memcached/Ehcache are examples of what you might call a caching layer for your application - nothing more.
If you find yourself using words like 'guaranteed' 'up-to-date' 'ACID' you're better off using an in-memory DB like Oracle Times Ten/MySQL Cluster/Redis with a disk-based persistent storage.

You can use memcached (not memcachedb) for fast and efficient caching. Redis or memcachedb could be an overkill unless you want persistent caching. Memcached can be clustered very easily and you can use spymemcached java client to access it. Memcacached is very mature and is running in several hundred thousands, if not millions of production servers. It can be monitored through Nagios and Munin systems when in production.

memcached tomcat mysql on 1GB RAM

I am new to memcached and caching in general. I have a java web application running on Ubuntu + Tomcat + MySQL on a VPS Server with 1GB of memory.
Does it make sense to add a memcached layer with about 256MB for caching? Will this be too much load on the server? Which is more appropriate caching rendered html pages or database objects?
Please advise.

If you're going to cache pages, don't use memcached, use Varnish. However, there's a good chance that's not a great use of memory. Cacheing pages trades memory for computation and database work, but it does cost quite a lot of memory per page, so it's best for cases where the computation and database work needed to produce a single page amounts to a lot (or the pages are very small!). Also, consider that page cacheing won't be effective, or even possible, if you want to use per-user customisation on your pages (eg showing the number of items in a shopping cart). At least not without getting into some truly hairy shenanigans (edge-side includes, anyone?).
If you're not going to cache pages, and your app is on a single machine, then there's no point using memcached or similar. The point of cache servers like that is to make the memory on one machine work as a cache for another - like how a file server shares a disk, they're essentially memory servers. On a single machine, you might as well give all the memory to Java and cache objects on the heap.
Are you using an object-relational mapper? If so, see if it has any support for a second-level cache. The big three implementations (Hibernate, OpenJPA, and EclipseLink) all support in-memory caches. They're likely to do a much better job than you would if you did the cacheing yourself.
But, if you're not using a mapper, you have no choice but to do the cacheing yourself. There are extension points in LinkedHashMap for building LRU caches, and then of course there's the people's favourite, SoftReference, in combination with a HashMap. Plus, there are probably cache implementations out there you could download and use - i'd be shocked if there wasn't something in the Apache Commons libraries.

memcached won't add any noticeable load on your server, but it will be memory your app can't use. If you only plan to have a single app server for a while, you're better off using an in-JVM cache.
As far what to cache, the answer falls somewhere in the middle of the above. You don't want to cache exactly what's in your database and you certainly don't want to cache the final output. You have a data model representation in your application that isn't exactly what's in the DB (e.g. a User object might be made up of multiple queries from a few different tables). Cache that kind of thing as it's most reusable.
There's lots of info in the memcached site that should help you understand and get going with caching in general and memcached specifically.

It might make sense to do that, why don't try a smaller size like 64 MB and see how that goes. When you use more resources for the memcache, there is less for everything else. You should try it and see what will give you the best performance.

Choosing a distributed shared memory solution

I have a task to build a prototype for a massively scalable distributed shared memory (DSM) app. The prototype would only serve as a proof-of-concept, but I want to spend my time most effectively by picking the components which would be used in the real solution later on.
The aim of this solution is to take data input from an external source, churn it and make the result available for a number of frontends. Those "frontends" would just take the data from the cache and serve it without extra processing. The amount of frontend hits on this data can literally be millions per second.
The data itself is very volatile; it can (and does) change quite rapidly. However the frontends should see "old" data until the newest has been processed and cached. The processing and writing is done by a single (redundant) node while other nodes only read the data. In other words: no read-through behaviour.
I was looking into solutions like memcached however this particular one doesn't fulfil all our requirements which are listed below:
The solution must at least have Java client API which is reasonably well maintained as the rest of app is written in Java and we are seasoned Java developers;
The solution must be totally elastic: it should be possible to add new nodes without restarting other nodes in the cluster;
The solution must be able to handle failover. Yes, I realize this means some overhead, but the overall served data size isn't big (1G max) so this shouldn't be a problem. By "failover" I mean seamless execution without hardcoding/changing server IP address(es) like in memcached clients when a node goes down;
Ideally it should be possible to specify the degree of data overlapping (e.g. how many copies of the same data should be stored in the DSM cluster);
There is no need to permanently store all the data but there might be a need of post-processing of some of the data (e.g. serialization to the DB).
Price. Obviously we prefer free/open source but we're happy to pay a reasonable amount if a solution is worth it. In any way, paid 24hr/day support contract is a must.
The whole thing has to be hosted in our data centers so SaaS offerings like Amazon SimpleDB are out of scope. We would only consider this if no other options would be available.
Ideally the solution would be strictly consistent (as in CAP); however, eventual consistence can be considered as an option.
Thanks in advance for any ideas.

Have a look at Hazelcast. It is pure Java, open source (Apache license) highly scalable in-memory data grid product. It does offer 7X24 support. And it does solve all of your problems I tried to explain each of them below:
It has a native Java Client.
It is 100% dynamic. Add and remove nodes dynamically. No need to change anything.
Again everything is dynamic.
You can configure number of backup nodes.
Hazelcast support persistency.
Everything that Hazelcast offers is free(open source) and it does offer enterprise level support.
Hazelcast is single jar file. super easy to use. Just add jar to your classpath. Have a look at screen cast in main page.
Hazelcast is strictly consistent. You can never read stale data.

I suggest you to use Redisson - Redis based In-memory Data Grid for Java. Implements (BitSet, BloomFilter, Set, SortedSet, Map, ConcurrentMap, List, Queue, Deque, BlockingQueue, BlockingDeque, ReadWriteLock, Semaphore, Lock, AtomicLong, CountDownLatch, Publish / Subscribe, RemoteService, ExecutorService, LiveObjectService, SchedulerService) on top of Redis server! It supports master/slave, sentinel and cluster server modes. Automatic cluster/sentinel servers topology discovery supported also. This lib is free and open-source.
Perfectly works in cloud thanks to AWS Elasticache support

Depending of what you prefer, i would surely follow the others by suggesting Hazelcast if you're towards AP from the CAP Theorem but if you need CP, i would choose Redis

Have a look at Terracotta's JVM clustering, it's OpenSource ;)
It has no API while it works efficent at JVM level, when you store the value in a replicated object it is sent to all other nodes.
Even locking and all those things work transparent and without adding any new code.

You may want to checkout Java-specific solutions like Coherence: http://www.oracle.com/global/ru/products/middleware/coherence/index.html
However, I consider such solutions to be too complex and prefer to use solutions like memcached. Big disadvantage of memcached for your purpose is lack of record lock it seems and there is no built in way to replicate data for failover. That is why I would look into the key-value data stores. Many of them would satisfy your need completely.
Here is a list of key-value data stores that may help you with your task:
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores
Just pick one that you fill comfortable with.

I am doing a similar project, but instead targeting the .NET platform. Apart from the already mentioned solutions, I think you should take a look at ScaleOut StateServer and Alachisoft NCache. I am afraid neither of these alternatives are cheap, but they are a safer bet than open source for commercial solutions according to my judgement.
Both provide Java client APIs, even though I have only played around with the .NET APIs.
StateServer features self-discovery of new cache nodes, and NCache has a management console where new cache nodes can be added.
Both should be able to handle failovers seamlessly.
StateServer can have 1 or 2 passive copies of the data. NCache features more caching topologies to choose between.
If you mean write-through/write-behind to a database that is available in both.
I have no idea how many cache servers you plan to use, but here are the full price specs:
ScaleOut StateServer
Alachisoft NCache
Both are installed and configured locally on your server and they both have GUI Management.
I am not sure exactly what strictly consistent involves, so I'll leave that for you to investigate..
Overall, StateServer is the best option if you want to skip configuring every little detail in the cache cluster, while NCache features very many features and caching topologies to choose from.
Depending on the behaviour of data towards the clients (if the data is read many times from the same client) it might be a good idea to mix local caching on the clients with the distributed caching in the cluster (available for both NCache and StateServer), just a thought.

The specified use case seems to fit into Netflix's Hollow. This is a read-only replicated cache with a single producer and multiple consumers.

Have you tought about using a standard messaging solution like rabbitmq ?
RabbitMQ is an open source implementation of the AMQP protocol.
Your application seems more or less like a Publish/subscribe system.
The Publisher node is the one that does the processing and puts messages (processed data) in a queue in the servers.
Subscribers can get messages from the server in various ways. AMQP decouples the producer and the consumer of messages and is very flexible in how you can combine the two sides.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.