Tomcat 6 cluster with shared objects

Tomcat 6 cluster with shared objects - java

We have a large cluster of tomcat servers and I'm trying to find an efficient way to share a count among all of them. This count is the number of "widgets" purchased and needs to be checked for every page view. Any server can complete a sale and increment that count, at which point the new value should be made available to all the cluster members.
We don't want to use the count from the database because there will be many page views between updates across the cluster and a get operation to the db for every page view seems unnecessary.
We have an extensive memcached cluster where we could store the value, get it on every page view, and anyone who updates the value sets the new value to the cluster. This again seems wasteful because of a cache get for each page view.
What I'd like to do is have an in-memory value on each server and a multicast (or similar mechanism) message tell all servers that they just incremented and the new number is X. That would seem to be the most efficient because an action is only taken when update is made, instead of doing work for every page view.
How have you handled this in your applications? Am I over-thinking this... should we just throw it in memcached?
Thanks!

Both JBossCache and EhCache can operate in UDP multicast mode, replicating an in-memory cache across multiple virtual machines. Unlike memcached, they operate inside the VM, and so a "cache get" is essentially a free operation. They're also pure java, so no needing to maintain a separate cache system.
JBossCache also provides transactions and sync/async operation, so if those are of interest to you, I'd pick that over EHCache.

If you already have a memcached infrastructure, i don't see why you shouldn't use that it'll be ideal for this. Wether it'll be wasteful or another drop in the ocean, only testing will tell you.
The architecture will be simple as well, compared to introducing something as intrusive as terracotta or another cache sharing mechanism.

Have a look at Terracotta. It gives you a distributed JVM memory model so the value of an object on every box gets update at once.
They have a wrapper around Ehcache, or you can use it transparently to your code with some XML config.
Terracotta provides commercial and open source licenses, and typically, they play down the open source in favour of the commercial -- but the free open source will definitely do what you need and will allow yours apps to scale a very long way.

Related

How to create a simple distributed Map for Java application?

I am looking for a Map to share information between two instances of a Java web application running on separate machines. Reads and writes to this map need to be very fast and don't have to be transactional i.e. its ok if one instance has stale data for a while.
Any recommendations?
I need to keep track of the last time a user did something in the application, so its not terribly bad if this information is out of date. Speed and ease of use are important. I don't want writes to the Map to impact response times.

I would try Hazelcast, JGroups or Ehcache. All support a distributed map.
EDIT: Another option is to use RMI top a service running in one or the other JVM. This avoids the need for an additional library.

Additionally, there is Memcached which is very robust and proven over the time.

Overhead of using coherence cache

I consider caching key-value lists stored in database. Right now for rendering of JSF pages, a lot of redundant queries are executed to find the names to be displayed for some keys (O/R-Mapper: Eclipselink).
The values are quasi-static, but can change very seldom by using the application (no change in database except by the application in question).
A simple cache would suffice when only using one application server. However, load balancing with multiple servers should be possible, avoiding returning stale values if data is changed using one server and therefore not reflected by the other server.
One idea would be to use oracle coherence as distributed cache. I'm not sure whether this is overkill because of the fact that the data is only changed very seldomly and the cache itself does not need to be distributed, only the invalidation should be.
What is the overhead of coherence in terms of memory, execution times and network communication? Are there any alternatives that better suit my use case?
I talk about 50.000 key value pairs, mainly short strings.

If the invalidation is that rare, then you can use a local cache and something like a JMS Topic that everyone subscribes to in order to handle the invalidation.
There's also something like EHCache as an alternative, since it's OSS and free to use vs Coherence, if that's important. I like to use EHCaches pull through ability.

Coherence has relatively low overhead, and can easily manage 50,000 (or 50,000,000) objects. However, if your use case is super simple, and you don't mind doing the invalidation work yourself, and don't need the various QoS that Coherence provides, then it probably is overkill.
Also, this simple use case can easily be done using the Coherence Standard Edition, which is far less expensive (licensed per server instead of per processor, and it's a much lower price).
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

How to Implement caching for a web application

What are the different ways to cache a web application data, developed using Java and NoSQL database? Databases also provide caching, are they, the only & always the best option to go with, for caching?
How else can I cache my data of users on the application. Application contains very user specific data like in a social network. Are there some simple thumb rules of what type of things should be cached?
Can I also cache my data on the application server using Java ?

If you want a rule of thumb, here's what Michael Jackson (not that Michael Jackson) said:
The First Rule of Program Optimization: Don't do it.
The Second Rule of Program Optimization (for experts only!): Don't do it yet.
The ancient tradition is that you don't optimise until you've profiled - that is, until you have hard evidence as to what actually needs to be optimised. Cacheing is a kind of optimisation; it is very likely to be important for your app, but until you are able to put your app under load and look at what objects are taking a long time to obtain (loading from the database or whatever), you won't know what needs cacheing. It really doesn't matter how smart you are, or what advice you get here - until you do that, you will not know what needs to be cached.
As for things you can cache, it's anything, but i suppose you can classify it into three groups:
Things that have come fresh from the database. These are easy to cache, because at the point at which you go to the database, you have the identifying information you'd need for a cache key (primary key, query parameters, etc). By cacheing them, you save the time taken to get them from the database - this involves IO, so it is likely to be quite large.
Things that have been produced by computation in the domain model (news feeds in a social app, perhaps). These may be trickier to cache, because more contextual information goes into producing them; you might have to refactor your code to create a single point where the required information is all to hand, so you can apply cacheing to it. Or you might find that this exists already. Cacheing these will save all the database access needed to obtain the information that goes into making them, as well as all the computation; the time taken for computation may or may not be a significant addition to the time taken for IO. Invalidating cached things of this kind is likely to be much harder than pure database objects.
Things that are being sent to the browser - pages, or fragments of pages. These can be quite easy to cache, because in a properly-designed application, they're uniquely identified by either the URL, or the combination of URL and user. Cacheing these will save all the computation in your app; it can even avoid servicing requests, because it can be done by a reverse proxy sitting in front of your app server. Two problems. Firstly, it uses a huge amount of memory: the page rendered from a few kilobytes of objects could be tens or hundreds of kilobytes in size (my Facebook homepage is 50 kB). That means you have to save a vast amount of computation to make it a better deal than cacheing at the database or domain model layers, and there just isn't that much computation between the domain model and the HTML in a sensibly-designed application. Secondly, invalidation is even harder than in the domain model, and is likely to happen prohibitively often - anything which changes the page or the fragment needs to invalidate the cache.
Finally, the actual mechanism: start with something simple and in-process, like a map with limited size and a least-recently-used eviction policy. That's simple but effective. Something out-of-process like EHCache is more complicated, but has two advantages: you can share caches between multiple processes (helpful if you have a cluster, which you probably will at some point), and you can store data where the garbage collector won't see it, which might save some CPU time (might - this is too big a subject to get into here).
But i reiterate my first point: don't cache until you know what needs to be cached, and once you do, be mindful of the limitations on the benefits of cacheing, and try to keep your cacheing strategy as simple as possible (but no simpler, of course).

I'll assume you're building a relatively typical web application that:
has a single server used for persistence
multiple web servers
ties authenticated users to a single server via sticky sessions through a load balancer
Now, with that stated to answer so of your questions. Most persistence, database or NoSQL, likely have some sort of caching built in such that if you execute the same simple query repeatedly (e.g. retrieval by primary key) it's able to cache the result. However, the more complex the query, the less likely persistence can perform caching on it. In addition, if there's only one server for persistence (i.e. no sharding, or write master/read slaves) it quickly becomes the bottleneck. So the application level caching you want to do usually should occur on the web servers to reduce load on the database.
As far as what should be cached, the heuristic is items frequently accessed and/or expensive to generate (in terms of database/web server processing/memory). Typical candidates are the home page and any other landing page of a site - often the best approach for these is generating a static file and serving that. The next pieces depend on your application, but typically the most effective strategy is caching as close to the final result as possible - often the HTML being served. For your social network this might be a list of featured updates or some such.
As far as user sessions are concerned, these are definitely a good candidate for caching. In this case you can probably get a lot of mileage out of judicious use of the web server's session scope (assuming a JSP server). This data lives in memory and is a good place to keep of user specific information shown once a user authenticates on every page (e.g. first and last name).
Now the final thing to consider is dealing with cache invalidation and really is the hard part of all this (naming stuff is the other hard thing in computer science). In this case using something like memcached or ehcache as others have mentioned is the right approach. ehcache can easily run in process with your java application and does a good job of expiring things, with policies for least recently used and least frequently used, and allowing you to use both memory and disk for caching. What you'll need to think about is the situations where you need to expire something form the cache ahead of this schedule because data's changed. In this case you need to work through those dependencies in your application's architecture so that it read/writes to the cache as appropriate.

memcached tomcat mysql on 1GB RAM

I am new to memcached and caching in general. I have a java web application running on Ubuntu + Tomcat + MySQL on a VPS Server with 1GB of memory.
Does it make sense to add a memcached layer with about 256MB for caching? Will this be too much load on the server? Which is more appropriate caching rendered html pages or database objects?
Please advise.

If you're going to cache pages, don't use memcached, use Varnish. However, there's a good chance that's not a great use of memory. Cacheing pages trades memory for computation and database work, but it does cost quite a lot of memory per page, so it's best for cases where the computation and database work needed to produce a single page amounts to a lot (or the pages are very small!). Also, consider that page cacheing won't be effective, or even possible, if you want to use per-user customisation on your pages (eg showing the number of items in a shopping cart). At least not without getting into some truly hairy shenanigans (edge-side includes, anyone?).
If you're not going to cache pages, and your app is on a single machine, then there's no point using memcached or similar. The point of cache servers like that is to make the memory on one machine work as a cache for another - like how a file server shares a disk, they're essentially memory servers. On a single machine, you might as well give all the memory to Java and cache objects on the heap.
Are you using an object-relational mapper? If so, see if it has any support for a second-level cache. The big three implementations (Hibernate, OpenJPA, and EclipseLink) all support in-memory caches. They're likely to do a much better job than you would if you did the cacheing yourself.
But, if you're not using a mapper, you have no choice but to do the cacheing yourself. There are extension points in LinkedHashMap for building LRU caches, and then of course there's the people's favourite, SoftReference, in combination with a HashMap. Plus, there are probably cache implementations out there you could download and use - i'd be shocked if there wasn't something in the Apache Commons libraries.

memcached won't add any noticeable load on your server, but it will be memory your app can't use. If you only plan to have a single app server for a while, you're better off using an in-JVM cache.
As far what to cache, the answer falls somewhere in the middle of the above. You don't want to cache exactly what's in your database and you certainly don't want to cache the final output. You have a data model representation in your application that isn't exactly what's in the DB (e.g. a User object might be made up of multiple queries from a few different tables). Cache that kind of thing as it's most reusable.
There's lots of info in the memcached site that should help you understand and get going with caching in general and memcached specifically.

It might make sense to do that, why don't try a smaller size like 64 MB and see how that goes. When you use more resources for the memcache, there is less for everything else. You should try it and see what will give you the best performance.

Choosing a distributed shared memory solution

I have a task to build a prototype for a massively scalable distributed shared memory (DSM) app. The prototype would only serve as a proof-of-concept, but I want to spend my time most effectively by picking the components which would be used in the real solution later on.
The aim of this solution is to take data input from an external source, churn it and make the result available for a number of frontends. Those "frontends" would just take the data from the cache and serve it without extra processing. The amount of frontend hits on this data can literally be millions per second.
The data itself is very volatile; it can (and does) change quite rapidly. However the frontends should see "old" data until the newest has been processed and cached. The processing and writing is done by a single (redundant) node while other nodes only read the data. In other words: no read-through behaviour.
I was looking into solutions like memcached however this particular one doesn't fulfil all our requirements which are listed below:
The solution must at least have Java client API which is reasonably well maintained as the rest of app is written in Java and we are seasoned Java developers;
The solution must be totally elastic: it should be possible to add new nodes without restarting other nodes in the cluster;
The solution must be able to handle failover. Yes, I realize this means some overhead, but the overall served data size isn't big (1G max) so this shouldn't be a problem. By "failover" I mean seamless execution without hardcoding/changing server IP address(es) like in memcached clients when a node goes down;
Ideally it should be possible to specify the degree of data overlapping (e.g. how many copies of the same data should be stored in the DSM cluster);
There is no need to permanently store all the data but there might be a need of post-processing of some of the data (e.g. serialization to the DB).
Price. Obviously we prefer free/open source but we're happy to pay a reasonable amount if a solution is worth it. In any way, paid 24hr/day support contract is a must.
The whole thing has to be hosted in our data centers so SaaS offerings like Amazon SimpleDB are out of scope. We would only consider this if no other options would be available.
Ideally the solution would be strictly consistent (as in CAP); however, eventual consistence can be considered as an option.
Thanks in advance for any ideas.

Have a look at Hazelcast. It is pure Java, open source (Apache license) highly scalable in-memory data grid product. It does offer 7X24 support. And it does solve all of your problems I tried to explain each of them below:
It has a native Java Client.
It is 100% dynamic. Add and remove nodes dynamically. No need to change anything.
Again everything is dynamic.
You can configure number of backup nodes.
Hazelcast support persistency.
Everything that Hazelcast offers is free(open source) and it does offer enterprise level support.
Hazelcast is single jar file. super easy to use. Just add jar to your classpath. Have a look at screen cast in main page.
Hazelcast is strictly consistent. You can never read stale data.

I suggest you to use Redisson - Redis based In-memory Data Grid for Java. Implements (BitSet, BloomFilter, Set, SortedSet, Map, ConcurrentMap, List, Queue, Deque, BlockingQueue, BlockingDeque, ReadWriteLock, Semaphore, Lock, AtomicLong, CountDownLatch, Publish / Subscribe, RemoteService, ExecutorService, LiveObjectService, SchedulerService) on top of Redis server! It supports master/slave, sentinel and cluster server modes. Automatic cluster/sentinel servers topology discovery supported also. This lib is free and open-source.
Perfectly works in cloud thanks to AWS Elasticache support

Depending of what you prefer, i would surely follow the others by suggesting Hazelcast if you're towards AP from the CAP Theorem but if you need CP, i would choose Redis

Have a look at Terracotta's JVM clustering, it's OpenSource ;)
It has no API while it works efficent at JVM level, when you store the value in a replicated object it is sent to all other nodes.
Even locking and all those things work transparent and without adding any new code.

You may want to checkout Java-specific solutions like Coherence: http://www.oracle.com/global/ru/products/middleware/coherence/index.html
However, I consider such solutions to be too complex and prefer to use solutions like memcached. Big disadvantage of memcached for your purpose is lack of record lock it seems and there is no built in way to replicate data for failover. That is why I would look into the key-value data stores. Many of them would satisfy your need completely.
Here is a list of key-value data stores that may help you with your task:
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores
Just pick one that you fill comfortable with.

I am doing a similar project, but instead targeting the .NET platform. Apart from the already mentioned solutions, I think you should take a look at ScaleOut StateServer and Alachisoft NCache. I am afraid neither of these alternatives are cheap, but they are a safer bet than open source for commercial solutions according to my judgement.
Both provide Java client APIs, even though I have only played around with the .NET APIs.
StateServer features self-discovery of new cache nodes, and NCache has a management console where new cache nodes can be added.
Both should be able to handle failovers seamlessly.
StateServer can have 1 or 2 passive copies of the data. NCache features more caching topologies to choose between.
If you mean write-through/write-behind to a database that is available in both.
I have no idea how many cache servers you plan to use, but here are the full price specs:
ScaleOut StateServer
Alachisoft NCache
Both are installed and configured locally on your server and they both have GUI Management.
I am not sure exactly what strictly consistent involves, so I'll leave that for you to investigate..
Overall, StateServer is the best option if you want to skip configuring every little detail in the cache cluster, while NCache features very many features and caching topologies to choose from.
Depending on the behaviour of data towards the clients (if the data is read many times from the same client) it might be a good idea to mix local caching on the clients with the distributed caching in the cluster (available for both NCache and StateServer), just a thought.

The specified use case seems to fit into Netflix's Hollow. This is a read-only replicated cache with a single producer and multiple consumers.

Have you tought about using a standard messaging solution like rabbitmq ?
RabbitMQ is an open source implementation of the AMQP protocol.
Your application seems more or less like a Publish/subscribe system.
The Publisher node is the one that does the processing and puts messages (processed data) in a queue in the servers.
Subscribers can get messages from the server in various ways. AMQP decouples the producer and the consumer of messages and is very flexible in how you can combine the two sides.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.