How to Implement caching for a web application

How to Implement caching for a web application - java

What are the different ways to cache a web application data, developed using Java and NoSQL database? Databases also provide caching, are they, the only & always the best option to go with, for caching?
How else can I cache my data of users on the application. Application contains very user specific data like in a social network. Are there some simple thumb rules of what type of things should be cached?
Can I also cache my data on the application server using Java ?

If you want a rule of thumb, here's what Michael Jackson (not that Michael Jackson) said:
The First Rule of Program Optimization: Don't do it.
The Second Rule of Program Optimization (for experts only!): Don't do it yet.
The ancient tradition is that you don't optimise until you've profiled - that is, until you have hard evidence as to what actually needs to be optimised. Cacheing is a kind of optimisation; it is very likely to be important for your app, but until you are able to put your app under load and look at what objects are taking a long time to obtain (loading from the database or whatever), you won't know what needs cacheing. It really doesn't matter how smart you are, or what advice you get here - until you do that, you will not know what needs to be cached.
As for things you can cache, it's anything, but i suppose you can classify it into three groups:
Things that have come fresh from the database. These are easy to cache, because at the point at which you go to the database, you have the identifying information you'd need for a cache key (primary key, query parameters, etc). By cacheing them, you save the time taken to get them from the database - this involves IO, so it is likely to be quite large.
Things that have been produced by computation in the domain model (news feeds in a social app, perhaps). These may be trickier to cache, because more contextual information goes into producing them; you might have to refactor your code to create a single point where the required information is all to hand, so you can apply cacheing to it. Or you might find that this exists already. Cacheing these will save all the database access needed to obtain the information that goes into making them, as well as all the computation; the time taken for computation may or may not be a significant addition to the time taken for IO. Invalidating cached things of this kind is likely to be much harder than pure database objects.
Things that are being sent to the browser - pages, or fragments of pages. These can be quite easy to cache, because in a properly-designed application, they're uniquely identified by either the URL, or the combination of URL and user. Cacheing these will save all the computation in your app; it can even avoid servicing requests, because it can be done by a reverse proxy sitting in front of your app server. Two problems. Firstly, it uses a huge amount of memory: the page rendered from a few kilobytes of objects could be tens or hundreds of kilobytes in size (my Facebook homepage is 50 kB). That means you have to save a vast amount of computation to make it a better deal than cacheing at the database or domain model layers, and there just isn't that much computation between the domain model and the HTML in a sensibly-designed application. Secondly, invalidation is even harder than in the domain model, and is likely to happen prohibitively often - anything which changes the page or the fragment needs to invalidate the cache.
Finally, the actual mechanism: start with something simple and in-process, like a map with limited size and a least-recently-used eviction policy. That's simple but effective. Something out-of-process like EHCache is more complicated, but has two advantages: you can share caches between multiple processes (helpful if you have a cluster, which you probably will at some point), and you can store data where the garbage collector won't see it, which might save some CPU time (might - this is too big a subject to get into here).
But i reiterate my first point: don't cache until you know what needs to be cached, and once you do, be mindful of the limitations on the benefits of cacheing, and try to keep your cacheing strategy as simple as possible (but no simpler, of course).

I'll assume you're building a relatively typical web application that:
has a single server used for persistence
multiple web servers
ties authenticated users to a single server via sticky sessions through a load balancer
Now, with that stated to answer so of your questions. Most persistence, database or NoSQL, likely have some sort of caching built in such that if you execute the same simple query repeatedly (e.g. retrieval by primary key) it's able to cache the result. However, the more complex the query, the less likely persistence can perform caching on it. In addition, if there's only one server for persistence (i.e. no sharding, or write master/read slaves) it quickly becomes the bottleneck. So the application level caching you want to do usually should occur on the web servers to reduce load on the database.
As far as what should be cached, the heuristic is items frequently accessed and/or expensive to generate (in terms of database/web server processing/memory). Typical candidates are the home page and any other landing page of a site - often the best approach for these is generating a static file and serving that. The next pieces depend on your application, but typically the most effective strategy is caching as close to the final result as possible - often the HTML being served. For your social network this might be a list of featured updates or some such.
As far as user sessions are concerned, these are definitely a good candidate for caching. In this case you can probably get a lot of mileage out of judicious use of the web server's session scope (assuming a JSP server). This data lives in memory and is a good place to keep of user specific information shown once a user authenticates on every page (e.g. first and last name).
Now the final thing to consider is dealing with cache invalidation and really is the hard part of all this (naming stuff is the other hard thing in computer science). In this case using something like memcached or ehcache as others have mentioned is the right approach. ehcache can easily run in process with your java application and does a good job of expiring things, with policies for least recently used and least frequently used, and allowing you to use both memory and disk for caching. What you'll need to think about is the situations where you need to expire something form the cache ahead of this schedule because data's changed. In this case you need to work through those dependencies in your application's architecture so that it read/writes to the cache as appropriate.

Related

Limit memory consumption of Vaadin session

In Vaadin Flow web apps, the state of the entire user-interface is maintained in the session on the web server, with automatic dynamic generation of the HTML/CSS/JavaScript needed to represent that UI remotely on the web browser client. Depending on the particular app, and the number of users, this can result in a significant amount of memory used on the web container.
Is it possible to limit the amount of memory a session and requests related to it can use?
For example, I would like to limit each user session to one megabyte. This limit should apply to any objects created when handling requests. Is that possible?

It is theoretically possible, but it is not practical.
As far as I am aware, no JVM keeps track of the amount of memory that (say) a thread allocates. So if you wanted to do this, you would build a lot of infrastructure to do that. Here are a couple of theoretical ideas.
You could use bytecode engineering to inject some code before each new to measure and record the size of the object allocated. You would need to run this across your entire codebase ... including any Java SE classes and 3rd-party classes that you app uses.
You could modify the JVM to record the information itself. For example, you might modify the memory allocator that new uses.
However, both of these are liable be a lot of work to implement, debug and maintain. And both are liable to have significant performance impact.
It is not clear to me why you would need this ... as a general thing. If you have a problem with the memory usage of particular types of requests, then it would be simpler for the request code itself to keep tabs on how big the request data structures are getting. When the data structures get too large, the request could "abort" itself.

As the correct Answer by Stephen C explains, there is no simple automatic approach to limiting or managing the memory used in Java.
Given the nature of Vaadin Flow web apps, a large amounts of memory may be consumed on the server for user sessions containing all the state of each user’s user-interface.
Reduce memory usage of your codebase
The first step is to examine your code base.
Do you have data replicated across users that could instead be shared across users in a thread-safe manner? Do you have cached data not often used that could instead be retrieved again from its source (database, web services call)? Do you cache parts of the UI not currently onscreen that could instead be instantiated again later when needed?
More RAM
Next step is to simply add more memory to your web server.
Buying RAM is much cheaper than paying for the time of programmers and sysadmins. And so simple to just drop in more stocks of memory.
Multiple web servers
The next step after that is horizontal scaling: Use multiple web servers.
With load balancers you can spread the user load across servers fairly. And “sticky” sessions can be used to direct further user interactions to the same server to continue a session.
Of course, this horizontal scaling approach is more complicated. But this approach is commonly done in the industry, and well-understood.
Vaadin Fusion
Another programming step could involve refactoring app to build parts of your app using Vaadin Fusion.
Instead of your app being driven from the server as with Vaadin Flow, Fusion is focused on web components running in the browser. Instead of writing in pure Java, you write in TypeScript, a superset of JavaScript. Fusion can make calls into Vaadin Flow server as needed to access data and services there.
Consulting
The Vaadin Ltd company sells consulting services, as do others, to assist with any of these steps.
Session serialization
Be aware that without taking these steps, when running low on memory, some web containers such as Apache Tomcat will serialize sessions to disk to purge them from memory temporarily.
This can result in poor performance if the human users are actively still engaged with those sessions. But the more serious problem is that all the objects in your entire sessions must be serializable. And you must code for reconnecting database connections, etc. If supporting such serialization is not feasible, you likely can turn off this serialize-sessions-on-low-memory feature of the web server. But then your web server will suffer when running out of memory with no such recourse available.

regarding of improvement of the efficiency of a cache heavy system

i'm about to improve the efficiency of a cache heavy system, which has the following properties/architecture:
The system has 2 components, a single instance backend and multiple frontend instances, spread across remote data centers.
The backend generates data and writes it to a relational database that is replicated to multiple data centers.
The frontends handle client requests (common web traffic based) by reading data from the database and serving it. Data is stored in a local cache for an hour before it expires and has to be retrieved again.
(The cache’s eviction policy is LRU based).
I want to mention that there are two issues with the implementation above:
It turns out that many of the database accesses are redundant because the underlying data didn't actually change.
On the other hand, a change isn't reflected until the cache TTL elapses, causing staleness issues.
can you advice for a solution that fixes both of these problems?
should the solution change if the data was stored in nosql db like cassandra and not a classic database?

Unfortunately, there is no silver bullet here. There are two obvious variants:
Keep long TTL or cache forever, but invalidate the cache data when it is updated. This might get quite complex and error prone
Simply lower the TTL to get faster updates. The low TTL approach is IMHO the KISS approach. We go as low as 27 seconds. A cache with this low TTL has not a lot hits during normal operation, but helps a lot when a flash crowd hits your application
In case your database is powerful enough and has acceptable latency the approach 2 is the simplest one.
If your database, does not have acceptable latency, or maybe your application is doing a multiple of sequential reads from the database per web request, then your can use a cache that provides refresh ahead or background refresh. This means, the cache refreshes the entries automatically and there is no additional latency except for the first read. However, this approach come with the downside of increasing the database load.
Cassandra may not support the same access strategies like the classic database. Changing to Cassandara will affect your caching as well, e.g. in case you cache also query results. However, the high level concept keeps the same. Your data access layers may change to an asynchronous or reactive pattern, since Cassandara has support for that.
If you want to do invalidation (solution 1), with Cassandara, you can get information from the database which data has updated, see CASSANDRA-8844. You may get similar information from "classical" SQL databases, but that is a vendor specific feature.

Global State in Java/Spring

I have a basic Java/Spring MVC CRUD application in production on my company's intranet. I am still a beginner really, this application is what I've used to learn Java and web applications. Basically it has a table that uses AJAX to refresh its data on regular intervals, and an html form that is input into the database. The refresh is important because the data is viewed on multiple computers that need to see the input from the others.
The problem is that, due to network issues outside of my control, the database transactions on certain computers can be very slow.
I have been playing around with React/Redux JavaScript client applications in the past few weeks and the concept of state. Now, as best I can tell, global state or variables are pretty reviled by the Java community. Bugs, difficulty in testing, etc.
But Redux gave me an idea that, when a user hits "submit" instead of inserting a row into SQL, it stores that object in memory on the server. Then at regular intervals that memory is inserted into the database - so the user does not have to wait for database transactions, only communication with the server. Table refreshes don't look at the database - they look at this memory.
But, again as a beginner, I don't see people do this. Why is it a bad idea?

In general, it isn't done for two reasons:
the state is not guaranteed, because it is not actually written.
If you restart the application before the data is flushed to the database, it is silently dropped. This is not a good thing in general, although obviously, but your interpretation may very. If you don't care so much, this might be ok. You could remedy this by persisting it somewhere locally.
the state is also not guaranteed, because you may end up not being able to write the data because, for example, some database constraint.
So, in general it is frowned upon, because you are lying to the client ... You say you wrote it, but there's no actual effort to ensure this has actually happened.
But then again. if the data is less important, it might be ok.

Best way to synchronize cache data between two servers [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Want to synchronize the cache data between two servers. Both database is sharing the same database, but for better execution data i have cached the data into Hash Map at startup.
Thus want to synchronize the cached data without restarting servers. (Both servers starts at same time).
Please suggest me the best and efficient way to do.

Instead of trying to synchronize the cached data between two server instances, why not centralize the caching instead using something like memcached/couchbase or redis? Using distributed caching with something like ehcache is far more complicated and error prone IMO vs centralizing the cached data using a caching server like those mentioned.
As an addendum to my original answer, when deciding what caching approach to use (in memory, centralized), one thing to take into account is the volatility of the data that is being cached.
If the data is stored in the DB, but does not change after the servers load it, then you don't even need synchronization between the servers. Just let them each load this static data into memory from the source and then go about their merry ways doing whatever it is they do. The data won't be changing, so no need to introduce a complicated pattern for keeping the data in sync between the servers.
If there is indeed a level of volatility in the data (like say you are caching looked up entity data from the DB in order to save hits to the DB), then I still think centralized caching is a better approach than in-memory distributed and synchronized caching. You just need to make sure that you use an appropriate expiration on the cached data to allow natural refresh of the data from time to time. Also, you might want to just drop the cached data from the centralized store when in the update path for a particular entity and then just let it be reloaded from the cache on the next request for that data. This is IMO better than trying to do a true write-through cache where you write to the underlying store as well as the cache. The DB itself might make tweaks to the data (via defaulting unsupplied values for example), and your cached data in that case might not match what's in the DB.
EDIT:
A question was asked in the comments about the advantages of a centralized cache (I'm guessing against something like an in memory distributed cache). I'll provide my opinion on that, but first a standard disclaimer. Centralized caching is not a cure-all. It aims to solve specific issues related to in-jvm-memory caching. Before evaluating whether or not to switch to it, you should understand what your problems are first and see if they fit with the benefits of centralized caching. Centralized caching is an architectural change and it can come with issues/caveats of its own. Don't switch to it simple because someone says it's better than what you are doing. Make sure the reason fits the problem.
Okay, now onto my opinion for what kinds of problems centralized caching can solve vs in-jvm-memory (and possibly distributed) caching. I'm going to list two things although I'm sure there are a few more. My two big ones are: Overall Memory Footprint and Data Synchronization Issues.
Let's start with Overall Memory Footprint. Say you are doing standard entity caching to protect your relational DB from undue stress. Let's also say that you have a lot of data to cache in order to really protect your DB; say in the range of many GBs. If you are doing in-jvm-memory caching, and you say had 10 app server boxes, you would need to get that additional memory ($$$) times 10 for each of the boxes that would need to be doing the caching in jvm memory. In addition, you would then have to allocate a larger heap to your JVM in order to accommodate the cached data. I'm from the opinion that the JVM heap should be small and streamlined in order to ease garbage collection burden. If you have a large chunks of Old Gen that can't be collected then your going to stress your garbage collector when it goes into a full GC and tries to reap something back from that bloated Old Gen space. You want to avoid long GC2 pause times and bloating your Old Gen is not going to help with that. Plus, if you memory requirement is above a certain threshold, and you happened to be running 32 bit machines for your app layer, you'll have to upgrade to 64 bit machines and that can be another prohibitive cost.
Now if you decided to centralize the cached data instead (using something like Redis or Memcached), you could significantly reduce the overall memory footprint of the cached data because you could have it on a couple of boxes instead of all of the app server boxes in the app layer. You probably want to use a clustered approach (both technologies support it) and at least two servers to give you high availability and avoid a single point of failure in your caching layer (more on that in a sec). By one having a couple of machines to support the needed memory requirement for caching, you can save some considerable $$. Also, you can tune the app boxes and the cache boxes differently now as they are serving distinct purposes. The app boxes can be tuned for high throughput and low heap and the cache boxes can be tuned for large memory. And having smaller heaps will definitely help out with overall throughput of the app layer boxes.
Now one quick point for centralized caching in general. You should set up your application in such a way that it can survive without the cache in case it goes completely down for a period of time. In traditional entity caching, this means that when the cache goes completely unavailable, you just are hitting your DB directly for every request. Not awesome, but also not the end of the world.
Okay, now for Data Synchronization Issues. With distributed in-jvm-memory caching, you need to keep the cache in sync. A change to cached data in one node needs to replicate to the other nodes and by sync'd into their cached data. This approach is a little scary in that if for some reason (network failure for example) one of the nodes falls out of sync, then when a request goes to that node, the data the user sees will not be accurate against what's currently in the DB. Even worse, if they make another request and that hits a different node, they will see different data and that will be confusing to the user. By centralizing the data, you eliminate this issue. Now, one could then argue that the centralized cache needs concurrency control around updates to the same cached data key. If two concurrent updates come in for the same key, how do you make sure the two updates don't stomp on each other? My thought here is to not even worry bout this; when an update happens, drop the item from the cache (and write though directly to the DB) and let it be reloaded on the next read. It's safer and easier this way. If you don't want to do that, then you can use CAS (Check-And-Set) functionality instead for optimistic concurrency control if you really want to update both the cache and db on updates.
So to summarize, you can save money and better tune your app layer machines if you centralize the data they cache. You also can get better accuracy of that data as you have less data synchronization issues to deal with. I hope this helps.

First, do try to forget about the premature optimization. Do you really need the cache? 99% that you do not need it. In this case you solution is in removing the redundant code.
If however you need it try to stop re-inventing wheels. There are perfect ready-to use libraries. For example ehCache that has distributed mode.

Use HazelCast. It allows data synchronization between servers using multicast protocol. It's easy to use. It supports locking and other features.

memcached tomcat mysql on 1GB RAM

I am new to memcached and caching in general. I have a java web application running on Ubuntu + Tomcat + MySQL on a VPS Server with 1GB of memory.
Does it make sense to add a memcached layer with about 256MB for caching? Will this be too much load on the server? Which is more appropriate caching rendered html pages or database objects?
Please advise.

If you're going to cache pages, don't use memcached, use Varnish. However, there's a good chance that's not a great use of memory. Cacheing pages trades memory for computation and database work, but it does cost quite a lot of memory per page, so it's best for cases where the computation and database work needed to produce a single page amounts to a lot (or the pages are very small!). Also, consider that page cacheing won't be effective, or even possible, if you want to use per-user customisation on your pages (eg showing the number of items in a shopping cart). At least not without getting into some truly hairy shenanigans (edge-side includes, anyone?).
If you're not going to cache pages, and your app is on a single machine, then there's no point using memcached or similar. The point of cache servers like that is to make the memory on one machine work as a cache for another - like how a file server shares a disk, they're essentially memory servers. On a single machine, you might as well give all the memory to Java and cache objects on the heap.
Are you using an object-relational mapper? If so, see if it has any support for a second-level cache. The big three implementations (Hibernate, OpenJPA, and EclipseLink) all support in-memory caches. They're likely to do a much better job than you would if you did the cacheing yourself.
But, if you're not using a mapper, you have no choice but to do the cacheing yourself. There are extension points in LinkedHashMap for building LRU caches, and then of course there's the people's favourite, SoftReference, in combination with a HashMap. Plus, there are probably cache implementations out there you could download and use - i'd be shocked if there wasn't something in the Apache Commons libraries.

memcached won't add any noticeable load on your server, but it will be memory your app can't use. If you only plan to have a single app server for a while, you're better off using an in-JVM cache.
As far what to cache, the answer falls somewhere in the middle of the above. You don't want to cache exactly what's in your database and you certainly don't want to cache the final output. You have a data model representation in your application that isn't exactly what's in the DB (e.g. a User object might be made up of multiple queries from a few different tables). Cache that kind of thing as it's most reusable.
There's lots of info in the memcached site that should help you understand and get going with caching in general and memcached specifically.

It might make sense to do that, why don't try a smaller size like 64 MB and see how that goes. When you use more resources for the memcache, there is less for everything else. You should try it and see what will give you the best performance.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.