I am currently thinking about Caching Strategies and more importantly avoiding any duplication of data inside the cache. My query is kind of language agnostic but very much programming related.
My question is regarding the efficient caching of paged or filtered data but more than that, distributed caching. The latter I have decided to go with memcached and more specifically a .NET port of it. I have seen another commercial option in the form of NCache, but memcached seems perfectly acceptable to me and apparently is used on facebook, myspace etc...
My query then is a strategy my which you can contain objects in cache and also a reference to them with paged data. If I have 100 items and I page them, then I could cache the ids of product 1-10 inside the cache and cache each product seperately. If I where to sort the items descending then items 1-10 would be different products so I would not want to store the actual objects each time the paged data/sorting/filtering changed, but instead stored the ids of the objects so I could then perform a trabsactional lookup in the databse if some of them do not already exist in the cache or are invalid.
My initial idea was this for a cache key.
paged_<pageNumber><pageSize><sort><sortDirection>[<filter>]
I would then iterate through the cache keys and remove any which start with "paged_" My question ultimately is if any one knows of any patterns or ideas about straties regarding caching of such patterns of data such as paged data and also making sure that objects are not cached more than once.
memcached is native code and would not have a problem clearing the cache in the way I have stated above, but it is an obvious fact that the more items in the cache the more time it would take. I am interested if anyone knows of any solution or theory to this type of problem which is currently beig employed. I am sure there will be . Thank you for your time
TIA
Andrew
I once tried, what I think, is a similar caching strategy and found it unwieldy. I eventually ended up just caching the objects that make up the pages and generating the pages for every request. 10 cache hits to construct a page is going to be (hopefully) sub second response time, pretty much instant to the users of your service.
If you must cache entire pages (I think of them as result sets) then perhaps you could run the user request through a hash and use that as your cache key. It's a hard problem to visualize with a concrete example or code (for me at least).
Related
I want to know whether it is useful to use ConcurrentHashMaps for user data. I have the user data saved in a mysql database and retrieve them when a user logs in (or someone edits the user). Every time when the user goes on another page, these user data will be refreshed. Should I use a map and save changes from my application there while having a database in background or should I directly download it from the db. I want to make the application as performant as possible.
What you are describing is a cache. Suppose the calls to the database cost a lot because there is a lot of info to load, or the query that is used to extract the data is complex and requires a lot of time. Here comes in play the cache data structure. It is basically an in memory storage, which is really faster w.r.t querying the database, because indeed, it is already loaded in memory.
The process of filling the cache takes the same time as querying the db for the data (generally more but in the same order). So it makes sense to use caches only if it brings benefit in time. There is a compromise though, speed vs freshness of data. Depending on your use-case you must find the right compromise between those two, and you shall afterwards find out if it is really convenient.
As you describe it, i.e user updates that needs to be saved and displayed, using a cache seems a bit an overkill IMO, unless you have lot of registered users, and so many of those are using the system simultaneously. If you decide to use it keep in mind of some concurrency issues that may rise. Concurrent hash maps saves you from many hazards but with performance compromise.
If the performance is the priority I think you should keep the logged users in memory.
That way, the read requests would be fast as you would not need to query the database. However, you would need to update the map if any of the logged users would be somehow edited.
A human cannot tell the difference between a 1ms delay and a 50ms delay. So it is overkill to optimize beyond "good enough".
MySQL already does a flavor of caching; your addition of another cache may actually slow down the response time.
Some of the RDBMS tables have million of records and some have few thousands. I am already caching those records in ehcache. Say I have million of customers already cached in
ehcache from DB table. Now have to search/filter customers on multiple attributes which is decided at run time
One approach is apply filtering on cached data. Good thing is here i can save IO calls which are costly Bad thing is I need to do filtering in application(java)
Second approach is fetch the data from DB using DB index. Good thing is i can use DB index which will eliminate scanning through all records . Bad thing is i need to make
IO calls.
Which is better approach performance wise ?
One approach is apply filtering on cached data. Good thing is here i can save IO calls which are costly Bad thing is I need to do filtering in application(java)
You cannot be sure that your cache contains all data, and that it is consistent. Making your cache is in sync with the database, possibly honoring transactions, leads you to many other problems.
If we are talking about a read-only, analytical and data fits complete in memory, you can load everything into the appropriate data structures (HashMap, Tree, etc.). Then you don't need a cache.
Filtering on cached data, typically means sequential scan through the data. This might be not very fast. Some caches provide indexing, but then you are locked in to very vendor specific extensions.
Second approach is fetch the data from DB using DB index. Good thing is i can use DB index which will eliminate scanning through all records . Bad thing is i need to make IO calls.
If all your data is not in the cache, you need to do a DB request anyways and the DB needs to do the index access anyway, too. A database query can just return IDs so you can save the redundant transfer of the row data. Consistency may be an issue here.
Which is better approach performance wise?
Also keep in mind that there is also your personal performance as programmer. Making to complex solutions will not make you happy and look good in the long run.
What you need to do depends on the cost of database I/O and your problem domain.
If the result set is large, then having the entire result set in memory (server cache e.g. hazelcast) will not be feasible. With large result sets, you cannot afford to have them in memory. In such case, you have to fetch a chunk of data at a time (query based paging). The down side of using query based paging, is that there will be multiple calls to the database for multiple page requests.
Can anyone suggest how to implement a hybrid approach of it.
I haven't put any sample code here since I think the question is more about a logic instead of specific code. Still if you need sample code I can put it.
Thanks in advance.
The most effective solution is to use the primary key as a paging criterion.This enables us to rely of first class constructs like a between range query which is simple for the RDBMS to optimize, the primary key of the queried entity will most likely be indexed already.
Retrieving data using a range query on the primary key is a two-step process. First one have to retrieve the collection of primary-keys, followed by a step to generate the intervals to properly identify a proper subset of the data,followed by the actual queries against the data.
This approach is almost as fast as the brute-force version. The memory consumption is about one tenth. By selecting the appropriate page-size for this implementation, you may alter the ratio between execution time and memory consumption. This version is also stateless, it does not keep references to resources like the ScrollableResults version does, nor does it strain the database like the version using setFirstResult/setMaxResult.
Effective pagination using Hibernate
Which is the most efficient way to retrieve data for performing search operation.
Following is the requirement application needs search like feature for known variables(search keywords).
NB:: Currently application already have search keywords stored in keys that are stored in data cache in form of objects maintained at application level and are used for other purpose than performing search.
There are two possibilities now that are available to enable searching
(1) perform some pattern matching with java.util.regex.Pattern and then fetch the identified result rows from the cache or
(2) Ask the database to perform the match and retrieve matching rows?
Need to know which is more efficient.
Any inputs on it or data on simulators performed for similar operation would be appreciated ?
Option 1 is preferable because it does not involve network I/O.
Pattern matching and looking up in local cache will most likely take nanoseconds or a few milliseconds while sending a request to the database over the wire and waiting for the response will take a few dozen (or a few hundred) milliseconds. It's irrelevant that the database possibly implements the actual data look-up a bit faster than your own code.
This became too big to put into a comment:
To simply answer your question: Option 1 is preferable with what you describe, eg a local cache and a database accessible over network.
I'd like to emphasize "local" cache. If we're talking about a distributed cache you incur the network penalty and then the answer would be "we need more information". Factors to consider are the average size of a row, median network latency, read and write probability,... Answering this is a real pain.
When I face such a decision, I usually go through the following steps to decide what to use. The main metric here is simplicity, ie I'm looking for the most simple solution possible to save my time while still having a responsive site.
When starting, I try with no cache.
If that doesn't suffice and I still have one app server, I implement a local cache.
When I need to scale out by adding more app servers (behind a load-balancer), I try with no caches again (relying on the DB cache)
Only if that hits a performance limit, I implement a distributed cache system by attaching redis or memcache instances as needed (probably keeping a small cache on the individual app servers).
Modern database provide caching support. Most of the ORM frameworks cache retrieved data too. Why this duplication is necessary?
Because to get the data from the database's cache, you still have to:
Generate the SQL from the ORM's "native" query format
Do a network round-trip to the database server
Parse the SQL
Fetch the data from the cache
Serialise the data to the database's over-the-wire format
Deserialize the data into the database client library's format
Convert the database client librarie's format into language-level objects (i.e. a collection of whatevers)
By caching at the application level, you don't have to do any of that. Typically, it's a simple lookup of an in-memory hashtable. Sometimes (if caching with memcache) there's still a network round-trip, but all of the other stuff no longer happens.
Here are a couple of reasons why you may want this:
An application caches just what it needs so you should get a better cache hit ratio
Accessing a local cache will probably be a couple of orders of magnitude faster than accessing the database due to network latency - even with a fast network
Scaling read-write transactions using a strongly consistent cache
Scaling read-only transactions can be done fairly easily by adding more Replica nodes.
However, that does not work for the Primary node since that can be only scaled vertically:
And that's where a cache comes into play. For read-write database transactions that need to be executed on the Primary node, the cache can help you reduce the query load by directing it to a strongly consistent cache, like the Hibernate second-level cache:
Using a distributed cache
Storing an application-level cache in the memory of the application is problematic for several reasons.
First, the application memory is limited, so the volume of data that can be cached is limited as well.
Second, when traffic increases and we want to start new application nodes to handle the extra traffic, the new nodes would start with a cold cache, making the problem even worse as they incur a spike in database load until the cache is populated with data:
To address this issue, it's better to have the cache running as a distributed system, like Redis. This way, the amount of cached data is not limited by the memory size on a single node since sharding can be used to split the data among multiple nodes.
And, when a new application node is added by the auto-scaler, the new node will load data from the same distributed cache. Hence, there's no cold cache issue anymore.
Even if a database engine caches data, indexes, or query result sets, it still takes a round-trip to the database for your application to benefit from that cache.
An ORM framework runs in the same space as your application. So there's no round-trip. It's just a memory access, which is generally a lot faster.
The framework can also decide to keep data in cache as long as it needs it. The database may decide to expire cached data at unpredictable times, when other concurrent clients make requests that utilize the cache.
Your application-side ORM framework may also cache data in a form that the database can't return. E.g. in the form of a collection of java objects instead of a stream of raw data. If you rely on database caching, your ORM has to repeat that transformation into objects, which adds to overhead and decreases the benefit of the cache.
Also, the database's cache might not be as practical as one think. I copied this from http://highscalability.com/bunch-great-strategies-using-memcached-and-mysql-better-together -- it's MySQL specific, tho.
Given that MySQL has a cache, why is memcached needed at all?
The MySQL cache is associated with just one instance. This limits the cache to the maximum address of one server. If your system is larger than the memory for one server then using the MySQL cache won't work. And if the same object is read from another instance its not cached.
The query cache invalidates on writes. You build up all that cache and it goes away when someone writes to it. Your cache may not be much of a cache at all depending on usage patterns.
The query cache is row based. Memcached can cache any type of data you want and it isn't limited to caching database rows. Memcached can cache complex complex objects that are directly usable without a join.
The performance considerations related to the network roundtrips have correctly been pointed out.
To that, it must be added that caching data anywhere else than in the dbms (NOT "database"), creates a problem of potentially obsoleted data that is still being presented as being "up to date".
Giving in to the temptations of performance improvement goes at the expense of losing the guarantee (watertight or at least close to that) of absolutely reliably and guaranteeably correct and consistent data.
Consider this every time accuracy and consistency is crucial.
A lot of good answers here. I'll add one other point: I know my access pattern, the database doesn't.
Depending on what I'm doing, I know that if the data ends up stale, that's not really a problem. The DB doesn't, and would have to reload the cache with the new data.
I know that I'll come back to a piece of data a few times over the next while, so it's important to keep around. The DB has to guess at what to keep in the cache, it's doesn't have the information I do. So if I fetch it from the DB over and over, it may not be in cache if the server is busy. I could get a cache miss. With my cache, I can be sure I get a hit. This is especially true on data that is non-trivial to get (i.e. a few joins, some group functions) as opposed to just a single row. Getting a row with the primary key of 7 is easy for the DB, but if it has to do some real work, the cost of the cache miss is much higher.
No doubt that modern databases are providing caching facility but when you are having more traffic on you site and that time you need to perform many database transaction then you will no get high performance.So to increase performance in this case hibernate cache will help you,
by optimizing the database applications. The cache actually stores the data already loaded from the database, so that the traffic between our application and the database will be reduced when the application want to access that data again.The access time and traffic will be reduced between the application and the database.
That said - caches can sometimes become a burden and actually slowdown the server. When you have high load the algorithm for what is cached and what is not might not fit right with the requests coming in...what you get is a cache that starts to operate like FIFO in overtime...this begins to make itself known when the table that sits behind the cache has significantly more records than are ever going to be cached in memory...
A good trade off would be to cluster the data for what you want to cache. Have a main server which pumps updates to the clusters, the time for when to send/pump the updates should be able to be tailored for each table depending on TTL (time to live) settings.
Your logic and data on the user node can then sit on the same server which opens up in memory databases or if it does have to fetch data then you could set it up to use a pipe instead of a network call...
This is something that takes some thought on how you want to use the data and when/if you cluster then you have to be aware of distributed transactions (transactions over more than one database)...but if the data being cached will be updated on its own without links into other db spaces then you can get away with this....
The problem with ORM caching is that if the database is updated independently through another application then the ORM cache can become out of date...Also it can get tricky if you do an update to a set...the update might update something that is in your cache and it needs to have some sort of algorithm to identify which records need to be removed/updated in memory (slowing down the update!?) - and then this algorithm becomes incredibly tricky and bug prone!
If using ORM caching then keep to a simple rule...cache simple objects that hardly ever change (user/role details for example) and that are small in size and are hit many times in a request...if its outside of this then I suggest clustering the data for performance.