We want to use guava cache for caching third party data to have better response times. The cache needs to be pre-loaded by making a sequence of api calls (~ 4000 api calls are to be made). The api response contains the cache key and its value. These api calls need to be made in parallel from multiple threads (i.e. thread pool) to speed up the cache loading. Each cache entry would have an expiry time. This can be set using expireAfterAccess() call.
After a cache entry expires, it needs to be refreshed automatically in the background. Also there should be a way (api) through which we can stop this background cache refresh so that we do not keep making api calls endlessly. We will call this api once we stop getting user requests after a configured time interval.
Is it possible to delegate the thread management for cache loading and refresh to guava? i.e. Given the api call, the code to map the json response to java object and cache key-value design, can guava perform the pre-loading and refresh on its own?
Thanks.
Automatic refreshing in Guava can be enabled via CacheBuilder.refreshAfterWrite(). The relevant semantics are described as:
Specifies that active entries are eligible for automatic refresh once
a fixed duration has elapsed after the entry's creation, or the most
recent replacement of its value. [ ... ] Currently automatic refreshes
are performed when the first stale request for an entry occurs.
When overriding the method CacheLoader.reload() you can use a thread pool to load values asynchronously.
The problem with this behavior is that you always have a few reads of stale values, before the new value has been loaded, if it succeeds. An alternative cache implementation, like cache2k starts the refreshing immediately after the duration. The latter approach leads to more recent data, but possibly more needless reads. See some recent discussion about that here: https://github.com/ben-manes/caffeine/issues/261
Related
From the guava documentation here , when a request is received for a cache key for which refresh time has expired, it starts an asynchronous thread to refresh the key if it is setup that way and returns the existing value. What I did not get from the documentation is what happens when another request is received immediately after the first one and the previous async thread for refresh is still running? Will it -
Return the old data AND start another async thread for refresh OR
It sees another refresh thread is running and it just returns the old data without spawning a new thread.
If the behavior is first one, I would be concerned if the request rate is very high, it might end up utilizing significant resources in the background.
Note that the JavaDoc for refreshAfterWrite states,
The semantics of refreshes are specified in LoadingCache.refresh(K)
This clarifies the behavior as,
Returns without doing anything if another thread is currently loading the value for {#code key}. If the cache loader associated with this cache performs refresh asynchronously then this method may return before refresh completes.
The cache is smart enough to do (2) and you could easily verify that by writing a simple unit test.
I use the CacheBuilder with expireAfterWrite(2000, TimeUnit.Milliseconds). I send 10000 requests to my program and I expect the CacheBuilder to call RemovalListener 10000 times after 2 seconds for each request. I do not observe this behaviour and instead I get RemovalListener called 1 or 2 times.
Can someone please explain to me what CacheBuilder is doing because as I explained above it is doing something totally different from the documentation that Guava is providing.
In the same spirit as above, I use maximumSize(1000) and after sending my program 10000 requests, I expect the RemovalListener to be called 9000 times. But it's called only 1 or 2 times.
How does this module works in reality?
EDIT
I explicitly call clean cleanup each time I receive a request
The removal behavior is documented and works as expected (emphasis mine):
When Does Cleanup Happen?
Caches built with CacheBuilder do not perform cleanup and evict values "automatically," or instantly after a value expires, or anything of the sort. Instead, it performs small amounts of maintenance during write operations, or during occasional read operations if writes are rare.
The reason for this is as follows: if we wanted to perform Cache maintenance continuously, we would need to create a thread, and its operations would be competing with user operations for shared locks. Additionally, some environments restrict the creation of threads, which would make CacheBuilder unusable in that environment.
Instead, we put the choice in your hands. If your cache is high-throughput, then you don't have to worry about performing cache maintenance to clean up expired entries and the like. If your cache does writes only rarely and you don't want cleanup to block cache reads, you may wish to create your own maintenance thread that calls Cache.cleanUp() at regular intervals.
If you want to have more control over the cache and have dedicated executor to take care for calling RemovalListeners, use Caffeine -- a high performance, near optimal caching library based on Java 8 -- which has an API similar to Guava's Cache (same author). Caffeine has more advanced removal handling:
You may specify a removal listener for your cache to perform some operation when an entry is removed, via Caffeine.removalListener(RemovalListener). The RemovalListener gets passed the key, value, and RemovalCause.
Removal listener operations are executed asynchronously using an Executor. The default executor is ForkJoinPool.commonPool() and can be overridden via Caffeine.executor(Executor). When the operation must be performed synchronously with the removal, use CacheWriter instead.
I'm adding a caching functionality to one of the DoFns inside a Dataflow Pipeline in Java. The DoFn is currently using a REST client to send a request to an endpoint (that charges based on number of request, and the response will change roughly every hour) for every input element, and what I want to achieve is to cache the response from the endpoint, and have it expiries every 15 mins.
I found two ways to do this: one from a similar stackoverflow question that suggested to use static variable to host a cache service (I used guava for caching). However I wasn't sure how to update the expiry from outside of the DoFn.
Another approach that I found is to use stateful processing to store a hash that keep track of the requests and responses, and use a TimerSpec to clear the "cache" every 15 mins. Although it appears that there is no way to set a timer for each element in the cache.
I haven't tried the second approach yet. While I'm going to implement it, I wonder if someone had running into similar situations, and has any suggestions, or has better approaches.
However I wasn't sure how to update the expiry from outside of the DoFn.
Do you need to? DoFn does a lookup first and if the entry already expired it issues a request and updates cache. Cache reads & write need to be thread-safe.
Although it appears that there is no way to set a timer for each element in the cache.
You can probably sets a timer that scan the entire cache every X minutes and refresh expired entries. However, if your state is not keyed using a global state like this will limit the parallelism of your pipeline.
My use-case is that I need to implement a cache on top of a service should expire entries after a certain amount of time (from their time of creation).
And if the entry is getting expired, then service look up should be done to get the latest entry. lets call is service refresh.
But, lets say if service refresh fails, then I should be able to use the stale data in the cache.
But since the cache is already expired, I don't have that entry.
So, I am thinking of controlling the expiration of the cache and cache entry would only be expired only if service is available to get the latest data, otherwise don't remove that entry.
I was looking into Google Guava cache, but it only provides a removalListener which would just notify me with the event but I am not able to control the expiration event with this.
Is there any third party cache implementation which can serve my purpose?
This kind of resilience and robustness semantics are implemented in cache2k. We use this in production for quite some time. The setup looks like
CacheBuilder.newCache(Key.class, Value.class)
.name("myCache")
.source(new YourSourceImplementation())
.backgroundRefresh(true)
.suppressExceptions(true)
.expiryDuration(60, TimeUnit.SECONDS)
.build();
With exceptionExpiryDuration you can actually set a shorter interval for the retry. There was a similar question on SO, which I also answered, see: Is it possible to configure Guava Cache (or other library) behaviour to be: If time to reload, return previous entry, reload in background (see specs) There you find some more details about it.
Regardless what cache you use, you will run into a lot of issues, since exception handling in general and building robust and resilient applications needs some thoughts in the details.
That said, I am not totally happy with the solution yet, since I think we need more control, e.g. how long stale data should be served. Furthermore, the cache needs to have some alerting if there is stale data in it. I put some thoughts on how to improve this here: https://github.com/cache2k/cache2k/issues/26
Feedback is very welcome.
We are using a Guava LoadingCache which is build by a CacheLoader.
What we are looking for, is a Cache which will refresh its content regularly, but also expires keys after a given (longer) timeframe, if the key is not accessed anymore.
Is it possible to use .refresAfterWrite(30, TimeUnit.SECONDS) and also .expireAfterAccess(10,TimeUnit.MINUTES) on the same CacheLoader?
My experience is that the keys are never evicted because of the regular reload through refreshAfterWrite. The documentation leaves me a little uncertain about this point.
This should behave as you desired. From the CacheBuilder docs:
Currently automatic refreshes are performed when the first stale request for an entry occurs. The request triggering refresh will make a blocking call to CacheLoader.reload(K, V) and immediately return the new value if the returned future is complete, and the old value otherwise.
So if a key is queried 30 seconds after its last write, it will be refreshed; if it is not queried for 10 minutes after its last access, it will become eligible for expiration without being refreshed in the meantime.