I want to use Caffeine for caching and I need to have a write-behind. I want to limit the amount of times I write to the database. The documentation speaks of write-back cache so it should be possible, but there is no example there on how to configure it. I have implemented a CacheWriter, but I don't understand how to configure it to for example only call the writer once every 10 seconds (If something changed to the cache ofcourse).
CacheWriter is an extension point and the documentation describes the use-cases where it may make sense. Those cases are beyond the scope of the library and if implemented instead could have been too rigid.
The writer is called atomically during a normal write operation (but not a computation). This ensures that a sequential order of changes is observed for a given key. For write-behind the writer would add the entry into a queue that is processed asynchronously, e.g. to batch the operations.
When implementing this capability you might want to consider,
Coalescing the updates (e.g. collect into a LinkedHashMap)
Performing a batch prior to a periodic write-behind if it exceeds a threshold size
Loading from the write-behind buffer if the operations have not yet been flushed (This avoids an inconsistent view, e.g. due to eviction)
Handling retrials, rate limiting, and striping depending on the the characteristics of the external resource
Update:
Wim Deblauwe provided a nice example using RxJava.
Related
I use the CacheBuilder with expireAfterWrite(2000, TimeUnit.Milliseconds). I send 10000 requests to my program and I expect the CacheBuilder to call RemovalListener 10000 times after 2 seconds for each request. I do not observe this behaviour and instead I get RemovalListener called 1 or 2 times.
Can someone please explain to me what CacheBuilder is doing because as I explained above it is doing something totally different from the documentation that Guava is providing.
In the same spirit as above, I use maximumSize(1000) and after sending my program 10000 requests, I expect the RemovalListener to be called 9000 times. But it's called only 1 or 2 times.
How does this module works in reality?
EDIT
I explicitly call clean cleanup each time I receive a request
The removal behavior is documented and works as expected (emphasis mine):
When Does Cleanup Happen?
Caches built with CacheBuilder do not perform cleanup and evict values "automatically," or instantly after a value expires, or anything of the sort. Instead, it performs small amounts of maintenance during write operations, or during occasional read operations if writes are rare.
The reason for this is as follows: if we wanted to perform Cache maintenance continuously, we would need to create a thread, and its operations would be competing with user operations for shared locks. Additionally, some environments restrict the creation of threads, which would make CacheBuilder unusable in that environment.
Instead, we put the choice in your hands. If your cache is high-throughput, then you don't have to worry about performing cache maintenance to clean up expired entries and the like. If your cache does writes only rarely and you don't want cleanup to block cache reads, you may wish to create your own maintenance thread that calls Cache.cleanUp() at regular intervals.
If you want to have more control over the cache and have dedicated executor to take care for calling RemovalListeners, use Caffeine -- a high performance, near optimal caching library based on Java 8 -- which has an API similar to Guava's Cache (same author). Caffeine has more advanced removal handling:
You may specify a removal listener for your cache to perform some operation when an entry is removed, via Caffeine.removalListener(RemovalListener). The RemovalListener gets passed the key, value, and RemovalCause.
Removal listener operations are executed asynchronously using an Executor. The default executor is ForkJoinPool.commonPool() and can be overridden via Caffeine.executor(Executor). When the operation must be performed synchronously with the removal, use CacheWriter instead.
I'm working on an ultra low latency and high performance application.
The core is single threaded, so don't need to worry about concurrency.
I'm developing a schedule log function which log messages periodically to prevent same messages flush in the log.
So the log class contains a ConcurrentHashMap, one thread update it (put new key or update existing value), another thread periodically loop through the map to log all the messages.
My concern is since need to log when loop through the Map which may take time, would it block the thread trying to update the Map? Any blocking is not acceptable since our application core is single threaded.
And is there any other data structure other than ConcurrentHashMap I can use to reduce the memory footprint?
Is there a thread-safe way to iterate the Map without blocking in the read-only case? Even the data iterated may be stale is still acceptable.
According to the java API docs, it says that :
[...] even though all operations are thread-safe, retrieval operations do not entail locking [...]
Moreover, the entrySet() method documentation tells us that:
The [returned] set is backed by the map, so changes to the map are reflected in the set, and vice-versa.
That would mean that modification of the map is possible while an iteration is done over it, meaning that it indeed doesn't block the whole map.
There may be other structures that would allow you to reduce the memory footprint, reduce latency, and perhaps offer a more uniform & consistent performance profile.
One of these is a worker pattern where your main worker publishes to a nonblocking queue and the logger pulls from that queue. This should decouple the two processes, allow for multiple concurrent publishers and allow you to scale out loggers.
One possible data structure for this is ConcurrentLinked Queue. I have very little experience with Java so I'm not sure how the performance profile of this differs from concurrent hashmap; but this is a very common pattern in distributed systems and in golang.
I have a datastream in which the order of the events is important. The time characteristic is set to EventTime as the incoming records have a timestamp within them.
In order to guarantee the ordering, I set the parallelism for the program to 1. Could that become a problem, performance wise, when my program gets more complex?
If I understand correctly, I need to assign watermarks to my events, if I want to keep the stream ordered by timestamp. This is quite simple. But I'm reading that even that doesn't guarantee order? Later on, I want to do stateful computations over that stream. So, for that I use a FlatMap function, which needs the stream to be keyed. But if I key the stream, the order is lost again. AFAIK this is because of different stream partitions, which are "caused" by parallelism.
I have two questions:
Do I need parallelism? What factors do I need to consider here?
How would I achieve "ordered parallelism" with what I described above?
Several points to consider:
Setting the parallelism to 1 for the entire job will prevent scaling your application, which will affect performance. Whether this actually matters depends on your application requirements, but it would certainly be limitation, and could be a problem.
If the aggregates you've mentioned are meant to be computed globally across all the event records then operating in parallel will require doing some pre-aggregation in parallel. But in this case you will then have to reduce the parallelism to 1 in the later stages of your job graph in order to produce the ultimate (global) results.
If on the other hand these aggregates are to be computed independently for each value of some key, then it makes sense to consider keying the stream and to use that partitioning as the basis for operating in parallel.
All of the operations you mention require some state, whether computing max, min, averages, or uptime and downtime. For example, you can't compute the maximum without remembering the maximum encountered so far.
If I understand correctly how Flink's NiFi source connector works, then if the source is operating in parallel, keying the stream will result in out-of-order events.
However, none of the operations you've mentioned require that the data be delivered in-order. Computing uptime (and downtime) on an out-of-order stream will require some buffering -- these operations will need to wait for out-of-order data to arrive before they can produce results -- but that's certainly doable. That's exactly what watermarks are for; they define how long to wait for out-of-order data. You can use an event-time timer in a ProcessFunction to arrange for an onTimer callback to be called when all earlier events have been processed.
You could always sort the keyed stream. Here's an example.
The uptime/downtime calculation should be easy to do with Flink's CEP library (which sorts its input, btw).
UPDATE:
It is true that after applying a ProcessFunction to a keyed stream the stream is no longer keyed. But in this case you could safely use reinterpretAsKeyedStream to inform Flink that the stream is still keyed.
As for CEP, this library uses state on your behalf, making it easier to develop applications that need to react to patterns.
The main benefit of reactive programming is that it is fault-tolerant and can process a lot more events than a blocking implementation, despite the fact that the processing will usually happen slower.
What I don't fully understand is how and where the events are stored. I know that there is an event buffer and it can be tweaked but that buffer can easily overload the memory if the queue is unbound, can't it? Can this buffer flush onto disk? Isn't it a rist to have it in-memory? Can it be configured similarly to Lagom event-sourcing or persistent Akka actors where events can be stored in DB?
The short answer is no, this buffer cannot be persisted. At least in reference implementation.
The internal in-memory buffer can hold up to 128 emited values by default, but there are some points. First of all, there is a backpressure — situatuion when the source emits items faster than observer or operator consumes them. Thus, when this internal buffer is overloaded you get a MissingBackpressureException and there are no any disk or some other way to persist it. However you can tweak the behavior, for instance keep only latest emit or just drop new emits. There are special operators for that — onBackpressureBuffer, onBackpressureDrop, onBackpressureLatest.
RxJava2 introduces a new type — Flowable which supports backpressure by default and gives more ways to tweak internal buffer.
Rx is a way to process data streams and you should care if you can consume all the items and how to store them if you can't.
One of the main advantages of rxjava is contract and there are ways to create your own operators or use some extensions like rxjava-extras
Say I have a Cache that is defined like this:
private static Cache<String, Long> alertsUIDCache = CacheBuilder.newBuilder().
expireAfterAccess(60).build();
From what I read (please correct me if I am wrong):
If value is written to Cache at 0:00, it should be moved to "ready to be evicted" status after 60 seconds. The actual removing of the value from the Cache will happen at the next cache modification (what is exactly is cache modification ?). is that right?
Also, I am not sure what the difference between the invalidateAll() and the cleanUp() methods, can someone provide an explanation?
first part from this link : How does Guava expire entries in its CacheBuilder?
I'm going to focus on expireAfterAccess, but the procedure for expireAfterWrite is almost identical. In terms of the mechanics, when you specify expireAfterAccess in the CacheBuilder, then each segment of the cache maintains a linked list access queue for entries in order from least-recent-access to most-recent-access. The cache entries are actually themselves nodes in the linked list, so when an entry is accessed, it removes itself from its old position in the access queue, and moves itself to the end of the queue.
second part :
from this link : Guava CacheLoader - invalidate does not immediately invalidate entry if both expireAfterWrite and expireAfterAccess are set
invalidate should remove the entry immediately -- not waiting for another query -- and should force the value to get reloaded on the very next query to that key.
cleanUp: Performs any pending maintenance operations needed by the cache. Exactly which activities are performed -- if any -- is implementation-dependent.
from guava documents: https://github.com/google/guava/wiki/CachesExplained
Explicit Removals
At any time, you may explicitly invalidate cache entries rather than waiting for entries to be evicted. This can be done:
individually, using Cache.invalidate(key)
in bulk, using Cache.invalidateAll(keys)
to all entries, using Cache.invalidateAll()
When Does Cleanup Happen?
Caches built with CacheBuilder do not perform cleanup and evict values "automatically," or instantly after a value expires, or anything of the sort. Instead, it performs small amounts of maintenance during write operations, or during occasional read operations if writes are rare.
The reason for this is as follows: if we wanted to perform Cache maintenance continuously, we would need to create a thread, and its operations would be competing with user operations for shared locks. Additionally, some environments restrict the creation of threads, which would make CacheBuilder unusable in that environment.
Instead, we put the choice in your hands. If your cache is high-throughput, then you don't have to worry about performing cache maintenance to clean up expired entries and the like. If your cache does writes only rarely and you don't want cleanup to block cache reads, you may wish to create your own maintenance thread that calls Cache.cleanUp() at regular intervals.
If you want to schedule regular cache maintenance for a cache which only rarely has writes, just schedule the maintenance using ScheduledExecutorService.