We have a simple but very much used cache, implemented by a ConcurrentHashMap. Now we want to refresh all values at regular times (say, every 15 minutes).
I would like code like this:
private void regularCacheCleanup() {
final long now = System.currentTimeMillis();
final long delta = now - cacheCleanupLastTime;
if (delta < 0 || delta > 15 * 60 * 1000) {
cacheCleanupLastTime = now;
clearCache();
}
}
Except it should be:
Thread safe
Non-blocking and extremely performant if the cache isn't going to be cleared
No dependencies except on java.* classes (so no Google CacheBuilder)
Rock-solid ;-)
Can't start new threads
Right now I think to implement a short timer in a ThreadLocal. When this expires, the real timer will be checked in a synchronized way. That's an awfull lot of code, however, so a more simple idea would be nice.
The mainstream way to tackle this issue would be by using some timer thread to refresh your cache on specified intervals. However, since you don't need to create new threads, a possible implementation that i can think of is that of a pseudo-timed cache refresh. Basically, i would insert checks in cache accessors (put and get methods) and each time clients would use this methods, i would check if the cache needs to be refreshed before performing the put or get action. This is the rough idea:
class YourCache {
// holds the last time the cache has been refreshed in millis
private volatile long lastRefreshDate;
// indicates that cache is currently refreshing entries
private volatile boolean cacheCurrentlyRefreshing;
private Map cache = // Your concurrent map cache...
public void put(Object key, Object element) {
if (cacheNeedsRefresh()) {
refresh();
}
map.put(key, element);
}
public Object get(Object key) {
if (cacheNeedsRefresh()) {
refresh();
}
return map.get(key);
}
private boolean cacheNeedsRefresh() {
// make sure that cache is not currently being refreshed by some
// other thread.
if (cacheCurrentlyRefreshing) {
return false;
}
return (now - lastRefreshDate) >= REFRESH_INTERVAL;
}
private void refresh() {
// make sure the cache did not start refreshing between cacheNeedsRefresh()
// and refresh() by some other thread.
if (cacheCurrentlyRefreshing) {
return;
}
// signal to other threads that cache is currently being refreshed.
cacheCurrentlyRefreshing = true;
try {
// refresh your cache contents here
} finally {
// set the lastRefreshDate and signal that cache has finished
// refreshing to other threads.
lastRefreshDate = System.currentTimeMillis();
cahceCurrentlyRefreshing = false;
}
}
}
Personally i wouldn't consider doing it like so, but if you don't want or can't create timer threads then this could be an option for you.
Note that although this implementation avoids locks, it is still prone to duplicate refreshes due to race events. If this is ok for your requirements then it should be no problem. If however you have stricter requirements then you need to put locking in order to properly synchronise the threads and avoid race events.
Related
I have a scenario where I have to maintain a Map which can be populated by multiple threads, each modifying their respective List (unique identifier/key being the thread name), and when the list size for a thread exceeds a fixed batch size, we have to persist the records to the database.
Aggregator class
private volatile ConcurrentHashMap<String, List<T>> instrumentMap = new ConcurrentHashMap<String, List<T>>();
private ReentrantLock lock ;
public void addAll(List<T> entityList, String threadName) {
try {
lock.lock();
List<T> instrumentList = instrumentMap.get(threadName);
if(instrumentList == null) {
instrumentList = new ArrayList<T>(batchSize);
instrumentMap.put(threadName, instrumentList);
}
if(instrumentList.size() >= batchSize -1){
instrumentList.addAll(entityList);
recordSaver.persist(instrumentList);
instrumentList.clear();
} else {
instrumentList.addAll(entityList);
}
} finally {
lock.unlock();
}
}
There is one more separate thread running after every 2 minutes (using the same lock) to persist all the records in Map (to make sure we have something persisted after every 2 minutes and the map size does not gets too big)
if(//Some condition) {
Thread.sleep(//2 minutes);
aggregator.getLock().lock();
List<T> instrumentList = instrumentMap.values().stream().flatMap(x->x.stream()).collect(Collectors.toList());
if(instrumentList.size() > 0) {
saver.persist(instrumentList);
instrumentMap .values().parallelStream().forEach(x -> x.clear());
aggregator.getLock().unlock();
}
}
This solution is working fine in almost for every scenario that we tested, except sometimes we see some of the records went missing, i.e. they are not persisted at all, although they were added fine to the Map.
My questions are:
What is the problem with this code?
Is ConcurrentHashMap not the best solution here?
Does the List that is used with the ConcurrentHashMap have an issue?
Should I use the compute method of ConcurrentHashMap here (no need I think, as ReentrantLock is already doing the same job)?
The answer provided by #Slaw in the comments did the trick. We were letting the instrumentList instance escape in non-synchronized way i.e. access/operations are happening over list without any synchonization. Fixing the same by passing the copy to further methods did the trick.
Following line of code is the one where this issue was happening
recordSaver.persist(instrumentList);
instrumentList.clear();
Here we are allowing the instrumentList instance to escape in non-synchronized way i.e. it is passed to another class (recordSaver.persist) where it was to be actioned on but we are also clearing the list in very next line(in Aggregator class) and all of this is happening in non-synchronized way. List state can't be predicted in record saver... a really stupid mistake.
We fixed the issue by passing a cloned copy of instrumentList to recordSaver.persist(...) method. In this way instrumentList.clear() has no affect on list available in recordSaver for further operations.
I see, that you are using ConcurrentHashMap's parallelStream within a lock. I am not knowledgeable about Java 8+ stream support, but quick searching shows, that
ConcurrentHashMap is a complex data structure, that used to have concurrency bugs in past
Parallel streams must abide to complex and poorly documented usage restrictions
You are modifying your data within a parallel stream
Based on that information (and my gut-driven concurrency bugs detector™), I wager a guess, that removing the call to parallelStream might improve robustness of your code. In addition, as mentioned by #Slaw, you should use ordinary HashMap in place of ConcurrentHashMap if all instrumentMap usage is already guarded by lock.
Of course, since you don't post the code of recordSaver, it is possible, that it too has bugs (and not necessarily concurrency-related ones). In particular, you should make sure, that the code that reads records from persistent storage — the one, that you are using to detect loss of records — is safe, correct, and properly synchronized with rest of your system (preferably by using a robust, industry-standard SQL database).
It looks like this was an attempt at optimization where it was not needed. In that case, less is more and simpler is better. In the code below, only two concepts for concurrency are used: synchronized to ensure a shared list is properly updated and final to ensure all threads see the same value.
import java.util.ArrayList;
import java.util.List;
public class Aggregator<T> implements Runnable {
private final List<T> instruments = new ArrayList<>();
private final RecordSaver recordSaver;
private final int batchSize;
public Aggregator(RecordSaver recordSaver, int batchSize) {
super();
this.recordSaver = recordSaver;
this.batchSize = batchSize;
}
public synchronized void addAll(List<T> moreInstruments) {
instruments.addAll(moreInstruments);
if (instruments.size() >= batchSize) {
storeInstruments();
}
}
public synchronized void storeInstruments() {
if (instruments.size() > 0) {
// in case recordSaver works async
// recordSaver.persist(new ArrayList<T>(instruments));
// else just:
recordSaver.persist(instruments);
instruments.clear();
}
}
#Override
public void run() {
while (true) {
try { Thread.sleep(1L); } catch (Exception ignored) {
break;
}
storeInstruments();
}
}
class RecordSaver {
void persist(List<?> l) {}
}
}
I am building a backend service whereby a REST call to my service creates a new thread. The thread waits for another REST call if it does not receive anything by say 5 minutes the thread will die.
To keep track of all the threads I have a collection that keeps track of all the currently running threads so that when the REST call finally comes in such as a user accepting or declining an action, I can then identify that thread using the userID. If its declined we will just remove that thread from the collection if its accepted the thread can carry on doing the next action. i have implemented this using a ConcurrentMap to avoid concurrency issues.
Since this is my first time working with threads I want to make sure that I am not overlooking any issues that may arise. Please have a look at my code and tell me if I could do it better or if there's any flaws.
public class UserAction extends Thread {
int userID;
boolean isAccepted = false;
boolean isDeclined = false;
long timeNow = System.currentTimeMillis();
long timeElapsed = timeNow + 50000;
public UserAction(int userID) {
this.userID = userID;
}
public void declineJob() {
this.isDeclined = true;
}
public void acceptJob() {
this.isAccepted = true;
}
public boolean waitForApproval(){
while (System.currentTimeMillis() < timeElapsed){
System.out.println("waiting for approval");
if (isAccepted) {
return true;
} else if (declined) {
return false;
}
}
return isAccepted;
}
#Override
public void run() {
if (!waitForApproval) {
// mustve timed out or user declined so remove from list and return thread immediately
tCollection.remove(userID);
// end the thread here
return;
}
// mustve been accepted so continue working
}
}
public class Controller {
public static ConcurrentHashMap<Integer, Thread> tCollection = new ConcurrentHashMap<>();
public static void main(String[] args) {
int barberID1 = 1;
int barberID2 = 2;
tCollection.put(barberID1, new UserAction(barberID1));
tCollection.put(barberID2, new UserAction(barberID2));
tCollection.get(barberID1).start();
tCollection.get(barberID2).start();
Thread.sleep(1000);
// simulate REST call accepting/declining job after 1 second. Usually this would be in a spring mvc RESTcontroller in a different class.
tCollection.get(barberID1).acceptJob();
tCollection.get(barberID2).declineJob();
}
}
You don't need (explicit) threads for this. Just a shared pool of task objects that are created on the first rest call.
When the second rest call comes, you already have a thread to use (the one that's handling the rest call). You just need to retrieve the task object according to the user id. You also need to get rid of expired tasks, which can be done with for example a DelayQueue.
Pseudocode:
public void rest1(User u) {
UserTask ut = new UserTask(u);
pool.put(u.getId(), ut);
delayPool.put(ut); // Assuming UserTask implements Delayed with a 5 minute delay
}
public void rest2(User u, Action a) {
UserTask ut = pool.get(u.getId());
if(!a.isAccepted() || ut == null)
pool.remove(u.getId());
else
process(ut);
// Clean up the pool from any expired tasks, can also be done in the beginning
// of the method, if you want to make sure that expired actions aren't performed
while((UserTask u = delayPool.poll()) != null)
pool.remove(u.getId());
}
There's a synchronization issue that you should make your flags isAccepted and isDeclined of class AtomicBoolean.
A critical concept is that you need to take steps to make sure changes to memory in one thread are communicated to other threads that need that data. They're called memory fences and they often occur implicitly between synchronization calls.
The idea of a (simple) Von Neumann architecture with a 'central memory' is false for most modern machines and you need to know data is being shared between caches/threads correctly.
Also as others suggest, creating a thread for each task is a poor model. It scales badly and leaves your application vulnerable to keeling over if too many tasks are submitted. There is some limit to memory so you can only have so many pending tasks at a time but the ceiling for threads will be much lower.
That will be made all the worse because you're spin waiting. Spin waiting puts a thread into a loop waiting for a condition. A better model would wait on a ConditionVariable so threads not doing anything (other than waiting) could be suspended by the operating system until notified that the thing they're waiting for is (or may be) ready.
There are often significant overheads in time and resources to creating and destroying threads. Given that most platforms can be simultaneously only executing a relatively small number of threads creating lots of 'expensive' threads to have them spend most of their time swapped out (suspended) doing nothing is very inefficient.
The right model launches a pool of a fixed number of threads (or relatively fixed number) and places tasks in a shared queue that the threads 'take' work from and process.
That model is known generically as a "Thread Pool".
The entry level implementation you should look at is ThreadPoolExecutor:
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html
I need to store objects in a cache and hese objects take a long time to create. I started with ConcurrentHashMap<id, Future<Object>> and everything was fine, until Out of Memory started to happen. Moved to SoftReferences and it was better, but now I need to control eviction. I'm in the process of moving to Ehcache.
I'm sure there is a library for such thing but I really need to understand the logic of doing the cache storage and calculation in two phases, while keeping everything consistent and not recalculating something that is already calculated or in the process of being calculated. Is a two level cache, one for the more persistent result and the other for the in the process of being calculated.
Any hints on how to better the following code which I'm sure has concurrency problems in the Callable.call() method?
public class TwoLevelCache {
// cache that serializes everything except Futures
private Cache complexicos = new Cache();
private ConcurrentMap<Integer, Future<Complexixo>> calculations =
new ConcurrentHashMap<Integer, Future<Complexico>>();
public Complexico get(final Integer id) {
// if in cache return it
Complexico c = complexicos.get(id);
if (c != null) { return c; }
// if in calculation wait for future
Future f = calculations.get(id);
if (f != null) { return f.get(); } // exceptions obviated
// if not, setup calculation
Callable<Complexico> callable = new Callable<Complexico>() {
public Complexico call() throws Exception {
Complexico complexico = compute(id);
// this might be a problem here
// but how to synchronize without
// blocking the whole structure?
complexicos.put(id, complexico);
calculations.remove(id);
return complexico;
}
};
// store calculation
FutureTask<Complexico> task = new FutureTask<Complexico>(callable);
Future<Complexico> future = futures.putIfAbsent(id, task);
if (future == null) {
// not previosly being run, so start calculation
task.run();
return task.get(); // exceptions obviated
} else {
// there was a previous calculation, so use that
return future.get(); // exceptions obviated
}
}
private Complexico compute(final Integer id) {
// very long computation of complexico
}
}
And what do you do with the values once they are calculated?
What is the number of new calculations per second?
If they are used (stored) and then disposed then I think that Reactive approach (RxJava and similar) could be a nice solution. You could put your "tasks" (a POJO with all info needed to perform calculation) on some off-heap structure (it could be some persistent queue etc.) and only perform calculations for as many as you want (throttle the process with the number for computational threads you want to have).
This way you would avoid OOM and would also gain much more control over the entire process.
The following class acts as a simple cache that gets updated very infrequently (say e.g. twice a day) and gets read quite a lot (up to several times a second). There are two different types, a List and a Map. My question is about the new assignment after the data gets updated in the update method. What's the best (safest) way for the new data to get applied?
I should add that it isn't necessary for readers to see the absolute latest value. The requirements are just to get either the old or the new value at any given time.
public class Foo {
private ThreadPoolExecutor _executor;
private List<Object> _listObjects = new ArrayList<Object>(0);
private Map<Integer, Object> _mapObjects = new HashMap<Integer, Object>();
private Object _mutex = new Object();
private boolean _updateInProgress;
public void update() {
synchronized (_mutex) {
if (_updateInProgress) {
return;
} else {
_updateInProgress = true;
}
}
_executor.execute(new Runnable() {
#Override
public void run() {
try {
List<Object> newObjects = loadListObjectsFromDatabase();
Map<Integer, Object> newMapObjects = loadMapObjectsFromDatabase();
/*
* this is the interesting part
*/
_listObjects = newObjects;
_mapObjects = newMapObjects;
} catch (final Exception ex) {
// error handling
} finally {
synchronized (_mutex) {
_updateInProgress = false;
}
}
}
});
}
public Object getObjectById(Integer id) {
return _mapObjects.get(id);
}
public List<Object> getListObjects() {
return new ArrayList<Object>(_listObjects);
}
}
As you see, currently no ConcurrentHashMap or CopyOnWriteArrayList is used. The only synchronisation is done in the update method.
Although not necessary for my current problem, it would be also great to know the best solution for cases where it is essential for readers to always get the absolute latest value.
You could use plan synchronization unless you are reading over 10,000 times per second.
If you want concurrent access I would use on of the concurrent collections like ConcurrentHashMap or CopyOnWriteArrayList. These are simpler to use than synchronizing the collection. (i.e. you don't need them for performance reasons, use them for simplicity)
BTW: A modern CPU can perform billions of operations in 0.1 seconds so several times a second is an eternity to a computer.
I am also seeing this issue and think of multiple solutions:
Use synchronization block on the both codes, one where reading and other where writing.
Make a separate remove list, add all removable items in that list. Remove in the same thread where reading the list just after reading is done. This way reading and deleting will happen in sequence and no error will come.
I was recently looking for a way to implement a doubly buffered thread-safe cache for regular objects.
The need arose because we had some cached data structures that were being hit numerous times for each request and needed to be reloaded from cache from a very large document (1s+ unmarshalling time) and we couldn't afford to let all requests be delayed by that long every minute.
Since I couldn't find a good threadsafe implementation I wrote my own and now I am wondering if it's correct and if it can be made smaller... Here it is:
package nl.trimpe.michiel
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
/**
* Abstract class implementing a double buffered cache for a single object.
*
* Implementing classes can load the object to be cached by implementing the
* {#link #retrieve()} method.
*
* #param <T>
* The type of the object to be cached.
*/
public abstract class DoublyBufferedCache<T> {
private static final Log log = LogFactory.getLog(DoublyBufferedCache.class);
private Long timeToLive;
private long lastRetrieval;
private T cachedObject;
private Object lock = new Object();
private volatile Boolean isLoading = false;
public T getCachedObject() {
checkForReload();
return cachedObject;
}
private void checkForReload() {
if (cachedObject == null || isExpired()) {
if (!isReloading()) {
synchronized (lock) {
// Recheck expiration because another thread might have
// refreshed the cache before we were allowed into the
// synchronized block.
if (isExpired()) {
isLoading = true;
try {
cachedObject = retrieve();
lastRetrieval = System.currentTimeMillis();
} catch (Exception e) {
log.error("Exception occurred retrieving cached object", e);
} finally {
isLoading = false;
}
}
}
}
}
}
protected abstract T retrieve() throws Exception;
private boolean isExpired() {
return (timeToLive > 0) ? ((System.currentTimeMillis() - lastRetrieval) > (timeToLive * 1000)) : true;
}
private boolean isReloading() {
return cachedObject != null && isLoading;
}
public void setTimeToLive(Long timeToLive) {
this.timeToLive = timeToLive;
}
}
What you've written isn't threadsafe. In fact, you've stumbled onto a common fallacy that is quite a famous problem. It's called the double-checked locking problem and many such solutions as yours (and there are several variations on this theme) all have issues.
There are a few potential solutions to this but imho the easiest is simply to use a ScheduledThreadExecutorService and reload what you need every minute or however often you need to. When you reload it put it into the cache result and the calls for it just return the latest version. This is threadsafe and easy to implement. Sure it's not on-demand loaded but, apart from the initial value, you'll never take a performance hit while you retrieve the value. I'd call this over-eager loading rather than lazy-loading.
For example:
public class Cache<T> {
private final ScheduledExecutorsService executor =
Executors.newSingleThreadExecutorService();
private final Callable<T> method;
private final Runnable refresh;
private Future<T> result;
private final long ttl;
public Cache(Callable<T> method, long ttl) {
if (method == null) {
throw new NullPointerException("method cannot be null");
}
if (ttl <= 0) {
throw new IllegalArgumentException("ttl must be positive");
}
this.method = method;
this.ttl = ttl;
// initial hits may result in a delay until we've loaded
// the result once, after which there will never be another
// delay because we will only refresh with complete results
result = executor.submit(method);
// schedule the refresh process
refresh = new Runnable() {
public void run() {
Future<T> future = executor.submit(method);
future.get();
result = future;
executor.schedule(refresh, ttl, TimeUnit.MILLISECONDS);
}
}
executor.schedule(refresh, ttl, TimeUnit.MILLISECONDS);
}
public T getResult() {
return result.get();
}
}
That takes a little explanation. Basically, you're creating a generic interface for caching the result of a Callable, which will be your document load. Submitting a Callable (or Runnable) returns a Future. Calling Future.get() blocks until it returns (completes).
So what this does is implement a get() method in terms of a Future so initial queries won't fail (they will block). After that, every 'ttl' milliseconds the refresh method is called. It submits the method to the scheduler and calls Future.get(), which yields and waits for the result to complete. Once complete, it replaces the 'result' member. Subsequence Cache.get() calls will return the new value.
There is a scheduleWithFixedRate() method on ScheduledExecutorService but I avoid it because if the Callable takes longer than the scheduled delay you will end up with multiple running at the same time and then have to worry about that or throttling. It's easier just for the process to submit itself at the end of a refresh.
I'm not sure I understand your need. Is your need to a have a faster loading (and reloading) of the cache, for a portion of the values?
If so, I would suggest breaking your datastructure into smaller pieces.
Just load the piece that you need at the time. If you divide the size by 10, you will divide the loading time by something related to 10.
This could apply to the original document you are reading, if possible. Otherwise, it would be the way you read it, where you skip a large part of it and load only the relevant part.
I believe that most data can be broken down into pieces. Choose the more appropriate, here are examples:
by starting letter : A*, B* ...
partition your id into two part : first part is a category, look for it in the cache, load it if needed, then look for your second part inside.
If your need is not the initial loading time, but the reloading, maybe you don't mind the actual time for reloading, but want to be able to use the old version while loading the new?
If that is your need, I suggest making your cache an instance (as opposed to static) that is available in a field.
You trigger reloading every minute with a dedicated thread (or a least not the regular threads), so that you don't delay your regular threads.
Reloading creates a new instance, load it with data (takes 1 second), and then simply replace the old instance with the new. (The old will get garbage-collected.) Replacing an object with another is an atomic operation.
Analysis: What happens in that case is that any other thread can get access to the old cache until the last instant ?
In the worst case, the instruction just after getting the old cache instance, another thread replaces the old instance with a new. But this doesn't make your code faulty, asking the old cache instance will still give a value that was correct just before, which is acceptable by the requirement I gave as first sentence.
To make your code more correct, you can create your cache instance as immutable (no setters available, no way to modify internal state). This makes it clearer that it is correct to use it in a multi-threaded context.
You appare to be locking more then is required, in your good case (cache full and valid) every request aquires a lock. you can get away with only locking if the cache is expired.
If we are reloading, do nothing.
If we are not reloading, check if expired if not expired go ahead.
If we are not reloading and we are expired, get the lock and double check expired to make sure we have not sucessfuly loaded seince last check.
Also note you may wish to reload the cache in a background thread so not event the one requrest is heldup waiting for cache to fill.
private void checkForReload() {
if (cachedObject == null || isExpired()) {
if (!isReloading()) {
// Recheck expiration because another thread might have
// refreshed the cache before we were allowed into the
// synchronized block.
if (isExpired()) {
synchronized (lock) {
if (isExpired()) {
isLoading = true;
try {
cachedObject = retrieve();
lastRetrieval = System.currentTimeMillis();
} catch (Exception e) {
log.error("Exception occurred retrieving cached object", e);
} finally {
isLoading = false;
}
}
}
}
}
}