Thread safety when iterating through concurrent collections

Thread safety when iterating through concurrent collections - java

I'm writing some client-server-application where I have to deal with multiple threads. I've got some servers, that send alive-packets every few seconds. Those servers are maintained in a ConcurrentHashMap, that contains their EndPoints paired with the time the last alive-package arrived of the respective server.
Now I've got a thread, that has to "sort out" all the servers that haven't sent alive-packets for a specific amount of time.
I guess I can't just do it like that, can I?
for( IPEndPoint server : this.fileservers.keySet() )
{
Long time = this.fileservers.get( server );
//If server's time is updated here, I got a problem
if( time > fileserverTimeout )
this.fileservers.remove( server );
}
Is there a way I can get around that without aquiring a lock for the whole loop (that I then have to respect in the other threads as well)?

There is probably no problem here, depending on what exactly you store in the map. Your code looks a little weird to me, since you seem to save "the duration for which the server hasn't been active".
My first idea for recording that data was to store "the latest timestamp at which the server has been active". Then your code would look like this:
package so3950354;
import java.util.Iterator;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
public class ServerManager {
private final ConcurrentMap<Server, Long> lastActive = new ConcurrentHashMap<Server, Long>();
/** May be overridden by a special method for testing. */
protected long now() {
return System.currentTimeMillis();
}
public void markActive(Server server) {
lastActive.put(server, Long.valueOf(now()));
}
public void removeInactive(long timeoutMillis) {
final long now = now();
Iterator<Map.Entry<Server, Long>> it = lastActive.entrySet().iterator();
while (it.hasNext()) {
final Map.Entry<Server, Long> entry = it.next();
final long backThen = entry.getValue().longValue();
/*
* Even if some other code updates the timestamp of this server now,
* the server had timed out at some point in time, so it may be
* removed. It's bad luck, but impossible to avoid.
*/
if (now - backThen >= timeoutMillis) {
it.remove();
}
}
}
static class Server {
}
}
If you really want to avoid that no code ever calls markActive during a call to removeInactive, there is no way around explicit locking. What you probably want is:
concurrent calls to markActive are allowed.
during markActive no calls to removeInactive are allowed.
during removeInactive no calls to markActive are allowed.
This looks like a typical scenario for a ReadWriteLock, where markActive is the "reading" operation and removeInactive is the "writing" Operation.

I don't see how another thread can update the server's time at that point in your code. Once you've retrieved the time of a server from the map using this.fileservers.get( server ), another thread cannot change its value as Long objects are immutable. Yes, another thread can put a new Long object for that server into the map, but that doesn't affect this thread, because it has already retrieved the time of the server.
So as it stands I can't see anything wrong with your code. The iterators in a ConcurrentHashMap are weakly consistent which means they can tolerate concurrent modification, so there is no risk of a ConcurrentModificationException being thrown either.

(See Roland's answer, which takes the ideas here and fleshes them out into a fuller example, with some great additional insights.)
Since it's a concurrent hash map, you can do the following. Note that CHM's iterators all implement the optional methods, including remove(), which you want. See CHM API docs, which states:
This class and its views and iterators
implement all of the optional methods
of the Map and Iterator interfaces.
This code should work (I don't know the type of the Key in your CHM):
ConcurrentHashMap<K,Long> fileservers = ...;
for(Iterator<Map.Entry<K,Long>> fsIter = fileservers.entrySet().iterator(); fileservers.hasNext(); )
{
Map.Entry<K,Long> thisEntry = fsIter.next();
Long time = thisEntry.getValue();
if( time > fileserverTimeout )
fsIter.remove( server );
}
But note that there may be race conditions elsewhere... You need to make sure that other bits of code accessing the map can cope with this kind of spontaneous removal -- i.e., probably whereever you touch fileservers.put() you'll need a bit of logic involving fileservers.putIfAbsent(). This solution is less likely to create bottlenecks than using synchronized, but it also requires a bit more thought.
Where you wrote "If server's time is updated here, I got a problem" is exactly where putIfAbsent() comes in. If the entry is absent, either you hadn't seen it before, or you just recently dropped it from the table. If the two sides of this need to be coordinated, then you may instead want to introduce a lockable record for the entry, and carry out the synchronization at that level (i.e., sync on the record while doing remove(), rather than on the whole table). Then the put() end of things can also sync on the same record, eliminating a potential race.

Firstly make map synchronized
this.fileservers = Collections.synchronizedMap(Map)
then use the strategy which is used in Singleton classes
if( time > fileserverTimeout )
{
synchronized(this.fileservers)
{
if( time > fileserverTimeout )
this.fileservers.remove( server );
}
}
Now this makes sure that once you inside the synchronized block, no updates can occur. This is so because once the lock on the map is taken, map(synchronized wrapper) will not have itself available to provide a thread lock on it for update, remove etc.
Checking for time twice makes sure that synchronization is used only when there is a genuine case of delete

Related

Java garbage collection in multithreaded application for local variable

I have following use-case:
Need a single background thread which maintains a set of accountIDs in memory and it fetches the latest accountIds every 1 second
Other multiple parallel running process will search if particular accountID is present in above Set or not
To achieve above use-case, I have following code. runTask() is a method which is responsible for fetching new Set every 1 second. doesAccountExist method is called by other parallel threads to check if accountId exists in the Set or not.
class AccountIDFetcher {
private Set<String> accountIds;
private ScheduledExecutorService scheduledExecutorService;
public AccountIDFetcher() {
this.accountIds = new HashSet<String>();
scheduledThreadPoolExecutor = new ScheduledThreadPoolExecutor(1);
scheduledExecutorService.scheduleWithFixedDelay(this::runTask, 0, 1, TimeUnit.SECONDS);
}
// Following method runs every 1 second
private void runTask() {
accountIds = getAccountIds()
}
// other parallel thread calls below method
public boolean doesAccountExist(String accountId) {
return accountIds.contains(instanceId);
}
private Set<String> getAccountIds() {
Set<String> accounts = new HashSet<String>();
// calls Database and put list of accountIds into above set
return accounts;
}
}
I have following question
In runTask method, I just change the reference of accountIds variable to a new object. So if, Thread-2 is in the middle of searching accountId in doesAccountExist() method and at the same time if Thread-1 executes runTask() & changes the reference of accountIds variable to a new object then old object gets orphaned. is it possible that old object can be garbage collected before Thread-2 finish searching in it?

tl;dr
You asked:
is it possible that old object can be garbage collected before Thread-2 finish searching in it?
No, the old Set object does not become garbage while some thread is still using it.
An object only becomes a candidate for garbage-collection after each and every reference to said object (a) goes out of scope, (b) is set to null, or (c) is a weak reference.
No, an object in use within a method will not be garbage-collected
In Java, an object reference assignment is atomic, as discussed in another Question. When this.accountIds is directed to point to a new Set object, that happens in one logical operation. That means that any other code in any other thread accessing the accountIds member field will always successfully access either the old Set object or the new Set object, always one or the other.
If during that re-assignment another thread accessed the old Set object, that other thread's code is working with a copy of the object reference. You can think of your doesAccountExist method:
public boolean doesAccountExist(String accountId) {
return accountIds.contains(accountId);
}
…as having a local variable with a copy of the object reference, as if written like this:
public boolean doesAccountExist(String accountId) {
Set<String> set = this.accountIds ;
return set.contains(accountId);
}
While one thread is replacing the reference to a new Set on the member field accountIds, the doesAccountExist method already has a copy of the reference to the old Set. At that moment, while one thread is changing the member field reference, and another thread has a local reference, the garbage collector sees both the new and old Set objects as having (at least) one reference each. So neither is a candidate for being garbage-collected.
Actually, more technically correct would be explaining that at the point in your line return accountIds.contains(accountId); where execution reaches the accountIds portion, the current (old) Set will be accessed. A moment later the contains method begins its work, during which re-assigning a new Set to that member field has no effect on this method's work-in-progress already using the old Set.
This means that even after the new Set has been assigned in one thread, the other thread may still be continuing its work of searching the old Set. This may or may not be a problem depending on the business context of your app. But your Question did not address this stale-data transactional aspect.
So regarding your question:
is it possible that old object can be garbage collected before Thread-2 finish searching in it?
No, the old Set object does not become garbage while some thread is still using it.
Other issues
Your code does have other issues.
Visibility
You declared your member field as private Set<String> accountIds;. If you access that member field across threads on a host machine with multiple cores, then you have a visibility problem. The caches on each core may not be refreshed immediately when you assign a different object to that member field. As currently written, it is entirely possible that one thread accessing this.accountIds will gain access to the old Set object even after that variable was assigned the new Set object.
If you do not already know about the issues I mention, study up on concurrency. There is more involved than I can cover here. Learn about the Java Memory Model. And I strongly recommend reading and re-reading the classic book, Java Concurrency in Practice by Brian Goetz, et al.
volatile
One solution is to mark the member field as volatile. So, this:
private Set<String> accountIds;
…becomes this:
volatile private Set<String> accountIds;
Marking as volatile avoids a stale cache on a CPU core pointing to the old object reference rather than the new object reference.
AtomicReference
Another solution is using an object of AtomicReference class as the member field. I would mark it final so that one and only one such object is ever assigned to that member field, so the field is a constant rather than a variable. Then assign each new Set object as the payload contained within that AtomicReference object. Code wanting the current Set object calls a getter method on that AtomicReference object. This call is guaranteed to be thread-safe, eliminating the need for volatile.
Concurrent access to existing Set
Another possible problem with your code might be concurrent access to an existing Set. If you have more than one thread accessing the existing Set, then you must protect that resource.
One way to protect access to that Set is to use a thread-safe implementation of Set such as ConcurrentSkipListSet.
From what you have showed in the Question, the only access to the existing Set that I noticed is calling contains. If you are never modifying the existing Set, then merely calling contains across multiple threads may be safe — I just don't know, you'd have to research it.
If you intend to never modify an existing Set, then you can enforce that by using an unmodifiable set. One way to produce an unmodifiable set is to construct and populate a regular set. Then feed that regular set to the method Set.copyOf. So your getAccountIds method would look like this:
private Set<String> getAccountIds() {
Set<String> accounts = new HashSet<String>();
// calls Database and put list of accountIds into above set
return Set.copyOf( accounts );
}
Return a copy rather a reference
There are two easy ways to avoid dealing with concurrency:
Make the object immutable
Provide a copy of the object
As for the first way, immutability, the Java Collections Framework is generally very good but unfortunately lacks explicit mutability & immutability in its type system. The Set.of methods and Collections.unmodifiableSet both provide a Set that cannot be modified. But the type itself does not proclaim that fact. So we cannot ask the compiler to enforce a rule such as our AtomicReference only storing an immutable set. As an alternative, consider using a third-party collections with immutability as part of its type. Perhaps Eclipse Collections or Google Guava.
As for the second way, we can make a copy of our Set of account IDs whenever needing access. So we need a getCurrentAccountIds method that goes into the AtomicReference, retrieves the Set stored there, and called Set.copyOf to produce a new set of the same contained objects. This copy operation is not documented as being thread-safe. So we should mark the method synchronized to allow only one copy operation at a time. Bonus: We can mark this method public to give any calling programmer access to the Set of account IDs for their own perusal.
synchronized public Set < UUID > getCurrentAccountIds ( )
{
return Set.copyOf( this.accountIdsRef.get() ); // Safest approach is to return a copy rather than original set.
}
Our convenience method doesAccountExist should call that same getCurrentAccountIds to obtain a copy of the set before doing its "contains" logic. This way we do not care whether or not the "contains" work is thread-safe.
Caveat: I am not satisfied with using Set.copyOf to as means to avoid any possible thread-safety issues. That method notes that if the passed collection being copied is already an unmodifiable set, then a copy may not be made. In real work I would use a Set implementation that guarantees thread-safety whether found bundled with Java or by adding a third-party library.
Do not use object within constructor
I do not like seeing the scheduled executor service appearing within your constructor. I see two issues there: (a) app lifecycle and (b) using an object within a constructor.
Creating the executor service, scheduling tasks on that service, and shutting down that service are all related to the lifecycle of the app. An object generally should not be aware of its lifecycle within the greater app. This account IDs provider object should know only how to do its job (provide IDs) but should not be responsible for putting itself to work. So your code is mixing responsibilities, which is generally a poor practice.
Another problem is that the executor service is being scheduled to immediately start using the very object that we are still constructing. Generally, the best practice is to not use an object while still under construction. You may get away with such use, but doing so is risky and is prone to leading to bugs. A constructor should be short and sweet, used just to validate inputs, establish required resources, and ensure the integrity of the object being birthed.
I did not pull the service out of your constructor only because I did not want to this Answer to go too far out into the weeds. However, I did make two adjustments. (a) I changed the initial delay on the scheduleWithFixedDelay call from zero to one second. This is a hack to give the constructor time to finish birthing the object before its first use. (b) I added the tearDown method to properly shutdown the executor service so its backing thread-pool does not continue running indefinitely in zombie fashion.
Tips
I suggest renaming your getAccountIds() method. The get wording in Java is usually associated with the JavaBeans convention of accessing an existing property. In your case you are generating an entirely new replacement set of values. So I would change that name to something like fetchFreshAccountIds.
Consider wrapping your scheduled task with a try-catch. Any Exception or Error bubbling up to reach the ScheduledExecutorService brings a silent halt to any further scheduling. See ScheduledExecutorService Exception handling.
Example code.
Here is a complete example of my take on your code.
Caveat: Use at your own risk. I am not a concurrency expert. This is meant as food-for-thought, not production use.
I used UUID as the data type of the account IDs to be more realistic and clear.
I changed some of your class & method names for clarity.
Notice which methods are private and which are public.
package work.basil.example;
import java.time.Duration;
import java.time.Instant;
import java.util.HashSet;
import java.util.Objects;
import java.util.Set;
import java.util.UUID;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicReference;
public class AccountIdsProvider
{
// Member fields
private AtomicReference < Set < UUID > > accountIdsRef;
private ScheduledExecutorService scheduledExecutorService;
// Constructor
public AccountIdsProvider ( )
{
this.accountIdsRef = new AtomicReference <>();
this.accountIdsRef.set( Set.of() ); // Initialize to empty set.
this.scheduledExecutorService = Executors.newSingleThreadScheduledExecutor();
scheduledExecutorService.scheduleWithFixedDelay( this :: replaceAccountIds , 1 , 1 , TimeUnit.SECONDS ); // I strongly suggest you move the executor service and the scheduling work to be outside this class, to be a different class’ responsibility.
}
// Performs database query to find currently relevant account IDs.
private void replaceAccountIds ( )
{
// Beware: Any uncaught Exception or Error bubbling up to the scheduled executor services halts the scheduler immediately and silently.
try
{
System.out.println( "Running replaceAccountIds. " + Instant.now() );
Set < UUID > freshAccountIds = this.fetchFreshAccountIds();
this.accountIdsRef.set( freshAccountIds );
System.out.println( "freshAccountIds = " + freshAccountIds + " at " + Instant.now() );
}
catch ( Throwable t )
{
t.printStackTrace();
}
}
// Task to be run by scheduled executor service.
private Set < UUID > fetchFreshAccountIds ( )
{
int limit = ThreadLocalRandom.current().nextInt( 0 , 4 );
HashSet < UUID > uuids = new HashSet <>();
for ( int i = 1 ; i <= limit ; i++ )
{
uuids.add( UUID.randomUUID() );
}
return Set.copyOf( uuids ); // Return unmodifiable set.
}
// Calling programmers can get a copy of the set of account IDs for their own perusal.
// Pass a copy rather than a reference for thread-safety.
// Synchronized in case copying the set is not thread-safe.
synchronized public Set < UUID > getCurrentAccountIds ( )
{
return Set.copyOf( this.accountIdsRef.get() ); // Safest approach is to return a copy rather than original set.
}
// Convenience method for calling programmers.
public boolean doesAccountExist ( UUID accountId )
{
return this.getCurrentAccountIds().contains( accountId );
}
// Destructor
public void tearDown ( )
{
// IMPORTANT: Always shut down your executor service. Otherwise the backing pool of threads may run indefinitely, like a zombie 🧟‍.
if ( Objects.nonNull( this.scheduledExecutorService ) )
{
System.out.println( "INFO - Shutting down the scheduled executor service. " + Instant.now() );
this.scheduledExecutorService.shutdown(); // I strongly suggest you move the executor service and the scheduling work to be outside this class, to be a different class’ responsibility.
}
}
public static void main ( String[] args )
{
System.out.println( "INFO - Starting app. " + Instant.now() );
AccountIdsProvider app = new AccountIdsProvider();
try { Thread.sleep( Duration.ofSeconds( 10 ).toMillis() ); } catch ( InterruptedException e ) { e.printStackTrace(); }
app.tearDown();
System.out.println( "INFO - Ending app. " + Instant.now() );
}
}

Garbage collection is not the main issue with this code. The lack of any synchronization is the main issue.
If thread-2 is "searching" in something, it necessarily has a reference to that thing, so it's not going to get GC'd.
Why not use 'synchronized' so you can be sure of what will happen?

garbage collection wise, you will not get into any surprise, but not because the accepted answer implies. It is somehow more tricky.
May be visualizing this will help.
accountIds ----> some_instance_1
Suppose ThreadA is now working with some_instance_1. It started to search for accountId in it. While that operation is going, ThreadB changes what that reference is pointing to. So it becomes:
some_instance_1
accountIds ----> some_instance_2
Because reference assign is atomic, this is what ThreadA will see also, if it reads that reference again. At this point in time, some_instance_1 is eligible for garbage collection, since no one refers to it. Just take note, that this will only happen if ThreadA sees this write that ThreadB did. Either way : you are safe (gc wise only), because ThreadA either works with a stale copy (which you said it's OK) or the most recent one.
This does not mean that everything is fine with your code.
What that answer gets right indeed is that reference assigning is atomic, so once a thread writes to a reference (accountIds = getAccountIds()), a reading thread (accountIds.contains(instanceId);) that will indeed perform the read will see the write. I say "indeed" because an optimizer might not even issue such a read, to begin with. In very simple (and somehow wrong) words, each thread might get its own copy of accountIds and because that is a "plain" read without any special semantics (like volatile, release/acquire, synchronization, etc), reading threads have no obligation to see the writing thread action.
So, even if someone actually did accountIds = getAccountIds(), it does not mean that reading threads will see this. And it gets worse. This write might not ever be seen. You need to introduce special semantics if you want guarantees (and you absolutely do).
For that you need to make your Set volatile:
private volatile Set<String> accountIds = ...
so that when multi threads are involved, you would get the needed visibility guarantees.
Then not to interfere with any in-flight updates of accountIds, you can simply work on a local copy of it:
public boolean doesAccountExist(String accountId) {
Set<String> local = accountIds;
return local.contains(accountId);
}
Even if accountIds change while you are in this method, you do not care about that change, since you are searching against local, which is unaware of the change.

Atomic multi-entry operations on ConcurrentHashMap

I need to perform a two-entries concurrent operation on a ConcurrentHashMapatomically.
I have a ConcurrentHashMap of Client, with an Integer id as key; every client has a selectedId attribute, which contains the id of another client or itself (means nobody selected).
At every clientChangedSelection(int whoChangedSelection) concurrent event, I need to check atomically if both the client and the selected client are referencing each other. If they do, they get removed and returned.
In the meantime clients can be added or removed by other threads.
The "ideal" solution would be to have a lock for every entry and lock the affected entries, every clientChangedSelection runs in it's own thread so they would wait if necessary. Of course that's not practical. On top of that, ConcurrentHashMap doesn't offer apis to manually lock buckets as far as I know. And on top of that again, I've read somewhere that the buckets' locks aren't reentrant. Not sure if that's true or why.
My "imaginative" approach makes heavy use of nested compute() methods to guarantee atomicity. If ConcurrentHashMap's locks aren't reentrant, this won't work. It loses any readability, requires "value capturing" workarounds, and performances are probably bad. But performances aren't much an issue as long as they don't affect threads working on unrelated entries. (i.e. in different buckets).
public Client[] match(int id){
final Client players[]=new Client[]{null,null};
clients.computeIfPresent(id,(idA, playerA)->{
if(playerA.selectedId!=idA){
clients.computeIfPresent(playerA.selectedId,(idB, playerB)->{
if(playerB.selectedId==idA){
players[0]=playerA;
players[1]=playerB;
return null;
}else{
return playerB;
}
});
}
if(players[0]==null){
return playerA;
}else{
return null;
}
});
if(players[0]==null){
return null;
}else{
return players;
}
}
The "unacceptable" approach synchronizes the entire match method. This invalidates the point of having concurrent events in the first place.
The "wrong" approach temporarily removes the two clients while working with them, and adds them back in case. This makes concurrent events using the entries fail instead of waiting, as "in use" becomes indistinguishable from "not present".
I think I'll go back to a timer which inspects the whole map in one pass every n seconds. No additional synchronization would be required, but it's less elegant.
This is, more or less, a common concurrency situation, but it's made interesting by the ConcurrentHashMap, that discourages from reinventing too much the wheel.
What would your approach be? Any suggestions?
Edit 1
Synchronizing every access (thus defeating the point of using a ConcurrentHashMap) is not a viable solution either. Concurrent access must be preserved, else the problem itself wouldn't exist.
I've removed the selectedId parameter from match(), but note that doesn't really matter. The fictitious event clientChangedSelection(int whoChangedSelection) represents the concurrent event. Could happen any time in any operating thread. match() is just an example function that gets called to handle the matching. Hope I made it clearer.
Edit 2
This is the doubly-synchronized function I ended up with. idSelect() is an example of a method that requires synchronization, as it modifies client attributes. Synchronization for put() and remove() is not required in this case, what the function sees is new enough.
There happens to be two checks: the first one is there just to get the clients to synchronize onto, the second one is there to tell if a previously executed match succeeded and removed the client, while the current one was waiting.
match() can't match the same client twice, and that was important (the atomic part).
match() can still match concurrently removed clients (removed with classic map apis, not by the same function), and that's tolerable.
public void idSelected(int id, int selectedId){
Client playerA=clients.get(id);
if(playerA!=null){
synchronized(playerA){
playerA.selectedId=selectedId;
}
}
}
public Client[] match(int id, int selectedId){
// determine if players exist in order be synchronized onto
Client playerA=clients.get(id);
if(playerA==null){
return null;
}
Client playerB=clients.get(selectedId);
if(playerB==null){
return null;
}
// sort players in order to do nested synchronization safely
if(id>selectedId){
final Client t=playerA;
playerA=playerB;
playerB=t;
}
// check under synchronization
synchronized(playerA){
if(clients.containsKey(playerA.id)){
synchronized(playerB){
if(clients.containsKey(playerB.id)){
if(playerA.selectedId==playerB.id&&playerB.selectedId==playerA.id){
clients.remove(id);
clients.remove(selectedId);
return new Client[]{playerA,playerB};
}
}
}
}
}
return null;
}

Scalable patterns for thread-safe hashtable puts when keeping track of frequency

This was an interview question I got some time last week and it ended at a cliffhanger. The question was simple: Design a service that keeps track of the frequency of "messages" (a 1 line string, could be in different languages) passed to it. There are 2 broad apis: submitMsg(String msg) and getFrequency(String msg). My immediate reaction was to use as hashMap that uses a String as a key (in this case, a message) and an Integer as a value (to keep track of counts/frequency).
The submitMsg api simply sees whether a message exists in the hashMap. If it doesn't, put the message and set the frequency to 1; if it does, then get the current count and increment it by 1. The interviewer then pointed out this would fail miserably in the event multiple threads access the SAME key at the SAME exact time.
For example: At 12:00:00:000 Thread1 would try to "submitMsg", and thereby my method would do a (1) get on the hashMap and see that the value is not null, it is infact, say 100 (2) do a put by incrementing the frequency by 1 so that the key's value is 101. Meanwhile consider that Thread2 ALSO tried to do a submitMsg at exactly At 12:00:00:000, and the method once again internally did a get on the hashMap (which returned a 100 - this is a race condition), after which the hashMap now increments the frequency to 101. Alas, the true frequency should have been 102 and not 101, and this is a major design flaw in a largely multithreaded environment. I wasn't sure how to stop this from happening: Putting a lock on simply the write isn't good enough, and having a lock on a read didn't make sense. What would have been ideal is to "lock" an element if a get was invoked internally via the submitMsg api because we expect it to be "written to" soonafter. The lock would be released once the frequency had been updated, but if someone were to use the getFrequency() api having a pure lock wouldn't make sense. I'm not sure whether a mutex would help here because I don't have a strong background in distributed systems.
I'm looking to the SO community for help on the best way to think through a problem like this. Is the magic in the datastructure to be used or some kind of synchronization that I need to do in my api itself? How can we maintain the integrity of "frequency" while maintaining the scalability of the service as well?

Well, your initial idea isn't a million miles off, you just need to make it thread safe. For instance, you could use a ConcurrentHashMap<String, AtomicInteger>.
public void submitMsg(String msg) {
AtomicInteger previous = map.putIfAbsent(msg, new AtomicInteger(1));
if (null != previous) {
previous.incrementAndGet();
}
}

The simplest solution is using Guava's com.google.common.collect.ConcurrentHashMultiset:
private final ConcurrentHashMultiset<String> multiset = ConcurrentHashMultiset.create();
public void submitMsg(String msg) {
multiset.add(msg);
}
public int count(String msg) {
return multiset.count(msg);
}
But this is basically the same as Aurand's solution, just that somebody already implemented the boring details like creating the counter if it doesn't exists yet, etc.

Treat it as a Producer–consumer problem.
The service is the producer; it should add each message to a queue that feeds the consumer. You could run one queue per producer to ensure that the producers do not wait.
The consumer encapsulates the HashTable, and pulls the messages off the queue and updates the table.

reliably forcing Guava map eviction to take place

EDIT: I've reorganized this question to reflect the new information that since became available.
This question is based on the responses to a question by Viliam concerning Guava Maps' use of lazy eviction: Laziness of eviction in Guava's maps
Please read this question and its response first, but essentially the conclusion is that Guava maps do not asynchronously calculate and enforce eviction. Given the following map:
ConcurrentMap<String, MyObject> cache = new MapMaker()
.expireAfterAccess(10, TimeUnit.MINUTES)
.makeMap();
Once ten minutes has passed following access to an entry, it will still not be evicted until the map is "touched" again. Known ways to do this include the usual accessors - get() and put() and containsKey().
The first part of my question [solved]: what other calls cause the map to be "touched"? Specifically, does anyone know if size() falls into this category?
The reason for wondering this is that I've implemented a scheduled task to occasionally nudge the Guava map I'm using for caching, using this simple method:
public static void nudgeEviction() {
cache.containsKey("");
}
However I'm also using cache.size() to programmatically report the number of objects contained in the map, as a way to confirm this strategy is working. But I haven't been able to see a difference from these reports, and now I'm wondering if size() also causes eviction to take place.
Answer: So Mark has pointed out that in release 9, eviction is invoked only by the get(), put(), and replace() methods, which would explain why I wasn't seeing an effect for containsKey(). This will apparently change with the next version of guava which is set for release soon, but unfortunately my project's release is set sooner.
This puts me in an interesting predicament. Normally I could still touch the map by calling get(""), but I'm actually using a computing map:
ConcurrentMap<String, MyObject> cache = new MapMaker()
.expireAfterAccess(10, TimeUnit.MINUTES)
.makeComputingMap(loadFunction);
where loadFunction loads the MyObject corresponding to the key from a database. It's starting to look like I have no easy way of forcing eviction until r10. But even being able to reliably force eviction is put into doubt by the second part of my question:
The second part of my question [solved]: In reaction to one of the responses to the linked question, does touching the map reliably evict all expired entries? In the linked answer, Niraj Tolia indicates otherwise, saying eviction is potentially only processed in batches, which would mean multiple calls to touch the map might be needed to ensure all expired objects were evicted. He did not elaborate, however this seems related to the map being split into segments based on concurrency level. Assuming I used r10, in which a containsKey("") does invoke eviction, would this then be for the entire map, or only for one of the segments?
Answer: maaartinus has addressed this part of the question:
Beware that containsKey and other reading methods only run postReadCleanup, which does nothing but on each 64th invocation (see DRAIN_THRESHOLD). Moreover, it looks like all cleanup methods work with single Segment only.
So it looks like calling containsKey("") wouldn't be a viable fix, even in r10. This reduces my question to the title: How can I reliably force eviction to occur?
Note: Part of the reason my web app is noticeably affected by this issue is that when I implemented caching I decided to use multiple maps - one for each class of my data objects. So with this issue there is the possibility that one area of code is executed, causing a bunch of Foo objects to be cached, and then the Foo cache isn't touched again for a long time so it doesn't evict anything. Meanwhile Bar and Baz objects are being cached from other areas of code, and memory is being eaten. I'm setting a maximum size on these maps, but this is a flimsy safeguard at best (I'm assuming its effect is immediate - still need to confirm this).
UPDATE 1: Thanks to Darren for linking the relevant issues - they now have my votes. So it looks like a resolution is in the pipeline, but seems unlikely to be in r10. In the meantime, my question remains.
UPDATE 2: At this point I'm just waiting for a Guava team member to give feedback on the hack maaartinus and I put together (see answers below).
LAST UPDATE: feedback received!

I just added the method Cache.cleanUp() to Guava. Once you migrate from MapMaker to CacheBuilder you can use that to force eviction.

I was wondering the about the same issue you described in the first part of your question. From what I can tell from looking at the source code for Guava's CustomConcurrentHashMap (release 9), it appears that entries are evicted on the get(), put(), and replace() methods. The containsKey() method does not appear to invoke eviction. I'm not 100% sure because I took a quick pass at the code.
Update:
I also found a more recent version of the CustomConcurrentHashmap in Guava's git repository and it looks like containsKey() has been updated to invoke eviction.
Both release 9 and the latest version I just found do not invoke eviction when size() is called.
Update 2:
I recently noticed that Guava r10 (yet to be released) has a new class called CacheBuilder. Basically this class is a forked version of the MapMaker but with caching in mind. The documentation suggests that it will support some of the eviction requirements you are looking for.
I reviewed the updated code in r10's version of the CustomConcurrentHashMap and found what looks like a scheduled map cleaner. Unfortunately, that code appears unfinished at this point but r10 looks more and more promising each day.

Beware that containsKey and other reading methods only run postReadCleanup, which does nothing but on each 64th invocation (see DRAIN_THRESHOLD). Moreover, it looks like all cleanup methods work with single Segment only.
The easiest way to enforce eviction seems to be to put some dummy object into each segment. For this to work, you'd need to analyze CustomConcurrentHashMap.hash(Object), which is surely no good idea, as this method may change anytime. Moreover, depending on the key class it may be hard to find a key with a hashCode ensuring it lands in a given segment.
You could use reads instead, but would have to repeat them 64 times per segment. Here, it'd easy to find a key with an appropriate hashCode, since here any object is allowed as an argument.
Maybe you could hack into the CustomConcurrentHashMap source code instead, it could be as trivial as
public void runCleanup() {
final Segment<K, V>[] segments = this.segments;
for (int i = 0; i < segments.length; ++i) {
segments[i].runCleanup();
}
}
but I wouldn't do it without a lot of testing and/or an OK by a guava team member.

Yep, we've gone back and forth a few times on whether these cleanup tasks should be done on a background thread (or pool), or should be done on user threads. If they were done on a background thread, this would eventually happen automatically; as it is, it'll only happen as each segment gets used. We're still trying to come up with the right approach here - I wouldn't be surprised to see this change in some future release, but I also can't promise anything or even make a credible guess as to how it will change. Still, you've presented a reasonable use case for some kind of background or user-triggered cleanup.
Your hack is reasonable, as long as you keep in mind that it's a hack, and liable to break (possibly in subtle ways) in future releases. As you can see in the source, Segment.runCleanup() calls runLockedCleanup and runUnlockedCleanup: runLockedCleanup() will have no effect if it can't lock the segment, but if it can't lock the segment it's because some other thread has the segment locked, and that other thread can be expected to call runLockedCleanup as part of its operation.
Also, in r10, there's CacheBuilder/Cache, analogous to MapMaker/Map. Cache is the preferred approach for many current users of makeComputingMap. It uses a separate CustomConcurrentHashMap, in the common.cache package; depending on your needs, you may want your GuavaEvictionHacker to work with both. (The mechanism is the same, but they're different Classes and therefore different Methods.)

I'm not a big fan of hacking into or forking external code until absolutely necessary. This problem occurs in part due to an early decision for MapMaker to fork ConcurrentHashMap, thereby dragging in a lot of complexity that could have been deferred until after the algorithms were worked out. By patching above MapMaker, the code is robust to library changes so that you can remove your workaround on your own schedule.
An easy approach is to use a priority queue of weak reference tasks and a dedicated thread. This has the drawback of creating many stale no-op tasks, which can become excessive in due to the O(lg n) insertion penalty. It works reasonably well for small, less frequently used caches. It was the original approach taken by MapMaker and its simple to write your own decorator.
A more robust choice is to mirror the lock amortization model with a single expiration queue. The head of the queue can be volatile so that a read can always peek to determine if it has expired. This allows all reads to trigger an expiration and an optional clean-up thread to check regularly.
By far the simplest is to use #concurrencyLevel(1) to force MapMaker to use a single segment. This reduces the write concurrency, but most caches are read heavy so the loss is minimal. The original hack to nudge the map with a dummy key would then work fine. This would be my preferred approach, but the other two options are okay if you have high write loads.

I don't know if it is appropriate for your use case, but your main concern about the lack of background cache eviction seems to be memory consumption, so I would have thought that using softValues() on the MapMaker to allow the Garbage Collector to reclaim entries from the cache when a low memory situation occurs. Could easily be the solution for you. I have used this on a subscription-server (ATOM) where entries are served through a Guava cache using SoftReferences for values.

Based on maaartinus's answer, I came up with the following code which uses reflection rather than directly modifying the source (If you find this useful please upvote his answer!). While it will come at a performance penalty for using reflection, the difference should be negligible since I'll run it about once every 20 minutes for each caching Map (I'm also caching the dynamic lookups in the static block which will help). I have done some initial testing and it appears to work as intended:
public class GuavaEvictionHacker {
//Class objects necessary for reflection on Guava classes - see Guava docs for info
private static final Class<?> computingMapAdapterClass;
private static final Class<?> nullConcurrentMapClass;
private static final Class<?> nullComputingConcurrentMapClass;
private static final Class<?> customConcurrentHashMapClass;
private static final Class<?> computingConcurrentHashMapClass;
private static final Class<?> segmentClass;
//MapMaker$ComputingMapAdapter#cache points to the wrapped CustomConcurrentHashMap
private static final Field cacheField;
//CustomConcurrentHashMap#segments points to the array of Segments (map partitions)
private static final Field segmentsField;
//CustomConcurrentHashMap$Segment#runCleanup() enforces eviction on the calling Segment
private static final Method runCleanupMethod;
static {
try {
//look up Classes
computingMapAdapterClass = Class.forName("com.google.common.collect.MapMaker$ComputingMapAdapter");
nullConcurrentMapClass = Class.forName("com.google.common.collect.MapMaker$NullConcurrentMap");
nullComputingConcurrentMapClass = Class.forName("com.google.common.collect.MapMaker$NullComputingConcurrentMap");
customConcurrentHashMapClass = Class.forName("com.google.common.collect.CustomConcurrentHashMap");
computingConcurrentHashMapClass = Class.forName("com.google.common.collect.ComputingConcurrentHashMap");
segmentClass = Class.forName("com.google.common.collect.CustomConcurrentHashMap$Segment");
//look up Fields and set accessible
cacheField = computingMapAdapterClass.getDeclaredField("cache");
segmentsField = customConcurrentHashMapClass.getDeclaredField("segments");
cacheField.setAccessible(true);
segmentsField.setAccessible(true);
//look up the cleanup Method and set accessible
runCleanupMethod = segmentClass.getDeclaredMethod("runCleanup");
runCleanupMethod.setAccessible(true);
}
catch (ClassNotFoundException cnfe) {
throw new RuntimeException("ClassNotFoundException thrown in GuavaEvictionHacker static initialization block.", cnfe);
}
catch (NoSuchFieldException nsfe) {
throw new RuntimeException("NoSuchFieldException thrown in GuavaEvictionHacker static initialization block.", nsfe);
}
catch (NoSuchMethodException nsme) {
throw new RuntimeException("NoSuchMethodException thrown in GuavaEvictionHacker static initialization block.", nsme);
}
}
/**
* Forces eviction to take place on the provided Guava Map. The Map must be an instance
* of either {#code CustomConcurrentHashMap} or {#code MapMaker$ComputingMapAdapter}.
*
* #param guavaMap the Guava Map to force eviction on.
*/
public static void forceEvictionOnGuavaMap(ConcurrentMap<?, ?> guavaMap) {
try {
//we need to get the CustomConcurrentHashMap instance
Object customConcurrentHashMap;
//get the type of what was passed in
Class<?> guavaMapClass = guavaMap.getClass();
//if it's a CustomConcurrentHashMap we have what we need
if (guavaMapClass == customConcurrentHashMapClass) {
customConcurrentHashMap = guavaMap;
}
//if it's a NullConcurrentMap (auto-evictor), return early
else if (guavaMapClass == nullConcurrentMapClass) {
return;
}
//if it's a computing map we need to pull the instance from the adapter's "cache" field
else if (guavaMapClass == computingMapAdapterClass) {
customConcurrentHashMap = cacheField.get(guavaMap);
//get the type of what we pulled out
Class<?> innerCacheClass = customConcurrentHashMap.getClass();
//if it's a NullComputingConcurrentMap (auto-evictor), return early
if (innerCacheClass == nullComputingConcurrentMapClass) {
return;
}
//otherwise make sure it's a ComputingConcurrentHashMap - error if it isn't
else if (innerCacheClass != computingConcurrentHashMapClass) {
throw new IllegalArgumentException("Provided ComputingMapAdapter's inner cache was an unexpected type: " + innerCacheClass);
}
}
//error for anything else passed in
else {
throw new IllegalArgumentException("Provided ConcurrentMap was not an expected Guava Map: " + guavaMapClass);
}
//pull the array of Segments out of the CustomConcurrentHashMap instance
Object[] segments = (Object[])segmentsField.get(customConcurrentHashMap);
//loop over them and invoke the cleanup method on each one
for (Object segment : segments) {
runCleanupMethod.invoke(segment);
}
}
catch (IllegalAccessException iae) {
throw new RuntimeException(iae);
}
catch (InvocationTargetException ite) {
throw new RuntimeException(ite.getCause());
}
}
}
I'm looking for feedback on whether this approach is advisable as a stopgap until the issue is resolved in a Guava release, particularly from members of the Guava team when they get a minute.
EDIT: updated the solution to allow for auto-evicting maps (NullConcurrentMap or NullComputingConcurrentMap residing in a ComputingMapAdapter). This turned out to be necessary in my case, since I'm calling this method on all of my maps and a few of them are auto-evictors.

Synchronizing on an Integer value [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
What is the best way to increase number of locks in java
Suppose I want to lock based on an integer id value. In this case, there's a function that pulls a value from a cache and does a fairly expensive retrieve/store into the cache if the value isn't there.
The existing code isn't synchronized and could potentially trigger multiple retrieve/store operations:
//psuedocode
public Page getPage (Integer id){
Page p = cache.get(id);
if (p==null)
{
p=getFromDataBase(id);
cache.store(p);
}
}
What I'd like to do is synchronize the retrieve on the id, e.g.
if (p==null)
{
synchronized (id)
{
..retrieve, store
}
}
Unfortunately this won't work because 2 separate calls can have the same Integer id value but a different Integer object, so they won't share the lock, and no synchronization will happen.
Is there a simple way of insuring that you have the same Integer instance? For example, will this work:
syncrhonized (Integer.valueOf(id.intValue())){
The javadoc for Integer.valueOf() seems to imply that you're likely to get the same instance, but that doesn't look like a guarantee:
Returns a Integer instance
representing the specified int value.
If a new Integer instance is not
required, this method should generally
be used in preference to the
constructor Integer(int), as this
method is likely to yield
significantly better space and time
performance by caching frequently
requested values.
So, any suggestions on how to get an Integer instance that's guaranteed to be the same, other than the more elaborate solutions like keeping a WeakHashMap of Lock objects keyed to the int? (nothing wrong with that, it just seems like there must be an obvious one-liner than I'm missing).

You really don't want to synchronize on an Integer, since you don't have control over what instances are the same and what instances are different. Java just doesn't provide such a facility (unless you're using Integers in a small range) that is dependable across different JVMs. If you really must synchronize on an Integer, then you need to keep a Map or Set of Integer so you can guarantee that you're getting the exact instance you want.
Better would be to create a new object, perhaps stored in a HashMap that is keyed by the Integer, to synchronize on. Something like this:
public Page getPage(Integer id) {
Page p = cache.get(id);
if (p == null) {
synchronized (getCacheSyncObject(id)) {
p = getFromDataBase(id);
cache.store(p);
}
}
}
private ConcurrentMap<Integer, Integer> locks = new ConcurrentHashMap<Integer, Integer>();
private Object getCacheSyncObject(final Integer id) {
locks.putIfAbsent(id, id);
return locks.get(id);
}
To explain this code, it uses ConcurrentMap, which allows use of putIfAbsent. You could do this:
locks.putIfAbsent(id, new Object());
but then you incur the (small) cost of creating an Object for each access. To avoid that, I just save the Integer itself in the Map. What does this achieve? Why is this any different from just using the Integer itself?
When you do a get() from a Map, the keys are compared with equals() (or at least the method used is the equivalent of using equals()). Two different Integer instances of the same value will be equal to each other. Thus, you can pass any number of different Integer instances of "new Integer(5)" as the parameter to getCacheSyncObject and you will always get back only the very first instance that was passed in that contained that value.
There are reasons why you may not want to synchronize on Integer ... you can get into deadlocks if multiple threads are synchronizing on Integer objects and are thus unwittingly using the same locks when they want to use different locks. You can fix this risk by using the
locks.putIfAbsent(id, new Object());
version and thus incurring a (very) small cost to each access to the cache. Doing this, you guarantee that this class will be doing its synchronization on an object that no other class will be synchronizing on. Always a Good Thing.

Use a thread-safe map, such as ConcurrentHashMap. This will allow you to manipulate a map safely, but use a different lock to do the real computation. In this way you can have multiple computations running simultaneous with a single map.
Use ConcurrentMap.putIfAbsent, but instead of placing the actual value, use a Future with computationally-light construction instead. Possibly the FutureTask implementation. Run the computation and then get the result, which will thread-safely block until done.

Integer.valueOf() only returns cached instances for a limited range. You haven't specified your range, but in general, this won't work.
However, I would strongly recommend you not take this approach, even if your values are in the correct range. Since these cached Integer instances are available to any code, you can't fully control the synchronization, which could lead to a deadlock. This is the same problem people have trying to lock on the result of String.intern().
The best lock is a private variable. Since only your code can reference it, you can guarantee that no deadlocks will occur.
By the way, using a WeakHashMap won't work either. If the instance serving as the key is unreferenced, it will be garbage collected. And if it is strongly referenced, you could use it directly.

Using synchronized on an Integer sounds really wrong by design.
If you need to synchronize each item individually only during retrieve/store you can create a Set and store there the currently locked items. In another words,
// this contains only those IDs that are currently locked, that is, this
// will contain only very few IDs most of the time
Set<Integer> activeIds = ...
Object retrieve(Integer id) {
// acquire "lock" on item #id
synchronized(activeIds) {
while(activeIds.contains(id)) {
try {
activeIds.wait();
} catch(InterruptedExcption e){...}
}
activeIds.add(id);
}
try {
// do the retrieve here...
return value;
} finally {
// release lock on item #id
synchronized(activeIds) {
activeIds.remove(id);
activeIds.notifyAll();
}
}
}
The same goes to the store.
The bottom line is: there is no single line of code that solves this problem exactly the way you need.

How about a ConcurrentHashMap with the Integer objects as keys?

You could have a look at this code for creating a mutex from an ID. The code was written for String IDs, but could easily be edited for Integer objects.

As you can see from the variety of answers, there are various ways to skin this cat:
Goetz et al's approach of keeping a cache of FutureTasks works quite well in situations like this where you're "caching something anyway" so don't mind building up a map of FutureTask objects (and if you did mind the map growing, at least it's easy to make pruning it concurrent)
As a general answer to "how to lock on ID", the approach outlined by Antonio has the advantage that it's obvious when the map of locks is added to/removed from.
You may need to watch out for a potential issue with Antonio's implementation, namely that the notifyAll() will wake up threads waiting on all IDs when one of them becomes available, which may not scale very well under high contention. In principle, I think you can fix that by having a Condition object for each currently locked ID, which is then the thing that you await/signal. Of course, if in practice there's rarely more than one ID being waited on at any given time, then this isn't an issue.

Steve,
your proposed code has a bunch of problems with synchronization. (Antonio's does as well).
To summarize:
You need to cache an expensive
object.
You need to make sure that while one thread is doing the retrieval, another thread does not also attempt to retrieve the same object.
That for n-threads all attempting to get the object only 1 object is ever retrieved and returned.
That for threads requesting different objects that they do not contend with each other.
pseudo code to make this happen (using a ConcurrentHashMap as the cache):
ConcurrentMap<Integer, java.util.concurrent.Future<Page>> cache = new ConcurrentHashMap<Integer, java.util.concurrent.Future<Page>>;
public Page getPage(Integer id) {
Future<Page> myFuture = new Future<Page>();
cache.putIfAbsent(id, myFuture);
Future<Page> actualFuture = cache.get(id);
if ( actualFuture == myFuture ) {
// I am the first w00t!
Page page = getFromDataBase(id);
myFuture.set(page);
}
return actualFuture.get();
}
Note:
java.util.concurrent.Future is an interface
java.util.concurrent.Future does not actually have a set() but look at the existing classes that implement Future to understand how to implement your own Future (Or use FutureTask)
Pushing the actual retrieval to a worker thread will almost certainly be a good idea.

See section 5.6 in Java Concurrency in Practice: "Building an efficient, scalable, result cache". It deals with the exact issue you are trying to solve. In particular, check out the memoizer pattern.
(source: umd.edu)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.