Constructing multiple threads to output words to a map

Constructing multiple threads to output words to a map - java

I have a wordCount(CharacterReader charReader) function which takes a stream of characters, converts them to words.
I also have a Collection<CharacterReader> characerReaders, containing multiple character streams. The number of readers in a collection can vary, I want to read from all streams and have a count of all words.
I'm a little confused about threads and couldn't find any examples which were similar to this.
I essentially want multiple threads outputting their words into a SortedMap so I can have a real time total word count.
How would I go about doing this?
Thanks

If you are going to have multiple threads writing to the map, you need to use a ConcurrentSkipListMap which is both a SortedMap and a ConcurrentMap.
You can create for each CharacterReader in the collection a Runnable which calls the wordCount function (which accesses the map described previously).
After creating the Runnables you can create an ExecutorService (for example using Executors.newCacheThreadPool()), pass it all the Runnables and wait for them to finish (see the example in the javadoc of class ExecutorService).
You can also create the Runnables just before sending them to the ExecutorService.

Create a WordMap class which encapsulates your sorted map, and makes sure all the accesses to the map are properly synchronized. Or use a concurrent map that is already thread safe.
Create an instance of this class. Use the Executors class to create an ExecutorService with the characteristics that you desire.
Then iterate through the collection, and for each reader, create a Callable or a Runnable filling the WordMap instance with the words found in this reader, and submit this Callable or Runnable to the ExecutorService.

vainolo and JB's answers are both good.
I will add one thing, which is a description of how to make a highly concurrent data structure to store your word counts.
As vainolo said, a ConcurrentSkipListMap is the basic data structure you want, because it is both sorted and concurrent. To make good use if it, you want to avoid doing any locking. That means you must avoid patterns which involve a lock-read-write-unlock cycle. That has two consequences: firstly, putting a new word in the map should not involve a lock, and incrementing the count of an existing word should not involve a lock.
You can safely add new things to the map using ConcurrentMap's putIfAbsent method. However, that alone is not quite enough, because you have to supply a potential value every time you use it, which is potentially expensive. The easiest thing to do is to use a sort of double-checked locking pattern, where you first simply try to get an existing value, then if you find there isn't one, add a new one with putIfAbsent (you can't simply call put, because there could be a race between two threads putting at the same time).
Incrementing without locking can easily be done by not storing integers in the map, but rather objects which themselves contain integers. That way, you never have to put an incremented value in the map, you just increment the object already there. AtomicInteger seems like a good candidate for this.
Putting that together, you get:
public class WordCounts {
private final ConcurrentMap<String, AtomicInteger> counts
= new ConcurrentSkipListMap<String, AtomicInteger>();
public void count(String word) {
AtomicInteger count = getCount(word);
count.incrementAndGet();
}
private AtomicInteger getCount(String word) {
AtomicInteger count = counts.get(word);
if (count == null) {
AtomicInteger newCount = new AtomicInteger();
count = counts.putIfAbsent(word, newCount);
if (count == null) count = newCount;
}
return count;
}
}

Related

Adding to AtomicInteger within ConcurrentHashMap

I have the following defined
private ConcurrentMap<Integer, AtomicInteger> = new ConcurrentHashMap<Integer, AtomicInteger>();
private void add() {
staffValues.replace(100, staffValues.get(100), new AtomicInteger(staffValues.get(100).addAndGet(200)));
}
After testing, the values I am getting are not expected, and I think there is a race condition here. Does anyone know if this would be considered threadsafe by wrapping the get call in the replace function?

A good way to handle situations like this is using the computeIfAbsent method (not the compute method that #the8472 recommends)
The computeIfAbsent accepts 2 arguments, the key, and a Function<K, V> that will only be called if the existing value is missing. Since a AtomicInteger is thread safe to increment from multiple threads, you can use it easely in the following manner:
staffValues.computeIfAbsent(100, k -> new AtomicInteger(0)).addAndGet(200);

There are a few issues with your code. The biggest is that you're ignoring the return-value of ConcurrentHashMap.replace: if the replacement doesn't happen (due to another thread having made a replacement in parallel), you simply proceed as if it happened. This is the main reason you're getting wrong results.
I also think it's a design mistake to mutate an AtomicInteger and then immediately replace it with a different AtomicInteger; even if you can get this working, there's simply no reason for it.
Lastly, I don't think you should call staffValues.get(100) twice. I don't think that causes a bug in the current code — your correctness depends only on the second call returning a "newer" result than the first, which I think is actually guaranteed by ConcurrentHashMap — but it's fragile and subtle and confusing. In general, when you call ConcurrentHashMap.replace, its third argument should be something you computed using the second.
Overall, you can simplify your code either by not using AtomicInteger:
private ConcurrentMap<Integer, Integer> staffValues = new ConcurrentHashMap<>();
private void add() {
final Integer prevValue = staffValues.get(100);
staffValues.replace(100, prevValue, prevValue + 200);
}
or by not using replace (and perhaps not even ConcurrentMap, depending how else you're touching this map):
private Map<Integer, AtomicInteger> staffValues = new HashMap<>();
private void add() {
staffValues.get(100).addAndGet(200);
}

You don't need to use replace(). AtomicInteger is a mutable value that does not need to be substituted whenever you want to increment it. In fact addAndGet already increments it in place.
Instead use compute to put a default value (presumably 0) into the map when none is present and otherwise get the pre-existing value and increment that.
If, on the other hand, you want to use immutable values put Integer instances instead of AtomicInteger into the map and update them with the atomic compute/replace/merge operations.

Is this dictionary function thread-safe (ConcurrentHashMap+AtomicInteger)?

I need to write a really simple dictionary which will be append only. The dictionary will be shared between many threads. When any thread calls getId I want to make sure the same id is always returned for the same word, i.e. there should be only one id for any unique word.
Now obviously I could just synchronize access to the getId method, but that is not very fun. So I wondered if there was a lock-free way to achieve this.
In particular, I am wondering about the thread safety of using java.util.concurrent.ConcurrentHashMap#computeIfAbsent. The javadoc for the interface ConcurrentMap says:
The default implementation may retry these steps when multiple threads attempt updates including potentially calling the mapping function multiple times.
From that description, it is not clear to me if that means that the mapping function might be called more than once for the same key?
If that is the case (i.e. the mapper might be called more than once for the same key), then I think the following code is most likely not thread-safe as it could call getAndIncrement more than once for the same key (i.e. word).
If that is not the case, then I think the following code is thread-safe. Can anyone confirm?
public class Dictionary {
private final AtomicInteger index = new AtomicInteger();
private final ConcurrentHashMap<String, Integer> words =
new ConcurrentHashMap<>();
public int getId(final String word) {
return words.computeIfAbsent(word, this::newId);
}
private int newId(final String word) {
return index.getAndIncrement();
}
}

This is guaranteed to be thread safe by the ConcurrentMap Javadoc (emphasis mine):
Actions in a thread prior to placing an object into a ConcurrentMap as a key or value happen-before actions subsequent to the access or removal of that object from the ConcurrentMap in another thread.
The ConcurrentHashMap Javadoc has an example similar to yours:
A ConcurrentHashMap can be used as scalable frequency map (a form of histogram or multiset) by using LongAdder values and initializing via computeIfAbsent. For example, to add a count to a ConcurrentHashMap<String,LongAdder> freqs, you can use freqs.computeIfAbsent(k -> new LongAdder()).increment();
While this uses computeIfAbsent, it should be analogous to putIfAbsent.
The java.util.concurrent package Javadoc talks about "happens-before":
Chapter 17 of the Java Language Specification defines the happens-before relation on memory operations such as reads and writes of shared variables. The results of a write by one thread are guaranteed to be visible to a read by another thread only if the write operation happens-before the read operation.
And the Language Specification says:
Two actions can be ordered by a happens-before relationship. If one action happens-before another, then the first is visible to and ordered before the second.
So your code should be thread safe.

How to fix non-atomic use of get/check/put?

I have a JSONArray which I am iterating to populate my Map as shown below. My ppJsonArray will have data like this -
[693,694,695,696,697,698,699,700,701,702]
Below is my code which is having issues with thread safety as my static analysis tool complained -
Map<Integer, Integer> m = new HashMap<Integer, Integer>();
ConcurrentMap<String, Map<Integer, Integer>> partitionsToNodeMap = new ConcurrentHashMap<String, Map<Integer, Integer>>();
int hostNum = 2;
JSONArray ppJsonArray = j.getJSONArray("pp");
for (int i = 0; i < ppJsonArray.length(); i++) {
m.put(Integer.parseInt(ppJsonArray.get(i).toString()), hostNum);
}
Map<Integer, Integer> tempMap = partitionsToNodeMap.get("PRIMARY");
if (tempMap != null) {
tempMap.putAll(m);
} else {
tempMap = m;
}
partitionsToNodeMap.put("PRIMARY", tempMap);
But when I am running static analysis tool, it is complaining as -
Non-atomic use of get/check/put on partitionsToNodeMap.put("PRIMARY", tempMap)
Which makes me think my above code is not thread safe? How can I resolve this issue?

The above code is not thread safe.
Does it need to be thread safe? (i.e., Is partitionsToNodeMap used by more than one thread? Could more than one thread run this routine? or could thread A thread update partitionsToNodeMap in some other routine while thread B runs this routine?)
If you answered "yes" to any of those questions, then you probably need to use some kind of synchronization.
partitionsToNodeMap is a ConcurrentHashMap. That will prevent the map structure itself from becoming corrupt if it is updated by more than one thread at one time; but the data in the map presumably aren't just random strings and integers. It probably means something to your program. The fact that the map structure itself is protected from corruption will not prevent the higher-level meaning of the map contents from becoming corrupt.
Can you provide an example how can I protect this?
Not a complete one, because thread-safety is a property of the whole program. You can't do thread-safety function-by-function.
Being thread-safe is all about protecting invariants. An invariant is an assertion about your data that must always be true. For example, if you were modeling a game of Monopoly, one invariant would say that the total amount of money in the game must always be $15,140.
If some thread in the Monopoly game processes a payment by taking X dollars away from one player, and returning it to the bank, that's a two step process, and in-between the two steps the invariant is broken. If the first thread were preempted in-between the two steps, and some other thread counted all of the money in the game, it would get the wrong total.
The main use-case for the Java synchronized keyword (or equivalently, for the java.util.concurrent.locks.ReentrantLock class) is to prevent other threads from seeing broken invariants.
Either way of locking is voluntary. To make it work, you must wrap every block of code that can temporarily break an invariant in a protected block
synchronized(bank-lock) {
deductNDollarsFrom(N, player);
giveNDollarsTo(N, bank);
}
AND every block of code that cares about the invariant must also be wrapped in a protected block.
synchronized(bank-lock) {
int totalDollars = countAllMoneyInGame(...);
if (totalDollars != 15140) {
throw new CheatingDetectedException(...);
}
}
Java won't let the balance transfer and the audit happen at the same time because it never allows two threads to synchronize on the same object (bank-lock, in this case) at the same time.
You will have to figure out what your invariants are. The static analyzer is telling you that the get()...put() sequence looks like a block of code that might care about an invariant. You have to figure out whether it really does or not. Is there something that some other thread could do in-between the get() and the put() that could cause things to go south? If so then both blocks of code should synchronize on the same object so that they can not both be executed at the same time.

Your static analysis tool is confused because what you're doing looks like a classic race condition.
Map<Integer, Integer> tempMap = partitionsToNodeMap.get("PRIMARY"); // GET
if (tempMap != null) { // CHECK
tempMap.putAll(m);
} else {
tempMap = m;
}
partitionsToNodeMap.put("PRIMARY", tempMap); // PUT
If another thread were to partitionsToNodeMap.put("PRIMARY"); after you get assign tempMap, you would overwrite the other thread's work. Among a myriad of other potential bad things. It seems like you don't have multiple threads accessing it though, so it isn't an issue. However, it would be more clearly expressed as:
Map<Integer, Integer> primaryMap = partitionsToNodeMap.get("PRIMARY");
if (primaryMap != null) {
primaryMap.putAll(m);
} else {
partitionsToNodeMap.put("PRIMARY", m);
}
If you want to make the static analysis tool happy, swap out your concurrent map for a regular map. The code you've provided doesn't require a threadsafe data structure.

Declaring a hashmap inside a method

Local variables are thread safe in Java. Is using a hashmap declared inside a method thread safe?
For Example-
void usingHashMap()
{
HashMap<Integer> map = new HashMap<integer>();
}

When two threads run the same method here usingHashMap(), they are in no way way related. Each thread will create its own version of every local variable, and these variables will not interact with each other in any way
If variables aren't local,then they are attached to the instance. In this case, two threads running the same method both see the one variable, and this isn't threadsafe.
public class usingHashMapNotThreadSafe {
HashMap<Integer, String> map = new HashMap<Integer, String>();
public int work() {
//manipulating the hashmap here
}
}
public class usingHashMapThreadSafe {
public int worksafe() {
HashMap<Integer, String> map = new HashMap<Integer, String>();
//manipulating the hashmap here
}
}
While usingHashMapNotThreadSafe two threads running on the same instance of usingHashMapNotThreadSafe will see the same x. This could be dangerous, because the threads are trying to change map! In the second, two threads running on the same instance of usingHashMapThreadSafe will see totally different versions of x, and can't effect each other.

As long as the reference to the HashMap object is not published (is not passed to another method), it is threadsafe.
The same applies to the keys/values stored in the map. They need to be either immutable (cannot change their states after being created) or used only within this method.

I think to ensure complete concurrency, a ConcurrentHashMap should be used in any case. Even if it is local in scope. ConcurrentHashMap implements ConcurrentMap. The partitioning is essentially an attempt, as explained in the documentation to:
The table is internally partitioned to try to permit the indicated number of concurrent updates without contention. Because placement in hash tables is essentially random, the actual concurrency will vary. Ideally, you should choose a value to accommodate as many threads as will ever concurrently modify the table. Using a significantly higher value than you need can waste space and time, and a significantly lower value can lead to thread contention.

Creating a ConcurrentHashMap that supports "snapshots"

I'm attempting to create a ConcurrentHashMap that supports "snapshots" in order to provide consistent iterators, and am wondering if there's a more efficient way to do this. The problem is that if two iterators are created at the same time then they need to read the same values, and the definition of the concurrent hash map's weakly consistent iterators does not guarantee this to be the case. I'd also like to avoid locks if possible: there are several thousand values in the map and processing each item takes several dozen milliseconds, and I don't want to have to block writers during this time as this could result in writers blocking for a minute or longer.
What I have so far:
The ConcurrentHashMap's keys are Strings, and its values are instances of ConcurrentSkipListMap<Long, T>
When an element is added to the hashmap with putIfAbsent, then a new skiplist is allocated, and the object is added via skipList.put(System.nanoTime(), t).
To query the map, I use map.get(key).lastEntry().getValue() to return the most recent value. To query a snapshot (e.g. with an iterator), I use map.get(key).lowerEntry(iteratorTimestamp).getValue(), where iteratorTimestamp is the result of System.nanoTime() called when the iterator was initialized.
If an object is deleted, I use map.get(key).put(timestamp, SnapShotMap.DELETED), where DELETED is a static final object.
Questions:
Is there a library that already implements this? Or barring that, is there a data structure that would be more appropriate than the ConcurrentHashMap and the ConcurrentSkipListMap? My keys are comparable, so maybe some sort of concurrent tree would better support snapshots than a concurrent hash table.
How do I prevent this thing from continually growing? I can delete all of the skip list entries with keys less than X (except for the last key in the map) after all iterators that were initialized on or before X have completed, but I don't know of a good way to determine when this has happened: I can flag that an iterator has completed when its hasNext method returns false, but not all iterators are necessarily going to run to completion; I can keep a WeakReference to an iterator so that I can detect when it's been garbage collected, but I can't think of a good way to detect this other than by using a thread that iterates through the collection of weak references and then sleeps for several minutes - ideally the thread would block on the WeakReference and be notified when the wrapped reference is GC'd, but I don't think this is an option.
ConcurrentSkipListMap<Long, WeakReference<Iterator>> iteratorMap;
while(true) {
long latestGC = 0;
for(Map.Entry<Long, WeakReference<Iterator>> entry : iteratorMap.entrySet()) {
if(entry.getValue().get() == null) {
iteratorMap.remove(entry.getKey());
latestGC = entry.getKey();
} else break;
}
// remove ConcurrentHashMap entries with timestamps less than `latestGC`
Thread.sleep(300000); // five minutes
}
Edit: To clear up some confusion in the answers and comments, I'm currently passing weakly consistent iterators to code written by another division in the company, and they have asked me to increase the strength of the iterators' consistency. They are already aware of the fact that it is infeasible for me to make 100% consistent iterators, they just want a best effort on my part. They care more about throughput than iterator consistency, so coarse-grained locks are not an option.

What is your actual use case that requires a special implementation? From the Javadoc of ConcurrentHashMap (emphasis added):
Retrievals reflect the results of the most recently completed update operations holding upon their onset. ... Iterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration. They do not throw ConcurrentModificationException. However, iterators are designed to be used by only one thread at a time.
So the regular ConcurrentHashMap.values().iterator() will give you a "consistent" iterator, but only for one-time use by a single thread. If you need to use the same "snapshot" multiple times and/or by multiple threads, I suggest making a copy of the map.
EDIT: With the new information and the insistence for a "strongly consistent" iterator, I offer this solution. Please note that the use of a ReadWriteLock has the following implications:
Writes will be serialized (only one writer at a time) so write performance may be impacted.
Concurrent reads are allowed as long as there is no write in progress, so read performance impact should be minimal.
Active readers block writers but only as long as it takes to retrieve the reference to the current "snapshot". Once a thread has the snapshot, it no longer blocks writers no matter how long it takes to process the information in the snapshot.
Readers are blocked while any write is active; once the write finishes then all readers will have access to the new snapshot until a new write replaces it.
Consistency is achieved by serializing the writes and making a copy of the current values on each and every write. Readers that hold a reference to a "stale" snapshot can continue to use the old snapshot without worrying about modification, and the garbage collector will reclaim old snapshots as soon as no one is using it any more. It is assumed that there is no requirement for a reader to request a snapshot from an earlier point in time.
Because snapshots are potentially shared among multiple concurrent threads, the snapshots are read-only and cannot be modified. This restriction also applies to the remove() method of any Iterator instances created from the snapshot.
import java.util.*;
import java.util.concurrent.locks.*;
public class StackOverflow16600019 <K, V> {
private final ReadWriteLock locks = new ReentrantReadWriteLock();
private final HashMap<K,V> map = new HashMap<>();
private Collection<V> valueSnapshot = Collections.emptyList();
public V put(K key, V value) {
locks.writeLock().lock();
try {
V oldValue = map.put(key, value);
updateSnapshot();
return oldValue;
} finally {
locks.writeLock().unlock();
}
}
public V remove(K key) {
locks.writeLock().lock();
try {
V removed = map.remove(key);
updateSnapshot();
return removed;
} finally {
locks.writeLock().unlock();
}
}
public Collection<V> values() {
locks.readLock().lock();
try {
return valueSnapshot; // read-only!
} finally {
locks.readLock().unlock();
}
}
/** Callers MUST hold the WRITE LOCK. */
private void updateSnapshot() {
valueSnapshot = Collections.unmodifiableCollection(
new ArrayList<V>(map.values())); // copy
}
}

I've found that the ctrie is the ideal solution - it's a concurrent hash array mapped trie with constant time snapshots

Solution1) What about just synchronizing on the puts, and on the iteration. That should give you a consistent snapshot.
Solution2) Start iterating and make a boolean to say so, then override the puts, putAll so that they go into a queue, when the iteration is finished simply make those puts with the changed values.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.