I have a data structure like Map<Key, Set<Value>>. I'm trying to implement the following scenario:
Several producers update this map adding new values either to already existing keys or to new keys (in which case new map entries are created).
A consumer periodically polls some limited number of entries from the map and passes them to processor.
Here's my take:
private static final MAX_UPDATES_PER_PASS = 100;
private final ConcurrentHashMap<Key, Set<Value>> updates = new ConcurrentHashMap<Key, Set<Value>>();
#Override
public void updatesReceived(Key key, Set<Value> values) {
Set<Value> valuesSet = updates.get(key);
if (valuesSet == null){
valuesSet = Collections.newSetFromMap(new ConcurrentHashMap<Value, Boolean>());
Set<Value> previousValues = updates.putIfAbsent(key, valuesSet);
if (previousValues != null){
valuesSet = previousValues;
}
}
valuesSet.addAll(values);
}
private class UpdatesProcessor implements Runnable {
#Override
public void run() {
int updatesProcessed = 0;
Map<Key, Set<Value>> valuesToProcess = new HashMap<Key, Set<Value>>();
Iterator<Map.Entry<Key, Set<Value>>> iterator = updates.entrySet().iterator();
while(iterator.hasNext() && updatesProcessed < MAX_UPDATES_PER_PASS) {
Map.Entry<Key, Set<Value>> next = iterator.next();
iterator.remove(); // <-- here
Key key = next.getKey();
Set<Value> values = valuesToProcess.get(key);
if (values == null){
values = new HashSet<Value>();
valuesToProcess.put(key, values);
}
values.addAll(next.getValue());
updatesProcessed++;
}
if (!valuesToProcess.isEmpty()){
process(valuesToProcess);
}
}
}
The method updatesRecevied() is called by producers of values from arbitrary threads. The UpdatesProcessor is scheduled for periodic execution through ScheduledExecutorService, so it too can be called from arbitrary threads.
Every single value should be processed exactly once. No more no less. I don't care if a value gets processed sooner or later, but eventually it should.
I want it to be fast and furious, so I don't want to synchronize everything up.
This clumsy code with the iterator in the UpdatesProcessor serves one single goal which could be easily achieved if there was something like ConcurrentHashMap.poll(). But there isn't.
So, to the questions. First, is this guaranteed to work or not? After I call iterator.remove() the entry is removed from the map, and every additional values would go to the new entry's set, right?
And second, am I complicating things? Is there a common approach to (data structure for) this kind of scenario?
Related
So, I have one map inside another one, for example, it might be word-counter per account base:
Map<Long, Map<String, Long>>
What is proper thread-safe way to increment the counter?
I guess it's possible to use ConcurrentHashMap and LongAdder like following:
private Map<Long, Map<GovernorLimitName, LongAdder>> status = new ConcurrentHashMap<> ();
public void count (Long accountId, String word) {
status.putIfAbsent (accountId, new ConcurrentHashMap<GovernorLimitName, LongAdder> ());
synchronized (getStatus ().get (accountId)) {
getStatus ().get (accountId).computeIfAbsent(limitName, k -> new LongAdder()).increment();
}
}
I believe that syncronyzation here is required cause of race condition between getting inner map and performing comuteIfAbsent() on it, is that correct?
Updated
I assume that both submaps and adders might be removed, cause there might be other methods accessing that map.
There's no point in using ConcurrentHashMap if you need to synchronize and you're right that you do because you're getting the value: Map<GovernorLimitName, LongAdder> (which is done concurrently) and then fetch LongAdder and increment it (which is not).
Instead of using Long use AtomicLong and change the implementation to use a regular HashMap.
You don't need the synchronized, as long as submaps are never removed from status, and adders are never removed from the submaps.
Creating a new ConcurrentHashMap that you will usually throw away is too expensive, though. Using the data structures you already have, you can do it like this:
public void count (Long accountId, GovernorLimitName limitName) {
Map<GovernorLimitName, LongAdder> submap = status.computeIfAbsent(accountId,
a -> new ConcurrentHashMap<GovernorLimitName, LongAdder> ());
LongAdder adder = submap.computeIfAbsent(limitName, k -> new LongAdder());
adder.incremeent();
}
instead of using putIfAbsent on main map, just use compute function (and do everything inside that function, even the inner map stuff). Whatever you do inside the function will be thread-safe if using ConcurrentHashMap in the root map; no need for synchronization block if you do it this way.
obviously, you will still need concurrenthashmap with this approach (in both) since i guess you will be doing gets at some other point of your code, and otherwise you would have concurrency issues while reading data.
other approaches could be taken instead of using ConcurrentHashMap's, but that's out of the scope of the question and it's fine to use those implementations.
Here you have some code (might have typos, and codestyle can be improved):
private Map<Long, Map<GovernorLimitName, LongAdder>> status = new ConcurrentHashMap<> ();
public void count (Long accountId, String word) {
status.compute(accountId, (k, v) -> {
if (v == null) {
v = new ConcurrentHashMap<>();
}
v.compute(limitName, (k2, v2) -> {
if (v2 == null) {
v2 = new LongAdder();
}
v2.increment();
return v2;
});
return v;
});
}
We've recently had a discussion at my work about whether we need to use ConcurrentHashMap or if we can simply use regular HashMap, in our multithreaded environment. The argument for HashMaps are two: it is faster then the ConcurrentHashMap, so we should use it if possible. And ConcurrentModificationException apparently only appears as you iterate over the Map as it is modified, so "if we only PUT and GET from the map, what is the problem with the regular HashMap?" was the arguments.
I thought that concurrent PUT actions or concurrent PUT and READ could lead to exceptions, so I put together a test to show this. The test is simple; create 10 threads, each which writes the same 1000 key-value pairs into the map again-and-again for 5 seconds, then print the resulting map.
The results were quite confusing actually:
Length:1299
Errors recorded: 0
I thought each key-value pair was unique in a HashMap, but looking through the map, I can find multiple Key-Value pairs that are identical. I expected either some kind of exception or corrupted keys or values, but I did not expect this. How does this occur?
Here's the code I used, for reference:
public class ConcurrentErrorTest
{
static final long runtime = 5000;
static final AtomicInteger errCount = new AtomicInteger();
static final int count = 10;
public static void main(String[] args) throws InterruptedException
{
List<Thread> threads = new LinkedList<>();
final Map<String, Integer> map = getMap();
for (int i = 0; i < count; i++)
{
Thread t = getThread(map);
threads.add(t);
t.start();
}
for (int i = 0; i < count; i++)
{
threads.get(i).join(runtime + 1000);
}
for (String s : map.keySet())
{
System.out.println(s + " " + map.get(s));
}
System.out.println("Length:" + map.size());
System.out.println("Errors recorded: " + errCount.get());
}
private static Map<String, Integer> getMap()
{
Map<String, Integer> map = new HashMap<>();
return map;
}
private static Map<String, Integer> getConcMap()
{
Map<String, Integer> map = new ConcurrentHashMap<>();
return map;
}
private static Thread getThread(final Map<String, Integer> map)
{
return new Thread(new Runnable() {
#Override
public void run()
{
long start = System.currentTimeMillis();
long now = start;
while (now - start < runtime)
{
try
{
for (int i = 0; i < 1000; i++)
map.put("i=" + i, i);
now = System.currentTimeMillis();
}
catch (Exception e)
{
System.out.println("P - Error occured: " + e.toString());
errCount.incrementAndGet();
}
}
}
});
}
}
What you're faced with seems to be a TOCTTOU class problem. (Yes, this kind of bug happens so often, it's got its own name. :))
When you insert an entry into a map, at least the following two things need to happen:
Check whether the key already exists.
If the check returned true, update the existing entry, if it didn't, add a new one.
If these two don't happen atomically (as they would in a correctly synchronized map implementation), then several threads can come to the conclusion that the key doesn't exist yet in step 1, but by the time they reach step 2, that isn't true any more. So multiple threads will happily insert an entry with the same key.
Please note that this isn't the only problem that can happen, and depending on the implementation and your luck with visibility, you can get all kinds of different and unexpected failures.
In multi thread environment, you should always use CuncurrentHashMap, if you are going to perform any operation except get.
Most of the time you won't get an exception, but definitely get the corrupt data because of the thread local copy value.
Every thread has its own copy of the Map data when performing the put operation and when they check for key existence, multiple threads found it false and they enter the data.
tldr: How can I search for an entry in multiple (read-only) Java HashMaps at the same time?
The long version:
I have several dictionaries of various sizes stored as HashMap< String, String >. Once they are read in, they are never to be changed (strictly read-only).
I want to check whether and which dictionary had stored an entry with my key.
My code was originally looking for a key like this:
public DictionaryEntry getEntry(String key) {
for (int i = 0; i < _numDictionaries; i++) {
HashMap<String, String> map = getDictionary(i);
if (map.containsKey(key))
return new DictionaryEntry(map.get(key), i);
}
return null;
}
Then it got a little more complicated: my search string could contain typos, or was a variant of the stored entry. Like, if the stored key was "banana", it is possible that I'd look up "bannana" or "a banana", but still would like the entry for "banana" returned. Using the Levenshtein-Distance, I now loop through all dictionaries and each entry in them:
public DictionaryEntry getEntry(String key) {
for (int i = 0; i < _numDictionaries; i++) {
HashMap<String, String> map = getDictionary(i);
for (Map.Entry entry : map.entrySet) {
// Calculate Levenshtein distance, store closest match etc.
}
}
// return closest match or null.
}
So far everything works as it should and I'm getting the entry I want. Unfortunately I have to look up around 7000 strings, in five dictionaries of various sizes (~ 30 - 70k entries) and it takes a while. From my processing output I have the strong impression my lookup dominates overall runtime.
My first idea to improve runtime was to search all dictionaries parallely. Since none of the dictionaries is to be changed and no more than one thread is accessing a dictionary at the same time, I don't see any safety concerns.
The question is just: how do I do this? I have never used multithreading before. My search only came up with Concurrent HashMaps (but to my understanding, I don't need this) and the Runnable-class, where I'd have to put my processing into the method run(). I think I could rewrite my current class to fit into Runnable, but I was wondering if there is maybe a simpler method to do this (or how can I do it simply with Runnable, right now my limited understanding thinks I have to restructure a lot).
Since I was asked to share the Levenshtein-Logic: It's really nothing fancy, but here you go:
private int _maxLSDistance = 10;
public Map.Entry getClosestMatch(String key) {
Map.Entry _closestMatch = null;
int lsDist;
if (key == null) {
return null;
}
for (Map.Entry entry : _dictionary.entrySet()) {
// Perfect match
if (entry.getKey().equals(key)) {
return entry;
}
// Similar match
else {
int dist = StringUtils.getLevenshteinDistance((String) entry.getKey(), key);
// If "dist" is smaller than threshold and smaller than distance of already stored entry
if (dist < _maxLSDistance) {
if (_closestMatch == null || dist < _lsDistance) {
_closestMatch = entry;
_lsDistance = dist;
}
}
}
}
return _closestMatch
}
In order to use multi-threading in your case, could be something like:
The "monitor" class, which basically stores the results and coordinates the threads;
public class Results {
private int nrOfDictionaries = 4; //
private ArrayList<String> results = new ArrayList<String>();
public void prepare() {
nrOfDictionaries = 4;
results = new ArrayList<String>();
}
public synchronized void oneDictionaryFinished() {
nrOfDictionaries--;
System.out.println("one dictionary finished");
notifyAll();
}
public synchronized boolean isReady() throws InterruptedException {
while (nrOfDictionaries != 0) {
wait();
}
return true;
}
public synchronized void addResult(String result) {
results.add(result);
}
public ArrayList<String> getAllResults() {
return results;
}
}
The Thread it's self, which can be set to search for the specific dictionary:
public class ThreadDictionarySearch extends Thread {
// the actual dictionary
private String dictionary;
private Results results;
public ThreadDictionarySearch(Results results, String dictionary) {
this.dictionary = dictionary;
this.results = results;
}
#Override
public void run() {
for (int i = 0; i < 4; i++) {
// search dictionary;
results.addResult("result of " + dictionary);
System.out.println("adding result from " + dictionary);
}
results.oneDictionaryFinished();
}
}
And the main method for demonstration:
public static void main(String[] args) throws Exception {
Results results = new Results();
ThreadDictionarySearch threadA = new ThreadDictionarySearch(results, "dictionary A");
ThreadDictionarySearch threadB = new ThreadDictionarySearch(results, "dictionary B");
ThreadDictionarySearch threadC = new ThreadDictionarySearch(results, "dictionary C");
ThreadDictionarySearch threadD = new ThreadDictionarySearch(results, "dictionary D");
threadA.start();
threadB.start();
threadC.start();
threadD.start();
if (results.isReady())
// it stays here until all dictionaries are searched
// because in "Results" it's told to wait() while not finished;
for (String string : results.getAllResults()) {
System.out.println("RESULT: " + string);
}
I think the easiest would be to use a stream over the entry set:
public DictionaryEntry getEntry(String key) {
for (int i = 0; i < _numDictionaries; i++) {
HashMap<String, String> map = getDictionary(i);
map.entrySet().parallelStream().foreach( (entry) ->
{
// Calculate Levenshtein distance, store closest match etc.
}
);
}
// return closest match or null.
}
Provided you are using java 8 of course. You could also wrap the outer loop into an IntStream as well. Also you could directly use the Stream.reduce to get the entry with the smallest distance.
Maybe try thread pools:
ExecutorService es = Executors.newFixedThreadPool(_numDictionaries);
for (int i = 0; i < _numDictionaries; i++) {
//prepare a Runnable implementation that contains a logic of your search
es.submit(prepared_runnable);
}
I believe you may also try to find a quick estimate of strings that completely do not match (i.e. significant difference in length), and use it to finish your logic ASAP, moving to next candidate.
I have my strong doubts that HashMaps are a suitable solution here, especially if you want to have some fuzzing and stop words. You should utilize a proper full text search solutions like ElaticSearch or Apache Solr or at least an available engine like Apache Lucene.
That being said, you can use a poor man's version: Create an array of your maps and a SortedMap, iterate over the array, take the keys of the current HashMap and store them in the SortedMap with the index of their HashMap. To retrieve a key, you first search in the SortedMap for said key, get the respective HashMap from the array using the index position and lookup the key in only one HashMap. Should be fast enough without the need for multiple threads to dig through the HashMaps. However, you could make the code below into a runnable and you can have multiple lookups in parallel.
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.SortedMap;
import java.util.TreeMap;
public class Search {
public static void main(String[] arg) {
if (arg.length == 0) {
System.out.println("Must give a search word!");
System.exit(1);
}
String searchString = arg[0].toLowerCase();
/*
* Populating our HashMaps.
*/
HashMap<String, String> english = new HashMap<String, String>();
english.put("banana", "fruit");
english.put("tomato", "vegetable");
HashMap<String, String> german = new HashMap<String, String>();
german.put("Banane", "Frucht");
german.put("Tomate", "Gemüse");
/*
* Now we create our ArrayList of HashMaps for fast retrieval
*/
List<HashMap<String, String>> maps = new ArrayList<HashMap<String, String>>();
maps.add(english);
maps.add(german);
/*
* This is our index
*/
SortedMap<String, Integer> index = new TreeMap<String, Integer>(String.CASE_INSENSITIVE_ORDER);
/*
* Populating the index:
*/
for (int i = 0; i < maps.size(); i++) {
// We iterate through or HashMaps...
HashMap<String, String> currentMap = maps.get(i);
for (String key : currentMap.keySet()) {
/* ...and populate our index with lowercase versions of the keys,
* referencing the array from which the key originates.
*/
index.put(key.toLowerCase(), i);
}
}
// In case our index contains our search string...
if (index.containsKey(searchString)) {
/*
* ... we find out in which map of the ones stored in maps
* the word in the index originated from.
*/
Integer mapIndex = index.get(searchString);
/*
* Next, we look up said map.
*/
HashMap<String, String> origin = maps.get(mapIndex);
/*
* Last, we retrieve the value from the origin map
*/
String result = origin.get(searchString);
/*
* The above steps can be shortened to
* String result = maps.get(index.get(searchString).intValue()).get(searchString);
*/
System.out.println(result);
} else {
System.out.println("\"" + searchString + "\" is not in the index!");
}
}
}
Please note that this is a rather naive implementation only provided for illustration purposes. It doesn't address several problems (you can't have duplicate index entries, for example).
With this solution, you are basically trading startup speed for query speed.
Okay!!..
Since your concern is to get faster response.
I would suggest you to divide the work between threads.
Lets you have 5 dictionaries May be keep three dictionaries to one thread and rest two will take care by another thread.
And then witch ever thread finds the match will halt or terminate the other thread.
May be you need an extra logic to do that dividing work ... But that wont effect your performance time.
And may be you need little more changes in your code to get your close match:
for (Map.Entry entry : _dictionary.entrySet()) {
you are using EntrySet But you are not using values anyway it seems getting entry set is a bit expensive. And I would suggest you to just use keySet since you are not really interested in the values in that map
for (Map.Entry entry : _dictionary.keySet()) {
For more details on the proformance of map Please read this link Map performances
Iteration over the collection-views of a LinkedHashMap requires time proportional to the size of the map, regardless of its capacity. Iteration over a HashMap is likely to be more expensive, requiring time proportional to its capacity.
I'm trying to multi thread an import job, but running into a problem where it's causing duplicate data. I need to keep my map outside of the loop so all my threads can update and read from it, but I can't do this without it being final and with it being final I can't update the map. Currently I need to put my Map object in the run method, but the problem comes when the values are not initially in the database and each thread creates a new one. This results in duplicate data in the database. Does anybody know how to do some sort of call back to update my map outside?
ExecutorService executorService = Executors.newFixedThreadPool(10);
final Map<Integer, Object> map = new HashMap<>();
map.putAll(populate from database);
for (int i = 0; i < 10; i++) {
executorService.execute(new Runnable() {
public void run() {
while ((line = br.readLine()) != null) {
if(map.containsKey(123)) {
//read map object
session.update(object);
} else {
map.put(123,someObject);
session.save(object);
}
if(rowCount % 250 == 0)
tx.commit;
});
}
executorService.shutdown();
You need to use some synchronization techniques.
Problematic part is when different threads are trying to put some data into map.
Example:
Thread 1 is checking if there is object with key 123 in map. Before thread 1 added new object to map, thread 2 is executed. Thread 2 also check if there is object with key 123. Then both threads added object 123 to map. This causes duplicates...
You can read more about synchronization here
http://docs.oracle.com/javase/tutorial/essential/concurrency/sync.html
Based on your problem description it appears that you want to have a map where the data is consistent and you always have the latest up-t-date data without having missed any updates.
In this case make you map as a Collections.synchronizedMap(). This will ensure that all read and write updates to the map are synchronized and hence you are guaranteed to find a key using the latest data in the map and also guaranteed to write exclusively to the map.
Refer to this SO discussion for a difference between the concurrency techniques used with maps.
Also, one more thing - defining a Map as final does not mean yu cannot modify the map - you can definitely add and remove elements from the map. What you cannot do however is change the variable to point to another map. This is illustrated by a simple code snippet below:
private final Map<Integer, String> testMap = Collections.synchronizedMap(new HashMap<Integer,String>());
testMap.add(1,"Tom"); //OK
testMap.remove(1); //OK
testMap = new HashMap<Integer,String>(); //ERROR!! Cannot modify a variable with the final modifier
I would suggest the following solution
Use ConcurrentHashmap
Don't use update and commit inside your crawling threads
Trigger save and commit when your map reaches a critical size in a separate thread.
Pseudocode sample:
final Object lock = new Object();
...
executorService.execute(new Runnable() {
public void run() {
...
synchronized(lock){
if(concurrentMap.size() > 250){
saveInASeparateThread(concurrentMap.values().removeAll()));
}
}
}
}
This following logic resolves my issue. The code below isn't tested.
ExecutorService executorService = Executors.newFixedThreadPool(10);
final Map<Integer, Object> map = new ConcurrentHashMap<>();
map.putAll(myObjectList);
List<Future> futures = new ArrayList<>();
for (int i = 0; i < 10; i++) {
final thread = i;
Future future = executorService.submit(new Callable() {
public void call() {
List<MyObject> list;
CSVReader reader = new CSVReader(new InputStreamReader(csvFile.getStream()));
list = bean.parse(strategy, reader);
int listSize = list.size();
int rowCount = 0;
for(MyObject myObject : list) {
rowCount++;
Integer key = myObject.getId();
if(map.putIfAbsent(key, myObject) == null) {
session.save(object);
} else {
myObject = map.get(key);
//Do something
session.update(myObject);
}
if(rowCount % 250 == 0 || rowCount == listSize) {
tx.flush();
tx.clear();
}
};
tx.commit();
return "Thread " + thread + " completed.";
});
futures.add(future);
}
for(Future future : futures) {
System.out.println(future.get());
}
executorService.shutdown();
I'm trying to support modification (deactivate() function call) of the following data structure in a thread safe manner -
private static Map<String, Set<Integer>> dbPartitionStatus = new HashMap<String, Set<DBPartitionId>>();
public void deactivate(DBPartitionId partition) throws Exception {
synchronized (dbPartitionStatus) {
Set<DBPartitionId> partitions = dbPartitionStatus.get(serviceName);
if (partitions == null) {
partitions = new HashSet<DBPartitionId>();
}
partitions.add(partition);
dbPartitionStatus.put(serviceName, partitions);
}
}
If I were to replace the synchronization with ConcurrentHashMap & ConcurrentSkipListSet duo, there would be some race condition.
I was wondering if there was a cleaner way of achieving synchronization here (using java.util.concurrent)
Should be no race conditions with the following implementation:
private final static ConcurrentMap <String, Set <DBPartitionId>> dbPartitionStatus =
new ConcurrentHashMap <String, Set <DBPartitionId>> ();
public void deactivate (DBPartitionId partition) {
Set <DBPartitionId> partitions = dbPartitionStatus.get (serviceName);
if (partitions == null)
{
partitions = new ConcurrentSkipListSet <DBPartitionId> ();
Set <DBPartitionId> p =
dbPartitionStatus.putIfAbsent (serviceName, partitions);
if (p != null) partitions = p;
}
partitions.add (partition);
}
I personally cannot see a issues with this sort of approach:
private static ConcurrentHashMap<String, ConcurrentSkipListSet<DBPartitionId>> dbPartitionStatus = new ConcurrentHashMap<>();
public bool deactivate(DBPartitionId partition) throws Exception {
ConcurrentSkipListSet<DBPartitionId> partitions = dbPartitionStatus.get(serviceName);
if (partitions == null) {
// Create a new set
partitions = new ConcurrentSkipListSet<DBPartitionId>();
// Attempt to add, if we add, ev will be null.
ConcurrentSkipListSet<DBPartitionId> ev = dbPartitionStatus.put(serviceName, partitions);
// If non-null, someone else has added it, so now use it.
if (ev != null)
partitions = ev;
}
// will return true if added succesfully...
return partitions.add(partition);
}
There is also the putIfAbsent() method in map which can do the get/put onthe map in an "atomic" operation, however it has the additional overhead in this case that you have to construct an empty set to pass in each time.