ConcurrentHashMap: avoid extra object creation with "putIfAbsent"? - java

I am aggregating multiple values for keys in a multi-threaded environment. The keys are not known in advance. I thought I would do something like this:
class Aggregator {
protected ConcurrentHashMap<String, List<String>> entries =
new ConcurrentHashMap<String, List<String>>();
public Aggregator() {}
public void record(String key, String value) {
List<String> newList =
Collections.synchronizedList(new ArrayList<String>());
List<String> existingList = entries.putIfAbsent(key, newList);
List<String> values = existingList == null ? newList : existingList;
values.add(value);
}
}
The problem I see is that every time this method runs, I need to create a new instance of an ArrayList, which I then throw away (in most cases). This seems like unjustified abuse of the garbage collector. Is there a better, thread-safe way of initializing this kind of a structure without having to synchronize the record method? I am somewhat surprised by the decision to have the putIfAbsent method not return the newly-created element, and by the lack of a way to defer instantiation unless it is called for (so to speak).

Java 8 introduced an API to cater for this exact problem, making a 1-line solution:
public void record(String key, String value) {
entries.computeIfAbsent(key, k -> Collections.synchronizedList(new ArrayList<String>())).add(value);
}
For Java 7:
public void record(String key, String value) {
List<String> values = entries.get(key);
if (values == null) {
entries.putIfAbsent(key, Collections.synchronizedList(new ArrayList<String>()));
// At this point, there will definitely be a list for the key.
// We don't know or care which thread's new object is in there, so:
values = entries.get(key);
}
values.add(value);
}
This is the standard code pattern when populating a ConcurrentHashMap.
The special method putIfAbsent(K, V)) will either put your value object in, or if another thread got before you, then it will ignore your value object. Either way, after the call to putIfAbsent(K, V)), get(key) is guaranteed to be consistent between threads and therefore the above code is threadsafe.
The only wasted overhead is if some other thread adds a new entry at the same time for the same key: You may end up throwing away the newly created value, but that only happens if there is not already an entry and there's a race that your thread loses, which would typically be rare.

As of Java-8 you can create Multi Maps using the following pattern:
public void record(String key, String value) {
entries.computeIfAbsent(key,
k -> Collections.synchronizedList(new ArrayList<String>()))
.add(value);
}
The ConcurrentHashMap documentation (not the general contract) specifies that the ArrayList will only be created once for each key, at the slight initial cost of delaying updates while the ArrayList is being created for a new key:
http://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html#computeIfAbsent-K-java.util.function.Function-

In the end, I implemented a slight modification of #Bohemian's answer. His proposed solution overwrites the values variable with the putIfAbsent call, which creates the same problem I had before. The code that seems to work looks like this:
public void record(String key, String value) {
List<String> values = entries.get(key);
if (values == null) {
values = Collections.synchronizedList(new ArrayList<String>());
List<String> values2 = entries.putIfAbsent(key, values);
if (values2 != null)
values = values2;
}
values.add(value);
}
It's not as elegant as I'd like, but it's better than the original that creates a new ArrayList instance at every call.

Created two versions based on Gene's answer
public static <K,V> void putIfAbsetMultiValue(ConcurrentHashMap<K,List<V>> entries, K key, V value) {
List<V> values = entries.get(key);
if (values == null) {
values = Collections.synchronizedList(new ArrayList<V>());
List<V> values2 = entries.putIfAbsent(key, values);
if (values2 != null)
values = values2;
}
values.add(value);
}
public static <K,V> void putIfAbsetMultiValueSet(ConcurrentMap<K,Set<V>> entries, K key, V value) {
Set<V> values = entries.get(key);
if (values == null) {
values = Collections.synchronizedSet(new HashSet<V>());
Set<V> values2 = entries.putIfAbsent(key, values);
if (values2 != null)
values = values2;
}
values.add(value);
}
It works well

This is a problem I also looked for an answer. The method putIfAbsent does not actually solve the extra object creation problem, it just makes sure that one of those objects doesn't replace another. But the race conditions among threads can cause multiple object instantiation. I could find 3 solutions for this problem (And I would follow this order of preference):
1- If you are on Java 8, the best way to achieve this is probably the new computeIfAbsent method of ConcurrentMap. You just need to give it a computation function which will be executed synchronously (at least for the ConcurrentHashMap implementation). Example:
private final ConcurrentMap<String, List<String>> entries =
new ConcurrentHashMap<String, List<String>>();
public void method1(String key, String value) {
entries.computeIfAbsent(key, s -> new ArrayList<String>())
.add(value);
}
This is from the javadoc of ConcurrentHashMap.computeIfAbsent:
If the specified key is not already associated with a value, attempts
to compute its value using the given mapping function and enters it
into this map unless null. The entire method invocation is performed
atomically, so the function is applied at most once per key. Some
attempted update operations on this map by other threads may be
blocked while computation is in progress, so the computation should be
short and simple, and must not attempt to update any other mappings of
this map.
2- If you cannot use Java 8, you can use Guava's LoadingCache, which is thread-safe. You define a load function to it (just like the compute function above), and you can be sure that it'll be called synchronously. Example:
private final LoadingCache<String, List<String>> entries = CacheBuilder.newBuilder()
.build(new CacheLoader<String, List<String>>() {
#Override
public List<String> load(String s) throws Exception {
return new ArrayList<String>();
}
});
public void method2(String key, String value) {
entries.getUnchecked(key).add(value);
}
3- If you cannot use Guava either, you can always synchronise manually and do a double-checked locking. Example:
private final ConcurrentMap<String, List<String>> entries =
new ConcurrentHashMap<String, List<String>>();
public void method3(String key, String value) {
List<String> existing = entries.get(key);
if (existing != null) {
existing.add(value);
} else {
synchronized (entries) {
List<String> existingSynchronized = entries.get(key);
if (existingSynchronized != null) {
existingSynchronized.add(value);
} else {
List<String> newList = new ArrayList<>();
newList.add(value);
entries.put(key, newList);
}
}
}
}
I made an example implementation of all those 3 methods and additionally, the non-synchronized method, which causes extra object creation: http://pastebin.com/qZ4DUjTr

Waste of memory (also GC etc.) that Empty Array list creation problem is handled with Java 1.7.40. Don't worry about creating empty arraylist.
Reference : http://javarevisited.blogspot.com.tr/2014/07/java-optimization-empty-arraylist-and-Hashmap-cost-less-memory-jdk-17040-update.html

The approach with putIfAbsent has the fastest execution time, it is from 2 to 50 times faster than the "lambda" approach in evironments with high contention. The Lambda isn't the reason behind this "powerloss", the issue is the compulsory synchronisation inside of computeIfAbsent prior to the Java-9 optimisations.
the benchmark:
import java.util.Random;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;
public class ConcurrentHashMapTest {
private final static int numberOfRuns = 1000000;
private final static int numberOfThreads = Runtime.getRuntime().availableProcessors();
private final static int keysSize = 10;
private final static String[] strings = new String[keysSize];
static {
for (int n = 0; n < keysSize; n++) {
strings[n] = "" + (char) ('A' + n);
}
}
public static void main(String[] args) throws InterruptedException {
for (int n = 0; n < 20; n++) {
testPutIfAbsent();
testComputeIfAbsentLamda();
}
}
private static void testPutIfAbsent() throws InterruptedException {
final AtomicLong totalTime = new AtomicLong();
final ConcurrentHashMap<String, AtomicInteger> map = new ConcurrentHashMap<String, AtomicInteger>();
final Random random = new Random();
ExecutorService executorService = Executors.newFixedThreadPool(numberOfThreads);
for (int i = 0; i < numberOfThreads; i++) {
executorService.execute(new Runnable() {
#Override
public void run() {
long start, end;
for (int n = 0; n < numberOfRuns; n++) {
String s = strings[random.nextInt(strings.length)];
start = System.nanoTime();
AtomicInteger count = map.get(s);
if (count == null) {
count = new AtomicInteger(0);
AtomicInteger prevCount = map.putIfAbsent(s, count);
if (prevCount != null) {
count = prevCount;
}
}
count.incrementAndGet();
end = System.nanoTime();
totalTime.addAndGet(end - start);
}
}
});
}
executorService.shutdown();
executorService.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
System.out.println("Test " + Thread.currentThread().getStackTrace()[1].getMethodName()
+ " average time per run: " + (double) totalTime.get() / numberOfThreads / numberOfRuns + " ns");
}
private static void testComputeIfAbsentLamda() throws InterruptedException {
final AtomicLong totalTime = new AtomicLong();
final ConcurrentHashMap<String, AtomicInteger> map = new ConcurrentHashMap<String, AtomicInteger>();
final Random random = new Random();
ExecutorService executorService = Executors.newFixedThreadPool(numberOfThreads);
for (int i = 0; i < numberOfThreads; i++) {
executorService.execute(new Runnable() {
#Override
public void run() {
long start, end;
for (int n = 0; n < numberOfRuns; n++) {
String s = strings[random.nextInt(strings.length)];
start = System.nanoTime();
AtomicInteger count = map.computeIfAbsent(s, (k) -> new AtomicInteger(0));
count.incrementAndGet();
end = System.nanoTime();
totalTime.addAndGet(end - start);
}
}
});
}
executorService.shutdown();
executorService.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
System.out.println("Test " + Thread.currentThread().getStackTrace()[1].getMethodName()
+ " average time per run: " + (double) totalTime.get() / numberOfThreads / numberOfRuns + " ns");
}
}
The results:
Test testPutIfAbsent average time per run: 115.756501 ns
Test testComputeIfAbsentLamda average time per run: 276.9667055 ns
Test testPutIfAbsent average time per run: 134.2332435 ns
Test testComputeIfAbsentLamda average time per run: 223.222063625 ns
Test testPutIfAbsent average time per run: 119.968893625 ns
Test testComputeIfAbsentLamda average time per run: 216.707419875 ns
Test testPutIfAbsent average time per run: 116.173902375 ns
Test testComputeIfAbsentLamda average time per run: 215.632467375 ns
Test testPutIfAbsent average time per run: 112.21422775 ns
Test testComputeIfAbsentLamda average time per run: 210.29563725 ns
Test testPutIfAbsent average time per run: 120.50643475 ns
Test testComputeIfAbsentLamda average time per run: 200.79536475 ns

Related

Java Hashmap - Multiple thread put

We've recently had a discussion at my work about whether we need to use ConcurrentHashMap or if we can simply use regular HashMap, in our multithreaded environment. The argument for HashMaps are two: it is faster then the ConcurrentHashMap, so we should use it if possible. And ConcurrentModificationException apparently only appears as you iterate over the Map as it is modified, so "if we only PUT and GET from the map, what is the problem with the regular HashMap?" was the arguments.
I thought that concurrent PUT actions or concurrent PUT and READ could lead to exceptions, so I put together a test to show this. The test is simple; create 10 threads, each which writes the same 1000 key-value pairs into the map again-and-again for 5 seconds, then print the resulting map.
The results were quite confusing actually:
Length:1299
Errors recorded: 0
I thought each key-value pair was unique in a HashMap, but looking through the map, I can find multiple Key-Value pairs that are identical. I expected either some kind of exception or corrupted keys or values, but I did not expect this. How does this occur?
Here's the code I used, for reference:
public class ConcurrentErrorTest
{
static final long runtime = 5000;
static final AtomicInteger errCount = new AtomicInteger();
static final int count = 10;
public static void main(String[] args) throws InterruptedException
{
List<Thread> threads = new LinkedList<>();
final Map<String, Integer> map = getMap();
for (int i = 0; i < count; i++)
{
Thread t = getThread(map);
threads.add(t);
t.start();
}
for (int i = 0; i < count; i++)
{
threads.get(i).join(runtime + 1000);
}
for (String s : map.keySet())
{
System.out.println(s + " " + map.get(s));
}
System.out.println("Length:" + map.size());
System.out.println("Errors recorded: " + errCount.get());
}
private static Map<String, Integer> getMap()
{
Map<String, Integer> map = new HashMap<>();
return map;
}
private static Map<String, Integer> getConcMap()
{
Map<String, Integer> map = new ConcurrentHashMap<>();
return map;
}
private static Thread getThread(final Map<String, Integer> map)
{
return new Thread(new Runnable() {
#Override
public void run()
{
long start = System.currentTimeMillis();
long now = start;
while (now - start < runtime)
{
try
{
for (int i = 0; i < 1000; i++)
map.put("i=" + i, i);
now = System.currentTimeMillis();
}
catch (Exception e)
{
System.out.println("P - Error occured: " + e.toString());
errCount.incrementAndGet();
}
}
}
});
}
}
What you're faced with seems to be a TOCTTOU class problem. (Yes, this kind of bug happens so often, it's got its own name. :))
When you insert an entry into a map, at least the following two things need to happen:
Check whether the key already exists.
If the check returned true, update the existing entry, if it didn't, add a new one.
If these two don't happen atomically (as they would in a correctly synchronized map implementation), then several threads can come to the conclusion that the key doesn't exist yet in step 1, but by the time they reach step 2, that isn't true any more. So multiple threads will happily insert an entry with the same key.
Please note that this isn't the only problem that can happen, and depending on the implementation and your luck with visibility, you can get all kinds of different and unexpected failures.
In multi thread environment, you should always use CuncurrentHashMap, if you are going to perform any operation except get.
Most of the time you won't get an exception, but definitely get the corrupt data because of the thread local copy value.
Every thread has its own copy of the Map data when performing the put operation and when they check for key existence, multiple threads found it false and they enter the data.

Get an Object from a collection without looping in java

I need to repeatedly (hundred of thousands of times) retrieve an element (different each time) from a Collection which contains dozens of thousand of Objects.
What is the quickest way to do this retrieval operation? At the moment my Collection is a List and I iterate on it until I have found the element, but is there a quicker way? Using a Map maybe? I was thinking to do:
Putting the Objects in a Map, with the key being the id field of the Object, and the Object itself being the value.
Then doing get(id) on the Map should be much faster than looping through a List.
If that is a correct way to do it, should I use a HashMap or TreeMap? - my objects have no particular ordering.
Any advice on the matter would be appreciated!
Last note: if an external library provides a tool to answer this I'd take it gladly!
As per the documentation of the Tree Map (emphasis my own):
The map is sorted according to the natural ordering of its keys,
or by a Comparator provided at map creation time, depending on which
constructor is used.
In your case, you state that the items have no particular order and it does not seem that you are after any particular order, but rather just be able to retrieve data as fast as possible.
HashMaps provide constant read time but do not guarantee order, so I think that you should go with HashMaps:
This class makes no guarantees as to the order of the map; in
particular, it does not guarantee that the order will remain constant
over time. This implementation provides constant-time performance for
the basic operations (get and put), assuming the hash function
disperses the elements properly among the buckets.
As a side note, the memory footprint of this can get quite high quite fast, so it might also be a good idea to look into a database approach and maybe use a cache like mechanism to handle more frequently used information.
I've created code which tests the proformance of BinarySearch, TreeMap and HashMap for the given problem.
In case you are rebuilding the collection each time, HashMap is the fastest (even with standard Object's hashCode() implementation!), sort+array binary search goes second and a TreeMap is last (due to complex rebuilding procedure).
proc array: 2395
proc tree : 4413
proc hash : 1325
If you are not rebuilding the collection, HashMap is still the fastest, an array's binary search is second and a TreeMap is the slowest, but with only slightly lower speed than array.
proc array: 506
proc tree : 561
proc hash : 122
Test code:
public class SearchSpeedTest {
private List<DataObject> data;
private List<Long> ids;
private Map<Long, DataObject> hashMap;
private Map<Long, DataObject> treeMap;
private int numRep;
private int dataAmount;
private boolean rebuildEachTime;
static class DataObject implements Comparable<DataObject>{
Long id;
public DataObject(Long id) {
super();
this.id = id;
}
public DataObject() {
// TODO Auto-generated constructor stub
}
#Override
public final int compareTo(DataObject o) {
return Long.compare(id, o.id);
}
public Long getId() {
return id;
}
public void setId(Long id) {
this.id = id;
}
public void dummyCode() {
}
}
#FunctionalInterface
public interface Procedure {
void execute();
}
public void testSpeeds() {
rebuildEachTime = true;
numRep = 100;
dataAmount = 60_000;
data = new ArrayList<>(dataAmount);
ids = new ArrayList<>(dataAmount);
Random gen = new Random();
for (int i=0; i< dataAmount; i++) {
long id = i*7+gen.nextInt(7);
ids.add(id);
data.add(new DataObject(id));
}
Collections.sort(data);
treeMap = new TreeMap<Long, DataObject>();
populateMap(treeMap);
hashMap = new HashMap<Long, SearchSpeedTest.DataObject>();
populateMap(hashMap);
Procedure[] procedures = new Procedure[] {this::testArray, this::testTreeMap, this::testHashMap};
String[] names = new String[] {"array", "tree ", "hash "};
for (int n=0; n<procedures.length; n++) {
Procedure proc = procedures[n];
long startTime = System.nanoTime();
for (int i=0; i<numRep; i++) {
if (rebuildEachTime) {
Collections.shuffle(data);
}
proc.execute();
}
long endTime = System.nanoTime();
long diff = endTime - startTime;
System.out.println("proc "+names[n]+":\t"+(diff/1_000_000));
}
}
void testHashMap() {
if (rebuildEachTime) {
hashMap = new HashMap<Long, SearchSpeedTest.DataObject>();
populateMap(hashMap);
}
testMap(hashMap);
}
void testTreeMap() {
if (rebuildEachTime) {
treeMap = new TreeMap<Long, SearchSpeedTest.DataObject>();
populateMap(treeMap);
}
testMap(treeMap);
}
void testMap(Map<Long, DataObject> map) {
for (Long id: ids) {
DataObject ret = map.get(id);
ret.dummyCode();
}
}
void populateMap(Map<Long, DataObject> map) {
for (DataObject dataObj : data) {
map.put(dataObj.getId(), dataObj);
}
}
void testArray() {
if (rebuildEachTime) {
Collections.sort(data);
}
DataObject key = new DataObject();
for (Long id: ids) {
key.setId(id);
DataObject ret = data.get(Collections.binarySearch(data, key));
ret.dummyCode();
}
}
public static void main(String[] args) {
new SearchSpeedTest().testSpeeds();
}
}
HashMap will be more efficient in general, so use it whenever you don't care about the order of the keys.
when you want to keep your entries in Map sorted by key than use TreeMap but sorting will be overhead in your case as you dont want any particulate order.
You can use a map if you have a good way to define the key of the map. In the worst case you can use your object as key and value.
As ordering is not important use a HashMap. To maintain the order in a TreeMap there is additional cost when adding an element, as it must be added at the correct position.

Search multiple HashMaps at the same time

tldr: How can I search for an entry in multiple (read-only) Java HashMaps at the same time?
The long version:
I have several dictionaries of various sizes stored as HashMap< String, String >. Once they are read in, they are never to be changed (strictly read-only).
I want to check whether and which dictionary had stored an entry with my key.
My code was originally looking for a key like this:
public DictionaryEntry getEntry(String key) {
for (int i = 0; i < _numDictionaries; i++) {
HashMap<String, String> map = getDictionary(i);
if (map.containsKey(key))
return new DictionaryEntry(map.get(key), i);
}
return null;
}
Then it got a little more complicated: my search string could contain typos, or was a variant of the stored entry. Like, if the stored key was "banana", it is possible that I'd look up "bannana" or "a banana", but still would like the entry for "banana" returned. Using the Levenshtein-Distance, I now loop through all dictionaries and each entry in them:
public DictionaryEntry getEntry(String key) {
for (int i = 0; i < _numDictionaries; i++) {
HashMap<String, String> map = getDictionary(i);
for (Map.Entry entry : map.entrySet) {
// Calculate Levenshtein distance, store closest match etc.
}
}
// return closest match or null.
}
So far everything works as it should and I'm getting the entry I want. Unfortunately I have to look up around 7000 strings, in five dictionaries of various sizes (~ 30 - 70k entries) and it takes a while. From my processing output I have the strong impression my lookup dominates overall runtime.
My first idea to improve runtime was to search all dictionaries parallely. Since none of the dictionaries is to be changed and no more than one thread is accessing a dictionary at the same time, I don't see any safety concerns.
The question is just: how do I do this? I have never used multithreading before. My search only came up with Concurrent HashMaps (but to my understanding, I don't need this) and the Runnable-class, where I'd have to put my processing into the method run(). I think I could rewrite my current class to fit into Runnable, but I was wondering if there is maybe a simpler method to do this (or how can I do it simply with Runnable, right now my limited understanding thinks I have to restructure a lot).
Since I was asked to share the Levenshtein-Logic: It's really nothing fancy, but here you go:
private int _maxLSDistance = 10;
public Map.Entry getClosestMatch(String key) {
Map.Entry _closestMatch = null;
int lsDist;
if (key == null) {
return null;
}
for (Map.Entry entry : _dictionary.entrySet()) {
// Perfect match
if (entry.getKey().equals(key)) {
return entry;
}
// Similar match
else {
int dist = StringUtils.getLevenshteinDistance((String) entry.getKey(), key);
// If "dist" is smaller than threshold and smaller than distance of already stored entry
if (dist < _maxLSDistance) {
if (_closestMatch == null || dist < _lsDistance) {
_closestMatch = entry;
_lsDistance = dist;
}
}
}
}
return _closestMatch
}
In order to use multi-threading in your case, could be something like:
The "monitor" class, which basically stores the results and coordinates the threads;
public class Results {
private int nrOfDictionaries = 4; //
private ArrayList<String> results = new ArrayList<String>();
public void prepare() {
nrOfDictionaries = 4;
results = new ArrayList<String>();
}
public synchronized void oneDictionaryFinished() {
nrOfDictionaries--;
System.out.println("one dictionary finished");
notifyAll();
}
public synchronized boolean isReady() throws InterruptedException {
while (nrOfDictionaries != 0) {
wait();
}
return true;
}
public synchronized void addResult(String result) {
results.add(result);
}
public ArrayList<String> getAllResults() {
return results;
}
}
The Thread it's self, which can be set to search for the specific dictionary:
public class ThreadDictionarySearch extends Thread {
// the actual dictionary
private String dictionary;
private Results results;
public ThreadDictionarySearch(Results results, String dictionary) {
this.dictionary = dictionary;
this.results = results;
}
#Override
public void run() {
for (int i = 0; i < 4; i++) {
// search dictionary;
results.addResult("result of " + dictionary);
System.out.println("adding result from " + dictionary);
}
results.oneDictionaryFinished();
}
}
And the main method for demonstration:
public static void main(String[] args) throws Exception {
Results results = new Results();
ThreadDictionarySearch threadA = new ThreadDictionarySearch(results, "dictionary A");
ThreadDictionarySearch threadB = new ThreadDictionarySearch(results, "dictionary B");
ThreadDictionarySearch threadC = new ThreadDictionarySearch(results, "dictionary C");
ThreadDictionarySearch threadD = new ThreadDictionarySearch(results, "dictionary D");
threadA.start();
threadB.start();
threadC.start();
threadD.start();
if (results.isReady())
// it stays here until all dictionaries are searched
// because in "Results" it's told to wait() while not finished;
for (String string : results.getAllResults()) {
System.out.println("RESULT: " + string);
}
I think the easiest would be to use a stream over the entry set:
public DictionaryEntry getEntry(String key) {
for (int i = 0; i < _numDictionaries; i++) {
HashMap<String, String> map = getDictionary(i);
map.entrySet().parallelStream().foreach( (entry) ->
{
// Calculate Levenshtein distance, store closest match etc.
}
);
}
// return closest match or null.
}
Provided you are using java 8 of course. You could also wrap the outer loop into an IntStream as well. Also you could directly use the Stream.reduce to get the entry with the smallest distance.
Maybe try thread pools:
ExecutorService es = Executors.newFixedThreadPool(_numDictionaries);
for (int i = 0; i < _numDictionaries; i++) {
//prepare a Runnable implementation that contains a logic of your search
es.submit(prepared_runnable);
}
I believe you may also try to find a quick estimate of strings that completely do not match (i.e. significant difference in length), and use it to finish your logic ASAP, moving to next candidate.
I have my strong doubts that HashMaps are a suitable solution here, especially if you want to have some fuzzing and stop words. You should utilize a proper full text search solutions like ElaticSearch or Apache Solr or at least an available engine like Apache Lucene.
That being said, you can use a poor man's version: Create an array of your maps and a SortedMap, iterate over the array, take the keys of the current HashMap and store them in the SortedMap with the index of their HashMap. To retrieve a key, you first search in the SortedMap for said key, get the respective HashMap from the array using the index position and lookup the key in only one HashMap. Should be fast enough without the need for multiple threads to dig through the HashMaps. However, you could make the code below into a runnable and you can have multiple lookups in parallel.
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.SortedMap;
import java.util.TreeMap;
public class Search {
public static void main(String[] arg) {
if (arg.length == 0) {
System.out.println("Must give a search word!");
System.exit(1);
}
String searchString = arg[0].toLowerCase();
/*
* Populating our HashMaps.
*/
HashMap<String, String> english = new HashMap<String, String>();
english.put("banana", "fruit");
english.put("tomato", "vegetable");
HashMap<String, String> german = new HashMap<String, String>();
german.put("Banane", "Frucht");
german.put("Tomate", "Gemüse");
/*
* Now we create our ArrayList of HashMaps for fast retrieval
*/
List<HashMap<String, String>> maps = new ArrayList<HashMap<String, String>>();
maps.add(english);
maps.add(german);
/*
* This is our index
*/
SortedMap<String, Integer> index = new TreeMap<String, Integer>(String.CASE_INSENSITIVE_ORDER);
/*
* Populating the index:
*/
for (int i = 0; i < maps.size(); i++) {
// We iterate through or HashMaps...
HashMap<String, String> currentMap = maps.get(i);
for (String key : currentMap.keySet()) {
/* ...and populate our index with lowercase versions of the keys,
* referencing the array from which the key originates.
*/
index.put(key.toLowerCase(), i);
}
}
// In case our index contains our search string...
if (index.containsKey(searchString)) {
/*
* ... we find out in which map of the ones stored in maps
* the word in the index originated from.
*/
Integer mapIndex = index.get(searchString);
/*
* Next, we look up said map.
*/
HashMap<String, String> origin = maps.get(mapIndex);
/*
* Last, we retrieve the value from the origin map
*/
String result = origin.get(searchString);
/*
* The above steps can be shortened to
* String result = maps.get(index.get(searchString).intValue()).get(searchString);
*/
System.out.println(result);
} else {
System.out.println("\"" + searchString + "\" is not in the index!");
}
}
}
Please note that this is a rather naive implementation only provided for illustration purposes. It doesn't address several problems (you can't have duplicate index entries, for example).
With this solution, you are basically trading startup speed for query speed.
Okay!!..
Since your concern is to get faster response.
I would suggest you to divide the work between threads.
Lets you have 5 dictionaries May be keep three dictionaries to one thread and rest two will take care by another thread.
And then witch ever thread finds the match will halt or terminate the other thread.
May be you need an extra logic to do that dividing work ... But that wont effect your performance time.
And may be you need little more changes in your code to get your close match:
for (Map.Entry entry : _dictionary.entrySet()) {
you are using EntrySet But you are not using values anyway it seems getting entry set is a bit expensive. And I would suggest you to just use keySet since you are not really interested in the values in that map
for (Map.Entry entry : _dictionary.keySet()) {
For more details on the proformance of map Please read this link Map performances
Iteration over the collection-views of a LinkedHashMap requires time proportional to the size of the map, regardless of its capacity. Iteration over a HashMap is likely to be more expensive, requiring time proportional to its capacity.

Callback with java multi thread

I'm trying to multi thread an import job, but running into a problem where it's causing duplicate data. I need to keep my map outside of the loop so all my threads can update and read from it, but I can't do this without it being final and with it being final I can't update the map. Currently I need to put my Map object in the run method, but the problem comes when the values are not initially in the database and each thread creates a new one. This results in duplicate data in the database. Does anybody know how to do some sort of call back to update my map outside?
ExecutorService executorService = Executors.newFixedThreadPool(10);
final Map<Integer, Object> map = new HashMap<>();
map.putAll(populate from database);
for (int i = 0; i < 10; i++) {
executorService.execute(new Runnable() {
public void run() {
while ((line = br.readLine()) != null) {
if(map.containsKey(123)) {
//read map object
session.update(object);
} else {
map.put(123,someObject);
session.save(object);
}
if(rowCount % 250 == 0)
tx.commit;
});
}
executorService.shutdown();
You need to use some synchronization techniques.
Problematic part is when different threads are trying to put some data into map.
Example:
Thread 1 is checking if there is object with key 123 in map. Before thread 1 added new object to map, thread 2 is executed. Thread 2 also check if there is object with key 123. Then both threads added object 123 to map. This causes duplicates...
You can read more about synchronization here
http://docs.oracle.com/javase/tutorial/essential/concurrency/sync.html
Based on your problem description it appears that you want to have a map where the data is consistent and you always have the latest up-t-date data without having missed any updates.
In this case make you map as a Collections.synchronizedMap(). This will ensure that all read and write updates to the map are synchronized and hence you are guaranteed to find a key using the latest data in the map and also guaranteed to write exclusively to the map.
Refer to this SO discussion for a difference between the concurrency techniques used with maps.
Also, one more thing - defining a Map as final does not mean yu cannot modify the map - you can definitely add and remove elements from the map. What you cannot do however is change the variable to point to another map. This is illustrated by a simple code snippet below:
private final Map<Integer, String> testMap = Collections.synchronizedMap(new HashMap<Integer,String>());
testMap.add(1,"Tom"); //OK
testMap.remove(1); //OK
testMap = new HashMap<Integer,String>(); //ERROR!! Cannot modify a variable with the final modifier
I would suggest the following solution
Use ConcurrentHashmap
Don't use update and commit inside your crawling threads
Trigger save and commit when your map reaches a critical size in a separate thread.
Pseudocode sample:
final Object lock = new Object();
...
executorService.execute(new Runnable() {
public void run() {
...
synchronized(lock){
if(concurrentMap.size() > 250){
saveInASeparateThread(concurrentMap.values().removeAll()));
}
}
}
}
This following logic resolves my issue. The code below isn't tested.
ExecutorService executorService = Executors.newFixedThreadPool(10);
final Map<Integer, Object> map = new ConcurrentHashMap<>();
map.putAll(myObjectList);
List<Future> futures = new ArrayList<>();
for (int i = 0; i < 10; i++) {
final thread = i;
Future future = executorService.submit(new Callable() {
public void call() {
List<MyObject> list;
CSVReader reader = new CSVReader(new InputStreamReader(csvFile.getStream()));
list = bean.parse(strategy, reader);
int listSize = list.size();
int rowCount = 0;
for(MyObject myObject : list) {
rowCount++;
Integer key = myObject.getId();
if(map.putIfAbsent(key, myObject) == null) {
session.save(object);
} else {
myObject = map.get(key);
//Do something
session.update(myObject);
}
if(rowCount % 250 == 0 || rowCount == listSize) {
tx.flush();
tx.clear();
}
};
tx.commit();
return "Thread " + thread + " completed.";
});
futures.add(future);
}
for(Future future : futures) {
System.out.println(future.get());
}
executorService.shutdown();

Can a Java HashMap's size() be out of sync with its actual entries' size?

I have a Java HashMap called statusCountMap.
Calling size() results in 30.
But if I count the entries manually, it's 31
This is in one of my TestNG unit tests. These results below are from Eclipse's Display window (type code -> highlight -> hit Display Result of Evaluating Selected Text).
statusCountMap.size()
(int) 30
statusCountMap.keySet().size()
(int) 30
statusCountMap.values().size()
(int) 30
statusCountMap
(java.util.HashMap) {40534-INACTIVE=2, 40526-INACTIVE=1, 40528-INACTIVE=1, 40492-INACTIVE=3, 40492-TOTAL=4, 40513-TOTAL=6, 40532-DRAFT=4, 40524-TOTAL=7, 40526-DRAFT=2, 40528-ACTIVE=1, 40524-DRAFT=2, 40515-ACTIVE=1, 40513-DRAFT=4, 40534-DRAFT=1, 40514-TOTAL=3, 40529-DRAFT=4, 40515-TOTAL=3, 40492-ACTIVE=1, 40528-TOTAL=4, 40514-DRAFT=2, 40526-TOTAL=3, 40524-INACTIVE=2, 40515-DRAFT=2, 40514-ACTIVE=1, 40534-TOTAL=3, 40513-ACTIVE=2, 40528-DRAFT=2, 40532-TOTAL=4, 40524-ACTIVE=3, 40529-ACTIVE=1, 40529-TOTAL=5}
statusCountMap.entrySet().size()
(int) 30
What gives ? Anyone has experienced this ?
I'm pretty sure statusCountMap is not being modified at this point.
There are 2 methods (lets call them methodA and methodB) that modify statusCountMap concurrently, by repeatedly calling incrementCountInMap.
private void incrementCountInMap(Map map, Long id, String qualifier) {
String key = id + "-" + qualifier;
if (map.get(key) == null) {
map.put(key, 0);
}
synchronized (map) {
map.put(key, map.get(key).intValue() + 1);
}
}
methodD is where I'm getting the issue. methodD has a TestNG #dependsOnMethods = { "methodA", "methodB" } so when methodD is executing, statusCountMap is pretty much static already.
I'm mentioning this because it might be a bug in TestNG.
I'm using Sun JDK 1.6.0_24. TestNG is testng-5.9-jdk15.jar
Hmmm ... after rereading my post, could it be because of concurrent execution of outside-of-synchronized-block map.get(key) == null & map.put(key,0) that's causing this issue ?
I believe you can achieve this if you modify a key after it is added to a HashMap.
However in your case it appears to be just a case of modifying the same map in two threads without proper synchronization. e.g. in thread A, map.put(key, 0), thread B map.put(key2, 0) can results in a size of 1 or 2. If you do the same with remove you can end up with a size larger than you should.
Hmmm ... after rereading my post, could it be because of concurrent execution of outside-of-synchronized-block map.get(key) == null & map.put(key,0) that's causing this issue ?
In a word ... yes.
HashMap is not thread-safe. Therefore, if there is any point where two threads could update a HashMap without proper synchronization, the map could get into an inconsistent state. And even if one of the threads is only reading, that thread could see an inconsistent state for the map.
The correct way to write that method is:
private void incrementCountInMap(Map map, Long id, String qualifier) {
String key = id + "-" + qualifier;
synchronized (map) {
Integer count = map.get(key);
map.put(key, count == null ? 1 : count + 1);
}
}
If you using the default initial capacity of 16 and accessing them map in a non thread safe manner your ripe for an inconsistent state. Size is a state member in the Map getting updated as each item is entered(size++). This is because the map itself is an array of linked lists and cannot really return its actual size because its not indicative of the number of items it contains. Once the Map reaches a percentage(load_factor) of the intial capacity it has to resize itself to accomodate more items. If a rogue thread is attempting to add items as the map is resizing who knows what state the map will be in.
The problem that the first map.put(..) isn't synchronized. Either synchronize it, or use Collections.synchronizedMap(..). Test case:
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Random;
public class Test {
public static void main(String... args) throws InterruptedException {
final Random random = new Random();
final int max = 10;
for (int j = 0; j < 100000; j++) {
// final Map<String, Integer> map = Collections.synchronizedMap(new HashMap<String, Integer>());
final HashMap<String, Integer> map = new HashMap<String, Integer>();
Thread t = new Thread() {
public void run() {
for (int i = 0; i < 100; i++) {
incrementCountInMap(map, random.nextInt(max));
}
}
};
t.start();
for (int i = 0; i < 100; i++) {
incrementCountInMap(map, random.nextInt(max));
}
t.join();
if (map.size() != max) {
System.out.println("size: " + map.size() + " entries: " + map);
}
}
}
static void incrementCountInMap(Map<String, Integer> map, int id) {
String key = "k" + id;
if (map.get(key) == null) {
map.put(key, 0);
}
synchronized (map) {
map.put(key, map.get(key).intValue() + 1);
}
}
}
Some results I get:
size: 11 entries: {k3=24, k4=20, k5=16, k6=30, k7=16, k8=18, k9=11, k0=18, k1=16, k1=13, k2=18}
size: 11 entries: {k3=18, k4=19, k5=21, k6=20, k7=18, k8=26, k9=20, k0=16, k1=25, k2=15}
size: 11 entries: {k3=25, k4=20, k5=27, k6=15, k7=17, k8=17, k9=24, k0=21, k1=16, k1=1, k2=17}
size: 11 entries: {k3=13, k4=21, k5=18, k6=21, k7=13, k8=17, k9=25, k0=20, k1=23, k2=28}
size: 11 entries: {k3=21, k4=25, k5=19, k6=12, k7=17, k8=14, k9=23, k0=24, k1=26, k2=18}
size: 9 entries: {k3=13, k4=17, k5=23, k6=24, k7=18, k8=19, k9=28, k0=21, k1=17, k2=20}
size: 9 entries: {k3=15, k4=24, k5=21, k6=18, k7=21, k8=30, k9=20, k0=17, k1=15, k2=19}
size: 11 entries: {k3=15, k4=13, k5=21, k6=21, k7=15, k8=19, k9=23, k0=30, k1=15, k2=27}
size: 11 entries: {k3=29, k4=15, k5=19, k6=19, k7=15, k8=23, k9=14, k0=31, k1=18, k2=12}
size: 11 entries: {k3=17, k4=18, k5=20, k6=11, k6=13, k7=20, k8=22, k9=30, k0=12, k1=21, k2=16}

Categories