Safely publish array elements - java

This question is concerned with safe publishing of array contents within multi threaded java programms.
Let's assume I have some arbitrary array of Objects:
Object[] myArray = ...
Now this array is handed over to another thread, maybe like this:
new Thread() {
public void run() {
// ...
Object o = myArray[0];
// ...
}
};
My question is, will the new Thread observe the values within the array as 'expected' if no further synchronization is in place? Does this depend on whether the array itself is a (final/volatile) field or a local variable? Are subsequent modifications of the array from the first thread immediately visible to the new thread?
What would be the most efficient way of safely publishing the array's elements?

The exact answer depends on the mutability of the array and its contents, the number of writers and readers, whether your only option is using an array (and not a thread-safe Collection of the same elements), etc.
If I was in your shoes, I'd probably spend an evening searching in StackOverflow and trying out some fancy combination of patterns (maybe final array reference plus immutable wrappers of the array elements, or a custom update scheme using Unsafe.putOrderedObject if there's a single writer).
Then after a short internal struggle, I'd copy all of it in my "Playground" folder and use an AtomicReferenceArray or a CopyOnWriteArrayList (or another appropriate off-the-shelf solution).
I'd also try to remind myself that I need to worry about performance when I've addressed all of my bigger concerns (like correctness), and when I have a proof that this specific part of my program needs to be optimized. Hopefully a similar approach will work for you too.

Related

Java software design - Looping, object creation VS modifying variables. Memory, performance & reliability comparison

Let's say we are trying to build a document scanner class in java that takes 1 input argument, the log path(eg. C:\document\text1.txt). Which of the following implementations would you prefer based on performance/memory/modularity?
ArrayList<String> fileListArray = new ArrayList<String>();
fileListArray.add("C:\\document\\text1.txt");
fileListArray.add("C:\\document\\text2.txt");
.
.
.
//Implementation A
for(int i =0, j = fileListArray.size(); i < j; i++){
MyDocumentScanner ds = new MyDocumentScanner(fileListArray.get(i));
ds.scanDocument();
ds.resultOutput();
}
//Implementation B
MyDocumentScanner ds = new MyDocumentScanner();
for(int i=0, j=fileListArray.size(); i < j; i++){
ds.setDocPath(fileListArray.get(i));
ds.scanDocument();
ds.resultOutput();
}
Personally I would prefer A due to its encapsulation, but it seems like more memory usage due to creation of multiple instances. I'm curious if there is an answer to this, or it is another "that depends on the situation/circumstances" dilemma?
Although this is obviously opinion-based, I will try an answer to tell my opinion.
You approach A is far better. Your document scanner obviously handles a file. That should be set at construction time and be saved in an instance field. So every method can refer to this field. Moreover, the constructor can do some checks on the file reference (null check, existence, ...).
Your approach B has two very serious disadvantages:
After constructing a document scanner, clients could easily call all of the methods. If no file was set before, you must handle that "illegal state" with maybe an IllegalStateException. Thus, this approach increases code and complexity of that class.
There seems to be a series of method calls that a client should or can perform. It's easy to call the file setting method again in the middle of such a series with a completely other file, breaking the whole scan facility. To avoid this, your setter (for the file) should remember whether a file was already set. And that nearly automatically leads to approach A.
Regarding the creation of objects: Modern JVMs are really very fast at creating objects. Usually, there is no measurable performance overhead for that. The processing time (here: the scan) usually is much higher.
If you don't need multiple instances of DocumentScanner to co-exist, I see no point in creating a new instance in each iteration of the loop. It just creates work to the garbage collector, which has to free each of those instances.
If the length of the array is small, it doesn't make much difference which implementation you choose, but for large arrays, implementation B is more efficient, both in terms of memory (less instances created that the GC hasn't freed yet) and CPU (less work for the GC).
Are you implementing DocumentScanner or using an existing class?
If the latter, and it was designed for being able to parse multiple documents in a row, you can just reuse the object as in variant B.
However, if you are designing DocumentScanner, I would recommend to design it such that it handles a single document and does not even have a setDocPath method. This leads to less mutable state in that class and thus makes its design much easier. Also using an instance of the class becomes less error-prone.
As for performance, there won't be a measurable difference unless instantiating a DocumentScanner is doing a lot of work (like instantiating many other objects, too). Instantiating and freeing objects in Java is pretty cheap if they are used only for a short time due to the generational garbage collector.

Sharing array of bins between threads

I have an application that is multithreaded and working OK. However it's hitting lock contention issues (checked by snapshotting the java stack and seeing whats waiting).
Each thread consumes objects off a list and either rejects each or places it into a Bin.
The Bins are initially null as each can be expensive (and there is potentially a lot of them).
The code that is causing the contention looks roughly like this:
public void addToBin(Bin[] bins, Item item) {
Bin bin;
int bin_index = item.bin_index
synchronized(bins) {
bin = bins[bin_index];
if(bin==null) {
bin = new Bin();
bins[bin_index] = bin;
}
}
synchronized(bin) {
bin.add(item);
}
}
It is the synchronization on the bins array that is the bottleneck.
It was suggested to me by a colleague to use double checked locking to solve this, but we're unsure exactly what would be involved to make it safe. The suggested solution looks like this:
public void addToBin(Bin[] bins, Item item) {
int bin_index = item.bin_index
Bin bin = bins[bin_index];
if(bin==null) {
synchronized(bins) {
bin = bins[bin_index];
if(bin==null) {
bin = new Bin();
bins[bin_index] = bin;
}
}
}
synchronized(bin) {
bin.add(item);
}
}
Is this safe and/or is there a better/safer/more idiomatic way to do this?
As already stated in the answer of Malt, Java already provides many lock-free data structures and concepts that can be used to solve this problem. I'd like to add a more detailed example using AtomicReferenceArray:
Assuming, bins is an AtomicReferenceArray, the following code performs a lock free update in case of a null entry:
Bin bin = bins.get(index);
while (bin == null) {
bin = new Bin();
if (!bins.compareAndSet(index, null, bin)) {
// some other thread already set the bin in the meantime
bin = bins.get(index);
}
}
// use bin as usual
Since Java 8, there is a more elegant solution for that:
Bin bin = bins.updateAndGet(index, oldBin -> oldBin == null ? new Bin() : oldBin);
// use bin as usual
Node: The Java 8 version is - while still non-blocking - perceptibly slower than the Java 7 version above due to the fact that updateAndGet will always update the array even if the value does not change. This might or might not be negligible depending on the overal costs for the entire bin-update-operation.
Another very elegant strategy might be to just pre-fill the entire bins array with newly created Bin instances, before handing over the array to the worker threads. As the threads then don't have to modify the array, this will reduce the needs for synchronization to the Bin objects themselves. Fill the array can be easily done multi-threaded by using Arrays.parallelSetAll (since Java 8):
Arrays.parallelSetAll(bins, i -> new Bin());
Update 2: If this is an option depends on the expected output of your algorithm: Will in the end the bins array be filled totally, densely or just sparsely? (In the first case, pre-filling is advicable. In the second case, it depends - as so often. In the latter case it's probably a bad idea).
Update 1: Don't use double-checked-locking! It is not safe! The problem here is visibility, not atomicitiy. In your case, the reading thread might get a partly constructed (hence corrupt) Bin instance. For details see http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html.
Java has a variety of excellent lock-free concurrent data structures, so there's really no need to use arrays with synchronizations for this type of thing.
ConcurrentSkipListMap is a concurrent, sorted, key-value map.
ConcurrentHashMap is a concurrent unsorted key-value.
You can simply use one of these instead of the array. Just set the map key be the Integer index you already use and you're good to go.
There's also Google's ConcurrentLinkedHashMap and Google's Guava Cache, which are excellent for keeping ordered data, and for removing old entries.
I would advise against the 2nd solution because it accesses the bins array outside of a synchronized block therefore it is not guaranteed the changes made by another thread is visible to the code that is reading an element from it unsynchronized.
It is not guaranteed that a concurrently added new Bin will be seen therefore it might create a new Bin for the same index again and discard a concurrently created and stored one - also forgetting that items might be placed in the discarded one.
If none of the built in java classes help you, you could just create 8 bins locks, say binsALock to binsFLock.
Then divide bin_index by 8, use the reminder to choose the lock to use.
If you choose a larger number that is more than the number of threads you have, and use a lock that is very fast when it is contended, then you may do better than choosing 8.
You may also get better result by reducing the number of threads you use.

volatile array and multithreaded sorting

I'm considering implementations of multi-threaded sorting with use of one volatile array. Let's say I have an array of length N, and M threads that will sort sub-ranges of the array. These sub-ranges are disjoint. Then, in the main thread I will merge partially sorted array.
Example code:
final int N = ....
volatile MyClass[] array = new MyClass[N];
//... fill array with values
void sort(){
MyThread[] workers = new MyThread[M];
int len = N/M; //length of the sub-range
for(int i=0;i<M;++i){
workers[i] = new MyThread(i*len, (i+1)*len);
workers[i].start();
}
for(int i=0;i<M;++i)workers.join();
//now synchronization in memory using "happens before"
//will it work?
array = array;
//...merge sorted sub-ranges into one sorted array
}
private class MyThread extends Thread{
final int from;
final int to;
public MyThread(int from, int to){ ..... }
public void run(){
//...something like: quicksort(array, from, to);
//...without synchronization, ranges <from, to> are exclusive
}
I don't need synchronization in memory while running threads because the array sub-ranges are disjoint. I want to do the synchronization once after finished threads. Will the updated version of the array (seen in the main thread) contain all the changes made in the working threads?
If this solution is valid, is it effective for large tables?
Thank you in advance for your help.
EDIT:
I ran the tests. I received correct results regardless of the use of volatile keyword. But the time of execution is a few times (about M-times) longer for a volatile array.
Not an answer, just some thoughts:
There is no such thing as a volatile array. Only fields can be volatile. You have declared a volatile field named "array", and initialized it with a reference to an array object.
It looks like you are expecting the statement, array = array to act as a full memory barrier. I don't know if it will or if it won't, or if the answer depends on what compiler, what JVM and, what operating system you use. Maybe somebody more expert than I can answer.
I don't like it for two reasons though: One is, it looks like a no-op. It's an invitation for some other programmer who doesn't understand what you're trying to do to come along and "clean up" the code by deleting it. A tricky statement like that should be wrapped in a function with a name that explains the trick.
Two is, the function of that statement has nothing to do with the array that the field references. It would be better to use a volatile int field or a volatile somethingelse field that obviously has no connection to the array, thereby calling attention to the fact that what matters is something other than the value of the field.
Update: According to Brian Goetz, that one statement won't do what you want. What you need is for each worker thread to update the volatile field after finishing its work, and then you need the master thread to read the volatile field before it tries to look at the worker's results.
On the other hand... Do you need the barrier at all? Isn't it enough that the worker threads all terminated and the master join()ed them? Again, maybe somebody more expert than myself can answer.
What you're doing looks very messy and as suggested, probably won't work as expected.
If you use Java8 then perhaps the parallel sort is for you. Otherwise --
Sorting a single array in place, in parallel is a horror show. Sorting in parallel is rather simple if you create a new array of sorted elements.
Create objects of the the sub-array (you'll need to do this eventually). Pass each object to a thread. Let the threads sort their objects in parallel. When all sorts are done, merge the sorted objects into a new array.
That means there is more memory required, but its rather easy and you don't need to worry about volatile or synchronization.

Thread-safe iteration over a collection

We all know when using Collections.synchronizedXXX (e.g. synchronizedSet()) we get a synchronized "view" of the underlying collection.
However, the document of these wrapper generation methods states that we have to explicitly synchronize on the collection when iterating of the collections using an iterator.
Which option do you choose to solve this problem?
I can only see the following approaches:
Do it as the documentation states: synchronize on the collection
Clone the collection before calling iterator()
Use a collection which iterator is thread-safe (I am only aware of CopyOnWriteArrayList/Set)
And as a bonus question: when using a synchronized view - is the use of foreach/Iterable thread-safe?
You've already answered your bonus question really: no, using an enhanced for loop isn't safe - because it uses an iterator.
As for which is the most appropriate approach - it really depends on how your context:
Are writes very infrequent? If so, CopyOnWriteArrayList may be most appropriate.
Is the collection reasonably small, and the iteration quick? (i.e. you're not doing much work in the loop) If so, synchronizing may well be fine - especially if this doesn't happen too often (i.e. you won't have much contention for the collection).
If you're doing a lot of work and don't want to block other threads working at the same time, the hit of cloning the collection may well be acceptable.
Depends on your access model. If you have low concurrency and frequent writes, 1 will have the best performance. If you have high concurrency with and infrequent writes, 3 will have the best performance. Option 2 is going to perform badly in almost all cases.
foreach calls iterator(), so exactly the same things apply.
You could use one of the newer collections added in Java 5.0 which support concurrent access while iterating. Another approach is to take a copy using toArray which is thread safe (during the copy).
Collection<String> words = ...
// enhanced for loop over an array.
for(String word: words.toArray(new String[0])) {
}
I might be totally off with your requirements, but if you are not aware of them, check out google-collections with "Favor immutability" in mind.
I suggest dropping Collections.synchronizedXXX and handle all locking uniformly in the client code. The basic collections don't support the sort of compound operations useful in threaded code, and even if you use java.util.concurrent.* the code is more difficult. I suggest keeping as much code as possible thread-agnostic. Keep difficult and error-prone thread-safe (if we are very lucky) code to a minimum.
All three of your options will work. Choosing the right one for your situation will depend on what your situation is.
CopyOnWriteArrayList will work if you want a list implementation and you don't mind the underlying storage being copied every time you write. This is pretty good for performance as long as you don't have very big collections.
ConcurrentHashMap or "ConcurrentHashSet" (using Collections.newSetFromMap) will work if you need a Map or Set interface, obviously you don't get random access this way. One great! thing about these two is that they will work well with large data sets - when mutated they just copy little bits of the underlying data storage.
It does depend on the result one needs to achieve cloning/copying/toArray(), new ArrayList(..) and the likes obtain a snapshot and does not lock the the collection.
Using synchronized(collection) and iteration through ensure by the end of the iteration would be no modification, i.e. effectively locking it.
side note:(toArray() is usually preferred with some exceptions when internally it needs to create a temporary ArrayList). Also please note, anything but toArray() should be wrapped in synchronized(collection) as well, provided using Collections.synchronizedXXX.
This Question is rather old (sorry, i am a bit late..) but i still want to add my Answer.
I would choose your second choice (i.e. Clone the collection before calling iterator()) but with a major twist.
Asuming, you want to iterate using iterator, you do not have to coppy the Collection before calling .iterator() and sort of negating (i am using the term "negating" loosely) the idea of the iterator pattern, but you could write a "ThreadSafeIterator".
It would work on the same premise, coppying the Collection, but without letting the iterating class know, that you did just that. Such an Iterator might look like this:
class ThreadSafeIterator<T> implements Iterator<T> {
private final Queue<T> clients;
private T currentElement;
private final Collection<T> source;
AsynchronousIterator(final Collection<T> collection) {
clients = new LinkedList<>(collection);
this.source = collection;
}
#Override
public boolean hasNext() {
return clients.peek() != null;
}
#Override
public T next() {
currentElement = clients.poll();
return currentElement;
}
#Override
public void remove() {
synchronized(source) {
source.remove(currentElement);
}
}
}
Taking this a Step furhter, you might use the Semaphore Class to ensure thread-safety or something. But take the remove method with a grain of salt.
The point is, by using such an Iterator, no one, neither the iterating nor the iterated Class (is that a real word) has to worrie about Thread safety.

Synchronizing on an Integer value [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
What is the best way to increase number of locks in java
Suppose I want to lock based on an integer id value. In this case, there's a function that pulls a value from a cache and does a fairly expensive retrieve/store into the cache if the value isn't there.
The existing code isn't synchronized and could potentially trigger multiple retrieve/store operations:
//psuedocode
public Page getPage (Integer id){
Page p = cache.get(id);
if (p==null)
{
p=getFromDataBase(id);
cache.store(p);
}
}
What I'd like to do is synchronize the retrieve on the id, e.g.
if (p==null)
{
synchronized (id)
{
..retrieve, store
}
}
Unfortunately this won't work because 2 separate calls can have the same Integer id value but a different Integer object, so they won't share the lock, and no synchronization will happen.
Is there a simple way of insuring that you have the same Integer instance? For example, will this work:
syncrhonized (Integer.valueOf(id.intValue())){
The javadoc for Integer.valueOf() seems to imply that you're likely to get the same instance, but that doesn't look like a guarantee:
Returns a Integer instance
representing the specified int value.
If a new Integer instance is not
required, this method should generally
be used in preference to the
constructor Integer(int), as this
method is likely to yield
significantly better space and time
performance by caching frequently
requested values.
So, any suggestions on how to get an Integer instance that's guaranteed to be the same, other than the more elaborate solutions like keeping a WeakHashMap of Lock objects keyed to the int? (nothing wrong with that, it just seems like there must be an obvious one-liner than I'm missing).
You really don't want to synchronize on an Integer, since you don't have control over what instances are the same and what instances are different. Java just doesn't provide such a facility (unless you're using Integers in a small range) that is dependable across different JVMs. If you really must synchronize on an Integer, then you need to keep a Map or Set of Integer so you can guarantee that you're getting the exact instance you want.
Better would be to create a new object, perhaps stored in a HashMap that is keyed by the Integer, to synchronize on. Something like this:
public Page getPage(Integer id) {
Page p = cache.get(id);
if (p == null) {
synchronized (getCacheSyncObject(id)) {
p = getFromDataBase(id);
cache.store(p);
}
}
}
private ConcurrentMap<Integer, Integer> locks = new ConcurrentHashMap<Integer, Integer>();
private Object getCacheSyncObject(final Integer id) {
locks.putIfAbsent(id, id);
return locks.get(id);
}
To explain this code, it uses ConcurrentMap, which allows use of putIfAbsent. You could do this:
locks.putIfAbsent(id, new Object());
but then you incur the (small) cost of creating an Object for each access. To avoid that, I just save the Integer itself in the Map. What does this achieve? Why is this any different from just using the Integer itself?
When you do a get() from a Map, the keys are compared with equals() (or at least the method used is the equivalent of using equals()). Two different Integer instances of the same value will be equal to each other. Thus, you can pass any number of different Integer instances of "new Integer(5)" as the parameter to getCacheSyncObject and you will always get back only the very first instance that was passed in that contained that value.
There are reasons why you may not want to synchronize on Integer ... you can get into deadlocks if multiple threads are synchronizing on Integer objects and are thus unwittingly using the same locks when they want to use different locks. You can fix this risk by using the
locks.putIfAbsent(id, new Object());
version and thus incurring a (very) small cost to each access to the cache. Doing this, you guarantee that this class will be doing its synchronization on an object that no other class will be synchronizing on. Always a Good Thing.
Use a thread-safe map, such as ConcurrentHashMap. This will allow you to manipulate a map safely, but use a different lock to do the real computation. In this way you can have multiple computations running simultaneous with a single map.
Use ConcurrentMap.putIfAbsent, but instead of placing the actual value, use a Future with computationally-light construction instead. Possibly the FutureTask implementation. Run the computation and then get the result, which will thread-safely block until done.
Integer.valueOf() only returns cached instances for a limited range. You haven't specified your range, but in general, this won't work.
However, I would strongly recommend you not take this approach, even if your values are in the correct range. Since these cached Integer instances are available to any code, you can't fully control the synchronization, which could lead to a deadlock. This is the same problem people have trying to lock on the result of String.intern().
The best lock is a private variable. Since only your code can reference it, you can guarantee that no deadlocks will occur.
By the way, using a WeakHashMap won't work either. If the instance serving as the key is unreferenced, it will be garbage collected. And if it is strongly referenced, you could use it directly.
Using synchronized on an Integer sounds really wrong by design.
If you need to synchronize each item individually only during retrieve/store you can create a Set and store there the currently locked items. In another words,
// this contains only those IDs that are currently locked, that is, this
// will contain only very few IDs most of the time
Set<Integer> activeIds = ...
Object retrieve(Integer id) {
// acquire "lock" on item #id
synchronized(activeIds) {
while(activeIds.contains(id)) {
try {
activeIds.wait();
} catch(InterruptedExcption e){...}
}
activeIds.add(id);
}
try {
// do the retrieve here...
return value;
} finally {
// release lock on item #id
synchronized(activeIds) {
activeIds.remove(id);
activeIds.notifyAll();
}
}
}
The same goes to the store.
The bottom line is: there is no single line of code that solves this problem exactly the way you need.
How about a ConcurrentHashMap with the Integer objects as keys?
You could have a look at this code for creating a mutex from an ID. The code was written for String IDs, but could easily be edited for Integer objects.
As you can see from the variety of answers, there are various ways to skin this cat:
Goetz et al's approach of keeping a cache of FutureTasks works quite well in situations like this where you're "caching something anyway" so don't mind building up a map of FutureTask objects (and if you did mind the map growing, at least it's easy to make pruning it concurrent)
As a general answer to "how to lock on ID", the approach outlined by Antonio has the advantage that it's obvious when the map of locks is added to/removed from.
You may need to watch out for a potential issue with Antonio's implementation, namely that the notifyAll() will wake up threads waiting on all IDs when one of them becomes available, which may not scale very well under high contention. In principle, I think you can fix that by having a Condition object for each currently locked ID, which is then the thing that you await/signal. Of course, if in practice there's rarely more than one ID being waited on at any given time, then this isn't an issue.
Steve,
your proposed code has a bunch of problems with synchronization. (Antonio's does as well).
To summarize:
You need to cache an expensive
object.
You need to make sure that while one thread is doing the retrieval, another thread does not also attempt to retrieve the same object.
That for n-threads all attempting to get the object only 1 object is ever retrieved and returned.
That for threads requesting different objects that they do not contend with each other.
pseudo code to make this happen (using a ConcurrentHashMap as the cache):
ConcurrentMap<Integer, java.util.concurrent.Future<Page>> cache = new ConcurrentHashMap<Integer, java.util.concurrent.Future<Page>>;
public Page getPage(Integer id) {
Future<Page> myFuture = new Future<Page>();
cache.putIfAbsent(id, myFuture);
Future<Page> actualFuture = cache.get(id);
if ( actualFuture == myFuture ) {
// I am the first w00t!
Page page = getFromDataBase(id);
myFuture.set(page);
}
return actualFuture.get();
}
Note:
java.util.concurrent.Future is an interface
java.util.concurrent.Future does not actually have a set() but look at the existing classes that implement Future to understand how to implement your own Future (Or use FutureTask)
Pushing the actual retrieval to a worker thread will almost certainly be a good idea.
See section 5.6 in Java Concurrency in Practice: "Building an efficient, scalable, result cache". It deals with the exact issue you are trying to solve. In particular, check out the memoizer pattern.
(source: umd.edu)

Categories