Can anyone please explain to me the consequences of mutating a collection in java that is not thread-safe and is being used by multiple threads?
The results are undefined and somewhat random.
With JDK collections that are designed to fail fast, you might receive a ConcurrentModificationException. This is really the only consequence that is specific to thread safety with collections, as opposed to any other class.
Problems that occur generally with thread-unsafe classes may occur:
The internal state of the collection might be corrupted.
The mutation may appear to be successful, but the changes may not, in fact, be visible to other threads at any given time. They might be invisible at first and become visible later.
The changes might actually be successful under light load, but fail randomly under heavy load with lots of threads in contention.
Race conditions might occur, as was mentioned in a comment above.
There are lots of other possibilities, none of them pleasant. Worst of all, these things tend to most commonly reveal themselves in production, when the system is stressed.
In short, you probably don't want to do that.
The most common outcome is it looks like it works, but doesn't work all the time.
This can mean you have a problem which
works on one machine but doesn't on another.
works for a while but something apparently unrelated changes and your program breaks.
whenever you have a bug you don't know if it's a multi-threading issue or not if you are not using thread safe data structures.
What can happen is;
you rarely/randomly get an error and strange behaviour
your code goes into an infinite loop and stops working (HashMap used to do this)
The only option is to;
limit the amount of state which is shared between threads, ideally none at all.
be very careful about how data is updated.
don't rely on unit tests, you have to understand what the code doing and be confident it will be behave correctly in all possible situations.
The invariants of the data structure will not be guaranteed.
For example:
If thread 2 does a read whilst thread 1 is adding to the DS thread 1 may consider this element added while thread 2 doesn't see that the element has been added yet.
There are plenty of data structures that aren't thread-safe that will still appear to function(i.e. not throw) in a multi threaded environment and they might even perform correctly under certain circumstances(like if you aren't doing any writes to the data structure).
To fully understand this topic exploring the different classes of bugs that occur in concurrent systems is recommended: this short document seems like a good start.
http://pages.cs.wisc.edu/~remzi/OSTEP/threads-bugs.pdf
Related
EDIT: This question might be appropriate for other languages as well - the overall theory behind it seems mostly language agnostic. However, as this will run in a JVM, I'm sure there's differences between JVM overheads/threading and those of other environments.
EDIT 2: To clarify a little better, I guess the main question is which is better for scalability: to have smaller threads that can return quicker to enable processing other chunks of work for other workloads, or try to get a single workload through as quickly as possible? The workloads are sequential and multithreading won't help speed up a single unit of work in this case; it's more in hopes of increasing the throughput of the system overall (thanks to Uri for leading me towards the clarification).
I'm working on a system that's replacing an existing system; the current system has a pretty heavy load, so we already know the replacement needs to be highly scalable. It communicates with several outside processes, such as email, other services, databases, etc., and I'm already planning on making it multithreaded to help with scaling. I've worked on multithreaded apps before, just nothing with this high of a performance/scalability requirement, so I don't have much experience when it comes to getting the absolute most out of concurrency.
The question I have is what's the best way to divide the work up between threads? I'm looking at two different versions, one that creates a single thread for the full workflow, and another that creates a thread for each of the individual steps, continuing on to the next step (in a new/different thread) when the previous step completes - probably with a NodeJS-style callback system, but not terribly concerned about the direct implementation details.
I don't know much about the nitty-gritty details of multithreading - things like context switching, for example - so I don't know if the overhead of multiple threads would swamp the execution time in each of the threads. On one hand, the single thread model seems like it would be fastest for an individual work flow compared to the multiple threads; however, it would also tie up a single thread for the entire workflow, whereas the multiple threads would be shorter lived and would return to the pool quicker (I imagine, at least).
Hopefully the underlying concept is easy enough to understand; here's a contrived pseudo-code example though:
// Single-thread approach
foo();
bar();
baz();
Or:
// Multiple Thread approach
Thread.run(foo);
when foo.isDone()
Thread.run(bar);
when bar.isDone()
Thread.run(baz);
UPDATE: Completely forgot. The reason I'm considering the multithreaded approach is the (possibly mistaken) belief that, since the threads will have smaller execution times, they'll be available for other instances of the overall workload. If each operation takes, say 5 seconds, then the single-thread version locks up a thread for 15 seconds; the multiple thread version would lock up a single thread for 5 seconds, and then it can be used for another process.
Any ideas? If there's anything similar out there in the interwebs, I'd love even a link - I couldn't think of how to search for this (I blame Monday for that, but it would probably be the same tomorrow).
Multithreading is not a silver bullet. It's means to an end.
Before making any changes, you need to ask yourself where your bottlenecks are, and what you're really trying to parallelize. I'm not sure that without more information that we can give good advice here.
If foo, bar, and baz are part of a pipeline, you're not necessarily going to improve the overall latency of a single sequence by using multiple threads.
What you might be able to do is to increase your throughput by letting multiple executions of the pipeline over different input pieces work in parallel, by letting later items to travel through the pipeline while earlier items are blocked on something (e.g., I/O). For instance, if bar() for a particular input is blocked and waiting on a notification, it's possible that you could do computationally heavy operations on another input, or have CPU resources to devote to foo(). A particularly important question is whether any of the external dependencies act as a limited shared resource. e.g., if one thread is accessing system X, is another thread going to be affected?
Threads are also very effective if you want to divide and conquer your problem - splitting your input into smaller parts, running each part through the pipeline, and then waiting on all the pieces to be ready. Is that possible with the kind of workflow you're looking at?
If you need to first do foo, then do bar, and then do baz, you should have one thread do each of these steps in sequence. This is simple and makes obvious sense.
The most common case where you're better off with the assembly line approach is when keeping the code in cache is more important than keeping the data in cache. In this case, having one thread that does foo over and over can keep the code for this step in cache, keep branch prediction information around, and so on. However, you will have data cache misses when you hand the results of foo to the thread that does bar.
This is more complex and should only be attempted if you have good reason to think it will work better.
Use a single thread for the full workflow.
Dividing up the workflow can't improve the completion time for one piece of work: since the parts of the workflow have to be done sequentially anyway, only one thread can work on the piece of work at a time. However, breaking up the stages can delay the completion time for one piece of work, because a processor which could have picked up the last part of one piece of work might instead pick up the first part of another piece of work.
Breaking up the stages into multiple threads is also unlikely to improve the time to completion of all your work, relative to executing all the stages in one thread, since ultimately you still have to execute all the stages for all the pieces of work.
Here's an example. If you have 200 of these pieces of work, each requiring three 5 second stages, and say a thread pool of two threads running on two processors, keeping the entire workflow in a single thread results in your first two results after 15 seconds. It will take 1500 seconds to get all your results, but you only need the working memory for two of the pieces of work at a time. If you break up the stages, then it may take a lot longer than 15 seconds to get your first results, and you potentially may need memory for all 200 pieces of work proceeding in parallel if you still want to get all the results in 1500 seconds.
In most cases, there are no efficiency advantages to breaking up sequential stages into different threads, and there may be substantial disadvantages. Threads are generally only useful when you can use them to do work in parallel, which does not seem to be the case for your work stages.
However, there is a huge disadvantage to breaking up the stages into separate threads. That disadvantage is that you now need to write multithreaded code that manages the stages. It's extremely easy to write bugs in such code, and such bugs can be very difficult to catch prior to production deployment.
The way to avoid such bugs is to keep the threading code as simple as possible given your requirements. In the case of your work stages, the simplest possible threading code is none at all.
I've been caught by yet another deadlock in our Java application and started thinking about how to detect potential deadlocks in the future. I had an idea of how to do this, but it seems almost too simple.
I'd like to hear people's views on it.
I plan to run our application for several hours in our test environment, using a typical data set.
I think it would be possible to perform bytecode manipulation on our application such that, whenever it takes a lock (e.g. entering a synchronized block), details of the lock are added to a ThreadLocal list.
I could write an algorithm that, at some later point, compares the lists for all threads and checks if any contain the same pair of locks in opposite order - this would be reported as a deadlock possibility. Again, I would use bytecode manipulation to add this periodic check to my application.
So my question is this: is this idea (a) original and (b) viable?
This is something that we talked about when I took a course in concurrency. I'm not sure if your implementation is original, but the concept of analysis to determine potential deadlock is not unique. There are dynamic analysis tools for Java, such as JCarder. There is also research into some analysis that can be done statically.
Admittedly, it's been a couple of years since I've looked around. I don't think JCarder was the specific tool we talked about (at least, the name doesn't sound familiar, but I couldn't find anything else). But the point is that analysis to detect deadlock isn't an original concept, and I'd start by looking at research that has produced usable tools as a starting point - I would suspect that the algorithms, if not the implementation, are generally available.
I have done something similar to this with Lock by supplying my own implementation.
These days I use the actor model, so there is little need to lock the data (as I have almost no shared mutable data)
In case you didn't know, you can use the Java MX bean to detect deadlocked threads programmatically. This doesn't help you in testing but it will help you at least better detect and recover in production.
ThreadMXBean threadMxBean = ManagementFactory.getThreadMXBean();
long[] deadLockedThreadIds = threadMxBean.findMonitorDeadlockedThreads();
// log the condition or even interrupt threads if necessary
...
That way you can find some deadlocks, but never prove their absence. I'd better develop static checking tool, a kind of bytecode analizer, feeded with annotations for each synchronized method. Annotations should show the place of the annotated method in the resource graph. The task is then to find loops in the graph. Each loop means deadlock.
I know that thread safety of java sockets has been discussed in several threads here on stackoverflow, but I haven't been able to find a clear answer to this question - Is it, in practice, safe to have multiple threads concurrently write to the same SocketOutputStream, or is there a risk that the data sent from one thread gets mixed up with the data from another tread? (For example the receiver on the other end first receives the first half of one thread's message and then some data from another thread's message and then the rest of the first thread's message)
The reason I said "in practice" is that I know the Socket class isn't documented as thread-safe, but if it actually is safe in current implementations, then that's good enough for me. The specific implementation I'm most curious about is Hotspot running on Linux.
When looking at the Java layer of hotspot's implementation, more specifically the implementation of socketWrite() in SocketOutputStream, it looks like it should be thread safe as long as the native implementation of socketWrite0() is safe. However, when looking at the implemention of that method (j2se/src/solaris/native/java/net/SocketOutputStream.c), it seems to split the data to be sent into chunks of 64 or 128kb (depending on whether it's a 64bit JVM) and then sends the chunks in seperate writes.
So - to me, it looks like sending more than 64kb from different threads is not safe, but if it's less than 64kb it should be safe... but I could very well be missing something important here. Has anyone else here looked at this and come to a different conclusion?
I think it's a really bad idea to so heavily depend on the implementation details of something that can change beyond your control. If you do something like this you will have to very carefully control the versions of everything you use to make sure it's what you expect, and that's very difficult to do. And you will also have to have a very robust test suite to verify that the multithreaded operatio functions correctly since you are depending on code inspection and rumors from randoms on StackOverflow for your solution.
Why can't you just wrap the SocketOutputStream into another passthrough OutputStream and then add the necessary synchronization at that level? It's much safer to do it that way and you are far less likely to have unexpected problems down the road.
According to this documentation http://www.docjar.com/docs/api/java/net/SocketOutputStream.html, the class does not claim to be thread safe, and thus assume it is not. It inherits from FileOutputStream, which normally file I/O is not inherently thread safe.
My advice is that if the class is related to hardware or communications, it is not thread safe or "blocking". The reason is thread safe operations consume more time, which you may not like. My background is not in Java but other libraries are similar in philosophy.
I notice you tested the class extensively, but you may test it all day for many days, and it may not prove anything, my 2-cents.
Good luck & have fun with it.
Tommy Kwee
All,
What should be the approach to writing a thread safe program. Given a problem statement, my perspective is:
1 > Start of with writing the code for a single threaded environment.
2 > Underline the fields which would need atomicity and replace with possible concurrent classes
3 > Underline the critical section and enclose them in synchronized
4 > Perform test for deadlocks
Does anyone have any suggestions on the other approaches or improvements to my approach. So far, I can see myself enclosing most of the code in synchronized blocks and I am sure this is not correct.
Programming in Java
Writing correct multi-threaded code is hard, and there is not a magic formula or set of steps that will get you there. But, there are some guidelines you can follow.
Personally I wouldn't start with writing code for a single threaded environment and then converting it to multi-threaded. Good multi-threaded code is designed with multi-threading in mind from the start. Atomicity of fields is just one element of concurrent code.
You should decide on what areas of the code need to be multi-threaded (in a multi-threaded app, typically not everything needs to be threadsafe). Then you need to design how those sections will be threadsafe. Methods of making one area of the code threadsafe may be different than making other areas different. For example, understanding whether there will be a high volume of reading vs writing is important and might affect the types of locks you use to protect the data.
Immutability is also a key element of threadsafe code. When elements are immutable (i.e. cannot be changed), you don't need to worry about multiple threads modifying them since they cannot be changed. This can greatly simplify thread safety issues and allow you to focus on where you will have multiple data readers and writers.
Understanding details of concurrency in Java (and details of the Java memory model) is very important. If you're not already familiar with these concepts, I recommend reading Java Concurrency In Practice http://www.javaconcurrencyinpractice.com/.
You should use final and immutable fields wherever possible, any other data that you want to change add inside:
synchronized (this) {
// update
}
And remember, sometimes stuff brakes, and if that happens, you don't want to prolong the program execution by taking every possible way to counter it - instead "fail fast".
As you have asked about "thread-safety" and not concurrent performance, then your approach is essentially sound. However, a thread-safe program that uses synchronisation probably does not scale much in a multi cpu environment with any level of contention on your structure/program.
Personally I like to try and identify the highest level state changes and try and think about how to make them atomic, and have the state changes move from one immutable state to another – copy-on-write if you like. Then the actual write can be either a compare-and-set operation on an atomic variable or a synchronised update or whatever strategy works/performs best (as long as it safely publishes the new state).
This can be a bit difficult to structure if your new state is quite different (requires updates to several fields for instance), but I have seen it very successfully solve concurrent performance issues with synchronised access.
Buy and read Brian Goetz's "Java Concurrency in Practice".
Any variables (memory) accessible by multiple threads potentially at the same time, need to be protected by a synchronisation mechanism.
I'm wondering what good ways there would be make assertions about synchronization or something so that I could detect synchronization violations (while testing).
That would be used for example for the case that I'd have a class that is not thread-safe and that isn't going to be thread-safe. With some way I would have some assertion that would inform me (log or something) if some method(s) of it was called from multiple threads.
I'm longing for something similar that could be made for AWT dispatch thread with the following:
public static void checkDispatchThread() {
if(!SwingUtilities.isEventDispatchThread()) {
throw new RuntimeException("GUI change made outside AWT dispatch thread");
}
}
I'd only want something more general. The problem description isn't so clear but I hope somebody has some good approaches =)
You are looking for the holy grail, I think. AFAIK it doesn't exist, and Java is not a language that allows such an approach to be easily created.
"Java Concurrency in Practice" has a section on testing for threading problems. It draws special attention to how hard it is to do.
When an issue arises over threads in Java it is usually related to deadlock detection, more than just monitoring what Threads are accessing a synchronized section at the same time. JMX extension, added to JRE since 1.5, can help you detect those deadlocks. In fact we use JMX inside our own software to automatically detect deadlocks an trace where it was found.
Here is an example about how to use it.
IntelliJ IDEA has a lot of useful concurrency inspections. For example, it warns you when you are accessing the same object from both synchronised and unsynchronised contexts, when you are synchronising on non-final objects and more.
Likewise, FindBugs has many similar checks.
As well as #Fernando's mention of thread deadlocking, another problem with multiple threads is concurrent modifications and the problems it can cause.
One thing that Java does internally is that a collection class keeps a count of how many times it's been updated. And then an iterator checks that value on every .next() against what it was when the interator was created to see if the collection has been updated while you were iterating. I think that principle could be used more generally.
Try ConTest or Covertity
Both tools analyze the code to figure out which parts of the data might be shared between threads and then they instrument the code (add extra bytecode to the compiled classes) to check if it breaks when two threads try to change some data at the same time. The two threads are then run over and over again, each time starting them with a slightly different time offset to get many possible combinations of access patterns.
Also, check this question: Unit testing a multithreaded application?
You might be interested in an approach Peter Veentjer blogged about, which he calls The Concurrency Detector. I don't believe he has open-sourced this yet, but as he describes it the basic idea is to use AOP to instrument code that you're interested in profiling, and record which thread has touched which field. After that it's a matter of manually or automatically parsing the generated logs.
If you can identify thread unsafe classes, static analysis might be able to tell you whether they ever "escape" to become visible to multiple threads. Normally, programmers do this in their heads, but obviously they are prone to mistakes in this regard. A tool should be able to use a similar approach.
That said, from the use case you describe, it sounds like something as simple as remembering a thread and doing assertions on it might suffice for your needs.
class Foo {
private final Thread owner = Thread.currentThread();
void x() {
assert Thread.currentThread() == owner;
/* Implement method. */
}
}
The owner reference is still populated even when assertions are disabled, so it's not entirely "free". I also wouldn't want to clutter many of my classes with this boilerplate.
The Thread.holdsLock(Object) method may also be useful to you.
For the specific example you give, SwingLabs has some helper code to detect event thread violations and hangs. https://swinghelper.dev.java.net/
A while back, I worked with the JProbe java profiling tools. One of their tools (threadalyzer?) looked for thread sync violations. Looking at their web page, I don't see a tool by that name or quite what I remember. But you might want to take a look. http://www.quest.com/jprobe/performance-home.aspx
You can use Netbeans profiler or JConsole to check the threads status in depth