Writing of primitives to stack or heap? - java

I was today in a work interview.
On one question the interviewer asked me how much time will it take to a thread to read an integer value that another thread set? Micro seconds? Milliseconds or even a second?
He told me that it can reach even a second in the case of a long value, because "the long value is written first to the stack but the reading threads read from the heap, so the value to be read should first be copied to the heap" and for long values it can take a lot of time.
Can someone please tell me if I understood it correctly and explain a bit more on that?
Thanks!

how much time will it take to a thread to read an integer value that another thread set? Micro seconds? Milliseconds or even a second?
Depends on a lot of factors. If the question is about a volatile int versus long field then the answer typically on a modern CPU is still microseconds. It actually can be quite close to a normal read in terms of performance if the field does not have frequent accesses from other threads. However, it can be significantly more expensive depending the cost of the cache invalidation and the locking of the memory line if the variable is heavily contended. Of course, if you are accessing the field inside of a synchronized block then it depends on the lock contention between threads and what other operations are in the block.
For example, on my 4 core Mac, running 10 threads all incrementing a volatile int, they can do 1 million ++ in ~190ms. That's ~0.19 microseconds per if my math is correct. Certainly not scientific in any manner but it should give you some idea of the scale. Changing to a volatile long didn't change the numbers much but I'm running on a native 64-bit system and JVM. Again, the performance hit inside of a larger application with a lot of cached memory might get closer to milliseconds but nowhere near seconds.
For comparison, it can do AtomicInteger increments in ~1300ms and 1 million AtomicLong increments in ~1400ms. Again, these numbers are approximations in a very simple test program.
He told me that it can reach even a second in the case of a long value, because "the long value is written first to the stack but the reading threads read from the heap, so the value to be read should first be copied to the heap" and for long values it can take a lot of time.
This just does not make much sense. The only difference between an int and long is that a long can take multiple accesses depending on your runtime architecture. For example, if you are on a 32-bit architecture, it may take multiple accesses to read and update the 64-bit value. But the idea that accessing a long would take seconds just because it is double the width is not valid.
how much time will it take to a thread to read an integer value that another thread set?
As mentioned by others, if this instead is talking about when one thread will see an unsynchronized update of a variable made in another thread, the answer could certainly be seconds but that has little to do with the width of the variable and nothing to do with copying between stack and heap.

Related

Java How to make sure that a shared data write affects multiple readers

I am trying to code a processor intensive task, so I would like to use multithreading and share the calculation between the available processor cores.
Let's say I have thousands of iterations and all iterations have two phases:
Some working threads that scans through hundreds of thousands of options
while they have to read data from a shared array (or some other data structure), while there is no modification of the data.
One thread that collects the results from all the working threads (while
they are waiting) and makes modifications on the shared array
The phases are in sequence, so that there is no overlap (no concurrent writing and reading of the data). My problem is: How would I be sure that the data (cache) is updated for the working threads before the next phase, Phase 1, starts.
I am assuming that when people speak about cache or caching in this context, they mean the processor cache (fix me if I'm wrong).
As I understood, volatile can be used for nonreference types only, while there is no point to use synchronized, because the working threads will block each other at reading (there can be thousands of reads when processing an option).
What else can I use in this case?
Right now I have a few ideas, but I have no idea how costly they are (most probably they are):
create new working threads for all iterations
in a synchronized block make a copy of the array (can be up to 195kB in size) for each threads before a new iteration begins
I red about ReentrantReadWriteLock, but I can't understand how is it related to caching. Can a read lock acquire force the reader's cache to update?
The thing I was searching for was mentioned in the "Java Tutorial on Concurrence" I just had to look deeper. In this case it was the AtomicIntegerArray class. Unfortunately it is not efficient enough for my needs. I run some tests, maybe it worth to share.
I approximated the cost of different memory access methods, by running them many times and averaged the elapsed times, broke everything down to one average read or write.
I used a size of 50000 integer array, and repeated every test methods 100 times, then averaged the results. The read tests are performing 50000 random(ish) reads. The results shows the approximated time of one read/write access. Still, this can't be stated as exact measurement, but I believe it gives a good sense of the time costs of the different access methods. However on different processors or with different numbers these results may be completely different regarding to the different cache sizes, and clock speeds.
So the results are:
Fill time with set is: 15.922673ns
Fill time with lazySet is: 4.5303152ns
Atomic read time is: 9.146553ns
Synchronized read time is: 57.858261399999996ns
Single threaded fill time is: 0.2879112ns
Single threaded read time is: 0.3152002ns
Immutable copy time is: 0.2920892ns
Immutable read time is: 0.650578ns
Points 1 and 2 shows the write result on an AtomicIntegerArray, with sequential writes. In some article I red about the good efficiency of the lazySet() mehtod so I wanted to test it. It is usually over perform the set() method by about 4 times, however different array sizes show different results.
Points 3 and 4 shows the difference between the "atomic" access and synchronized access (a synchronized getter) to one item of the array via random(ish) reads by four different threads simultaneously. This clearly indicates the benefits of the "atomic" access.
Since the first four value looked shockingly high, I really wanted to measure the access times without multithreading, so I got the reslults of points 5 and 6. I tried to copy and modify methods from the previous tests, to make the code as close as it is possible. Of course there can be optimizations I can't affect.
Then just out of curiosity I come up with points 7. and 8. which imitates the immutable access. Here one thread creates the array (by sequential writes) and passes it's reference to an another thread which does the random(ish) read accesses on it.
The results are heavily vary, if the parameters are changed, like the size of the array or the count of the methods running.
The conclusion:
If an algorithm is extremely memory intensive (lots of reads from the same small array, interrupted by short calculations - which is my case), multithreading can slow down the calculation instead of speeding it up. But if it has many many reads, compared to the size of the array, it may be helpful to use an immutable copy of the array, and use multiple threads.

Java, editing array from multiple threads

I have many byte arrays of size 4096 (16x16x16), and I want editing them from many threads in one time, there is small chance that any element will be written in one time by more than one thread, and almost impossible that more than 3 will be accessing it (one of elements) in one time (write or read).
But whole array can be accessed by many threads in one time.
Can this cause any problems? If yes, then how to fix/avoid them?
I know that reading should be safe, and I hear about some problems with writing
Code need be fast (real-time based stuff) so I can't synchronize that, and I can't use any ArrayList, because that will cause problems with memory. (There will be like 1000-20000 (or even more) arrays like that)
Every time someone says real time in the same sentence as Java it peaks my interest because real time has a specific meaning that most people don't understand ( oracle / sun have a real time jvm available for purchase )
But I digress , array reads and writes are atomic, therefore thread safe. 2 threads cannot write to the array at the same time because the operation cannot get broken down to anything smaller ( allowing a scheduler to interrupt halfway through ) As long as you are careful ( e.g. Are not reading a value, doing some math and then writing it back to the array and expecting the the value at the given index to remain the same )
So in short there is nothing stopping you from doing this as long as your logic around it is also thread safe.

Timing a remote call in a multithreaded java program

I am writing a stress test that will issue many calls to a remote server. I want to collect the following statistics after the test:
Latency (in milliseconds) of the remote call.
Number of operations per second that the remote server can handle.
I can successfully get (2), but I am having problems with (1). My current implementation is very similar to the one shown in this other SO question. And I have the same problem described in that question: latency reported by using System.currentTimeMillis() is longer than expected when the test is run with multiple threads.
I analyzed the problem and I am positive the problem comes from the thread interleaving (see my answer to the other question that I linked above for details), and that System.currentTimeMillis() is not the way to solve this problem.
It seems that I should be able to do it using java.lang.management, which has some interesting methods like:
ThreadMXBean.getCurrentThreadCpuTime()
ThreadMXBean.getCurrentThreadUserTime()
ThreadInfo.getWaitedTime()
ThreadInfo.getBlockedTime()
My problem is that even though I have read the API, it is still unclear to me which of these methods will give me what I want. In the context of the other SO question that I linked, this is what I need:
long start_time = **rightMethodToCall()**;
result = restTemplate.getForObject("Some URL",String.class);
long difference = (**rightMethodToCall()** - start_time);
So that the difference gives me a very good approximation of the time that the remote call took, even in a multi-threaded environment.
Restriction: I'd like to avoid protecting that block of code with a synchronized block because my program has other threads that I would like to allow to continue executing.
EDIT: Providing more info.:
The issue is this: I want to time the remote call, and just the remote call. If I use System.currentTimeMillis or System.nanoTime(), AND if I have more threads than cores, then it is possible that I could have this thread interleaving:
Thread1: long start_time ...
Thread1: result = ...
Thread2: long start_time ...
Thread2: result = ...
Thread2: long difference ...
Thread1: long difference ...
If that happens, then the difference calculated by Thread2 is correct, but the one calculated by Thread1 is incorrect (it would be greater than it should be). In other words, for the measurement of the difference in Thread1, I would like to exclude the time of lines 4 and 5. Is this time that the thread was WAITING?
Summarizing question in a different way in case it helps other people understand it better (this quote is how #jason-c put it in his comment.):
[I am] attempting to time the remote call, but running the test with multiple threads just to increase testing volume.
Use System.nanoTime() (but see updates at end of this answer).
You definitely don't want to use the current thread's CPU or user time, as user-perceived latency is wall clock time, not thread CPU time. You also don't want to use the current thread's blocking or waiting time, as it measures per-thread contention times which also doesn't accurately represent what you are trying to measure.
System.nanoTime() will return relatively accurate results (although granularity is technically only guaranteed to be as good or better than currentTimeMillis(), in practice it tends to be much better, generally implemented with hardware clocks or other performance timers, e.g. QueryPerformanceCounter on Windows or clock_gettime on Linux) from a high resolution clock with a fixed reference point, and will measure exactly what you are trying to measure.
long start_time = System.nanoTime();
result = restTemplate.getForObject("Some URL",String.class);
long difference = (System.nanoTime() - start_time);
long milliseconds = difference / 1000000;
System.nanoTime() does have it's own set of issues but be careful not to get whipped up in paranoia; for most applications it is more than adequate. You just wouldn't want to use it for, say, precise timing when sending audio samples to hardware or something (which you wouldn't do directly in Java anyways).
Update 1:
More importantly, how do you know the measured values are longer than expected? If your measurements are showing true wall clock time, and some threads are taking longer than others, that is still an accurate representation of user-perceived latency, as some users will experience those longer delay times.
Update 2 (based on clarification in comments):
Much of my above answer is still valid then; but for different reasons.
Using per-thread time does not give you an accurate representation because a thread could be idle/inactive while the remote request is still processing, and you would therefore exclude that time from your measurement even though it is part of perceived latency.
Further inaccuracies are introduced by the remote server taking longer to process the simultaneous requests you are making - this is an extra variable that you are adding (although it may be acceptable as representative of the remote server being busy).
Wall time is also not completely accurate because, as you have seen, variances in local thread overhead may add extra latency that isn't typically present in single-request client applications (although this still may be acceptable as representative of a client application that is multi-threaded, but it is a variable you cannot control).
Of those two, wall time will still get you closer to the actual result than per-thread time, which is why I left the previous answer above. You have a few options:
You could do your tests on a single thread, serially -- this is ultimately the most accurate way to achieve your stated requirements.
You could not create more threads than cores, e.g. a fixed size thread pool with bound affinities (tricky: Java thread affinity) to each core and measurements running as tasks on each. Of course this still adds any variables due to synchronization of underlying mechanisms that are beyond your control. This may reduce the risk of interleaving (especially if you set the affinities) but you still do not have full control over e.g. other threads the JVM is running or other unrelated processes on the system.
You could measure the request handling time on the remote server; of course this does not take network latency into account.
You could continue using your current approach and do some statistical analysis on the results to remove outliers.
You could not measure this at all, and simply do user tests and wait for a comment on it before attempting to optimize it (i.e. measure it with people, who are what you're developing for anyways). If the only reason to optimize this is for UX, it could very well be the case that users have a pleasant experience and the wait time is totally acceptable.
Also, none of this makes any guarantees that other unrelated threads on the system won't be affecting your timings, but that is why it is important to both a) run your test multiple times and average (obviously) and b) set an acceptable requirement for timing error's that you are OK with (do you really need to know this to e.g. 0.1ms accuracy?).
Personally, I would either do the first, single-threaded approach and let it run overnight or over a weekend, or use your existing approach and remove outliers from the result and accept a margin of error in the timings. Your goal is to find a realistic estimate within a satisfactory margin of error. You will also want to consider what you are going to ultimately do with this information when deciding what is acceptable.

java multithreading performance scaling

can you explain this nonsense to me?
i have a method that basically fills up an array with mathematical operations. there's no I/O involved or anything. now, this method takes about 50 seconds to run, and the code is perfectly scalable (theoretically 100%), so i split it up into 4 threads, wait for them to complete, and reassemble the 4 arrays. now, i run the program on a quad core processor, expecting it to take about 15 seconds, and it actually takes 58 seconds. that's right: it takes longer! i see the cpu working 100%, and i know that each thread does 1/4 of the calculations, and creating threads and reassembling the arrays take about 1-2 ms in total.
what's causing such loss of performance? what the hell is the cpu doing all that time?
CODE: http://pastebin.com/cFUgiysw
Threads don't work that way.
Threads are still part of the same process (depending on the OS), so in terms of the operating system - CPU time will be scheduled the same for 4 threads in 1 process as it is for 1 thread in 1 process.
Also, with such a small number of values, you won't see the scalability in the midst of the overhead. Re-assembling the arrays in java will be costly.
Check out things like "Context switching overhead" - things like that always mess you up when you try to map theory to practise :P
I would stick to the single-threaded way :)
~ Dan
http://en.wikipedia.org/wiki/Context_switch
A lot depends on what you are doing and how you are dividing the work. There are many possible causes for this problem.
The most likely cause is, you are using all the bandwidth of your CPU to main memory bus with one thread. This can happen if your data set is larger than your CPU cache. esp if you have some random access behaviour. You could consider trying to reuse the original array, rather than taking multiple copies to reduce cache churn.
Your locking overhead is greater than the performance gain. I suspect you have used very course locking so this shouldnt be an issue.
Starting stopping threads takes too long. As your code is multi second, I doubt this too.
There is a cost associated with opening new threads. I don't think it should be up to 8 second but it depends on what threads you are using. Some threads needs to create a copy of the data that you are handling to be thread safe and that can take some time. This cost is commonly referred to as overhead. If the execution you are doing is somewhere not serializable for instance reads the same file or needs access to a shared resource the threads might need to wait on each other this can take some time and under sub optimal conditions it can take more time than serial execution. My tip is try and check for these unserializable events remove them from the threaded part if possible. Also try and use a lower amount of threads 4 threads for 4 cpus is not always optimal.
Hope it helps.
Unless you are constantly creating and killing threads the thread overhead shouldn't be a problem. Four threads running simultaeously is no big deal for the scheduler.
As Peter Lawrey suggested the memory bandwidth could be the problem. Your 50-second code is running on a java engine and they both compete for the available memory bandwidth. The java engine needs memory bandwidth to execute your code and your code needs it to do its calculations.
You write "perfectly scalable" which would be the case if your code was compiled. Since it runs on a java engine this is not the case. So the 16% increase in overall time could be seen as the difference between the smoothness of one thread vs the chaos of four colliding over memory accesses.

Memory Consistency Errors vs Thread interference

What is the difference between memory consistency errors and thread interference?
How does the use of synchronization to avoid them differ or not? Please illustrate with an example. I couldn't get this from the sun Java tutorial. Any recommendation of reading material(s) to understand this purely in context of java would be helpful.
Memory consistency errors can't be understood purely in the context of java--the details of shared memory behavior on multi-cpu systems are highly architecture-specific, and to make it worse, x86 (where most people coding today learned to code) has pretty programmer-friendly semantics compared to architectures that were designed for multi-processor machines from the beginning (like POWER and SPARC), so most people really aren't used to thinking about memory access semantics.
I'll give a common example of where memory consistency errors can get you into trouble. Assume for this example, that the initial value of x is 3. Nearly all architectures guarantee that if one CPU executes the code:
STORE 4 -> x // x is a memory address
STORE 5 -> x
and another CPU executes
LOAD x
LOAD x
will either see 3,3, 3,4, 4,4, 4,5, or 5,5 from the perspective its two LOAD instructions. Basically, CPUs guarantee that the order of writes to a single memory location is maintained from the perspective of all CPUs, even if the exact time that each of the writes become known to other CPUs is allowed to vary.
Where CPUs differ from one another tends to be in the guarantees they make about LOAD and STORE operations involving different memory addresses. Assume for this example, that the initial values of both x and y are 4.
STORE 5 -> x // x is a memory address
STORE 5 -> y // y is a different memory address
then another CPU executes
LOAD x
LOAD y
In this example, on some architectures, the second thread can see 4,4, 5,5, 4,5, OR 5,4. Ouch!
Most architectures deal with memory at the granularity of a 32 or 64 bit word--this means that on a 32 bit POWER/SPARC machine, you can't update a 64-bit integer memory location and safely read it from another thread ever without explicit synchronization. Goofy, huh?
Thread interference is much simpler. The basic idea is that java doesn't guarantee that a single statement of java code executes atomically. For example, incrementing a value requires reading the value, incrementing it, then storing it again. So you can have int x = 1 after two threads execute x++, x can end up as 2 or 3 depending on how the lower-level code interleaved (the lower-level abstract code at work here presumably looks like LOAD x, INCREMENT, STORE x). The basic idea here is that java code is broken down into smaller atomic pieces and you don't get to make assumptions of how they interleave unless you use synchronization primitives explicitly.
For more information, check out this paper. It's long and dry and written by a notorious asshole, but hey, it's pretty good too. Also check out this (or just google for "double checked locking is broken"). These memory reordering issues reared their ugly heads for many C++/java programmers who tried to get a little bit too clever with their singleton initializations a few years ago.
Thread interference is about threads overwriting each other's statements (say, thread A incrementing a counter and thread B decrementing it at the same time), leading to a situation where the actual value of counter is unpredictable. You avoid them by enforcing exclusive access, one thread at a time.
On the other hand, memory inconsistency is about visibility. Thread A may increment counter, but then thread B may not be aware of this change yet so it might read some prior value. You avoid them by establishing a happens-before relationship, which is
is simply a guarantee that memory writes by one specific statement are visible to another specific statement.(per Oracle)
The article to read on this is "Memory Models: A Case for Rethinking Parallel Languages and Hardware" by Adve and Boehm in the August 2010 vol. 53 number 8 issue of Communications of the ACM. This is available online for Association for Computer Machinery members (http://www.acm.org). This deals with the problem in general and also discusses the Java Memory Model.
For more information on the Java Memory Model, see http://www.cs.umd.edu/~pugh/java/memoryModel/
Memory Consistency problems are normally manifest as broken happens-before relationships.
Time A: Thread 1 sets int i = 1
Time B: Thread 2 sets i = 2
Time C: Thread 1 reads i, but still sees a value of 1, because of any number of reasons that it did not get the most recent stored value in memory.
You prevent this from happening either by using the volatile keyword on the variable, or by using the AtomicX classes from the java.util.concurrent.atomic package. Either of these messages makes sure that no second thread will see a partially modified value, and no one will ever see a value that isn't the most current real value in memory.
(Synchronizing the getter and setter would also fix the problem, but may look strange to other programmers who don't know why you did it, and can also break down in the face of things like binding frameworks and persistence frameworks that use reflection.)
--
Thread interleaves are when two threads munge an object up and see inconsistent states.
We have a PurchaseOrder object with an itemQuantity and itemPrice, automatic logic generates the invoice total.
Time 0: Thread 1 sets itemQuantity 50
Time 1: Thread 2 sets itemQuantity 100
Time 2: Thread 1 sets itemPrice 2.50, invoice total is calculated $250
Time 3: Thread 2 sets itemPrice 3, invoice total is calculated at $300
Thread 1 performed an incorrect calculation because some other thread was messing with the object in between his operations.
You address this issue either by using the synchronized keyword, to make sure only one person can perform the entire process at a time, or alternately with a lock from the java.util.concurrent.locks package. Using java.util.concurrent is generally the preferred approach for new programs.
1. Thread Interference
class Counter {
private int c = 0;
public void increment() {
c++;
}
public void decrement() {
c--;
}
public int value() {
return c;
}
}
Suppose there are two threads Thread-A and Thread-B working on the
same counter instance . Say Thread-A invokes increment() , and at the
same time Thread-B invokes decrement() . Thread-A reads the value c
and increment it by 1 . At the same time Thread-B reads the value (
which is 0 because the incremented value is not yet set by Thread-A) ,
decrements it and set it to -1 . Now Thread-A sets the value to 1 .
2. Memory Consistency Errors
Memory Consistency Errors occurs when different threads have
inconsistent views of the shared data. In the above class counter ,
Lets say there are two threads working on the same counter instance ,
calls the increment method to increase the counter's value by 1 . Here
it is no guarantee that the changes made by one thread is visible to
the other .
For more visit this.
First, please note, that your source is NOT the best place to learn what you're trying to learn. You will do well reading papers from #blucz 's answer (as well as his answer in general), even if it's out of scope of Java. Oracle Trails aren't bad per se, but they simplify matters and gloss over them, hence you may find you don't understand what you've just learned or whether it's useful or not and how much.
Now, trying to answer primarily within Java context.
Thread interference
happens when thread operations interleave, that is, mingle. We need two executors (threads) and shared data (place to interfere).
Image by Daniel Stori, from turnoff.us website:
In the image you see that two threads in a GNU/Linux process can interfere with each other. Java threads are essentially Java objects pointing to native threads and they also can interfere with each other, if they operate on same data (like here where "Rick" messes up the data - drawing - of his younger brother).
Memory Consistency Errors - MCE
Crucial points here are memory visibility, happens-before and - brought up by #blucz, hardware.
MCE are - obviously - situations, where memory becomes inconsistent. Which actually is a term for humans - for computers the memory is consistent at all times (unless it's corrupted). The "inconsistencies" are something humans are "seeing", because they don't understand what exactly happened and were expecting something else. "Why is it 1? It should be 2?!"
This "perceived inconsistency", this "gap", relates to memory visibility, that is, what different threads see when they look at memory. And therefore what those threads operate on.
You see, while reads from and writes to memory are linear when we reason about code (especially when thinking about how it is executed line by line)... actually they are not. Especially, when threads are involved. So, the tutorial you read gives you an example of a counter being incremented by two threads and how thread 2 reads same value as thread 1. Actual reasons for memory inconsistencies might be due to optimizations done to your code, by javac, JIT or hardware memory consistency model (that is, something that CPU people did to speed up their CPU and make it more efficient). These optimizations include prescient stores, branch prediction and for now you may think of them as reordering code so that in the end, it runs faster and uses/wastes less CPU cycles. However, to make sure optimizing doesn't go out of control (or too far), some guarantees are made. These guarantees form relationship of "happens-before", where we can tell that before this point and after, things "happened-before". Imagine you running a party and remembering, that Tom got here BEFORE Suzie, cause you know that Rob came after tom and before Suzie. Rob is the event which you use to form happens-before relationship before events of Tom/Suzie coming in.
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/package-summary.html#MemoryVisibility
Link above tells you more about memory visibility and what establishes happens-before relationship in Java. It will not come as a surprise, but:
synchronization does
starting a Thread
joining a Thread
volatile keyword tells you that writes happens-before subsequent reads, that is, that reads AFTER writes will not be reordered to be "before" writes, as that would break "happens-before" relationship.
Since all that touches memory, hardware is essential. Your platform has it's own rules and while JVM tries to make them universal by making all platforms behave similarly, just that alone means that on platform A there will be more memory barriers than on platform B.
Your questions
What is the difference between memory consistency errors and thread interference?
MCE are about visibility of the memory to program threads and NOT having happens-before relationship between reads and writes, therefore having a gap between what humans think "should be" and what "actually is".
Thread interference is about thread operations overlapping, mingling, interleaving and touching shared data, screwing it in the process, that may lead to thread A having nice drawing destroyed by thread B. Interference being harmful usually marks a critical section, which is why synchronization works.
How does the use of synchronization to avoid them differ or not?
Please read also about thin locks, fat locks and thread contention.
Synchronization to avoid thread interference does it in making only one thread access the critical section, other thread is blocked (costly, thread contention). When it comes to MCE synchronization establishes happens-before when it comes to locking and unlocking the mutex, see earlier link to java.util.concurrent package description.
For examples: see both earlier sections.

Categories