Java CAS operation performs faster than C equivalent, why?

Java CAS operation performs faster than C equivalent, why? - java

Here I have Java and C code that tries to do an Atomic increment operation using CAS.
To increment an long variable from 0 to 500,000,000.
C : Time taken : 7300ms
Java : Time Taken : 2083ms
Can any one double check these results? Because I just can't believe them.
Thanks
Java code:
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;
public class SmallerCASTest {
public static void main(String[] args){
final long MAX = 500l * 1000l * 1000l;
final AtomicLong counter = new AtomicLong(0);
long start = System.nanoTime();
while (true) {
if (counter.incrementAndGet() >= MAX) {
break;
}
}
long casTime = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start);
System.out.println("Time Taken=" + casTime + "ms");
}
}
C code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define NITER 500000000
int main (){
long val = 0;
clock_t starttime = clock ();
while (val < NITER){
while (1){
long current = val;
long next = current+1;
if ( __sync_bool_compare_and_swap (&val, current, next))
break;
}
}
clock_t castime = (clock()-starttime)/ (CLOCKS_PER_SEC / 1000);
printf ("Time taken : %d ",castime);
}
run.sh
#!/bin/bash
gcc -O3 test.c -o test.o
echo -e "\nC"
./test.o
javac SmallerCASTest.java
echo -e "\nJava"
java SmallerCASTest
Other details:
System : Linux XXXXXXXXX #1 SMP Thu Mar 22 08:00:08 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
gcc --version:
gcc (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)
java -version:
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

You are comparing apples with oranges as I am sure you expected. The java version is a true CAS with retry on failure while the C version is using what I'd call in java a synchronized form.
See this question for more details.
See this answer to that question for supporting narrative where it says A full memory barrier is created when this function is invoked, i.e. in java terms, this is a synchronized call.
Try using _compare_and_swap in the same way AtomicLong uses its java equivalent, i.e. spin on the function until the value changes to what you want it to be.
Added:
I cannot find a definitive C++ equivalent of a java AtomicLong but that does not mean there isn't one. Essentially, an AtomicLong can be changed by any thread at any time and just one of them succeeds. However, the change will be consistent, i.e. the change will be the result of the change by one or other of the threads, it will not be a combination of the two. If thread A attempts to change the value to 0xffff0000 (or the equivalent 64bit number) while thread B attempts a change to 0x0000ffff (ditto) the result will be either of the two values, more specifically it will not be 0x00000000 or 0xffffffff (unless of course a third thread gets involved).
Essentially, an AtomicLong has no synchronisation at all other than this.

EDIT Indeed, java seems to implement incrementAndGet using a CAS operation, as you point out.
My testing seems to suggest that the C and Java versions have roughly equivalent performance (which makes sense, as the time consuming part is the atomic rather than any optimization of the rest that the java or C compilers manage to do).
So on my machine (Xeon X3450), the java version takes ~4700 ms, the C version ~4600 ms, a C version using __sync_add_and_fetch() ~3800 ms (suggesting that java could be improved here instead of implementing all the atomic operations on top of CAS).
java version is
java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.4) (6b24-1.11.4-1ubuntu0.10.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
GCC is 4.4.3, x86_64.
OS is Ubuntu 10.04 x86_64.
So I can only conclude that something seems fishy in your tests.

Because Java is awesome?
The java version takes 4ns for each loop. That is about right. An uncontended CAS is actually a CPU local operation, it should be very fast. (edit: probably not 4ns fast!)
Java achieves that speed by aggressive runtime optimization, the code is inlined and becomes just a couple of machine instructions, i.e. as fast as one can hand-code in assembly.
If the gcc version couldn't inline the function call, that's a big overhead per loop.

Related

Why is my Java program running faster than my C++ program , both doing the same thing [duplicate]

This question already has answers here:
Why is std::cout so time consuming?
(1 answer)
'printf' vs. 'cout' in C++
(16 answers)
Closed 2 months ago.
I wrote a program in both C++ and Java to print "Hello World" 100,000 times, but I noticed that the C++ code takes too long compared to the Java code;
The Java code takes about 6 seconds averagely and the C++ code takes about 18 seconds averagely, both run from the command line;
Can someone please explain why, thanks.
The name of the program is first.java and first.cpp for Java and C++ respectively
I used: java first.java; and first.exe; both from the command line
g++ --version
g++ (Rev6, Built by MSYS2 project) 11.2.0
java --version
java 13.0.2, 2020-01-14
Java Code
class first {
public static void main(String... args) {
long start = System.currentTimeMillis();
for (int i = 0; i < 100000; i++) {
System.out.println("Hello World");
}
long end = System.currentTimeMillis();
long dur = end - start;
System.out.println(dur / 1000);
}
}
C++ Code
#include <iostream>
#include <string>
#include <chrono>
using namespace std;
int main()
{
auto start = std::chrono::system_clock::now();
for (int i = 0; i < 100000; i++)
{
cout << "Hello World" << endl;
}
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
cout << elapsed_seconds.count() << endl;
}

There are several relevant differences between your C++ and Java code:
By default C++ IO streams synchronise their state with the underlying C streams. This takes time. To avoid this (which you can do only if you know that your code does not mix C and C++ IO operations!), add the following to the beginning of your main code:
std::ios_base::sync_with_stdio(false);
cout << endl; is equivalent to cout << "\n" << flush; (which, in turn, is equivalent to cout << "\n"; cout.flush();). The flush call is absent from your Java code. You could add it to your Java code or, better, remove it from your C++ code: you almost never need to use endl/flush. Instead, just use
cout << "Hello World\n";
As noted by Peter in the comments, most systems flush the stdout stream on newline anyway (at least when attached to a terminal) so one might expect this not to make a difference. However, it does make a (substantial!) difference e.g. when piping the output to a file.
Your Java benchmark code truncates fractional seconds. To show those fractions of seconds (relevant since the code runs in <1s!), change the relevant line to
System.out.println(dur / 1000.0);
Be sure to compile your C++ code with optimisations enabled; with GCC/clang/ICC, you do this by passing -O2. MSVC has a similar flag, /O2 (there are higher optimisation levels but they have particular issues; -O2 is pretty much the default setting people use).
Conversely, java first.java will first compile the code every time you invoke it. To make the comparison fair, be sure to run javac first.java ahead of time, and then execute the code via java first.
Making these changes causes the C++ code to overtake the Java code on my system. This is most noticeable when increasing the loop size from 100,000 to 1,000,000: the C++ code now runs in milliseconds, while the Java code takes several seconds (be sure to pipe the output to a file! Otherwise you will be purely measuring the latency/rendering speed of your terminal, not the performance of the code).

Simple Go vs Java performance comparison in one particular application

I create simple test which compare my Go and Java application performance
I do not know why but it looks like my Java application is faster than Go
I used:
~> go version
go version go1.15.6 darwin/amd64
and
~> java -version
openjdk version "15.0.1" 2020-10-20
OpenJDK Runtime Environment (build 15.0.1+9)
OpenJDK 64-Bit Server VM (build 15.0.1+9, mixed mode, sharing)
Go function mostly tested is:
func split(text string, occurrence map[string]int, separators []string) {
words := strings.Split(text, separators[0])
for _, w := range words {
if len(w) > 0 {
if len(separators) > 1 {
split(w, occurrence, separators[1:])
} else {
occurrence[w] = occurrence[w] + 1
}
}
}
}
Java equivalent:
private void split(String text, Map<String, Integer> occurrence, String[] separators) {
StringTokenizer st = new StringTokenizer(text, separators[0]);
while (st.hasMoreTokens()) {
if (separators.length > 1) {
split(st.nextToken(), occurrence, Arrays.copyOfRange(separators, 1, separators.length));
} else {
occurrence.compute(st.nextToken(),(k,v) -> v == null ? 1 : v+1);
}
}
}
Start 10 threads and execute this method against text loaded from ulyss10.txt file in the loop (text is loaded into memory once on the beginning of application execution - it is not I/O test).
There you can see all files from test: https://github.com/TOlchawa/go-vs-java/tree/main/book_read_test
My expectation was Go will be faster - but results are opposite.
It looks like Go is little bit slower - about: 40% slower (which is unexpected)
I know this is not a very reliable test - but nevertheless I'm surprised.
Could you provide me list of possible reasons why it was happen, please?
in my understanding it is difference between:
strings.Split | StringTokenizer
slider | HashMap
routine | Thread
go compiler | JVM
memory management | GC
differences in source code of application (IMHO it is not an issue)
what else ?
//edit
There was a wrong version in Github repo - but during my tests I used correct and the question is still valid/open.

Rather than doing a game of guessing. Here is what I would do to understand what happened.
Run a program profiling for both java and golang. Visualise them into memory allocation flow and flame graph.
See where each program spend most of their time.
Both golang and java has the same runtime speed for more/less equivalent low-level task that they do.
However, they underlying implementation for strings library can be different and hence lead to very huge performance difference.
For example, have you give a thought on how Golang vs Java implement a map? How about string splitting.
Any excessive bytes copy operations?
In real-world, performance optimisation are mostly cpu and memory management, not the "flashy" better big O algorithms. See Kafka vs RMQ. A lot of the performance edge came from better socket buffer management and zero-copy technique, which isn't rocket science algo at all.

Why can I fill the stack more on successive calls to recursive method [duplicate]

This question already has answers here:
Why is the max recursion depth I can reach non-deterministic?
(4 answers)
Closed 5 years ago.
A simple class for demonstration purposes:
public class Main {
private static int counter = 0;
public static void main(String[] args) {
try {
f();
} catch (StackOverflowError e) {
System.out.println(counter);
}
}
private static void f() {
counter++;
f();
}
}
I executed the above program 5 times, the results are:
22025
22117
15234
21993
21430
Why are the results different each time?
I tried setting the max stack size (for example -Xss256k). The results were then a bit more consistent but again not equal each time.
Java version:
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)
EDIT
When JIT is disabled (-Djava.compiler=NONE) I always get the same number (11907).
This makes sense as JIT optimizations are probably affecting the size of stack frames and the work done by JIT definitely has to vary between the executions.
Nevertheless, I think it would be beneficial if this theory is confirmed with references to some documentation about the topic and/or concrete examples of work done by JIT in this specific example that leads to frame size changes.

The observed variance is caused by background JIT compilation.
This is how the process looks like:
Method f() starts execution in interpreter.
After a number of invocations (around 250) the method is scheduled for compilation.
The compiler thread works in parallel to the application thread. Meanwhile the method continues execution in interpreter.
As soon as the compiler thread finishes compilation, the method entry point is replaced, so the next call to f() will invoke the compiled version of the method.
There is basically a race between applcation thread and JIT compiler thread. Interpreter may perform different number of calls before the compiled version of the method is ready. At the end there is a mix of interpreted and compiled frames.
No wonder that compiled frame layout differs from interpreted one. Compiled frames are usually smaller; they don't need to store all the execution context on the stack (method reference, constant pool reference, profiler data, all arguments, expression variables etc.)
Futhermore, there is even more race possibilities with Tiered Compilation (default since JDK 8). There can be a combination of 3 types of frames: interpreter, C1 and C2 (see below).
Let's have some fun experiments to support the theory.
Pure interpreted mode. No JIT compilation.
No races => stable results.
$ java -Xint Main
11895
11895
11895
Disable background compilation. JIT is ON, but is synchronized with the application thread.
No races again, but the number of calls is now higher due to compiled frames.
$ java -XX:-BackgroundCompilation Main
23462
23462
23462
Compile everything with C1 before execution. Unlike previous case there will be no interpreted frames on the stack, so the number will be a bit higher.
$ java -Xcomp -XX:TieredStopAtLevel=1 Main
23720
23720
23720
Now compile everything with C2 before execution. This will produce the most optimized code with the smallest frame. The number of calls will be the highest.
$ java -Xcomp -XX:-TieredCompilation Main
59300
59300
59300
Since the default stack size is 1M, this should mean the frame now is only 16 bytes long. Is it?
$ java -Xcomp -XX:-TieredCompilation -XX:CompileCommand=print,Main.f Main
0x00000000025ab460: mov %eax,-0x6000(%rsp) ; StackOverflow check
0x00000000025ab467: push %rbp ; frame link
0x00000000025ab468: sub $0x10,%rsp
0x00000000025ab46c: movabs $0xd7726ef0,%r10 ; r10 = Main.class
0x00000000025ab476: addl $0x2,0x68(%r10) ; Main.counter += 2
0x00000000025ab47b: callq 0x00000000023c6620 ; invokestatic f()
0x00000000025ab480: add $0x10,%rsp
0x00000000025ab484: pop %rbp ; pop frame
0x00000000025ab485: test %eax,-0x23bb48b(%rip) ; safepoint poll
0x00000000025ab48b: retq
In fact, the frame here is 32 bytes, but JIT has inlined one level of recursion.
Finally, let's look at the mixed stack trace. In order to get it, we'll crash JVM on StackOverflowError (option available in debug builds).
$ java -XX:AbortVMOnException=java.lang.StackOverflowError Main
The crash dump hs_err_pid.log contains the detailed stack trace where we can find interpreted frames at the bottom, C1 frames in the middle and lastly C2 frames on the top.
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5958 [0x00007f21251a5900+0x0000000000000058]
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5920 [0x00007f21251a5900+0x0000000000000020]
// ... repeated 19787 times ...
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5920 [0x00007f21251a5900+0x0000000000000020]
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
// ... repeated 1866 times ...
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
j Main.f()V+8
j Main.f()V+8
// ... repeated 1839 times ...
j Main.f()V+8
j Main.main([Ljava/lang/String;)V+0
v ~StubRoutines::call_stub

First of all, the following has not been researched. I have not "deep dived" the OpenJDK source code to validate any of the following, and I don't have access to any inside knowledge.
I tried to validate your results by running your test on my machine:
$ java -version
openjdk version "1.8.0_71"
OpenJDK Runtime Environment (build 1.8.0_71-b15)
OpenJDK 64-Bit Server VM (build 25.71-b15, mixed mode)
I get the "count" varying over a range of ~250. (Not as much as you are seeing)
First some background. A thread stack in a typical Java implementation is a contiguous region of memory that is allocated before the thread is started, and that is never grown or moved. A stack overflow happens when the JVM tries to create a stack frame to make a method call, and the frame goes beyond the limits of the memory region. The test could be done by testing the SP explicitly, but my understanding is that it is normally implemented using a clever trick with the memory page settings.
When a stack region is allocated, the JVM makes a syscall to tell the OS to mark a "red zone" page at the end of the stack region read-only or non-accessible. When a thread makes a call that overflows the stack, it accesses memory in the "red zone" which triggers a memory fault. The OS tells the JVM via a "signal", and the JVM's signal handler maps it to a StackOverflowError that is "thrown" on the thread's stack.
So here are a couple of possible explanations for the variability:
The granularity of hardware-based memory protection is the page boundary. So if the thread stack has been allocated using malloc, the start of the region is not going to be page aligned. Therefore the distance from the start of the stack frame to the first word of the "red zone" (which >is< page aligned) is going to be variable.
The "main" stack is potentially special, because that region may be used while the JVM is bootstrapping. That might lead to some "stuff" being left on the stack from before main was called. (This is not convincing ... and I'm not convinced.)
Having said this, the "large" variability that you are seeing is baffling. Page sizes are too small to explain a difference of ~7000 in the counts.
UPDATE
When JIT is disabled (-Djava.compiler=NONE) I always get the same number (11907).
Interesting. Among other things, that could cause stack limit checking to be done differently.
This makes sense as JIT optimizations are probably affecting the size of stack frames and the work done by JIT definitely has to vary between the executions.
Plausible. The size of the stackframe could well be different after the f() method has been JIT compiled. Assuming f() was JIT compiled at some point you stack will have a mixture of "old" and "new" frames. If the JIT compilation occurred at different points, then the ratio will be different ... and hence the count will be different when you hit the limit.
Nevertheless, I think it would be beneficial if this theory is confirmed with references to some documentation about the topic and/or concrete examples of work done by JIT in this specific example that leads to frame size changes.
Little chance of that, I'm afraid ... unless you are prepared to PAY someone to do a few days research for you.
1) No such (public) reference documentation exists, AFAIK. At least, I've never been able to find a definitive source for this kind of thing ... apart from deep diving the source code.
2) Looking at the JIT compiled code tells you nothing of how the bytecode interpreter handled things before the code was JIT compiled. So you won't be able to see if the frame size has changed.

The exact functioning of Java stack undocumented, but it totally depends on the memory allocated to that thread.
Just try using the Thread constructor with stacksize and see if it gets constant. I have not tried it it, so please share the results.

Some JVM monitoring checks via SNMP return zero values in JVM 1.8

My monitoring tools uses SNMP to get several internal values of Java Virtual Machines 1.6 and 1.7. The problem is that some values in JVM 1.8 machines return a zero value.
These are NoHeapMemPoolMaxSize (OID: .3.163.1.1.2.23.0) and PoolMaxSize (OID: .3.163.1.1.2.110.1.13.2). The snmpwalk output:
SNMPv2-SMI::enterprises.42.2.900.3.163.1.1.2.23.0 = Counter64: 0
SNMPv2-SMI::enterprises.42.2.900.3.163.1.1.2.110.1.13.2 = Counter64: 0
Have changed the OIDs of these both values? I compared JVM-MANAGEMENT-MIB.mib for Java 6 and Java 8 and I have found no difference.
What is wrong here?

The problem was the JVM had not defined the maximum MetaSpace size. After adding -XX:MaxMetaspaceSize=256m as a command line argument, the snmpwalk command shows that those values are no longer zero:
SNMPv2-SMI::enterprises.42.2.900.3.163.1.1.2.23.0 = Counter64: 1593835520
SNMPv2-SMI::enterprises.42.2.900.3.163.1.1.2.110.1.13.2 = Counter64: 268435456
Best regards.

Java vs. C Simple Performance Test [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
im running a simple loop that prints out the iterator (i) for 1.000.000 times in both java and c.
im using netbeans and visual studio respectively.
i dont care about precision but at about 40 seconds:
netbeans (java) has printed about 500.000 numbers
while windows (c) has printed about 75.000 numbers
-- why such a big difference?
im using a common intel core2duo(2.0 Ghz) pc with windows7

That seems wrong. Could you provide your code?
My Versions:
C version compiled with gcc -std=c99 -o itr itr.c with gcc 4.5.1
#include <stdio.h>
int main( int argc, char **argv )
{
for ( int i = 0; i < 1000000; i++ )
{
printf("%d\n", i);
}
}
Java Version compiled as javac Itr.java with javac 1.6.0_20 and JVM being:
OpenJDK Runtime Environment (IcedTea6 1.9.1) (ArchLinux-6.b20_1.9.1-1-x86_64)
OpenJDK 64-Bit Server VM (build 17.0-b16, mixed mode)
code -
class Itr
{
public static void main( String[] av )
{
for ( int i = 0; i < 1000000; i++ )
{
System.out.println(i);
}
}
}
and the times:
time ./itr
// Snip Output //
real 0m1.964s
user 0m0.330s
sys 0m1.477s
time java Itr
// Snip Output //
real 0m5.245s
user 0m2.337s
sys 0m3.023s
The test system is a Intel Core i5 M520 ( # 2.4GHz ) running 64 bit ArchLinux.

One way to considerably speed up your example would be:
public static void main(String[] args) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 1000000; i++)
sb.append(i).append("\n");
System.out.println(sb.toString());
}
String concatenation or output (in your case printing to standard output stream) in a loop is bad by design and not the fault of Java, you just generally want to avoid that.
It is much faster if you minimize the calls to output and use a local buffer. Also concatenating Strings is also inefficient - Java has StringBuilder class for that task.

Without providing your code and environnement settings, your test have no value.
Are you sure that the NetBeans console display isn't slown down in C case, or optimized for Java output?
Are you sure you did run the two projects in optimized mode without debug? C debug versions often generate a lot of debug informations that clearly slow down everything if you're debugging. Anyway, any benchmark should be done with optimization AND no debug mode.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.