I am running this test because I want to see the stacktrace of a program.
Below is my program:
public class NanoTime {
public static void main (String[] args) {
long StartTime = System.nanoTime();
StringBuffer buffer = new StringBuffer();
for(int i=0; i<1000; i++) {
buffer.append("a"); }
long EndTime = System.nanoTime();
long totalTime = EndTime-StartTime;
System.out.println("Total time of calculation ="+ totalTime);
}
}
Now I am using OpenJDK built with debug-level set to slowdebug and also another one set to fastdebug.
I get this output:
[New Thread 0x7ffff7fd3700 (LWP 22532)]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7fd3700 (LWP 22532)]0x00007fffe10002b4 in ?? ()
(gdb) bt
#0 0x00007fffe10002b4 in ?? ()
#1 0x0000000000000202 in ?? ()
#2 0x00007fffe1000160 in ?? ()
#3 0x00007ffff66bd1a9 in execute_internal_vm_tests () at /home/bionix/Openjdk8/hotspot/src/share/vm/prims/jni.cpp:5128
#4 0x00007ffff7fd2550 in ?? ()
#5 0x00007ffff6b08bf8 in VM_Version::get_cpu_info_wrapper () at /home/bionix/Openjdk8/hotspot/src/cpu/x86/vm/vm_version_x86.cpp:395
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
I am confused at the question marks there, as I expected native methods name.
Note: I am even disabling the JIT Compiler : gdb --args java -Xint Test
In ordinary code, gdb relies on debug information (and to a lesser extent the "linker" symbols) to find the names of functions as it unwinds. Debugging information is described in various standards, DWARF being the current best one for Linux. Compilers emit the debugging information that is then read by gdb.
For just-in-time compilers like OpenJDK, there is no agreed-upon solution to emitting debugging information for debuggers to read. And so, as you've found, gdb generally has no idea what is going on.
In fact, as you can see from your trace, gdb can't even really unwind the entire stack. That's what this means:
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Modern compilers and ABIs tend to require some extra debugging information to unwind as well -- and, again, there's no agreement on how this should work for JIT compilation. GDB has some heuristics it uses to try to unwind when this information isn't available, but as you can see, they sometimes fail.
So, that's the bad news.
The good news is that gdb provides some ways to write unwinders and debug info readers for JITs. And, someone is working on this for OpenJDK. I wasn't able to quickly find the source, but I did find this thread which explains it a little.
This is because jvm is using template interpreter. For that interpreter java bytecode handlers are translated into machine instructions for the current platform during start time. Still you can get the full stacktrace, more info: template interpreter demo
There is also cpp interpreter and that is the interpreter you would expect to see in your stacktraces. But cpp interpreter is not supported anymore and is only available with zero jvm variant.
Also there is an article about the interpreters.
Related
I create simple test which compare my Go and Java application performance
I do not know why but it looks like my Java application is faster than Go
I used:
~> go version
go version go1.15.6 darwin/amd64
and
~> java -version
openjdk version "15.0.1" 2020-10-20
OpenJDK Runtime Environment (build 15.0.1+9)
OpenJDK 64-Bit Server VM (build 15.0.1+9, mixed mode, sharing)
Go function mostly tested is:
func split(text string, occurrence map[string]int, separators []string) {
words := strings.Split(text, separators[0])
for _, w := range words {
if len(w) > 0 {
if len(separators) > 1 {
split(w, occurrence, separators[1:])
} else {
occurrence[w] = occurrence[w] + 1
}
}
}
}
Java equivalent:
private void split(String text, Map<String, Integer> occurrence, String[] separators) {
StringTokenizer st = new StringTokenizer(text, separators[0]);
while (st.hasMoreTokens()) {
if (separators.length > 1) {
split(st.nextToken(), occurrence, Arrays.copyOfRange(separators, 1, separators.length));
} else {
occurrence.compute(st.nextToken(),(k,v) -> v == null ? 1 : v+1);
}
}
}
Start 10 threads and execute this method against text loaded from ulyss10.txt file in the loop (text is loaded into memory once on the beginning of application execution - it is not I/O test).
There you can see all files from test: https://github.com/TOlchawa/go-vs-java/tree/main/book_read_test
My expectation was Go will be faster - but results are opposite.
It looks like Go is little bit slower - about: 40% slower (which is unexpected)
I know this is not a very reliable test - but nevertheless I'm surprised.
Could you provide me list of possible reasons why it was happen, please?
in my understanding it is difference between:
strings.Split | StringTokenizer
slider | HashMap
routine | Thread
go compiler | JVM
memory management | GC
differences in source code of application (IMHO it is not an issue)
what else ?
//edit
There was a wrong version in Github repo - but during my tests I used correct and the question is still valid/open.
Rather than doing a game of guessing. Here is what I would do to understand what happened.
Run a program profiling for both java and golang. Visualise them into memory allocation flow and flame graph.
See where each program spend most of their time.
Both golang and java has the same runtime speed for more/less equivalent low-level task that they do.
However, they underlying implementation for strings library can be different and hence lead to very huge performance difference.
For example, have you give a thought on how Golang vs Java implement a map? How about string splitting.
Any excessive bytes copy operations?
In real-world, performance optimisation are mostly cpu and memory management, not the "flashy" better big O algorithms. See Kafka vs RMQ. A lot of the performance edge came from better socket buffer management and zero-copy technique, which isn't rocket science algo at all.
I am running a Java program written by another person on more data than the program was initially designed for, e.g. 10-times longer input files, roughly quadratic runtime. I encountered different problems and now aim to solve them bit by bit.
During execution when a lot of output has already been printed (redirected to a file) I get following output:
Exception in thread "main" java.lang.StackOverflowError
at java.io.PrintStream.write(PrintStream.java:480)
[...]
at java.io.PrintStream.write(PrintStream.java:480)
The stack trace is the first thing that confuses me as it is a long repetition of just the same line again and again. Furthermore, it gives no intention where in the code or the execution the problem occurred.
My thoughts / research
StackOverflowError
may indicate too less memory. With -Xmx110G flag I provided 110 G memory and monitored it while execution, only up to ~32 G were used. So this is probably not the problem here.
may be thrown due to programming mistakes causing an infinite loop. However, I cant really check this since I am not familiar enough with the code and the stack trace does not help me finding the location of the problem in the code.
[hypothesis] may be caused, because the writing of output is slower than execution and new print/write calls. Still, why is there no further stack trace? How could I check and fix this?
PrintStream
only code fragments after searching for "PrintStream"
// reset output stream to suppress the annoying output of the Apache batik library. Gets reset after lib call.
OutputStream tmp=System.out;
System.setOut(new PrintStream(new org.apache.commons.io.output.NullOutputStream()));
drawRes.g2d.stream(new FileWriter(svgFilePath), false);
System.setOut(new PrintStream(tmp));
[hypothesis] the write to void / null does not work
[workaround] if skip the change of output stream and just "live" with the big output created the programm seems to run (into other problems, but this is another case). Any ideas why this is happening?
Call for advice
If you have any advice on what is going one, what the Java code specifically does, please help me understand it. Especially the stack trace frustrates me as it provides no place to begin the fixing. I am also thankful for a general approach on how to tackle this problem, get a stack trace, fix the code to avoid StackOverflow, etc.
Some system environment facts
Linux machine
128 G memory
Java
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (IcedTea 3.3.0) (suse-28.1-x86_64)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
Please ask if you need more information!
Notes
I am rather new to Java so I am grateful for every advice (the program is not written by me)
This is my first post on stackoverflow, please inform me where I can improve my style of asking and formatting
I am no native English speaker, so please excuse mistakes and feel free to ask for understanding or correct me
Thanks all for your replies!
These two lines look suspicious:
OutputStream tmp=System.out;
//...
System.setOut(new PrintStream(tmp));
System.out is already a PrintStream, so IMHO the lines should read
PrintStream tmp=System.out;
//...
System.setOut(tmp);
What happens otherwise is that you have a near endless wrapping of PrintStreams within PrintStreams. The nesting of PrintStreams is only limited through Java heap space - but the call level nesting is much lower.
To verify the hypothesis I created a small test program that first wraps System.out 20 times and prints a stacktrace to verify the call chain. After that it wraps System.out 10_000 times and produces a StackOverflowException.
import java.io.OutputStream;
import java.io.PrintStream;
public class CheckPrintStream {
public static void main(String[] args) {
PrintStream originalSystemOut = System.out;
System.setOut(new PrintStream(System.out) {
#Override
public void write(byte buf[], int off, int len) {
originalSystemOut.write(buf, off, len);
if (len > 2) {
new RuntimeException("Testing PrintStream nesting").printStackTrace(originalSystemOut);
}
}
});
for (int i = 0; i < 20; i++) {
wrapSystemOut();
}
System.out.println("Hello World!");
for (int i = 20; i < 10_000; i++) {
wrapSystemOut();
}
System.out.println("crash!");
}
private static void wrapSystemOut() {
OutputStream tmp = System.out;
System.setOut(new PrintStream(System.out));
}
}
A nesting of around 6000 to 7000 PrintWriters is sufficient to produce a stack overflow.
This question already has answers here:
Why is the max recursion depth I can reach non-deterministic?
(4 answers)
Closed 5 years ago.
A simple class for demonstration purposes:
public class Main {
private static int counter = 0;
public static void main(String[] args) {
try {
f();
} catch (StackOverflowError e) {
System.out.println(counter);
}
}
private static void f() {
counter++;
f();
}
}
I executed the above program 5 times, the results are:
22025
22117
15234
21993
21430
Why are the results different each time?
I tried setting the max stack size (for example -Xss256k). The results were then a bit more consistent but again not equal each time.
Java version:
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)
EDIT
When JIT is disabled (-Djava.compiler=NONE) I always get the same number (11907).
This makes sense as JIT optimizations are probably affecting the size of stack frames and the work done by JIT definitely has to vary between the executions.
Nevertheless, I think it would be beneficial if this theory is confirmed with references to some documentation about the topic and/or concrete examples of work done by JIT in this specific example that leads to frame size changes.
The observed variance is caused by background JIT compilation.
This is how the process looks like:
Method f() starts execution in interpreter.
After a number of invocations (around 250) the method is scheduled for compilation.
The compiler thread works in parallel to the application thread. Meanwhile the method continues execution in interpreter.
As soon as the compiler thread finishes compilation, the method entry point is replaced, so the next call to f() will invoke the compiled version of the method.
There is basically a race between applcation thread and JIT compiler thread. Interpreter may perform different number of calls before the compiled version of the method is ready. At the end there is a mix of interpreted and compiled frames.
No wonder that compiled frame layout differs from interpreted one. Compiled frames are usually smaller; they don't need to store all the execution context on the stack (method reference, constant pool reference, profiler data, all arguments, expression variables etc.)
Futhermore, there is even more race possibilities with Tiered Compilation (default since JDK 8). There can be a combination of 3 types of frames: interpreter, C1 and C2 (see below).
Let's have some fun experiments to support the theory.
Pure interpreted mode. No JIT compilation.
No races => stable results.
$ java -Xint Main
11895
11895
11895
Disable background compilation. JIT is ON, but is synchronized with the application thread.
No races again, but the number of calls is now higher due to compiled frames.
$ java -XX:-BackgroundCompilation Main
23462
23462
23462
Compile everything with C1 before execution. Unlike previous case there will be no interpreted frames on the stack, so the number will be a bit higher.
$ java -Xcomp -XX:TieredStopAtLevel=1 Main
23720
23720
23720
Now compile everything with C2 before execution. This will produce the most optimized code with the smallest frame. The number of calls will be the highest.
$ java -Xcomp -XX:-TieredCompilation Main
59300
59300
59300
Since the default stack size is 1M, this should mean the frame now is only 16 bytes long. Is it?
$ java -Xcomp -XX:-TieredCompilation -XX:CompileCommand=print,Main.f Main
0x00000000025ab460: mov %eax,-0x6000(%rsp) ; StackOverflow check
0x00000000025ab467: push %rbp ; frame link
0x00000000025ab468: sub $0x10,%rsp
0x00000000025ab46c: movabs $0xd7726ef0,%r10 ; r10 = Main.class
0x00000000025ab476: addl $0x2,0x68(%r10) ; Main.counter += 2
0x00000000025ab47b: callq 0x00000000023c6620 ; invokestatic f()
0x00000000025ab480: add $0x10,%rsp
0x00000000025ab484: pop %rbp ; pop frame
0x00000000025ab485: test %eax,-0x23bb48b(%rip) ; safepoint poll
0x00000000025ab48b: retq
In fact, the frame here is 32 bytes, but JIT has inlined one level of recursion.
Finally, let's look at the mixed stack trace. In order to get it, we'll crash JVM on StackOverflowError (option available in debug builds).
$ java -XX:AbortVMOnException=java.lang.StackOverflowError Main
The crash dump hs_err_pid.log contains the detailed stack trace where we can find interpreted frames at the bottom, C1 frames in the middle and lastly C2 frames on the top.
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5958 [0x00007f21251a5900+0x0000000000000058]
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5920 [0x00007f21251a5900+0x0000000000000020]
// ... repeated 19787 times ...
J 164 C2 Main.f()V (12 bytes) # 0x00007f21251a5920 [0x00007f21251a5900+0x0000000000000020]
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
// ... repeated 1866 times ...
J 163 C1 Main.f()V (12 bytes) # 0x00007f211dca50ec [0x00007f211dca5040+0x00000000000000ac]
j Main.f()V+8
j Main.f()V+8
// ... repeated 1839 times ...
j Main.f()V+8
j Main.main([Ljava/lang/String;)V+0
v ~StubRoutines::call_stub
First of all, the following has not been researched. I have not "deep dived" the OpenJDK source code to validate any of the following, and I don't have access to any inside knowledge.
I tried to validate your results by running your test on my machine:
$ java -version
openjdk version "1.8.0_71"
OpenJDK Runtime Environment (build 1.8.0_71-b15)
OpenJDK 64-Bit Server VM (build 25.71-b15, mixed mode)
I get the "count" varying over a range of ~250. (Not as much as you are seeing)
First some background. A thread stack in a typical Java implementation is a contiguous region of memory that is allocated before the thread is started, and that is never grown or moved. A stack overflow happens when the JVM tries to create a stack frame to make a method call, and the frame goes beyond the limits of the memory region. The test could be done by testing the SP explicitly, but my understanding is that it is normally implemented using a clever trick with the memory page settings.
When a stack region is allocated, the JVM makes a syscall to tell the OS to mark a "red zone" page at the end of the stack region read-only or non-accessible. When a thread makes a call that overflows the stack, it accesses memory in the "red zone" which triggers a memory fault. The OS tells the JVM via a "signal", and the JVM's signal handler maps it to a StackOverflowError that is "thrown" on the thread's stack.
So here are a couple of possible explanations for the variability:
The granularity of hardware-based memory protection is the page boundary. So if the thread stack has been allocated using malloc, the start of the region is not going to be page aligned. Therefore the distance from the start of the stack frame to the first word of the "red zone" (which >is< page aligned) is going to be variable.
The "main" stack is potentially special, because that region may be used while the JVM is bootstrapping. That might lead to some "stuff" being left on the stack from before main was called. (This is not convincing ... and I'm not convinced.)
Having said this, the "large" variability that you are seeing is baffling. Page sizes are too small to explain a difference of ~7000 in the counts.
UPDATE
When JIT is disabled (-Djava.compiler=NONE) I always get the same number (11907).
Interesting. Among other things, that could cause stack limit checking to be done differently.
This makes sense as JIT optimizations are probably affecting the size of stack frames and the work done by JIT definitely has to vary between the executions.
Plausible. The size of the stackframe could well be different after the f() method has been JIT compiled. Assuming f() was JIT compiled at some point you stack will have a mixture of "old" and "new" frames. If the JIT compilation occurred at different points, then the ratio will be different ... and hence the count will be different when you hit the limit.
Nevertheless, I think it would be beneficial if this theory is confirmed with references to some documentation about the topic and/or concrete examples of work done by JIT in this specific example that leads to frame size changes.
Little chance of that, I'm afraid ... unless you are prepared to PAY someone to do a few days research for you.
1) No such (public) reference documentation exists, AFAIK. At least, I've never been able to find a definitive source for this kind of thing ... apart from deep diving the source code.
2) Looking at the JIT compiled code tells you nothing of how the bytecode interpreter handled things before the code was JIT compiled. So you won't be able to see if the frame size has changed.
The exact functioning of Java stack undocumented, but it totally depends on the memory allocated to that thread.
Just try using the Thread constructor with stacksize and see if it gets constant. I have not tried it it, so please share the results.
I would like to use Emac's org mode for taking notes of java snippets.
I would like the java snippets to be syntax highlighted.
I tried running Org-mode in minor mode and Java-mode in major mode, but I found this lacks a lot of Org-Mode features (e.g links).
I would prefer to run Org-mode in major mode and have some minor mode to do java syntax highlihgting for when it finds java syntax.
I would rather avoid the #+begin_src business as my file would be full of those.
Is this possible?
[Edit] in was thinking along the lines of soft syntax highlighting for none headings and non org-items . I.e general paragraph body?
The only mechanism of which I'm aware that supports syntax-highlighted code blocks in Org-mode is the source code block feature you have already mentioned.
Setting org-src-fontify-natively to t should enable syntax highlighting for such blocks:
(setf org-src-fontify-natively t)
Code blocks should look something like this:
* Pretty sweet Org heading
This is an org-mode file, which is cool for lots of reasons, e.g.
- it's Emacs, and
- it supports syntax-highlighted blocks
- (note that this requires the variable ~org-src-fontify-natively~
to be set to ~t~)
#+BEGIN_SRC java
public class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello, World");
}
}
#+END_SRC
A few tips:
The quickest way to start a new code block is to type <s and then hit Tab. This expands to
#+BEGIN_SRC |
#+END_SRC
with the cursor represented by |, so you can just type java and start editing.
With point inside such a block, org-edit-special, bound to C-c ' by default, will open the code block up in a separate buffer with the appropriate major mode active. You can use the full power of that mode, then type C-c ' again to update the embedded snippet.
Myself and another developer on my time recently moved from a Core 2 Duo machine at work to a new Core 2 Quad 9505; both running Windows XP SP3 32-bit with JDK 1.6.0_18.
Upon doing so, a couple of our automated unit tests for some timing/statistics/metrics aggregation code promptly started failing, due to what appear to be ridiculous values coming back from System.nanoTime().
Test code that shows this behaviour, reliably, on my machine is:
import static org.junit.Assert.assertThat;
import org.hamcrest.Matchers;
import org.junit.Test;
public class NanoTest {
#Test
public void testNanoTime() throws InterruptedException {
final long sleepMillis = 5000;
long nanosBefore = System.nanoTime();
long millisBefore = System.currentTimeMillis();
Thread.sleep(sleepMillis);
long nanosTaken = System.nanoTime() - nanosBefore;
long millisTaken = System.currentTimeMillis() - millisBefore;
System.out.println("nanosTaken="+nanosTaken);
System.out.println("millisTaken="+millisTaken);
// Check it slept within 10% of requested time
assertThat((double)millisTaken, Matchers.closeTo(sleepMillis, sleepMillis * 0.1));
assertThat((double)nanosTaken, Matchers.closeTo(sleepMillis * 1000000, sleepMillis * 1000000 * 0.1));
}
}
Typical output:
millisTaken=5001
nanosTaken=2243785148
Running it 100x yields nano results between 33% and 60% of the actual sleep time; usually around 40% though.
I understand the weaknesses in accuracy of timers in Windows, and have read related threads like Is System.nanoTime() consistent across threads?, however my understanding is that System.nanoTime() is intended for exactly the purpose we're using it :- measuring elapsed time; more accurately than currentTimeMillis().
Does anyone know why it's returning such crazy results? Is this likely to be a hardware architecture problem (the only major thing that has changed is the CPU/Motherboard on this machine)? A problem with the Windows HAL with my current hardware? A JDK problem? Should I abandon nanoTime()? Should I log a bug somewhere, or any suggestions on how I could investigate further?
UPDATE 19/07 03:15 UTC: After trying finnw's test case below I did some more Googling, coming across entries such as bugid:6440250. It also reminded me of some other strange behaviour I noticed late Friday where pings were coming back negative. So I added /usepmtimer to my boot.ini and now all the tests behave as expected., and my pings are normal too.
I'm a bit confused about why this was still an issue though; from my reading I thought TSC vs PMT issues were largely resolved in Windows XP SP3. Could it be because my machine was originally SP2, and was patched to SP3 rather than installed originally as SP3? I now also wonder whether I should be installing patches like the one at MS KB896256. Maybe I should take this up with the corporate desktop build team?
The problem was resolved (with some open suspicions about the suitability of nanoTime() on multi-core systems!) by adding /usepmtimer to the end of my C:\boot.ini string; forcing Windows to use the Power Management timer rather than the TSC. It's an open question as to why I needed to do this given I'm on XP SP3, as I understood that this was the default, however perhaps it was due to the manner in which my machine was patched to SP3.
On my system (Windows 7 64-Bit, Core i7 980X):
nanosTaken=4999902563
millisTaken=5001
System.nanoTime() uses OS-specific calls, so I expect that you are seeing a bug in your Windows/processor combination.
You probably want to read the answers to this other stack overflow question: Is System.nanoTime() completely useless?.
In summary, it would appear that nanoTime relies on operating system timers that may be affected by the presence of multiple core CPUs. As such, nanoTime may not be that useful on certain combinations of OS and CPU, and care should be taken when using it in portable Java code that you intend to run on multiple target platforms. There seems to be a lot of complaining on the web on this subject, but not much consensus on a meaningful alternative.
It is difficult to tell whether this is a bug or just normal timer variation between cores.
An experiment you could try is to use native calls to force the thread to run on a specific core.
Also, to rule out power management effects, try spinning in a loop as an alternative to sleep():
import com.sun.jna.Native;
import com.sun.jna.NativeLong;
import com.sun.jna.platform.win32.Kernel32;
import com.sun.jna.platform.win32.W32API;
public class AffinityTest {
private static void testNanoTime(boolean sameCore, boolean spin)
throws InterruptedException {
W32API.HANDLE hThread = kernel.GetCurrentThread();
final long sleepMillis = 5000;
kernel.SetThreadAffinityMask(hThread, new NativeLong(1L));
Thread.yield();
long nanosBefore = System.nanoTime();
long millisBefore = System.currentTimeMillis();
kernel.SetThreadAffinityMask(hThread, new NativeLong(sameCore? 1L: 2L));
if (spin) {
Thread.yield();
while (System.currentTimeMillis() - millisBefore < sleepMillis)
;
} else {
Thread.sleep(sleepMillis);
}
long nanosTaken = System.nanoTime() - nanosBefore;
long millisTaken = System.currentTimeMillis() - millisBefore;
System.out.println("nanosTaken="+nanosTaken);
System.out.println("millisTaken="+millisTaken);
}
public static void main(String[] args) throws InterruptedException {
System.out.println("Sleeping, different cores");
testNanoTime(false, false);
System.out.println("\nSleeping, same core");
testNanoTime(true, false);
System.out.println("\nSpinning, different cores");
testNanoTime(false, true);
System.out.println("\nSpinning, same core");
testNanoTime(true, true);
}
private static final Kernel32Ex kernel =
(Kernel32Ex) Native.loadLibrary(Kernel32Ex.class);
}
interface Kernel32Ex extends Kernel32 {
NativeLong SetThreadAffinityMask(HANDLE hThread, NativeLong dwAffinityMask);
}
If you get very different results depending on core selection (e.g. 5000ms on the same core but 2200ms on different cores) that would suggest that the problem is just natural timer variation between cores.
If you get very different results from sleeping vs. spinning, it is more likely due to power management slowing down the clocks.
If none of the four results are close to 5000ms, then it might be a bug.