Why do StringBuilders pop up when debugging String concatenation? - java

I am aware that String are immutable, and on when to use a StringBuilder or a StringBuffer. I also read that the bytecode for these two snippets would end up being the same:
//Snippet 1
String variable = "text";
this.class.getResourceAsStream("string"+variable);
//Snippet 2
StringBuilder sb = new StringBuilder("string");
sb.append("text");
this.class.getResourceAsStream(sb.toString());
But I obviously have something wrong. When debugging through Snippet 1 in eclipse, I am actually taken to the StringBuilder constructor and to the append method. I suppose I'm missing details on how bytecode is interpreted and how the debugger refers back to the lines in the source code; if anyone could explain this a bit, I'd really appreciate it. Also, maybe you can point out what's JVM specific and what isn't (I'm for example running Oracle's v6), Thanks!

Why do StringBuilders pop up when debugging String concatenation?
Because string concatenation (via the '+' operator) is typically compiled to code that uses a StringBuffer or StringBuilder to do the concatenation. The JLS explicitly permits this behaviour.
"An implementation may choose to perform conversion and concatenation in one step to avoid creating and then discarding an intermediate String object. To increase the performance of repeated string concatenation, a Java compiler may use the StringBuffer class or a similar technique to reduce the number of intermediate String objects that are created by evaluation of an expression." JLS 15.18.1.
(If your code is using a StringBuffer rather than a StringBuilder, it is probably because it was compiled using a really old Java compiler, or because you have specified a really old target JVM. The StringBuilder class is a relatively addition to Java. Older versions of the JLS used to mention StringBuffer instead of StringBuilder, IIRC.)
Also, maybe you can point out what's JVM specific and what isn't.
The bytecodes produced for "string" + variable" depend on how the Java compiler handles the concatenation. (In fact, all generated bytecodes are Java compiler dependent to some degree. The JLS and JVM specs do not dictate what bytecodes must be generated. The specifications are more about how the program should behave, and what individual bytecodes do.)
#supercat comments:
I wonder why string concatenation wouldn't use e.g. a String constructor overload which accepts two String objects, allocates a buffer of the proper combined size, and joins them? Or, when joining more strings, an overload which takes a String[]? Creating a String[] containing references to the strings to be joined should be no more expensive than creating a StringBuilder, and being able to create a perfect-sized backing store in one shot should be an easy performance win.
Maybe ... but I'd say probably not. This is a complicated area involving complicated trade-offs. The chosen implementation strategy for string concatenation has to work well across a wide range of different use-cases.
My understanding is that the original strategy was chosen after looking at a number of approaches, and doing some large-scale static code analysis and benchmarking to try to figure out which approach was best. I imagine they considered all of the alternatives that you proposed. (After all, they were / are smart people ...)
Having said that, the complete source code base for Java 6, 7 and 8 are available to you. That means that you could download it, and try some experiments of your own to see if your theories are right. If they are ... and you can gather solid evidence that they are ... then submit a patch to the OpenJDK team.

#StephenC I am still not convinced with the explanation. The compiler may do whatever optimization it wants to do but when you debug through the eclipse the source code view is hidden from compiler code and it should not jump one section of code to another code within the same source file.
The following description in the question suggests that the source code and byte code are not in sync. i.e., he is not running the latest code.
When debugging through Snippet 1 in eclipse, I am actually taken to the StringBuffer constructor and to the append method
and
how the debugger refers back to the lines in the source code

Related

Tracking method implementation changes in class bytecode

I have some abstract project (let's call it The Project) bytecode (of it's every class) inside some kotlin code, and each class bytecode is stored as ByteArray; the task is to tell which specific methods in each class are being modified from build to build of The Project. In other words, there are two ByteArrays of a same class of The Project, but they belong to different versions of it, and I need to compare them accurate. A simple example. Let's assume we have a trivial class:
class Rst {
fun getjson(): String {
abc("""ss""");
return "jsonValid"
}
public fun abc(s: String) {
println(s)
}
}
It's bytecode is stored in oldByteCode. Now some changes happened to the class:
class Rst {
fun getjson(): String {
abc("""ss""");
return "someOtherValue"
}
public fun newMethod(s: String) {
println("it's not abc anymore!")
}
}
It's bytecode is stored in newByteCode.
That's the main goal: compare oldByteCode to newByteCode.
Here we have the following changes:
getjson() method had been changed;
abc() method had been removed;
newMethod() had been created.
So, a method is changed, if it's signature remains the same. If not, it's already some different method.
Now back to the actual problem. I have to know every method's exact status by it's bytecode. What I have at the moment is the jacoco analyzer, which parses class bytecode to "bundles". In these bundles I have hierarchy of packages, classes, methods, but only with their signatures, so I cant tell if a method's body has any changes. I can only track signature differences.
Are there any tools, libs to split class bytecode to it's methods bytecodes? With those I could, for example, calculate hashes and compare them. Maybe asm library has any deal with that?
Any ideas are welcome.
TL;DR you approach of just comparing bytecode or even hashes won’t lead to a reliable solution, in fact, there is no solution with a reasonable effort to this kind of problem at all.
I don’t know, how much of it applies to the Kotlin compiler, but as elaborated in Is the creation of Java class files deterministic?, Java compilers are not required to produce identical bytecode even if the same version is used to compile exactly the same source code. While they may have an implementation that tries to be as deterministic as possible, things change when looking at different versions or alternative implementations, as explained in Do different Java Compilers (where the vendor is different) produce different bytecode.
Even when we assume that the Kotlin compiler is outstandingly deterministic, even across versions, it can’t ignore the JVM evolution. E.g. the removal of the jsr/ret instructions could not be ignored by any compiler, even when trying to be conservative. But it’s rather likely that it will incorporate other improvements as well, even when not being forced¹.
So in short, even when the entire source code did not change, it’s not a safe bet to assume that the compiled form has to stay the same. Even with an explicitly deterministic compiler we would have to be prepared for changes when recompiling with newer versions.
Even worse, if one method changes, it may have an impact on the compiled form of others, as instructions refer to items of a constant pool whenever constants or linkage information are needed and these indices may change, depending on how the other methods use the constant pool. There’s also an optimized form for certain instructions when accessing one of the first 255 pool indices, so changes in the numbering may require changing the form of the instruction. This in turn may have an impact on other instructions, e.g. switch instructions have padding bytes, depending on their byte code position.
On the other hand, a simple change of a constant value used in only one method may have no impact on the method’s bytecode at all, if the new constant happened to end up at the same place in the pool than the old constant.
So, to determine whether the code of two methods does actually the same, there is no way around parsing the instructions and understanding their meaning to some degree. Comparing just bytes or hashes won’t work.
¹ to name some non-mandatory changes, the compilation of class literals changed, likewise string concatenation changed from using StringBuffer to use StringBuilder and changed again to use StringConcatFactory, the use of getClass() for intrinsic null checks changed to requireNonNull(…), etc. A compiler for a different language doesn’t have to follow, but no-one wants to be left behind…
There are also bugs to fix, like obsolete instructions, which no compiler would keep just to stay deterministic.

Implicit StringBuilder size

I almost always use StringBuilder(int) to avoid silly reallocations with the tiny default capacity. Does any javac implementation do anything special to use an appropriate capacity for implicit uses of StringBuilder in concatenation? Would "abcdefghijklmnopqrstuvwxyz" + numbers + "!.?:" use at least an initial capacity of 30 given the literal lengths?
Yes, it is going to be that smart -- smarter, actually -- in Java 9. (This is something I can loosely claim I was a part of -- I made the original suggestion to presize StringBuilders, and that inspired the better, more powerful approach they ended up with.)
http://openjdk.java.net/jeps/280 specifically mentions, among other issues, that "sometimes we need to presize the StringBuilder appropriately," and that's a specific improvement being made.
(And the moral of the story is, write your code as simply and readably as possible and let the compiler and JIT take care of the optimizations that are properly their job.)
This may depend on your compiler version, settings etc.
To test this, I created a simple method:
public static void main(String[] args) {
System.out.println("abcdefghijklmnopqrstuvwxyz" + args[0] + "!.?:");
}
And after compiling and then running it through a decompiler, we get the following result:
System.out.println((new StringBuilder()).append("abcdefghijklmnopqrstuvwxyz").append(args[0]).append("!.?:").toString());
So it would seem that, at least with some javac versions, it just uses the default.
Manually inspecting the bytecode indicates the same thing.

Memory efficient, low overhead replacement for String in Java

After reading answers on this old question, I'm a bit curious to know if there are any frameworks now, that provide for storing large no.(millions) of small size(15-25 chars long) Strings more efficiently than java.lang.String.
If possible I would like to store represent the string using byte[] instead of char[].
My String(s) are going to be constants & I don't really require numerous utility methods as provided by java.lang.String class.
Java 6 does this with -XX:+UseCompressedStrings which is on by default in some updates.
Its not in Java 5.0 or 7. It is still listed as on by default, but its not actually supported in Java 7. :P
Depending on what you want to do you could write your own classes, but if you only have a few 100 MBs of Strings I suspect its not worth it.
Most likely this optimization is not worth the effort and complexity it brings with it. Either live with what the VM offers you (as Peter Lawrey suggests), or go through great lengths to work your own solution (not using java.lang.String).
There is an interface CharSequence your own String class could implement. Unfortunately very few JRE methods accept a CharSequence, so be prepared that toString() will need to be used frequently on your class if you need to pass any of your 'Strings' to any other API.
You could also hack String to create your Strings in a more memory efficient (and less GC friendly way). String has a (package access level) constructor String(offset, count, char[]) that does not copy the chars but just takes the char[] as direct reference. You could put all your strings into one big char[] array and construct the strings using reflection, this would avoid much of the overhead normally introduced by the char[] array in a string. I can't really recommend this method, since it relies on JRE private functionality.

Why use char[] instead of String?

In Thread.java, line 146, I have noticed that the author used a char[] instead of a String for the name field. Are there any performance reasons that I am not aware of? getName() also wraps the character in a String before returning the name. Isn't it better to just use a String?
In general, yes. I suspect char[] was used in Thread for performance reasons, back in the days when such things in Java required every effort to get decent performance. With the advent of modern JVMs, such micro-optimizations have long since become unimportant, but it's just been left that way.
There's a lot of weird code in the old Java 1.0-era source, I wouldn't pay too much attention to it.
Hard to say. Perhaps they had some optimizations in mind, perhaps the person who wrote this code was simply more used to the C-style char* arrays for strings, or perhaps by the time this code was written they were not sure if strings will be immutable or not. But with this code, any time a Thread.getName() is called, a new char array is created, so this code is actually heavier on the GC than just using a string.
Maybe the reason was security protection? String can be changed with reflection, so the author wants copy on read and write. If you are doing that, you might as well use char array for faster copying.

What's a real-world example of using StringBuffer?

I'm using Java 6.
I've only written a couple of multi-threaded applications so I've never encountered a time when I had several threads accessing the same StringBuffer.
Could somebody give me a real world example when StringBuffer might be useful?
Thanks.
EDIT: Sorry I think I wasn't clear enough. I always use StringBuilder because in my applications, only one thread accesses the string at a time. So I was wondering what kind of scenario would require multiple threads to access StringBuffer at the same time.
The only real world example I can think of is if you are targetting Java versions befere 1.5. The StringBuilder class was introduced in 1.5 so for older versions you have to use StringBuffer instead.
In most other cases StringBuilder should be prefered to StringBuffer for performance reasons - the extra thread safety provided by StringBuffer is rarely required. I can't think of any obvious situations where a StringBuffer would make more sense. Perhaps there are some, but I can't think of one right now.
In fact it seems that even the Java library authors admit that StringBuffer was a mistake:
Evaluation by the libraries team:
It is by design that StringBuffer and StringBuilder share no
common public supertype. They are not intended to be alternatives:
one is a mistake (StringBuffer), and the other (StringBuilder)
is its replacement.
If StringBuilder had been added to the library first StringBuffer would probably never have been added. If you are in the situation that multiple threads appending to the same string seems like a good idea you can easily get thread safety by synchronizing access to a StringBuilder. There's no need for a whole extra class and all the confusion it causes.
It also might be worth noting that the .NET base class library which is heavily inspired by Java's libraries has a StringBuilder class but no StringBuffer and I've never seen anyone complaining about that.
A simple case cane be when you are having a Log file and multiple threads are logging errors or warnings and writing to that log file.
In general, these types of buffered string objects are useful when you are dynamically building strings. They attempt to minimize the amount of memory allocation and deallocation that is created when you continually append strings of a fixed size together.
So a real world example, imagine you are manually building HTML for a page, where you do roughly 100 string appends. If you did this with immutable strings, the JAVA virtual machine would do quite a bit of memory allocation and deallocation where with a StringBuffer it would do far less.
StringBuffer is a very popular choice with programmers.
It has the advantage over standard String objects, in that it is not an immutable object. Therefore, if a value is appended to the StringBuffer, a new object is not created (as it would be with String), but simply appended to the end.
This gives StringBuffers (under certain situations that cannot be compensated by the compiler) a performance advantage.
I tend to use StringBuffers anywhere that I dynamically add data to a string output, such as a log file writer, or other file generation.
The other alternative is StringBuilder. However, this is not thread-safe, as was designed not to be to offer even better performance in single-threaded applications. Apart from method signatures containing the sychronized keyword in StringBuffer, the classes are almost identical.
StringBuilder is recommended over StringBuffer in single threaded applications however, due to the performance gains (or if you look at it the other way around, due to the performance overheads of StringBuffer).

Categories