automatically interning of string literals

automatically interning of string literals - java

In the source code of
com.sun.org.apache.xerces.internal.impl.XMLScanner at line 183 and 186
183 protected final static String fVersionSymbol = "version".intern();
186 protected final static String fEncodingSymbol = "encoding".intern();
Why "version" and "encoding" are explicitly interned by using intern() while they are string literals and would get automatically interned?

I've tracked down the change to revision 318617 in the Apache Xerces SVN Repository (this is the project where this XML parser was initially developed, as the package name suggests).
The relevant part of the commit message is:
Trying to improve the use of symbol tables. Many predefined Strings are
added to symbol tables every time the parser is reset. For small documents,
this would be a significant cost. Now since we call String#intern for Strings
in the symbol table, it's sufficient to use String#intern for those predefined
symbols. This only needs to be performed once.
As you noted, the .intern() should not be necessary (and should have no visible effect) on a conforming JVM implementation.
My guess is that
either the author was not aware of the fact that string literals will always be interned
or it was a conscious decision to ward against a misbehaving JVM implementation
In the second case I'd expect some note of that in a comment or in the comment message, however.
One side-effect of that .intern() call is that initializers are no longer constant expressions and the fields will not be inlined by other classes referencing them.That will ensure that the class XMLScanner is loaded and its field read. I don't think this is relevant here, however.

I don't believe there's any good reason for that, for the reason you identified: Literals are always automatically interned, as defined by the String class:
All literal strings and string-valued constant expressions are interned. String literals are defined in section 3.10.5 of the The Java™ Language Specification.

Related

in which memory area method area , string constant pool reside in java 8?

I have read oracle document but there is nothing given regarding method area and string constant pool. I have doubt that where method area, string constant pool reside in memory in JDK 8 or 8+ .

The java language specification does not specify where this lives.
It also doesn't matter. These objects end up being created, there is no way to directly access them, which doesn't matter.
That's sort of how java works: The spec says what you can and cannot rely on, this gives room to JVM implementations to do whatever they want, so long as they fulfill the contract. "Where in memory..." is a question that in java doesn't matter, you can't manipulate memory directly at all.
Go back to why you think you need to know and find another way; any answer to this question would be specific to some implementation of the JVM, and therefore your code wouldn't be portable. That is, any version update to the JVM, or some alternative JVM implementation such as OpenJ9 rolls along and your code just breaks, probably with a raw core dump. That doesn't sound like a good idea.

In Java 8 and later:
the method area is in metaspace
the string pool is in the regular heap.
This is an implementation detail for Oracle and OpenJDK JVMs. Other implementations may be different. But it really doesn't matter where strings and code is stored. Your application doesn't need to know.
By the way, it is called the "string pool", not the "string constant pool".
All strings are constant in the sense that they are immutable.
Strings variables that are declared as static final (and are constant in that sense) are not necessarily in the string pool.
Not all strings in the string pool are static final.
Not all strings in the string pool are string literals or other compile-time constant values.

Where is the description of Constant Folding in the Java Language Specification, Java SE 11 Edition (JLS SE 11)?

As far as I know, Java deals with constant variables §4.12.4 by constant folding in compile time. I've tried my best, but I couldn't find its description from JLS. Could anybody tell me where I could find official description of the constant folding process for Java 11?

The specification does not use the term Constant Folding.
It has the definition of Constant Expressions
A constant expression is an expression denoting a value of primitive type or a String that does not complete abruptly and is composed using only the following:
[…]
Constant expressions of type String are always "interned" so as to share unique instances, using the method String.intern.
A constant expression is always treated as FP-strict (§15.4), even if it occurs in a context where a non-constant expression would not be considered to be FP-strict.
Constant expressions are used as case labels in switch statements (§14.11) and have a special significance in assignment contexts (§5.2) and the initialization of a class or interface (§12.4.2). They may also govern the ability of a while, do, or for statement to complete normally (§14.21), and the type of a conditional operator ? : with numeric operands.
The last part does already point out where precalculation of constant expressions is mandatory. When it comes to case labels, the compiler is required to report duplicates, hence, it must calculate the values at compile-time. When calculating loops, it must calculate constant boolean expressions to determine code reachability.
Likewise, initializers need precalculation to determine the correctness. E.g. short s = 'a' * 2; is a correct declaration, but short s = Short.MAX_VALUE + 1; is not.
A well known use case of constant expressions is the initializer of constant variables. When reading a constant variable, the constant value will be used instead of reading the variable, compare with the Q&A “Does the JLS require inlining of final String constants?”
But this does not imply that “constant folding” is mandatory. In theory, a conforming implementation still could perform the calculation of the constant expression as written in the variable initializer at every place where the variable is used. In practice, the bytecode format leads to a constant folding behavior. The ConstantValue attribute which is used to record the value of a constant variable in bytecode can only hold a precalculated value. When compiling against an already compiled class file, the original expression of a constant variable is not available to the compiler. It can only use the precalculated value.
Likewise, compiling a switch instruction is normally done using either, the tableswitch or the lookupswitch instruction, both requiring precalculated int values for the case labels. A compiler would have to go great length to implement a different strategy.
Also, the compiled format for annotation values can only hold precalculated expressions.

The Java Language Specification defines the semantics of the language; constant folding is a compiler optimisation which does not change the behaviour of a Java program, so it is not specified in the JLS, and does not need to be. It is allowed for an implementation of Java not to do it, or to do it in some circumstances but not others, so long as the compiled program does what the JLS says it should.
That said, the JLS does define the language semantics in such a way as to allow constant folding in more cases without changing the behaviour of the program. The most relevant paragraph is what you presumably refer to in §14.2.4:
A constant variable is a final variable of primitive type or type String that is initialized with a constant expression (§15.28). Whether a variable is a constant variable or not may have implications with respect to class initialization (§12.4.1), binary compatibility (§13.1), reachability (§14.21), and definite assignment (§16.1.1).
The referenced sections about class initialisation, binary compatibility, reachability and definite assignment specifically define the semantics of constant variables differently to other variables; specifically, the behaviours they define are the behaviours you would expect from a compiler which does fold constants. This allows those implementing the specification to do the optimisation without overly constraining how they do it.

The process for fields is linked from the page linked in the question: https://docs.oracle.com/javase/specs/jls/se11/html/jls-13.html#jls-13.1
A reference to a field that is a constant variable (§4.12.4) must be
resolved at compile time to the value V denoted by the constant
variable's initializer.
If such a field is static, then no reference to the field should be
present in the code in a binary file, including the class or interface
which declared the field. Such a field must always appear to have been
initialized (§12.4.2); the default initial value for the field (if
different than V) must never be observed.
If such a field is non-static, then no reference to the field should
be present in the code in a binary file, except in the class
containing the field. (It will be a class rather than an interface,
since an interface has only static fields.) The class should have code
to set the field's value to V during instance creation (§12.5).

How does the JVM lookup the String in the String constant pool? [duplicate]

This question already has answers here:
What is the Java string pool and how is "s" different from new String("s")? [duplicate]
(5 answers)
Closed 7 years ago.
I want to understand the string pool more deeply. Please help me get to the source class file containing this implementation in Java.
The question is more of related to finding the source code or implementation of the String Pool to delve deeper on this concept to know more about some unknown or elusive things in it. This way we can make the use of strings even more efficiently or think of some other way to implement our own garbage collections in case we have an application creating so many literals and string objects.

I am sorry to disappoint you but the Java String-Pool is not an actual Java class but somewhere implemented in the JVM i.e. it is writen as C++ code.
If you look at the source code of the String class (pretty much all the way down) you see that the intern() method is native.
You will have to go through some JVM code to get more information.
Edit:
Some implementation can be found here (C++ header, C++ implementation). Search for StringTable.
Edit2: As Holger pointed out in the comments, this is not a hard requirement of the JVM implementation. So it is possible to have a JVM that implements the String Pool differently, e.g. using an actual Java class. Though all commonly used JVMs I am aware of implement it in the JVMs C++ code.

You can go through this article: Strings, Literally
When a .java file is compiled into a .class file, any String literals
are noted in a special way, just as all constants are. When a class is
loaded (note that loading happens prior to initialization), the JVM
goes through the code for the class and looks for String literals.
When it finds one, it checks to see if an equivalent String is already
referenced from the heap. If not, it creates a String instance on the
heap and stores a reference to that object in the constant table. Once
a reference is made to that String object, any references to that
String literal throughout your program are simply replaced with the
reference to the object referenced from the String Literal Pool.
So, in the example shown above, there would be only one entry in the
String Literal Pool, which would refer to a String object that
contained the word "someString". Both of the local variables, one and
two, would be assigned a reference to that single String object. You
can see that this is true by looking at the output of the above
program. While the equals() method checks to see if the String objects
contain the same data ("someString"), the == operator, when used on
objects, checks for referential equality - that means that it will
return true if and only if the two reference variables refer to the
exact same object. In such a case, the references are equal. From the
above output, you can see that the local variables, one and two, not
only refer to Strings that contain the same data, they refer to the
same object.

Why do StringBuilders pop up when debugging String concatenation?

I am aware that String are immutable, and on when to use a StringBuilder or a StringBuffer. I also read that the bytecode for these two snippets would end up being the same:
//Snippet 1
String variable = "text";
this.class.getResourceAsStream("string"+variable);
//Snippet 2
StringBuilder sb = new StringBuilder("string");
sb.append("text");
this.class.getResourceAsStream(sb.toString());
But I obviously have something wrong. When debugging through Snippet 1 in eclipse, I am actually taken to the StringBuilder constructor and to the append method. I suppose I'm missing details on how bytecode is interpreted and how the debugger refers back to the lines in the source code; if anyone could explain this a bit, I'd really appreciate it. Also, maybe you can point out what's JVM specific and what isn't (I'm for example running Oracle's v6), Thanks!

Why do StringBuilders pop up when debugging String concatenation?
Because string concatenation (via the '+' operator) is typically compiled to code that uses a StringBuffer or StringBuilder to do the concatenation. The JLS explicitly permits this behaviour.
"An implementation may choose to perform conversion and concatenation in one step to avoid creating and then discarding an intermediate String object. To increase the performance of repeated string concatenation, a Java compiler may use the StringBuffer class or a similar technique to reduce the number of intermediate String objects that are created by evaluation of an expression." JLS 15.18.1.
(If your code is using a StringBuffer rather than a StringBuilder, it is probably because it was compiled using a really old Java compiler, or because you have specified a really old target JVM. The StringBuilder class is a relatively addition to Java. Older versions of the JLS used to mention StringBuffer instead of StringBuilder, IIRC.)
Also, maybe you can point out what's JVM specific and what isn't.
The bytecodes produced for "string" + variable" depend on how the Java compiler handles the concatenation. (In fact, all generated bytecodes are Java compiler dependent to some degree. The JLS and JVM specs do not dictate what bytecodes must be generated. The specifications are more about how the program should behave, and what individual bytecodes do.)
#supercat comments:
I wonder why string concatenation wouldn't use e.g. a String constructor overload which accepts two String objects, allocates a buffer of the proper combined size, and joins them? Or, when joining more strings, an overload which takes a String[]? Creating a String[] containing references to the strings to be joined should be no more expensive than creating a StringBuilder, and being able to create a perfect-sized backing store in one shot should be an easy performance win.
Maybe ... but I'd say probably not. This is a complicated area involving complicated trade-offs. The chosen implementation strategy for string concatenation has to work well across a wide range of different use-cases.
My understanding is that the original strategy was chosen after looking at a number of approaches, and doing some large-scale static code analysis and benchmarking to try to figure out which approach was best. I imagine they considered all of the alternatives that you proposed. (After all, they were / are smart people ...)
Having said that, the complete source code base for Java 6, 7 and 8 are available to you. That means that you could download it, and try some experiments of your own to see if your theories are right. If they are ... and you can gather solid evidence that they are ... then submit a patch to the OpenJDK team.

#StephenC I am still not convinced with the explanation. The compiler may do whatever optimization it wants to do but when you debug through the eclipse the source code view is hidden from compiler code and it should not jump one section of code to another code within the same source file.
The following description in the question suggests that the source code and byte code are not in sync. i.e., he is not running the latest code.
When debugging through Snippet 1 in eclipse, I am actually taken to the StringBuffer constructor and to the append method
and
how the debugger refers back to the lines in the source code

When are Java Strings interned?

Inspired by the comments on this question, I'm pretty sure that Java Strings are interned at runtime rather than compile time - surely just the fact that classes can be compiled at different times, but would still point to the same reference at runtime.
I can't seem to find any evidence to back this up. Can anyone justify this?

The optimization happens (or at least can happen) in both places:
If two references to the same string constant appear in the same class, I'd expect the class file to only contain one constant pool entry. This isn't strictly required in order to ensure that there's only one String object created in the JVM, but it's an obvious optimization to make. This isn't actually interning as such - just constant optimization.
When classes are loaded, the string pool for the class is added to the intern pool. This is "real" interning.
(I have a vague recollection that one of the bits of work for Java 7 around "small jar files" included a single string pool for the whole jar file... but I could be very wrong.)
EDIT: Section 5.1 of the JVM spec, "The Runtime Constant Pool" goes into details of this:
To derive a string literal, the Java
virtual machine examines the sequence
of characters given by the
CONSTANT_String_info structure.
If the method String.intern has
previously been called on an instance
of class String containing a sequence
of Unicode characters identical to
that given by the CONSTANT_String_info
structure, then the result of string
literal derivation is a reference to
that same instance of class String.
Otherwise, a new instance of class
String is created containing the
sequence of Unicode characters given
by the CONSTANT_String_info structure;
that class instance is the result of
string literal derivation. Finally,
the intern method of the new String
instance is invoked.

Runtime.
JLS and JVM specifications specify javac compilation to class files which contain constant declarations (in the Constant Pool) and constant usage in code (which javac can inline as primitive / object reference values). For compile-time String constants, the compiler generates code to construct String instances and to call String.intern() for them, so that the JVM interns String constants automatically. This is a behavioural requirement from JLS:
http://docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.28
Compile-time constant expressions of type String are always "interned" so as to share unique instances, using the method String.intern.
But these specs have neither the concept nor the definition of any particular String intern pool structures/references/handles whether compile time or runtime. (Of course, in general, the JVM spec does not mandate any particular internal structure for objects: http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-2.html#jvms-2.7)
The reason that no intern pool structures are mentioned is because they're handled entirely with the String class. The intern pool is a private static/class-level structure of the String class (unspecified by JLS & JVM specs & javadoc).
Objects are added to the intern pool when String.intern() is called at runtime. The intern pool is leveraged privately by the String class - when code create new String instances and calls String.intern(), the String class determines whether to reuse existing internal data. Optimisation can be carried out by the JIT compiler - at runtime.
There's no compile-time contribution here, bar the vanilla inlining of constant values.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.