Java's native character set for Strings

Java's native character set for Strings - java

I am utterly confused by the answers that I have seen on
stackoverflow plus on java docs
https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.1
What is the character encoding of String in Java?
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html
While all theory in the docs and stack in the links above seem to point that UTF-16 is the native character set supported by Java, there is another theory that says it depends on the JVM/OS e.g. in this link, it says:
Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets. The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.
Then in the same link in another section it says
The native character encoding of the Java programming language is UTF-16.
I am finding it difficult to understand this apparently contradicting statements as:
one says it is dependent on OS
the other (I infer) says, regardless of the OS, UTF-16 is the charset for Java (This is also what all of the links I have mentioned above say)
Again, now, when I execute the following piece of code:
package org.sheel.classes;
import java.nio.charset.Charset;
public class Test {
public static void main(String[] args) {
System.out.println(Charset.defaultCharset());
}
}
...in an online editor I get to see UTF-8. In my local system I get to see windows-1252
And lastly, there is a JDK Enhancement Proposal (JEP) which talks about changing the default to UTF-8
Could there be an explanation for this confusion?

A String internally is an array of char, toCharArray(), each char being a utf-16 codepoint. When you convert the string to an array of byte without specifying the charset, getBytes(), the OS one is used.
PS: as noted by VGR, recent implementations may not store String as array of char, but as programmers we usually interact using chars which are always UTF-16.

The internal encoding used by String has nothing to do with the platform’s default charset. They are completely independent of each other.
String internals
Internally, a String may store its data as anything. As programmers, we don’t interact with the private implementation; we can only use public methods. The public methods usually return a String’s data as UTF-16 (char values), though some, like the codePoints() method, can return full UTF-32 int values. None of those methods indicate how String data is stored internally, only the forms in which a programmer may examine that data.
So, rather than saying that String stores data internally as UTF-16 or any other encoding, it’s correct to say that String stores a sequence of Unicode code points, and makes them available in various forms, most commonly as char values.
Default charset
The default charset is something Java obtains from the underlying system.
As roberto pointed out, the default charset matters when you use certain (outdated) methods and constructors. Converting a String to bytes, or converting bytes to a String, without explicitly specifying a charset, will make use of the default charset. Similarly, creating an InputStreamReader or OutputStreamWriter without specifying a charset will use the default charset.
It is usually unwise to rely on the default charset, as it will make your code behave differently on different platforms. Also, some charsets can represent all known characters, but some charsets can represent only a small subset of the total Unicode repertoire. In particular, Windows usually has a default charset which uses a single byte to represent each character (windows-1252 in US versions of Windows), and obviously that isn’t enough space for the hundreds of thousands of available characters.
If you rely on the default charset, there is indeed a chance that you will lose information:
String s = "\u03c0\u22603"; // "π≠3"
byte[] bytes = s.getBytes();
for (byte b : bytes) {
System.out.printf("%02x ", b);
}
System.out.println();
On most systems, this will print:
cf 80 e2 89 a0 33
On Windows, this will probably print:
3f 3f 33
The pi and not-equal characters aren’t represented in the windows-1252 charset, so on Windows, the getBytes method replaces them with question marks (byte value 3f).
If conversion to or from bytes is not involved, String objects will never lose information, because regardless of how they store their data internally, the String class guarantees that every character will be preserved.

Related

Java internal String representation: is it UTF-16?

I have found on SO, that Java strings are represented as UTF-16 internally. Out of curiosity I have developed and ran following snippet (Java 7):
public class StringExperiment {
public static void main(String...args) throws UnsupportedEncodingException {
System.out.println(Arrays.toString("ABC".getBytes()));
}
}
which resulted in:
[65, 66, 67]
being printed to the console output.
How does it match with UTF-16?
Update. Is there a way to write a program that prints internal bytes of the string as is?

Java's internal string-representation is based on their char and thus UTF-16.
Unless it isn't: A modern VM (since Java 6 Update 21 Performance Release) might try to save space by using basic ASCII (single-byte-encoding) where that suffices.
And serialization / java-native-interface is done in a modified CESU-8 (a surrogate-agnostic variant of UTF-8) encoding, with NUL represented as two bytes to avoid embedded zeroes.
All of that is irrelevant for your "test" though:
You are asking Java to encode the string in the platform's default-charset, and that's not the internal charset:
public byte[] getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
The behavior of this method when this string cannot be encoded in the default charset is unspecified. The CharsetEncoder class should be used when more control over the encoding process is required.

You seem to be misunderstanding something.
For all the system cares, and, MOST OF THE TIME, the developer cares, chars could as well be carrier pigeons, and Strings sequence of said carrier pigeons. Although yes, internally, strings are sequences of chars (which are more precisely UTF-16 code units), this is not the problem at hand here.
You don't write chars into files, neither do you read chars from files. You write, and read, bytes.
And in order to read a sequence of bytes as a sequence of chars/carrier pigeons, you need a decoder; similarly (and this is what you do here), in order to turn chars/carrier pigeons into bytes, you need an encoder. In Java, both of these are available from a Charset.
String.getBytes() just happens to use an encoder with the default platform character coding (obtained using Charset.defaultCharset()), and it happens that for your input string "ABC" and your JRE implementation, the sequence of bytes generated is 65, 66, 67. Hence the result.
Now, try and String.getBytes(Charset.forName("UTF-32LE")), and you'll get a different result.

Java Strings are indeed represented as UTF-16 internally, but you are calling the getBytes method, which does the following (my emphasis)
public byte[] getBytes()
Encodes this String into a sequence of bytes using the platform's
default charset, storing the result into a new byte array.
And your platform's default encoding is probably not UTF-16.
If you use the variant that lets you specify an encoding, you can see how the string would look in other encodings:
public byte[] getBytes(Charset charset)
If you look at the source code for java.lang.String, you can see that the String is stored internally as an array of (16-bit) chars.

Will String.getBytes("UTF-16") return the same result on all platforms?

I need to create a hash from a String containing users password. To create the hash, I use a byte array which I get by calling String.getBytes(). But when I call this method with specified encoding, (such as UTF-8) on a platform where this is not the default encoding, the non-ASCII characters get replaced by a default character (if I understand the behaviour of getBytes() correctly) and therefore on such platform, I will get a different byte array, and eventually a different hash.
Since Strings are internally stored in UTF-16, will calling String.getBytes("UTF-16") guarantee me that I get the same byte array on every platform, regardless of its default encoding?

Yes. Not only is it guaranteed to be UTF-16, but the byte order is defined too:
When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.
(The BOM isn't relevant when the caller doesn't ask for it, so String.getBytes(...) won't include it.)
So long as you have the same string content - i.e. the same sequence of char values - then you'll get the same bytes on every implementation of Java, barring bugs. (Any such bug would be pretty surprising, given that UTF-16 is probably the simplest encoding to implement in Java...)
The fact that UTF-16 is the native representation for char (and usually for String) is only relevant in terms of ease of implementation, however. For example, I'd also expect String.getBytes("UTF-8") to give the same results on every platform.

It is true, java uses Unicode internally so it may combine any script/language. String and char use UTF-16BE but .class files store there String constants in UTF-8. In general it is irrelevant what String does, as there is a conversion to bytes specifying the encoding the bytes have to be in.
If this encoding of the bytes cannot represent some of the Unicode characters, a placeholder character or question mark is given. Also fonts might not have all Unicode characters, 35 MB for a full Unicode font is a normal size. You might then see a square with 2x2 hex codes or so for missing code points. Or on Linux another font might substitute the char.
Hence UTF-8 is a perfect fine choice.
String s = ...;
if (!s.startsWith("\uFEFF")) { // Add a Unicode BOM
s = "\uFEFF" + s;
}
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
Both UTF-16 (in both byte orders) and UTF-8 always are present in the JRE, whereas some Charsets are not. Hence you can use a constant from StandardCharsets not needing to handle any UnsupportedEncodingException.
Above I added a BOM for Windows Notepad esoecially, to recognize UTF-8. It certainly is not good practice. But as a small help here.
There is no disadvantage to UTF16-LE or UTF-16BE. I think UTF-8 is a bit more universally used, as UTF-16 also cannot store all Unicode code points in 16 bits. Text is Asian scripts would be more compressed, but already HTML pages are more compact in UTF-8 because of the HTML tags and other latin script.
For Windows UTF-16LE might be more native.
Problem with placeholders for non-Unicode platforms, especially Windows, might happen.

I just found this:
https://github.com/facebook/conceal/issues/138
which seems to answer negatively your question.
As per Jon Skeet's answer: the specification is clear. But I guess Android/Mac implementations of Dalvik/JVM don't agree.

Unicode code points to bytes and reverse: how do you do that in C++?

As an introduction, I do Java and have done quite a bit of C in the past.
In Java, a String literal can contain any set of graphemes as long as you can input them in your editing environment; said editing environment will then save your source file in whatever character encoding is used at the time.
At runtime, and as long as the compiler supports the encoding, the byte code represents all String literals as a set of chars, where a char represents one UTF-16 code unit. (Unicode code points outside the BMP therefore require two chars; you can obtain an array of chars necessary to represent a Unicode code point outside the BMP using Character.toChars()).
You have classes for a character encoding (Charset), the process of encoding a sequence of chars to a sequence of bytes (CharsetEncoder) and also the reverse (CharsetDecoder). Therefore, whatever the character encoding used by your source/destination, whether it be a file, a socket or whatever, you can encode/decode as appropriate.
Now, let us suppose C++11. It introduces std::u32string, std::u16string; those are "aliases", as far as I understand, to std::basic_string<char32_t> and std::basic_string<char16_t>, and the net effect of them is that at runtime, the string constants you declare (using u"" and U"") are made of 16bit or 32bit entities representing a UTF-16 or UTF-32 code unit respectively. There is also u8"" (what is the basic_string type for the latter if any, since it has no fixed length?).
Other important point: UTF-16 has two variants, LE and BE; java does BE since at the bytecode level, everything is BE. Does char{16,32}_t depend on endianness in your code?
But even after hours of searching, I cannot find an answer: can C++11, as standard, do what the standard JDK does, that is convert any string constant into a suitable byte sequence and the reverse, given a character coding? I suspect this is made more difficult since there are basically three representations of a string literal at runtime, without even going to char * which is basically a byte array...
(edit: added links to the relevant javadoc)

You can convert through using a codecvt locale facet.
The usage is somewhat unintuitive, but this is what I did:
/** Convert utf8 stream to UCS-4 stream */
u32string decode(string utf8)
{
std::wstring_convert<std::codecvt_utf8<char32_t>,char32_t> convert;
return convert.from_bytes(utf8);
}
/** Convert UCS-4 stream to utf8 stream */
string encode(u32string ucs4)
{
std::wstring_convert<std::codecvt_utf8<char32_t>,char32_t> convert;
return convert.to_bytes(ucs4);
}
It requires a decent compiler though, for me only clang worked correctly, gcc compiled but generated invalid results (newer versions of gcc may be ok).

C++ does not specify a source file encoding. In fact, it supports EBCDIC. All C++11 compilers support UTF-8, and many support other encodings by passing appropriate flags.
The standard specifies an escape code syntax for characters outside the basic source character set, which essentially comprises the characters used by the language. Characters outside the basic source character set are called "extended characters" and they are replaced by the corresponding code before the source is compiled, or even preprocessed. This ensures that the meaning of source code is independent of its encoding.
char32_t and char16_t do not have endianness built in. They are simply equivalent to uint32_t and uint16_t. You could say that they inherit the native endianness, but directly serializing object representations as bytes is an abuse.
To reliably specify UTF-8 literals, and override any compiler settings to the contrary, use u8"" which is ready for serialization. u"" and U"" do not have endianness because the values are already baked into the program.
To serialize, you can use the codecvt_utf8 and codecvt_utf16 class templates, which take compile-time template flags specifying the file format:
enum codecvt_mode {
consume_header = 4,
generate_header = 2,
little_endian = 1
};
To set a stream file (in binary mode) to encode char32_t strings into UTF-16LE with a byte-order mark, you would use
std::basic_ofstream< char32_t > file( path, std::ios::binary );
file.imbue( std::locale( file.locale(), new std::codecvt_utf16<
char32_t,
std::codecvt_mode::generate_header | std::codecvt_mode::little_endian
>{} ) );
This is preferable to translating before outputting.

#include <string>
#include <codecvt>
#include <locale>
template<typename Facet>
struct usable_facet : Facet {
using Facet::Facet;
~usable_facet() = default;
};
int main() {
using utf16_codecvt = usable_facet<std::codecvt<char16_t, char, std::mbstate_t>>;
using utf32_codecvt = usable_facet<std::codecvt<char32_t, char, std::mbstate_t>>;
std::wstring_convert<utf16_codecvt, char16_t> u16convert; // bidirectional UTF-16/UTF-8 conversion
std::wstring_convert<utf32_codecvt, char32_t> u32convert; // bidirectional UTF-32/UTF-8
std::string utf8 = u16convert.to_bytes(u"UTF-16 data");
std::u16string utf16 = u16convert.from_bytes(u8"UTF-8 data");
utf8 = u32convert.to_bytes(U"UTF-32 data");
std::u32string utf32 = u32convert.from_bytes(u8"UTF-8 data");
}
You can also use other facets, but be careful because they don't all do what they sound like or what it seems like they should. codecvt_utf8 won't convert to UTF-16 if you use char16_t, codecvt_utf16 uses UTF-16 as the narrow encoding, etc. The names make sense given their intended usage, but they're confusing with wstring_convert.
You can also use wstring_convert with whatever encodings are used by supported locales using codecvt_byname (However you can only convert between that locale's char encoding and its own wchar_t encoding, not between the locale narrow encoding and a fixed Unicode encoding. Locales specify their own wchar_t encoding and it's not necessarily a Unicode encoding or the same as the wchar_t encoding used by another locale.)
using locale_codecvt = usable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>>;
std::wstring_convert<locale_codecvt, wchar_t> legacy_russian(new locale_codecvt("ru_RU")); // non-portable locale name
std::string legacy_russian_data = /* ... some source of legacy encoded data */
std::wstring w = legacy_russian.from_bytes(legacy_russian_data);
The only standard way to convert between arbitrary locale encoded text and any Unicode encoding is the poorly supported <cuchar> header with low level functions like c16rtomb and c32rtomb.

JNI strings and C strings

In documentation of JNI function FindClass I can read about argument name:
name: a fully-qualified class name (...) The string is encoded in modified UTF-8.
According to documentation modified UTF-8 has to end with double '\0' chars:
the null character (char)0 is encoded using the two-byte format rather than the one-byte format
Does it mean that I should invoke FindClass from C in this way:
FindClass("java/lang/String\0")
i.e. with double '\0' at the end?

Character set, encoding and termination are three different things. Obviously, an encoding is designed for a specific character set but a character set can be encoded in multiple ways. And, often, a terminator (if used) is an encoded character, but with modified UTF-8, this is not the case.
Java uses the Unicode character set. For string and char types, it uses the UTF-16 encoding. The string type is counted; It doesn't use a terminator.
In C, terminated strings are common, as well as single-byte encodings of various character sets. C and C++ compilers terminate literal strings with the NUL character. In the destination character set encoding of the compiler, this is either one or two 0x00 bytes. Almost all common character sets and their encodings have the same byte representation for the non-control ASCII characters. This is true of the UTF-8 encoding of the Unicode character set. (But, note that is not true for characters outside of the limited subset.)
The JNI designers opted to use this limited "interoperability" between C strings. Many JNI functions accept 0x00-terminated modified UTF-8 strings. These are compatible what a C compiler would produce from a literal string in the source code, again provided that the characters are limited to non-control ASCII characters. This covers the use case of writing Java package & class, method and field strings in JNI. (Well, almost: Java allows any Unicode currency symbol in an identifier.)
So, you can pass C string literals to JNI functions in a WYSIWYG style. No need to add a terminator—the compiler does that. The C compiler would encode extra '\0' characters as 0x00 so it wouldn't do any harm but isn't necessary.
There are a couple modifications from the standard UTF-8 encoding. One is to allow C functions that expect a 0x00 terminator to "handle" modified UTF-8 strings, the NUL character (U+00000) is encoded to avoid 0x00, which would be the standard. That allows modified UTF-8 strings to be laid into a buffer with a 0x00 terminator beyond the bytes of the original encoded string. The other modification is a bit esoteric but both modifications make a modified UTF-8 string incompatible with a strictly compliant UTF-8 function.
You didn't ask, but there is another use of 0x00 terminated, modified UTF-8 strings in JNI. It is with the GetStringUTFChars and NewStringUTF functions. (The JNI documentation doesn't actually say that GetStringUTFChars returns a 0x00 terminated string but there are no known JVM implementations that don't. Check your JVM implementor's documentation or source code.) These functions are designed on the same "interoperability" basis. However, the use cases are different, making them dangerous. They are generally used to pass Java strings between C functions. The C functions, generally, would have no idea what modified UTF-8 is, or possibly not even what UTF-8 or Unicode are. It is much more direct to use the Java String and Charset classes to convert to and from character sets and encodings that the C functions are designed for. Often, it is a system setting, user setting, application setting or thread setting that determines which a C function is using. The Java String class attempts to conform to such settings when not given a specific encoding for a conversion. But, it many cases, the desired encoding is fixed and can be specified with clear intent.

No, you don't encode the terminating zero, it is not part of the class name.

No, according to the first reference I found, it means it should be encoded like this:
FindChar("java/lang/String\xc0\x80");
^
|
|
This is not the shortest
way to encode the codepoint
U+0000, which is why it's
"modified" UTF-8.
Note that this assumes that you're really looking for class names whose names end in U+0000, which is rather unlikely. The C string should be terminated just like normal, with a single 0-byte as you get from just:
FindChar("java/lang/String");
The special 2-byte encoding of U+0000 provided by Modified UTF-8 only matters if you want to put U+0000 in a string, and still be able to differentiate it from the C terminator.

Java safeguards for when UTF-16 doesn't cut it

My understanding is that Java uses UTF-16 by default (for String and char and possibly other types) and that UTF-16 is a major superset of most character encodings on the planet (though, I could be wrong). But I need a way to protect my app for when it's reading files that were generated with encodings (I'm not sure if there are many, or none at all) that UTF-16 doesn't support.
So I ask:
Is it safe to assume the file is UTF-16 prior to reading it, or, to maximize my chances of not getting NPEs or other malformed input exceptions, should I be using a character encoding detector like JUniversalCharDet or JCharDet or ICU4J to first detect the encoding?
Then, when writing to a file, I need to be sure that a characte/byte didn't make it into the in-memory object (the String, the OutputStream, whatever) that produces garbage text/characters when written to a string or file. Ideally, I'd like to have some way of making sure that this garbage-producing character gets caught somehow before making it into the file that I am writing. How do I safeguard against this?
Thanks in advance.

Java normally uses UTF-16 for its internal representation of characters. n Java char arrays are a sequence of UTF-16 encoded Unicode codepoints. By default char values are considered to be Big Endian (as any Java basic type is). You should however not use char values to write strings to files or memory. You should make use of the character encoding/decoding facilities in the Java API (see below).
UTF-16 is not a major superset of encodings. Actually, UTF-8 and UTF-16 can both encode any Unicode code point. In that sense, Unicode does define almost any character that you possibly want to use in modern communication.
If you read a file from disk and asume UTF-16 then you would quickly run into trouble. Most text files are using ASCII or an extension of ASCII to use all 8 bits of a byte. Examples of these extensions are UTF-8 (which can be used to read any ASCII text) or ISO 8859-1 (Latin). Then there are a lot of encodings e.g. used by Windows that are an extension of those extensions. UTF-16 is not compatible with ASCII, so it should not be used as default for most applications.
So yes, please use some kind of detector if you want to read a lot of plain text files with unknown encoding. This should answer question #1.
As for question #2, think of a file that is completely ASCII. Now you want to add a character that is not in the ASCII. You choose UTF-8 (which is a pretty safe bet). There is no way of knowing that the program that opens the file guesses correctly guesses that it should use UTF-8. It may try to use Latin or even worse, assume 7-bit ASCII. In that case you get garbage. Unfortunately there are no smart tricks to make sure this never happens.
Look into the CharsetEncoder and CharsetDecoder classes to see how Java handles encoding/decoding.

Whenever a conversion between bytes and characters takes place, Java allows to specify the character encoding to be used. If it is not specified, a machine dependent default encoding is used. In some encodings the bit pattern representing a certain character has no similarity with the bit pattern used for the same character in UTF-16 encoding.
To question 1 the answer is therefore "no", you cannot assume the file is encoded in UTF-16.
It depends on the used encoding which characters are representable.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.