In documentation of JNI function FindClass I can read about argument name:
name: a fully-qualified class name (...) The string is encoded in modified UTF-8.
According to documentation modified UTF-8 has to end with double '\0' chars:
the null character (char)0 is encoded using the two-byte format rather than the one-byte format
Does it mean that I should invoke FindClass from C in this way:
FindClass("java/lang/String\0")
i.e. with double '\0' at the end?
Character set, encoding and termination are three different things. Obviously, an encoding is designed for a specific character set but a character set can be encoded in multiple ways. And, often, a terminator (if used) is an encoded character, but with modified UTF-8, this is not the case.
Java uses the Unicode character set. For string and char types, it uses the UTF-16 encoding. The string type is counted; It doesn't use a terminator.
In C, terminated strings are common, as well as single-byte encodings of various character sets. C and C++ compilers terminate literal strings with the NUL character. In the destination character set encoding of the compiler, this is either one or two 0x00 bytes. Almost all common character sets and their encodings have the same byte representation for the non-control ASCII characters. This is true of the UTF-8 encoding of the Unicode character set. (But, note that is not true for characters outside of the limited subset.)
The JNI designers opted to use this limited "interoperability" between C strings. Many JNI functions accept 0x00-terminated modified UTF-8 strings. These are compatible what a C compiler would produce from a literal string in the source code, again provided that the characters are limited to non-control ASCII characters. This covers the use case of writing Java package & class, method and field strings in JNI. (Well, almost: Java allows any Unicode currency symbol in an identifier.)
So, you can pass C string literals to JNI functions in a WYSIWYG style. No need to add a terminator—the compiler does that. The C compiler would encode extra '\0' characters as 0x00 so it wouldn't do any harm but isn't necessary.
There are a couple modifications from the standard UTF-8 encoding. One is to allow C functions that expect a 0x00 terminator to "handle" modified UTF-8 strings, the NUL character (U+00000) is encoded to avoid 0x00, which would be the standard. That allows modified UTF-8 strings to be laid into a buffer with a 0x00 terminator beyond the bytes of the original encoded string. The other modification is a bit esoteric but both modifications make a modified UTF-8 string incompatible with a strictly compliant UTF-8 function.
You didn't ask, but there is another use of 0x00 terminated, modified UTF-8 strings in JNI. It is with the GetStringUTFChars and NewStringUTF functions. (The JNI documentation doesn't actually say that GetStringUTFChars returns a 0x00 terminated string but there are no known JVM implementations that don't. Check your JVM implementor's documentation or source code.) These functions are designed on the same "interoperability" basis. However, the use cases are different, making them dangerous. They are generally used to pass Java strings between C functions. The C functions, generally, would have no idea what modified UTF-8 is, or possibly not even what UTF-8 or Unicode are. It is much more direct to use the Java String and Charset classes to convert to and from character sets and encodings that the C functions are designed for. Often, it is a system setting, user setting, application setting or thread setting that determines which a C function is using. The Java String class attempts to conform to such settings when not given a specific encoding for a conversion. But, it many cases, the desired encoding is fixed and can be specified with clear intent.
No, you don't encode the terminating zero, it is not part of the class name.
No, according to the first reference I found, it means it should be encoded like this:
FindChar("java/lang/String\xc0\x80");
^
|
|
This is not the shortest
way to encode the codepoint
U+0000, which is why it's
"modified" UTF-8.
Note that this assumes that you're really looking for class names whose names end in U+0000, which is rather unlikely. The C string should be terminated just like normal, with a single 0-byte as you get from just:
FindChar("java/lang/String");
The special 2-byte encoding of U+0000 provided by Modified UTF-8 only matters if you want to put U+0000 in a string, and still be able to differentiate it from the C terminator.
Related
//non-utf source file encoding
char ch = 'ё'; // some number within 0..65535 is stored in char.
System.out.println(ch); // the same number output to
"java internal encoding is UTF16". Where does it meanfully come to play in that?
Besides, I can perfectly put into char one utf16 codeunit from surrogate range (say '\uD800') - making this char perfectly invalid Unicode. And let us stay within BMP, so to avoid thinking that we might have 2 chars (codeunits) for a supplementary symbol (thinking this way sounds to me that "char internally uses utf16" is complete nonsense). But maybe "char internally uses utf16" makes sense within BMP?
I could undersand it if were like this: my source code file is in windows-1251 encoding, char literal is converted to number according to windows-1251 encoding (what really happens), then this number is automatically converted to another number (from windows-1251 number to utf-16 number) - which is NOT taking place (am I right?! this I could understand as "internally uses UTF-16"). And then that stored number is written to (really it is written as given, as from win-1251, no my "imaginary conversion from internal utf16 to output\console encoding" taking place), console shows it converting from number to glyph using console encoding (what really happens)
So this "UTF16 encoding used internally" is NEVER USED ANYHOW ??? char just stores any number (in [0..65535]), and besides specific range and being "unsigned" has NO DIFFERENCE FROM int (in scope of my example of course)???
P.S. Experimentally, code above with UTF-8 encoding of source file and console outputs
й
1081
with win-1251 encoding of source file and UTF-8 in console outputs
�
65533
Same output if we use String instead of char...
String s = "й";
System.out.println(s);
In API, all methods taking char as argument usually never take encoding as argument. But methods taking byte[] as argument often take encoding as another argument. Implying that with char we don't need encoding (meaning that we know this encoding for sure). But **how on earth do we know in what encoding something was put into char???
If char is just a storage for a number, we do need to understand what encoding this number originally came from?**
So char vs byte is just that char has two bytes of something with UNKNOWN encoding (instead of one byte of UNKNOWN encoding for a byte).
Given some initialized char variable, we don't know what encoding to use to correctly display it (to choose correct console encoding for output), we cannot tell what was encoding of source file where it was initialized with char literal (not counting cases where various encodings and utf would be compatilble).
Am I right, or am I a big idiot? Sorry for asking in latter case :)))
SO research shows no direct answer to my question:
In what encoding is a Java char stored in?
What encoding is used when I type a character?
To which character encoding (Unicode version) set does a char object
correspond?
In most cases it is best to think of a char just as a certain character (independent of any encoding), e.g. the character 'A', and not as a 16-bit value in some encoding. Only when you convert between char or a String and a sequence of bytes does the encoding play a role.
The fact that a char is internally encoded as UTF-16 is only important if you have to deal with it's numeric value.
Surrogate pairs are only meaningful in a character sequence. A single char can not hold a character value outside the BMP. This is where the character abstraction breaks down.
Unicode is system of expressing textual data as codepoints. These are typically characters, but not always. A Unicode codepoint is always represented in some encoding. The common ones are UTF-8, UTF-16 and UTF-32, where the number indicates the number of bits in a codeunit. (For example UTF-8 is encoded as 8-bit bytes, and UTF-16 is encoded as 16-bit words.)
While the first version of Unicode only allowed code points in the range 0hex ... FFFFhex, in Unicode 2.0, they changed the range to 0hex to 10FFFFhex.
So, clearly, a Java (16 bit) char is no longer big enough to represent every Unicode code point.
This brings us back to UTF-16. A Java char can represent Unicode code points that are less or equal to FFFFhex. For larger codepoints, the UTF-16 representation consists of 2 16-bit values; a so-called surrogate pair. And that will fit into 2 Java chars. So in fact, the standard representation of a Java String is a sequence of char values that constitute the UTF-16 representation of the Unicode code points.
If we are working with most modern languages (including CJK with simplified characters), the Unicode code points of interest are all found in code plane zero (0hex through FFFFhex). If you can make that assumption, then it is possible to treat a char as a Unicode code point. However, increasingly we are seeing code points in higher planes. A common case is the code points for Emojis.)
If you look at the javadoc for the String class, you will see a bunch of methods line codePointAt, codePointCount and so on. These allow you to handle text data properly ... that is to deal with the surrogate pair cases.
So how does this relate to UTF-8, windows-1251 and so on?
Well these are 8-bit character encodings that are used at the OS level in text files and so on. When you read a file using a Java Reader your text is effectively transcoded from UTF-8 (or windows-1251) into UTF-16. When you write characters out (using a Writer) you transcode in the other direction.
This doesn't always work.
Many character encodings such as windows-1251 are not capable of representing the full range of Unicode codepoints. So, if you attempt to write (say) a CJK character via a Writer configured a windows-1251, you will get ? characters instead.
If you read an encoded file using the wrong character encoding (for example, if you attempt to read a UTF-8 file as windows-1251, or vice versa) then the trancoding is liable to give garbage. This phenomenon is so common it has a name: Mojibake).
You asked:
Does that mean that in char ch = 'й'; literal 'й' is always converted to utf16 from whatever encoding source file was in?
Now we are (presumably) talking about Java source code. The answer is that it depends. Basically, you need to make sure that the Java compiler uses the correct encoding to read the source file. This is typically specified using the -encoding command line option. (If you don't specify the -encoding then the "platform default converter" is used; see the javac manual entry.)
Assuming that you compile your source code with the correct encoding (i.e. matching the actual representation in the source file), the Java compiler will emit code containing the correct UTF-16 representation of any String literals.
However, note that this is independent of the character encoding that your application uses to read and write files at runtime. That encoding is determined by what your application selects or the execution platform's default encoding.
I am creating a set of tests for the size of a String to do so I am using something like this myString.getBytes("UTF-8").length > MAX_SIZE for which java has a checked exception UnsupportedEncodingException.
Just for curiosity, and to further consider other possible test scenarios, is there a text that cannot be represented by UTF-8 character encoding?
BTW: I did my homework, but nowhere (that I can find) specifies that indeed UTF-8/Unicode includes ALL the characters which are possible. I know that its size is 2^32 and many of them are still empty, but the question remains.
The official FAQ from the Unicode Consortium is pretty clear on the matter, and is a great source of information on all questions related to UTF-8, UTF-16, etc.
In particular, notice the following quote (emphasis mine):
Q: What is a UTF?
A: A Unicode transformation format (UTF) is an
algorithmic mapping from every Unicode code point (except surrogate
code points) to a unique byte sequence. The ISO/IEC 10646 standard
uses the term “UCS transformation format” for UTF; the two terms are
merely synonyms for the same concept.
Each UTF is reversible, thus every UTF supports lossless round
tripping: mapping from any Unicode coded character sequence S to a
sequence of bytes and back will produce S again. To ensure round
tripping, a UTF mapping must map all code points (except surrogate
code points) to unique byte sequences. This includes reserved
(unassigned) code points and the 66 noncharacters (including U+FFFE
and U+FFFF).
So, as you can see, by definition, all UTF encodings (including UTF-8) must be able to handle all Unicode code points (except the surrogate code points of course, but they are not real characters anyways).
Additionally, here is a quote directly from the Unicode Standard that also talks about this:
The Unicode Standard supports three character encoding forms: UTF-32,
UTF-16, and UTF-8. Each encoding form maps the Unicode code points
U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences.
As you can see, the specified range of characters covers the whole assigned Unicode range (excluding the surrogate character range of course).
is there a text that cannot be represented by UTF-8 character encoding?
Java strings use UTF-16, and standard UTF-8 is designed to handle every Unicode codepoint that UTF-16 can handle (and then some).
However, do be careful, because Java also uses a Modified UTF-8 in some areas, and that does have some differences/limitations from standard UTF-8.
I need to create a hash from a String containing users password. To create the hash, I use a byte array which I get by calling String.getBytes(). But when I call this method with specified encoding, (such as UTF-8) on a platform where this is not the default encoding, the non-ASCII characters get replaced by a default character (if I understand the behaviour of getBytes() correctly) and therefore on such platform, I will get a different byte array, and eventually a different hash.
Since Strings are internally stored in UTF-16, will calling String.getBytes("UTF-16") guarantee me that I get the same byte array on every platform, regardless of its default encoding?
Yes. Not only is it guaranteed to be UTF-16, but the byte order is defined too:
When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.
(The BOM isn't relevant when the caller doesn't ask for it, so String.getBytes(...) won't include it.)
So long as you have the same string content - i.e. the same sequence of char values - then you'll get the same bytes on every implementation of Java, barring bugs. (Any such bug would be pretty surprising, given that UTF-16 is probably the simplest encoding to implement in Java...)
The fact that UTF-16 is the native representation for char (and usually for String) is only relevant in terms of ease of implementation, however. For example, I'd also expect String.getBytes("UTF-8") to give the same results on every platform.
It is true, java uses Unicode internally so it may combine any script/language. String and char use UTF-16BE but .class files store there String constants in UTF-8. In general it is irrelevant what String does, as there is a conversion to bytes specifying the encoding the bytes have to be in.
If this encoding of the bytes cannot represent some of the Unicode characters, a placeholder character or question mark is given. Also fonts might not have all Unicode characters, 35 MB for a full Unicode font is a normal size. You might then see a square with 2x2 hex codes or so for missing code points. Or on Linux another font might substitute the char.
Hence UTF-8 is a perfect fine choice.
String s = ...;
if (!s.startsWith("\uFEFF")) { // Add a Unicode BOM
s = "\uFEFF" + s;
}
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
Both UTF-16 (in both byte orders) and UTF-8 always are present in the JRE, whereas some Charsets are not. Hence you can use a constant from StandardCharsets not needing to handle any UnsupportedEncodingException.
Above I added a BOM for Windows Notepad esoecially, to recognize UTF-8. It certainly is not good practice. But as a small help here.
There is no disadvantage to UTF16-LE or UTF-16BE. I think UTF-8 is a bit more universally used, as UTF-16 also cannot store all Unicode code points in 16 bits. Text is Asian scripts would be more compressed, but already HTML pages are more compact in UTF-8 because of the HTML tags and other latin script.
For Windows UTF-16LE might be more native.
Problem with placeholders for non-Unicode platforms, especially Windows, might happen.
I just found this:
https://github.com/facebook/conceal/issues/138
which seems to answer negatively your question.
As per Jon Skeet's answer: the specification is clear. But I guess Android/Mac implementations of Dalvik/JVM don't agree.
As an introduction, I do Java and have done quite a bit of C in the past.
In Java, a String literal can contain any set of graphemes as long as you can input them in your editing environment; said editing environment will then save your source file in whatever character encoding is used at the time.
At runtime, and as long as the compiler supports the encoding, the byte code represents all String literals as a set of chars, where a char represents one UTF-16 code unit. (Unicode code points outside the BMP therefore require two chars; you can obtain an array of chars necessary to represent a Unicode code point outside the BMP using Character.toChars()).
You have classes for a character encoding (Charset), the process of encoding a sequence of chars to a sequence of bytes (CharsetEncoder) and also the reverse (CharsetDecoder). Therefore, whatever the character encoding used by your source/destination, whether it be a file, a socket or whatever, you can encode/decode as appropriate.
Now, let us suppose C++11. It introduces std::u32string, std::u16string; those are "aliases", as far as I understand, to std::basic_string<char32_t> and std::basic_string<char16_t>, and the net effect of them is that at runtime, the string constants you declare (using u"" and U"") are made of 16bit or 32bit entities representing a UTF-16 or UTF-32 code unit respectively. There is also u8"" (what is the basic_string type for the latter if any, since it has no fixed length?).
Other important point: UTF-16 has two variants, LE and BE; java does BE since at the bytecode level, everything is BE. Does char{16,32}_t depend on endianness in your code?
But even after hours of searching, I cannot find an answer: can C++11, as standard, do what the standard JDK does, that is convert any string constant into a suitable byte sequence and the reverse, given a character coding? I suspect this is made more difficult since there are basically three representations of a string literal at runtime, without even going to char * which is basically a byte array...
(edit: added links to the relevant javadoc)
You can convert through using a codecvt locale facet.
The usage is somewhat unintuitive, but this is what I did:
/** Convert utf8 stream to UCS-4 stream */
u32string decode(string utf8)
{
std::wstring_convert<std::codecvt_utf8<char32_t>,char32_t> convert;
return convert.from_bytes(utf8);
}
/** Convert UCS-4 stream to utf8 stream */
string encode(u32string ucs4)
{
std::wstring_convert<std::codecvt_utf8<char32_t>,char32_t> convert;
return convert.to_bytes(ucs4);
}
It requires a decent compiler though, for me only clang worked correctly, gcc compiled but generated invalid results (newer versions of gcc may be ok).
C++ does not specify a source file encoding. In fact, it supports EBCDIC. All C++11 compilers support UTF-8, and many support other encodings by passing appropriate flags.
The standard specifies an escape code syntax for characters outside the basic source character set, which essentially comprises the characters used by the language. Characters outside the basic source character set are called "extended characters" and they are replaced by the corresponding code before the source is compiled, or even preprocessed. This ensures that the meaning of source code is independent of its encoding.
char32_t and char16_t do not have endianness built in. They are simply equivalent to uint32_t and uint16_t. You could say that they inherit the native endianness, but directly serializing object representations as bytes is an abuse.
To reliably specify UTF-8 literals, and override any compiler settings to the contrary, use u8"" which is ready for serialization. u"" and U"" do not have endianness because the values are already baked into the program.
To serialize, you can use the codecvt_utf8 and codecvt_utf16 class templates, which take compile-time template flags specifying the file format:
enum codecvt_mode {
consume_header = 4,
generate_header = 2,
little_endian = 1
};
To set a stream file (in binary mode) to encode char32_t strings into UTF-16LE with a byte-order mark, you would use
std::basic_ofstream< char32_t > file( path, std::ios::binary );
file.imbue( std::locale( file.locale(), new std::codecvt_utf16<
char32_t,
std::codecvt_mode::generate_header | std::codecvt_mode::little_endian
>{} ) );
This is preferable to translating before outputting.
#include <string>
#include <codecvt>
#include <locale>
template<typename Facet>
struct usable_facet : Facet {
using Facet::Facet;
~usable_facet() = default;
};
int main() {
using utf16_codecvt = usable_facet<std::codecvt<char16_t, char, std::mbstate_t>>;
using utf32_codecvt = usable_facet<std::codecvt<char32_t, char, std::mbstate_t>>;
std::wstring_convert<utf16_codecvt, char16_t> u16convert; // bidirectional UTF-16/UTF-8 conversion
std::wstring_convert<utf32_codecvt, char32_t> u32convert; // bidirectional UTF-32/UTF-8
std::string utf8 = u16convert.to_bytes(u"UTF-16 data");
std::u16string utf16 = u16convert.from_bytes(u8"UTF-8 data");
utf8 = u32convert.to_bytes(U"UTF-32 data");
std::u32string utf32 = u32convert.from_bytes(u8"UTF-8 data");
}
You can also use other facets, but be careful because they don't all do what they sound like or what it seems like they should. codecvt_utf8 won't convert to UTF-16 if you use char16_t, codecvt_utf16 uses UTF-16 as the narrow encoding, etc. The names make sense given their intended usage, but they're confusing with wstring_convert.
You can also use wstring_convert with whatever encodings are used by supported locales using codecvt_byname (However you can only convert between that locale's char encoding and its own wchar_t encoding, not between the locale narrow encoding and a fixed Unicode encoding. Locales specify their own wchar_t encoding and it's not necessarily a Unicode encoding or the same as the wchar_t encoding used by another locale.)
using locale_codecvt = usable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>>;
std::wstring_convert<locale_codecvt, wchar_t> legacy_russian(new locale_codecvt("ru_RU")); // non-portable locale name
std::string legacy_russian_data = /* ... some source of legacy encoded data */
std::wstring w = legacy_russian.from_bytes(legacy_russian_data);
The only standard way to convert between arbitrary locale encoded text and any Unicode encoding is the poorly supported <cuchar> header with low level functions like c16rtomb and c32rtomb.
Is the Java char type guaranteed to be stored in any particular encoding?
Edit: I phrased this question incorrectly. What I meant to ask is are char literals guaranteed to use any particular encoding?
"Stored" where? All Strings in Java are represented in UTF-16. When written to a file, sent across a network, or whatever else, it's sent using whatever character encoding you specify.
Edit: Specifically for the char type, see the Character docs. Specifically: "The char data type ... are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities." Therefore, casting char to int will always give you a UTF-16 value if the char actually contains a character from that charset. If you just poked some random value into the char, it obviously won't necessarily be a valid UTF-16 character, and likewise if you read the character in using a bad encoding. The docs go on to discuss how the supplementary UTF-16 characters can only be represented by an int, since char doesn't have enough space to hold them, and if you're operating at this level, it might be important to get familiar with those semantics.
A Java char is conventionally used to hold a Unicode code unit; i.e. a 16 bit unit that is part of a valid UTF-16 sequence. However, there is nothing to prevent an application from putting any 16 bit unsigned value into a char, irrespective of what it actually means.
So you could say that a Unicode code unit can be represented by a char and a char can represent a Unicode code unit ... but neither of these is necessarily true, in the general case.
Your question about how a Java char is stored cannot be answered. Simply said, it depends on what you mean by "stored":
If you mean "represented in an executing program", then the answer is JVM implementation specific. (The char data type is typically represented as a 16 bit machine integer, though it may or may not be machine word aligned, depending on the specific context.)
If you mean "stored in a file" or something like that, then the answer is entirely dependent on how the application chooses to store it.
Is the Java char type guaranteed to be stored in any particular encoding?
In the light of what I said above the answer is "No". In an executing application, it is up to the application to decide what a char means / contains. When a char is stored to a file, the application decides how it wants to store it and what on-disk representation it will use.
FOLLOWUP
What about char literals? For example, 'c' must have some value that is defined by the language.
Java source code is required (by the language spec) to be Unicode text, represented in some character encoding that the tool chain understands; see the javac -encoding option. In theory, a character encoding could map the c in 'c' in your source code to something unexpected.
In practice though, the c will map to the Unicode lower-case C code-point (U+0063) and will be represented as the 16-bit unsigned value 0x0063.
To the extent that char literals have a meaning ascribed by the Java language, they represent (and are represented as) UTF-16 code units. Note that they may or may not be assigned Unicode code points ("characters"). Some Unicode code points in the range U+0000 to U+FFFF are unassigned.
Originally, Java used UCS-2 internally; now it uses UTF-16. The two are virtually identical, except for D800 - DFFF, which are used in UTF-16 as part of the extended representation for larger characters.