Why does a character in Java take twice as much space to store as a character in C?
In Java characters are 16-bit and C they are 8-bit.
A more general question is why is this so?
To find out why you need to look at history and come to conclusions/opinions on the subject.
When C was developed in the USA, ASCII was pretty standard there and you only really needed 7-bits, but with 8 you could handle some non-ASCII characters as well. It might seem more than enough. Many text based protocols like SMTP (email), XML and FIX, still only use ASCII character. Email and XML encode non ASCII characters. Binary files, sockets and stream are still only 8-bit byte native.
BTW: C can support wider characters, but that is not plain char
When Java was developed 16-bit seemed like enough to support most languages. Since then unicode has been extended to characters above 65535 and Java has had to add support for codepoints which is UTF-16 characters and can be one or two 16-bit characters.
So making a byte a byte and char an unsigned 16-bit value made sense at the time.
BTW: If your JVM supports -XX:+UseCompressedStrings it can use bytes instead of chars for Strings which only use 8-bit characters.
Because Java uses Unicode, C generally uses ASCII by default.
There are various flavours of Unicode encoding, but Java uses UTF-16, which uses either one or two 16-bit code units per character. ASCII always uses one byte per character.
http://java.about.com/od/programmingconcepts/a/unicode.htm
http://www.joelonsoftware.com/articles/Unicode.html
http://en.wikipedia.org/wiki/UTF-16
The Java 2 platform uses the UTF-16 representation in char arrays and
in the String and StringBuffer classes.
java.lang.Character
java.lang.String
Java is a modern language that came up around the early Unicode era (in the beginning of the 90s), so it supports Unicode by default as a first class citizen like many new contemporary languages (like Python, Visual Basic or JavaScript...), OSes (Windows, Symbian, BREW...) and frameworks/interfaces/specifications... (like Qt, NTFS, Joliet). By the time those were designed, Unicode was a fixed 16-bit charset encoded in UCS-2, so it made sense for them to use 16-bit values for the characters
In contrast C is an "ancient" language that was invented decades before Java, when Unicode was far from a thing. That's the age of 7-bit ASCII and 8-bit EBCDIC, thus C uses 8-bit char1 as that's enough for a char variable to contain all basic characters. When coming to the Unicode times, to refrain from breaking old code they decided to introduce a different character type to C90 which is wchar_t. Again this is the 90s when Unicode began its life. In any cases char must continue to have the old size because you still need to access individual bytes even if you use wider characters (Java has a separate byte type for this purpose)
Of course later the Unicode Consortium quickly realized that 16 bits are not enough and must fix it somehow. They widened the code-point range by changing UCS-2 to UTF-16 to avoid breaking old code that uses wide char and have Unicode as a 21-bit charset (actually up to U+10FFFF instead of U+1FFFFF because of UTF-16). Unfortunately it was too late and the old implementations that use 16-bit char got stuck
Later we saw the advent of UTF-8, which proved to be far superior to UTF-16 because it's independent of endianness, generally takes up less space, and most importantly it requires no changes in the standard C string functions. Most user functions that receive a char* will continue to work without special Unicode support
Unix systems are lucky because they migrate to Unicode later when UTF-8 was introduced, therefore continue to use 8-bit char. OTOH all modern Win32 APIs work on 16-bit wchar_t by default because Windows was also an early adopter of Unicode. As a result .NET framework and C# also go the same way by having char as a 16-bit type.
Talking about wchar_t, it was so unportable that both C and C++ standards needed to introduce the new character types char16_t and char32_t in their 2011 revisions
Both C and C++ introduced fixed-size character types char16_t and char32_t in the 2011 revisions of their respective standards to provide unambiguous representation of 16-bit and 32-bit Unicode transformation formats, leaving wchar_t implementation-defined
https://en.wikipedia.org/wiki/Wide_character#Programming_specifics
That said, most implementations are working on improving the wide string situation. Java experimented with compressed string in Java 6 and introduced compact strings in Java 9. Python is moving to a more flexible internal representation compared to wchar_t* in Python before 3.3. Firefox and Chrome have separate internal 8-bit char representations for simple strings. There are also discussions on that for .NET framework. And more recently Windows is gradually introducing UTF-8 support for the old ANSI APIs
1 Strictly speaking char in C is only required to have at least 8 bits. See What platforms have something other than 8-bit char?
Java char is an UTF-16-encoded Unicode code point while C uses ASCII encoding in most of the cases.
Related
I have been reading about how Unicode code points have evolved over time, including this article by Joel Spolsky, which says:
Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.
But despite all this reading, I couldn't find the real reason that Java uses UTF-16 for a char.
Isn't UTF-8 far more efficient than UTF-16? For example, if I had a string which contains 1024 letters of ASCII scoped characters, UTF-16 will take 1024 * 2 bytes (2KB) of memory.
But if Java used UTF-8, it would be just 1KB of data. Even if the string has a few character which needs to 2 bytes, it will still only take about a kilobyte. For example, suppose in addition to the 1024 characters, there were 10 characters of "字" (code-point U+5b57, UTF-8 encoding e5 ad 97). In UTF-8, this will still take only (1024 * 1 byte) + (10 * 3 bytes) = 1KB + 30 bytes.
So this doesn't answer my question. 1KB + 30 bytes for UTF-8 is clearly less memory than 2KB for UTF-16.
Of course it makes sense that Java doesn't use ASCII for a char, but why does it not use UTF-8, which has a clean mechanism for handling arbitrary multi-byte characters when they come up? UTF-16 looks like a waste of memory in any string which has lots of non-multibyte chars.
Is there some good reason for UTF-16 that I'm missing?
Java used UCS-2 before transitioning over UTF-16 in 2004/2005. The reason for the original choice of UCS-2 is mainly historical:
Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character.
This, and the birth of UTF-16, is further explained by the Unicode FAQ page:
Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.
As #wero has already mentioned, random access cannot be done efficiently with UTF-8. So all things weighed up, UCS-2 was seemingly the best choice at the time, particularly as no supplementary characters had been allocated by that stage. This then left UTF-16 as the easiest natural progression beyond that.
Historically, one reason was the performance characteristics of random access or iterating over the characters of a String:
UTF-8 encoding uses a variable number (1-4) bytes to encode a Unicode character. Therefore accessing a character by index: String.charAt(i) would be way more complicated to implement and slower than the array access used by java.lang.String.
Even today, Python uses a fixed-width format for Strings internally, storing either 1, 2, or 4 bytes per character depending on the maximum size of a character in that string.
Of course, this is no longer a pure benefit in Java, since, as nj_ explains, Java no longer uses a fixed-with format. But at the time the language was developed, Unicode was a fixed-width format (now called UCS-2), and this would have been an advantage.
I call a webservice, that gives me back a response xml that has UTF-8 encoding. I checked that in java using getAllHeaders() method.
Now, in my java code, I take that response and then do some processing on it. And later, pass it on to a different service.
Now, I googled a bit and found out that by default the encoding in Java for strings is UTF-16.
In my response xml, one of the elements had a character É. Now this got screwed in the post processing request that I make to a different service.
Instead of sending É, it sent some jibberish stuff. Now I wanted to know, will there be really a lot of difference in the two of these encodings? And if I wanted to know what will É convert from UTF-8 to UTF-16, then how can I do that?
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Basic ASCII characters like digits, Latin characters with no
accents, etc. occupy one byte which is identical to US-ASCII
representation. This way all US-ASCII strings become valid UTF-8,
which provides decent backwards compatibility in many cases.
No null bytes, which allows to use null-terminated strings, this
introduces a great deal of backwards compatibility too.
Main UTF-8 cons:
Many common characters have different length, which slows indexing
and calculating a string length terribly.
Main UTF-16 pros:
Most reasonable characters, like Latin, Cyrillic, Chinese, Japanese
can be represented with 2 bytes. Unless really exotic characters are
needed, this means that the 16-bit subset of UTF-16 can be used as a
fixed-length encoding, which speeds indexing.
Main UTF-16 cons:
Lots of null bytes in US-ASCII strings, which means no
null-terminated strings and a lot of wasted memory.
In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocol
There are two things:
the encoding in which you exchange data;
the internal string representation of Java.
You should not be preoccupied with the second point ;) The thing is to use the appropriate methods to convert from your data (byte arrays) to Strings (char arrays ultimately), and to convert form Strings to your data.
The most basic classes you can think of are CharsetDecoder and CharsetEncoder. But there are plenty others. String.getBytes(), all Readers and Writers are but two possible methods. And there are all static methods of Character as well.
If you see gibberish at some point, it means you failed to decode or encode from the original byte data to Java strings. But again, the fact that Java strings use UTF-16 is not relevant here.
In particular, you should be aware that when you create a Reader or Writer, you should specify the encoding; if you fail to do so, the default JVM encoding will be used, and it may, or may not, be UTF-8.
This Website provide UTF TO UTF Conversion
http://www.fileformat.info/convert/text/utf2utf.htm
UTF-32 is arguably the most human-readable of the Unicode Encoding Forms, because its big-endian hexadecimal representation is simply the Unicode Scalar Value without the “U+” prefix and zero-padded to eight digits and While a UTF-32 representation does make the programming model somewhat simpler, the increased average storage size has real drawbacks, making a complete transition to UTF-32 less compelling.
HOWEVER
UTF-32 is the same as the old UCS-4 encoding and remains fixed width. Why can this remain fixed width? As UTF-16 is now the format that can encode the least amount of characters it set the limit for all formats. It was defined that 1,112,064 was the total number of code points that will ever be defined by either Unicode or ISO 10646. Since Unicode is now only defined from 0 to 10FFFF UTF-32 sounds a bit like a pointless encoding now as it's 32 bit wide, but only ever about 21 bits are used which makes this very wasteful.
UTF-8: Generally speaking, you should use UTF-8. Most HTML documents use this encoding.
It uses at least 8 bits of data to store each character. This can lead to more efficient storage, especially when the text contains mostly English ASCII characters. But higher-order characters, such as non-ASCII characters, may require up to 24 bits each!
UTF-16:
This encoding uses at least 16 bits to encode characters, including lower-order ASCII characters and higher-order non-ASCII characters.
If you are encoding text consisting of mostly non-English or non-ASCII characters, UTF-16 may result in smaller file size. But if you use UTF-16 to encode mostly ASCII text, it will use up more space.
I searched Java's internal representation for String, but I've got two materials which look reliable but inconsistent.
One is:
http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451
and it says:
Java uses UTF-16 for the internal text representation and supports a non-standard modification of UTF-8 for string serialization.
The other is:
http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
and it says:
Tcl also uses the same modified UTF-8[25] as Java for internal representation of Unicode data, but uses strict CESU-8 for external data.
Modified UTF-8? Or UTF-16? Which one is correct? And how many bytes does Java use for a char in memory?
Please let me know which one is correct and how many bytes it uses.
Java uses UTF-16 for the internal text representation
The representation for String and StringBuilder etc in Java is UTF-16
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/overview.html
How is text represented in the Java platform?
The Java programming language is based on the Unicode character set, and several libraries implement the Unicode standard. The primitive data type char in the Java programming language is an unsigned 16-bit integer that can represent a Unicode code point in the range U+0000 to U+FFFF, or the code units of UTF-16. The various types and classes in the Java platform that represent character sequences - char[], implementations of java.lang.CharSequence (such as the String class), and implementations of java.text.CharacterIterator - are UTF-16 sequences.
At the JVM level, if you are using -XX:+UseCompressedStrings (which is default for some updates of Java 6) The actual in-memory representation can be 8-bit, ISO-8859-1 but only for strings which do not need UTF-16 encoding.
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
and supports a non-standard modification of UTF-8 for string serialization.
Serialized Strings use UTF-8 by default.
And how many bytes does Java use for a char in memory?
A char is always two bytes, if you ignore the need for padding in an Object.
Note: a code point (which allows character > 65535) can use one or two characters, i.e. 2 or 4 bytes.
You can confirm the following by looking at the source code of the relevant version of the java.lang.String class in OpenJDK. (For some really old versions of Java, String was partly implemented in native code. That source code is not publicly available.)
Prior to Java 9, the standard in-memory representation for a Java String is UTF-16 code-units held in a char[].
With Java 6 update 21 and later, there was a non-standard option (-XX:UseCompressedStrings) to enable compressed strings. This feature was removed in Java 7.
For Java 9 and later, the implementation of String has been changed to use a compact representation by default. The java command documentation now says this:
-XX:-CompactStrings
Disables the Compact Strings feature. By default, this option is enabled. When this option is enabled, Java Strings containing only single-byte characters are internally represented and stored as single-byte-per-character Strings using ISO-8859-1 / Latin-1 encoding. This reduces, by 50%, the amount of space required for Strings containing only single-byte characters. For Java Strings containing at least one multibyte character: these are represented and stored as 2 bytes per character using UTF-16 encoding. Disabling the Compact Strings feature forces the use of UTF-16 encoding as the internal representation for all Java Strings.
Note that neither classical, "compressed" or "compact" strings ever used UTF-8 encoding as the String representation. Modified UTF-8 is used in other contexts; e.g. in class files, and the object serialization format.
See also:
Java Platform, Standard Edition What’s New in Oracle JDK 9
JEP 254: Compact Strings
Difference between compact strings and compressed strings in Java 9
To answer your specific questions:
Modified UTF-8? Or UTF-16? Which one is correct?
Either UTF-16 or an adaptive representation that depends on the actual data; see above.
And how many bytes does Java use for a char in memory?
A single char uses 2 bytes. There might be some "wastage" due to possible padding, depending on the context.
A char[] is 2 bytes per character plus the object header (typically 12 bytes including the array length) padded to (typically) a multiple of 8 bytes.
Please let me know which one is correct and how many bytes it uses.
If we are talking about a String now, it is not possible to give a general answer. It will depend on the Java version and hardware platform, as well as the String length and (in some cases) what the characters are. Indeed, for some versions of Java it even depends on how you created the String.
Having said all of the above, the API model for String is that it is both a sequence of UTF-16 code-units and a sequence of Unicode code-points. As a Java programmer, you should be able to ignore everything that happens "under the hood". The internal String representation is (should be!) irrelevant.
UTF-16.
From http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp :
How is text represented in the Java platform?
The Java programming language is based on the Unicode character set,
and several libraries implement the Unicode standard. The primitive
data type char in the Java programming language is an unsigned 16-bit
integer that can represent a Unicode code point in the range U+0000 to
U+FFFF, or the code units of UTF-16. The various types and classes in
the Java platform that represent character sequences - char[],
implementations of java.lang.CharSequence (such as the String class),
and implementations of java.text.CharacterIterator - are UTF-16
sequences.
The size of a char is 2 bytes.
Therefore, I would say that Java uses UTF-16 for internal String representation.
Java stores strings internally as UTF-16 and uses 2 bytes for each character.
java is available in 18 international languages and following UNICODE character set, which contains all the characters which are available in 18 international languages and
contains 65536 characters.And java following UTF-16 so the size of char in java is 2 bytes.
Is the Java char type guaranteed to be stored in any particular encoding?
Edit: I phrased this question incorrectly. What I meant to ask is are char literals guaranteed to use any particular encoding?
"Stored" where? All Strings in Java are represented in UTF-16. When written to a file, sent across a network, or whatever else, it's sent using whatever character encoding you specify.
Edit: Specifically for the char type, see the Character docs. Specifically: "The char data type ... are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities." Therefore, casting char to int will always give you a UTF-16 value if the char actually contains a character from that charset. If you just poked some random value into the char, it obviously won't necessarily be a valid UTF-16 character, and likewise if you read the character in using a bad encoding. The docs go on to discuss how the supplementary UTF-16 characters can only be represented by an int, since char doesn't have enough space to hold them, and if you're operating at this level, it might be important to get familiar with those semantics.
A Java char is conventionally used to hold a Unicode code unit; i.e. a 16 bit unit that is part of a valid UTF-16 sequence. However, there is nothing to prevent an application from putting any 16 bit unsigned value into a char, irrespective of what it actually means.
So you could say that a Unicode code unit can be represented by a char and a char can represent a Unicode code unit ... but neither of these is necessarily true, in the general case.
Your question about how a Java char is stored cannot be answered. Simply said, it depends on what you mean by "stored":
If you mean "represented in an executing program", then the answer is JVM implementation specific. (The char data type is typically represented as a 16 bit machine integer, though it may or may not be machine word aligned, depending on the specific context.)
If you mean "stored in a file" or something like that, then the answer is entirely dependent on how the application chooses to store it.
Is the Java char type guaranteed to be stored in any particular encoding?
In the light of what I said above the answer is "No". In an executing application, it is up to the application to decide what a char means / contains. When a char is stored to a file, the application decides how it wants to store it and what on-disk representation it will use.
FOLLOWUP
What about char literals? For example, 'c' must have some value that is defined by the language.
Java source code is required (by the language spec) to be Unicode text, represented in some character encoding that the tool chain understands; see the javac -encoding option. In theory, a character encoding could map the c in 'c' in your source code to something unexpected.
In practice though, the c will map to the Unicode lower-case C code-point (U+0063) and will be represented as the 16-bit unsigned value 0x0063.
To the extent that char literals have a meaning ascribed by the Java language, they represent (and are represented as) UTF-16 code units. Note that they may or may not be assigned Unicode code points ("characters"). Some Unicode code points in the range U+0000 to U+FFFF are unassigned.
Originally, Java used UCS-2 internally; now it uses UTF-16. The two are virtually identical, except for D800 - DFFF, which are used in UTF-16 as part of the extended representation for larger characters.
Difference between UTF-8 and UTF-16?
Why do we need these?
MessageDigest md = MessageDigest.getInstance("SHA-256");
String text = "This is some text";
md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed
byte[] digest = md.digest();
I believe there are a lot of good articles about this around the Web, but here is a short summary.
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.
Main UTF-8 cons:
Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.
Main UTF-16 pros:
BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters.
Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string.
Main UTF-16 cons:
Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters!
It's variable length, so counting or indexing codepoints is costly, though less than UTF-8.
In general, UTF-16 is usually better for in-memory representation because BE/LE is irrelevant there (just use native order) and indexing is faster (just don't forget to handle surrogate pairs properly). UTF-8, on the other hand, is extremely good for text files and network protocols because there is no BE/LE issue and null-termination often comes in handy, as well as ASCII-compatibility.
They're simply different schemes for representing Unicode characters.
Both are variable-length - UTF-16 uses 2 bytes for all characters in the basic multilingual plane (BMP) which contains most characters in common use.
UTF-8 uses between 1 and 3 bytes for characters in the BMP, up to 4 for characters in the current Unicode range of U+0000 to U+1FFFFF, and is extensible up to U+7FFFFFFF if that ever becomes necessary... but notably all ASCII characters are represented in a single byte each.
For the purposes of a message digest it won't matter which of these you pick, so long as everyone who tries to recreate the digest uses the same option.
See this page for more about UTF-8 and Unicode.
(Note that all Java characters are UTF-16 code points within the BMP; to represent characters above U+FFFF you need to use surrogate pairs in Java.)
Security: Use only UTF-8
Difference between UTF-8 and UTF-16? Why do we need these?
There have been at least a couple of security vulnerabilities in implementations of UTF-16. See Wikipedia for details.
CVE-2008-2938
CVE-2012-2135
WHATWG and W3C have now declared that only UTF-8 is to be used on the Web.
The [security] problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.
Other groups are saying the same.
So while UTF-16 may continue being used internally by some systems such as Java and Windows, what little use of UTF-16 you may have seen in the past for data files, data exchange, and such, will likely fade away entirely.
This is unrelated to UTF-8/16 (in general, although it does convert to UTF16 and the BE/LE part can be set w/ a single line), yet below is the fastest way to convert String to byte[]. For instance: good exactly for the case provided (hash code). String.getBytes(enc) is relatively slow.
static byte[] toBytes(String s){
byte[] b=new byte[s.length()*2];
ByteBuffer.wrap(b).asCharBuffer().put(s);
return b;
}
Simple way to differentiate UTF-8 and UTF-16 is to identify commonalities between them.
Other than sharing same unicode number for given character, each one is their own format.
UTF-8 try to represent, every unicode number given to character with one byte(If it is ASCII), else 2 two bytes, else 4 bytes and so on...
UTF-16 try to represent, every unicode number given to character with two byte to start with. If two bytes are not sufficient, then uses 4 bytes. IF that is also not sufficient, then uses 6 bytes.
Theoretically, UTF-16 is more space efficient, but in practical UTF-8 is more space efficient as most of the characters(98% of data) for processing are ASCII and UTF-8 try to represent them with single byte and UTF-16 try to represent them with 2 bytes.
Also, UTF-8 is superset of ASCII encoding. So every app that expects ASCII data would also accepted by UTF-8 processor. This is not true for UTF-16. UTF-16 could not understand ASCII, and this is big hurdle for UTF-16 adoption.
Another point to note is, all UNICODE as of now could be fit in 4 bytes of UTF-8 maximum(Considering all languages of world). This is same as UTF-16 and no real saving in space compared to UTF-8 ( https://stackoverflow.com/a/8505038/3343801 )
So, people use UTF-8 where ever possible.