I have a question about Charset.forName(String charsetName). Is there a list of charsetNames I can refer to? For example, for UTF-8, we use "utf8" for the charsetName. What about WINDOWS-1252, GB18030, etc.?
Charset Description
US-ASCII Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set
ISO-8859-1 ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1
UTF-8 Eight-bit UCS Transformation Format
UTF-16BE Sixteen-bit UCS Transformation Format, big-endian byte order
UTF-16LE Sixteen-bit UCS Transformation Format, little-endian byte order
UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
Reference: http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html
The charset names in Java are platform dependent, there are only 6 constants in the StandardCharsets class.
To view the all charsets you should look at IANA. Check Preferred MIME Name and aliases columns.
To list all character set installed in your JVM, you might use the following
code snippet (Java 8 SE or higher):
SortedMap<String, Charset> map = Charset.availableCharsets();
map.keySet().stream().forEach(System.out::println);
On my system, this lists around 170 character sets.
The java Charset library is required to accept just a few basic encodings: ASCII, Latin-1 (ISO-8859-1), and a handful of UTF variants that you can see listed in this answer. That's a pretty useless list for any practical purposes, unless your scope is limited to Latin-1. In reality, Java classes can handle a large number of encodings that you can read about in the Supported Encodings page. Quoting from it:
The java.io.InputStreamReader, java.io.OutputStreamWriter, java.lang.String classes, and classes in the java.nio.charset package can convert between Unicode and a number of other character encodings. The supported encodings vary between different implementations of Java SE 8. The class description for java.nio.charset.Charset lists the encodings that any implementation of Java SE 8 is required to support.
JDK 8 for all platforms (Solaris, Linux, and Microsoft Windows) and JRE 8 for Solaris and Linux support all encodings shown on this page. JRE 8 for Microsoft Windows may be installed as a complete international version or as a European languages version. [...]
The rest of the page consists of an extensive table of encoding names and synonyms, which is what the OP was after all those years ago...
Related
I understand that Java follows UTF-16 for the char data type. But the new update in Java 18 - Default charset for the standard Java APIs is UTF-8.
Does this update have any impact on the char data type encoding format? I also understand that UTF-8 is a variable-width encoding that can accommodate characters up to 4 bytes. What is the size of the char data type after Java 18? And does it still adhere to UTF-16 or moved to UTF-8?
No, the change in default character encoding does not affect the internals of char/Character, nor does it affect the intervals of String.
I suggest you read the official documentation, JEP 400: UTF-8 by Default. The motivation and the details are explained thoroughly.
The change made in Java 18 affects mainly input/output. So this includes the older APIs and classes for reading and writing files. Some of the newer APIs and classes were already defaulting to UTF-8. JEP 400 seeks to make this default consistent throughout the bundled libraries.
One particular issue called out in the JEP regards Java source code files that were saved with a non-UTF-8 encoding and compiled with an earlier JDK. Recompiling on JDK 18 or later may cause problems.
By the way, let me remind you that char/Character has been legacy since Java 5, essentially broken since Java 2. As a 16-bit type, char is physically incapable of representing most characters.
To work with individual characters, use code point integer numbers. Look for codePoint methods added to classes including String, StringBuilder, Character, etc.
The application I am developing will be used by folks in Western & Eastern Europe as well in the US. I am encoding my input and decoding my output with UTF-8 character set.
My confusion is becase when I use this method String(byte[] bytes, String charsetName), I provide UTF-8 as the charsetname when it really is an character encoding. And my default econding is set in Eclipse as Cp1252.
Does this mean if, in the US in my Java application, I create an Output text file using Cp1252 as my charset encoding and UTF-8 as my charset name, will the folks in Europe be able to read this file in my Java application and vice versa?
They're encodings. It's a pity that Java uses "charset" all over the place when it really means "encoding", but that's hard to fix now :( Annoyingly, IANA made the same mistake.
Actually, by Unicode terminology they're probably most accurately character encoding schemes:
A character encoding form plus byte serialization. There are seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.
Where a character encoding form is:
Mapping from a character set definition to the actual code units used to represent the data.
Yes, the fact that Unicode only defines seven character encoding forms makes this even more confusing. Fundamentally, all most developers need to know is that a "charset" in Java terminology is a mapping between text data (String, char[]) and binary data (byte[]).
I think those two things are not directly related.
The Eclipse setting decide how your eclipse editor will save the text file (typically source code) you created/edited. You can use other editors and therefore the file maybe saved in some other encoding scheme. As long as your java compiler has no problem compiling your source code you're safe.
The
java String(byte[] bytes, String charsetName)
is your own application logic that deals with how do you want to interpret some data your read either from a file or network. Different charsetName (essentially different character encoding scheme) may have different interpretation on the byte array.
A "charset" does implies the set of characters that the text uses. For UTF-8/16, the character set happens to be "all" characters. For others, not necessarily. Back in the days, everybody were inventing their own character sets and encoding schemes, and the two were almost 1-to-1 mapping, therefore one name can be used to refer to both character set and encoding scheme.
I searched Java's internal representation for String, but I've got two materials which look reliable but inconsistent.
One is:
http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451
and it says:
Java uses UTF-16 for the internal text representation and supports a non-standard modification of UTF-8 for string serialization.
The other is:
http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
and it says:
Tcl also uses the same modified UTF-8[25] as Java for internal representation of Unicode data, but uses strict CESU-8 for external data.
Modified UTF-8? Or UTF-16? Which one is correct? And how many bytes does Java use for a char in memory?
Please let me know which one is correct and how many bytes it uses.
Java uses UTF-16 for the internal text representation
The representation for String and StringBuilder etc in Java is UTF-16
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/overview.html
How is text represented in the Java platform?
The Java programming language is based on the Unicode character set, and several libraries implement the Unicode standard. The primitive data type char in the Java programming language is an unsigned 16-bit integer that can represent a Unicode code point in the range U+0000 to U+FFFF, or the code units of UTF-16. The various types and classes in the Java platform that represent character sequences - char[], implementations of java.lang.CharSequence (such as the String class), and implementations of java.text.CharacterIterator - are UTF-16 sequences.
At the JVM level, if you are using -XX:+UseCompressedStrings (which is default for some updates of Java 6) The actual in-memory representation can be 8-bit, ISO-8859-1 but only for strings which do not need UTF-16 encoding.
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
and supports a non-standard modification of UTF-8 for string serialization.
Serialized Strings use UTF-8 by default.
And how many bytes does Java use for a char in memory?
A char is always two bytes, if you ignore the need for padding in an Object.
Note: a code point (which allows character > 65535) can use one or two characters, i.e. 2 or 4 bytes.
You can confirm the following by looking at the source code of the relevant version of the java.lang.String class in OpenJDK. (For some really old versions of Java, String was partly implemented in native code. That source code is not publicly available.)
Prior to Java 9, the standard in-memory representation for a Java String is UTF-16 code-units held in a char[].
With Java 6 update 21 and later, there was a non-standard option (-XX:UseCompressedStrings) to enable compressed strings. This feature was removed in Java 7.
For Java 9 and later, the implementation of String has been changed to use a compact representation by default. The java command documentation now says this:
-XX:-CompactStrings
Disables the Compact Strings feature. By default, this option is enabled. When this option is enabled, Java Strings containing only single-byte characters are internally represented and stored as single-byte-per-character Strings using ISO-8859-1 / Latin-1 encoding. This reduces, by 50%, the amount of space required for Strings containing only single-byte characters. For Java Strings containing at least one multibyte character: these are represented and stored as 2 bytes per character using UTF-16 encoding. Disabling the Compact Strings feature forces the use of UTF-16 encoding as the internal representation for all Java Strings.
Note that neither classical, "compressed" or "compact" strings ever used UTF-8 encoding as the String representation. Modified UTF-8 is used in other contexts; e.g. in class files, and the object serialization format.
See also:
Java Platform, Standard Edition What’s New in Oracle JDK 9
JEP 254: Compact Strings
Difference between compact strings and compressed strings in Java 9
To answer your specific questions:
Modified UTF-8? Or UTF-16? Which one is correct?
Either UTF-16 or an adaptive representation that depends on the actual data; see above.
And how many bytes does Java use for a char in memory?
A single char uses 2 bytes. There might be some "wastage" due to possible padding, depending on the context.
A char[] is 2 bytes per character plus the object header (typically 12 bytes including the array length) padded to (typically) a multiple of 8 bytes.
Please let me know which one is correct and how many bytes it uses.
If we are talking about a String now, it is not possible to give a general answer. It will depend on the Java version and hardware platform, as well as the String length and (in some cases) what the characters are. Indeed, for some versions of Java it even depends on how you created the String.
Having said all of the above, the API model for String is that it is both a sequence of UTF-16 code-units and a sequence of Unicode code-points. As a Java programmer, you should be able to ignore everything that happens "under the hood". The internal String representation is (should be!) irrelevant.
UTF-16.
From http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp :
How is text represented in the Java platform?
The Java programming language is based on the Unicode character set,
and several libraries implement the Unicode standard. The primitive
data type char in the Java programming language is an unsigned 16-bit
integer that can represent a Unicode code point in the range U+0000 to
U+FFFF, or the code units of UTF-16. The various types and classes in
the Java platform that represent character sequences - char[],
implementations of java.lang.CharSequence (such as the String class),
and implementations of java.text.CharacterIterator - are UTF-16
sequences.
The size of a char is 2 bytes.
Therefore, I would say that Java uses UTF-16 for internal String representation.
Java stores strings internally as UTF-16 and uses 2 bytes for each character.
java is available in 18 international languages and following UNICODE character set, which contains all the characters which are available in 18 international languages and
contains 65536 characters.And java following UTF-16 so the size of char in java is 2 bytes.
Why does a character in Java take twice as much space to store as a character in C?
In Java characters are 16-bit and C they are 8-bit.
A more general question is why is this so?
To find out why you need to look at history and come to conclusions/opinions on the subject.
When C was developed in the USA, ASCII was pretty standard there and you only really needed 7-bits, but with 8 you could handle some non-ASCII characters as well. It might seem more than enough. Many text based protocols like SMTP (email), XML and FIX, still only use ASCII character. Email and XML encode non ASCII characters. Binary files, sockets and stream are still only 8-bit byte native.
BTW: C can support wider characters, but that is not plain char
When Java was developed 16-bit seemed like enough to support most languages. Since then unicode has been extended to characters above 65535 and Java has had to add support for codepoints which is UTF-16 characters and can be one or two 16-bit characters.
So making a byte a byte and char an unsigned 16-bit value made sense at the time.
BTW: If your JVM supports -XX:+UseCompressedStrings it can use bytes instead of chars for Strings which only use 8-bit characters.
Because Java uses Unicode, C generally uses ASCII by default.
There are various flavours of Unicode encoding, but Java uses UTF-16, which uses either one or two 16-bit code units per character. ASCII always uses one byte per character.
http://java.about.com/od/programmingconcepts/a/unicode.htm
http://www.joelonsoftware.com/articles/Unicode.html
http://en.wikipedia.org/wiki/UTF-16
The Java 2 platform uses the UTF-16 representation in char arrays and
in the String and StringBuffer classes.
java.lang.Character
java.lang.String
Java is a modern language that came up around the early Unicode era (in the beginning of the 90s), so it supports Unicode by default as a first class citizen like many new contemporary languages (like Python, Visual Basic or JavaScript...), OSes (Windows, Symbian, BREW...) and frameworks/interfaces/specifications... (like Qt, NTFS, Joliet). By the time those were designed, Unicode was a fixed 16-bit charset encoded in UCS-2, so it made sense for them to use 16-bit values for the characters
In contrast C is an "ancient" language that was invented decades before Java, when Unicode was far from a thing. That's the age of 7-bit ASCII and 8-bit EBCDIC, thus C uses 8-bit char1 as that's enough for a char variable to contain all basic characters. When coming to the Unicode times, to refrain from breaking old code they decided to introduce a different character type to C90 which is wchar_t. Again this is the 90s when Unicode began its life. In any cases char must continue to have the old size because you still need to access individual bytes even if you use wider characters (Java has a separate byte type for this purpose)
Of course later the Unicode Consortium quickly realized that 16 bits are not enough and must fix it somehow. They widened the code-point range by changing UCS-2 to UTF-16 to avoid breaking old code that uses wide char and have Unicode as a 21-bit charset (actually up to U+10FFFF instead of U+1FFFFF because of UTF-16). Unfortunately it was too late and the old implementations that use 16-bit char got stuck
Later we saw the advent of UTF-8, which proved to be far superior to UTF-16 because it's independent of endianness, generally takes up less space, and most importantly it requires no changes in the standard C string functions. Most user functions that receive a char* will continue to work without special Unicode support
Unix systems are lucky because they migrate to Unicode later when UTF-8 was introduced, therefore continue to use 8-bit char. OTOH all modern Win32 APIs work on 16-bit wchar_t by default because Windows was also an early adopter of Unicode. As a result .NET framework and C# also go the same way by having char as a 16-bit type.
Talking about wchar_t, it was so unportable that both C and C++ standards needed to introduce the new character types char16_t and char32_t in their 2011 revisions
Both C and C++ introduced fixed-size character types char16_t and char32_t in the 2011 revisions of their respective standards to provide unambiguous representation of 16-bit and 32-bit Unicode transformation formats, leaving wchar_t implementation-defined
https://en.wikipedia.org/wiki/Wide_character#Programming_specifics
That said, most implementations are working on improving the wide string situation. Java experimented with compressed string in Java 6 and introduced compact strings in Java 9. Python is moving to a more flexible internal representation compared to wchar_t* in Python before 3.3. Firefox and Chrome have separate internal 8-bit char representations for simple strings. There are also discussions on that for .NET framework. And more recently Windows is gradually introducing UTF-8 support for the old ANSI APIs
1 Strictly speaking char in C is only required to have at least 8 bits. See What platforms have something other than 8-bit char?
Java char is an UTF-16-encoded Unicode code point while C uses ASCII encoding in most of the cases.
Our requirement is to send EBCDIC text to mainframe. We have some chinese characters thus UTF8 format.
So, is there a way to convert the UTF-8 characters to EBCDIC?
Thanks,
Raj Mohan
Assuming your target system is an IBM mainframe or midrange, it has full support for all of the EBCDIC encodings built into it's JVM as encodings named CPxxxx, corresponding to the IBM CCSID's (CP stands for code-page). You will need to do the translations on the host-side since the client side will not have the necessary encoding support.
Since Unicode is DBCS and greater, and supports every known character, you will likely be targeting multiple EBCDIC encodings; so you will likely configure those encodings in some way. Try to have your client Unicode (UTF-8, UTF-16, etc) only, with the translations being done as data arrives on the host and/or leaves the host system.
Other than needing to do translations host-side, the mechanics are the same as any Java translation; e.g. new String(bytes,encoding) and String.getBytes(encoding), and the various NIO and writer classes. There's really no magic - it's no different than translating between, say, ISO 8859-x and Unicode, or any other SBCS (or limited DBCS).
For example:
byte[] ebcdta="Hello World".getBytes("CP037"); // get bytes for EBCDIC codepage 37
You can find more information on IBM's documentation website.
EBCDIC has many 8-Bit Codepages. Many of them are supported by the VM. Have a look at Charset.availableCharsets().keySet(), the EBCDIC pages are named IBM... (there are aliases like cp500 for IBM500 as you can see by Charset.forName("IBM500").aliases()).
There are two problems:
if you have characters included in different code pages of EBCDIC, this will not help
i am not sure, if these charsets are available in any vm outside windows.
For the first, have a look at this approach. For the second, have a try on the desired target runtime ;-)
You can always make use of the IBM Toolbox for Java (JTOpen), specifically the com.ibm.as400.access.AS400Text class in the jt400.jar.
It goes as follows:
int codePageNumber = 420;
String codePage = "CP420";
String sourceUtfText = "أحمد يوسف صالح";
AS400Text converter = new AS400Text(sourceUtfText.length(), codePageNumber);
byte[] bytesData = converter.toBytes(sourceUtfText);
String resultedEbcdicText = new String(bytesData, codePage);
I used the code-page 420 and its corresponding java representation of the encoding CP420, this code-page is used for Arabic text, so, you should pick the suitable code-page for Chinese text.
For the midrange AS/400 (IBM i these days) the best bet is to use the IBM Java Toolkit (jt400.jar) which does all these things transparently (perhaps slightly hinted).
Please note that inside Java a character is a 16 bit value, not an UTF-8 (that is an encoding).