Java - UTF8/16 is a Charset Name or Character Encoding?

Java - UTF8/16 is a Charset Name or Character Encoding? - java

The application I am developing will be used by folks in Western & Eastern Europe as well in the US. I am encoding my input and decoding my output with UTF-8 character set.
My confusion is becase when I use this method String(byte[] bytes, String charsetName), I provide UTF-8 as the charsetname when it really is an character encoding. And my default econding is set in Eclipse as Cp1252.
Does this mean if, in the US in my Java application, I create an Output text file using Cp1252 as my charset encoding and UTF-8 as my charset name, will the folks in Europe be able to read this file in my Java application and vice versa?

They're encodings. It's a pity that Java uses "charset" all over the place when it really means "encoding", but that's hard to fix now :( Annoyingly, IANA made the same mistake.
Actually, by Unicode terminology they're probably most accurately character encoding schemes:
A character encoding form plus byte serialization. There are seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.
Where a character encoding form is:
Mapping from a character set definition to the actual code units used to represent the data.
Yes, the fact that Unicode only defines seven character encoding forms makes this even more confusing. Fundamentally, all most developers need to know is that a "charset" in Java terminology is a mapping between text data (String, char[]) and binary data (byte[]).

I think those two things are not directly related.
The Eclipse setting decide how your eclipse editor will save the text file (typically source code) you created/edited. You can use other editors and therefore the file maybe saved in some other encoding scheme. As long as your java compiler has no problem compiling your source code you're safe.
The
java String(byte[] bytes, String charsetName)
is your own application logic that deals with how do you want to interpret some data your read either from a file or network. Different charsetName (essentially different character encoding scheme) may have different interpretation on the byte array.

A "charset" does implies the set of characters that the text uses. For UTF-8/16, the character set happens to be "all" characters. For others, not necessarily. Back in the days, everybody were inventing their own character sets and encoding schemes, and the two were almost 1-to-1 mapping, therefore one name can be used to refer to both character set and encoding scheme.

Related

Point of other encoding rather than UTF-8

I have been working with String in various programming language for a long time, and I haven't come across a situation where I need to use any other encoding except UTF-8
The question might feel like opinion based, but I don't understand why other encoding should be available.
wouldn't it just make everyone's life (especially programmers) easier to just have one single standard?
I take Java as an example:
A Set of currently available encoding for Java can be found here:
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html

UTF-8: Advantages and disadvantages
The typical argument is:
Asian languages have many more characters and would require oversized
encoding for their languages.
However, the Pros outweigh the cons in my opinion:
UTF-8, in general is much more powerful due to is compatibility with ASCII
The fact that it is Unicode
Other UTF-16/32 are not fixed-length
Others that are not unicode are extremely complex
I would take a gander over here: Why don't people use other encodings.

String in java are internally represented as UTF-16, when you build a String you don't have to tell what encoding to use as internal representation (but you have to pass the encoding if you are building a String from an array of bytes).
The link you provided shows the Encoding available for read and write operations; if you want to read correctly a text file encoded in ISO-8859-1 on a platform where the default encoding is UTF-8 you must specify the correct encoding and your language (java in this case) must be able to automatically convert from one encoded form to another.
Java manage a lot of encodings and the convertion from one to another, but internally it represents Strings as UTF-16, but you don't have to worry about that; you only must specifiy the encoding when transforming a String to a sequence of bytes, or vice versa.

Determining ISO-8859-1 vs US-ASCII charset

I am trying to determine whether to use
PrintWriter pw = new PrintWriter(outputFilename, "ISO-8859-1");
or
PrintWriter pw = new PrintWriter(outputFilename, "US-ASCII");
I was reading All about character sets to determine the character set of an example file which I must create in the same encoding via java code.
When my example file contains "European" letters (Norwegian: å ø æ), then the following command tells me the file encoding is "iso-8859-1"
file -bi example.txt
However, when I take a copy of the same example file and modify it to contain different data, without any Norwegian text (let's say, I replace "Bjørn" with "Bjorn"), then the same command tells me the file encoding is "us-ascii".
file -bi example-no-european-letters.txt
What does this mean? Is ISO-8859-1 in practise the same as US-ASCII if there are no "European" characters in it?
Should I just use a charset "ISO-8559-1" and everything will be ok?

If the file contains only the 7-bit US-ASCII characters it can be read as US-ASCII. It doesn't tell anything about what was intended as the charset. It may be just a coincidence that there were no characters that would require a different coding.
ISO-8859-1 (and -15) is a common european encoding, able to encode äöåéü and other characters, the first 127 characters being the same as in US-ASCII (as often is, for convenience reasons).
However you can't just pick an encoding and assume that "everything will be OK". The very common UTF-8 encoding also contains the US-ASCII charset, but it will encode for example äöå characters as two bytes instead of ISO-8859-1's one byte.
TL;DR: Don't assume things with encodings. Find out what was intended and use that. If you can't find it out, observe the data to try to figure out what is a correct charset to use (as you noted yourself, multiple encodings may work at least temporarily).

It depends on different types of characters we use in the respective document. ASCII is 7-bit charset and ISO-8859-1 is 8-bit charset which supports some additional characters. But, mostly, if you are going to reproduce the document from inputstream, I recommend the ISO-8859-1 charset. It will work for textfile like notepad and MS word.
If you are using some different international characters, we need to check the corresponding charset which supports that particular character like UTF-8..

Will String.getBytes("UTF-16") return the same result on all platforms?

I need to create a hash from a String containing users password. To create the hash, I use a byte array which I get by calling String.getBytes(). But when I call this method with specified encoding, (such as UTF-8) on a platform where this is not the default encoding, the non-ASCII characters get replaced by a default character (if I understand the behaviour of getBytes() correctly) and therefore on such platform, I will get a different byte array, and eventually a different hash.
Since Strings are internally stored in UTF-16, will calling String.getBytes("UTF-16") guarantee me that I get the same byte array on every platform, regardless of its default encoding?

Yes. Not only is it guaranteed to be UTF-16, but the byte order is defined too:
When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.
(The BOM isn't relevant when the caller doesn't ask for it, so String.getBytes(...) won't include it.)
So long as you have the same string content - i.e. the same sequence of char values - then you'll get the same bytes on every implementation of Java, barring bugs. (Any such bug would be pretty surprising, given that UTF-16 is probably the simplest encoding to implement in Java...)
The fact that UTF-16 is the native representation for char (and usually for String) is only relevant in terms of ease of implementation, however. For example, I'd also expect String.getBytes("UTF-8") to give the same results on every platform.

It is true, java uses Unicode internally so it may combine any script/language. String and char use UTF-16BE but .class files store there String constants in UTF-8. In general it is irrelevant what String does, as there is a conversion to bytes specifying the encoding the bytes have to be in.
If this encoding of the bytes cannot represent some of the Unicode characters, a placeholder character or question mark is given. Also fonts might not have all Unicode characters, 35 MB for a full Unicode font is a normal size. You might then see a square with 2x2 hex codes or so for missing code points. Or on Linux another font might substitute the char.
Hence UTF-8 is a perfect fine choice.
String s = ...;
if (!s.startsWith("\uFEFF")) { // Add a Unicode BOM
s = "\uFEFF" + s;
}
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
Both UTF-16 (in both byte orders) and UTF-8 always are present in the JRE, whereas some Charsets are not. Hence you can use a constant from StandardCharsets not needing to handle any UnsupportedEncodingException.
Above I added a BOM for Windows Notepad esoecially, to recognize UTF-8. It certainly is not good practice. But as a small help here.
There is no disadvantage to UTF16-LE or UTF-16BE. I think UTF-8 is a bit more universally used, as UTF-16 also cannot store all Unicode code points in 16 bits. Text is Asian scripts would be more compressed, but already HTML pages are more compact in UTF-8 because of the HTML tags and other latin script.
For Windows UTF-16LE might be more native.
Problem with placeholders for non-Unicode platforms, especially Windows, might happen.

I just found this:
https://github.com/facebook/conceal/issues/138
which seems to answer negatively your question.
As per Jon Skeet's answer: the specification is clear. But I guess Android/Mac implementations of Dalvik/JVM don't agree.

C++ and Java encodings

I am trying to make a Java application and a VS C++ application communicate and send different messages to each other using Sockets. The only problem that I have so far - I am absolutely lost in their encodings.
By default Java uses UTF-8. This is as far as I am concerned a Unicode charset. In my VS project I have settings set to Unicode. Though for some reason when I debug my code I allways see my strings encoded as CP1252 in memory.
Furthermore if I try to use CP1252 in Java it works fine for English letters, but whenever I try some russian letters I get a 3f byte for every letter.
If on other hand I try to use UTF-8 in Java - each English letter is 1 byte long, but every Russian - 2 bytes long. Isnt it a multibyte encoding?
Some docs on C++ say that std::string(char) uses UTF-8 codepage, and std:wstring(wchar_t) - UTF-16. When I debug my application I see CP1252 encoding for both of them, though wstring has empty bytes between each letter.
Could you please explain how encodings behave in both Java and C++ and how should I communicate my 2 apps?

UTF-8 has a variable-length per character. Common characters take less space by using up less bytes per character. More un-common characters take up more space because they have to be encoded in more bytes. Since most of this was invented in the US, guess which characters are shorter and which are longer?
If you want Sockets to work, then you will have to get both sides to agree on the encoding. Otherwise, you are fighting a loosing battle.

it's not true that java do utf-8 encoding. You can write your source code in utf8 and compile it with some weird signs in attributes(sometimes really annoying).
The internal representation in java of strings is utf-16(see What is the Java's internal represention for String? Modified UTF-8? UTF-16?)

Unicode is a character set, UTF-8 and UTF-16 are encodings of Unicode. For English (actually ASCII) characters UTF-8 results in the same value as CP1252 and UTF-16 adds a zero byte. As you want to use Russian (Cyrillic) you can use UTF-8, UTF-16 or CP1251. But both applications must agree on the encoding.
For example, if you agreed on UTF-8, the following will convert a Java String s to an array of bytes using UTF-8:
byte[] b = s.getBytes("UTF-8");
Then:
outputStream.write(b);
will send the data on the socket.

Java safeguards for when UTF-16 doesn't cut it

My understanding is that Java uses UTF-16 by default (for String and char and possibly other types) and that UTF-16 is a major superset of most character encodings on the planet (though, I could be wrong). But I need a way to protect my app for when it's reading files that were generated with encodings (I'm not sure if there are many, or none at all) that UTF-16 doesn't support.
So I ask:
Is it safe to assume the file is UTF-16 prior to reading it, or, to maximize my chances of not getting NPEs or other malformed input exceptions, should I be using a character encoding detector like JUniversalCharDet or JCharDet or ICU4J to first detect the encoding?
Then, when writing to a file, I need to be sure that a characte/byte didn't make it into the in-memory object (the String, the OutputStream, whatever) that produces garbage text/characters when written to a string or file. Ideally, I'd like to have some way of making sure that this garbage-producing character gets caught somehow before making it into the file that I am writing. How do I safeguard against this?
Thanks in advance.

Java normally uses UTF-16 for its internal representation of characters. n Java char arrays are a sequence of UTF-16 encoded Unicode codepoints. By default char values are considered to be Big Endian (as any Java basic type is). You should however not use char values to write strings to files or memory. You should make use of the character encoding/decoding facilities in the Java API (see below).
UTF-16 is not a major superset of encodings. Actually, UTF-8 and UTF-16 can both encode any Unicode code point. In that sense, Unicode does define almost any character that you possibly want to use in modern communication.
If you read a file from disk and asume UTF-16 then you would quickly run into trouble. Most text files are using ASCII or an extension of ASCII to use all 8 bits of a byte. Examples of these extensions are UTF-8 (which can be used to read any ASCII text) or ISO 8859-1 (Latin). Then there are a lot of encodings e.g. used by Windows that are an extension of those extensions. UTF-16 is not compatible with ASCII, so it should not be used as default for most applications.
So yes, please use some kind of detector if you want to read a lot of plain text files with unknown encoding. This should answer question #1.
As for question #2, think of a file that is completely ASCII. Now you want to add a character that is not in the ASCII. You choose UTF-8 (which is a pretty safe bet). There is no way of knowing that the program that opens the file guesses correctly guesses that it should use UTF-8. It may try to use Latin or even worse, assume 7-bit ASCII. In that case you get garbage. Unfortunately there are no smart tricks to make sure this never happens.
Look into the CharsetEncoder and CharsetDecoder classes to see how Java handles encoding/decoding.

Whenever a conversion between bytes and characters takes place, Java allows to specify the character encoding to be used. If it is not specified, a machine dependent default encoding is used. In some encodings the bit pattern representing a certain character has no similarity with the bit pattern used for the same character in UTF-16 encoding.
To question 1 the answer is therefore "no", you cannot assume the file is encoded in UTF-16.
It depends on the used encoding which characters are representable.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.