Java 8 UTF-16 isn't default charset but UTF-8

Java 8 UTF-16 isn't default charset but UTF-8 - java

I been doing some coding with String in Java8,Java 11 but this question is based on Java 8. I have this little snippet.
final char e = (char)200;//È
I just thought that the characters between 0.255[Ascii+extended Ascii] would always fit in a byte just because 2^8=256 but this seems not to be true i have try on the website https://mothereff.in/byte-counter and states that the character is taking 2 bytes can somebody please explain to me.
Another question in a lot of post states that Java is UTF-16 but in my machine running Windows 7 is returning UTF-8 in this snippet.
String csn = Charset.defaultCharset().name();
Is this platform depent?
Other questions i have try this snippet.
final List<Charset>charsets = Arrays.asList(StandardCharsets.ISO_8859_1,StandardCharsets.US_ASCII,StandardCharsets.UTF_16,StandardCharsets.UTF_8);
charsets.forEach(a->print(a,"È"));
System.out.println("getBytes");
System.out.println(Arrays.toString("È".getBytes()));
charsets.forEach(a->System.out.println(a+" "+Arrays.toString(sb.toString().getBytes(a))));
private void print(final Charset set,final CharSequence sb){
byte[] array = new byte[4];
set.newEncoder()
.encode(CharBuffer.wrap(sb), ByteBuffer.wrap(array), true);
final String buildedString = new String(array,set);
System.out.println(set+" "+Arrays.toString(array)+" "+buildedString+"<<>>"+buildedString.length());
}
And prints
run:
ISO-8859-1 [-56, 0, 0, 0] È//PERFECT USING 1 BYTE WHICH IS -56
US-ASCII [0, 0, 0, 0] //DONT GET IT SEE THIS ITEM FOR LATER
UTF-16 [-2, -1, 0, -56] È<<>>1 //WHAT IS -2,-1 BYTE USED FOR? I HAVE TRY WITH OTHER EXAMPLES AND THEY ALWAYS APPEAR AM I LOSING TWO BYTES HERE??
UTF-8 [-61, -120, 0, 0] 2 È //SEEMS TO MY CHARACTER NEEDS TWO BYTES?? I THOUGHT THAT CODE=200 WOULD REQUIRE ONLY ONE
getBytes
[-61, -120]//OK MY UTF-8 REPRESENTATION
ISO-8859-1 [-56]//OK
US-ASCII [63]//OK BUT WHY WHEN I ENCODE IN ASCCI DOESNT GET ANY BYTE ENCODED?
UTF-16 [-2, -1, 0, -56]//AGAIN WHAT ARE -2,-1 IN THE LEADING BYTES?
UTF-8 [-61, -120]//OK
I have try
System.out.println(new String(new byte[]{-1,-2},"UTF-16"));//SIMPLE "" I AM WASTING THIS 2 BYTES??
In resume.
Why UTF-16 always has two leading bytes are they wasted? new byte[]{-1,-2}
Why when i encode "È" i dont get any bytes in ASCCI Charset but when i do È.getBytes(StandardCharsets.US_ASCII) i get {63}?
Java uses UTF-16 but in my case UTF-8 is platform depend??
Sorry if this post is confussing
Environment
Windows 7 64 Bits Netbeans 8.2 with Java 1.8.0_121

First question
For your first question: those bytes are the BOM code and they specify the byte order (whether the least or most significant comes first) of multibyte encoding such as UTF-16.
Second question
Every ASCII character can be encoded as a single byte in UTF-8. But ASCII is not an 8-bit encoding, it uses 7 bits for every character. And in fact, all Unicode character with code points >= 128 require at least two bytes. (The reason is that you need a way to distinguish between 200 and a multibyte code point whose first byte happens to be 200. UTF-8 solves this by using the bytes >= 128 to represent multibyte codepoints.)
'È' is not an ASCII character, so it cannot be represented in ASCII. This explains the second output: 63 is ASCII for the character '?'. And indeed, the Javadoc for the getBytes(Charset) method specifies that unmappable input is mapped to "the default replacement byte array", in this case '?'. On the other hand, to obtain the first ASCII byte array you used the CharsetEncoder directly, which is a more low-level API and does not perform such automatic replacements. (When you would have checked the result of the encode method, you would have found it to have returned a CoderResult instance representing an error.)
Third question
Java 8 Strings use UTF-16 internally, but when communicating with other software, different encodings may be expected, such as UTF-8. The Charset.defaultCharset() method returns the default character set of the virtual machine, which depends on the locale and character set of the operating system, not on the encoding used internally by Java strings.

Let's back up a bit…
Java's text datatypes use the UTF-16 character encoding of the Unicode character set. (As do, VB4/5/6/A/Script, JavaScript, .NET, ….) You can see this in the various operations you do with the string API: indexing, length, ….
Libraries support converting between the text datatypes and byte arrays using various encodings. Some of them are categorized as "Extended ASCII", but stating that is a very poor substitute for naming the character encoding actually being used.
Some operating systems allow the user to designate a default character encoding. (Most users don't know or care, though.) Java attempts to pick this up. It is only useful when the program understands that input from the user is that character encoding or that output should be. This century, users dealing in text files prefer to use a specific encoding, communicate them unchanged across systems, don't appreciate lossy conversions and therefore don't have any use for this concept. From a program's point of view, it is never what you want unless it is exactly what you want.
Where a conversion would be lossy, you have the choice of a replacement character (such a '?'), omitting it, or throwing an exception.
A character encoding is a map between a codepoint (integer) of a character set and one or more code units, according to the definition of the encoding. A code unit is a fixed size and the number of code units needed for a codepoint, might vary by codepoint.
In libraries, it is not generally useful to have an array of code units so they take the further step of converting to/from an array of bytes. byte values do range from -128 to 127, however, that's the Java interpretation as two's complement 8-bit integers. As the bytes are understood to be encoding text, the values would be interpret according to the rules of the character encoding.
Because some Unicode encodings, have code units more than one byte long, byte order becomes important. So, at the byte array level, there is UTF-16 Big Endian and UTF-16 Little Endian. When communicating a text file or stream, you would send the bytes and well as having a shared knowledge of the encoding. This "metadata" is required for understanding. So, UTF-16BE or UTF-16LE, for example. To make that a bit easier, Unicode allows some metadata beginning of the file or stream to indicate the byte order. It is called the byte-order mark (BOM) So, the external metadata can share the encoding (say, UTF-16), while the internal metadata shares the byte order. Unicode allows the BOM to be present even when byte order is not relevant, such as UTF-8. So, if the understanding is that the bytes are text encoded with any Unicode encoding and a BOM is present, then it's a very simple matter to figure out which Unicode encoding it is and what the byte order is, if relavent.
1) You are seeing the BOM in some of your Unicode encoding outputs.
2) È is not in the ASCII character set. What would want to happen in this case? I often prefer an exception.
3) The system you were using, for your account, at the time of your tests, may have had UTF-8 as the default character encoding, Is that important to the way you want and have encoded your text files on that system?

Related

Will String.getBytes("UTF-16") return the same result on all platforms?

I need to create a hash from a String containing users password. To create the hash, I use a byte array which I get by calling String.getBytes(). But when I call this method with specified encoding, (such as UTF-8) on a platform where this is not the default encoding, the non-ASCII characters get replaced by a default character (if I understand the behaviour of getBytes() correctly) and therefore on such platform, I will get a different byte array, and eventually a different hash.
Since Strings are internally stored in UTF-16, will calling String.getBytes("UTF-16") guarantee me that I get the same byte array on every platform, regardless of its default encoding?

Yes. Not only is it guaranteed to be UTF-16, but the byte order is defined too:
When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.
(The BOM isn't relevant when the caller doesn't ask for it, so String.getBytes(...) won't include it.)
So long as you have the same string content - i.e. the same sequence of char values - then you'll get the same bytes on every implementation of Java, barring bugs. (Any such bug would be pretty surprising, given that UTF-16 is probably the simplest encoding to implement in Java...)
The fact that UTF-16 is the native representation for char (and usually for String) is only relevant in terms of ease of implementation, however. For example, I'd also expect String.getBytes("UTF-8") to give the same results on every platform.

It is true, java uses Unicode internally so it may combine any script/language. String and char use UTF-16BE but .class files store there String constants in UTF-8. In general it is irrelevant what String does, as there is a conversion to bytes specifying the encoding the bytes have to be in.
If this encoding of the bytes cannot represent some of the Unicode characters, a placeholder character or question mark is given. Also fonts might not have all Unicode characters, 35 MB for a full Unicode font is a normal size. You might then see a square with 2x2 hex codes or so for missing code points. Or on Linux another font might substitute the char.
Hence UTF-8 is a perfect fine choice.
String s = ...;
if (!s.startsWith("\uFEFF")) { // Add a Unicode BOM
s = "\uFEFF" + s;
}
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
Both UTF-16 (in both byte orders) and UTF-8 always are present in the JRE, whereas some Charsets are not. Hence you can use a constant from StandardCharsets not needing to handle any UnsupportedEncodingException.
Above I added a BOM for Windows Notepad esoecially, to recognize UTF-8. It certainly is not good practice. But as a small help here.
There is no disadvantage to UTF16-LE or UTF-16BE. I think UTF-8 is a bit more universally used, as UTF-16 also cannot store all Unicode code points in 16 bits. Text is Asian scripts would be more compressed, but already HTML pages are more compact in UTF-8 because of the HTML tags and other latin script.
For Windows UTF-16LE might be more native.
Problem with placeholders for non-Unicode platforms, especially Windows, might happen.

I just found this:
https://github.com/facebook/conceal/issues/138
which seems to answer negatively your question.
As per Jon Skeet's answer: the specification is clear. But I guess Android/Mac implementations of Dalvik/JVM don't agree.

Which charset should I use to encode and decode 8 bit values?

I have a problem with encoding and decoding specific byte values. I'm implementing an application, where I need to get String data, make some bit manipulation on it and return another String.
I'm currently getting byte[] values by String.getbytes(), doing proper manipulation and then returning String by constructor String(byte[] data). The issue is, when some of bytes have specific values e.g. -120, -127, etc., the coding in the constructor returns ? character, that is byte value 63. As far as I know, these values are ones, that can't be printed on Windows, concerning the fact, that -120 in Java is 10001000, that is \b character according to ASCII table
Is there any charset, that I could use to properly code and decode every byte value (from -128 to 127)?
EDIT: I shall also say, that ISO-8859-1 charset works pretty fine, but does not code language specific characters, such as ąęćśńźżół

You seem to have some confusion regarding encodings, not specific to Java, so I'll try to help clear some of that up.
There do not exist any charsets nor encodings which use the code points from -128 to 0. If you treat the byte as an unsigned integer, then you get the range 0-255 which is valid for all the cp-* and isoo-8859-* charsets.
ASCII characters are in the range 0-127 and so appear valid whether you treat the int as signed or unsigned.
UTF-8 characters are either in the range 0-127 or double-byte characters with the first byte in the range 128-255.
You mention some Polish characters, so instead of ISO-8859-1 you should encode as ISO-8859-2 or (preferably) UTF-8.

Why I need use encoding in String.getBytes(charsetName)

Ususally when I need to convert my string to byte[] I use getBytes() without param. I was checked it is not save I should use charset. Why I shoud do so - letter 'A' will always be parsed to 0x41? Is't it?

Ususally when I need to convert my string to byte[] I use getBytes() without param.
Stop doing that right now. I would suggest that you always specify an encoding. If you want to use the platform default encoding (which is what you'll get if you don't specify one), then do that explicitly so that it's clearer. But that should very rarely be the approach anyway. Personally I use UTF-8 in almost all cases.
Why I shoud do so - letter 'A' will always be parsed to 0x41? Is't it?
Nope. For example, using UTF-16, 'A' will be two bytes - 0x41 0x00 or 0x00 0x41 (depending on the endianness). In EBCDIC encodings it could be something completely different.
Most encodings treat ASCII characters in the same way - but characters outside ASCII are represented very differently in different encodings (and many encodings only support a subset of Unicode).
See my article on Unicode (C#-focused, but the principles are the same) for a few more details - and links to more information than you're ever likely to want.

Different character encodings lead to different ways characters get parsed. In Ascii, sure 'A' will parse to 0x41. In other encodings, this will be different.
This is why when you go to some webpages, you may see a bunch of weird characters. The browser doesn't know how to decode it, so it just decodes to the default.

Some background: When text is stored in files or sent between computers over a socket, the text characters are stored or sent as a sequence of bits, almost always grouped in 8-bit bytes. The characters all have defined numeric values in Unicode, so that 'A' always has the value 0x41 (well, there are actually two other A's in the Unicode character set, in the Greek and Russian alphabets, but that's not relevant). But there are many mechanisms for how those numeric codes are translated to a sequence of bits when storing in a file or sending to another computer. In UTF-8, 0x41 is represented as 8 bits (the byte 0x41), but other numeric values (code points) will be converted to 16 or more bits with an algorithm that rearranges the bits; in UTF-16, 0x41 is represented as 16 bits; and there are other encodings like JIS and some which are capable of representing some but not all of the Unicode characters. Since String.getBytes() was intended to return a byte array that contains the bytes to be sent to a file or socket, the method needs to know what encoding it's supposed to use when creating those bytes. Basically the encoding will have to be the same one that a program later reading a file, or a computer at the other end of the socket, expects it to be.

Isn't the size of character in Java 2 bytes?

I used RandomAccessFile to read a byte from a text file.
public static void readFile(RandomAccessFile fr) {
byte[] cbuff = new byte[1];
fr.read(cbuff,0,1);
System.out.println(new String(cbuff));
}
Why am I seeing one full character being read by this?

A char represents a character in Java (*). It is 2 bytes large (or 16 bits).
That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).
When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset(**). Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.
If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).
That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.
(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.
(**) Note that on Android the default character set is always UTF-8 and starting with Java 18 the Java platform itself also switched to this default (but it can still be configured to act the legacy way)

Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.
Some characters (ASCII) are single byte, but many others are multi-byte.
Java supports Unicode, thus according to:
Java Character Docs
The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).

The constructor String(byte[] bytes) takes the bytes from the buffer and encodes them to characters.
It uses the platform default charset to encode bytes to characters. If you know, your file contains text, that is encoded in a different charset, you can use the String(byte[] bytes, String charsetName) to use the correct encoding (from bytes to characters).

In ASCII text file each character is just one byte

Looks like your file contains ASCII characters, which are encoded in just 1 byte. If text file was containing non-ASCII character, e.g. 2-byte UTF-8, then you get just the first byte, not whole character.

There are some great answers here but I wanted to point out the jvm is free to store a char value in any size space >= 2 bytes.
On many architectures there is a penalty for performing unaligned memory access so a char might easily be padded to 4 bytes. A volatile char might even be padded to the size of the CPU cache line to prevent false sharing. https://en.wikipedia.org/wiki/False_sharing
It might be non-intuitive to new Java programmers that a character array or a string is NOT simply multiple characters. You should learn and think about strings and arrays distinctly from "multiple characters".
I also want to point out that java characters are often misused. People don't realize they are writing code that won't properly handle codepoints over 16 bits in length.

Java allocates 2 of 2 bytes for character as it follows UTF-16. It occupies minimum 2 bytes while storing a character, and maximum of 4 bytes. There is no 1 byte or 3 bytes of storage for character.

The Java char is 2 bytes. But the file encoding may be different.
So first you should know what encoding your file uses. For example, the file could be UTF-8 or ASCII encoded, then you will retrieve the right chars by reading one byte at a time.
If the encoding of the file is UTF-16, it may still show you the correct char if your UTF-16 is little endian. For example, the little endian UTF-16 for A is [65, 0]. Then when you read the first byte, it returns 65. After padding with 0 for the second byte, you will get A.

Difference between UTF-8 and UTF-16?

Difference between UTF-8 and UTF-16?
Why do we need these?
MessageDigest md = MessageDigest.getInstance("SHA-256");
String text = "This is some text";
md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed
byte[] digest = md.digest();

I believe there are a lot of good articles about this around the Web, but here is a short summary.
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.
Main UTF-8 cons:
Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.
Main UTF-16 pros:
BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters.
Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string.
Main UTF-16 cons:
Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters!
It's variable length, so counting or indexing codepoints is costly, though less than UTF-8.
In general, UTF-16 is usually better for in-memory representation because BE/LE is irrelevant there (just use native order) and indexing is faster (just don't forget to handle surrogate pairs properly). UTF-8, on the other hand, is extremely good for text files and network protocols because there is no BE/LE issue and null-termination often comes in handy, as well as ASCII-compatibility.

They're simply different schemes for representing Unicode characters.
Both are variable-length - UTF-16 uses 2 bytes for all characters in the basic multilingual plane (BMP) which contains most characters in common use.
UTF-8 uses between 1 and 3 bytes for characters in the BMP, up to 4 for characters in the current Unicode range of U+0000 to U+1FFFFF, and is extensible up to U+7FFFFFFF if that ever becomes necessary... but notably all ASCII characters are represented in a single byte each.
For the purposes of a message digest it won't matter which of these you pick, so long as everyone who tries to recreate the digest uses the same option.
See this page for more about UTF-8 and Unicode.
(Note that all Java characters are UTF-16 code points within the BMP; to represent characters above U+FFFF you need to use surrogate pairs in Java.)

Security: Use only UTF-8
Difference between UTF-8 and UTF-16? Why do we need these?
There have been at least a couple of security vulnerabilities in implementations of UTF-16. See Wikipedia for details.
CVE-2008-2938
CVE-2012-2135
WHATWG and W3C have now declared that only UTF-8 is to be used on the Web.
The [security] problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.
Other groups are saying the same.
So while UTF-16 may continue being used internally by some systems such as Java and Windows, what little use of UTF-16 you may have seen in the past for data files, data exchange, and such, will likely fade away entirely.

This is unrelated to UTF-8/16 (in general, although it does convert to UTF16 and the BE/LE part can be set w/ a single line), yet below is the fastest way to convert String to byte[]. For instance: good exactly for the case provided (hash code). String.getBytes(enc) is relatively slow.
static byte[] toBytes(String s){
byte[] b=new byte[s.length()*2];
ByteBuffer.wrap(b).asCharBuffer().put(s);
return b;
}

Simple way to differentiate UTF-8 and UTF-16 is to identify commonalities between them.
Other than sharing same unicode number for given character, each one is their own format.
UTF-8 try to represent, every unicode number given to character with one byte(If it is ASCII), else 2 two bytes, else 4 bytes and so on...
UTF-16 try to represent, every unicode number given to character with two byte to start with. If two bytes are not sufficient, then uses 4 bytes. IF that is also not sufficient, then uses 6 bytes.
Theoretically, UTF-16 is more space efficient, but in practical UTF-8 is more space efficient as most of the characters(98% of data) for processing are ASCII and UTF-8 try to represent them with single byte and UTF-16 try to represent them with 2 bytes.
Also, UTF-8 is superset of ASCII encoding. So every app that expects ASCII data would also accepted by UTF-8 processor. This is not true for UTF-16. UTF-16 could not understand ASCII, and this is big hurdle for UTF-16 adoption.
Another point to note is, all UNICODE as of now could be fit in 4 bytes of UTF-8 maximum(Considering all languages of world). This is same as UTF-16 and no real saving in space compared to UTF-8 ( https://stackoverflow.com/a/8505038/3343801 )
So, people use UTF-8 where ever possible.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.