I call a webservice, that gives me back a response xml that has UTF-8 encoding. I checked that in java using getAllHeaders() method.
Now, in my java code, I take that response and then do some processing on it. And later, pass it on to a different service.
Now, I googled a bit and found out that by default the encoding in Java for strings is UTF-16.
In my response xml, one of the elements had a character É. Now this got screwed in the post processing request that I make to a different service.
Instead of sending É, it sent some jibberish stuff. Now I wanted to know, will there be really a lot of difference in the two of these encodings? And if I wanted to know what will É convert from UTF-8 to UTF-16, then how can I do that?
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Basic ASCII characters like digits, Latin characters with no
accents, etc. occupy one byte which is identical to US-ASCII
representation. This way all US-ASCII strings become valid UTF-8,
which provides decent backwards compatibility in many cases.
No null bytes, which allows to use null-terminated strings, this
introduces a great deal of backwards compatibility too.
Main UTF-8 cons:
Many common characters have different length, which slows indexing
and calculating a string length terribly.
Main UTF-16 pros:
Most reasonable characters, like Latin, Cyrillic, Chinese, Japanese
can be represented with 2 bytes. Unless really exotic characters are
needed, this means that the 16-bit subset of UTF-16 can be used as a
fixed-length encoding, which speeds indexing.
Main UTF-16 cons:
Lots of null bytes in US-ASCII strings, which means no
null-terminated strings and a lot of wasted memory.
In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocol
There are two things:
the encoding in which you exchange data;
the internal string representation of Java.
You should not be preoccupied with the second point ;) The thing is to use the appropriate methods to convert from your data (byte arrays) to Strings (char arrays ultimately), and to convert form Strings to your data.
The most basic classes you can think of are CharsetDecoder and CharsetEncoder. But there are plenty others. String.getBytes(), all Readers and Writers are but two possible methods. And there are all static methods of Character as well.
If you see gibberish at some point, it means you failed to decode or encode from the original byte data to Java strings. But again, the fact that Java strings use UTF-16 is not relevant here.
In particular, you should be aware that when you create a Reader or Writer, you should specify the encoding; if you fail to do so, the default JVM encoding will be used, and it may, or may not, be UTF-8.
This Website provide UTF TO UTF Conversion
http://www.fileformat.info/convert/text/utf2utf.htm
UTF-32 is arguably the most human-readable of the Unicode Encoding Forms, because its big-endian hexadecimal representation is simply the Unicode Scalar Value without the “U+” prefix and zero-padded to eight digits and While a UTF-32 representation does make the programming model somewhat simpler, the increased average storage size has real drawbacks, making a complete transition to UTF-32 less compelling.
HOWEVER
UTF-32 is the same as the old UCS-4 encoding and remains fixed width. Why can this remain fixed width? As UTF-16 is now the format that can encode the least amount of characters it set the limit for all formats. It was defined that 1,112,064 was the total number of code points that will ever be defined by either Unicode or ISO 10646. Since Unicode is now only defined from 0 to 10FFFF UTF-32 sounds a bit like a pointless encoding now as it's 32 bit wide, but only ever about 21 bits are used which makes this very wasteful.
UTF-8: Generally speaking, you should use UTF-8. Most HTML documents use this encoding.
It uses at least 8 bits of data to store each character. This can lead to more efficient storage, especially when the text contains mostly English ASCII characters. But higher-order characters, such as non-ASCII characters, may require up to 24 bits each!
UTF-16:
This encoding uses at least 16 bits to encode characters, including lower-order ASCII characters and higher-order non-ASCII characters.
If you are encoding text consisting of mostly non-English or non-ASCII characters, UTF-16 may result in smaller file size. But if you use UTF-16 to encode mostly ASCII text, it will use up more space.
Related
I am creating a set of tests for the size of a String to do so I am using something like this myString.getBytes("UTF-8").length > MAX_SIZE for which java has a checked exception UnsupportedEncodingException.
Just for curiosity, and to further consider other possible test scenarios, is there a text that cannot be represented by UTF-8 character encoding?
BTW: I did my homework, but nowhere (that I can find) specifies that indeed UTF-8/Unicode includes ALL the characters which are possible. I know that its size is 2^32 and many of them are still empty, but the question remains.
The official FAQ from the Unicode Consortium is pretty clear on the matter, and is a great source of information on all questions related to UTF-8, UTF-16, etc.
In particular, notice the following quote (emphasis mine):
Q: What is a UTF?
A: A Unicode transformation format (UTF) is an
algorithmic mapping from every Unicode code point (except surrogate
code points) to a unique byte sequence. The ISO/IEC 10646 standard
uses the term “UCS transformation format” for UTF; the two terms are
merely synonyms for the same concept.
Each UTF is reversible, thus every UTF supports lossless round
tripping: mapping from any Unicode coded character sequence S to a
sequence of bytes and back will produce S again. To ensure round
tripping, a UTF mapping must map all code points (except surrogate
code points) to unique byte sequences. This includes reserved
(unassigned) code points and the 66 noncharacters (including U+FFFE
and U+FFFF).
So, as you can see, by definition, all UTF encodings (including UTF-8) must be able to handle all Unicode code points (except the surrogate code points of course, but they are not real characters anyways).
Additionally, here is a quote directly from the Unicode Standard that also talks about this:
The Unicode Standard supports three character encoding forms: UTF-32,
UTF-16, and UTF-8. Each encoding form maps the Unicode code points
U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences.
As you can see, the specified range of characters covers the whole assigned Unicode range (excluding the surrogate character range of course).
is there a text that cannot be represented by UTF-8 character encoding?
Java strings use UTF-16, and standard UTF-8 is designed to handle every Unicode codepoint that UTF-16 can handle (and then some).
However, do be careful, because Java also uses a Modified UTF-8 in some areas, and that does have some differences/limitations from standard UTF-8.
I have been reading about how Unicode code points have evolved over time, including this article by Joel Spolsky, which says:
Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.
But despite all this reading, I couldn't find the real reason that Java uses UTF-16 for a char.
Isn't UTF-8 far more efficient than UTF-16? For example, if I had a string which contains 1024 letters of ASCII scoped characters, UTF-16 will take 1024 * 2 bytes (2KB) of memory.
But if Java used UTF-8, it would be just 1KB of data. Even if the string has a few character which needs to 2 bytes, it will still only take about a kilobyte. For example, suppose in addition to the 1024 characters, there were 10 characters of "字" (code-point U+5b57, UTF-8 encoding e5 ad 97). In UTF-8, this will still take only (1024 * 1 byte) + (10 * 3 bytes) = 1KB + 30 bytes.
So this doesn't answer my question. 1KB + 30 bytes for UTF-8 is clearly less memory than 2KB for UTF-16.
Of course it makes sense that Java doesn't use ASCII for a char, but why does it not use UTF-8, which has a clean mechanism for handling arbitrary multi-byte characters when they come up? UTF-16 looks like a waste of memory in any string which has lots of non-multibyte chars.
Is there some good reason for UTF-16 that I'm missing?
Java used UCS-2 before transitioning over UTF-16 in 2004/2005. The reason for the original choice of UCS-2 is mainly historical:
Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character.
This, and the birth of UTF-16, is further explained by the Unicode FAQ page:
Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.
As #wero has already mentioned, random access cannot be done efficiently with UTF-8. So all things weighed up, UCS-2 was seemingly the best choice at the time, particularly as no supplementary characters had been allocated by that stage. This then left UTF-16 as the easiest natural progression beyond that.
Historically, one reason was the performance characteristics of random access or iterating over the characters of a String:
UTF-8 encoding uses a variable number (1-4) bytes to encode a Unicode character. Therefore accessing a character by index: String.charAt(i) would be way more complicated to implement and slower than the array access used by java.lang.String.
Even today, Python uses a fixed-width format for Strings internally, storing either 1, 2, or 4 bytes per character depending on the maximum size of a character in that string.
Of course, this is no longer a pure benefit in Java, since, as nj_ explains, Java no longer uses a fixed-with format. But at the time the language was developed, Unicode was a fixed-width format (now called UCS-2), and this would have been an advantage.
I need to create a hash from a String containing users password. To create the hash, I use a byte array which I get by calling String.getBytes(). But when I call this method with specified encoding, (such as UTF-8) on a platform where this is not the default encoding, the non-ASCII characters get replaced by a default character (if I understand the behaviour of getBytes() correctly) and therefore on such platform, I will get a different byte array, and eventually a different hash.
Since Strings are internally stored in UTF-16, will calling String.getBytes("UTF-16") guarantee me that I get the same byte array on every platform, regardless of its default encoding?
Yes. Not only is it guaranteed to be UTF-16, but the byte order is defined too:
When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.
(The BOM isn't relevant when the caller doesn't ask for it, so String.getBytes(...) won't include it.)
So long as you have the same string content - i.e. the same sequence of char values - then you'll get the same bytes on every implementation of Java, barring bugs. (Any such bug would be pretty surprising, given that UTF-16 is probably the simplest encoding to implement in Java...)
The fact that UTF-16 is the native representation for char (and usually for String) is only relevant in terms of ease of implementation, however. For example, I'd also expect String.getBytes("UTF-8") to give the same results on every platform.
It is true, java uses Unicode internally so it may combine any script/language. String and char use UTF-16BE but .class files store there String constants in UTF-8. In general it is irrelevant what String does, as there is a conversion to bytes specifying the encoding the bytes have to be in.
If this encoding of the bytes cannot represent some of the Unicode characters, a placeholder character or question mark is given. Also fonts might not have all Unicode characters, 35 MB for a full Unicode font is a normal size. You might then see a square with 2x2 hex codes or so for missing code points. Or on Linux another font might substitute the char.
Hence UTF-8 is a perfect fine choice.
String s = ...;
if (!s.startsWith("\uFEFF")) { // Add a Unicode BOM
s = "\uFEFF" + s;
}
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
Both UTF-16 (in both byte orders) and UTF-8 always are present in the JRE, whereas some Charsets are not. Hence you can use a constant from StandardCharsets not needing to handle any UnsupportedEncodingException.
Above I added a BOM for Windows Notepad esoecially, to recognize UTF-8. It certainly is not good practice. But as a small help here.
There is no disadvantage to UTF16-LE or UTF-16BE. I think UTF-8 is a bit more universally used, as UTF-16 also cannot store all Unicode code points in 16 bits. Text is Asian scripts would be more compressed, but already HTML pages are more compact in UTF-8 because of the HTML tags and other latin script.
For Windows UTF-16LE might be more native.
Problem with placeholders for non-Unicode platforms, especially Windows, might happen.
I just found this:
https://github.com/facebook/conceal/issues/138
which seems to answer negatively your question.
As per Jon Skeet's answer: the specification is clear. But I guess Android/Mac implementations of Dalvik/JVM don't agree.
Ususally when I need to convert my string to byte[] I use getBytes() without param. I was checked it is not save I should use charset. Why I shoud do so - letter 'A' will always be parsed to 0x41? Is't it?
Ususally when I need to convert my string to byte[] I use getBytes() without param.
Stop doing that right now. I would suggest that you always specify an encoding. If you want to use the platform default encoding (which is what you'll get if you don't specify one), then do that explicitly so that it's clearer. But that should very rarely be the approach anyway. Personally I use UTF-8 in almost all cases.
Why I shoud do so - letter 'A' will always be parsed to 0x41? Is't it?
Nope. For example, using UTF-16, 'A' will be two bytes - 0x41 0x00 or 0x00 0x41 (depending on the endianness). In EBCDIC encodings it could be something completely different.
Most encodings treat ASCII characters in the same way - but characters outside ASCII are represented very differently in different encodings (and many encodings only support a subset of Unicode).
See my article on Unicode (C#-focused, but the principles are the same) for a few more details - and links to more information than you're ever likely to want.
Different character encodings lead to different ways characters get parsed. In Ascii, sure 'A' will parse to 0x41. In other encodings, this will be different.
This is why when you go to some webpages, you may see a bunch of weird characters. The browser doesn't know how to decode it, so it just decodes to the default.
Some background: When text is stored in files or sent between computers over a socket, the text characters are stored or sent as a sequence of bits, almost always grouped in 8-bit bytes. The characters all have defined numeric values in Unicode, so that 'A' always has the value 0x41 (well, there are actually two other A's in the Unicode character set, in the Greek and Russian alphabets, but that's not relevant). But there are many mechanisms for how those numeric codes are translated to a sequence of bits when storing in a file or sending to another computer. In UTF-8, 0x41 is represented as 8 bits (the byte 0x41), but other numeric values (code points) will be converted to 16 or more bits with an algorithm that rearranges the bits; in UTF-16, 0x41 is represented as 16 bits; and there are other encodings like JIS and some which are capable of representing some but not all of the Unicode characters. Since String.getBytes() was intended to return a byte array that contains the bytes to be sent to a file or socket, the method needs to know what encoding it's supposed to use when creating those bytes. Basically the encoding will have to be the same one that a program later reading a file, or a computer at the other end of the socket, expects it to be.
Difference between UTF-8 and UTF-16?
Why do we need these?
MessageDigest md = MessageDigest.getInstance("SHA-256");
String text = "This is some text";
md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed
byte[] digest = md.digest();
I believe there are a lot of good articles about this around the Web, but here is a short summary.
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.
Main UTF-8 cons:
Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.
Main UTF-16 pros:
BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters.
Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string.
Main UTF-16 cons:
Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters!
It's variable length, so counting or indexing codepoints is costly, though less than UTF-8.
In general, UTF-16 is usually better for in-memory representation because BE/LE is irrelevant there (just use native order) and indexing is faster (just don't forget to handle surrogate pairs properly). UTF-8, on the other hand, is extremely good for text files and network protocols because there is no BE/LE issue and null-termination often comes in handy, as well as ASCII-compatibility.
They're simply different schemes for representing Unicode characters.
Both are variable-length - UTF-16 uses 2 bytes for all characters in the basic multilingual plane (BMP) which contains most characters in common use.
UTF-8 uses between 1 and 3 bytes for characters in the BMP, up to 4 for characters in the current Unicode range of U+0000 to U+1FFFFF, and is extensible up to U+7FFFFFFF if that ever becomes necessary... but notably all ASCII characters are represented in a single byte each.
For the purposes of a message digest it won't matter which of these you pick, so long as everyone who tries to recreate the digest uses the same option.
See this page for more about UTF-8 and Unicode.
(Note that all Java characters are UTF-16 code points within the BMP; to represent characters above U+FFFF you need to use surrogate pairs in Java.)
Security: Use only UTF-8
Difference between UTF-8 and UTF-16? Why do we need these?
There have been at least a couple of security vulnerabilities in implementations of UTF-16. See Wikipedia for details.
CVE-2008-2938
CVE-2012-2135
WHATWG and W3C have now declared that only UTF-8 is to be used on the Web.
The [security] problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.
Other groups are saying the same.
So while UTF-16 may continue being used internally by some systems such as Java and Windows, what little use of UTF-16 you may have seen in the past for data files, data exchange, and such, will likely fade away entirely.
This is unrelated to UTF-8/16 (in general, although it does convert to UTF16 and the BE/LE part can be set w/ a single line), yet below is the fastest way to convert String to byte[]. For instance: good exactly for the case provided (hash code). String.getBytes(enc) is relatively slow.
static byte[] toBytes(String s){
byte[] b=new byte[s.length()*2];
ByteBuffer.wrap(b).asCharBuffer().put(s);
return b;
}
Simple way to differentiate UTF-8 and UTF-16 is to identify commonalities between them.
Other than sharing same unicode number for given character, each one is their own format.
UTF-8 try to represent, every unicode number given to character with one byte(If it is ASCII), else 2 two bytes, else 4 bytes and so on...
UTF-16 try to represent, every unicode number given to character with two byte to start with. If two bytes are not sufficient, then uses 4 bytes. IF that is also not sufficient, then uses 6 bytes.
Theoretically, UTF-16 is more space efficient, but in practical UTF-8 is more space efficient as most of the characters(98% of data) for processing are ASCII and UTF-8 try to represent them with single byte and UTF-16 try to represent them with 2 bytes.
Also, UTF-8 is superset of ASCII encoding. So every app that expects ASCII data would also accepted by UTF-8 processor. This is not true for UTF-16. UTF-16 could not understand ASCII, and this is big hurdle for UTF-16 adoption.
Another point to note is, all UNICODE as of now could be fit in 4 bytes of UTF-8 maximum(Considering all languages of world). This is same as UTF-16 and no real saving in space compared to UTF-8 ( https://stackoverflow.com/a/8505038/3343801 )
So, people use UTF-8 where ever possible.