The reason for encoding with standard Base64 format is to make sure it won't contain any control characters which may be considered as control characters over network. This ensures receiving same data over the other side of the network transfer.
In this scenario, Does UTF-8 character encoding provides same as Base64 by not giving any control characters in the output so that we can send it via network?
The reason for encoding with standard Base64 format is to make sure it won't contain any control characters which may be considered as control characters over network.
The above statement is incorrect. Base64 is used specifically to encode binary data using 64 of the printable ASCII characters. It is only necessary in specific situations where you are embedding binary data in a protocol which was designed to transfer text (such as embedding attachments in email); it is not required in general for transmitting data over a network. HTTP, for instance, manages perfectly well without it.
In this scenario, Does UTF-8 character encoding provides same as Base64 by not giving any control characters in the output so that we can send it via network?
No. UTF-8 is a Unicode string format. It cannot be used to encode arbitrary binary data.
Control characters (0-31 in ASCII) are not touched by UTF-8 encoding and therefore if your protocol cannot transmit them safely you wouldn't solve the issue by using UTF-8.
UTF-8 is about encoding unicode text into a 8-bit bytes stream, not about escaping control characters. It solves a different problem.
Note that the input for UTF-8 encoding is unicode text, not random bytes: for example it's not possible to encode the byte 0x83 with UTF-8: what you can do is convert the greek letter "Δ" encoded in cp737 as 0x83 into UTF-8, or you can encode the russian letter "Ѓ" encoded in cp855 as 0x83 into UTF8, but the result would be different ("Δ" is 0xCE+0x94, while "Ѓ" is 0xD0+0x83).
Related
I am trying to make a Java application and a VS C++ application communicate and send different messages to each other using Sockets. The only problem that I have so far - I am absolutely lost in their encodings.
By default Java uses UTF-8. This is as far as I am concerned a Unicode charset. In my VS project I have settings set to Unicode. Though for some reason when I debug my code I allways see my strings encoded as CP1252 in memory.
Furthermore if I try to use CP1252 in Java it works fine for English letters, but whenever I try some russian letters I get a 3f byte for every letter.
If on other hand I try to use UTF-8 in Java - each English letter is 1 byte long, but every Russian - 2 bytes long. Isnt it a multibyte encoding?
Some docs on C++ say that std::string(char) uses UTF-8 codepage, and std:wstring(wchar_t) - UTF-16. When I debug my application I see CP1252 encoding for both of them, though wstring has empty bytes between each letter.
Could you please explain how encodings behave in both Java and C++ and how should I communicate my 2 apps?
UTF-8 has a variable-length per character. Common characters take less space by using up less bytes per character. More un-common characters take up more space because they have to be encoded in more bytes. Since most of this was invented in the US, guess which characters are shorter and which are longer?
If you want Sockets to work, then you will have to get both sides to agree on the encoding. Otherwise, you are fighting a loosing battle.
it's not true that java do utf-8 encoding. You can write your source code in utf8 and compile it with some weird signs in attributes(sometimes really annoying).
The internal representation in java of strings is utf-16(see What is the Java's internal represention for String? Modified UTF-8? UTF-16?)
Unicode is a character set, UTF-8 and UTF-16 are encodings of Unicode. For English (actually ASCII) characters UTF-8 results in the same value as CP1252 and UTF-16 adds a zero byte. As you want to use Russian (Cyrillic) you can use UTF-8, UTF-16 or CP1251. But both applications must agree on the encoding.
For example, if you agreed on UTF-8, the following will convert a Java String s to an array of bytes using UTF-8:
byte[] b = s.getBytes("UTF-8");
Then:
outputStream.write(b);
will send the data on the socket.
Ususally when I need to convert my string to byte[] I use getBytes() without param. I was checked it is not save I should use charset. Why I shoud do so - letter 'A' will always be parsed to 0x41? Is't it?
Ususally when I need to convert my string to byte[] I use getBytes() without param.
Stop doing that right now. I would suggest that you always specify an encoding. If you want to use the platform default encoding (which is what you'll get if you don't specify one), then do that explicitly so that it's clearer. But that should very rarely be the approach anyway. Personally I use UTF-8 in almost all cases.
Why I shoud do so - letter 'A' will always be parsed to 0x41? Is't it?
Nope. For example, using UTF-16, 'A' will be two bytes - 0x41 0x00 or 0x00 0x41 (depending on the endianness). In EBCDIC encodings it could be something completely different.
Most encodings treat ASCII characters in the same way - but characters outside ASCII are represented very differently in different encodings (and many encodings only support a subset of Unicode).
See my article on Unicode (C#-focused, but the principles are the same) for a few more details - and links to more information than you're ever likely to want.
Different character encodings lead to different ways characters get parsed. In Ascii, sure 'A' will parse to 0x41. In other encodings, this will be different.
This is why when you go to some webpages, you may see a bunch of weird characters. The browser doesn't know how to decode it, so it just decodes to the default.
Some background: When text is stored in files or sent between computers over a socket, the text characters are stored or sent as a sequence of bits, almost always grouped in 8-bit bytes. The characters all have defined numeric values in Unicode, so that 'A' always has the value 0x41 (well, there are actually two other A's in the Unicode character set, in the Greek and Russian alphabets, but that's not relevant). But there are many mechanisms for how those numeric codes are translated to a sequence of bits when storing in a file or sending to another computer. In UTF-8, 0x41 is represented as 8 bits (the byte 0x41), but other numeric values (code points) will be converted to 16 or more bits with an algorithm that rearranges the bits; in UTF-16, 0x41 is represented as 16 bits; and there are other encodings like JIS and some which are capable of representing some but not all of the Unicode characters. Since String.getBytes() was intended to return a byte array that contains the bytes to be sent to a file or socket, the method needs to know what encoding it's supposed to use when creating those bytes. Basically the encoding will have to be the same one that a program later reading a file, or a computer at the other end of the socket, expects it to be.
An XML containing 哈瓦那 (UTF-8) is sent to Service A.
Service A sends it to Service B.
The string was encoded to 哈瓦那 (ISO-8859-1).
How do I encode it back to 哈瓦那? Considering that all strings in Java are UTF-16. Service B has to compare it as 哈瓦那 not 哈瓦那.
Thanks.
When you read a text file, you have to read it using the actual encoding used to create the file. If you specify the appropriate encoding, you'll get the correct characters in memory. So, if the same file (semantically) exists in two versions (UTF-8 encoded and ISO-8859-1), reading the first one with UTF-8 and the second one with ISO-8859-1 will lead to exactly the same chars in memory.
The above is true only if it made sense to encode the file in ISO-8859-1 in the first place. UTF-8 is able to store every unicode character. But ISO-8859-1 is able to encode only a small subset of the unicode characters (western languages characters). The characters you posted literally look like Chinese to me, and I don't think encoding them in ISO-8859-1 is even possible without losing everything.
I think you are misdiagnosing the problem:
An XML containing 哈瓦那 (UTF-8) is sent to Service A.
OK ...
Service A sends it to Service B.
OK ...
The string was converted to 哈瓦那 (ISO-8859-1).
This is not correct. The string has not been "converted". Rather, it has been decoded with the wrong character encoding. Specifically, it looks very much like something has taken UTF-8 encoded bytes, and assumed that they are ISO-8859-1 encoded, and decoded them accordingly.
Can you unpick this? It depends where the mistaken decoding first occurred. If it happens in Service B, then you should be able to relabel the data source as UTF-8, and then decode it correctly. On the other hand, if the first mistaken decoding happens in service A, then you could be out of luck. A mistaken decoding can result in loss of data as unrecognized codes are replaced with some other character. If that happens, the original data will be gone forever.
In either case, the best way to deal with this is to figure out what is getting the wrong character encoding mixed up, and fix that. Perhaps the XML needs to be fixed to specify the charset / encoding. Perhaps, the transport mechanism (e.g. HTTP request or response) needs to be corrected to include the proper document encoding.
Use writers and readers to encode/decode your input/output streams:
String yourText = "...";
InputStream yourInputStream = ...;
Writer out = new OutputStreamWriter(youInputStream, "UTF-8");
out.write(yourText);
Same for reader.
I have Search on my site we frame the query and send in the Request and Response comes back from the vendor as JSON. The vendor crawls our site and capture the data from our site and send response. In Our design we are converting the JSON into java object using GSON. We use the UTF-8 as charset in the Meta.
I have a situation the response has some times Unicode encoding for the special characters based on the request. The browser is rendering this Unicode encoding for special characters in a strange way. How should i decode this Unicode encoding?
For example, for the special character 'ndash' i see in the response it encoded as '\u2013'
To clarify the differences between Unicode and a character encoding
Unicode
is an abstract concept aiming to identify all letters (currently > 110 000).
Character encoding
defines how a character can be represending by a sequence of bytes
one such encoding is utf-8 which uses 1-4 bytes to represent a Unicode character
A java String is always UTF-16. Hence when you construct a String you can use the following String constructor
new String(byte[], encoding)
The second argument should be the encoding the characters are in when the client are sending them. If you don't explicilty define an encoding, you will get the default system encoding, which you can examine using Charset.defaultCharset();.
You can manually set the default encoding as an argument when starting the JVM
-Dfile.encoding="utf-8"
Although rarely needed, you can also employ CharsetDecoder/CharsetEncoder.
Currently I am using utf-8 for URL encoding. I want to convert it to UTF-16.
How can I achieve this?
When encoding Unicode characters in URLs, it's necessary to encode them in such a fashion that all URL parsers and consumers can understand your URLs.
To that end; when the URL was expanded by RFCs in the wake of the development of Unicode and related standards and tools, it was decided that the encoding to employ for encoding characters (using percent escapes) was to be UTF-8, as this would mean that established ASCII escapes would Just Work™.
Consequently, even if you could generate URLs with UTF-16-based percent escapes, no other program would be able to understand them, making them useless. In fact, by matter of definition, they wouldn't even be URLs.
There's also the question of why on earth you would want to use UTF-16 for anything, it being silly and all.
Remember: Never Don't Use UTF-8! (N'DUUH!)
URL escapes, as in %nn hex values, encode bytes. 8-bit bytes. If for some very nonstandard reason you want to encode bytes of UTF-16 instead of UTF-8, you must first pick a byte order (BE or LE). Then you have to write code in your program to take the two bytes of each 16-bit UTF-16 character and represent it as %nn in hex.