Character encoding issues?

Character encoding issues? - java

We had a a clob column in DB. Now when we extract this clob and try to display it (plain text not html), it prints junk some characters on html screen. The character when directly streamed to a file looks like ” (not the usual double quote on regular keyboard)
One more observation:
System.out.println("”".getBytes()[0]);
prints -108.
Why a character byte should be in negative range ? Is there any way to display it correctly on a html screen ?

Re: your final observation - Java bytes are always signed. To interpret them as unsigned, you can bitwise AND them with an int:
byte[] bytes = "”".getBytes("UTF-8");
for(byte b: bytes)
{
System.out.println(b & 0xFF);
}
which outputs:
226
128
157
Note that your string is actually three bytes long in UTF-8.
As pointed out in the comments, it depends on the encoding. For UTF-16 you get:
254
255
32
29
and for US-ASCII or ISO-8859-1 you get
63
which is a question-mark (i.e. "I dunno, some new-fangled character"). Note that:
The behavior of this method [getBytes()] when this string cannot be
encoded in the given charset is unspecified. The CharsetEncoder class
should be used when more control over the encoding process is
required.

I think that it will be better to print character code like this way:
System.out.println((int)'”');//result is 8221
This link can help you to explain this extraordinary double quote (include html code).

To answer your question about displaying the character correctly in an HTML document, you need to do one of two things: either set the encoding of the document or entity-ize the non-ascii characters.
To set the encoding you have two options.
Update your web server to send an appropriate charset argument in
the Content-Type header. The correct header would be Content-Type:
text/html; charset=UTF-8.
Add a <meta charset="UTF-8" /> tag to
the head section of your page.
Keep in mind that Option 1 will take precedence over option 2. I.e. if you are already setting an incorrect charset in the header, you can't override it with a meta tag.
The other option is to entity-ize the non ASCII characters. For the quote character in your question you could use ” or ” or ”. The first is a user friendly named entity, the second specifies the Unicode code point of the character in decimal, and the third specifies the code point in hex. All are valid and all will work.
Generally if you are going to entity-ize dynamic content out of a database that contains unknown characters you're best off just using the code point versions of the entities as you can easily write a method to convert any character >127 to its appropriate code point.
One of the systems I currently work on actually ran into this issue where we took data from a UTF-8 source and had to serve HTML pages with no control over the Content-Type header. We actually ended up writing a custom java Charset which could convert a stream of Java characters into an ASCII encoded byte stream with all non-ASCII characters converted to entities. Then we just wrapped the output stream in a Writer with that Charset and output everything as usual. There are a few gotchas in implementing a Charset correctly, but simply doing the encoding yourself is pretty straight forward, just be sure to handle the surrogate pairs correctly.

Related

decoding and encoding strings, ISO-8859-1 to UTF-8 in Java

I have read the other posts on this issue, but the solutions they presented did not work for me. Actually, the official Java documentation also did not work as intended (I am using Java 11) : https://docs.oracle.com/javase/tutorial/i18n/text/string.html
My problem is that I am reading one byte at a time from a byte buffer, putting that in a byte array, and making a String out of that byte array. The bytes I read are from an embedded system that can only send ISO-8859-1 bytes, so I end up with a byte array with ISO-8859-1 bytes and the Java String I end up getting is thus ISO-8859-1 encoded. No problem here. The String in IntelliJ looks like this :
The bytes I am trying to convert from ISO-8859-1 to UTF-8 are the ones in yellow. I want them to be UTF-8, so in the end the "C9" byte should be replace by the "C3A9" bytes.
The first step works correctly, I do this : maintenanceResponseString.getBytes(StandardCharsets.UTF_8) and I get the right bytes that I want, the UTF-8 encoding of the string, that's good :
The problem comes in here , when I try to make a STRING out of these new (and GOOD) bytes, like this :
new String(maintenanceResponseString.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8)
The old bytes are back ?!! It's like the "getBytes(UTF-8)" never actually happened. That is NOT what the documentation says should happen... what am I missing here ? I have done tests and the string really is still ISO-8859-1 encoded... I don't know what is going on here. Where are the bytes from "getBytes" ?
How do you convert a String that contains ISO-8859-1 bytes to UTF-8 bytes ? I'm out of alternatives and I need to get it done real bad for a pro project... this should be easy !
Note : I have tried alternatives like
ByteBuffer buffer = StandardCharsets.UTF_8.encode(s);
return StandardCharsets.UTF_8.decode(buffer).toString();
But the exact same thing happens.
Thank you in advance for your help.
EDIT :
With some info in the comments about how Strings in Java 9+ get represented internally not as UTF-16 only anymore, but Latin-1 (why...), I think that is what made me think the Strings were "internally encoded in Latin-1" when it is just the default representation of the String if we don't specify the encoding we want to use when displaying the String.
From what I undestand now the String itself is not bound to any encoding, and you can CHOOSE the encoding you want to display it in when it gets written.
Actually my issue is that the String ends up written to an XML file via JAXB marshalling in LATIN-1, and I now think the issues lies over there... I will dig further when I access my work computer again and report here

It turns out there was nothing wrong with Strings and "their encoding". What happened is I got really confused because the debugger shows the contents of the String in a "default internal storage encoding", and that is ISO-8859-1 (but can be UTF-16, depends on the content of the String).
Quote from the JEP-254 :
We propose to change the internal representation of the String class
from a UTF-16 char array to a byte array plus an encoding-flag field.
The new String class will store characters encoded either as
ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes
per character), based upon the contents of the string. The encoding
flag will indicate which encoding is used.
But actually it doesn't matter the internal encoding storage. When it is time to be written, the String will use whatever encoding you want at the time of writing.
My issue actually was when I was sending the String in an HTTP request with Spring RestTemplate. I didn't have the header specifying the "charset" to use in the request, and RestTemplate defaults to ISO-8859-1 if not told otherwise. I added the charset=utf-8, and the String was correctly written as UTF-8 in the request.
Thank you to #VGR #Eugene #skomisa for the help

Need help identifying type of UTF Encoding

I'm having a hard time trying to figure out the type of unicode that i need to convert to pass data for post request. Mostly would be chinese characters.
Example String:
的事故事务院治党派驻地是不是
Expected Unicode: %u7684%u4E8B%u6545%u4E8B%u52A1%u9662%u6CBB%u515A%u6D3E%u9A7B%u5730%u662F%u4E0D%u662F
Tried to encode to UTF16-BE:
%76%84%4E%8B%65%45%4E%8B%52%A1%5C%40%5C%40%95%7F%67%1F%8D%27%7B%49%5F%85%62%08%59%1A
Encoded text in UTF-16: %FF%FE%84%76%8B%4E%45%65%8B%4E%A1%52%62%96%BB%6C%5A%51%3E%6D%7B%9A%30%57%2F%66%0D%4E%2F%66
Encoded text in UTF-8: %E7%9A%84%E4%BA%8B%E6%95%85%E4%BA%8B%E5%8A%A1%E9%99%A2%E6%B2%BB%E5%85%9A%E6%B4%BE%E9%A9%BB%E5%9C%B0%E6%98%AF%E4%B8%8D%E6%98%AF
As you can see, UTF16-BE is the closest, but it only takes 2 bytes and there should be an additional %u in front of every character as shown in the expected unicode.
I've been using URLEncoder method to get the encoded text, with the standard charset encodings but it doesn't seem to return the expected unicode.
Code:
String text = "的事故事务院治党派驻地是不是";
URLEncoder.encode(text, "UTF-16BE");

As Kayaman said in a comment: Your expectation is wrong.
That is because %uNNNN is not a valid URL encoding of Unicode text. As Wikipedia says it:
There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a UTF-16 code unit represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C.
So unless your server is expected non-standard input, your expectation is wrong.
Instead, use UTF-8. As Wikipedia says it:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
That is however for sending data in a URL, e.g. as part of a GET.
For sending text data as part of a application/x-www-form-urlencoded encoded POST, see the HTML5 documentation:
If the form element has an accept-charset attribute, let the selected character encoding be the result of picking an encoding for the form.
Otherwise, if the form element has no accept-charset attribute, but the document's character encoding is an ASCII-compatible character encoding, then that is the selected character encoding.
Otherwise, let the selected character encoding be UTF-8.
Since most web pages ("the document") are presented in UTF-8 these days, that would likely mean UTF-8.

I think that you are thinking too far. The encoding of a text doesn't need to "resemble" in any way the string of Unicode code points of this text. These are two different things.
To send the string 的事故事务院治党派驻地是不是 in a POST request, just write the entire POST request and encode it with UTF-8, and the resulting bytes are what is sent as the body of the POST request to the server.
As pointed out by #Andreas, UTF-8 is the default encoding of HTML5, so it's not even necessary to set the accept-charset attribute, because the server will automatically use UTF-8 to decode the body of your request, if accept-charset is not set.

When text copied from MS Word is sent to Java via HTML form, strange characters appear and text length increases

I copied the following text from MS Word and pasted it on the HTML input text field
Test…. !! Wow
It appeared correctly on the browser and the length was also 13 characters. But when I submit the form, the text received in Java code is
Testâ¦. !! Wow
with a count of 15. I have a max text field length check in Javascript and in the Java code. Because the text's length increases in Java code, the text might validate in Javascript but fail in Java code. I want the same format in both cases (or at least the same length, so that the validation is consistent)

What we see here as “â¦” results from the three bytes 0xE2 0x80 0xA6, which constitute the UTF-8 encoded representation of “…” U+2026 HORIZONTAL ELLIPSIS. The byte 0xE2 is “â” when interpreted as Latin-1 (ISO-8859-1 or windows-1252) encoded, and similarly 0xA6 is “¦”. What happens to the 0x80 byte is unclear, but maybe it has been filtered out, because in ISO-8859-1 it is a control character.
Thus, apparently the form data is sent as UTF-8 encoded (this normally depends on the encoding of the page containing the form, though it can also be set with the accept-charset attribute in the <form> tag). This is all fine, because UTF-8 is the only way to ensure that all characters are sent properly.
So the problem is in the receiving side. The Java code appearently reads the data assuming it to be in an 8-bit encoding (one byte = one character), but it isn’t.
(The reason why the text contains U+2026 is probably autocorrection in Word: by default, Word turns, in keyboard input, three consecutive periods “...” to one character, the ellipsis “…”.)

This is almost certainly an encoding problem. The characters you're pasting will be UTF-8 (or similar), but will be being sent as ANSI characters. You need to set your encoding for the submit.

Unicode issue with an HTML Title, question mark? 65533;

I'm trying to parse the title from the following webpage: http://kid37.blogger.de/stories/1670573/
When I use the apache.commons.lang StringEscapeUtils.escapeHTML method on the title element I get the following
Das hermetische Caf�: Rock & Wrestling 2010
however when I display that in my webpage with utf-8 encoding it just shows a question mark.
Using the following code:
String title = StringEscapeUtils.escapeHtml(myTitle);
If I run the title through this website: http://tools.devshed.com/?option=com_mechtools&tool=27 I get the following output which seems correct
TITLE:
<title>Das hermetische Café: Rock & Wrestling 2010</title>
BECOMES (which I was expecting the escapeHtml method to do):
<title>Das hermetische Café: Rock & Wrestling 2010</title>
any ideas? thanks

U+FFFD (decimal 65533) is the "replacement character". When a decoder encounters an invalid sequence of bytes, it may (depending on its configuration) substitute � for the corrupt sequence and continue.
One common reason for a "corrupt" sequence is that the wrong decoder has been applied. For example, the decoder might be UTF-8, but the page is actually encoded with ISO-8859-1 (the default if another is not specified in the content-type header or equivalent).
So, before you even pass the string to escapeHtml, the "é" has already been replaced with "�"; the method encodes this correctly.
The page in question uses ISO-8859-1 encoding. Make sure that you are using that decoder when converting the fetched resource to a String.

These decoders(charset) attribute could also be used in java Stream readers such as InputStreamReader as it has its own constructors to allow them what kind of characters that are entering stream. Agree with the answer Erickson gave.

how can Weblogic send page with Unicode charset

99.9% of the pages in my application are using UTF-8 encoding.
However for some special usecase in the client side, I need one of them to use Unicode (2 bytes for each character)
For that matter the header of this page is:
<%# page language="java" contentType="text/html; charset=unicode"%>
...<my content>...
This implementation works fine and do the job, when the application is run on Tomcat and Webspher. However when it is deployed on Weblogic, I get the server error:
unsupported encoding: 'unicode': java.io.UnsupportedEncodingException: unicode
Does someone know how I can force Weblogic to send pages in 'Unicode' encoding?

UTF-8 is Unicode. "Unicode" is not a character encoding at its own, it is a character mapping standard (a charset). Your problem lies somewhere else. Maybe you've had problems with GET request encoding. This is often overlooked. You may then find this article useful to get more background information and complete solutions how to get the Unicode phenomenon to work in a Java EE webapplication: Unicode - How to get the characters right?
Good luck.
By the way, the "2 bytes per character" is characteristic for the majority of the UTF-16 encoding (0x0000 until with 0xFFFF are represented in 2 bytes, while UTF-8 uses 1, 2 and 3 bytes for each of the subranges). Maybe you just meant to use it instead?

Unicode is not a charset, but there are charsets allowing to represent characters to be represented in the Unicode system. You know already the UTF-8 charset, which encodes each character with 1, 2, 3 or 4 bytes, depending on the position of the character in the system. It seems that you want to use the UTF-16 charset, which encodes each character with 2 or 4 bytes.
Note related to the answer provided by BalusC: here I use the word "charset" as "denominator for the character set encoding part in the Content-Type MIME header". Strictly speaking, the Universal Character Set provided by Unicode is a character set, but we don't strictly specify a character set with the charset moniker.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.