JSoup encoding issue with numeric character references

JSoup encoding issue with numeric character references - java

We're doing the following:
Open a Reader for a file, using some specified encoding.
Read in each line, parsing it as CSV.
For some of the columns in the CSV data, pass it to JSoup to clean out HTML as below:
public String apply(#Nullable String input) {
Document document = Jsoup.parse(input);
return document.text();
}
This works great, except in the presence of numeric character references, such as  . What seems to be happening is that since we necessarily must do the JSoup call after we've figured out the encoding (to get the CSV parsing to work), when JSoup gets round to converting hard-coded bytes into characters, we're working with the wrong character set. Byte 160 (0xa0) is non-breaking space in windows-1252, but is not a valid Unicode character so gives us bad data when JSoup is replacing the numeric character reference with a byte.
Is there a way around this? It would require JSoup to be given a 'source encoding' for numeric character references, or something like that.

Try calling the following before text():
document.outputSettings().charset("windows-1252");
For more output settings see the javadoc.

Related

? is the only output for all unicode above U+0080 in java [duplicate]

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter

Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.

You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.

Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.

Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).

I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

why '?' appears as output while Printing unicode characters in java

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter

Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.

You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.

Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.

Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).

I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

escaped html won't unescaped (now: unescaped html won't escape back)

So I'm currently using the commons lang apache library.
When I tried unescaping this string: 😀
This returns the same string: 😀
String characters = "😀"
StringEscapeUtils.unescapeHtml(characters);
Output: 😀
But when I tried unescaping a String with a less few characters, it works:
String characters = "㈳"
StringEscapeUtils.unescapeHtml(characters);
Output: ㈳
Any ideas? When I tried unescaping this String "😀" on online unescaping utility, it works, so maybe it's a bug in the apache common langs library? Or can anyone recommend another library?
Thanks.
UPDATES:
I'm now able to unescape the String successfully. The problem now is when I tried to escaped the result of that unescape, it won't bring back the String (😀).

unescapeHtml() leaves 😀 untouched because – as the documentation says – it only unescapes HTML 4.0 entities, which are limited to 65,536 characters. Unfortunately, 128,512 is far beyond that limit.
Have you tried using unescapeXml()?
XML supports up to 1,114,111 (10FFFFh) character entities (link).

This is a unicode character whose index is U+1F600 (128512) - GRINNING FACE
Refer the URL for details
The String you have mentioned is HTML Escape of U+1F600, If you unescape it using Apache commons lang it will draw you the required smiley as provided in screenshot
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Regarding your update that its not converting back to 😀
You can also represent a character using a Numeric Character Reference, of the form &#dddd;, where dddd is the decimal value representing the character's Unicode scalar value. You can alternatively use a hexadecimal representation &#xhhhh;, where hhhh is the hexadecimal value equivalent to the decimal value.
A good site for this
Have added few SoP to help you understand this unicode better.

Well - the solution is pretty easy:
use org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 instead! (unless you're using Java <1.5, which you probably won't)
String characters = "😀";
StringEscapeUtils.unescapeHtml4(characters);

i think the problem is that there is no unicode character "😀"
so the method simply returns this string.
the doc of the function says only
Returns: a new unescaped String, null if null string input

If it's a HTML specific question, then you can just use JavaScript for this purpose.
You can do
escape("😀") which gives you %26%23128512%3B
unescape("%26%23128512%3B") which gives you back 😀

Character encoding issues?

We had a a clob column in DB. Now when we extract this clob and try to display it (plain text not html), it prints junk some characters on html screen. The character when directly streamed to a file looks like ” (not the usual double quote on regular keyboard)
One more observation:
System.out.println("”".getBytes()[0]);
prints -108.
Why a character byte should be in negative range ? Is there any way to display it correctly on a html screen ?

Re: your final observation - Java bytes are always signed. To interpret them as unsigned, you can bitwise AND them with an int:
byte[] bytes = "”".getBytes("UTF-8");
for(byte b: bytes)
{
System.out.println(b & 0xFF);
}
which outputs:
226
128
157
Note that your string is actually three bytes long in UTF-8.
As pointed out in the comments, it depends on the encoding. For UTF-16 you get:
254
255
32
29
and for US-ASCII or ISO-8859-1 you get
63
which is a question-mark (i.e. "I dunno, some new-fangled character"). Note that:
The behavior of this method [getBytes()] when this string cannot be
encoded in the given charset is unspecified. The CharsetEncoder class
should be used when more control over the encoding process is
required.

I think that it will be better to print character code like this way:
System.out.println((int)'”');//result is 8221
This link can help you to explain this extraordinary double quote (include html code).

To answer your question about displaying the character correctly in an HTML document, you need to do one of two things: either set the encoding of the document or entity-ize the non-ascii characters.
To set the encoding you have two options.
Update your web server to send an appropriate charset argument in
the Content-Type header. The correct header would be Content-Type:
text/html; charset=UTF-8.
Add a <meta charset="UTF-8" /> tag to
the head section of your page.
Keep in mind that Option 1 will take precedence over option 2. I.e. if you are already setting an incorrect charset in the header, you can't override it with a meta tag.
The other option is to entity-ize the non ASCII characters. For the quote character in your question you could use ” or ” or ”. The first is a user friendly named entity, the second specifies the Unicode code point of the character in decimal, and the third specifies the code point in hex. All are valid and all will work.
Generally if you are going to entity-ize dynamic content out of a database that contains unknown characters you're best off just using the code point versions of the entities as you can easily write a method to convert any character >127 to its appropriate code point.
One of the systems I currently work on actually ran into this issue where we took data from a UTF-8 source and had to serve HTML pages with no control over the Content-Type header. We actually ended up writing a custom java Charset which could convert a stream of Java characters into an ASCII encoded byte stream with all non-ASCII characters converted to entities. Then we just wrapped the output stream in a Writer with that Charset and output everything as usual. There are a few gotchas in implementing a Charset correctly, but simply doing the encoding yourself is pretty straight forward, just be sure to handle the surrogate pairs correctly.

String Encoding doesn't ouput all characters

My client uses InputStreamReader/BufferedReader to fetch text from the Internet.
However when I save the Text to a *.txt the text shows extra weird special symbols like 'Â'.
I've tried Convert the String to ASCII but that mess upp å,ä,ö,Ø which I use.
I've tried food = food.replace("Â", ""); and IndexOf();
But string won't find it. But it's there in HEX Editor.
So summary: When I use text.setText(Android), the output looks fine with NO weird symbols, but when I save the text to *.txt I get about 4 of 'Â'. I do not want ASCII because I use other Non-ASCII character.
The 'Â' is displayed as a Whitespace on my Android and in notepad.
Thanks!
Have A great Weekend!
EDIT:
Solved it by removing all Non-breaking-spaces:
myString.replaceAll("\\u00a0"," ");

You say that you are fetching like this:
in = new BufferedReader(new InputStreamReader(url.openStream(),"UTF-8"));
There is a fair chance that the stuff you are fetching is not encoded in UTF-8.
You need to call getContentType() on the HttpURLConnection object, and if it is non-null, extract the encoding and use it when you create the InputStreamReader. Only assume "UTF-8" if the response doesn't supply a content type with a valid encoding.
On reflection, while you SHOULD pay attention to the content type returned by the server, the real problem is either in the way that you are writing the *.txt file, or in the display tool that is showing strange characters.
It is not clear what encoding you are using to write the file. Perhaps you have chosen the wrong one.
It is possible that the display tool is assuming that the file has a different encoding. Maybe it detects that a file is UTF-8 or UTF-16 is there is a BOM.
It is possible that the display tool is plain broken, and doesn't understand non-breaking spaces.
When you display files using a HEX editor, it is most likely using an 8-bit character set to render bytes, and that character set is most likely Latin-1. But apparently, the file is actually encoded differently.
Anyway, the approach of replacing non-breaking spaces is (IMO) a hack, and it won't deal with other stuff that you might encounter in the future. So I recommend that you take the time to really understand the problem, and fix it properly.
Finally, I think I understand why you might be getting Â characters. A Unicode NON-BREAKING-SPACE character is u00a0. When you encode that as UTF-8, you get C2 A0. But C2 in Latin-1 is CAPITAL-A-CIRCUMFLEX, and A0 in Latin-1 is NON-BREAKING-SPACE. So the "confusion" is most likely that your program is writing the *.txt file in UTF-8 and the tool is reading it as Latin-1.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.