Unicode issue with an HTML Title, question mark? 65533; - java

I'm trying to parse the title from the following webpage: http://kid37.blogger.de/stories/1670573/
When I use the apache.commons.lang StringEscapeUtils.escapeHTML method on the title element I get the following
Das hermetische Caf�: Rock & Wrestling 2010
however when I display that in my webpage with utf-8 encoding it just shows a question mark.
Using the following code:
String title = StringEscapeUtils.escapeHtml(myTitle);
If I run the title through this website: http://tools.devshed.com/?option=com_mechtools&tool=27 I get the following output which seems correct
TITLE:
<title>Das hermetische Café: Rock & Wrestling 2010</title>
BECOMES (which I was expecting the escapeHtml method to do):
<title>Das hermetische Café: Rock & Wrestling 2010</title>
any ideas? thanks

U+FFFD (decimal 65533) is the "replacement character". When a decoder encounters an invalid sequence of bytes, it may (depending on its configuration) substitute � for the corrupt sequence and continue.
One common reason for a "corrupt" sequence is that the wrong decoder has been applied. For example, the decoder might be UTF-8, but the page is actually encoded with ISO-8859-1 (the default if another is not specified in the content-type header or equivalent).
So, before you even pass the string to escapeHtml, the "é" has already been replaced with "�"; the method encodes this correctly.
The page in question uses ISO-8859-1 encoding. Make sure that you are using that decoder when converting the fetched resource to a String.

These decoders(charset) attribute could also be used in java Stream readers such as InputStreamReader as it has its own constructors to allow them what kind of characters that are entering stream. Agree with the answer Erickson gave.

Related

? is the only output for all unicode above U+0080 in java [duplicate]

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter
Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.
You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.
Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.
Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).
I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

Encoding a string in 128c barcode symbology

I am having some trouble with encoding this string into barcode symbology - Code 128.
Text to encode:
1021448642241082212700794828592311
I am using the universal encoder from idautomation.com:
https://www.bcgen.com/fontencoder/
I get the following output for the encoded text for Code 128:
Í*5LvJ8*r5;ÂoP<[7+.Î
However, in ";Âo" the character between the semi-colon and o (let us call it special A) - is not part of the extended character set used in Code128. (See the Latin Supplements at https://www.fonts2u.com/code-128.font)
Yet the same string shows a valid barcode at
https://www.bcgen.com/linear-barcode-creator.html
How?
If I use the output with the Special A on a webpage with a font face for barcodes, the special A character does not show up as the barcode (and that seems correct since the special A is not part of the character set).
What gives? Please help.
I am using the IDAutomation utility to encode the string to 128c symbology. If you can share code to do the encoding (in Java/Python/C/Perl) that would help too.
There are multiple fonts for Code128 that may use different characters to represent the barcode symbols. Make sure the font and the encoding logic match each other.
I used this one http://www.jtbarton.com/Barcodes/Code128.aspx (there is also sample code how to encode it on the site, but you have to translate it from VB). The font works for all three encodings (A, B and C).
Sorry, this is very late.
When you are dealing with the encoding of code 128, in any subset, it's a good idea to think of that coding in terms of numbers, not characters. At this level, when you have shifts, code-changes, checksums and stuff, intermixed with the data, the whole concept of "character" is lost.
However, this is what is happening:
The semicolon in the output corresponds to "27"
The lowercase o corresponds to "48" and the P to "79"
The "A with Macron" corresponds to your "00" sequence. This is why you should be dealing with numbers, not characters, at this level of encoding.
How would you expect it to show a character with a code of 00 ? That would be a space of NULL, neither of which is particularly visible.
Your software has simply rendered it the best way it can, which is to make the character 'visible' by adding 0x80 to it. If you look at charmap, you will see that code 0x80 is indeed A with macron.
The rest (indeed all) of your encoded string looks correct for a setc-encodation.

why '?' appears as output while Printing unicode characters in java

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter
Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.
You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.
Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.
Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).
I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

Character encoding issues?

We had a a clob column in DB. Now when we extract this clob and try to display it (plain text not html), it prints junk some characters on html screen. The character when directly streamed to a file looks like ” (not the usual double quote on regular keyboard)
One more observation:
System.out.println("”".getBytes()[0]);
prints -108.
Why a character byte should be in negative range ? Is there any way to display it correctly on a html screen ?
Re: your final observation - Java bytes are always signed. To interpret them as unsigned, you can bitwise AND them with an int:
byte[] bytes = "”".getBytes("UTF-8");
for(byte b: bytes)
{
System.out.println(b & 0xFF);
}
which outputs:
226
128
157
Note that your string is actually three bytes long in UTF-8.
As pointed out in the comments, it depends on the encoding. For UTF-16 you get:
254
255
32
29
and for US-ASCII or ISO-8859-1 you get
63
which is a question-mark (i.e. "I dunno, some new-fangled character"). Note that:
The behavior of this method [getBytes()] when this string cannot be
encoded in the given charset is unspecified. The CharsetEncoder class
should be used when more control over the encoding process is
required.
I think that it will be better to print character code like this way:
System.out.println((int)'”');//result is 8221
This link can help you to explain this extraordinary double quote (include html code).
To answer your question about displaying the character correctly in an HTML document, you need to do one of two things: either set the encoding of the document or entity-ize the non-ascii characters.
To set the encoding you have two options.
Update your web server to send an appropriate charset argument in
the Content-Type header. The correct header would be Content-Type:
text/html; charset=UTF-8.
Add a <meta charset="UTF-8" /> tag to
the head section of your page.
Keep in mind that Option 1 will take precedence over option 2. I.e. if you are already setting an incorrect charset in the header, you can't override it with a meta tag.
The other option is to entity-ize the non ASCII characters. For the quote character in your question you could use ” or ” or ”. The first is a user friendly named entity, the second specifies the Unicode code point of the character in decimal, and the third specifies the code point in hex. All are valid and all will work.
Generally if you are going to entity-ize dynamic content out of a database that contains unknown characters you're best off just using the code point versions of the entities as you can easily write a method to convert any character >127 to its appropriate code point.
One of the systems I currently work on actually ran into this issue where we took data from a UTF-8 source and had to serve HTML pages with no control over the Content-Type header. We actually ended up writing a custom java Charset which could convert a stream of Java characters into an ASCII encoded byte stream with all non-ASCII characters converted to entities. Then we just wrapped the output stream in a Writer with that Charset and output everything as usual. There are a few gotchas in implementing a Charset correctly, but simply doing the encoding yourself is pretty straight forward, just be sure to handle the surrogate pairs correctly.

String Encoding doesn't ouput all characters

My client uses InputStreamReader/BufferedReader to fetch text from the Internet.
However when I save the Text to a *.txt the text shows extra weird special symbols like 'Â'.
I've tried Convert the String to ASCII but that mess upp å,ä,ö,Ø which I use.
I've tried food = food.replace("Â", ""); and IndexOf();
But string won't find it. But it's there in HEX Editor.
So summary: When I use text.setText(Android), the output looks fine with NO weird symbols, but when I save the text to *.txt I get about 4 of 'Â'. I do not want ASCII because I use other Non-ASCII character.
The 'Â' is displayed as a Whitespace on my Android and in notepad.
Thanks!
Have A great Weekend!
EDIT:
Solved it by removing all Non-breaking-spaces:
myString.replaceAll("\\u00a0"," ");
You say that you are fetching like this:
in = new BufferedReader(new InputStreamReader(url.openStream(),"UTF-8"));
There is a fair chance that the stuff you are fetching is not encoded in UTF-8.
You need to call getContentType() on the HttpURLConnection object, and if it is non-null, extract the encoding and use it when you create the InputStreamReader. Only assume "UTF-8" if the response doesn't supply a content type with a valid encoding.
On reflection, while you SHOULD pay attention to the content type returned by the server, the real problem is either in the way that you are writing the *.txt file, or in the display tool that is showing strange characters.
It is not clear what encoding you are using to write the file. Perhaps you have chosen the wrong one.
It is possible that the display tool is assuming that the file has a different encoding. Maybe it detects that a file is UTF-8 or UTF-16 is there is a BOM.
It is possible that the display tool is plain broken, and doesn't understand non-breaking spaces.
When you display files using a HEX editor, it is most likely using an 8-bit character set to render bytes, and that character set is most likely Latin-1. But apparently, the file is actually encoded differently.
Anyway, the approach of replacing non-breaking spaces is (IMO) a hack, and it won't deal with other stuff that you might encounter in the future. So I recommend that you take the time to really understand the problem, and fix it properly.
Finally, I think I understand why you might be getting  characters. A Unicode NON-BREAKING-SPACE character is u00a0. When you encode that as UTF-8, you get C2 A0. But C2 in Latin-1 is CAPITAL-A-CIRCUMFLEX, and A0 in Latin-1 is NON-BREAKING-SPACE. So the "confusion" is most likely that your program is writing the *.txt file in UTF-8 and the tool is reading it as Latin-1.

Categories