Java: Advise on Charset Conversion

Java: Advise on Charset Conversion - java

I have been working on a scenario that does the following:
Get input data in Unicode format; [UTF-8]
Convert to ISO-8559;
Detect & replace unsupported characters for encoding; [Based on user-defined key-value pairs]
My question is, I have been trying to find information on ISO-8559 in depth with no luck yet. Has anybody happen to know more about this? How different is this one from ISO-8859? Any details will be much helpful.
Secondly, keeping the ISO-8559 requirement aside, I went ahead to write my program to convert the incoming data to ISO-8859 in Java. While I am able to achieve what is needed using character based replacement, it obviously seem to be time-consuming when data size is huge. [in MBs]
I am sure there must be a better way to do this. Can someone advise me, please?

I assume you want to convert UTF-8 to ISO-8859 -1, that is Western Latin-1. There are many char set tables in the net.
In general for web browsers and Windows, it would be better to convert to Windows-1252, which is an extension redefining the range 0x80 - 0xBF, undermore with special quotes as seen in MS Word. Browsers are defacto capable to interprete these codes in an ISO-559-1 even on a Mac.
Java standard conversion like new OutputStreamWriter(new FileOutputStream("..."), "Windows-1252") does already much. You can either write a kind of filter, or find introduced ? untranslated special characters. You could translate latin letters with accents not in Windows-1252 as ASCII letters:
String s = ...
s = Normalizer.normalize(s, Normalizer.Form.NFD);
return s = s.replaceAll("\\p{InCombiningDiacriticalMarks}", "");
For other scripts like Hindi or Cyrillic the keyword to search for is transliteration.

Related

? is the only output for all unicode above U+0080 in java [duplicate]

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter

Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.

You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.

Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.

Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).

I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

Encoding a string in 128c barcode symbology

I am having some trouble with encoding this string into barcode symbology - Code 128.
Text to encode:
1021448642241082212700794828592311
I am using the universal encoder from idautomation.com:
https://www.bcgen.com/fontencoder/
I get the following output for the encoded text for Code 128:
Í*5LvJ8*r5;ÂoP<[7+.Î
However, in ";Âo" the character between the semi-colon and o (let us call it special A) - is not part of the extended character set used in Code128. (See the Latin Supplements at https://www.fonts2u.com/code-128.font)
Yet the same string shows a valid barcode at
https://www.bcgen.com/linear-barcode-creator.html
How?
If I use the output with the Special A on a webpage with a font face for barcodes, the special A character does not show up as the barcode (and that seems correct since the special A is not part of the character set).
What gives? Please help.
I am using the IDAutomation utility to encode the string to 128c symbology. If you can share code to do the encoding (in Java/Python/C/Perl) that would help too.

There are multiple fonts for Code128 that may use different characters to represent the barcode symbols. Make sure the font and the encoding logic match each other.
I used this one http://www.jtbarton.com/Barcodes/Code128.aspx (there is also sample code how to encode it on the site, but you have to translate it from VB). The font works for all three encodings (A, B and C).

Sorry, this is very late.
When you are dealing with the encoding of code 128, in any subset, it's a good idea to think of that coding in terms of numbers, not characters. At this level, when you have shifts, code-changes, checksums and stuff, intermixed with the data, the whole concept of "character" is lost.
However, this is what is happening:
The semicolon in the output corresponds to "27"
The lowercase o corresponds to "48" and the P to "79"
The "A with Macron" corresponds to your "00" sequence. This is why you should be dealing with numbers, not characters, at this level of encoding.
How would you expect it to show a character with a code of 00 ? That would be a space of NULL, neither of which is particularly visible.
Your software has simply rendered it the best way it can, which is to make the character 'visible' by adding 0x80 to it. If you look at charmap, you will see that code 0x80 is indeed A with macron.
The rest (indeed all) of your encoded string looks correct for a setc-encodation.

why '?' appears as output while Printing unicode characters in java

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter

Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.

You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.

Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.

Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).

I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

Java library to fix incorrectly encoded text using heuristics

I'm dealing with an external web service that is giving me incorrectly encoded (and or corrupted) Strings (UTF-8) that were most likely either ISO LATIN or WINDOWS-1252 but are now UTF-8 (and or a mixture of ISO/WINDOWS/UTF-8). Lovely A hats (Â) abound.
I obviously cannot fix how the external web service stores its strings so the information is lost. Thus hopes of a 100% translation I know are not possible.
But I was hoping that someone had written a heuristic character mapping library in Java (its unlikely some one would type A hats).
If not I guess I can port this guys PHP code: https://stackoverflow.com/a/3521340/318174
UPDATE and Explanation: A simple conversion like #VGR answered with will not work. I do not have the original bytes. The data was converted incorrectly at the endpoint (SOAP server maybe getBytes(/*with out correct encoding*/) was done or maybe the data is stored in the incorrect format). When you convert bytes to Strings in Java back forth the data is not retained unless the encoding is the same everywhere. This is easy to understand if you think of something like ASCII <-> UTF-8. With Windows-1252 or ISO Latin its much more complicated because data is not lost but often confused. That is because those encodings can be two bytes and are not a subset of UTF-8.
If you don't believe me you can try doing getBytes() back in forth with various encodings and will see data corruption and data loss.

I may be misunderstanding the nature of the incorrectly encoded data, but that PHP code seems like overkill to me. If you have UTF-8 bytes that were passed as individual characters, you should be able to just do:
String fix(String s) {
byte[] bytes = s.getBytes(Charset.forName("windows-1252"));
return new String(bytes, StandardCharsets.UTF_8);
}

String Encoding doesn't ouput all characters

My client uses InputStreamReader/BufferedReader to fetch text from the Internet.
However when I save the Text to a *.txt the text shows extra weird special symbols like 'Â'.
I've tried Convert the String to ASCII but that mess upp å,ä,ö,Ø which I use.
I've tried food = food.replace("Â", ""); and IndexOf();
But string won't find it. But it's there in HEX Editor.
So summary: When I use text.setText(Android), the output looks fine with NO weird symbols, but when I save the text to *.txt I get about 4 of 'Â'. I do not want ASCII because I use other Non-ASCII character.
The 'Â' is displayed as a Whitespace on my Android and in notepad.
Thanks!
Have A great Weekend!
EDIT:
Solved it by removing all Non-breaking-spaces:
myString.replaceAll("\\u00a0"," ");

You say that you are fetching like this:
in = new BufferedReader(new InputStreamReader(url.openStream(),"UTF-8"));
There is a fair chance that the stuff you are fetching is not encoded in UTF-8.
You need to call getContentType() on the HttpURLConnection object, and if it is non-null, extract the encoding and use it when you create the InputStreamReader. Only assume "UTF-8" if the response doesn't supply a content type with a valid encoding.
On reflection, while you SHOULD pay attention to the content type returned by the server, the real problem is either in the way that you are writing the *.txt file, or in the display tool that is showing strange characters.
It is not clear what encoding you are using to write the file. Perhaps you have chosen the wrong one.
It is possible that the display tool is assuming that the file has a different encoding. Maybe it detects that a file is UTF-8 or UTF-16 is there is a BOM.
It is possible that the display tool is plain broken, and doesn't understand non-breaking spaces.
When you display files using a HEX editor, it is most likely using an 8-bit character set to render bytes, and that character set is most likely Latin-1. But apparently, the file is actually encoded differently.
Anyway, the approach of replacing non-breaking spaces is (IMO) a hack, and it won't deal with other stuff that you might encounter in the future. So I recommend that you take the time to really understand the problem, and fix it properly.
Finally, I think I understand why you might be getting Â characters. A Unicode NON-BREAKING-SPACE character is u00a0. When you encode that as UTF-8, you get C2 A0. But C2 in Latin-1 is CAPITAL-A-CIRCUMFLEX, and A0 in Latin-1 is NON-BREAKING-SPACE. So the "confusion" is most likely that your program is writing the *.txt file in UTF-8 and the tool is reading it as Latin-1.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.