Java library to fix incorrectly encoded text using heuristics

Java library to fix incorrectly encoded text using heuristics - java

I'm dealing with an external web service that is giving me incorrectly encoded (and or corrupted) Strings (UTF-8) that were most likely either ISO LATIN or WINDOWS-1252 but are now UTF-8 (and or a mixture of ISO/WINDOWS/UTF-8). Lovely A hats (Â) abound.
I obviously cannot fix how the external web service stores its strings so the information is lost. Thus hopes of a 100% translation I know are not possible.
But I was hoping that someone had written a heuristic character mapping library in Java (its unlikely some one would type A hats).
If not I guess I can port this guys PHP code: https://stackoverflow.com/a/3521340/318174
UPDATE and Explanation: A simple conversion like #VGR answered with will not work. I do not have the original bytes. The data was converted incorrectly at the endpoint (SOAP server maybe getBytes(/*with out correct encoding*/) was done or maybe the data is stored in the incorrect format). When you convert bytes to Strings in Java back forth the data is not retained unless the encoding is the same everywhere. This is easy to understand if you think of something like ASCII <-> UTF-8. With Windows-1252 or ISO Latin its much more complicated because data is not lost but often confused. That is because those encodings can be two bytes and are not a subset of UTF-8.
If you don't believe me you can try doing getBytes() back in forth with various encodings and will see data corruption and data loss.

I may be misunderstanding the nature of the incorrectly encoded data, but that PHP code seems like overkill to me. If you have UTF-8 bytes that were passed as individual characters, you should be able to just do:
String fix(String s) {
byte[] bytes = s.getBytes(Charset.forName("windows-1252"));
return new String(bytes, StandardCharsets.UTF_8);
}

Related

When is encoding being relevant in Java?

This might be a bit beginner question but it's fairly relevant considering debbuging encoding in Java: At what point is an encoding being relevant to a String object?
Consider I have a String object that I want to save to a file. Is the String object itself using some sort of encoding I should manipulate or this encoding will only be informed by me when I create a stream of bytes to save?
The same applies to importing: when I open a file and get it's bytes, I assume there's no encoding at hand, only bytes. When I parse this bytes to a String, I got to use an encoding to understand what characters are they. After I parse those bytes, the String (in memory) has some sort of meta information with the encoding or this is only being handled by the JVM?
This is vital considering I'm having file import/export issues and I got to understand at which point I should worry about getting the right encoding.
Hope I explained my doubt well, and thank you in advance!

Java strings do not have explicit encoding information. They don't know where they came from, and they don't know where they are going. All Java strings are stored internally as UTF-16.
You (optionally) specify what encoding to use whenever you want to turn a String into a sequence of bytes (e.g., to save to a file), or when you want to turn a sequence of bytes (e.g., read from a file) into a String.

Encoding is important to String when you are de/serializing from disk or the web. There are multiple text file formats: ascii, latin-1, utf-8/16 (I believe there may be two utf-16 formats, but I'm not 100%)
See InputStreamReader for how to load a String from text encoded in a non-default format

ISO-8859-1 to UTF-8 in Java

An XML containing 哈瓦那 (UTF-8) is sent to Service A.
Service A sends it to Service B.
The string was encoded to å“ˆç“¦é‚£ (ISO-8859-1).
How do I encode it back to 哈瓦那? Considering that all strings in Java are UTF-16. Service B has to compare it as 哈瓦那 not å“ˆç“¦é‚£.
Thanks.

When you read a text file, you have to read it using the actual encoding used to create the file. If you specify the appropriate encoding, you'll get the correct characters in memory. So, if the same file (semantically) exists in two versions (UTF-8 encoded and ISO-8859-1), reading the first one with UTF-8 and the second one with ISO-8859-1 will lead to exactly the same chars in memory.
The above is true only if it made sense to encode the file in ISO-8859-1 in the first place. UTF-8 is able to store every unicode character. But ISO-8859-1 is able to encode only a small subset of the unicode characters (western languages characters). The characters you posted literally look like Chinese to me, and I don't think encoding them in ISO-8859-1 is even possible without losing everything.

I think you are misdiagnosing the problem:
An XML containing 哈瓦那 (UTF-8) is sent to Service A.
OK ...
Service A sends it to Service B.
OK ...
The string was converted to å“ˆç“¦é‚£ (ISO-8859-1).
This is not correct. The string has not been "converted". Rather, it has been decoded with the wrong character encoding. Specifically, it looks very much like something has taken UTF-8 encoded bytes, and assumed that they are ISO-8859-1 encoded, and decoded them accordingly.
Can you unpick this? It depends where the mistaken decoding first occurred. If it happens in Service B, then you should be able to relabel the data source as UTF-8, and then decode it correctly. On the other hand, if the first mistaken decoding happens in service A, then you could be out of luck. A mistaken decoding can result in loss of data as unrecognized codes are replaced with some other character. If that happens, the original data will be gone forever.
In either case, the best way to deal with this is to figure out what is getting the wrong character encoding mixed up, and fix that. Perhaps the XML needs to be fixed to specify the charset / encoding. Perhaps, the transport mechanism (e.g. HTTP request or response) needs to be corrected to include the proper document encoding.

Use writers and readers to encode/decode your input/output streams:
String yourText = "...";
InputStream yourInputStream = ...;
Writer out = new OutputStreamWriter(youInputStream, "UTF-8");
out.write(yourText);
Same for reader.

Can a file be encoded in multiple charsets in Java?

I'm working on a Java plugin which would allow people to write to and read from a file by specifying a charset encoding they would wish to use. However, I was confused as to how I would encode multiple encodings in a single file. For example, suppose that A characters come from one charset and B characters come from another, would it be possible to write "AAAAABBBBBAAAAA" to a file?
If it is not possible, is this generally true for any programming language, or specifically for Java? And if it is possible, how would I then proceed to read (decode) the file?
I do not want to use the encode() and decode() methods of Charset since tests with them have failed (some charsets were not decoded properly). I also don't want to use third-party programs for various reasons, so the scope of this question is purely in the standard java packages/code.
Thanks a lot!
N.S.

You'd need to read it as a byte stream and know beforehand at which byte positions the characters start and end, or to use some special separator character/byterange which indicates the start and end of the character group. This way you can get the bytes of the specific character group and finally decode it using the desired character encoding.
This problem is not specific to Java. The requirement is just strange. I wonder how it makes sense to mix character encodings like that. Just use one uniform encoding all the time, for example UTF-8 which supports practically all characters the mankind is aware of.

Ofcourse it is in principle possible to write text that is encoded in different character sets into one file, but why would you ever want to do this?
A character encoding is simply a mapping from text characters to bytes and vice versa. A file consists of bytes. When writing a file, the character encoding determines how the characters are converted to bytes, and when reading, it determines how the bytes are converted back to characters.
You could have one part of the file encoded with one character encoding, and another part with another character encoding. You'd have to have some mechanism to keep track of what parts are encoded with what encoding, because the file doesn't automatically keep track of that for you.

I was wondering about this as well, because my client just asked a similar question. Like BalusC mentioned this is not a java specific problem.
After a few back and forth, I found the real question might be 'multiple encoding of information', instead multiple encoding file.
i.e. we have a xml string text needs to be encoded with 8859-1, if we save it as a file, then we need encode it. The default encoding for xml is UTF-8, we might not necessary to encode the whole xml as 8859-1. Since the xml node is just a vehicle of passing information over to other system and the content (value of the xml node, which needs to be persisted with 8859-1). So do we need multiple encoding in this case? probably not. We can still encode the xml with UTF-8, then pass it over. once the client receives the xml, then they need read the information out of the UTF-8 encoded file, and persist value of the xml node as 8859-1.

What could be the possible consequences of default encoding to UTF-8 for a String to Stream conversion?

I need to convert Strings obtained from some API's to InputStream consumed by other API's. The only way is that I convert the String to Stream without knowing the exact encoding. So I assume it to be UTF-8 and it works fine for now. However I would like to know what could be a better solution for this given that I have no way of identifying the the encoding of the source of the string.

There is no good solution to the problem of not knowing the encoding.
Because of this, you must demand that the encoding be explicitly specified, or else use one single agreed-upon encoding that is strictly adhered to.
Also, make sure you use the rare form of the contructor to InputStreamReader that condescends to raise an exception on an encoding error. That is InputStreamReader(InputStream in, CharsetDecoder dec). The other three are either broken or else infelicitously designed depending on your point of view or purposes, because they suppress encoding errors and render your program unreliable and nonportable.
Be very careful about missing errors, especially when you do not know for sure what you are getting — and even if you think you do :).

The possible consequences of applying the incorrect encoding is getting the wrong data out the other end.
The specific consequences will depend on the specific encodings. For example, if you receive a stream of ISO-8859-1 characters, and try to decode using UTF-8, you'll probably get errors due to incorrect sequences. If you start with UTF-16 and assume that it's ISO-8859-1, you'll get twice as many characters as you expect, and every other one will be garbage.

Encodings are not a property of Strings in Java, they're only relevant when you convert between Strings and bytes. If those APIs give you Strings, there is only one point where your program needs to use an encoding, which is when you convert the String back to bytes to be returned by the InputStream. And those "other APIs" of course need to know which encoding to use if they're going to interpret the contents as text data.

To add to the other answers, your deployed application will no longer be portable between Windows and Linux, since these usually have different default encodings.

How can I find the byte encoding of a TIBCO Rendezvous message?

In my Java application, I am archiving TIBCO RV messages to a file as bytes.
I am writing a small utility app that will play the messages back. This way I can just create a TibrvMsg object from the bytes without having to parse the file and construct the object manually.
The problem I am having is that I am reading a file that was created on a Linux box, and attempting to run my app on a Windows machine. I get an error due to the different charset the file was written in.
So now, what I want to do is log each message in a specific charset (UTF-8), so that I don't care what platform I run my playback app in. The app should just read in the file knowing before-hand the charset the file is written in. I am planning on using java.nio packages for this, to transform the bytes from one charset to another.
Do I need to know what charset the TIBRV message bytes are encoded in to do the transformation? If so, how can I find this out?

You are taking opaque data and, it would appear, attempting to write it to a file as textual data without escaping the non textual portions of it (alternatively you are writing it as raw bytes and then trying to read it as if it were character based which is much the same problem).
This is flawed from the very start.
Opaque data should be treated as meaningless and simply stored without modification to give back to an API that does know how to deal with it. If the data must be stored in a textual form then you must losslessly convert the bytes into text. Appropriate encodings are things like base64. Encoding in the sense of character set encoding is NOT lossless if you apply it to raw binary data.
Simply storing the bytes in a file as bytes (not characters) along with a fixed length prefix indicating the length of the message and the subject it was sent on is sufficient to replay RV messages through the system.
In relation to any text based fields inside the message if the encoding matters (I strongly suggest avoiding this mattering in general when designing the app) then you have the same problem on replay as you would have had at the original receipt time which is to convert from the source encoding to the desired encoding (hopefully using exactly the same code) so this should be a non issue in relation to the replaying.

This is probably related to Java string encoding, not TIBRV. Though there's this in the documentation:
Strings and Character Encodings
--------------------------------------------------------------------------------
Rendezvous software uses strings in several roles:
* String data inside message fields
* Field names
* Subject names (and other associated strings that are not
strictly inside the message)
* Certified delivery correspondent names
* Group names (fault tolerance)
All these strings (both in C and in wire format) use the character
encoding appropriate to the ISO locale of the sender. For example,
the United States is locale en_US, and uses the Latin-1 character
encoding (also called ISO 8859-1); Japan is locale ja_JP, and uses
the Shift-JIS character encoding.
When two programs exchange messages within the same locale, strings
are always correct. However, when a message sender and receiver use
different character encodings, the receiving program must convert
between encodings as needed. Rendezvous software does not convert
automatically.
EBCDIC
For information about string encoding in EBCDIC environments,
see tibrv_SetCodePages() .
So you might want to look at the locale of the machines.

As this (admittedly rather old) mailing list message indicates, little is known about the internal structure of that network protocol. This might make it quite a challenge to do what you're after.
That said, if the messages are just binary blocks of data (as captured from the network), they shouldn't even have a charset. Charsets is for textual data, where it matters since a single character can be encoded in many different ways. Binary data is not composed out of characters, so there cannot be an encoding in that sense.

Do I need to know what charset the
TIBRV message bytes are encoded in to
do the transformation?
Yes. A charset is a method of transforming text into a byte stream and vice versa. Your network data is a byte stream, so when you interpret parts of it as text, you ARE (implicitly or explicitly) using a charset - the question is whether it is the correct one.
Transforming bytes from one charset to another basically means convering them to text using one charset and then back to bytes using another. Note that this can result in the length of the data changing, since many charsets use more than 1 byte for some characters. In the context of network messages, this could be problematic when it invalidates length fields or causes text fields to overflow. It's probably better not to do any transformation and instead teach the reading app to learn how to deal with varying charsets.
If so, how can I find this out?
Look at the protocol specification.

Read everything inte a byte[] from a inputStream, write the byte[] to a a FileOutputStream.
NO Reader or Writer should be involved, they do character conversion and that is wrong.
Stay away from java.nio until you understand java.io.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java library to fix incorrectly encoded text using heuristics - java

Related

When is encoding being relevant in Java?

ISO-8859-1 to UTF-8 in Java

Can a file be encoded in multiple charsets in Java?

What could be the possible consequences of default encoding to UTF-8 for a String to Stream conversion?

How can I find the byte encoding of a TIBCO Rendezvous message?

Categories

Resources