Handling UTF-8 encoding

Handling UTF-8 encoding - java

We have an Java application running on Weblogic server that picks up XML messages from a JMS or MQ queue and writes it into another JMS queue. The application doesn't modify the XML content in any way. We use BEA's XMLObject to read and write the messages into queues.
The XML messages contain the encoding type declarations as UTF-8.
We have an issue when the XML contains characters that are out side the normal ASCII range (like £ symbol for example). When the message is read from the queue we can see that the £ symbol is intact, however once we write it to the destination queue, the £ symbol is lost and is replaced with Â£ instead.
I have checked the OS level settings (locale settings) and everything seems to be fine. What else should I be checking to make sure that this doesn't happen?

once we write it to the destination queue, the £ symbol is lost and is replaced with Â£ instead
That tells me the character is being written as UTF-8, but it's being read as if it were in a single-byte encoding like ISO-8859-1. (For any character in the range U+00A0..U+00BF, if you encode it as UTF-8 and decode it as ISO-8859-1, you end up with the two-character sequence ÃX, where X is the original character.) I would look at the encoding settings of the receiving JMS queue.

You should use InputStream, OutputStream, and byte[] to handle XML documents, not Reader, Writer, and String. In the world of JMS, BytesMessage is a better fit for XML payloads than TextMessage.
Every XML document specifies its character encoding internally, and all XML processing APIs are oriented to take byte streams and where necessary figure out the correct character encoding to use themselves. The text-based APIs are only there… to confuse people, I guess! Anyway, applications should let the XML processor deal with character encoding issues, rather than trying to manage it themselves (or using a text-oriented API without a solid understanding of character-encoding issues).

Without a few more specifics, I'd guess that there is a method that optionally takes an encoding somewhere that isn't specified and is defaulting to ISO-8859-1. Commonly, check anything that passes between an InputStream/OutputStream and a Reader/Writer.
For instance, an OutputStreamWriter takes an optional encoding that you could be leaving out.

Related

How to read textfiles with unknown encoding?

I want to read several text files (eg CSV), but I don't know the encoding.
As the textfiles may contain special chars like umlauts, chosing the right encoding seems to be crucial.
new BufferedReader(new InputStreamReader(resource.getInputStream(), encoding));
I tried reading with ISO_8859_1 which did not work propertly with umlauts encoded. So I tried UTF-8, which works.
But I don't know in future if this might also cause problems with different files. And I never now before reading a file in which encoding the file is.
So how should I best read files with encoding unknown?

Strictly speaking the other two answers are right - you just have to know what the encoding is to be guaranteed of anything. However, there are libraries out there that will allow you to make educated guesses about the encoding. Check out ICU4J or jchardet, for example.

You have to know the encoding, you cannot read the files correctly if you don't know it. As UTF-8 works just keep using it. Also check with the producer of the files if they will keep producing them in UTF-8. They should document this.

It is impossible to programmatically recognize encoding of a text file. The only way is to try to open it in a text editor with different encodings until you can read the text

Handle multiple language encoding

In my application, I read tweets from twitter, but the tweets are not language restricted. So when I am trying to send response for a Chinese/Japanese tweet the content is not displayed correctly. I have currently set the
response.setContentType("text/html;charset=UTF-8");
before sending the response.
How can we handle multiple languages?
i can see the message sent
{"lastPost":{"lastUpdate":"毋成金口","pubDate":"Fri Aug 12 00:39:09 UTC 2011","message_id":101814948329562112}
this is a json string and added to the response..
on my client i.e iphone the lastpost is "????"

Telling the browser that the page is UTF-8 is a good thing, but useless unless you make sure that you are actually writing only UTF-8 in the page.
To make sure this happens :
Whenever you read, from twitter or whatever, always require UTF-8 data, make sure you are receiving UTF-8 bytes.
When you create a string from raw bytes, Java by default uses the "platform default encoding" which could be anything. Bytes to string conversion happens when creating a new String from a byte array or when using a Reader. Both these methods allow you to explicitly define the encpding you expect the bytes to be. Once point 1 is checked and you are receiving UTF-8 byes, make sure everywhere in your application you are specifying to use UTF-8 when converting bytes to strings.
when using a Writer, to convert strings to bytes sent for example to the browser (the servlet writer), the same rules apply : try to be explicit and always specify UTF-8
If you store stuff in databases, then you have two encoding problems. The first one is which encpding you database is using when talking to your application (connection encoding), the second is which encoding the database is actually storing strings in (storage encoding). Usually, you can specify only the connection encoding from Java, while the storage encoding is specified in the database when it is created (search for "collation" if you are using mysql).
Detecting where a string that is supposed to be UTF-8 gets reencoded badly is a hard task. 99% of the times, it is being converted to ISO-latin or similar encoding somewhere, which causes special characters like à or ì appear as two chars of garbage. Often debugging is the only way to find out where this happens.

the problem was with the client encoding.. it was set to ISO-

File upload-download in its actual format

I've to make a code to upload/download a file on remote machine. But when i upload the file new line is not saved as well as it automatically inserts some binary characters. Also I'm not able to save the file in its actual format, I've to save it as "filename.ser". I'm using serialization-deserialization concept of java.
Thanks in advance.

How exactly are you transmitting the files? If you're using implementations of InputStream and OutputStream, they work on a byte-by-byte level so you should end up with a binary-equal output.
If you're using implementations of Reader and Writer, they convert the bytes to characters according to some character mapping, and then perform the reverse process when saving. Depending on the platform encodings of the various machines (and possibly other effects if you're not specifying the charset explicitly), you could well end up with differences in the binary file.
The fact that you mention newlines makes me think that you're using Readers to send strings (and possibly that you're stitching the strings back together yourself by manually adding newlines). If you want the files to be binary equal, then send them as a stream of bytes and store that stream verbatim. If you want them to be equal as strings in a given character set, then use Readers and Writers but specify the character set explicitly. If you want them to be transmitted as strings in the platform default set (not very useful), then accept that they're not going to be binary equal as files.
(Also, your question really doesn't provide much information to solve it. To me, it basically reads "I wrote some code to do X, and it doesn't work. Where did I go wrong?" You seem to assume that your code is correct by not listing it, but at the same time recognise that it's not...)

How to encode special characters for a POST with Spring/Roo

I'm using Spring/Roo for an app server, and need to be able to post some special characters. Specifically, characters like the Yen symbol, or Euro symbol. When I receive these characters on my server, and display them in console, they appear as "?". How can they be properly encoded and received?

Try configuring src/main/resources/META-INF/spring/database.properties to this :
database.url=jdbc:mysql://[YOUR_DB_SERVER]:3306/[YOUR_DB_NAME]?autoReconnect=true&useUnicode=true&characterEncoding=UTF-8

There are a couple of possible failure points here.
First, I'd check to see if the console supports the characters in question:
if the default encoding used by the JVM does not support the characters, they will be turned into question marks by System.out
if the console font does not support the characters, they will not be rendered properly
if the console is decoding the bytes using a different encoding to the one System.out is encoding them to, the characters will not display correctly
Instead of trying to print characters as literal, cast to int and print the hex value - then check the value against the Unicode charts.
Lossy or incorrect conversions can also happen between the browser and the server. Ideally, the server should use UTF-8 for encoding and decoding. If the encoding used by the browser when it encodes the data does not support the characters, they will be lossily encoded; the browser usually picks an encoding based on the encoding sent by the server for the GET request (or more rarely from a form attribute). Inspect the Accept-Charset header being sent with your data (you can do this with something like Firebug or Fiddler). I don't know anything about Roo, but there's bound to be some mechanism to configure encodings.

How would I robustly log binary or XML using slf4j/log4j/java.util.logging?

After doing logging for many years I have now reached a point where I need to be able to postprocess the log files with the long term goal of using the log files as a "transport medium" allowing me to persist objects et. al. so I can replay the backend requests. In order to do so I need to persist objects into a loggable form.
The recommended way to do this by Sun, is to use java.beans.XMLEncoder to create an XML snippet which is quite agreeable with me, but the problem is that it is sent to an UTF-8 encoded OutputStream including an UTF-8 header, and OutputStreams are byte oriented. Log files are character oriented (strings) and logfiles are typically encoded in the default encoding for that platform. Our XML may include any Unicode character.
I need a robust way of handling this, with preferation to an approach which generates humanly readable files.
I have thought about converting the XML OuptutStream to a String, removing the unusable header, and flattening to ASCII (with any non-ASCII character encoded as a numeric entity). I have also thought about using XML transformations but I have a gut feeling that this will require more resources than I want a logger to do.
Suggestions?

More a hint than a real answer: maybe have a look at this thread on the logback-dev mailing list and especially the messages from Joern Huxhorn (which is the author of Lilith). More generally, I think that you should look at logback, the "successor" of log4j from the same author, Ceki Gülcü. This is where things happen in my opinion.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.