Handle multiple language encoding

Handle multiple language encoding - java

In my application, I read tweets from twitter, but the tweets are not language restricted. So when I am trying to send response for a Chinese/Japanese tweet the content is not displayed correctly. I have currently set the
response.setContentType("text/html;charset=UTF-8");
before sending the response.
How can we handle multiple languages?
i can see the message sent
{"lastPost":{"lastUpdate":"毋成金口","pubDate":"Fri Aug 12 00:39:09 UTC 2011","message_id":101814948329562112}
this is a json string and added to the response..
on my client i.e iphone the lastpost is "????"

Telling the browser that the page is UTF-8 is a good thing, but useless unless you make sure that you are actually writing only UTF-8 in the page.
To make sure this happens :
Whenever you read, from twitter or whatever, always require UTF-8 data, make sure you are receiving UTF-8 bytes.
When you create a string from raw bytes, Java by default uses the "platform default encoding" which could be anything. Bytes to string conversion happens when creating a new String from a byte array or when using a Reader. Both these methods allow you to explicitly define the encpding you expect the bytes to be. Once point 1 is checked and you are receiving UTF-8 byes, make sure everywhere in your application you are specifying to use UTF-8 when converting bytes to strings.
when using a Writer, to convert strings to bytes sent for example to the browser (the servlet writer), the same rules apply : try to be explicit and always specify UTF-8
If you store stuff in databases, then you have two encoding problems. The first one is which encpding you database is using when talking to your application (connection encoding), the second is which encoding the database is actually storing strings in (storage encoding). Usually, you can specify only the connection encoding from Java, while the storage encoding is specified in the database when it is created (search for "collation" if you are using mysql).
Detecting where a string that is supposed to be UTF-8 gets reencoded badly is a hard task. 99% of the times, it is being converted to ISO-latin or similar encoding somewhere, which causes special characters like à or ì appear as two chars of garbage. Often debugging is the only way to find out where this happens.

the problem was with the client encoding.. it was set to ISO-

Related

Does javascript by default have support for UTF-8

I want to know whether javascript is by default UTF-8 compliant or not. If no, then how to make it UTF-8 compliant.
I am sharing you the snapshot, i have used german charterers in input type email as : αβγδεζηθ in the image shown below, which when entered shows ok, but after the request is made to java to save this value in database, the value in this field shows unknown charterers.
Is this javascript issue or java issue ..

javascript supports unicode.
Proof:
ﾟωﾟﾉ=/｀ｍ´）ﾉ~┻━┻//*´∇｀*/['_'];o=(ﾟｰﾟ)=_=3;c=(ﾟΘﾟ)=(ﾟｰﾟ)-(ﾟｰﾟ);(ﾟДﾟ)=(ﾟΘﾟ)=(o^_^o)/(o^_^o);(ﾟДﾟ)={ﾟΘﾟ:'_',ﾟωﾟﾉ:((ﾟωﾟﾉ==3)+'_')[ﾟΘﾟ],ﾟｰﾟﾉ:(ﾟωﾟﾉ+'_')[o^_^o-(ﾟΘﾟ)],ﾟДﾟﾉ:((ﾟｰﾟ==3)+'_')[ﾟｰﾟ]};(ﾟДﾟ)[ﾟΘﾟ]=((ﾟωﾟﾉ==3)+'_')[c^_^o];(ﾟДﾟ)['c']=((ﾟДﾟ)+'_')[(ﾟｰﾟ)+(ﾟｰﾟ)-(ﾟΘﾟ)];(ﾟДﾟ)['o']=((ﾟДﾟ)+'_')[ﾟΘﾟ];(ﾟoﾟ)=(ﾟДﾟ)['c']+(ﾟДﾟ)['o']+(ﾟωﾟﾉ+'_')[ﾟΘﾟ]+((ﾟωﾟﾉ==3)+'_')[ﾟｰﾟ]+((ﾟДﾟ)+'_')[(ﾟｰﾟ)+(ﾟｰﾟ)]+((ﾟｰﾟ==3)+'_')[ﾟΘﾟ]+((ﾟｰﾟ==3)+'_')[(ﾟｰﾟ)-(ﾟΘﾟ)]+(ﾟДﾟ)['c']+((ﾟДﾟ)+'_')[(ﾟｰﾟ)+(ﾟｰﾟ)]+(ﾟДﾟ)['o']+((ﾟｰﾟ==3)+'_')[ﾟΘﾟ];(ﾟДﾟ)['_']=(o^_^o)[ﾟoﾟ][ﾟoﾟ];(ﾟεﾟ)=((ﾟｰﾟ==3)+'_')[ﾟΘﾟ]+(ﾟДﾟ).ﾟДﾟﾉ+((ﾟДﾟ)+'_')[(ﾟｰﾟ)+(ﾟｰﾟ)]+((ﾟｰﾟ==3)+'_')[o^_^o-ﾟΘﾟ]+((ﾟｰﾟ==3)+'_')[ﾟΘﾟ]+(ﾟωﾟﾉ+'_')[ﾟΘﾟ];(ﾟｰﾟ)+=(ﾟΘﾟ);(ﾟДﾟ)[ﾟεﾟ]='\\';(ﾟДﾟ).ﾟΘﾟﾉ=(ﾟДﾟ+ﾟｰﾟ)[o^_^o-(ﾟΘﾟ)];(oﾟｰﾟo)=(ﾟωﾟﾉ+'_')[c^_^o];(ﾟДﾟ)[ﾟoﾟ]='\"';(ﾟДﾟ)['_']((ﾟДﾟ)['_'](ﾟεﾟ+(ﾟДﾟ)[ﾟoﾟ]+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+(ﾟｰﾟ)+(ﾟΘﾟ)+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+((ﾟｰﾟ)+(ﾟΘﾟ))+(ﾟｰﾟ)+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+(ﾟｰﾟ)+((ﾟｰﾟ)+(ﾟΘﾟ))+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+((o^_^o)+(o^_^o))+((o^_^o)-(ﾟΘﾟ))+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+((o^_^o)+(o^_^o))+(ﾟｰﾟ)+(ﾟДﾟ)[ﾟεﾟ]+((ﾟｰﾟ)+(ﾟΘﾟ))+(c^_^o)+(ﾟДﾟ)[ﾟεﾟ]+(ﾟｰﾟ)+((o^_^o)-(ﾟΘﾟ))+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+(ﾟΘﾟ)+(c^_^o)+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+(ﾟｰﾟ)+((ﾟｰﾟ)+(ﾟΘﾟ))+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+((ﾟｰﾟ)+(ﾟΘﾟ))+(ﾟｰﾟ)+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+((ﾟｰﾟ)+(ﾟΘﾟ))+(ﾟｰﾟ)+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+((ﾟｰﾟ)+(ﾟΘﾟ))+((ﾟｰﾟ)+(o^_^o))+(ﾟДﾟ)[ﾟεﾟ]+(ﾟｰﾟ)+(c^_^o)+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+((o^_^o)-(ﾟΘﾟ))+((ﾟｰﾟ)+(o^_^o))+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+((ﾟｰﾟ)+(ﾟΘﾟ))+((ﾟｰﾟ)+(o^_^o))+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+((o^_^o)+(o^_^o))+((o^_^o)-(ﾟΘﾟ))+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+((ﾟｰﾟ)+(ﾟΘﾟ))+(ﾟｰﾟ)+(ﾟДﾟ)[ﾟεﾟ]+(ﾟΘﾟ)+(ﾟｰﾟ)+(ﾟｰﾟ)+(ﾟДﾟ)[ﾟεﾟ]+(ﾟｰﾟ)+((o^_^o)-(ﾟΘﾟ))+(ﾟДﾟ)[ﾟεﾟ]+((ﾟｰﾟ)+(ﾟΘﾟ))+(ﾟΘﾟ)+(ﾟДﾟ)[ﾟoﾟ])(ﾟΘﾟ))('_');
JsFiddle
The above code prooves that js indeed supports unicode. The above code alerts "Hello World" and is taken from here
So, it is not js' fault. Fault is somewhere else, or the system is unable to show those characters.

JavaScript is definitely UTF-8 compliant.
However, there are many points of failure for UTF-8. You will need to make sure that your server-side code can handle UTF-8. Then check that your database columns use UTF-8 as their character set. Then ensure that the connection between your server-side code and your database is using UTF-8.
The "conversion" of characters to ?s suggests to me that it's the database that's missing encoding information.

JavaScript (the language) comes with Unicode support (all strings are inherently Unicode), but not with default UTF8 support. That is part of the APIs and their implementation. For example, in a browser the DOM does support it.
When you are sending data from a browser to a server, a) you need to set the correct headers b) send the correct data and the browser will do as expected. Then, c) the server needs to understand this request and use the correct encoding for deserialising the data. If something does not work, ask a question that shows your code.

File upload-download in its actual format

I've to make a code to upload/download a file on remote machine. But when i upload the file new line is not saved as well as it automatically inserts some binary characters. Also I'm not able to save the file in its actual format, I've to save it as "filename.ser". I'm using serialization-deserialization concept of java.
Thanks in advance.

How exactly are you transmitting the files? If you're using implementations of InputStream and OutputStream, they work on a byte-by-byte level so you should end up with a binary-equal output.
If you're using implementations of Reader and Writer, they convert the bytes to characters according to some character mapping, and then perform the reverse process when saving. Depending on the platform encodings of the various machines (and possibly other effects if you're not specifying the charset explicitly), you could well end up with differences in the binary file.
The fact that you mention newlines makes me think that you're using Readers to send strings (and possibly that you're stitching the strings back together yourself by manually adding newlines). If you want the files to be binary equal, then send them as a stream of bytes and store that stream verbatim. If you want them to be equal as strings in a given character set, then use Readers and Writers but specify the character set explicitly. If you want them to be transmitted as strings in the platform default set (not very useful), then accept that they're not going to be binary equal as files.
(Also, your question really doesn't provide much information to solve it. To me, it basically reads "I wrote some code to do X, and it doesn't work. Where did I go wrong?" You seem to assume that your code is correct by not listing it, but at the same time recognise that it's not...)

How to encode special characters for a POST with Spring/Roo

I'm using Spring/Roo for an app server, and need to be able to post some special characters. Specifically, characters like the Yen symbol, or Euro symbol. When I receive these characters on my server, and display them in console, they appear as "?". How can they be properly encoded and received?

Try configuring src/main/resources/META-INF/spring/database.properties to this :
database.url=jdbc:mysql://[YOUR_DB_SERVER]:3306/[YOUR_DB_NAME]?autoReconnect=true&useUnicode=true&characterEncoding=UTF-8

There are a couple of possible failure points here.
First, I'd check to see if the console supports the characters in question:
if the default encoding used by the JVM does not support the characters, they will be turned into question marks by System.out
if the console font does not support the characters, they will not be rendered properly
if the console is decoding the bytes using a different encoding to the one System.out is encoding them to, the characters will not display correctly
Instead of trying to print characters as literal, cast to int and print the hex value - then check the value against the Unicode charts.
Lossy or incorrect conversions can also happen between the browser and the server. Ideally, the server should use UTF-8 for encoding and decoding. If the encoding used by the browser when it encodes the data does not support the characters, they will be lossily encoded; the browser usually picks an encoding based on the encoding sent by the server for the GET request (or more rarely from a form attribute). Inspect the Accept-Charset header being sent with your data (you can do this with something like Firebug or Fiddler). I don't know anything about Roo, but there's bound to be some mechanism to configure encodings.

a question related to URL

Dear all,Now i have this question in my java program,I think it should be classified as URL problem,but not 100% sure.If you think I am wrong,feel free to recategorize this problem,thanks.
I would state my problem as simply as possible.
I did a search on the famouse Chinese search engine baidu.com for a Chinese key word "奥巴马" (Obama in English),and the way I do that is to pass a URL (in a Java Program)to the browser like:
http://news.baidu.com/ns?word=奥巴马
and it works perfectly just like I input the "奥巴马”keyword in the text field on baidu.com.
However,now my advisor wants another thing.Since he can not read the Chinese webpages,but he wants to make sure the webpages I got from Baidu.com is related to "Obama",he asked me to google translate it back,i.e,using google translate and translate the Chinese webpage to English one.
This sounds straightforward.However,I met my problem here.
If I simply pass the URL "http://news.baidu.com/ns?word=奥巴马" into Google Translate and tick "Chinese to English" translating option,the result looks awful.(I don't know the clue here,maybe related to Chinese character encoding).
Alternatively,if now my browser opens ""http://news.baidu.com/ns?word=奥巴马" webpage,but I click on the "百度一下" button (that simply means "search"),you will notice the URL will get changed,now if I pass this URL into the Google translate and do the same thing,the result works much better.
I hope I am not making this problem sound too complicated,and I appologize for some Chinese words invovled,but I really need your guys' help here.Becasue I did all this in a Java program,I couldn't figure out how to realize that "百度一下"(pressing search button) step then get the new URL.If I could get that new URL,things are easy,I could just call Google translate in my Java code,and pops out the new window to show my advisor.
Please share any of your idea or thougts here.Thanks a lot.
Robert

You could use
URLEncoder.encode("http://news.baidu.com/ns?word=奥巴马", "utf-8")
then pass the resulting URL to Google Translate like:
http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=YOUR_URL
Cheers

When you press the search button, the browser encodes the search term into %E5%A5%A5%E5%B7%B4%E9%A9%AC, which is the UTF-8 encoding for 奥巴马. It does this because UTF-8 is the default encoding for HTML forms.
Java uses a UTF-16 encoding internally, so it’s possible that the URL library builds a request in that encoding if you do not specify anything.
However, I could not reproduce your problem with Google translate — pasting that URL appeared to work correctly no matter how I did it.

Try calling
URLEncoder.encode("http://news.baidu.com/ns?word=奥巴马", "utf-8")
(or utf-16; I'm not quite familiar with the Chinese characters representation)

URLs can contain only ASCII characters. All other characters must be converted to bytes then %-encoded in ASCII. However there is no mandate on what charset is used to convert chars to bytes. UTF-8 is recommended, but not required. As long as a server expresses its preference on charset, the client should respect that and use the same charset for encoding.
You can see from page info that baidu uses gb2312 encoding. The characters 奥巴马 in a form on its page will be converted to bytes in gb2312: B0C2 B0CD C2ED, then %-encoded to %B0%C2%B0%CD%C2%ED. That is what actually sent to baidu server, http://www.baidu.com/s?wd=%B0%C2%B0%CD%C2%ED
Your OS happens to be configured to use gb2312 by default, therefore when you paste http://news.baidu.com/ns?word= 奥巴马 to the browser, browser does the same thing, and baidu gets the correct chars. When I paste that URL in my browser, it screws up, because my OS uses UTF-8, and the browser encodes these chinese characters in UTF-8, not something baidu expectes. (when entering a URL directly in a browser, the browser may not have communicated to the server and does not know the charset the server prefers, therefore the browser uses platform default charset)
Now, Google uses UTF-8. That's why if you paste the URL to google form, it will screw up just like on my OS. The chars are encoded in UTF-8, and baidu will try to parse it as gb2312, and gets totally wrong words.
Solution is easy. Just encode the parameter in the way that the server expects:
"http://news.baidu.com/ns?word=" + URLEncoder.encode("奥巴马", "gb2312")

CPP to Java conversion

Here's my scenario. I have an application written in C++ but not the complete source but the "meat" of it is there. I also have a compiled exe of this application. It communicates to a server somewhere here on our network. I am trying to replicate the C++ code in java, however it uses dwords and memory references, sizeof etc, all things that don't exist in java since it manages it's own memory. It builds this large complex message and then fires it over the network. So I am basically sniffing the traffic and inspecting the packet and trying to hardcode the data it's sending over to see if I can get a response from the server this way. However I can't seem to replicate the message perfectly. Some of it, such as the license code it sends is in "clear hex", that is, hex that translates into ascii, where-as some other portions of the data are not "clear hex" such as "aa" which does not translate into ascii (or at least a common character set?? if that makes any sense I'm not sure).
Ideally I'd like to not do it like this, but it's a stepping stone to see if can get the server to respond to me. One of the functions is for the application to register itself and that's the one I am trying to replicate.
Some of my assumptions above may be wrong, so I apologize in advance. Thanks for your time.

In Java, all "character" data is encoded as Unicode (and not ASCII). So when you talk to something outside, you need to map the internal strings to the outside world. There are several ways to do it:
Use a ByteArrayOutputStream. This is basically a growing buffer of bytes to which you can append. This allows you to build the message using bytes.
Use getBytes(encoding) where encoding is the encoding the other side understands. In your case, that would be "ASCII" for the text parts.
In your case, you probably need both. Create a byte buffer and then append strings and bytes to it and then send the final result (getByteArray()) via the socket API.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.