UTF Encoding in java - java

I need to encode a message from request and write it into a file. Currently I am using the URLEncoder.encode() method for encoding. But it is not giving the expected result for special characters in French and Dutch.
I have tried using URLEncoder.encode("msg", "UTF-8") also.
Example:
Original message: Pour gérer votre GSM
After encoding: Pour g?rer votre GSM
Can any one tell me which method I can use for this purpose?

URL encoding is not the right thing to do to preserve UTF-8 characters. See
What character set should I assume the encoded characters in a URL to be in?

Try doing something like:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(file),"UTF-8"));

Have you tried using specifying OutputStream encoder using the [OutputStreamWriter(OutputStream, Charset)](http://java.sun.com/javase/6/docs/api/java/io/OutputStreamWriter.html#OutputStreamWriter(java.io.OutputStream,%20java.nio.charset.Charset)

There are a lot of causes for the problem you have observed. The primary cause is that REQUEST is not giving you UTF-8 in the first place. I imagine that this situation will change over time, but currently there are many weak links that could be to blame: neither mySQL nor PHP5, html nor browsers use UTF-8 by default, though the data may originally be.
See stackoverflow: how-do-i-set-character-encoding-to-utf-8-for-default-html
and
java.sun.com: technicalArticles--HTTPCharset
I experienced this problem with Chinese, and for that I'd recommend herongyang.com

if you are using tomcat then please see my post on the subject here http://nirlevy.blogspot.com/2009/02/utf8-and-hebrew-in-tomcat.html
I had the problem with hebrew but it's the same for every non english language

Use an explicit encoding when creating the string you want to send:
final String input = ...;
final String utf8 = new String( input.getBytes( "UTF-8" ) , "UTF-8" );

I seems to me like every single web developer in the world stumbles over this. I'd like to point to an article that helped me a lot:
http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/
And if you use db2: this IBM developer works Article
By the way, I think the browsers don't support Unicode in addresses, because one could easily set up a phishing page when you use characters from one language that look similar to characters in another language.

Related

URL Decode Difference between C# and Java

I got a url encode string %B9q
while I use C# code:
string res = HttpUtility.UrlDecode("%B9q", Encoding.GetEncoding("Big5"));
It outputs as 電,which is the correct answer that I want
But when I use Java decode function:
String res = URLDecoder.decode("%B9q", "Big5");
Then I got the output ?q
Does anyone knows how it happens and how should I solve it?
Thanks for any suggestions and helps!
As far as I can tell from the relevant spec, it looks like Java's way of handling things is correct.
Especially the example presented when discussing URI to IRI conversion seems meaningful:
Conversions from URIs to IRIs MUST NOT use any character encoding
other than UTF-8 in steps 3 and 4, even if it might be possible to
guess from the context that another character encoding than UTF-8 was
used in the URI. For example, the URI
"http://www.example.org/r%E9sum%E9.html" might with some guessing be
interpreted to contain two e-acute characters encoded as iso-8859-1.
It must not be converted to an IRI containing these e-acute
characters. Otherwise, in the future the IRI will be mapped to
"http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
URI from "http://www.example.org/r%E9sum%E9.html".
Maybe Java's URLDecoder ignore some rules on big5 encoding standard. C# do same things as browsers like Chrome, but Java's URLDecoder doesn't. See the relevant question: https://stackoverflow.com/a/27635806/1321255

Not able to decode traditional chinese using java

I want to display a string which is in traditional Chinese language into my application GUI.
While debugging eclipse showed this string as some mixture of English alphabets and square boxes.
This is the java code which I used to decode it. The string 'str' I am getting from a traditional Chinese .mpg stream.
String TRADITIONAL_CHINESE_ENC = "Big5";
byte[] tmp = str.getBytes();
String decodedString=new String(tmp,TRADITIONAL_CHINESE_ENC);
But the result i am getting in decodedString is also a mixture of alphabets,square boxes,and some question mark embedded in a diamond shaped box etc.
This is happening only in case of traditional Chinese language. The same code works fine for simplified chinese,korean languages etc.
What could be wrong in my code when dealing with traditional Chinese?
I am using UTF-8 encoding for eclipse.
I can't see anything wrong with that code.
According to this Wikipedia page, there are 3 common encodings for traditional Chinese characters: Guobiao, UTF-8 and Big5. I suggest you try the two alternatives that you haven't tried, and if that fails try some of the less common alternatives listed.
(It is also possible that the real problem is in the way you are displaying the String ... but the fact that you are displaying Simplified Chinese and Korean correctly suggests that this is not the problem.)
I am using UTF-8 encoding for eclipse.
I don't think that is relevant. The code you showed us doesn't depend on the default character encoding of either the execution platform or the IDE.

Java: Advise on Charset Conversion

I have been working on a scenario that does the following:
Get input data in Unicode format; [UTF-8]
Convert to ISO-8559;
Detect & replace unsupported characters for encoding; [Based on user-defined key-value pairs]
My question is, I have been trying to find information on ISO-8559 in depth with no luck yet. Has anybody happen to know more about this? How different is this one from ISO-8859? Any details will be much helpful.
Secondly, keeping the ISO-8559 requirement aside, I went ahead to write my program to convert the incoming data to ISO-8859 in Java. While I am able to achieve what is needed using character based replacement, it obviously seem to be time-consuming when data size is huge. [in MBs]
I am sure there must be a better way to do this. Can someone advise me, please?
I assume you want to convert UTF-8 to ISO-8859 -1, that is Western Latin-1. There are many char set tables in the net.
In general for web browsers and Windows, it would be better to convert to Windows-1252, which is an extension redefining the range 0x80 - 0xBF, undermore with special quotes as seen in MS Word. Browsers are defacto capable to interprete these codes in an ISO-559-1 even on a Mac.
Java standard conversion like new OutputStreamWriter(new FileOutputStream("..."), "Windows-1252") does already much. You can either write a kind of filter, or find introduced ? untranslated special characters. You could translate latin letters with accents not in Windows-1252 as ASCII letters:
String s = ...
s = Normalizer.normalize(s, Normalizer.Form.NFD);
return s = s.replaceAll("\\p{InCombiningDiacriticalMarks}", "");
For other scripts like Hindi or Cyrillic the keyword to search for is transliteration.

URL decoding Japanese characters etc. in Java

I have a servlet that receives some POST data. Because this data is x-www-form-urlencoded, a string such as サボテン would be encoded to サボテン.
How would I unencode this string back to the correct characters? I have tried using URLDecoder.decode("encoded string", "UTF-8"); but it doesn't make a difference.
The reason I would like to unencode them, is because, before I display this data on a webpage, I escape & to & and at the moment, it is escaping the &s in the encoded string so the characters are not showing up properly.
Those are not URL encodings. It would have looked like %E3%82%B5%E3%83%9C%E3%83%86%E3%83%B3. Those are decimal HTML/XML entities. To unescape HTML/XML entities, use Apache Commons Lang StringEscapeUtils.
Update as per the comments: you will get question marks when the response encoding is not UTF-8. If you're using JSP, just add the following line to top of the page:
<%# page pageEncoding="UTF-8" %>
See for more detail the solutions about halfway this article. I would prefer using-UTF8-all-the-way above fiddling with regexps since regexps doesn't prepare you for world domination.
This is a feature/bug of browsers. If a web page is in a limited charset, say ASCII, and users type in some chars outside the charset in a form field, browsers will send these chars in the form of $#xxxx;
It can be a problem because if users actually type $#xxxx; they'll be sent as is. So the server has no way to distinguish the two cases.
The best way is to use a charset that covers all characters, like UTF-8, so browsers won't do this trick.
Just a wild guess, but are you using Tomcat?
If so, make sure you have set up the Connector in Tomcat with a URIEncoding of UTF-8. Google that on the web and you will find a ton of hits such as
How to get UTF-8 working in Java webapps?
How about a regular expression?
Pattern pattern = Pattern.compile("&([^a][^m][^p][^;])?");
Matcher matcher = pattern.matcher(inputStr);
String output = matcher.replaceAll("&$1");

How do you unescape URLs in Java?

When I read the xml through a URL's InputStream, and then cut out everything except the url, I get "http://cliveg.bu.edu/people/sganguly/player/%20Rang%20De%20Basanti%20-%20Tu%20Bin%20Bataye.mp3".
As you can see, there are a lot of "%20"s.
I want the url to be unescaped.
Is there any way to do this in Java, without using a third-party library?
This is not unescaped XML, this is URL encoded text. Looks to me like you want to use the following on the URL strings.
URLDecoder.decode(url);
This will give you the correct text. The result of decoding the like you provided is this.
http://cliveg.bu.edu/people/sganguly/player/ Rang De Basanti - Tu Bin Bataye.mp3
The %20 is an escaped space character. To get the above I used the URLDecoder object.
Starting from Java 11 use
URLDecoder.decode(url, StandardCharsets.UTF_8).
for Java 7/8/9 use URLDecoder.decode(url, "UTF-8").
URLDecoder.decode(String s) has been deprecated since Java 5
Regarding the chosen encoding:
Note: The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites.
I'm having problems using this method when I have special characters like á, é, í, etc. My (probably wild) guess is widechars are not being encoded properly... well, at least I was expecting to see sequences like %uC2BF instead of %C2%BF.
Edited: My bad, this post explains the difference between URL encoding and JavaScript's escape sequences: URI encoding in UNICODE for apache httpclient 4
In my case the URL contained escaped html entities, so StringEscapeUtils.unescapeHtml4() from apache-commons did the trick

Categories