URL Decode Difference between C# and Java - java

I got a url encode string %B9q
while I use C# code:
string res = HttpUtility.UrlDecode("%B9q", Encoding.GetEncoding("Big5"));
It outputs as 電,which is the correct answer that I want
But when I use Java decode function:
String res = URLDecoder.decode("%B9q", "Big5");
Then I got the output ?q
Does anyone knows how it happens and how should I solve it?
Thanks for any suggestions and helps!

As far as I can tell from the relevant spec, it looks like Java's way of handling things is correct.
Especially the example presented when discussing URI to IRI conversion seems meaningful:
Conversions from URIs to IRIs MUST NOT use any character encoding
other than UTF-8 in steps 3 and 4, even if it might be possible to
guess from the context that another character encoding than UTF-8 was
used in the URI. For example, the URI
"http://www.example.org/r%E9sum%E9.html" might with some guessing be
interpreted to contain two e-acute characters encoded as iso-8859-1.
It must not be converted to an IRI containing these e-acute
characters. Otherwise, in the future the IRI will be mapped to
"http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
URI from "http://www.example.org/r%E9sum%E9.html".

Maybe Java's URLDecoder ignore some rules on big5 encoding standard. C# do same things as browsers like Chrome, but Java's URLDecoder doesn't. See the relevant question: https://stackoverflow.com/a/27635806/1321255

Related

Java: Advise on Charset Conversion

I have been working on a scenario that does the following:
Get input data in Unicode format; [UTF-8]
Convert to ISO-8559;
Detect & replace unsupported characters for encoding; [Based on user-defined key-value pairs]
My question is, I have been trying to find information on ISO-8559 in depth with no luck yet. Has anybody happen to know more about this? How different is this one from ISO-8859? Any details will be much helpful.
Secondly, keeping the ISO-8559 requirement aside, I went ahead to write my program to convert the incoming data to ISO-8859 in Java. While I am able to achieve what is needed using character based replacement, it obviously seem to be time-consuming when data size is huge. [in MBs]
I am sure there must be a better way to do this. Can someone advise me, please?
I assume you want to convert UTF-8 to ISO-8859 -1, that is Western Latin-1. There are many char set tables in the net.
In general for web browsers and Windows, it would be better to convert to Windows-1252, which is an extension redefining the range 0x80 - 0xBF, undermore with special quotes as seen in MS Word. Browsers are defacto capable to interprete these codes in an ISO-559-1 even on a Mac.
Java standard conversion like new OutputStreamWriter(new FileOutputStream("..."), "Windows-1252") does already much. You can either write a kind of filter, or find introduced ? untranslated special characters. You could translate latin letters with accents not in Windows-1252 as ASCII letters:
String s = ...
s = Normalizer.normalize(s, Normalizer.Form.NFD);
return s = s.replaceAll("\\p{InCombiningDiacriticalMarks}", "");
For other scripts like Hindi or Cyrillic the keyword to search for is transliteration.

How to "fix" broken Java Strings (charset-conversion)

I'm running a Servlet that takes POST requests from websites that aren't necessarily encoded in UTF-8. These requests get parsed with GSON and information (mainly strings) end up in objects.
Client side charset doesn't seem to be used for any of this, as Java just stores Strings in Unicode internally.
Now if a page sending a request has a non-unicode-charset, the information in the strings is garbled up and doesn't represent what was sent - it seems to be misinterpreted somewhere either in the process of being stringified by the servlet, or parsed by gson.
Assuming there is no easy way of fixing the root of the issue, is there a way of recovering that information, given the (misinterpreted) Java Strings and the charset identifier (i.e. "Shift_JIS", "Windows-1255") used to display it on the client's side?
I haven't had need to do this before, but I believe that
final String realCharsetName = "Shift_JIS"; // for example
new String(brokenString.getBytes(), realCharsetName);
stands a good chance of doing the trick.
(This does however assume that encoding issues were entirely ignored while reading, and so the platform's default character set was used (a likely assumption since if people thought about charsets they probably would have got it right). It also assumes you're decoding on a machine with the same default charset as the one that originally read the bytes and created the String.)
If you happen to know exactly which charset was incorrectly used to read the string, you can pass it into the getBytes() call to make this 100% reliable.
Assuming that it's obtained as a POST request parameter the following way
String string = request.getParameter("name");
then you need to URL-encode the string back to the original query string parameter value using the charset which the server itself was using to decode the parameter value
String original = URLEncoder.encode(string, "UTF-8");
and then URL-decode it using the intended charset
String fixed = URLDecoder.decode(original, "Shift_JIS");
As the better alternative, you could also just instruct the server to use the given charset directly before obtaining any request parameter by ServletRequest#setCharacterEncoding().
request.setCharacterEncoding("Shift_JIS");
String string = request.getParameter("name");
There's by the way no way to know about the charset which the client used to URL-encode the POST request body. Almost no of the clients specifies it in the Content-Type request header, otherwise the ServletRequest#setCharacterEncoding() call would be already implicitly done by the servlet API based on that. You could determine it by checking getCharacterEncoding(), if it returns null then the client has specified none.
However, this does of course not work if the client has already properly encoded the value as UTF-8 or for any other charset. The Shift_JIS massage would break it again. There exist tools/API's to guess the original charset used based on the obtained byte sequence, but that's not 100% reliable. If your servlet concerns a public API, then you should document properly that it only accepts UTF-8 encoded parameters whenever the charset is not specified in the request header. You can then move the problem to the client side and point them on their mistake.
Am I correct that what you get is a string that was parsed as if it were UTF-8 but was encoded in Windows-1255? The solution would be to encode your string in UTF-8 and decode the result as Windows-1255.
The correct way to fix the problem is to ensure that when you read the content, you do so using the correct character encoding. Most frameworks and libraries will take care of this for you, but if you're manually writing servlets, it's something you need to be aware of. This isn't a shortcoming of Java. You just need to pay attention to the encodings. Specifically, the Content-Type header should contain useful information.
Any time you convert from a byte stream to a character stream in Java, you should supply a character encoding so that the bytes can be properly decoded into characters. See for example the InputStreamReader constructors.

URL decoding in Javascript

I want to decode a string that has been encoded using the java.net.URLEncoder.encode() method.
I tried using the unescape() function in javascript, but a problem occurs for blank spaces because java.net.URLEncoder.encode() converts a blank space
to '+' but unescape() won't convert '+' to a blank space.
Try decodeURI("") or decodeURIComponent("") !-)
Using JavaScript's escape/unescape function is almost always the wrong thing, it is incompatible with URL-encoding or any other standard encoding on the web. Non-ASCII characters are treated unexpectedly as well as spaces, and older browsers don't necessarily have the same behaviour.
As mentioned by roenving, the method you want is decodeURIComponent(). This is a newer addition which you won't find on IE 5.0, so if you need to support that browser (let's hope not, nowadays!) you'd need to implement the function yourself. And for non-ASCII characters that means you need to implement a UTF-8 encoder. Code is available if you need it.
decodeURI[Component] doesn't handle + as space either (at least on FF3, where I tested).
Simple workaround:
alert(decodeURIComponent('http://foo.com/bar+gah.php?r=%22a+b%22&d=o%e2%8c%98o'.replace(/\+/g, '%20')))
Indeed, unescape chokes on this URL: it knows only UTF-16 chars like %u2318 which are not standard (see Percent-encoding).
Try
var decoded = decodeURIComponent(encoded.replace(/\+/g," "));

UTF Encoding in java

I need to encode a message from request and write it into a file. Currently I am using the URLEncoder.encode() method for encoding. But it is not giving the expected result for special characters in French and Dutch.
I have tried using URLEncoder.encode("msg", "UTF-8") also.
Example:
Original message: Pour gérer votre GSM
After encoding: Pour g?rer votre GSM
Can any one tell me which method I can use for this purpose?
URL encoding is not the right thing to do to preserve UTF-8 characters. See
What character set should I assume the encoded characters in a URL to be in?
Try doing something like:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(file),"UTF-8"));
Have you tried using specifying OutputStream encoder using the [OutputStreamWriter(OutputStream, Charset)](http://java.sun.com/javase/6/docs/api/java/io/OutputStreamWriter.html#OutputStreamWriter(java.io.OutputStream,%20java.nio.charset.Charset)
There are a lot of causes for the problem you have observed. The primary cause is that REQUEST is not giving you UTF-8 in the first place. I imagine that this situation will change over time, but currently there are many weak links that could be to blame: neither mySQL nor PHP5, html nor browsers use UTF-8 by default, though the data may originally be.
See stackoverflow: how-do-i-set-character-encoding-to-utf-8-for-default-html
and
java.sun.com: technicalArticles--HTTPCharset
I experienced this problem with Chinese, and for that I'd recommend herongyang.com
if you are using tomcat then please see my post on the subject here http://nirlevy.blogspot.com/2009/02/utf8-and-hebrew-in-tomcat.html
I had the problem with hebrew but it's the same for every non english language
Use an explicit encoding when creating the string you want to send:
final String input = ...;
final String utf8 = new String( input.getBytes( "UTF-8" ) , "UTF-8" );
I seems to me like every single web developer in the world stumbles over this. I'd like to point to an article that helped me a lot:
http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/
And if you use db2: this IBM developer works Article
By the way, I think the browsers don't support Unicode in addresses, because one could easily set up a phishing page when you use characters from one language that look similar to characters in another language.

How do you unescape URLs in Java?

When I read the xml through a URL's InputStream, and then cut out everything except the url, I get "http://cliveg.bu.edu/people/sganguly/player/%20Rang%20De%20Basanti%20-%20Tu%20Bin%20Bataye.mp3".
As you can see, there are a lot of "%20"s.
I want the url to be unescaped.
Is there any way to do this in Java, without using a third-party library?
This is not unescaped XML, this is URL encoded text. Looks to me like you want to use the following on the URL strings.
URLDecoder.decode(url);
This will give you the correct text. The result of decoding the like you provided is this.
http://cliveg.bu.edu/people/sganguly/player/ Rang De Basanti - Tu Bin Bataye.mp3
The %20 is an escaped space character. To get the above I used the URLDecoder object.
Starting from Java 11 use
URLDecoder.decode(url, StandardCharsets.UTF_8).
for Java 7/8/9 use URLDecoder.decode(url, "UTF-8").
URLDecoder.decode(String s) has been deprecated since Java 5
Regarding the chosen encoding:
Note: The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites.
I'm having problems using this method when I have special characters like á, é, í, etc. My (probably wild) guess is widechars are not being encoded properly... well, at least I was expecting to see sequences like %uC2BF instead of %C2%BF.
Edited: My bad, this post explains the difference between URL encoding and JavaScript's escape sequences: URI encoding in UNICODE for apache httpclient 4
In my case the URL contained escaped html entities, so StringEscapeUtils.unescapeHtml4() from apache-commons did the trick

Categories