How do you unescape URLs in Java? - java

When I read the xml through a URL's InputStream, and then cut out everything except the url, I get "http://cliveg.bu.edu/people/sganguly/player/%20Rang%20De%20Basanti%20-%20Tu%20Bin%20Bataye.mp3".
As you can see, there are a lot of "%20"s.
I want the url to be unescaped.
Is there any way to do this in Java, without using a third-party library?

This is not unescaped XML, this is URL encoded text. Looks to me like you want to use the following on the URL strings.
URLDecoder.decode(url);
This will give you the correct text. The result of decoding the like you provided is this.
http://cliveg.bu.edu/people/sganguly/player/ Rang De Basanti - Tu Bin Bataye.mp3
The %20 is an escaped space character. To get the above I used the URLDecoder object.

Starting from Java 11 use
URLDecoder.decode(url, StandardCharsets.UTF_8).
for Java 7/8/9 use URLDecoder.decode(url, "UTF-8").
URLDecoder.decode(String s) has been deprecated since Java 5
Regarding the chosen encoding:
Note: The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites.

I'm having problems using this method when I have special characters like á, é, í, etc. My (probably wild) guess is widechars are not being encoded properly... well, at least I was expecting to see sequences like %uC2BF instead of %C2%BF.
Edited: My bad, this post explains the difference between URL encoding and JavaScript's escape sequences: URI encoding in UNICODE for apache httpclient 4

In my case the URL contained escaped html entities, so StringEscapeUtils.unescapeHtml4() from apache-commons did the trick

Related

URL Decode Difference between C# and Java

I got a url encode string %B9q
while I use C# code:
string res = HttpUtility.UrlDecode("%B9q", Encoding.GetEncoding("Big5"));
It outputs as 電,which is the correct answer that I want
But when I use Java decode function:
String res = URLDecoder.decode("%B9q", "Big5");
Then I got the output ?q
Does anyone knows how it happens and how should I solve it?
Thanks for any suggestions and helps!
As far as I can tell from the relevant spec, it looks like Java's way of handling things is correct.
Especially the example presented when discussing URI to IRI conversion seems meaningful:
Conversions from URIs to IRIs MUST NOT use any character encoding
other than UTF-8 in steps 3 and 4, even if it might be possible to
guess from the context that another character encoding than UTF-8 was
used in the URI. For example, the URI
"http://www.example.org/r%E9sum%E9.html" might with some guessing be
interpreted to contain two e-acute characters encoded as iso-8859-1.
It must not be converted to an IRI containing these e-acute
characters. Otherwise, in the future the IRI will be mapped to
"http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
URI from "http://www.example.org/r%E9sum%E9.html".
Maybe Java's URLDecoder ignore some rules on big5 encoding standard. C# do same things as browsers like Chrome, but Java's URLDecoder doesn't. See the relevant question: https://stackoverflow.com/a/27635806/1321255

encoding different in Coldfusion and Java

When I encode a string with charset UTF-8 it gives different results in Java and ColdFusion
String to encode:
ONE TWO`< newline >`THREE FOUR
Result in Android (Java):
ONE+TWO%0ATHREE+FOUR
Result in ColdFusion:
ONE%20TWO%0D%0ATHREE%20FOUR
I thought UTF-8 defines a standard and every technology follows that while encoding/decoding using UTF-8. But it doesn't seem to be the case. Which charset should I rely on?
Edit:
ColdFusion code to encode the string:
<cfset encodedString = URLEncodedFormat(str,"UTF-8")>
Java Code to encode String:
URLEncoder.encode(str,"UTF-8");
Your problem is not related to utf-8. Because there are only plain ascii characters here!
What you are doing is URL encoding, and there are indeed multiple versions of this.
in the HTTP query string, a space is encoded as +.
The percent % encoding of space is %20.
Sometimes you can use either encoding, sometimes you can't... usually, using + for spaces, like the Android class did, is more compatible in my experience. Because there is a lot of broken code out there that doesn't properly decode.
https://en.wikipedia.org/wiki/Query_string#URL_encoding
https://en.wikipedia.org/wiki/Percent-encoding

URL decoding Japanese characters etc. in Java

I have a servlet that receives some POST data. Because this data is x-www-form-urlencoded, a string such as サボテン would be encoded to サボテン.
How would I unencode this string back to the correct characters? I have tried using URLDecoder.decode("encoded string", "UTF-8"); but it doesn't make a difference.
The reason I would like to unencode them, is because, before I display this data on a webpage, I escape & to & and at the moment, it is escaping the &s in the encoded string so the characters are not showing up properly.
Those are not URL encodings. It would have looked like %E3%82%B5%E3%83%9C%E3%83%86%E3%83%B3. Those are decimal HTML/XML entities. To unescape HTML/XML entities, use Apache Commons Lang StringEscapeUtils.
Update as per the comments: you will get question marks when the response encoding is not UTF-8. If you're using JSP, just add the following line to top of the page:
<%# page pageEncoding="UTF-8" %>
See for more detail the solutions about halfway this article. I would prefer using-UTF8-all-the-way above fiddling with regexps since regexps doesn't prepare you for world domination.
This is a feature/bug of browsers. If a web page is in a limited charset, say ASCII, and users type in some chars outside the charset in a form field, browsers will send these chars in the form of $#xxxx;
It can be a problem because if users actually type $#xxxx; they'll be sent as is. So the server has no way to distinguish the two cases.
The best way is to use a charset that covers all characters, like UTF-8, so browsers won't do this trick.
Just a wild guess, but are you using Tomcat?
If so, make sure you have set up the Connector in Tomcat with a URIEncoding of UTF-8. Google that on the web and you will find a ton of hits such as
How to get UTF-8 working in Java webapps?
How about a regular expression?
Pattern pattern = Pattern.compile("&([^a][^m][^p][^;])?");
Matcher matcher = pattern.matcher(inputStr);
String output = matcher.replaceAll("&$1");

URI encoding in UNICODE for apache httpclient 4

I am working with apache http client 4 for all of my web accesses.
This means that every query that I need to do has to pass the URI syntax checks.
One of the sites that I am trying to access uses UNICODE as the url GET params encoding, i.e:
http://maya.tase.co.il/bursa/index.asp?http://maya.tase.co.il/bursa/index.asp?view=search&company_group=147&srh_txt=%u05E0%u05D9%u05D1&arg_comp=&srh_from=2009-06-01&srh_until=2010-02-16&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press=
(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
The problem is that URI doesn't support UNICODE encoding(it only supports UTF-8)
The really big issue here, is that this site expect it's params to be encoded in UNICODE, so any attempts to convert the url using String.format("http://...srh_txt=%s&...",URLEncoder.encode( "ניב" , "UTF8"))
results in a url which is legal and can be used to construct a URI but the site response to it with an error message, since it's not the encoding that it expects.
by the way URL object can be created and even used to connect to the web site using the non converted url.
Is there any way of creating URI in non UTF-8 encoding?
Is there any way of working with apache httpclient 4 with regular URL(and not URI)?
thanks,
Niv
(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
It doesn't really. That's not URL-encoding and the sequence %u is invalid in a URL.
%u05E0%u05D9%u05D1" encodes ניב only in JavaScript's oddball escape syntax. escape is the same as URL-encoding for all ASCII characters except for +, but the %u#### escapes it produces for Unicode characters are completely of its own invention.
(One should, in general, never use escape. Using encodeURIComponent instead produces the correct URL-encoded UTF-8, ניב=%D7%A0%D7%99%D7%91.)
If a site requires %u#### sequences in its query string, it is very badly broken.
Is there any way of creating URI in non UTF-8 encoding?
Yes, URIs may use any character encoding you like. It is conventionally UTF-8; that's what IRI requires and what browsers will usually submit if the user types non-ASCII characters into the address bar, but URI itself concerns itself only with bytes.
So you could convert ניב to %F0%E9%E1. There would be no way for the web app to tell that those bytes represented characters encoded in code page 1255 (Hebrew, similar to ISO-8859-8). But it does appear to work, on the link above, which the UTF-8 version does not. Oh dear!

URL decoding in Javascript

I want to decode a string that has been encoded using the java.net.URLEncoder.encode() method.
I tried using the unescape() function in javascript, but a problem occurs for blank spaces because java.net.URLEncoder.encode() converts a blank space
to '+' but unescape() won't convert '+' to a blank space.
Try decodeURI("") or decodeURIComponent("") !-)
Using JavaScript's escape/unescape function is almost always the wrong thing, it is incompatible with URL-encoding or any other standard encoding on the web. Non-ASCII characters are treated unexpectedly as well as spaces, and older browsers don't necessarily have the same behaviour.
As mentioned by roenving, the method you want is decodeURIComponent(). This is a newer addition which you won't find on IE 5.0, so if you need to support that browser (let's hope not, nowadays!) you'd need to implement the function yourself. And for non-ASCII characters that means you need to implement a UTF-8 encoder. Code is available if you need it.
decodeURI[Component] doesn't handle + as space either (at least on FF3, where I tested).
Simple workaround:
alert(decodeURIComponent('http://foo.com/bar+gah.php?r=%22a+b%22&d=o%e2%8c%98o'.replace(/\+/g, '%20')))
Indeed, unescape chokes on this URL: it knows only UTF-16 chars like %u2318 which are not standard (see Percent-encoding).
Try
var decoded = decodeURIComponent(encoded.replace(/\+/g," "));

Categories