URL decoding Japanese characters etc. in Java - java

I have a servlet that receives some POST data. Because this data is x-www-form-urlencoded, a string such as サボテン would be encoded to サボテン.
How would I unencode this string back to the correct characters? I have tried using URLDecoder.decode("encoded string", "UTF-8"); but it doesn't make a difference.
The reason I would like to unencode them, is because, before I display this data on a webpage, I escape & to & and at the moment, it is escaping the &s in the encoded string so the characters are not showing up properly.

Those are not URL encodings. It would have looked like %E3%82%B5%E3%83%9C%E3%83%86%E3%83%B3. Those are decimal HTML/XML entities. To unescape HTML/XML entities, use Apache Commons Lang StringEscapeUtils.
Update as per the comments: you will get question marks when the response encoding is not UTF-8. If you're using JSP, just add the following line to top of the page:
<%# page pageEncoding="UTF-8" %>
See for more detail the solutions about halfway this article. I would prefer using-UTF8-all-the-way above fiddling with regexps since regexps doesn't prepare you for world domination.

This is a feature/bug of browsers. If a web page is in a limited charset, say ASCII, and users type in some chars outside the charset in a form field, browsers will send these chars in the form of $#xxxx;
It can be a problem because if users actually type $#xxxx; they'll be sent as is. So the server has no way to distinguish the two cases.
The best way is to use a charset that covers all characters, like UTF-8, so browsers won't do this trick.

Just a wild guess, but are you using Tomcat?
If so, make sure you have set up the Connector in Tomcat with a URIEncoding of UTF-8. Google that on the web and you will find a ton of hits such as
How to get UTF-8 working in Java webapps?

How about a regular expression?
Pattern pattern = Pattern.compile("&([^a][^m][^p][^;])?");
Matcher matcher = pattern.matcher(inputStr);
String output = matcher.replaceAll("&$1");

Related

Cleanning a String from html code and accents with java

I need to clean an html string from accents and html accents code, and of course I have found a lot of codes that do this, however, none seems to work with the file i need to clean.
This file contains words like Postulación Ayudantías and also Gestión or Árbol
I found a lot of codes with text.normalize and regex use to clean the String, which work well with short strings but I'm using very long strings and those codes, which work with short string, doesn't work with long Strings
I am really lost here and I need help please!
This are the codes I tried and didnt work
Easy way to remove UTF-8 accents from a string? (return "?" for every accent in the String)
and I used regular expression to remove the html accent code but neither is working:
string=string.replaceAll("á","a");
string=string.replaceAll("é","e");
string=string.replaceAll("í","i");
string=string.replaceAll("ó","o");
string=string.replaceAll("ú","u");
string=string.replaceAll("ñ","n");
Edit: nvm the replaceAll is working I wrote it wrong ("/á instead of "á)
Any help or ideas?
I think there are several options that would work. I would suggest that you first
use StringEscapeUtils.unescapeHtml4(String) to unescape your html entities (that is convert them to their normal Java "utf-8" form).
Then you could use an ASCIIFoldingFilter to filter to "ASCII" equivalents.
You need to differentiate whether you're talking about a whole HTML document containing tags and so forth or just a string containing HTML encoded data.
If you're working with an entire HTML document, say, something returned by fetching a web page, then the solution is really more than could fit into a stack overflow answer, since you basically need an HTML parser to navigate the data.
However, if you're just dealing with a string that's HTML encoded, then you first need to decode it. There are lots of utilities to do so, such as the Apache Commons Lang library StringEscapeUtils class. See this question for an example.
Once you've decoded the string, you need to iterate over it character by character and replace anything that's unwanted. Your current method won't work for hex encoded items, and you're going to end up having to build a huge table to cover all the possible HTML entities.

Convert string from UTF-8 to ISO 8859-1 in Java

I want to encode a UTF-8 string to a ISO 8859- string in Java
I have this:
String title = new String(item.getTitle().getText().getBytes("ISO-8859-1"));
But it isn't working, the output is Sørensen for example
There's no such thing as a "UTF-8 string" in Java... there are just strings, which are always in Unicode. (They're effectively always UTF-16.)
You can have a byte array which is an ISO-8859-1 encoded form of a string (or UTF-8 or whatever) but it doesn't make sense to have a string with an encoding.
If you've read a string with the incorrect encoding somewhere, the correct thing to do is fix the code which reads the string, rather than trying to decode/encode the data from the string form later.
If you could give more information about the problem, we can probably give some more useful advice.
This problem isn't to be solved that way. Strings in Java are always in the same encoding (UTF-16), you've basically only changed the content. You need to set the encoding in the destination of this string. If it's the stdout, you need to set its encoding. If it's a file, you need to set its Writer encoding. If it's a HTML page, you need to set the response encoding. If it's a database, you need to set the DB/table/connection encoding. Etcetera.
Update: as per the comments:
The string is from a RSS feed that is in UTF-8, and I want to show in in a HTML page that uses ISO 8859 encoding
You'll need to upgrade the HTML page's encoding from vintage ISO 8859 encoding to the modern and world-domination-prepared UTF-8 encoding.
Update 2: as per the comments:
Firefox shows the it in the right encoding by default (utf-8) but Internet Explorer for example doesn't
Then the text is actually fine. You don't need to massage the string into another encoding. The symptoms tells that the character encoding information is missing in the response headers. Firefox has actually a pretty smart encoding detector, while IE will use the platform default encoding when the encoding is unknown. But IE will also fail if the HTML is (drastically) malformed in doctype and head.
Thus, either the HTML response is syntactically invalid, or the response content type wasn't set correctly. Assuming that your website validates and that you're using JSP/Servlet (after judging your post history here), you basically need to add the following line to the top of your JSP:
<%# page pageEncoding="UTF-8" %>
That's all. It will automatically set both the response encoding (so that the server knows which encoding to use to write the characters to the byte stream of the response) and the encoding in the Content-Type response header (so that the client knows which encoding to use to read/display those characters from the byte stream of the response). For more background information you may find this article useful.

URI encoding in UNICODE for apache httpclient 4

I am working with apache http client 4 for all of my web accesses.
This means that every query that I need to do has to pass the URI syntax checks.
One of the sites that I am trying to access uses UNICODE as the url GET params encoding, i.e:
http://maya.tase.co.il/bursa/index.asp?http://maya.tase.co.il/bursa/index.asp?view=search&company_group=147&srh_txt=%u05E0%u05D9%u05D1&arg_comp=&srh_from=2009-06-01&srh_until=2010-02-16&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press=
(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
The problem is that URI doesn't support UNICODE encoding(it only supports UTF-8)
The really big issue here, is that this site expect it's params to be encoded in UNICODE, so any attempts to convert the url using String.format("http://...srh_txt=%s&...",URLEncoder.encode( "ניב" , "UTF8"))
results in a url which is legal and can be used to construct a URI but the site response to it with an error message, since it's not the encoding that it expects.
by the way URL object can be created and even used to connect to the web site using the non converted url.
Is there any way of creating URI in non UTF-8 encoding?
Is there any way of working with apache httpclient 4 with regular URL(and not URI)?
thanks,
Niv
(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
It doesn't really. That's not URL-encoding and the sequence %u is invalid in a URL.
%u05E0%u05D9%u05D1" encodes ניב only in JavaScript's oddball escape syntax. escape is the same as URL-encoding for all ASCII characters except for +, but the %u#### escapes it produces for Unicode characters are completely of its own invention.
(One should, in general, never use escape. Using encodeURIComponent instead produces the correct URL-encoded UTF-8, ניב=%D7%A0%D7%99%D7%91.)
If a site requires %u#### sequences in its query string, it is very badly broken.
Is there any way of creating URI in non UTF-8 encoding?
Yes, URIs may use any character encoding you like. It is conventionally UTF-8; that's what IRI requires and what browsers will usually submit if the user types non-ASCII characters into the address bar, but URI itself concerns itself only with bytes.
So you could convert ניב to %F0%E9%E1. There would be no way for the web app to tell that those bytes represented characters encoded in code page 1255 (Hebrew, similar to ISO-8859-8). But it does appear to work, on the link above, which the UTF-8 version does not. Oh dear!

URL decoding in Javascript

I want to decode a string that has been encoded using the java.net.URLEncoder.encode() method.
I tried using the unescape() function in javascript, but a problem occurs for blank spaces because java.net.URLEncoder.encode() converts a blank space
to '+' but unescape() won't convert '+' to a blank space.
Try decodeURI("") or decodeURIComponent("") !-)
Using JavaScript's escape/unescape function is almost always the wrong thing, it is incompatible with URL-encoding or any other standard encoding on the web. Non-ASCII characters are treated unexpectedly as well as spaces, and older browsers don't necessarily have the same behaviour.
As mentioned by roenving, the method you want is decodeURIComponent(). This is a newer addition which you won't find on IE 5.0, so if you need to support that browser (let's hope not, nowadays!) you'd need to implement the function yourself. And for non-ASCII characters that means you need to implement a UTF-8 encoder. Code is available if you need it.
decodeURI[Component] doesn't handle + as space either (at least on FF3, where I tested).
Simple workaround:
alert(decodeURIComponent('http://foo.com/bar+gah.php?r=%22a+b%22&d=o%e2%8c%98o'.replace(/\+/g, '%20')))
Indeed, unescape chokes on this URL: it knows only UTF-16 chars like %u2318 which are not standard (see Percent-encoding).
Try
var decoded = decodeURIComponent(encoded.replace(/\+/g," "));

How do you unescape URLs in Java?

When I read the xml through a URL's InputStream, and then cut out everything except the url, I get "http://cliveg.bu.edu/people/sganguly/player/%20Rang%20De%20Basanti%20-%20Tu%20Bin%20Bataye.mp3".
As you can see, there are a lot of "%20"s.
I want the url to be unescaped.
Is there any way to do this in Java, without using a third-party library?
This is not unescaped XML, this is URL encoded text. Looks to me like you want to use the following on the URL strings.
URLDecoder.decode(url);
This will give you the correct text. The result of decoding the like you provided is this.
http://cliveg.bu.edu/people/sganguly/player/ Rang De Basanti - Tu Bin Bataye.mp3
The %20 is an escaped space character. To get the above I used the URLDecoder object.
Starting from Java 11 use
URLDecoder.decode(url, StandardCharsets.UTF_8).
for Java 7/8/9 use URLDecoder.decode(url, "UTF-8").
URLDecoder.decode(String s) has been deprecated since Java 5
Regarding the chosen encoding:
Note: The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites.
I'm having problems using this method when I have special characters like á, é, í, etc. My (probably wild) guess is widechars are not being encoded properly... well, at least I was expecting to see sequences like %uC2BF instead of %C2%BF.
Edited: My bad, this post explains the difference between URL encoding and JavaScript's escape sequences: URI encoding in UNICODE for apache httpclient 4
In my case the URL contained escaped html entities, so StringEscapeUtils.unescapeHtml4() from apache-commons did the trick

Categories