ISO to UTF-8 on RESTful server

ISO to UTF-8 on RESTful server - java

I have a RESTful service that expects a string in the request. When the string is passed from the browser, the accented characters are garbled(�), as the default browser encoding is ISO-8859-1. If I change browser encoding to UTF-8, accented characters are preserved in the request string.
Is there a way to change the string encoding and re-construct the string in UTF-8 on the server side so that I need not change the browser encoding everytime ?
Thanks

I've found that most browsers' default encodings depend on the system they're installed on. Most of mine (especially on Windows) default to either ISO-8859-1 or CP1252, which corresponds with this original post. Make sure that your http headers and html meta tags specify UTF-8 encoding, and ensure your servlet container is set to use UTF-8 by default (see http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8 if you're using tomcat).
Sometimes you will still get bit by copy-paste from an application using (e.g.) CP1252 being pasted bit-for-bit into a textarea on a UTF-8 page. I have never gotten this to work without garbled characters.

UTF-8 encoding standard is capable of encoding any Unicode code point. ISO-8859-1 can handle only a tiny fraction of them. So, transcoding from ISO-8859-1 to UTF-8 is not a problem. Going backwards from UTF-8 to ISO-8859-1 will cause "replacement characters" (�) to appear in your text when unsupported characters are found. To transcode your test, you can do like this:
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");
OR
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");

Related

Sending special characters via URL Java

Say I have a string with "é" in it, and I want to send it over a URL to the next controller, the characters gets encoded to %C3%A9, and when it's received in the other controller, it gets decoded to "Ã©".
My question is, how to encode the "é" over the URL so when it's received in the other controller it gets decoded to "é"? For now, I'm replacing them manually. I need a way to do it automatically and with any special character (éèà...)
Thank you.

Unfortunately there is no way to declare the encoding of URL data. The common encoding used to be ISO-8859-1 or Latin1, but nowadays, UTF-8 is often used in new developments. The reasons was that the servlet specification says that when the charset is not specified, ISO-8859-1 is implied, but HTML 4.0 recommends UTF-8 for URLs.
The problem is that the URL is composed of bytes and that the servlet container convert it to java characters before passing it to the application, so you must declare the used charset at the servlet container level. For compatibily reasons, the well known Tomcat before version 8.0 assumed by default a Latin1 charset for the URL. Since 8.0.0 the default now depends on the "strict servlet compliance" setting. It is ISO-8859-1 when true and UTF-8 when false
References:
What character set should I assume the encoded characters in a URL to be in?
Tomcat FAQ/CharacterEncoding
For your precise question, you have two ways to correctly process the é character:
leave your servlet container configuration unchanged and encode the URL in ISO-8859-1 (é will be %E9)
stick to UTF-8 in the URL but declare it to the servlet container

URL decode ä -> ã1⁄4

I have the problem that the decoding from a URL causes some major problems. The request URL contains %C3%BC as the letter 'ü'. The decoding server side should now decode it as an ü, but it does this: Ã¼
decoding is done like this:
decoded = URLDecoder.decode(value, "UTF-8");
while value contains '%C3%BC' and decoded should now conatain 'ü', but that's where the problem is. What's going wrong here? I use this method in more than one application and it works fine in all other cases...

I don't have enough reputation yet to comment, so I'll have to make this as close to an answer as possible.
If you're using a servlet, and "value" is something that you got from calling getParameter() on the servlet, then it has already been decoded (rightly or wrongly) by the servlet container. (Tomcat?)
Likewise if it's part of the path. Your servlet container probably decoded it assuming that the percent-encoded bytes were ISO-8859-1, which is the default setting for Tomcat. See the document for the URIEncoding attribute of the Connector element in Tomcat's server.xml file, if that's what appserver you're using. If you set it to UTF-8, Tomcat will assume that percent-encoded bytes represent UTF-8 text.

You are probably outputting the value wrong. First decoded.length() (assumedly 1) gives a fair indication; you could dump it too, Arrays.toString(decoded.toCharArray()).
In the IDE console under Windows you could see something like that mess for a Windows single byte ANSI encoding.
For the rest take care of:
String s;
byte[] b;
s.getBytes() -> s.getBytes(StandardCharsets.UTF_8)
s.getBytes("Cp1252") // Windows Latin-1
new String(b) -> new String(b, StandardCharsets.UTF_8)

Broken UTF-8 URI Encoding in JSPs

I got a strange issue with wrong URI Encoding and would appreciate any help!
The project uses JSPs, Servlets, Jquery, Tomcat 6.
Charset in the JSPs is set to UTF-8, all Tomcat connectors use URIEncoding=UTF-8 and I also use a character encoding filter as described here.
Also, I set the contentType in the meta Tag and my browser detects it correctly.
In Ajax calls with Jquery I use encodeURIComponent() on the terms I want to use as URL Parameters and then serialize the whole parameter set with $.param(). In the called servlet these parameters are decoded correctly with Java.net.URLDecoder.decode(term, "UTF-8").
In some places I generate URLs for href elements from a parameter map in the JSPs. Each parameter value is encoded with Java.net.URLEncoder.encode(value, "UTF-8") on JSP side but then decoding it the same way as before results in broken special characters. Instead, I have to encode it as "ISO-8859-2" in the JSP which is then decoded correctly as "UTF-8" in the servlet.
An example for clarifying:
The term "überfall" is URIEncoded via Javascript (%C3%BCberfall) and sent to the servlet for decoding and processing, which works. After passing it back to a JSP I would encode it as UTF-8 and build the URL which results for instance in:
Click here
However, clicking this link will send the parameter as "%C3%83%C2%BCberfall" to the servlet which decodes to "Ã¼berfall". The same occurs when no encoding takes place.
When, using "ISO-8859-2" for encoding I get:
Click here
When clicking this link I can observe in Wireshark that %C3%BCberfall is sent as parameter which decodes again to "überfall"!
Can anyone tell me where I miss something?
EDIT:
While observing the Network Tab in Firebug I realized that by using
$.param({term : encodeURIComponent(term)});
the term is UTF-8 encoded twice, resulting in "%25C3%25BCberfall", i.e. the percent symbols are also percent-encoded. Analogously, it works for me if I call encode(term, "UTF-8") twice on each value from the parameter map.
Encoding once and not decoding the String results in "Ã¼berfall" again.

What encoding is Java using internally? Did you start your application with
-Dfile.encoding=utf-8
Please clarify where the "parameter map in the JSPs" is defined. Does it come from some persistent datastorage or are the strings given in your code as literals?
Some thoughts on what is going on, which might help:
Ã¼ is what comes out when a UTF-8 encoded ü is read expecting ISO-8859-1, when each byte is decoded on its own. %C3%BC is the URI-encoded representationg of both UTF-8 bytes of a UTF-8 ü. I think this is what's happening:
%C3%BC gets wrongly decoded to → Ã¼ which gets encoded to → %C3%83%C2%BC which then gets decoded again to → Ã¼ so you end up with Ã¼berfall.
So I guess, you use the wrong encoding for decoding a URI-encoded string. This might have something to do with the internal encoding used by Java/the JVM:
By default, the JRE 7 installer installs a European languages version if it recognizes that the host operating system only supports European languages.

I think I fixed the problem now definitely.
Following Jontro's comment I encoded all URL parameter values once and removed the manual servlet-side decoding.
Sending an ü should look like %C3%BC in Firebug's Network tab which gave me Ã¼ in the servlet.
Java was definitely set to "UTF-8" internal encoding with the -Dfile.encoding parameter.
I traced the problem to the request.getParameter() method like this. request.getQueryString was ok, but when extracting the actual parameters it fails:
request.getCharacterEncoding()) => UTF-8
request.getContentType() => null
request.getQueryString() => from=0&resultCount=10&sortAsc=true&searchType=quick&term=%C3%BC
request.getParameter("term") => Ã¼
Charset.defaultCharset() => UTF-8
OutputStreamWriter.getEncoding() => UTF8
new String(request.getParameter("term").getBytes(), UTF-8) => Ã¼
System.getProperty("file.encoding") => UTF-8
By looking into the sources of Tomcat and Coyote which implement request.getParameter() i found the problem: the URIEncoding from the connector was always null and in this case it defaults to org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING which is "ISO-8859-1" like Wolfram said.
Long story short: my fault was editing the server.xml in Tomcat's conf directory which is only loaded ONCE into Eclipse when a new server is created in the servers view! After that, a separate server.xml in the Servers project has to be edited. After doing so, the connector setting is loaded correctly and everything works as it should.
Thanks for the comments! Hope this helps someone...

How do I detect if a text file is encoded using Windows-1256?

I would really like to get if the file is Windows-1256 or not. Is there a way to recognize if text file is Windows-1256 in Java?

You could use this API to check the encoding:
http://jchardet.sourceforge.net/
And have a look at this question:
Java : How to determine the correct charset encoding of a stream

Add an encoding header to the file. Many text editors do this:
# -*- coding: cp1256 -*-
Other than that, there is no reliable way to do this.
The problem is that the cp12xx encodings aren't very different from each other. They look different on the screen but in the data of the files, there is nothing which says 0x8a means arabic ٹ (1256) or Š (1250 and 1252) or nothing (1255).
PS: the last sentence looks wrong because of right-to-left issues. The code "(1256)" is actually after the arabic character.

Say you have the choice of Windows-1256 (Arabic), UTF-8 and Windows-1252 (part of Western Europe). Then you can register proofs of wrong encoding for say UTF-8 (unsensible sequence) and Windows-1252. Some sequences of Windows-1252 would throw an unparsable exception for UTF-8 anyway-
try {
readInUTF8(file);
} catch (IsWindows1256Exception e {
readInWindow1256(file);
}
(Pseudo-code)

UTF8 convertion for text obtained from internet

ElasticSearch is a search Server which accepts data only in UTF8.
When i tries to give ElasticSearch following text
Small businesses potentially in line for a lighter reporting load include those with an annual turnover of less than £440,000, net assets of less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes this info from a webpage , and gives it to elasticSearch. ES complaints it cant understand £ and it fails. After filtering through below code -
byte bytes[] = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
But then when I copy it to a file in my home directory using bash and it goes in fine. Any pointers will help.

You have ISO-8895-1 octets in bytes, which you then tell String to decode as if it were UTF-8. When it does that, it doesn't recognize the illegal 0xA3 sequence and replaces it with the substitution character.
To do this, you have to construct the string with the encoding it uses, then convert it to the encoding that you want. See How do I convert between ISO-8859-1 and UTF-8 in Java?.

UTF-8 is easier than one thinks. In String everything is unicode characters.
Bytes/string conversion is done as follows.
(Note Cp1252 or Windows-1252 is the Windows Latin1 extension of ISO-8859-1; better use
that one.)
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file), "Cp1252"));
PrintWriter out = new PrintWriter(
new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
response.setContentType("text/html; charset=UTF-8");
response.setEncoding("UTF-8");
String s = "20 \u00A3"; // Escaping
To see why Cp1252 is more suitable than ISO-8859-1:
http://en.wikipedia.org/wiki/Windows-1252

String s is a series of characters that are basically independent of any character encoding (ok, not exactly independent, but close enough for our needs now). Whatever encoding your data was in when you loaded it into a String has already been decoded. The decoding was done either using system default encoding (which is practically ALWAYS AN ERROR, do not ever use system default encoding, trust me I have over 10 years of experience in dealing with bugs related to wrong default encodings) or the encoding you explicitely specified when you loaded the data.
When you call getBytes("ISO-8859-1") for a String, you request that the String is encoded into bytes according to ISO-8859-1 encoding.
When you create a String from a byte array, you need to specify the encoding in which the characters in the byte array are represented. You create a string from a byte array that has been encoded in UTF-8 (and just above you encoded it in ISO-8859-1, that is your error).
What you want to do is:
byte bytes[] = s.getBytes("UTF-8");
s = new String(bytes, "UTF-8");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.