I have a default installation of Tomcat 8.5.6. It seems like UTF-8 encoded requests are not being interpreted correctly, even though the docs say the default (if not in strict mode) should be UTF-8 everywhere these days. My java POST requests look like:
HttpPost post = new HttpPost(url);
post.setEntity(new UrlEncodedFormEntity(nameValuePairs, HTTP.UTF_8));
...
Testing, I see the tilde character ñ is not decoded correctly in my servlet handler:
public class MyServlet extends HttpServlet {
protected void doPost(HttpServletRequest request, ...) {
String tildeTest = request.getParam("foo"); // no good.
}
}
if I explicitly set the encoding on the request before access, it decodes properly:
protected void doPost(HttpServletRequest request, ...) {
request.setCharacterEncoding("UTF-8");
String tildeTest = request.getParam("foo"); // works!
...
}
so I'm not sure if:
Tomcat 8.5.6 is not really using UTF-8 everywhere, and I need to set that manually in the config files somewhere.
My http request is missing some header which tells Tomcat which encoding to use - perhaps the http post is defaulting to some other encoding which Tomcat is just honoring.
Anyone know which one?
Thanks
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding
POST requests should specify the encoding of the parameters and values
they send. Since many clients fail to set an explicit encoding, the
default is used (ISO-8859-1).
What can you recommend to just make everything work? (How to use UTF-8 everywhere).
There are 6 ways listed to ensure this, for servlet requests 1,2 should be relevant
Set URIEncoding="UTF-8" on your in server.xml. References: HTTP Connector, AJP Connector.
Use a character encoding filter with the default encoding set to UTF-8
Related
I have a Java RESTlet v2.1.2 method like this:
#Post("json")
public Representation doPost(Representation entity) throws UnsupportedEncodingException {
Request request = getRequest();
String entityAsText = request.getEntityAsText();
logger.info("entityAsText = " + entityAsText + " Üüÿê");
in the Cygwin console it prints:
2015-04-19 22:07:27 INFO BaseResource:46 - entityAsText = {
"Id":"xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx",
"Field1":"John?????????",
"Field2":"Johnson??????????"
} ▄³ Û
As you can see the Üüÿê is printed as ▄³ Û. The characters Üüÿê are also in the POST body of SOAP UI. But they're printed as ???. I have an implemantation which does not use RESTlet where this works. So the settings in SOAP UI are not the problem. (The POST body is in Application/JSON btw.)
How can I extract the unicode chars Üüÿê from the POST body without getting them as ??? ?
I made a test and it works for me but perhaps I don't have same configuration regarding charset / encoding. I used a standalone Restlet application (no servlet) from Postman. Can you give us more details about the version of Restlet and the different editions / extensions you use (for example, Jackson, Servlet, ...)?
Here is what I have for Java (you can have a look at this link: How to Find the Default Charset/Encoding in Java?):
Default Charset=UTF-8
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=UTF8
You can also specify the charset you sent for your content at the level of the header Content-Type: application/json;charset=utf-8.
I wrote a post some years ago about such issues when using a servlet container. Perhaps could you also find out some hints to help you: https://templth.wordpress.com/2011/06/05/does-your-java-based-web-applications-really-support-utf8/.
Hope it helps you,
Thierry
I am using Spring MVC's charset filter. This is the URL that I use to invoke my servlet from my applet
http://192.168.0.67/MyServlet?p1=団
As you can see, the parameter has a unicode character 団. So I use
URLEncoder.encode("団", "UTF-8");
and now my URL becomes
http://192.168.0.67/MyServlet?p1=%E5%9B%A3
However, from the servlet, calling
request.getParameter("p1");
already return some gibberish that cannot be decoded with URLDecoder. BTW, invoking
URLDecoder.decode("%E5%9B%A3", "UTF-8");
does give me the original unicode character. It's just that the servlet has garbled the parameter before it can even be decoded. Does anyone know why? request.getParameter() doesn't decode parameter with UTF-8?
The Spring MVC's charset filter will only set the request body encoding, not the request URI encoding. You need to set the charset for the URI encoding in the servletcontainer configuration. Lot of servletcontainers default to ISO-8859-1 to decode the URI. It's unclear what servletcontainer you're using, so here's just an example for Tomcat: edit the <Connector> entry of /conf/server.xml to add URIEncoding="UTF-8":
<Connector ... URIEncoding="UTF-8">
If you can't edit the server's configuration for some reason (e.g. 3rd party hosting and such), then you should consider to use POST instead of GET:
String query = "p1=" + URLEncoder.encode("団", "UTF-8");
URLConnection connection = new URL(getCodeBase(), "MyServlet").openConnection();
connection.setDoOutput(true); // This sets request method to POST.
connection.getOutputStream().write(query.getBytes("UTF-8"));
// ...
This way you can in doPost() use ServletRequest#setCharacterEncoding() to tell the Servlet API what charset to use to parse the request body (or just rely on the Spring MVC's charset filter from doing this job):
request.setCharacterEncoding("UTF-8");
String p1 = request.getParameter("p1"); // You don't need to decode yourself!
// ...
See also:
Unicode - How to get the characters right?
My Servlet just won't use UTF-8 for JSON responses.
MyServlet.java:
public class MyServlet extends HttpServlet {
protected void doPost(HttpServletRequest req, HttpServletResponse res) throws Exception {
PrintWriter writer = res.getWriter();
res.setCharacterEncoding("UTF-8");
res.setContentType("application/json; charset=UTF-8");
writer.print(getSomeJson());
}
}
But special characters aren't showing up, and when I check the headers that I'm getting back in Firebug, I see Content-Type: application/json;charset=ISO-8859-1.
I did a grep -ri iso . in my Servlet directory, and came up with nothing, so nowhere am I explicitly setting the type to ISO-8859-1.
I should also specify that I'm running this on Tomcat 7 in Eclipse with a J2EE target as a development environment, with Solaris 10 and whatever they call their web server environment (somebody else admins this) as the production environment, and the behavior is the same.
I've also confirmed that the request submitted is UTF-8, and only the response is ISO-8859-1.
Update
I have amended the code to reflect that I am calling PrintWriter before I set the character encoding. I omitted this from my original example, and now I realize that this was the source of my problem. I read here that you have to set character encoding before you call HttpServletResponse.getWriter(), or getWriter will set it to ISO-8859-1 for you.
This was my problem. So the above example should be adjusted to
public class MyServlet extends HttpServlet {
protected void doPost(HttpServletRequest req, HttpServletResponse res) throws Exception {
res.setCharacterEncoding("UTF-8");
res.setContentType("application/json");
PrintWriter writer = res.getWriter();
writer.print(getSomeJson());
}
}
Once the encoding is set for a response, it cannot be changed.
The easiest way to force UTF-8 is to create your own filter which is the first to peek at the response and set the encoding.
Take a look at how Spring 3.0 does this. Even if you can't use Spring in your project, maybe you can get some inspiration (make sure your company policy allows you to get inspiration from open source licenses).
The code looks fine. Either you're not running the code you think you're running, or there's some Filter or proxy somewhere in the request-response chain which modifies the content type like that.
Aside from specific problem, you really should consider getting output stream, using JSON library to write contents directly as UTF-8 encoded JSON; there is no benefit to using writers.
Some JSON packages only work with strings, which is unfortunate, but most allow using more efficient streams (safer and more efficient as parser/generator can handle escaping and encoding aspects together).
I had recently a problem with encoding of websites generated by servlet, that occurred if the servlets were deployed under Tomcat, but not under Jetty. I did a little bit of research about it and simplified the problem to the following servlet:
public class TestServlet extends HttpServlet implements Servlet {
#Override
public void service(HttpServletRequest request, HttpServletResponse response) throws IOException {
response.setContentType("text/plain");
Writer output = response.getWriter();
output.write("öäüÖÄÜß");
output.flush();
output.close();
}
}
If I deploy this under Jetty and direct the browser to it, it returns the expected result. The data is returned as ISO-8859-1 and if I take a look into the headers, then Jetty returns:
Content-Type: text/plain; charset=iso-8859-1
The browser detects the encoding from this header. If I deploy the same servlet in Tomcat, the browser shows up strange characters. But Tomcat also returns the data as ISO-8859-1, the difference is, that no header tells about it. So the browser has to guess the encoding, and that goes wrong.
My question is, is that behaviour of Tomcat correct or a bug? And if it is correct, how can I avoid this problem? Sure, I can always add response.setCharacterEncoding("UTF-8"); to the servlet, but that means I set a fixed encoding, that the browser might or might not understand. The problem is more relevant, if no browser but another service accesses the servlet. So how I should deal with the problem in the most flexible way?
If you don't specify an encoding, the Servlet specification requires ISO-8859-1. However, AFAIK it does not require the container to set the encoding in the content type, at least not if you set it to "text/plain". This is what the spec says:
Calls to setContentType set the
character encoding only if the given
content type string provides a value
for the charset attribute.
In other words, only if you set the content type like this
response.setContentType("text/plain; charset=XXXX")
Tomcat is required to set the charset. I haven't tried whether this works though.
In general, I would recommend to always set the encoding to UTF-8 (as it causes the least amount of trouble, at least in browsers) and then, for text/plain, state the encoding explicitly, to prevent browsers from using a system default.
In support of Jesse Barnum's answer, the apache Wiki suggests that a filter can be used to control the character encoding of the request and the response. However, Tomcat 5.5 and up come bundled with a SetCharacterEncodingFilter so it may be better to use apache's implementation than to use Jesse's (no offense Jesse). The tomcat implementations only set the character encoding on the request, so modification may be necessary to use the filter as a means of setting the character set on the response of all servlets.
Specifically, Tomcat has implementations examples here:
5.x
webapps/servlets-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
webapps/jsp-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
6.x
webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
7.x
Since 7.0.20 the filter became first-class citizen and was moved from the examples into core Tomcat and is available to any web application without the need to compile and bundle it separately. See documentation for the list of filters provided by Tomcat. The class name is:
org.apache.catalina.filters.SetCharacterEncodingFilter
This page tells more: http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q3
Here's a filter that I wrote to force UTF-8 encoding:
public class CharacterEncodingFilter implements Filter {
private static final Logger log = Logger.getLogger( CharacterEncodingFilter.class.getName() );
boolean isConnectorConfigured = false;
public void init( FilterConfig filterConfig ) throws ServletException {}
public void doFilter( ServletRequest request, ServletResponse response, FilterChain chain ) throws IOException, ServletException {
request.setCharacterEncoding( "utf-8" );
response.setCharacterEncoding( "utf-8" );
if( ! isConnectorConfigured ) {
isConnectorConfigured = true;
try { //I need to do all of this with reflection, because I get NoClassDefErrors otherwise. --jsb
Field f = request.getClass().getDeclaredField( "request" ); //Tomcat wraps the real request in a facade, need to get it
f.setAccessible( true );
Object req = f.get( request );
Object connector = req.getClass().getMethod( "getConnector", new Class[0] ).invoke( req ); //Now get the connector
connector.getClass().getMethod( "setUseBodyEncodingForURI", new Class[] {boolean.class} ).invoke( connector, Boolean.TRUE );
} catch( NoSuchFieldException e ) {
log.log( Level.WARNING, "Servlet container does not seem to be Tomcat, cannot programatically alter character encoding. Do this in the Server.xml <Connector> attribute instead." );
} catch( Exception e ) {
log.log( Level.WARNING, "Could not setUseBodyEncodingForURI to true on connector" );
}
}
chain.doFilter( request, response );
}
public void destroy() {}
}
If you don't specify the encoding, Tomcat is free to encode your characters however it feels, and the browser is free to guess what encoding Tomcat picked. You are correct in that the way to solve the problem is response.setCharacterEncoding("UTF-8").
You shouldn't worry about the chance that the browser won't understand the encoding, as virtually all browsers released in the past 10 years support UTF-8. Though if you're really worried, you can inspect the "Accept-Encoding" headers provided by the user agent.
This question already has answers here:
How to pass Unicode characters as JSP/Servlet request.getParameter?
(5 answers)
Closed 6 years ago.
I have such a link in JSP page with encoding big5
http://hello/world?name=婀ㄉ
And when I input it in browser's URL bar, it will be changed to something like
http://hello/world?name=%23%24%23
And when we want to get this parameter in jsp page, all the characters are corrupted.
And we have set this:
request.setCharacterEncoding("UTF-8"), so all the requests will be converted to UTF8.
But why in this case, it doesn't work ?
Thanks in advance!.
When you enter the URL in browser's address bar, browser may convert the character encoding before URL-encoding. However, this behavior is not well defined, see my question,
Handling Character Encoding in URI on Tomcat
We mostly get UTF-8 and Latin-1 on newer browsers but we get all kinds of encodings (including Big5) in old ones. So it's best to avoid non-ASCII characters in URL entered by user directly.
If the URL is embedded in JSP, you can force it into UTF-8 by generating it like this,
String link = "http://hello/world?name=" + URLEncoder.encode(name, "UTF-8");
On Tomcat, the encoding needs to be specified on Connector like this,
<Connector port="8080" URIEncoding="UTF-8"/>
You also need to use request.setCharacterEncoding("UTF-8") for body encoding but it's not safe to set this in servlet because this only works when the parameter is not processed but other filter or valve may trigger the processing. So you should do it in a filter. Tomcat comes with such a filter in the source distribution.
To avoid fiddling with the server.xml use :
protected static final String CHARSET_FOR_URL_ENCODING = "UTF-8";
protected String encodeString(String baseLink, String parameter)
throws UnsupportedEncodingException {
return String.format(baseLink + "%s",
URLEncoder.encode(parameter, CHARSET_FOR_URL_ENCODING));
}
// Used in the servlet code to generate GET requests
response.sendRedirect(encodeString("userlist?name=", name));
To actually get those parameters on Tomcat you need to do something like :
final String name =
new String(request.getParameter("name").getBytes("iso-8859-1"), "UTF-8");
As apparently (?) request.getParameter URLDecodes() the string and interprets it as iso-8859-1 - or whatever the URIEncoding is set to in the server.xml. For an example of how to get the URIEncoding charset from the server.xml for Tomcat 7 see here
You cannot have non-ASCII characters in an URL - you always need to percent-encode them. When doing so, browsers have difficulties rendering them. Rendering works best if you encode the URL in UTF-8, and then percent-encode it. For your specific URL, this would give http://hello/world?name=%E5%A9%80%E3%84%89 (check your browser what it gives for this specific link). When you get the parameter in JSP, you need to explicitly unquote it, and then decode it from UTF-8, as the browser will send it as-is.
I had a problem with JBoss 7.0, and I think this filter solution also works with Tomcat:
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
HttpServletRequest httpRequest = (HttpServletRequest) request;
HttpServletResponse httpResponse = (HttpServletResponse) response;
try {
httpRequest.setCharacterEncoding(MyAppConfig.getAppSetting("System.Character.Encoding"));
String appServer = MyAppConfig.getAppSetting("System.AppServer");
if(appServer.equalsIgnoreCase("JBOSS7")) {
Field requestField = httpRequest.getClass().getDeclaredField("request");
requestField.setAccessible(true);
Object requestValue = requestField.get(httpRequest);
Field coyoteRequestField = requestValue.getClass().getDeclaredField("coyoteRequest");
coyoteRequestField.setAccessible(true);
Object coyoteRequestValue = coyoteRequestField.get(requestValue);
Method getParameters = coyoteRequestValue.getClass().getMethod("getParameters");
Object parameters = getParameters.invoke(coyoteRequestValue);
Method setQueryStringEncoding = parameters.getClass().getMethod("setQueryStringEncoding", String.class);
setQueryStringEncoding.invoke(parameters, MyAppConfig.getAppSetting("System.Character.Encoding"));
Method setEncoding = parameters.getClass().getMethod("setEncoding", String.class);
setEncoding.invoke(parameters, MyAppConfig.getAppSetting("System.Character.Encoding"));
}
} catch (NoSuchMethodException nsme) {
System.err.println(nsme.getLocalizedMessage());
nsme.printStackTrace();
MyLogger.logException(nsme);
} catch (InvocationTargetException ite) {
System.err.println(ite.getLocalizedMessage());
ite.printStackTrace();
MyLogger.logException(ite);
} catch (IllegalAccessException iae) {
System.err.println(iae.getLocalizedMessage());
iae.printStackTrace();
MyLogger.logException(iae);
} catch(Exception e) {
TALogger.logException(e);
}
try {
httpResponse.setCharacterEncoding(MyAppConfig.getAppSetting("System.Character.Encoding"));
} catch(Exception e) {
MyLogger.logException(e);
}
}
I did quite a bit of searching on this issue so this might help others who are experiencing the same problem on tomcat. This is taken from http://wiki.apache.org/tomcat/FAQ/CharacterEncoding.
(How to use UTF-8 everywhere).
Set URIEncoding="UTF-8" on your <Connector> in server.xml. References: HTTP Connector, AJP Connector.
Use a character encoding filter with the default encoding set to UTF-8
Change all your JSPs to include charset name in their contentType.
For example, use <%#page contentType="text/html; charset=UTF-8" %> for the usual JSP pages and <jsp:directive.page contentType="text/html; charset=UTF-8" /> for the pages in XML syntax (aka JSP Documents).
Change all your servlets to set the content type for responses and to include charset name in the content type to be UTF-8.
Use response.setContentType("text/html; charset=UTF-8") or response.setCharacterEncoding("UTF-8").
Change any content-generation libraries you use (Velocity, Freemarker, etc.) to use UTF-8 and to specify UTF-8 in the content type of the responses that they generate.
Disable any valves or filters that may read request parameters before your character encoding filter or jsp page has a chance to set the encoding to UTF-8.