How to generate Chinese characters from a Java servlet? - java

My servlet looks like this
protected void processRequest(HttpServletRequest request,HttpServletResponse response) throws ServletException,IOException
{
PrintWriter out=response.getWriter();
out.println("<Html><Head><Title>Signup</Title></Head>\n<Body>\n");
out.println("\u5982 电话\n");
out.println("</Body>\n</Html>");
}
My browser can display Chinese characters from other websites.
I'm trying 2 different ways to display Chinese characters, but they all showed up as ???
What's the correct way to do it ?

No explicit encoding has been set for the response. The response would therefore be written by the container with the default encoding of ISO-8859-1.
You'll therefore need to specify the appropriate character encoding using the HttpServletResponse.setCharacterEncoding() or HttpServletResponse.setContentType methods. This would be either of:
response.setCharacterEncoding("GB18030");
response.setContentType("text/html; charset=GB18030");
You may also use UTF-8 as the explicit encoding.

Try adding
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

You would need to send Unicode, have your servlet send UTF-8, and have the browser locale set up properly to interpret the characters correctly.

Just setting the character encoding as UTF-8 worked for me.
response.setCharacterEncoding("UTF-8")

Related

JSP not showing correct UTF-8 contents for HTML form POST

I'm using Java 11 with Tomcat 9 with the latest JSP/JSTL. I'm testing in Chrome 71 and Firefox 64.0 on Windows 10. I have the following test document:
<%# page contentType="text/html; charset=UTF-8" %>
<%# taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8"/>
<title>Hello</title>
</head>
<body>
<c:if test="${not empty param.fullName}">
<p>Hello, ${param.fullName}.</p>
</c:if>
<form>
<div>
<label>Full name: <input name="fullName" /></label>
</div>
<button>Say Hello</button>
</form>
</body>
</html>
This is perhaps the simplest form possible. As you know the form method defaults to get, the form action defaults to "" (submitting to the same page), and the form enctype defaults to application/x-www-form-urlencoded.
If I enter the name "Flávio José" (a famous Brazilian forró singer and musícian) in the field and submit, the form is submitted via HTTP GET to the same page using hello.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9. This is correct, and the page says:
Hello, Flávio José.
If I change the form method to post and enter the same name "Flávio José", the form contents are instead submitted via POST, with HTTP request contents:
fullName=Fl%C3%A1vio+Jos%C3%A9
This also appears correct. But this time the page says:
Hello, Flávio José.
Rather than seeing %C3%A as a sequence of UTF-8 octects, JSP seems to think that these are a series of ISO-8859-1 octets (or code page 1252 octets), and is therefore decoding them to the wrong character sequence.
But where is it getting ISO-8859-1? What is my JSP page lacking to indicate the correct encoding?
I'll note also that WHATWG specification says that application/x-www-form-urlencoded octets should be parsed as UTF-8 by default. Is the Java servlet specification simply broken? How do I work around this?
This is caused by Tomcat, but the root problem is the Java Servlet 4 specification, which is incorrect and outdated.
Originally HTML 4.0.1 said that application/x-www-form-urlencoded encoded octets should be decoded as US-ASCII. The servlet specification changed this to say that, if the request encoding is not specified, the octets should be decoded as ISO-8859-1. Tomcat is simply following the servlet specification.
There are two problems with the Java servlet specification. The first is that the modern interpretation of application/x-www-form-urlencoded is that encoded octets should be decoded using UTF-8. The second problem is that tying the octet decoding to the resource charset confuses two levels of decoding.
Take another look at this POST content:
fullName=Fl%C3%A1vio+Jos%C3%A9
You'll notice that it is ASCII!! It doesn't matter if you consider the POST HTTP request charset to be ISO-8859-1, UTF-8, or US-ASCII—you'll still wind up with exactly the same Unicode characters before decoding the octets! What encoding is used to decode the encoding octets is completely separate.
As a further example, let's say I download a text file instructions.txt that is clearly marked as ISO-8859-1, and it contains the URI https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9. Just because the text file has a charset of ISO-8859-1, does that mean I need to decode %C3%A using ISO-8859-1? Of course not! The charset used for decoding URI characters is a separate level of decoding on top of the resource content type charset! Similarly the octets of values encoded in application/x-www-form-urlencoded should be decoded using UTF-8, regardless of the underlying charset of the resource.
There are several workarounds, some of them found at found by looking at the Tomcat character encoding FAQ to "use UTF-8 everywhere".
Set the request character encoding in your web.xml file.
Add the following to your WEB-INF/web.xml file:
<request-character-encoding>UTF-8</request-character-encoding>
This setting is agnostic of the servlet container implementation, and is defined forth in the servlet specification. (You should be able to alternatively put it in Tomcat's conf/web.xml file, if want a global setting and don't mind changing the Tomcat configuration.)
Set the SetCharacterEncodingFilter in your web.xml file.
Tomcat has a proprietary equivalent: use the org.apache.catalina.filters.SetCharacterEncodingFilter in the WEB-INF/web.xml file, as the Tomcat FAQ above mentions, and as illustrated by https://stackoverflow.com/a/37833977/421049, excerpted below:
<filter>
<filter-name>setCharacterEncodingFilter</filter-name>
<filter-class>org.apache.catalina.filters.SetCharacterEncodingFilter</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>setCharacterEncodingFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
This will make your web application only work on Tomcat, so it's better to put this in the Tomcat installation conf/web.xml file instead, as the post above mentions. In fact Tomcat's conf/web.xml installations have these two sections, but commented out; simply uncomment them and things should work.
Force the request character encoding to UTF-8 in the JSP or servlet.
You can force the character encoding of the servlet request to UTF-8, somewhere early in the JSP:
<% request.setCharacterEncoding("UTF-8"); %>
But that is ugly, unwieldy, error-prone, and goes against modern best practices—JSP scriptlets shouldn't be used anymore.
Hopefully we can get a newer Java servlet specification to remove any relationship between the resource charset and the decoding of application/x-www-form-urlencoded octets, and simply state that application/x-www-form-urlencoded octets must be decoded as UTF-8, as is modern practice as clarified by the latest W3C and WHATWG specifications.
Update: I've updated the Tomcat FAQ on Character Encoding Issues with this information.

Apache struts internationalization and localization issue

I am working on a Struts-1 project which support two language English and Turkies. To display message we are using Internationalization feature of Struts-1 hence we have two property file(ApplicationResources_en.properties and ApplicationResources_en.properties) to store messages which need to be display to user.
For english version ApplicationResources_en.properties key and value is
farequoteautomatic.entry-area.gen.emd.fareamount=Fare Amount
For Turkies version ApplicationResources_tr.properties key and value is
farequoteautomatic.entry-area.gen.emd.fareamount=Ücret Miktarı
Everything is working fine when Locale is English means when we are using English version. There is correct and expected out put for that key which is Fare Amount.
But when Locale is changed means when we try try to use turkey version there no correct out put. It displays special chars rather than the actual char written in property fıle.
In property file message is Ücret Miktarı but out put at browser is �cret Miktar�.
Note: I have checked my Firefox browser is set default to Unicede (UTF-8) encoding and we have a header.jsp which is encluded in each page in which we have a META tag like <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I don't understand what I am doing wrong here. Please help me.
check your browser encoding and set it UTF-8
try this
in web.xml
<filter>
<filter-name>CharacterEncodingFilter</filter-name>
<filter-class>bt.gov.g2c.framework.common.CharacterEncodingFilter</filter-class>
<init-param>
<param-name>requestEncoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
</filter>
Followed mkyong url, It says.
For UTF-8 or non-English characters, for example Chinese , you should encode it with native2ascii tool.
With the help of native2ascii tool
farequoteautomatic.entry-area.gen.emd.fareamount=Ücret Miktarı
Converted to
farequoteautomatic.entry-area.gen.emd.fareamount=\ufeff\u00dccret Miktar\u0131
And at the browser i got desired out put that is Ücret Miktarı

Pass data of non english character to web server

From html , i am using data of non english characters to controller.
eg) passing multi章byte in url as post request.
But in controller data is receiving as
multiç« byte
Can any one help me on how to fix this issue?
First, you need to tell the browser that you're serving UTF-8 content and expecting UTF-8 submit. You can do that by either placing the line in the very top of your JSP,
<%#page pageEncoding="UTF-8" %>
or by adding the following entry to web.xml so that you don't need to repeat #page over all JSPs:
<jsp-config>
<jsp-property-group>
<url-pattern>*.jsp</url-pattern>
<page-encoding>UTF-8</page-encoding>
</jsp-property-group>
</jsp-config>
Then, you need to tell the servlet API to use UTF-8 to parse POST request parameters. You can do that by creating a servlet filter which does basically the following:
#WebFilter("/*")
public class CharacterEncodingFilter implements Filter {
#Override
public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) throws IOException, ServletException {
req.setCharacterEncoding("UTF-8");
chain.doFilter(req, res);
}
// ...
}
That's all. You do not need <form accept-charset> (that would make things worse when using MSIE browser) and you do not necessarily need <meta http-equiv="content-type"> (that's only interpreted when the page is served from local disk file system instead of over HTTP).
Beware that when you're using System.out.println() or a logger in order to verify the submitted value, that it should in turn also be configured to use UTF-8 to present the characters, otherwise you will be mislead by still being presented with Mojibake.
See also:
Unicode - How to get the characters right?
you can use this in your action method
request.setCharacterEncoding("UTF-8");
this must be called prior to any request.getParameter() call
and just change the encoding depending on your needs
you can also refer to this
HttpServletRequest - setCharacterEncoding seems to do nothing
request.getQueryString() seems to need some encoding

Special Characters In Webapp being saved differently

I'm creating a webapp using Spring MVC and some of the information I'm pulling is from a Database, so it was edited elsewhere. When I import some have, what I consider, special characters, such as
“_blank”
as opposed to using the standard keyboard
"_blank".
When I display this on my website textarea, it displays fine, but when I attempt to save it back into the string when submitting the form in the spring textArea, the string now has ? where the 'special' characters were. They were obviously imported into a String fine, but somewhere in the save process it's not allowing it as a special character. Any idea what is causing this or why?
Sounds like a character encoding problem. Try setting the character set of the page containing the form to UTF-8.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Why does POST not honor charset, but an AJAX request does? tomcat 6

I have a tomcat based application that needs to submit a form capable of handling utf-8 characters. When submitted via ajax, the data is returned correctly from getParameter() in utf-8. When submitting via form post, the data is returned from getParameter() in iso-8859-1.
I used fiddler, and have determined the only difference in the requests, is that charset=utf-8 is appended to the end of the Content-Type header in the ajax call (as expected, since I send the content type explicitly).
ContentType from ajax:
"application/x-www-form-urlencoded; charset=utf-8"
ContentType from form:
"application/x-www-form-urlencoded"
I have the following settings:
ajax post (outputs chars correctly):
$.ajax( {
type : "POST",
url : "blah",
async : false,
contentType: "application/x-www-form-urlencoded; charset=utf-8",
data : data,
success : function(data) {
}
});
form post (outputs chars in iso)
<form id="leadform" enctype="application/x-www-form-urlencoded; charset=utf-8" method="post" accept-charset="utf-8" action="{//app/path}">
xml declaration:
<?xml version="1.0" encoding="utf-8"?>
Doctype:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
jvm parameters:
-Dfile.encoding=UTF-8
I have also tried using request.setCharacterEncoding("UTF-8"); but it seems as if tomcat simply ignores it. I am not using the RequestDumper valve.
From what I've read, POST data encoding is mostly dependent on the page encoding where the form is. As far as I can tell, my page is correctly encoded in utf-8.
The sample JSP from this page works correctly. It simply uses setCharacterEncoding("UTF-8"); and echos the data you post. http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
So to summarize, the post request does not send the charset as being utf-8, despite the page being in utf-8, the form parameters specifying utf-8, the xml declaration or anything else. I have spent the better part of three days on this and am running out of ideas. Can anyone help me?
form post (outputs chars in iso)
<form id="leadform" enctype="application/x-www-form-urlencoded; charset=utf-8" method="post" accept-charset="utf-8" action="{//app/path}">
You don't need to specify the charset there. The browser will use the charset which is specified in HTTP
response header.
Just
<form id="leadform" method="post" action="{//app/path}">
is enough.
xml declaration:
<?xml version="1.0" encoding="utf-8"?>
Irrelevant. It's only relevant for XML parsers. Webbrowsers doesn't parse text/html as XML. This is only relevant for the server side (if you're using a XML based view technology like Facelets or JSPX, on plain JSP this is superfluous).
Doctype:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Irrelevant. It's only relevant for HTML parsers. Besides, it doesn't specify any charset. Instead, the one in the HTTP response header will be used. If you aren't using a XML based view technology like Facelets or JSPX, this can be as good <!DOCTYPE html>.
meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
Irrelevant. It's only relevant when the HTML page is been viewed from local disk or is to be parsed locally. Instead, the one in the HTTP response header will be used.
jvm parameters:
-Dfile.encoding=UTF-8
Irrelevant. It's only relevant to Sun/Oracle(!) JVM to parse the source files.
I have also tried using request.setCharacterEncoding("UTF-8"); but it seems as if tomcat simply ignores it. I am not using the RequestDumper valve.
This will only work when the request body is not been parsed yet (i.e. you haven't called getParameter() and so on beforehand). You need to call this as early as possible. A Filter is a perfect place for this. Otherwise it will be ignored.
From what I've read, POST data encoding is mostly dependent on the page encoding where the form is. As far as I can tell, my page is correctly encoded in utf-8.
It's dependent on the HTTP response header.
All you need to do are the following three things:
Add the following to top of your JSP:
<%#page pageEncoding="UTF-8" %>
This will set the response encoding to UTF-8 and set the response header to UTF-8.
Create a Filter which does the following in doFilter() method:
if (request.getCharacterEncoding() == null) {
request.setCharacterEncoding("UTF-8");
}
chain.doFilter(request, response);
This will make that the POST request body will be processed as UTF-8.
Change the <Connector> entry in Tomcat/conf/server.xml as follows:
<Connector (...) URIEncoding="UTF-8" />
This will make that the GET query strings will be processed as UTF-8.
See also:
Unicode - How to get characters right? - contains practical background information and detailed solutions for Java EE web developers.
Try this :
How do I change how POST parameters are interpreted?
POST requests should specify the encoding of the parameters and values they send. Since many clients fail to set an explicit encoding, the default is used (ISO-8859-1). In many cases this is not the preferred interpretation so one can employ a javax.servlet.Filter to set request encodings. Writing such a filter is trivial. Furthermore Tomcat already comes with such an example filter.
Please take a look at:
5.x
webapps/servlets-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
webapps/jsp-examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
6.x
webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java
For more info , refer to the below URL
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
Have you tried accept-charset="UTF-8"? As you said, the data should be encoded according to the encoding of the page itself; it seems strange that tomcat is ignoring that. What browser are you trying this out on?
Have you tried to specify useBodyEncodingForURL="true" in your conf/server.xml for HTTP connector?
I implemented a filter based on the information in this post and it is now working. However, this still doesn't explain why even though the page was UTF-8, the charset used by tomcat to interpret it was ISO-9951-1.

Categories