I have simple Jersey REST webServices:
#POST
#Path("/label")
#Consumes(MediaType.TEXT_HTML)
public Response setLabels(String requestBody) {
System.out.println(requestBody);
......
}
Request passes some text with "special" non-English characters
[{"За обекта"}]
I can see in Firebug that request passed with correct UTF-8 content and charset
Content-Type text/plain; charset=UTF-8
Though on on server output does not present desirable charset:
[{"?? ??????"}]
Any Idea what and were went wrong? How can I capture text in correct charset on server side?
System.out is a PrintStream. It uses the platform default encoding, which is typically not UTF-8. So you are getting the correct data in, it's just getting mangled when you print it to the console.
I had the exact same problem a few weeks ago - drove me nuts until I figured it out. What made it worse is that I actually had an encoding-related bug in another part of the code.
Related
I am trying to build a basic downloaded file scanning extension for the popular open source security application ZAP. using the built in sniffer, I can access the HTTP response messages. I am unable to determine the filetype of the file being downloaded. Although the Mozilla blog regarding HTTP talks about using the MIME Type in the 'Content-Type' header to determine the file type, I find that none of the response messages that I get have anything other than application/json or text/html or application/octet-stream. How do I determine if the corresponding HTTP response body contains any particular file type? . I am thus stuck at a dead end!
I am a beginner in this field and there might be something that I am over looking. Any help or pointers would be greatly appreciated.
The Content-Type entity-header field indicates the media type of the entity-body sent to the recipient or, in the case of the HEAD method, the media type that would have been sent had the request been a GET.
Taken from https://www.rfc-editor.org/rfc/rfc2616 under "14.17 Content-Type"
They give this as an example:
Content-Type: text/html; charset=ISO-8859-4
This HTTP request or response contains text in the form of a body of HTML.
If you do not trust this header (which most of the time you can), the next step would be analyzing the file contents. For example, if the file contains opening and closing HTML tags, then there is a good chance that the file is an HTML file. If the file begins with a [ or { and ends with a ] or } then there is a good chance that it is a JSON file. An actual analysis would and should be much more detailed, of course.
GET call on REST API with Accept as text/csv and content-Type as application/json. In what format should the reponse be?
Should the response be in JSON format or in CSV format?
In HTTP, the Accept header is used by the client to tell the server what content types they'll accept.
The server will then send back the response and will set the Content-type header telling the client the type of the content actually returned.
You might have noticed that Content-type is also included in some HTTP requests. This is because some type of HTTP requests, like POST or PUT, can send data to the server. In this case the client tells the server the type of the content data, using the Content-type header.
Now to your question, a GET request should not have any Content-type header. I hope this is clear after my explanation above.
As you correctly note, the Accept header is used by HTTP clients to tell the server what content types they'll accept. The server will then send back a response, which will include a Content-Type header telling the client what the content type of the returned content actually is.
However, as you may have noticed, HTTP requests can also contain Content-Type headers. Why? Well, think about POST or PUT requests. With those request types, the client is actually sending a bunch of data to the server as part of the request, and the Content-Type header tells the server what the data actually is (and thus determines how the server will parse it).
In particular, for a POST request resulting from an HTML form submission, the Content-Type of the request will (normally) be one of the standard form content types below, as specified by the enctype attribute on the tag:
application/x-www-form-urlencoded (default, older, simpler, slightly less overhead for small amounts of simple ASCII text, no file upload support)
multipart/form-data (newer, adds support for file uploads, more efficient for large amounts of binary data or non-ASCII text)
source: https://webmasters.stackexchange.com/users/12578/ilmari-karonen
based on that, you should go for the parameter Accept!
I have a Java RESTlet v2.1.2 method like this:
#Post("json")
public Representation doPost(Representation entity) throws UnsupportedEncodingException {
Request request = getRequest();
String entityAsText = request.getEntityAsText();
logger.info("entityAsText = " + entityAsText + " Üüÿê");
in the Cygwin console it prints:
2015-04-19 22:07:27 INFO BaseResource:46 - entityAsText = {
"Id":"xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx",
"Field1":"John?????????",
"Field2":"Johnson??????????"
} ▄³ Û
As you can see the Üüÿê is printed as ▄³ Û. The characters Üüÿê are also in the POST body of SOAP UI. But they're printed as ???. I have an implemantation which does not use RESTlet where this works. So the settings in SOAP UI are not the problem. (The POST body is in Application/JSON btw.)
How can I extract the unicode chars Üüÿê from the POST body without getting them as ??? ?
I made a test and it works for me but perhaps I don't have same configuration regarding charset / encoding. I used a standalone Restlet application (no servlet) from Postman. Can you give us more details about the version of Restlet and the different editions / extensions you use (for example, Jackson, Servlet, ...)?
Here is what I have for Java (you can have a look at this link: How to Find the Default Charset/Encoding in Java?):
Default Charset=UTF-8
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=UTF8
You can also specify the charset you sent for your content at the level of the header Content-Type: application/json;charset=utf-8.
I wrote a post some years ago about such issues when using a servlet container. Perhaps could you also find out some hints to help you: https://templth.wordpress.com/2011/06/05/does-your-java-based-web-applications-really-support-utf8/.
Hope it helps you,
Thierry
Actually i'm trying to stream a CSV file. I set the encoding to windows-1252 but it seems it is still streamed as UTF-8 file.
final String encoding = "windows-1252";
exportResolution = new StreamingResolution(builder.getContentType() + ";charset=" + encoding.toLowerCase()) {
#Override
public void stream(HttpServletResponse response) throws Exception {
// Set response headers
response.setHeader("Cache-control", "private, max-age=0");
response.setCharacterEncoding(encoding);
OutputStream os = response.getOutputStream();
writeExportStream(os,builder);
}
}.setFilename(filename);
writeExportStream just streams the content to the outputstream (with pagination and db calls, it takes some time)
It doesn't work in local (jetty plugin) + dev (tomcat) Neither with firefox / chrome
I've not tested but people at work told me that it works better when we don't stream the content but we write the file in one time after having loaded all the objets we want from db.
Anybody know what is happening? Thanks
Btw my headers:
HTTP/1.1 200 OK
Content-Language: fr-FR
Content-Type: text/csv;charset=windows-1252
Content-Disposition: attachment;filename="export_rshop_01-02-11.csv"
Cache-Control: private, max-age=0
Transfer-Encoding: chunked
Server: Jetty(6.1.14)
I want the file to be able to be imported in excel in windows-1252 but i can't, it just open in utf8 while my header is windows-1252
The problem lies in the writeExportStream(os,builder); method. We can't see what encoding operations it is performing, but I'm guessing it is writing UTF-8 data.
The output operation needs to perform two encoding tasks:
Tell the client what encoding the response text is in (via the headers)
Encode the data writen to the client in a matching encoding (e.g. via a writer)
Step 1 is being done correctly. Step 2 is probably the source of the error.
If you use the provided writer, it will encode character data in the appropriate response encoding.
If pre-encoded data is written via the raw byte stream (getOutputStream()), you need to make sure this process uses the same encoding.
An HTTP POST request is made to my servlet. There is a posted form parameter in the http request that my code in the servlet retrieves for further processing named "payload". When the value of the payload includes the windows-1252 character "’" (ascii value 146), HttpServletRequest instance method getParameter("payload") returns null. There is nothing in the server.log related to the problem. We think the character encoding used to produce this character is windows-1252. The character encoding glassfish defaults to for http requests appears to be ISO-8859-1. Ascii value 146 is a control character in ISO-8859-1.
Does anyone have any suggestions as to how I could solve this problem?
The http request headers in the post that showed the problem are:
POST /dbxchange/TechAnywhere HTTP/1.1
CONTENT_LENGTH: 13117
Content-type: application/x-www-form-urlencoded
Cache-Control: no-cache
Pragma: no-cache
User-Agent: Mozilla/4.0 (Windows Vista 6.0) Java/1.6.0_16
Host: localhost:8080
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive
Content-Length: 13117
Java doesn't care about the differences between Cp1252 and Latin-1. Since there are no invalid byte sequence in both encoding, you wouldn't get null with either one. I think your server is using UTF-8 and the browser is using Cp1252 or Latin1.
Try to put following attributes in form to see if it helps,
<form action="..." method="post" charset="UTF-8" accept-encoding="UTF-8"...>
We think the character encoding used to produce this character is windows-1252.
Yes, very probably. Even when browsers claim to be using iso-8559-1, they are usually actually using windows-1252.
The character encoding glassfish defaults to for http requests appears to be ISO-8859-1
Most likely it is defaulting to your system's Java ‘default encoding’. This is rarely what you want, as it makes your application break when you redeploy it.
For reading POST request bodies, you should be able to fix the encoding by calling setCharacterEncoding on the request object, as long as you can do it early enough so that no-one has already caused it to read the body by calling methods such as getParameter. Try setting the encoding to "Cp1252". Although really you ought to be aiming for UTF-8 for everything in the long run.
Unfortunately there is not a standard J2EE way to specify what encoding your application expects for all requests (including query string parameters, which are not affected by setCharacterEncoding). Each server has its own way, which creates annoying deployment issues. But for Glassfish, set a <parameter-encoding> in your sun-web.xml.
We have found that the problem is in the javascript code that sends the post request. The javascript code was URL encoding the value of the payload before sending the request. The javascript built-in function escape() was used to do the URL encoding. This was encoding the character to a non standard encoding implementation of %u2019. It appears as though glassfish does not support this non standard form of encoding.
See http://en.wikipedia.org/wiki/Percent-encoding#Non-standard_implementations
The fix was to use the built-in javascript function encodeURI() which returns "%E2%80%99" for ’