Bad encoding of streamed CSV with Stripes / Tomcat - java

Actually i'm trying to stream a CSV file. I set the encoding to windows-1252 but it seems it is still streamed as UTF-8 file.
final String encoding = "windows-1252";
exportResolution = new StreamingResolution(builder.getContentType() + ";charset=" + encoding.toLowerCase()) {
#Override
public void stream(HttpServletResponse response) throws Exception {
// Set response headers
response.setHeader("Cache-control", "private, max-age=0");
response.setCharacterEncoding(encoding);
OutputStream os = response.getOutputStream();
writeExportStream(os,builder);
}
}.setFilename(filename);
writeExportStream just streams the content to the outputstream (with pagination and db calls, it takes some time)
It doesn't work in local (jetty plugin) + dev (tomcat) Neither with firefox / chrome
I've not tested but people at work told me that it works better when we don't stream the content but we write the file in one time after having loaded all the objets we want from db.
Anybody know what is happening? Thanks
Btw my headers:
HTTP/1.1 200 OK
Content-Language: fr-FR
Content-Type: text/csv;charset=windows-1252
Content-Disposition: attachment;filename="export_rshop_01-02-11.csv"
Cache-Control: private, max-age=0
Transfer-Encoding: chunked
Server: Jetty(6.1.14)
I want the file to be able to be imported in excel in windows-1252 but i can't, it just open in utf8 while my header is windows-1252

The problem lies in the writeExportStream(os,builder); method. We can't see what encoding operations it is performing, but I'm guessing it is writing UTF-8 data.
The output operation needs to perform two encoding tasks:
Tell the client what encoding the response text is in (via the headers)
Encode the data writen to the client in a matching encoding (e.g. via a writer)
Step 1 is being done correctly. Step 2 is probably the source of the error.
If you use the provided writer, it will encode character data in the appropriate response encoding.
If pre-encoded data is written via the raw byte stream (getOutputStream()), you need to make sure this process uses the same encoding.

Related

How do I send an HTTP response without Transfer Encoding: chunked?

I have a Java Servlet that responds to the Twilio API. It appears that Twilio does not support the chunked transfer that my responses are using. How can I avoid using Transfer-Encoding: chunked?
Here is my code:
// response is HttpServletResponse
// xml is a String with XML in it
response.getWriter().write(xml);
response.getWriter().flush();
I am using Jetty as the Servlet container.
I believe that Jetty will use chunked responses when it doesn't know the response content length and/or it is using persistent connections. To avoid chunking you either need to set the response content length or to avoid persistent connections by setting "Connection":"close" header on the response.
Try setting the Content-length before writing to the stream. Don't forget to calculate the amount of bytes according to the correct encoding, e.g.:
final byte[] content = xml.getBytes("UTF-8");
response.setContentLength(content.length);
response.setContentType("text/xml"); // or "text/xml; charset=UTF-8"
response.setCharacterEncoding("UTF-8");
final OutputStream out = response.getOutputStream();
out.write(content);
The container will decide itself to use Content-Length or Transfer-Encoding basing on the size of data to be written by using Writer or outputStream. If the size of the data is larger than the HttpServletResponse.getBufferSize(), then the response will be trunked. If not, Content-Length will be used.
In your case, just remove the 2nd flushing code will solve your problem.

HTTPClient MultipartEntity seems to be adding garbage text to StringBody parts

I am trying to use Apache Commons's HttpClient to send a multipart POST request with a binary file and a couple of string parameters.
However, it seems that somewhere along the line, some garbage text is making its way into my string parameters. For instance, as confirmed through the debugger, the sizeBody variable here is indeed holding the value "100":
StringBody sizeBody = new StringBody("100", Charset.forName("UTF-8"));
However, if I listen to the request with Wireshark, I see this:
--o2mm51iGsng9w0Pb-Guvf8XDwXgG7BPcupLnaa
Content-Disposition: form-data; name="x"
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
100
a5
--o2mm51iGsng9w0Pb-Guvf8XDwXgG7BPcupLnaa
Note the a5 after the 100.
What could be causing this? Where should I look?
What you are seeing are likely to be chunk headers used by so the called chunk transfer encoding [1]. See if the message head has a Transfer-Encoding: chunked field.
[1] http://en.wikipedia.org/wiki/Chunked_transfer_encoding
I had this same issue testing my POSTs with NanoHTTPD receiving them. It's indeed that HttpClient is using chunked transfer encoding, which NanoHTTPD doesn't support. It did that in my case because the binary file was supplied via an InputStreamBody, and since that cannot determine its own content length (it just sends back -1), the client uses chunked encoding.
I switched to using a ByteArrayBody for the file contents, and since that and StringBody can supply content lengths, the requests now do not use chunked encoding.
ByteArrayOutputStream baos = new ByteArrayOutputStream();
IOUtils.copy (fileInputStream, baos); // from Apache Commons IO, or roll your own
ContentBody filePart = new ByteArrayBody (baos.toByteArray(), fileName);
Of course, if your file is huge, loading the whole thing into a byte array as above could cause memory problems.

Jersey REST WS - request body UTF-8

I have simple Jersey REST webServices:
#POST
#Path("/label")
#Consumes(MediaType.TEXT_HTML)
public Response setLabels(String requestBody) {
System.out.println(requestBody);
......
}
Request passes some text with "special" non-English characters
[{"За обекта"}]
I can see in Firebug that request passed with correct UTF-8 content and charset
Content-Type text/plain; charset=UTF-8
Though on on server output does not present desirable charset:
[{"?? ??????"}]
Any Idea what and were went wrong? How can I capture text in correct charset on server side?
System.out is a PrintStream. It uses the platform default encoding, which is typically not UTF-8. So you are getting the correct data in, it's just getting mangled when you print it to the console.
I had the exact same problem a few weeks ago - drove me nuts until I figured it out. What made it worse is that I actually had an encoding-related bug in another part of the code.

How to get non-latin characters from website?

I try to get data from latata.pl/pl.php and view all sign (polish - iso-8859-2)
final URL url = new URL("http://latata.pl/pl.php");
final URLConnection urlConnection = url.openConnection();
final BufferedReader in = new BufferedReader(new InputStreamReader(
urlConnection.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
in.close();
It doesn't work. :( Any ideas?
InputStream reader has multiple constructors and you can (should/have to) specify encoding in such case in one of these constructors.
Your InputStreamReader will be attempting to convert the bytes coming back over the TCP connection using your platform default encoding (which is most likely UTF-8 or one of the horrible Windows ones). You should explicitly specify an encoding.
Assuming the web server is doing a good job, you can find the correct encoding in one of the HTTP headers (I forget which one). Or you can just assume it's iso-8859-2, but that might break later.
This is too long for a comment but who set that webpage? You? From what I can see it doesn't look correct.
Here's what you get back:
$ telnet latata.pl 80
Trying 91.205.74.65...
Connected to latata.pl.
Escape character is '^]'.
GET /pl.php HTTP/1.0
Host: latata.pl
HTTP/1.1 200 OK
Date: Sun, 27 Feb 2011 13:49:19 GMT
Server: Apache/2
X-Powered-By: PHP/5.2.16
Vary: Accept-Encoding,User-Agent
Content-Length: 10
Connection: close
Content-Type: text/html
����ʣ��Connection closed by foreign host.
The HTML is simply:
<html>
<head></head>
<body>±ê³ó¿¡Ê£¯¬</body>
</html>
And that's how your page appears from a browser. Is there a valid reason why no charset is specified in that HTML page?
The output of your php-script pl.php is faulty. There is a HTTP-header Content-Type: text/html set without a declared charset. Without a declared charset, the client has to assume that it is ISO-8859-1 regarding to the HTTP-specifications. The sent body is ±ê³ó¿¡Ê£¯¬ if interpreted as ISO-8859-1.
The bytes sended by the php-script are representing ąęłóżĄĘŁŻŹ if it were declared as
Content-Type: text/html; charset=ISO-8859-2
You can check this with the simple code fragment, which will transform the faulty ISO-8859-1 encoding to ISO-8859-2:
final String test="±ê³ó¿¡Ê£¯¬";
String corrupt=new String(test.getBytes("ISO-8859-1"),"ISO-8859-2");
System.out.println(corrupt);
The output will be ąęłóżĄĘŁŻŹ, which are some polish characters.
As a quick fix, set the charset in your php-script to output Content-Type: text/html; charset=ISO-8859-2 as HTTP-Header.
But you should think about to switch to UTF-8 encoded output anyway.
As someone has already stated there is no charset encoding specified for the response. Forcing the response document to be viewed as ISO-8859-2 (typically used in central Europe) results in legitimate polish characters being displayed, so I assume this is the encoding actually being used. Since no encoding has been specified, ISO-8859-1 will be assumed as this is the default.
The response headers need to include the header Content-Type: text/html; charset=ISO-8859-2 for the character code points to be interpreted correctly. This charset will be used when constructing the response InputStream.

How to gzip ajax requests with Struts 2?

How to gzip an ajax response with Struts2? I tried to create a filter but it didn't work. At client-side I'm using jQuery and the ajax response I'm expecting is in json.
This is the code I used on server:
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gz = new GZIPOutputStream(out);
gz.write(json.getBytes());
gz.close();
I'm redirecting the response to dummy jsp page defined at struts.xml.
The reason why I want to gzip the data back is because there's a situation where I must send a relatively big sized json back to the client.
Any reference provided will be appreciated.
Thanks.
You shouldn't randomly gzip responses. You can only gzip the response when the client has notified the server that it accepts (understands) gzipped responses. You can do that by determining if the Accept-Encoding request header contains gzip. If it is there, then you can safely wrap the OutputStream of the response in a GZIPOutputStream. You only need to add the Content-Encoding header beforehand with a value of gzip to inform the client what encoding the content is been sent in, so that the client knows that it needs to ungzip it.
In a nutshell:
response.setContentType("application/json");
response.setCharacterEncoding("UTF-8");
OutputStream output = response.getOutputStream();
String acceptEncoding = request.getHeader("Accept-Encoding");
if (acceptEncoding != null && acceptEncoding.contains("gzip")) {
response.setHeader("Content-Encoding", "gzip");
output = new GZIPOutputStream(output);
}
output.write(json.getBytes("UTF-8"));
(note that you would like to set the content type and character encoding as well, this is taken into account in the example)
You could also configure this at appserver level. Since it's unclear which one you're using, here's just a Tomcat-targeted example: check the compression and compressableMimeType attributes of the <Connector> element in /conf/server.xml: HTTP connector reference. This way you can just write to the response without worrying about gzipping it.
If your response is JSON I would recommend using the struts2-json plugin http://struts.apache.org/2.1.8/docs/json-plugin.html and setting the
enableGZIP param to true.

Categories