I'm writing a small proxy in Java that basically picks out 2 specific files and does some extra processing on them. One URL it just grabs some info out of the content before passing it along. The other file I want to filter the response content, which is just xml deflate encoded (I want to remove some child elements).
Now, the proxy works fine when I just pass though all content. However, when I try to filter the xml file it doesn't actually send the content to the client ???
Here is some code:
Within the Thread run() method that is spawned when accepting a Socket connection, once I determine the request is for the file I want to filter, I call:
filterRaceFile(serverIn, clientOut); // This won't send content
//streamHTTPData(serverIn, clientOut, true); // This passes through fine (but all content of course).
and here is the filtering method itself:
private void filterRaceFile(InputStream is, OutputStream os) {
// Pass through headers before using deflate and filtering xml
processHeader(is, os, new StringBuilder(), new StringBuilder());
// Seems to be 1 line left, inflater doesn't like it if we don't do this anyway...?
try {
os.write(readLine(is, false).getBytes());
} catch (IOException e) {
e.printStackTrace();
}
InflaterInputStream inflate = new InflaterInputStream(is);
DeflaterOutputStream deflate = new DeflaterOutputStream(os);
int c = 0;
try {
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document xdoc = db.parse(inflate);
Node id = xdoc.getElementsByTagName("id").item(0);
Node msg = xdoc.getElementsByTagName("message").item(0);
StringBuilder xml_buf = new StringBuilder("<race>\n");
xml_buf.append("<id>").append(id.getTextContent()).append("</id>\n");
xml_buf.append("<message>").append(msg.getTextContent()).append("</message>\n");
xml_buf.append("<boats>\n");
NodeList allBoats = xdoc.getElementsByTagName("boat");
int N = allBoats.getLength();
for (int i = 0; i < N; ++i)
{
Node boat = allBoats.item(i);
Element boat_el = (Element)boat;
double lat = Double.parseDouble(boat_el.getElementsByTagName("lat").item(0).getTextContent());
double lon = Double.parseDouble(boat_el.getElementsByTagName("lon").item(0).getTextContent());
double dist = Geodesic.vincenty_earth_dist(proxy.userLat, proxy.userLon, lat, lon)[0];
if (dist <= LOS_DIST)
{
String boatXML = xmlToString(boat);
//<?xml version="1.0" encoding="UTF-8"?> is prepended to the xml
int pos = boatXML.indexOf("?>")+2;
boatXML = boatXML.substring(pos);
xml_buf.append(boatXML);
++c;
}
}
System.out.printf("%d boats within LOS distance\n", c);
xml_buf.append("</boats>\n");
xml_buf.append("</race>");
byte[] xml_bytes = xml_buf.toString().getBytes("UTF-8");
deflate.write(xml_bytes);
} catch (Exception e) {
e.printStackTrace();
}
// flush the OutputStream and return
try {
os.flush();
} catch (Exception e) {
e.printStackTrace();
}
}
I also have a simple pass through method that simply writes the server's InputStream to the client's OutputStream, uses readLine also and works fine - ie any other url shows up in the browser no problems, so readLine is ok. The boolean parameter is to let it know it is reading from a deflate stream, as it uses mark and read internally, which isn't supported on deflate streams.
XML is very simple:
<race>
{some data to always pass thru}
<boats>
<boat>
<id>1234</id>
....
<lat>-23.3456</lat>
<lon>17.2345</lon>
</boat>
{more boat elements}
</boats>
</race>
And it produces the xml I want it to send to the client fine, but the client just doesn't receive it (shows content-length of 0 in a web-debugger, although there is no Content-Length header in the original response either)
Any ideas as to what is going on, or what I should be doing that I am not ??
I don't see any error, but if you are already using a Document, can't you just modify the document and then use your XML library to write the whole Document out? This seems more robust than manual XML formatting.
You need to add a call to deflate.close() after deflate.write() in order to flush the output and correctly close the stream.
Related
Apparently for excel to open CSV files nicely, it should have the Byte Order Mark at the start. The download of CSV is implemented by writing into HttpServletResponse's output stream in the controller, as the data is generated during request. I get an exception when I try to write the BOM bytes - java.io.CharConversionException: Not an ISO 8859-1 character: [] (even though the encoding I specified is UTF-8).
The controller's method in question
#RequestMapping("/monthly/list")
public List<MonthlyDetailsItem> queryDetailsItems(
MonthlyDetailsItemQuery query,
#RequestParam(value = "format", required = false) String format,
#RequestParam(value = "attachment", required = false, defaultValue="false") Boolean attachment,
HttpServletResponse response) throws Exception
{
// load item list
List<MonthlyDetailsItem> list = detailsSvc.queryMonthlyDetailsForList(query);
// adjust format
format = format != null ? format.toLowerCase() : "json";
if (!Arrays.asList("json", "csv").contains(format)) format = "json";
// modify common response headers
response.setCharacterEncoding("UTF-8");
if (attachment)
response.setHeader("Content-Disposition", "attachment;filename=duomenys." + format);
// build csv
if ("csv".equals(format)) {
response.setContentType("text/csv; charset=UTF-8");
response.getOutputStream().print("\ufeff");
response.getOutputStream().write(buildMonthlyDetailsItemCsv(list).getBytes("UTF-8"));
return null;
}
return list;
}
I have just come across, this same problem. The solution which works for me is to get the output stream from the response Object and write to it as follows
// first create an array for the Byte Order Mark
final byte[] bom = new byte[] { (byte) 239, (byte) 187, (byte) 191 };
try (OutputStream os = response.getOutputStream()) {
os.write(bom);
final PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8"));
w.print(data);
w.flush();
w.close();
} catch (IOException e) {
// logit
}
So UTF-8 is specified on the OutputStreamWriter.
As an addendum to this, I should add, the same application needs to allow users to upload files, these may or may not have BOM's. This may be dealt with by using the class org.apache.commons.io.input.BOMInputStream, then using that to construct a org.apache.commons.csv.CSVParser.
The BOMInputStream includes a method hasBOM() to detect if the file has a BOM or not.
One gotcha that I first fell into was that the hasBOM() method reads (obviously!) from the underlying stream, so the way to deal with this is to first mark the stream, then after the test if it doesn't have a BOM, reset the stream. The code I use for this looks like the following:
try (InputStream is = uploadFile.getInputStream();
BufferedInputStream buffIs = new BufferedInputStream(is);
BOMInputStream bomIn = new BOMInputStream(buffIs);) {
buffIs.mark(LOOKAHEAD_LENGTH);
// this should allow us to deal with csv's with or without BOMs
final boolean hasBOM = bomIn.hasBOM();
final BufferedReader buffReadr = new BufferedReader(
new InputStreamReader(hasBOM ? bomIn : buffIs, StandardCharsets.UTF_8));
// if this stream does not have a BOM, then we must reset the stream as the test
// for a BOM will have consumed some bytes
if (!hasBOM) {
buffIs.reset();
}
// collect the validated entity details
final CSVParser parser = CSVParser.parse(buffReadr,
CSVFormat.DEFAULT.withFirstRecordAsHeader());
// Do stuff with the parser
...
// Catch and clean up
Hope this helps someone.
It doesn't make much sense: the BOM is for UTF-16; there is no byte order with UTF-8. The encoding You've set with setCharacterEncoding is used for getWriter, not for getOutputStream.
UPDATE:
OK, try this:
if ("csv".equals(format)) {
response.setContentType("text/csv; charset=UTF-8");
PrintWriter out = response.getWriter();
out.print("\uFEFF");
out.print(buildMonthlyDetailsItemCsv(list));
return null;
}
I'm assuming that method buildMonthlyDetailsItemCsv returns a String.
My application downloads xml files that happen to be either encoded in UTF-8 or ISO-8859-1 (the software that generates those files is crappy so it does that). I'm from Germany so we're using Umlauts (ä,ü,ö) so it really makes a difference how those files are encoded.
I know that the XmlPullParser has a method .getInputEncoding() which correctly detects how my files are encoded. However I have to set the encoding in my FileInputStream already (which is before I get to call .getInputEncoding()). So far I'm just using a BufferedReader to read the XML file and search for the entry that specifies the encoding and then instantiate my PullParser afterwards.
private void setFileEncoding() {
try {
bufferedReader.reset();
String firstLine = bufferedReader.readLine();
int start = firstLine.indexOf("encoding=") + 10; // +10 to actually start after "encoding="
String encoding = firstLine.substring(start, firstLine.indexOf("\"", start));
// now set the encoding to the reader to be used for parsing afterwards
bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream, encoding));
bufferedReader.mark(0);
} catch (IOException e) {
e.printStackTrace();
}
}
Is there a different way to do this? Can I take advantage of the .getInputEncoding method? Right now the method seems kinda useless to me because how does my encoding matter if I've already had to set it before being able to check for it.
If you trust the creator of the XML to have set the encoding correctly in the XML declaration, you can sniff it as you're doing. However, be aware that it can be wrong; it can disagree with the actual encoding.
If you want to detect the encoding directly, independently of the (potentially wrong) XML declaration encoding setting, use a library such as ICU CharsetDetector or the older jChardet.
ICU CharsetDetector:
CharsetDetector detector;
CharsetMatch match;
byte[] byteData = ...;
detector = new CharsetDetector();
detector.setText(byteData);
match = detector.detect();
jChardet:
// Initalize the nsDetector() ;
int lang = (argv.length == 2)? Integer.parseInt(argv[1])
: nsPSMDetector.ALL ;
nsDetector det = new nsDetector(lang) ;
// Set an observer...
// The Notify() will be called when a matching charset is found.
det.Init(new nsICharsetDetectionObserver() {
public void Notify(String charset) {
HtmlCharsetDetector.found = true ;
System.out.println("CHARSET = " + charset);
}
});
URL url = new URL(argv[0]);
BufferedInputStream imp = new BufferedInputStream(url.openStream());
byte[] buf = new byte[1024] ;
int len;
boolean done = false ;
boolean isAscii = true ;
while( (len=imp.read(buf,0,buf.length)) != -1) {
// Check if the stream is only ascii.
if (isAscii)
isAscii = det.isAscii(buf,len);
// DoIt if non-ascii and not done yet.
if (!isAscii && !done)
done = det.DoIt(buf,len, false);
}
det.DataEnd();
if (isAscii) {
System.out.println("CHARSET = ASCII");
found = true ;
}
You may be able to get the correct character-set from the content-type header, if your server sends it correctly.
I am creating a web application using the Spark Java framework. The front-end is developed using AngularJS.
I want to generate a .docx file on the server (in-memory) and send this to the client for download.
To achieve this I created an angular service with the following function being called after the user clicks on a download button:
functions.generateWord = function () {
$http.post('/api/v1/surveys/genword', data.currentSurvey).success(function (response) {
var element = angular.element('<a/>');
element.attr({
href: 'data:attachment;charset=utf-8;application/vnd.openxmlformats-officedocument.wordprocessingml.document' + response,
target: '_blank',
download: 'test.docx'
})[0].click();
});
};
On the server, this api call gets forwarded to the following method:
public Response exportToWord(Response response) {
try {
File file = new File("src/main/resources/template.docx");
FileInputStream inputStream = new FileInputStream(file);
byte byteStream[] = new byte[(int)file.length()];
inputStream.read(byteStream);
response.raw().setContentType("data:attachment;chatset=utf-8;application/vnd.openxmlformats-officedocument.wordprocessingml.document");
response.raw().setContentLength((int) file.length());
response.raw().getOutputStream().write(byteStream);
response.raw().getOutputStream().flush();
response.raw().getOutputStream().close();
return response;
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
I have tried to solve this in MANY different ways and I always end up with a corrupted 'test.docx' that looks like this:
Solved it by using blobs and specifying the response type as 'arraybuffer' in the $http.post api call. The only bad thing with this solution (as far as I know) is that it doesn't play well with IE, but that's a problem for another day.
functions.generateWord = function () {
$http.post('/api/v1/surveys/genword', data.currentSurvey, {responseType: 'arraybuffer'})
.success(function (response) {
var blob = new Blob([response], {type: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'});
var url = (window.URL || window.webkitURL).createObjectURL(blob);
var element = angular.element('<a/>');
element.attr({
href: url,
target: '_blank',
download: 'survey.docx'
})[0].click();
});
};
I think what went wrong was that the byte stream got encoded as plain text when I tried to create a URL with:
href: 'data:attachment;charset=utf-8;application/vnd.openxmlformats-officedocument.wordprocessingml.document' + response
thus corrupting it.
When using blobs instead, I get a "direct" link to the generated byte stream and no encoding is done on it since the response type is set to 'arraybuffer'.
Note that this is just my own reasoning of why things went wrong with the original code. I might be terribly wrong, so feel free to correct me if that's the case.
I posted this question to the CXF list, without any luck. So here we go. I am trying to upload large files to a remote server (think of them virtual machine disks). So I have a restful service that accepts upload requests. The handler for the upload looks like:
#POST
#Consumes(MediaType.MULTIPART_FORM_DATA)
#Path("/doupload")
public Response receiveStream(MultipartBody multipart) {
List<Attachment> allAttachments = body.getAllAttachments();
Attachment att = null;
for (Attachment b : allAttachments) {
if (UPLOAD_FILE_DESCRIPTOR.equals(b.getContentId())) {
att = b;
}
}
Assert.notNull(att);
DataHandler dh = att.getDataHandler();
if (dh == null) {
throw new WebApplicationException(HTTP_BAD_REQUEST);
}
try {
InputStream is = dh.getInputStream();
byte[] buf = new byte[65536];
int n;
OutputStream os = getOutputStream();
while ((n = is.read(buf)) > 0) {
os.write(buf, 0, n);
}
ResponseBuilder rb = Response.status(HTTP_CREATED);
return rb.build();
} catch (IOException e) {
log.error("Got exception=", e);
throw new WebApplicationException(HTTP_INTERNAL_ERROR);
} catch (NoSuchAlgorithmException e) {
log.error("Got exception=", e);
throw new WebApplicationException(HTTP_INTERNAL_ERROR);
} finally {}
}
The client for this code is fairly simple:
public void sendLargeFile(String filename) {
WebClient wc = WebClient.create(targetUrl);
InputStream is = new FileInputStream(new File(filename));
Response r = wc.post(new Attachment(Constants.UPLOAD_FILE_DESCRIPTOR,
MediaType.APPLICATION_OCTET_STREAM, is));
}
The code works fine in terms of functionality. In terms of performance, I noticed that before my handler (receiveStream() method) gets the first byte out of the stream, the whole stream actually gets persisted into a temporary file (using a CachedOutputStream). Unfortunately, this is not acceptable for my purposes.
My handler simply passes the incoming bytes to a backend storage system (virtual machine disk repository), and waiting for the whole disk to be written to a cache only to be read again takes a lot of time, tying up a lot of resources, and reducing throughput.
There is a cost associated with writing the blocks and reading them again, since the app is running in the cloud, and the cloud provider charges per block read/written.
Since every byte is written to the local disk, my service VM must have enough disk space to accommodate the total sizes of all the streams being uploaded (i.e., if I have 10 uploads of 100GB each, I must have 1TB of disk just to cache the content). That again is extra money, as the size of the service VM grows dramatically, and the cloud provider charges for the provisioned disk size as well.
Given all of this, I am looking for a way to use the HTTP InputStream (or as close to it as possible) to read the attachment directly from there and handle it afterwards. I guess the question translates into one of:
- Is there a way to tell CXF not do caching
- OR - is there a way to pass CXF an output stream (one I write) to use, rather than using CachedOutputStream
I found a similar question here. The resolution says use CXF 2.2.3 or later, I am using 2.4.4 (and tried with 2.7.0) with no luck.
Thanks.
I think it's logically not possible (neither in CXF or anywhere else). You're calling getAllAttachements(), which means that the server should collect information about them from the HTTP input stream. It means that the entire stream has to go into memory for MIME parsing.
In your case you should work directly with the stream, and do the MIME parsing yourself:
public Response receiveStream(InputStream input) {
Now you have full control of the input and can consume it into memory byte-by-byte.
I ended up fixing the problem in an unelegant way, but it works, so I wanted to share my experience. Please do let me know if there are some "standard" or better ways.
Since I am writing the server side, I knew I was accessing all the attachments in the order they were sent, and process them as they are streamed in. So, to reflect that behavior of the handler method (receiveStream() method above), I created a new annotation on the server side called "#SequentialAttachmentProcessing" and annotatated my above method with it.
Also, wrote a subclass of Attachment, called SequentialAttachment that acts like a linked list. It has a skip() method that skips over the current attachment, and when an attachment ends, hasMore() method tells you whether there is another one.
Then I wrote a custom multipart/form-data provider which behaves as follows: If the target method is annotated as above, handle the attachment, otherwise call the default provider to do the handling. When it is handled by my provider, it always returns at most one attachment. Hence it could be misleading to a non-suspecting handling method. However, I think it is acceptable since the writer of the server must have annotated the method as "#SequentialAttachmentProcessing" and therefore must know what that entails.
As a result the implementation of the receiveStream() method is now something like:
#POST
#SequentialAttachmentProcessing
#Consumes(MediaType.MULTIPART_FORM_DATA)
#Path("/doupload")
public Response receiveStream(MultipartBody multipart) {
List<Attachment> allAttachments = body.getAllAttachments();
Assert.isTrue(allAttachments.size() <= 1);
if (allAttachment.size() > 0) {
Attachment head = allAttachments.get(0);
Assert.isTrue(head instanceof SequentialAttachment);
SequentialAttachment att = (SequentialAttachment) head;
while (att != null) {
DataHandler dh = att.getDataHandler();
InputStream is = dh.getInputStream();
byte[] buf = new byte[65536];
int n;
OutputStream os = getOutputStream();
while ((n = is.read(buf)) > 0) {
os.write(buf, 0, n);
}
if (att.hasMore()) {
att = att.next();
}
}
}
}
While this solved my immediate problem, I still believe there has to be a standard way of doing this. I hope this helps someone.
I'm trying to get data from website which is encoded in UTF-8 and insert them into the database (MYSQL). Database is also encoded in UTF-8.
This is the method I use to download data from specific site.
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
java.io.InputStreamReader r = null;
StringBuilder content = new StringBuilder();
try {
s = (java.io.InputStream)new URL(url).getContent();
r = new java.io.InputStreamReader(s, "UTF-8");
char[] buffer = new char[4*1024];
int n = 0;
while (n >= 0) {
n = r.read(buffer, 0, buffer.length);
if (n > 0) {
content.append(buffer, 0, n);
}
}
}
finally {
if (r != null) r.close();
if (s != null) s.close();
}
return content.toString();
}
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
All my websites are encoded in UTF-8.
Please help.
If encoding is set to 'windows-1252' (r = new java.io.InputStreamReader(s, "windows-1252"); ) everything works fine and I am getting Côte d'Ivoire on my website (), but in java this title looks like 'C?´te d'Ivoire' what breaks other things, such as for example links. What does it mean ?
I would consider using commons-io, they have a function doing what you want to do:link
That is replace your code with this:
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
String content = null;
try {
s = (java.io.InputStream)new URL(url).getContent();
content = IOUtils.toString(s, "UTF-8")
}
finally {
if (s != null) s.close();
}
return content.toString();
}
if that nots doing start looking into if you can store it to file correctly to eliminate the possibility that your db isn't set up correctly.
Java
The problem seems to lie in the HttpServletResponse , if you have a servlet or jsp page. Make sure to set your HttpServletResponse encoding to UTF-8.
In a jsp page or in the doGet or doPost of a servlet, before any content is sent to the response, just do :
response.setCharacterEncoding("UTF-8");
PHP
In PHP, try to use the utf8-encode function after retrieving from the database.
Is your database encoding set to UTF-8 for both server, client, connection and have the tables been created with that encoding? Check 'show variables' and 'show create table <one-of-the-tables>'
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
Thus, the encoding during the display is wrong. How are you displaying it? As per the comments, it's a PHP page? If so, then you need to take two things into account:
Write them to HTTP response output using the same encoding, thus UTF-8.
Set content type to UTF-8 so that the webbrowser knows which encoding to use to display text.
As per the comments, you have apparently already done 2. Left behind 1, in PHP you need to install mb_string and set mbstring.http_output to UTF-8 as well. I have found this cheatsheet very useful.