I am using Apache HTTP components (4.1-alpha2) to upload a files to dropbox. This is done using multipart form data. What is the correct way to encode filenames in in a multipart form that contain international (non-ascii) characters?
If I use there standard API, the server returns an HTTP status Forbidden. If I modify the upload code so the file name is urlencoded:
MultipartEntity entity = new MultipartEntity(HttpMultipartMode.BROWSER_COMPATIBLE);
FileBody bin = new FileBody(file_obj, URLEncoder.encode(file_obj.getName(), HTTP.UTF_8), HTTP.UTF_8, HTTP.OCTET_STREAM_TYPE );
entity.addPart("file", bin);
req.setEntity(entity);
The file is uploaded, but I end up with a filename that is still encoded. E.g. %D1%82%D0%B5%D1%81%D1%82.txt
To solve this issue specifically for the dropbox server I had to encode the filename in utf8. To do this I had to declare my multipart entity as follows:
MultipartEntity entity = new MultipartEntity(HttpMultipartMode.BROWSER_COMPATIBLE, null, Charset.forName(HTTP.UTF_8));
I was getting the forbidden because of the OAuth signed entity not matching the actual entity sent (it was being URL encoded).
For those interested on what the standards have to say on this I did some reading of RFCs.
If the standard is strictly adhered then all headers should be encoded 7bit, this would make utf8 encoding of the filename illegal. However RFC2388 () states:
The original local file name may be
supplied as well, either as a
"filename" parameter either of the
"content-disposition: form-data"
header or, in the case of multiple
files, in a "content-disposition:
file" header of the subpart. The
sending application MAY supply a file
name; if the file name of the sender's
operating system is not in US-ASCII,
the file name might be approximated,
or encoded using the method of RFC
2231.
Many posts mention using either rfc2231 or rfc2047 for encoding headers in non US-ASCII in 7bit. However rfc2047 explicitly states in section 5.3 encoded words MUST NOT be used on a Content-Disposition field. This would only leave rfc2231, this however is an extension and cannot be relied upon being implemented in all servers. The reality is most of the major browsers send non-US-ASCII characters in UTF-8 (hence the HttpMultipartMode.BROWSER_COMPATIBLE mode in Apache HTTP client), and because of this most web servers will support this. Another thing to note is that if you use HttpMultipartMode.STRICT on the multipart entity, the library will actually substitute non-ASCII for question mark (?) in the filename.S
I would have thought that the implementation of the FileBody would take responsibility for applying the appropriate rules from RFC 2047 itself. The filename would then be encoded as =?UTF-8?Q?=D1=82=D0=B5=D1=81=D1=82.txt?= or something very similar.
Quick fix:
new String(multipartFile.getOriginalFilename().getBytes ("iso-8859-1"), "UTF-8");
Related
I have a project about getting XML files from URL's, scraping them, pulling the data, then processing it. Also, I am creating the URL with user input. But I need to check if the URL contains XML file to scrape. Any ideas how to do that? So basically how to check if URL contains XML file or not?
Ways to know whether GETing a URL will retrieve XML...
Before retrieving the file
Have an out-of-band guarantee.
Inspect Content-Type HTTP header of response to a HEAD request1.
After retrieving the file
Inspect Content-Type HTTP header of the response1.
Sniff root element.
Files.probeContentType(path)
Parse via conforming XML parser without getting any well-formedness errors.
Note: Only parsing via a conforming XML parser is guaranteed to provide 100% determination.
1 MIME assignments for XML data:
application/xml (RFC 7303, previously RFC 3023)
text/xml (RFC 7303, previously RFC 3023)
Other MIME assignments used with XML applications.
I want to get the clarity on these headers in my application:
response.setHeader("Content-Security-Policy", "frame-ancestors 'self'");
response.setHeader("X-Content-Type-Options", "nosniff");
response.setHeader("X-XSS-Protection", "1; mode=block");
response.setHeader("Strict-Transport-Security", "max-age=7776000; cludeSubdomains");
String contextPath = ((HttpServletRequest) request).getContextPath();
response.setHeader("SET-COOKIE", "JSESSIONID=" +
((HttpServletRequest)request).getSession().getId() +
";Path="+contextPath+";Secure;HttpOnly");
response.setHeader("Cache-control", "no-cache, no-store,max-age=0, must-revalidate");
response.setHeader("Pragma", "no-cache");
response.setHeader("X-Frame-Options", "SAMEORIGIN");
As of now I know:
Content Security Policy is an added layer of security that helps to
detect and mitigate certain types of attacks, including Cross Site
Scripting (XSS) and data injection attacks.
X-Content-Type-Options response HTTP header is a marker used by the server to indicate that the MIME types advertised in the Content-Type headers should not be changed and be followed.
X-XSS-protection is XSS Filter.
Strict-Transport-Security is an opt-in security enhancement that is specified by a web application through the use of a special response header. Once a supported browser receives this header that browser will prevent any communications from being sent over HTTP to the specified domain and will instead send all communications over HTTPS.
Cache-control general-header field is used to specify directives for caching mechanisms in both, requests and responses.
Pragma meant to prevent the client from caching the response. However there is a difference between Cache control and Pragma response headers as they both does same work except Pragma is the HTTP/1.0 implementation and cache-control is the HTTP/1.1 implementation of the same concept..
X-Frame-Options used to indicate whether or not a browser should be allowed to render a page in a frame, iframe or object.
Now I have this code in CrossSiteScriptingFilter which is mapped in web.xml which does XSS filtering. but as a result it changes the .png files encoding and remove the ?characters which corrupt PNG file encoding and thus giving false PNG data.
Please check the screenshot, it has no ? characters and are replaced by empty string and as a result it does not allow .png files to render.
I analysed the code and found that removing response header X-Content-Type-Options is doing the job (.png files are rendering properly).
I am still not sure why this problem occurs and why X-Content-Type-Options was replacing the ? character to "" string which was creating the problem. Can somebody explain.
Thanks in advance :)
It sounds to me like you're pretty close to your answer: XSS filtering of special characters is a bad idea with binary files which may validly use characters that would be out of place in (x)html, js, or similar interpreted files.
Normally, web apps split such resources into their own directory that will have a different process applied to its contents, say, not running an XSS protection filter over it. When you configure the filter, you should exclude paths known to exclusively contain binary data, such as the aforementioned resource directories.
What seems likely is that the headers are causing/prohibiting the filter from guessing at the MIME-type, misinterpreting your binary as html or similar (probably based on the text in the PNG header) or just falling back on the filter by default, and then sanitising it. It could be that your MIME-type headers are wrong and the sniffer is fixing it (hence telling it not to do so prevents it from recovering).
I create Java 7 REST service using Spring, Apache CXF.
public SuccessfulResponse uploadFile(#Multipart("report") Attachment attachment)
I use "Content-Disposition" parameter in order to retrieve file name. I've read some solutions which are usedfor downloading files (for example, url-encoding). But how to cope with non-ASCII filenames for upload? Is it a client-side or a server-side solution? The signature of the method above can be changed. The client side uses html5 file api + iframe.
My experience is that content-disposition doesn't handle UTF8 in general. You can simply add another multipart field for the filename - multipart fields support charset indication and handles UTF8 chars if done correctly.
You can use the UTF8 for a filename (according to https://www.rfc-editor.org/rfc/rfc6266 and https://www.rfc-editor.org/rfc/rfc5987). For the Spring the easiest way is to use org.springframework.http.ContentDisposition class. For example:
ContentDisposition disposition = ContentDisposition
.builder("attachment")
.filename("репорт.export", StandardCharsets.UTF_8)
.build();
return ResponseEntity
.ok()
.contentType(MediaType.APPLICATION_OCTET_STREAM)
.header(HttpHeaders.CONTENT_DISPOSITION, disposition.toString())
.body((out) -> messageService.exportMessages(out));
It is an example to send the file from the server (download). To upload the file you can follow the same RFC, even the Content-Disposition header must be prepared on the browser, by JavaScript for example, and looks like:
Content-Disposition: attachment;
filename="EURO rates";
filename*=utf-8''%e2%82%ac%20rates
Parameter filename is optional in this case and is a fallback for the systems that don't support the RFC 6266 (it contains the ASCII file name). Value for filename* must the URL Encoded (https://www.url-encode-decode.com).
I am calling a restful service that returns JSON using the Apache HttpClient.
The problem is I am getting different results in the encoding of the response when I run the code on different platforms.
Here is my code:
GetMethod get = new GetMethod("http://urltomyrestservice");
get.addRequestHeader("Content-Type", "text/html; charset=UTF-8");
...
HttpResponse response = httpexecutor.execute(request, conn, context);
response.setParams(params);
httpexecutor.postProcess(response, httpproc, context);
StringWriter writer = new StringWriter();
IOUtils.copy(response.getEntity().getContent(), writer);
When I run this on OSX, asian characters etc return fine e.g. 張惠妹 in the response. But when I run this on a linux server the same code displays the characters as ???
The linux server is an Amazon EC2 instance running Java 1.6.0_26-b03
My local OSX is running 1.6.0_29-b11
Any ideas really appreciated!!!!!
If you look at the javadoc of org.apache.commons.io.IOUtils.copy(InputStream, Writer):
Copy bytes from an InputStream to chars on a Writer using the default
character encoding of the platform.
So that will give different answers depending on the client (which is what you're seeing)
Also, Content-Type is usually a response header (unless you're using POST or PUT). The server is likely to ignore it (though you might have more luck with the Accept-Charset request header).
You need to parse the content type's charset-encoding parameter of the response header, and use that to convert the response into a String (if it's a String you're actually after). I expect Commons HTTP has code that will do that automatically for you. If it doesn't, Spring's RESTTemplate definitely does.
I believe that the problem is not in the HTTP encoding but elsewhere (e.g. while reading or forming the answer). Where do you get the content from and how? Is this stored in a DB or file?
I am trying to use the latest Apache HTTP Client (v4.x) to send a multi part POST request- the example code provided with the docs gives the following code sample (somewhat modified) to make a POST request--
FileBody bin = new FileBody(new File(args[0]));
StringBody comment = new StringBody("A binary file of some kind");
MultipartEntity reqEntity = new MultipartEntity();
reqEntity.addPart("bin", bin);
reqEntity.addPart("comment", comment);
httppost.setEntity(reqEntity);
What I am confused about is, if I have multiple files to be added, then in the code
reqEntity.addPart("bin", bin);
what does the first string represent? Is it the name of the file which is being sent as part of multi part post?
Multipart Form requests can have several parts, and each part is given a name (similar to a regular form request). This name can be used on the server side to retrieve a specific part, given the name. Good details are available in RFC 2388:
3. Definition of multipart/form-data
The media-type multipart/form-data follows the rules of all multipart
MIME data streams as outlined in [RFC 2046]. In forms, there are a
series of fields to be supplied by the user who fills out the form.
Each field has a name. Within a given form, the names are unique.
"multipart/form-data" contains a series of parts. Each part is
expected to contain a content-disposition header [RFC 2183] where the
disposition type is "form-data", and where the disposition contains
an (additional) parameter of "name", where the value of that
parameter is the original field name in the form. For example, a part
might contain a header:
Content-Disposition: form-data; name="user"