Handling Character Encoding in URI on Tomcat

Handling Character Encoding in URI on Tomcat - java

On the web site I am trying to help with, user can type in an URL in the browser, like following Chinese characters,
http://localhost:8080?a=测试
On server, we get
GET /a=%E6%B5%8B%E8%AF%95 HTTP/1.1
As you can see, it's UTF-8 encoded, then URL encoded. We can handle this correctly by setting encoding to UTF-8 in Tomcat.
However, sometimes we get Latin1 encoding on certain browsers,
http://localhost:8080?a=ß
turns into
GET /a=%DF HTTP/1.1
Is there anyway to handle this correctly in Tomcat? Looks like the server has to do some intelligent guessing. We don't expect to handle the Latin1 correctly 100% but anything is better than what we are doing now by assuming everything is UTF-8.
The server is Tomcat 5.5. The supported browsers are IE 6+, Firefox 2+ and Safari on iPhone.

Unfortunately, UTF-8 encoding is a "should" in the URI specification, which seems to assume that the origin server will generate all URLs in such a way that they will be meaningful to the destination server.
There are a couple of techniques that I would consider; all involve parsing the query string yourself (although you may know better than I whether setting the request encoding affects the query string to parameter mapping or just the body).
First, examine the query string for single "high-bytes": a valid UTF-8 sequence must have two or more bytes (the Wikipedia entry has a nice table of valid and invalid bytes).
Less reliable would be to look a the "Accept-Charset" header in the request. I don't think this header is required (haven't looked at the HTTP spec to verify), and I know that Firefox, at least, will send a whole list of acceptable values. Picking the first value in the list might work, or it might not.
Finally, have you done any analysis on the logs, to see if a particular user-agent will consistently use this encoding?

Related

How do I set the charset portion of the Content-Type HTTP Header on an IBM HTTPD Server?

I have an application which is a set of Java Web Services and some static content (HTML, XML, JavaScript, etc.). I know that JavaScript has a limited character encoding that is possible, but HTML and XML can use various character encodings. I happen to know that all of these files are UTF-8 encoded. The WebSphere application server that I am using properly sets the Content-Type to 'text/html; charset=utf-8' for the HTML, but not for JavaScript or XML. They get the Content-Type header set to 'application/javascript' and 'text/xml' respectively. My security folks are telling me that ot specifying the charset for the XML files is a vulnerability. Remember these are static files.
On an IBM HTTPD web server (in front of the WebSphere application server) is there a directive that I can use to add the character encoding to the content type of 'text' types? On WebSphere is there a directive I can use to set the default character encoding for text types? I assume that after I "fix" this for the XML files that I will then be asked to fix it for CSS files, JavaScript files, etc. I would rather fix it once and be done.
If this question has been asked before, please provide the URL. I did find this question, but it is not the same. I am looking into the feasibility of this answer, but there are many folders and I would rather not have to remember to add a .htaccess file with this directive to each one.

You can just append AddDefaultCharset utf-8 to httpd.conf and everything will go out with that charset appended to it, even content generated by the application server. htaccess is not necessary and not useful for appserver content.
If you find you need to blacklist context roots, extensions,
or anything else, sue <LocationMatch> with AddDefaultCharset off.
Unfortunately Header edit Content-Type... will not work in IBM HTTP Server prior to V9. In V9 this allows you to easily cherry pick the current Content-Type:
Header always edit Content-Type ^(text/html)$ "$1 ; charset=utf8"
Header always edit Content-Type ^(application/javascript)$ "$1 ; charset=utf8"

Just as same as covener described:
Add the following lines into the conf/httpd.conf file:
AddDefaultCharset utf-8
AddCharset utf-8 .html .js .css
<Location />
Header always edit Content-Type ^(text/html)$ "$1; charset=utf8"
Header always edit Content-Type ^(application/javascript)$ "$1; charset=utf8"
RewriteRule ^(.*)$ $1 [R=200,L]
</Location>
and it should work.

& becomes & during FTP to MVS

I am using a java library (edtftpj) to FTP a file from a web app hosted of a tomcat server to an MVS system.
The FTP transfer mode is ASCII and transfer is done using FTP streams. The data from a String variable is stored into an MVS dataset.
The problem is all the ampersand characters get converted to & . I have tried various escape characters including \& , ^& and X'50' (hex value), but none of it helped.
Anyone has any idea how to escape the ampersands please?

Nothing in the FTP protocol would cause this encoding behavior.
Representing & as & is an XML based escaping representation. Other systems might use the same scheme, but as a standard, this is an XML standard encoding.
Something in the reading of the data and writing of the data thinks it should be escaping this information and is doing the encoding.
If anything on the MVS system is using Java it is probably communicating via SOAP with some other connector, which implies XML, which could be causing the escape sequence to be happening.
Either way, the FTP protocol itself part is not part of the problem, ASCII transfer should only encode things like line endings, & is already a valid ASCII character and would not be affected. It is the MVS system that is doing this escaping if anything.
Binary transfer is preferred in almost every case, since it doesn't do any interpretation or encoding of the raw bytes.

Using FTP in ASCII-mode to/from a MVS (z/OS) will always perform code page conversions (i.e ASCII <-> EBCDIC) for the data connection. Thus it's very important to setup the connection with the appropriate parameters depending on dataset type and page codes. Example:
site SBD=(IBM-037,ISO8859-1)
site TRAck
site RECfm=FB
site LRECL=80
site PRImary=5
site SECondary=5
site BLKsize=6233
site Directory=50
As alternative, use BINARY mode and manually perform the conversions with some of the standard tools or libraries on the receiving end.
Ref links:
1. Preset commands to tackle codepage problem.
2. Coverting ASCII to EBCDIC via FTP on MVS Host.
3. Transferring Files to and from MVS.
4. FTP code page conversion.
5. FTP File Transfer Protocol and Z/OS (pdf).

JSP page is giving error in IE browser only

HTML1114: Codepage iso-8859-1 from (HTTP header) overrides conflicting codepage utf-8 from (META tag)
getQuotes?zip=20190&county=FAIRFAX&eff=01%2F13%2F2012&fam_income=30000.0&a0_dob=11%2F11%2F1981&a0_g=M&a0_t=true&a0_rel=self&appId=30&planId=4&changedSubsidy=%24100.98
SEC7111: HTTPS security is compromised by http://www.startssl.com/img/secured.gif
getQuotes?zip=20190&county=FAIRFAX&eff=01%2F13%2F2012&fam_income=30000.0&a0_dob=11%2F11%2F1981&a0_g=M&a0_t=true&a0_rel=self&appId=30&planId=4&changedSubsidy=%24100.98
what does this error means ?

There are two errors here.
The HTTP header says that the encoding is iso-8859-1 whereas the meta-tag in the HTML page says that it's UTF-8. Both should say the same, and should say the actual character encoding used.
You have a HTTPS page which contains an image downloaded over HTTP. So the whole page is not considered secure by IE.

How to find out the name of the default page displayed by a webserver?

I'm downloading various files through I/O-streaming in my Java application. Receiving and saving those files works well as long as I have a full URL-path including file name, but how can I find out the name of the index file (as defined in, for example, Apache's DirectoryIndex) of a domain? The HTTP header doesn't provide this information and neither does the URLConnection method.
Thanks alot!
Be well
S.

As far as I know there is no way of retrieving this information. The HTTP specification doesn't provide it, and I think this isn't a bad thing. Your clients requests the URL "/", it's up to the web server how to handle that, there is no obligation to return a filename too.
It's also worth pointing out (I'm sure you're aware of it but just in case) that just because a URL looks like /somedir/somefile.html, it doesn't mean that is the actual file being served. It could be being served via a proxy to another host, mod_rewrite etc - in other words, the name as arbitrary and doesn't necessarily bear any relation to the physical name on disk.
In short, I think your best bet would be to pick a default filename e.g. index.html for those cases and stick to it.

Only way out is to:
Inspect Content-Disposition header and use it to generate filename. If server is serving a file, it would set this header. E.g. http://server:port/DownLoadServlet URL might set this header to indicate name as "statement.pdf".
IF this header is missing, use Heuristics to generate filename. This is what browsers do to generate filenames like Doc[10].pdf Doc[12].pdf etc.
Use content-type header (if available) to guess file extension.

Java UrlConnection HTTP 1.0

I am trying to download a file from my Java application. But because UrlConnection uses HTTP 1.1 protocol i get a Tranfer Encoding: chunked response in which case i can not find out file size(content-length is not set). From what i could find HTTP version is hard coded in the class and there is no way to change it. Is it somehow possible to change the version back to one or tell the server not to use chunked encoding when sending a file?
Edit: I am not trying to retrive dynamic content my application is a download manager.
files i am downloading are static. Other downloaders i checked wget,igetter,curl use Http 1.0 and they get the size info from most servers. But my application and firefox issuing Http 1.1 always gets chunked encoding. I understand that content-length is not always present but i would like to get it most of time.

The Jakarta Commons HTTP Client contains a "preference architecture" that allows some fine grained control over the particulars of the HTTP connection. See http://hc.apache.org/httpclient-3.x/preference-api.html

It's very likely that the server can't specify a valid content-length, even if you specify HTTP/1.0. When content is dynamically produced, the server has to buffer it all to measure its total length. Not all servers are going to be able to fallback to this less efficient behavior.
If buffering the response is reasonable, why not do it in your client, where you have full control? This is safer than relying on the server.
Read the response without processing, just stuffing the data into a ByteArrayOutputStream. When you are done, measure the length of the resulting byte array. Then create a ByteArrayInputStream with it and process that stream in place of the stream you got from the URLConnection.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Handling Character Encoding in URI on Tomcat - java

Related

How do I set the charset portion of the Content-Type HTTP Header on an IBM HTTPD Server?

& becomes & during FTP to MVS

JSP page is giving error in IE browser only

How to find out the name of the default page displayed by a webserver?

Java UrlConnection HTTP 1.0

Categories

Resources