How to get non-latin characters from website?

How to get non-latin characters from website? - java

I try to get data from latata.pl/pl.php and view all sign (polish - iso-8859-2)
final URL url = new URL("http://latata.pl/pl.php");
final URLConnection urlConnection = url.openConnection();
final BufferedReader in = new BufferedReader(new InputStreamReader(
urlConnection.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
in.close();
It doesn't work. :( Any ideas?

InputStream reader has multiple constructors and you can (should/have to) specify encoding in such case in one of these constructors.

Your InputStreamReader will be attempting to convert the bytes coming back over the TCP connection using your platform default encoding (which is most likely UTF-8 or one of the horrible Windows ones). You should explicitly specify an encoding.
Assuming the web server is doing a good job, you can find the correct encoding in one of the HTTP headers (I forget which one). Or you can just assume it's iso-8859-2, but that might break later.

This is too long for a comment but who set that webpage? You? From what I can see it doesn't look correct.
Here's what you get back:
$ telnet latata.pl 80
Trying 91.205.74.65...
Connected to latata.pl.
Escape character is '^]'.
GET /pl.php HTTP/1.0
Host: latata.pl
HTTP/1.1 200 OK
Date: Sun, 27 Feb 2011 13:49:19 GMT
Server: Apache/2
X-Powered-By: PHP/5.2.16
Vary: Accept-Encoding,User-Agent
Content-Length: 10
Connection: close
Content-Type: text/html
����ʣ��Connection closed by foreign host.
The HTML is simply:
<html>
<head></head>
<body>±ê³ó¿¡Ê£¯¬</body>
</html>
And that's how your page appears from a browser. Is there a valid reason why no charset is specified in that HTML page?

The output of your php-script pl.php is faulty. There is a HTTP-header Content-Type: text/html set without a declared charset. Without a declared charset, the client has to assume that it is ISO-8859-1 regarding to the HTTP-specifications. The sent body is ±ê³ó¿¡Ê£¯¬ if interpreted as ISO-8859-1.
The bytes sended by the php-script are representing ąęłóżĄĘŁŻŹ if it were declared as
Content-Type: text/html; charset=ISO-8859-2
You can check this with the simple code fragment, which will transform the faulty ISO-8859-1 encoding to ISO-8859-2:
final String test="±ê³ó¿¡Ê£¯¬";
String corrupt=new String(test.getBytes("ISO-8859-1"),"ISO-8859-2");
System.out.println(corrupt);
The output will be ąęłóżĄĘŁŻŹ, which are some polish characters.
As a quick fix, set the charset in your php-script to output Content-Type: text/html; charset=ISO-8859-2 as HTTP-Header.
But you should think about to switch to UTF-8 encoded output anyway.

As someone has already stated there is no charset encoding specified for the response. Forcing the response document to be viewed as ISO-8859-2 (typically used in central Europe) results in legitimate polish characters being displayed, so I assume this is the encoding actually being used. Since no encoding has been specified, ISO-8859-1 will be assumed as this is the default.
The response headers need to include the header Content-Type: text/html; charset=ISO-8859-2 for the character code points to be interpreted correctly. This charset will be used when constructing the response InputStream.

Related

OutputStreamWriter encoding vs response content-type

I have an OutputStreamWriter in my Servlet that uses a particular encoding scheme, i.e. I have to use this constructor
OutputStreamWriter(OutputStream out, String charsetName)
Also, I have used the following line of code to set the encoding scheme of the response
response.setContentType("text/html;charset=UTF-8")
Using this output stream I am sending response to the client.
Now in the browser the decoding will be done by which scheme UTF-8 or charsetName.
Can someone explain me why?

The line
OutputStreamWriter(OutputStream out, String charsetName)
tells the writer which charset to use for encoding.
The line
response.setContentType(text/html;charset=UTF-8)
sets the contentType header in the http response and tells the browser which encoding to use for displaying the content.

The browser will handle the content based on the Content-Type header. The charset you use for the OutputStreamWriter only affects how characters written to it are encoded into bytes.

Properly set encoding to UTF-8, even then email come up with ? and �

I am using javax.mail API for sending email to my Outlook. There are chinese and french characters in my Body.
I am properly setting body as
MimeMessage.setText(body, "UTF-8");
Also in the email I am checking the Headers. They are properly coming as :
Content-type: text/plain;
charset="UTF-8"
Content-transfer-encoding: quoted-printable
The funny thing is that from the Other Machine, the email is coming up fine, but when I try it from my desktop, It doesn't encode properly.
I am also checking logs by printing the body. They are properly coming up in chinese and french.
Help needed ?
Does it is anything to do with Sendmail??

Should have worked; you only forgot to do the subject too. Especially as you checked the header. Encoding calls:
MimeMessage message = new MimeMessage(session);
message.setSubject(subject, "UTF-8");
message.setText(body, "UTF-8");
//message.setHeader("Content-Type", "text/plain; charset=UTF-8");
I think, your email settings on the desktop force the wrong encoding.
Paranoia: Check the body string, via a hard-coded u-escaped string:
message.setText("\u00e9\u00f4\u5837" + body, "UTF-8"); // éô堷

URLConnection does not handle content length via proxy correctly

I faced the following problem: When URLConnection is used via proxy the content length is always set to -1.
First I checked that proxy really returns the Content-Length (lynx and wget are also working via proxy; there is no other way to go to internet from local network):
$ lynx -source -head ftp://ftp.wipo.int/pub/published_pct_sequences/publication/2003/1218/WO03_104476/WO2003-104476-001.zip
HTTP/1.1 200 OK
Last-Modified: Mon, 09 Jul 2007 17:02:37 GMT
Content-Type: application/x-zip-compressed
Content-Length: 30745
Connection: close
Date: Thu, 02 Feb 2012 17:18:52 GMT
$ wget -S -X HEAD ftp://ftp.wipo.int/pub/published_pct_sequences/publication/2003/1218/WO03_104476/WO2003-104476-001.zip
--2012-04-03 19:36:54-- ftp://ftp.wipo.int/pub/published_pct_sequences/publication/2003/1218/WO03_104476/WO2003-104476-001.zip
Resolving proxy... 10.10.0.12
Connecting to proxy|10.10.0.12|:8080... connected.
Proxy request sent, awaiting response...
HTTP/1.1 200 OK
Last-Modified: Mon, 09 Jul 2007 17:02:37 GMT
Content-Type: application/x-zip-compressed
Content-Length: 30745
Connection: close
Age: 0
Date: Tue, 03 Apr 2012 17:36:54 GMT
Length: 30745 (30K) [application/x-zip-compressed]
Saving to: `WO2003-104476-001.zip'
In Java I wrote:
URL url = new URL("ftp://ftp.wipo.int/pub/published_pct_sequences/publication/2003/1218/WO03_104476/WO2003-104476-001.zip");
int length = url.openConnection().getContentLength();
logger.debug("Got length: " + length);
and I get -1. I started to debug FtpURLConnection and it turned out that the necessary information is in underlying HttpURLConnection.responses field however it is never properly populated from there:
(there is Content-Length: 30745 in headers). The content length is not updated when you start reading the stream or even after the stream was read. Code:
URL url = new URL("ftp://ftp.wipo.int/pub/published_pct_sequences/publication/2003/1218/WO03_104476/WO2003-104476-001.zip");
URLConnection connection = url.openConnection();
logger.debug("Got length (1): " + connection.getContentLength());
InputStream input = connection.getInputStream();
byte[] buffer = new byte[4096];
int count = 0, len;
while ((len = input.read(buffer)) > 0) {
count += len;
}
logger.debug("Got length (2): " + connection.getContentLength() + " but wanted " + count);
Output:
Got length (1): -1
Got length (2): -1 but wanted 30745
It seems like it is a bug in JDK6, so I have opened new bug#7168608.
If somebody can help me to write the code should return correct content length for direct FTP connection, FTP connection via proxy and local file:/ URLs I would appreciate.
If given problem cannot be worked-around with JDK6, suggest any other library that definitely works for all cases I've mentioned (Apache Http Client?).

Remember that proxies will often change the representation of the underlying entity. In your case I suspect the proxy is probably altering the transfer encoding. Which in turn makes the Content-Length meaningless even if supplied.
You are falling afoul of the following two sections of the HTTP 1.1 spec:
4.4 Message Length
...
...
If a Content-Length header field (section 14.13) is present, its decimal value in OCTETs represents both the entity-length and the transfer-length. The Content-Length header field MUST NOT be sent if these two lengths are different (i.e., if a Transfer-Encoding header field is present). If a message is received with both a Transfer-Encoding header field and a Content-Length header field, the latter MUST be ignored.
14.41 Transfer-Encoding
The Transfer-Encoding general-header field indicates what (if any) type of transformation has been applied to the message body in order to safely transfer it between the sender and the recipient. This differs from the content-coding in that the transfer-coding is a property of the message, not of the entity.
Transfer-Encoding = "Transfer-Encoding" ":" 1#transfer-coding
Transfer-codings are defined in section 3.6. An example is:
Transfer-Encoding: chunked
If multiple encodings have been applied to an entity, the transfer- codings MUST be listed in the order in which they were applied. Additional information about the encoding parameters MAY be provided by other entity-header fields not defined by this specification.
Many older HTTP/1.0 applications do not understand the Transfer- Encoding header.
So The URLConnection is then ignoring the Content-Length header, as per the spec because it is meaningless in the presence of chunked transfers
In your debugger screenshot it's not clear whether the Transfer-Encoding header is present. Please let us know...
On further investigation - it seems that lynx does not show all the headers returned when you issue a lynx -head. It is not showing the Transfer-Encoding header critical to this discussion.
Here's the proof of the discrepancy with a publically visible website
Ξ▶ lynx -useragent='dummy' -source -head http://www.bbc.co.uk
HTTP/1.1 302 Found
Server: Apache
X-Cache-Action: PASS (non-cacheable)
X-Cache-Age: 0
Content-Type: text/html; charset=iso-8859-1
Date: Tue, 03 Apr 2012 13:33:06 GMT
Location: http://www.bbc.co.uk/mobile/
Connection: close
Ξ▶ wget -useragent='dummy' -S -X HEAD http://www.bbc.co.uk
--2012-04-03 14:33:22-- http://www.bbc.co.uk/
Resolving www.bbc.co.uk... 212.58.244.70
Connecting to www.bbc.co.uk|212.58.244.70|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Server: Apache
Cache-Control: private, max-age=15
Etag: "7e0f292b2e5e4c33cac1bc033779813b"
Content-Type: text/html
Transfer-Encoding: chunked
Date: Tue, 03 Apr 2012 13:33:22 GMT
Connection: keep-alive
X-Cache-Action: MISS
X-Cache-Age: 0
X-LB-NoCache: true
Vary: Cookie
Since I am obviously not inside your network I can't replicate your exact circumstances, but please validate that you really aren't getting a Transfer-Encoding header when passing through a proxy.

I think it's a "bug" in the jdk related to handling ftp connections which are proxied. The FtpURLConnection delegates to an HttpURLConnection when a proxy is in use. however, the FtpURLConnection doesn't seem to delegate any of the header management to this HttpURLConnection in this situation. thus, you can correctly get the streams, but i don't think you can access any "header" values like content length or content type. (this is based on a quick glance over the openjdk source for 1.6, i could have missed something).

One thing to check I would do is to actually read the response (writing off the top of my head so expect mistakes):
URLConnection connection= url.openConnection();
InputStream input= connection.getInputStream();
byte[] buffer= new byte[4096];
while(input.read(buffer) > 0)
;
logger.debug("Got length: " + getContentLength());
If the size you are getting is good, then look for a way for make URLConnection read the header but not the data to avoid reading the whole response.

Bad encoding of streamed CSV with Stripes / Tomcat

Actually i'm trying to stream a CSV file. I set the encoding to windows-1252 but it seems it is still streamed as UTF-8 file.
final String encoding = "windows-1252";
exportResolution = new StreamingResolution(builder.getContentType() + ";charset=" + encoding.toLowerCase()) {
#Override
public void stream(HttpServletResponse response) throws Exception {
// Set response headers
response.setHeader("Cache-control", "private, max-age=0");
response.setCharacterEncoding(encoding);
OutputStream os = response.getOutputStream();
writeExportStream(os,builder);
}
}.setFilename(filename);
writeExportStream just streams the content to the outputstream (with pagination and db calls, it takes some time)
It doesn't work in local (jetty plugin) + dev (tomcat) Neither with firefox / chrome
I've not tested but people at work told me that it works better when we don't stream the content but we write the file in one time after having loaded all the objets we want from db.
Anybody know what is happening? Thanks
Btw my headers:
HTTP/1.1 200 OK
Content-Language: fr-FR
Content-Type: text/csv;charset=windows-1252
Content-Disposition: attachment;filename="export_rshop_01-02-11.csv"
Cache-Control: private, max-age=0
Transfer-Encoding: chunked
Server: Jetty(6.1.14)
I want the file to be able to be imported in excel in windows-1252 but i can't, it just open in utf8 while my header is windows-1252

The problem lies in the writeExportStream(os,builder); method. We can't see what encoding operations it is performing, but I'm guessing it is writing UTF-8 data.
The output operation needs to perform two encoding tasks:
Tell the client what encoding the response text is in (via the headers)
Encode the data writen to the client in a matching encoding (e.g. via a writer)
Step 1 is being done correctly. Step 2 is probably the source of the error.
If you use the provided writer, it will encode character data in the appropriate response encoding.
If pre-encoded data is written via the raw byte stream (getOutputStream()), you need to make sure this process uses the same encoding.

windows-1252 character 146 is stopping POST data reaching servlet in glassfish v2

An HTTP POST request is made to my servlet. There is a posted form parameter in the http request that my code in the servlet retrieves for further processing named "payload". When the value of the payload includes the windows-1252 character "’" (ascii value 146), HttpServletRequest instance method getParameter("payload") returns null. There is nothing in the server.log related to the problem. We think the character encoding used to produce this character is windows-1252. The character encoding glassfish defaults to for http requests appears to be ISO-8859-1. Ascii value 146 is a control character in ISO-8859-1.
Does anyone have any suggestions as to how I could solve this problem?
The http request headers in the post that showed the problem are:
POST /dbxchange/TechAnywhere HTTP/1.1
CONTENT_LENGTH: 13117
Content-type: application/x-www-form-urlencoded
Cache-Control: no-cache
Pragma: no-cache
User-Agent: Mozilla/4.0 (Windows Vista 6.0) Java/1.6.0_16
Host: localhost:8080
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive
Content-Length: 13117

Java doesn't care about the differences between Cp1252 and Latin-1. Since there are no invalid byte sequence in both encoding, you wouldn't get null with either one. I think your server is using UTF-8 and the browser is using Cp1252 or Latin1.
Try to put following attributes in form to see if it helps,
<form action="..." method="post" charset="UTF-8" accept-encoding="UTF-8"...>

We think the character encoding used to produce this character is windows-1252.
Yes, very probably. Even when browsers claim to be using iso-8559-1, they are usually actually using windows-1252.
The character encoding glassfish defaults to for http requests appears to be ISO-8859-1
Most likely it is defaulting to your system's Java ‘default encoding’. This is rarely what you want, as it makes your application break when you redeploy it.
For reading POST request bodies, you should be able to fix the encoding by calling setCharacterEncoding on the request object, as long as you can do it early enough so that no-one has already caused it to read the body by calling methods such as getParameter. Try setting the encoding to "Cp1252". Although really you ought to be aiming for UTF-8 for everything in the long run.
Unfortunately there is not a standard J2EE way to specify what encoding your application expects for all requests (including query string parameters, which are not affected by setCharacterEncoding). Each server has its own way, which creates annoying deployment issues. But for Glassfish, set a <parameter-encoding> in your sun-web.xml.

We have found that the problem is in the javascript code that sends the post request. The javascript code was URL encoding the value of the payload before sending the request. The javascript built-in function escape() was used to do the URL encoding. This was encoding the character to a non standard encoding implementation of %u2019. It appears as though glassfish does not support this non standard form of encoding.
See http://en.wikipedia.org/wiki/Percent-encoding#Non-standard_implementations
The fix was to use the built-in javascript function encodeURI() which returns "%E2%80%99" for ’

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get non-latin characters from website? - java

InputStream reader has multiple constructors and you can (should/have to) specify encoding in such case in one of these constructors.

Related

OutputStreamWriter encoding vs response content-type

Properly set encoding to UTF-8, even then email come up with ? and �

URLConnection does not handle content length via proxy correctly

Bad encoding of streamed CSV with Stripes / Tomcat

windows-1252 character 146 is stopping POST data reaching servlet in glassfish v2

Categories

Resources