Newly-developed encoding issue with an Oracle Java program - java

My Java program (or rather, a part of it) sends a request to a webservice and receives rdf-strings including ancient Greek words in unicode. I wrote the program in netbeans and so far, there has not been a problem during run-time, both in the netbeans environment and outside as a standalone jar under Linux and Windows XP. Now, all of a sudden the Greek words in the rdf come back garbled like this:
á¼€
At first, I thought this was a Windows XP problem, but when checking under Windows 7 the problem persisted. I found out that I was running OpenJDK under Linux, and was since able to reproduce the issue using Oracle Java.
This is the relevant code (of course, I may have tunnel vision, so please tell me if you need more):
try {
HttpClient client = new DefaultHttpClient();
HttpGet get;
get = new HttpGet(URL+URLEncoder.encode(form, "UTF-8"));
HttpResponse response = client.execute(get);
if (201 == response.getStatusLine().getStatusCode()) {
HttpEntity respEnt = response.getEntity();
BufferedReader reader = new BufferedReader(new InputStreamReader(respEnt.getContent()));
StringBuilder sb = new StringBuilder();
char[] cbuffer = new char[256];
int read;
while ((read = reader.read(cbuffer)) != -1) {
sb.append(cbuffer,0,read);
}
//System.out.println(sb.toString());
rdf = new String(sb.toString().getBytes("UTF-8"),"UTF-8");
} else {
System.err.println("HTTP Request fehlgeschlagen.");
}
} catch (IOException e) {
System.err.println("Problem beim HTTP Request.");
}
The webservice is the Perseus morphology service, it can be found here: http://services.perseids.org/bsp/morphologyservice/analysis/word?lang=grc&engine=morpheusgrc&word=. Try "word=μῆνιν", for example. How or when the rdf is generated, I really don't know.
I would be very grateful for further insights!

Make sure the encoding of your strings is consistent from client to server and back again. In your case of course the servers response (rdf-strings) is most important (encoding serveside, decoding in your client code).
One thing concerning the client code you posted :
You are using the one argument constructor of InputStreamReader in this line:
BufferedReader reader = new BufferedReader(new InputStreamReader(respEnt.getContent()));
It will read from the inputstream using the VM (and systems) default charset, so the outcome will depend on the machine/VM you are running your client application on.
Try explicitly setting the charset using this constructor
new InputStreamReader(url.openStream(), "UTF-8")
See also API-doc.
Search your code for more uses of the one argument constructor of both InputStreamReader and OutputStreamWriter, which also uses the default encoding.
If you have no control over the server code (the webservice implementation), you can try to find out the answers charset like this:
Header contentType = response.getFirstHeader("Content-Type");
String charset= contentType.getValue();
(This is from the apache HttpClient API you seem to be using).
see also this Q on SO.

Related

How can I be sure that an HttpClient PostMethod contains UTF-8 strings in the parameters?

In our webapp, we have to send a POST request via HttpClient to an endpoint on our network, which will receive this and do some work with it. We are having trouble with character encoding, and I am having difficulties finding an answer to my question.
We have used the postMethod.getParams().setContentCharset("UTF-8") method when sending the request, but on the receiving end, it seems like the characters are still encoded in ISO 8859-1. I have determined this because when I inspect the String on the receiving side, it has garbage characters in it that go away once I follow the steps found at https://stackoverflow.com/a/16549329/1130549. Is there any extra steps I need to take on the sending end to ensure that I am actually writing characters in UTF-8 as expected? All we are doing now is using postMethod.addParameter(paramKey, paramValue) with native String objects.
Edit: Here is a very simple example of how we're sending the POST request. For what it's worth, the values are being taken from an XMLBeans object.
PostMethod postMethod = new PostMethod(url);
postMethod.getParams().setContentCharset("UTF-8");
postMethod.addParameter("key1", "value1");
postMethod.addParameter("key2", "value2");
HttpClient httpClient = new HttpClient();
int status = httpClient.executeMethod(postMethod);
EDIT
Simpler solution is to encode the value
postMethod.addParameter("key1", URLEncoder.encode("value1","UTF-8"));
To encode properly UTF-8, you can execute differently, using StringEntity and NameValuePair, e.g.:
try (CloseableHttpClient httpClient = HttpClients.custom().build()) {
URIBuilder uriBuilder = new URIBuilder(url);
HttpHost target = new HttpHost(uriBuilder.getHost(), uriBuilder.getPort(), uriBuilder.getScheme());
List<NameValuePair> nameValuePairs = new ArrayList<>();
nameValuePairs.add(new BasicNameValuePair("key1", "value1"));
nameValuePairs.add(new BasicNameValuePair("key2", "value2"));
String entityValue = URLEncodedUtils.format(nameValuePairs, StandardCharsets.UTF_8.name());
StringEntity entity = new StringEntity(entityValue, StandardCharsets.UTF_8.name());
post.setEntity(entity);
httpClient.execute(target, post);
First of all, you do need to make sure that the string that you are actually writing is encoded in UTF-8. I realized that you already know that but still double-check that it is so, as it would be the prime suspect of your problem. Also, I would recommend trying a much simpler HTTP client. Apache HTTP client (I believe that's the library that you are using) is an excellent library. But due to covering a very wide range of options it tends to be a bit bulky. So, or simple requests I would suggest a lightweight HTTP client that maybe not that comprehensive as Apache library but offers simplicity as a trade-off. Here how your code may look like:
private static void testHttpClient() {
HttpClient client = new HttpClient();
// client.setContentType("text/html; charset=utf-8");
client.setContentType("application/json; charset=utf-8");
client.setConnectionUrl("http://www.my-url.com");
String content = null;
try {
String myMessage = getMyMessage() // get the string that you want to send
content = client.sendHttpRequest(HttpMethod.POST, myMessage);
} catch (IOException e) {
content = client.getLastResponseMessage() + TextUtils.getStacktrace(e, false);
}
System.out.println(content);
}
It looks much more simple, I think. Also in the same library, there is another utility that allows you to convert any string in any language into a sequence of unicodes and vice-versa. This helped me numerous times to diagnose encoding thorny issues. For instance, if you see some gibberish symbols that could be a wrong display of a valid character or actual character loss. Here is an example of how it works:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
That might help you to check if the string you passed is valid or not. The library is called MgntUtils and could be found at Maven Central or at Github It comes as maven artifact and with sources and Javadoc. Javadoc could be found separately here
Disclaimer: The MgntUtils library is written by me

Java HTTP server not working

I am hosting a webpage from home. I made my own HTTP server using Java. This is an SSCCE:
if(command.startsWith("GET"))
{
//client is a socket on which I reply.
PrintWriter pw = new PrintWriter(client.getOutputStream(), true);
String commule = command.split(" ");
if(commule[0].equals("GET"))
{
if(commule[1].contains("."))
{
File file = new File(GEQO_SERVER_ROOT + commule[1].substring(1).replaceAll("%20", " "));
if(file.exists())
{
OutputStream out = client.getOutputStream();
InputStream stream = new FileInputStream(file);
String response = new String();
response += "HTTP/1.1 200 OK\r\n";
response += "Date: Thu, 08 Aug 2013 08:49:37 GMT\r\n";
response += "Content-Type: text/html\r\n";
response += "Content-Length: " + file.length() + "\r\n";
response += "Connection: keep-alive\r\n";
response += "\r\n";
pw.write(response); //Assume I already initialized pw as a PrintWriter
pw.flush();
copy(stream, out);
stream.close();
out.close();
}
else
{
pw.write("<html><h1>The request 404ed.</h1>");
pw.write("<body>The requested URL <b>" + commule[1] + "</b> could not be found on this server.</body></html>");
pw.flush();
}
}
else
{
BufferedReader br = new BufferedReader(new FileReader(GEQO_SERVER_ROOT + commule[1].substring(1) + "main.html"));
String sCurrentLine;
while ((sCurrentLine = br.readLine()) != null)
{
pw.print(sCurrentLine);
}
br.close();
}
}
else
{
pw.println("Unrecognized HTTP command.");
}
}
This is the main.html source :
<html>
<title>Geqo Server</title>
<body>Geqo server online and functioning!</body>
</html>
The issue is that when I try to access this page using Chrome, it displays correctly (At least when using 127.0.0.1). But when I tried accessing it on Firefox on 127.0.0.1, it works, but just gives me the html source. IE also only gives me the source. Can anyone tell me why Firefox and IE only show the source, instead of parsing it?
I think this contains some clues (Firebug screenshot) :
My source seems to be coming in a <pre> tag. I donno why, but isn't that sort of the problem?
I port-forwarded. Here's the page guys : http://110.172.170.83:17416/ (Sorry, Stackoverflow doesn't allows numerical links.)
EDIT : I found the problem. But before I explain, thanks to Bart for the SSCCE, which I used to compare with my code. This is the problem : The if statement on the eighth line if(commule[1].contains(".")) causes the code to skip the most of the code here. In that respective else block, there is even no command to send the headers. Thanks to artbristol for pointing that out.
Thanks in advance.
Your printwriter isn't flushing (as Ernest pointed out), so no HTTP headers are being sent. Look at the result of connecting directly - it just returns the raw data, with no headers.
nc 110.172.170.83 17416
GET /
<html><title>Geqo Server</title><body>Geqo server online and functioning!</body></html>
Writing an HTTP server is hard work. Unless this is for an exercise, you should use a lightweight existing one, such as Jetty, or the built in Sun HTTP server in the JDK.
Edit - A PrintWriter really isn't appropriate for doing HTTP. It's designed to deal with line-by-line data such as a file being written to disk. It's also dependent on platform-specific settings for text encoding and line endings. Check the HTTP spec for more details on how a proper HTTP server ought to work.
There would appear to be some potential issues with buffering. You write some of your output to a PrintWriter wrapper around out, and other output directly to out. I would definitely add a call to pw.flush() after the pw.write() call.
You enabled autoFlush with the second argument to
new PrintWriter(client.getOutputStream(), true)
http://docs.oracle.com/javase/7/docs/api/java/io/PrintWriter.html
Unlike the PrintStream class, if automatic flushing is enabled it will be done only when one of the println, printf, or format methods is invoked, rather than whenever a newline character happens to be output. These methods use the platform's own notion of line separator rather than the newline character.
So basically your pw.write() did not flush to the output stream. So all you need to do is replace
pw.write(response);
with
pw.println(response);
You do not send any response header.
I can't find the definition of pw in your source code?

HTTP post body lost spuriously (HttpURLConnection & Jetty)

I have a home grown protocol which uses HttpURLConnection (from Java 1.6) & Jetty (6.1.26) to POST a block of xml as a request and receive a block of xml as a response. The amounts of xml are approx. 5KB.
When running both sender and receiver on Linux EC2 instances in different parts of the world I'm finding that in about 0.04% of my requests the Jetty handler sees the xml request (the post body) as an empty string. I've checked and the client outputs that it's consistently trying to send the correct (> 0 length) xml request string.
I have also reproduced this by looping my JUnit tests on my local (Win 8) box.
I assume the error must be something like:
Misuse of buffers
An HttpURLConnection bug
A network error
A Jetty bug
A random head slapping stupid thing I've done in the code
The relevant code is below:
CLIENT
connection = (HttpURLConnection) (new URL (url)).openConnection();
connection.setReadTimeout(readTimeoutMS);
connection.setConnectTimeout(connectTimeoutMS);
connection.setRequestMethod("POST");
connection.setAllowUserInteraction(false);
connection.setDoOutput(true);
// Send request
byte[] postBytes = requestXML.getBytes("UTF-8");
connection.setRequestProperty("Content-length", "" + postBytes.length);
OutputStream os = connection.getOutputStream();
os.write(postBytes);
os.flush();
os.close();
// Read response
InputStream is = connection.getInputStream();
StringWriter writer = new StringWriter();
IOUtils.copy(is, writer, "UTF-8");
is.close();
connection.disconnect();
return writer.toString();
SERVER (Jetty handler)
public void handle(java.lang.String target, javax.servlet.http.HttpServletRequest request, javax.servlet.http.HttpServletResponse response, int dispatch) {
InputStream is = request.getInputStream();
StringWriter writer = new StringWriter();
IOUtils.copy(is, writer, "UTF-8");
is.close();
String requestXML = writer.toString();
// requestXML is 0 length string about 0.04% of time
Can anyone think of why I'd randomly get the request as an empty string?
Thanks!
EDIT
I introduced some more trace and getContentLength() returns -1 when the error occurs, but the client output still shows it's sending the right amount of bytes.
I can't think of why you are getting a empty string. Code looks correct. If you update you code to check for empty string and if found report the content-length and transfer-encoding of the request, that would be helpful to identify the culprit. A wireshark trace of the network data would also be good.
But the bad new is that jetty-6 is really end of life, and we are unlikely to be updating it. If you are writing the code today, then you really should be using jetty-7 or 8. Perhaps even jetty-9 milestone release if you are brave. If you find such and error in jetty-9, I'd be all over it like a rash trying to fix it for you!
Make sure you set connection.setRequestProperty("Content-Type", "application/xml"); It's possible POST data may be discarded without some Content-type. This was the case when I replicated your problem locally (against a Grails embedded Tomcat instance), and supplying this fixed it.

Unable to download more than 16144 characters when requesting json from API

I have an android application which downloads its information as JSON.
A typical JSON download is about 2,000-3,000 characters. But i wanted to stress it, so I created a larger file (~48,000 characters). As files go this is still small, under 50kb.
The problem I have is when downloading I am only getting 16144 charcters of data. That is reader.readLine() returns just one line containing 16144 characters, as does client.execute(request, new BasicResponseHandler());. Obviously with only part of the file, my JSON parsering code fails quickly as its not a valid JSON object.
There are no exceptions raised, so its not an out of memory error. And the problem is repeatable on a HTC desire (2.2) and Galaxy Nexus (4.1.1), so not OS specific either. I've tested the URL in a web browser and it works fine, all the JSON is available so its not a server error.
Question
Can anyone point out why it is downloading only 16144 characters, and how to make it download the whole file?
Method #1
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet(uri);
HttpResponse response = client.execute(request);
InputStream in = response.getEntity().getContent();
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
StringBuilder str = new StringBuilder();
String line = null;
while ((line = reader.readLine()) != null)
{
str.append(line);
}
in.close();
result.setJSONResult(str.toString());
Method #2
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet(uri);
HttpResponse response = client.execute(request);
String json = client.execute(request, new BasicResponseHandler());
result.setJSONResult(json);
Note - The url is on a LAN network (http://192.168.0.99:8080...), so I've not included it as it won't be useful.
Update - Fixed
Fixed the problem. In the end I put it down to a file transfer issue rather than memory limits of the phone. Whilst it worked on a PC (Chrome), I found it was broken in other places other than on android such as on the website and other browsers (Safari) didn't work with the raw API call. The underlying problem was the webserver's proxy ngix, wanted to buffer larger responses (over 32kb) however it never had write permissions on the server folders it used for buffering. This meant it sent part of the file, started to buffer and hit a critial error due to been unable to write. When it errored, it stopped sending the rest of the file hence it stopping at an unusual number of bytes. Thanks for all your help!
its because that's the max size a string can hold -- always 2147483647 (2^31 - 1) by the Java specification, the maximum size of an array, which the String class uses for internal storage) or half your maximum heap size (since each character is two bytes), whichever is smaller.
and probably the heap size ll be less than 40kbs
you can use json reader instead of using a string to store the data from web pls refer http://developer.android.com/reference/android/util/JsonReader.html
You are using a line-based reader to read data that is not line-based. When you call readLine, you are asking it to forcefully convert whatever it read into a line of text. This mangles the data if it's not in fact a line of text.
Fixed the problem. In the end I put it down to a file transfer issue rather than memory limits of the phone. Whilst it worked on a PC (Chrome), I found it was broken in other places other than on android such as on the website and other browsers (Safari) didn't work with the raw API call. The underlying problem was the webserver's proxy ngix, wanted to buffer larger responses (over 32kb) however it never had write permissions on the server folders it used for buffering. This meant it sent part of the file, started to buffer and hit a critial error due to been unable to write. When it errored, it stopped sending the rest of the file hence it stopping at an unusual number of bytes. Thanks for all your help!

Downloading a web page with Android

I'm downloading a web page then extracting some data out of it, using regex (don't yell at me, I know a proper parser would be better, but this is a very simple machine generated page). This works fine in the emulator, and on my phone when connected by wi-fi, but not on 3G - the string returned is not the same, and I don't get a match. I can imagine it has something to do with packet size or latency, but I can't figure it out.
My code:
public static String getPage(URL url) throws IOException {
final URLConnection connection = url.openConnection();
HttpGet httpRequest = null;
try {
httpRequest = new HttpGet(url.toURI());
} catch (URISyntaxException e) {
e.printStackTrace();
}
HttpClient httpclient = new DefaultHttpClient();
HttpResponse response = (HttpResponse) httpclient.execute(httpRequest);
HttpEntity entity = response.getEntity();
BufferedHttpEntity bufHttpEntity = new BufferedHttpEntity(entity);
InputStream stream = bufHttpEntity.getContent();
String ct = connection.getContentType();
final BufferedReader reader;
if (ct.indexOf("charset=") != -1) {
ct = ct.substring(ct.indexOf("charset=") + 8);
reader = new BufferedReader(new InputStreamReader(stream, ct));
}else {
reader = new BufferedReader(new InputStreamReader(stream));
}
final StringBuilder sb = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
sb.append(line);
}
stream.close();
return sb.toString();
}
Is it my poor connection causing this, or is there a bug in there? Either way, how do I solve it?
Update:
The file downloaded over 3G is 201 bytes smaller than the one over wi-fi. While they are obviously both downloading the correct page, the 3G one is missing a whole bunch of whitespace, and also some HTML comments that are present in the original page which I find a little strange. Does Android fetch pages differently on 3G as to reduce file size?
UserAgent (UA) shouldn't be changed if u access web page using 3g or wifi.
As it is mentioned before, get rid of UrlConnection, cause obviously code is complete for using HTTPClient method, and you are able to set UA using:
httpclient.getParams().setParameter(CoreProtocolPNames.USER_AGENT, userAgent);
last one..it might be silly but maybe web page is dynamic?! is that possible?
Here you go some hints, some of them silly hints, but just in case:
Review your mobile connection, try to open web browser, surf the web, and make sure it actually works
I don't know which is the web page your are trying to access but take into account that depending on your phone User Agent (UA), the rendered content might be different (web pages specially designed for mobile phones), or even no content rendered at all. Is it a web page on your own.
Try to access that same web page from Firefox, changing the UA (Use the User Agent Switcher for Firefox), and review the code returned.
That will be a good start point to figure out what's your problem
Ger
You may want to check if your provider has a transparent proxy in place with 3G.

Categories