Garbage when loading xml content with URLConnection

Garbage when loading xml content with URLConnection - java

I'm trying to load the content of an XML page using URLConnection but I'm getting back garbage characters. The same code works for me for pretty much any other site so I'm not sure what's the issue.
Here's the relevant code:
String url = "http://myUrl";
URL url = new URL(urlString);
URLConnection conn = url.openConnection();
conn.setConnectTimeout(60*2000); // wait only 60 seconds for a response
conn.setReadTimeout(60*2000);
InputStreamReader isr = new InputStreamReader(conn.getInputStream(), encoding);
BufferedReader in = new BufferedReader(isr);
String inputLine;
while ((inputLine = in.readLine()) != null) {
wholeDocument += inputLine;
}
Printing out wholeDocument produces a bunch of characters like this: er���;�pI.���$6
I am using encoding = 'UTF-8'.
I also tried using XML libraries, for example:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new URL(baseUrl).openStream());
System.out.println("doc = " + doc);
But the result is the same. When using curl in a terminal app (I'm on a mac) the result is similar although the characters look like this: ???0??KZV??????0N6?aH:$?X9v???$>???`
Any idea how to solve this?

If you check the headers of your response you will see Content-Encoding: gzip indicating that the body of the response has been compressed, you need to uncompress it first, that's why you get those weird characters. More details about Http Compression.
A good way to check the headers with curl is to use the verbose option -v, In this case thanks to curl -v http://sites.one.co.il/XML/VOD/ | more, I could quickly see the response headers.

Expanding on the other answer, you can check if the received file is gzip encoded, and decode it if so by:
if (conn.getHeaderField("Content-Encoding") != null &&
conn.getHeaderField("Content-Encoding").equals("gzip")){
InputStream gzStream = new GZIPInputStream(conn.getInputStream());
InputStreamReader isr = new InputStreamReader(gzStream, encoding);
} else {
InputStreamReader isr = new InputStreamReader(conn.getInputStream(), encoding);
}
Alternatively, you can specify that you wouldn't like gzip encoded data by:
conn.setRequestProperty("Accept-Encoding", "identity");

Related

HTTP URL connection response

I am trying to hit the URL and get the response from my Java code.
I am using URLConnection to get this response. And writing this response in html file.
When opening this html in browser after executing the java class, I am getting only google home page and not with the results.
Whats wrong with my code, my code here,
FileWriter fWriter = null;
BufferedWriter writer = null;
URL url = new URL("https://www.google.co.in/?gfe_rd=cr&ei=aS-BVpPGDOiK8Qea4aKIAw&gws_rd=ssl#q=google+post+request+from+java");
byte[] encodedBytes = Base64.encodeBase64("root:pass".getBytes());
String encoding = new String(encodedBytes);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("User-Agent", "Mozilla/5.0");
connection.setRequestProperty("Accept-Charset", "UTF-8");
connection.setDoInput(true);
connection.setRequestProperty("Authorization", "Basic " + encoding);
connection.connect();
InputStream content = (InputStream) connection.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(content));
String line;
try {
fWriter = new FileWriter(new File("f:\\fileName.html"));
writer = new BufferedWriter(fWriter);
while ((line = in.readLine()) != null) {
String s = line.toString();
writer.write(s);
}
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Same code works couple of days back, but not now.

The reason is that this url does not return search results it self. You have to understand google's working process to understand it. Open this url in your browser and view its source. You will only see lots of javascript there.
Actually, in a short summary, google uses Ajax requests to process search queries.
To perform required task you either have to use a headless browser (the hard way) which can execute javascript/ajax OR better use google search api as directed by anand.

This method of searching is not advised is supposed to fail, you must use google search APIs for this kind of work.
Note: Google uses some redirection and uses token, so even if you will find a clever way to handle it, it is ought to fail in long run.
Edit:
This is a sample of how using Google search APIs you can get your work done in reliable way; please do refer to the source for more information.
public static void main(String[] args) throws Exception {
String google = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=";
String search = "stackoverflow";
String charset = "UTF-8";
URL url = new URL(google + URLEncoder.encode(search, charset));
Reader reader = new InputStreamReader(url.openStream(), charset);
GoogleResults results = new Gson().fromJson(reader, GoogleResults.class);
// Show title and URL of 1st result.
System.out.println(results.getResponseData().getResults().get(0).getTitle());
System.out.println(results.getResponseData().getResults().get(0).getUrl());
}

Encoding ignored while reading InputStream

I'm having some encoding problems in a Java application that makes HTTP requests to an IIS server.
Iterating over the headers of the URLConnection object I can see the following (relevant) headers:
Transfer-Encoding: [chunked]
Content-Encoding: [utf-8]
Content-Type: [text/html; charset=utf-8]
The URLConnection.getContentEncoding() method returns utf-8 as the document encoding.
This is how my HTTP request, and stream read is being made:
OutputStreamWriter sw = null;
BufferedReader br = null;
char[] buffer = null;
URL url;
url = new URL(this.URL);
URLConnection connection = url.openConnection();
connection.setDoOutput(true);
sw = new OutputStreamWriter(connection.getOutputStream());
sw.write(postData);
sw.flush();
br = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF8"));
StringBuilder totalResponse = new StringBuilder();
String line;
while((line = br.readLine()) != null) {
totalResponse.append(line);
}
buffer = totalResponse.toString().toCharArray();
if (sw != null)
sw.close();
if (br != null)
br.close();
return buffer;
However the following string sent by the server "ÃÃÃção" is received by the client as "�����o".
What am I doing wrong ?

Based on your comments, you are trying to receive a FIX message from an IIS server and FIX uses ASCII. There are only a small subset of tags which support other encoding and they have to be treated in a special manner (non-ASCII tags in the standard FIX spec are 349,351,353,355,357,359,361,363,365). If such tags are present, you will get a tag 347 with a value specifying the encoding (for example UTF-8) and then each tag, will be preceded by a tag giving you the length of the coming encoded value (for tag 349, you will always get 348 first with an integer value)
In your case, it looks like the server is sending a custom tag 10411 (the 10xxx range) in some other encoding. By convention, the preceding tag 10410 should give you the length of the value in 10411, but it contains "0000" instead, which may have some other meaning.
Note that although FIX message are very readable, they should still be treated as binary data. Tags and values are mostly ASCII characters, but the delimiter (SOH) is 0x01 and as mentioned above, certain tags may be encoded with another encoding. The IIS service should really return the data as application/octet-stream so it can be received properly. Attempting to return it as text/html is asking for trouble :).

If the server really sends a Content-Encoding of "UTF-8" then it is very confused. See http://svn.tools.ietf.org/svn/wg/httpbis/specs/rfc7231.html#header.content-encoding

For good order a couple of corrections.
URLConnection connection = url.openConnection();
connection.setDoOutput(true);
connection.connect();
try (Writer sw = new OutputStreamWriter(connection.getOutputStream(),
StandardCharsets.UTF_8)) {
sw.write(postData);
sw.flush();
try (BufferedReader br = new BufferedReader(
new InputStreamReader(connection.getInputStream(),
StandardCharsets.UTF_8))) {
StringBuilder totalResponse = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
totalResponse.append(line).append("\r\n");
}
return totalResponse.toString().toCharArray();
} // Close br.
} // Close sw.
Maybe:
postData = ... + "Accept-Charset: utf-8\r\n" + ...;
Receiving the totalResponse.toString() you should have all read correctly.
But then when displaying again, the String/char is again converted to bytes, and there the encoding fails. For instance System.out.println will not do as probably the Windows encoding is used.
You can test the String by dumping its bytes:
String s = totalResponse.toString();
Logger.getLogger(getClass().getName()).log(Level.INFORMATION, "{0}",
Arrays.toString(s.getBytes(StandardCharsets.UTF_8)));
In some rare cases the font will not contain the special characters.

Can you try by putting the stream as part of request attribute and then printing it out on client side. a request attribute will be received as is withou any encoding issues

HttpURLConnection response is incorrect

When using this code below to make a get request:
private String get(String inurl, Map headers, boolean followredirects) throws MalformedURLException, IOException {
URL url = new URL(inurl);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setInstanceFollowRedirects(followredirects);
// Add headers to request.
Iterator entries = headers.entrySet().iterator();
while (entries.hasNext()) {
Entry thisEntry = (Entry) entries.next();
Object key = thisEntry.getKey();
Object value = thisEntry.getValue();
connection.addRequestProperty((String)key, (String)value);
}
// Attempt to parse
InputStream stream = connection.getInputStream();
InputStreamReader isReader = new InputStreamReader(stream );
BufferedReader br = new BufferedReader(isReader );
System.out.println(br.readLine());
// Disconnect
connection.disconnect();
return connection.getHeaderField("Location");
}
The resulting response is completely nonsensical (e.g ���:ks�6��﯐9�rђ� e��u�n�qש�v���"uI*�W��s)
However I can see in Wireshark that the response is HTML/XML and nothing like the string above. I've tried a myriad of different methods for parsing the InputStream but I get the same result each time.
Please note: this only happens when it's HTML/XML, plain HTML works.
Why is the response coming back in this format?
Thanks in advance!
=== SOLVED ===
Gah, got it!
The server is compressing the response when it contains XML, so I needed to use GZIPInputStream instead of InputSream.
GZIPInputStream stream = new GZIPInputStream(connection.getInputStream());
Thanks anyway!

use an UTF-8 encoding in input stream like below
InputStreamReader isReader = new InputStreamReader(stream, "UTF-8");

How can I send POST data through url.openStream()?

i'm looking for tutorial or quick example, how i can send POST data throw openStream.
My code is:
URL url = new URL("http://localhost:8080/test");
InputStream response = url.openStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(response, "UTF-8"));
Could you help me ?

URL url = new URL(urlSpec);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod(method);
connection.setDoOutput(true);
connection.setDoInput(true);
// important: get output stream before input stream
OutputStream out = connection.getOutputStream();
out.write(content);
out.close();
// now you can get input stream and read.
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line = null;
while ((line = reader.readLine()) != null) {
writer.println(line);
}

Use Apache HTTP Compoennts http://hc.apache.org/httpcomponents-client-ga/
tutorial: http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html
Look for HttpPost - there are some examples of sending dynamic data, text, files and form data.

Apache HTTP Components in particular, the Client would be the best way to go.
It absracts a lot of that nasty coding you would normally have to do by hand

Wrong encoding with Java HttpURLConnection

Trying to read a generated XML from a MS Webservice
URL page = new URL(address);
StringBuffer text = new StringBuffer();
HttpURLConnection conn = (HttpURLConnection) page.openConnection();
conn.connect();
InputStreamReader in = new InputStreamReader((InputStream) conn.getContent());
BufferedReader buff = new BufferedReader(in);
box.setText("Getting data ...");
String line;
do {
line = buff.readLine();
text.append(line + "\n");
} while (line != null);
box.setText(text.toString());
or
URL u = new URL(address);
URLConnection uc = u.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
inputLine = java.net.URLDecoder.decode(inputLine, "UTF-8");
System.out.println(inputLine);
}
in.close();
Any page reads fine except the web service output
it reads the greater and less than signs strangely
it read < to "& lt;" and > to "& gt;" without spaces, but if i type them here without spaces stackoverflow makes them < and >
Please help
thanks

First there seem to be a confusion on this row:
inputLine = java.net.URLDecoder.decode(inputLine, "UTF-8");
This effectively says that you expect every row in the document that your server is providing to be URL encoded. URL encoding is not the same as document encoding.
http://en.wikipedia.org/wiki/Percent-encoding
http://en.wikipedia.org/wiki/Character_encoding
Looking at your code snippet, I think URL encoding (percent encoding) is not what you're after.
In terms of document character encoding. You are making a conversion on this line:
InputStreamReader in = new InputStreamReader((InputStream) conn.getContent());
conn.getContent() returns an InputStream that operates on bytes, whilst the reader operates on chars - the character encoding conversion is done here. Checkout the other constructors of InputStreamReader which takes the encoding as second argument. Without the second argument you are falling back on whatever is your platform default in java.
InputStreamReader(InputStream in, String charsetName)
for instance lets you change your code to:
InputStreamReader in = new InputStreamReader((InputStream) conn.getContent(), "utf-8");
But the real question will be "what encoding is your server providing the content in?" If you own the server code too, you may just hard code it to something reasonable such as utf-8. But if it can vary, you need to look at the http header Content-Type to figure it out.
String contentType = conn.getHeaderField("Content-Type");
The contents of contentType will look like
text/plain; charset=utf-8
A short hand way of getting this field is:
String contentEncoding = conn.getContentEncoding();
Notice that it's entirely possible that no charset is provided, or no Content-Type header, in which case you must fall back on reasonable defaults.

Mark Rotteveel is correct, the webservice is the culprit here it's for some reason sending the greater than and less than sign with the & lt and & gt format
Thanks Martin Algesten but i have already stated i worked around it i was just looking for why it was this way.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Garbage when loading xml content with URLConnection - java

Related

HTTP URL connection response

Encoding ignored while reading InputStream

HttpURLConnection response is incorrect

How can I send POST data through url.openStream()?

Wrong encoding with Java HttpURLConnection

Categories

Resources