How to read a InputStream with UTF-8? - java

Welcome all
I'm developing a Java app, that calls a PHP from internet that it's giving me a XML response.
In the response is contained this word: "Próximo", but when i parse the nodes of the XML and obtain the response into a String variable, I'm receiving the word like this: "Pr& oacute;ximo".
I'm sure that the problem is that i'm using different encoding in the Java app then encoding of PHP script. Then, i supose i must set encoding to the same as in your PHP xml, UTF-8
This is the code i'm using to geat the XML file from the PHP.
¿What should i change in this code to set the encoding to UTF-8? (note that im not using bufered reader, i'm using input stream)
InputStream in = null;
String url = "http://www.myurl.com"
try {
URL formattedUrl = new URL(url);
URLConnection connection = formattedUrl.openConnection();
HttpURLConnection httpConnection = (HttpURLConnection) connection;
httpConnection.setAllowUserInteraction(false);
httpConnection.setInstanceFollowRedirects(true);
httpConnection.setRequestMethod("GET");
httpConnection.connect();
if (httpConnection.getResponseCode() == HttpURLConnection.HTTP_OK)
in = httpConnection.getInputStream();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(in);
doc.getDocumentElement().normalize();
NodeList myNodes = doc.getElementsByTagName("myNode");

When you get your InputStream read byte[]s from it. When you create your Strings, pass in the CharSetfor "UTF-8". Example:
byte[] buffer = new byte[contentLength];
int bytesRead = inputStream.read(buffer);
String page = new String(buffer, 0, bytesRead, "UTF-8");
Note, you're probably going to want to make your buffer some sane size (like 1024), and continuously called inputStream.read(buffer).
#Amir Pashazadeh
Yes, you can also use an InputStreamReader, and try changing the parse() line to:
Document doc = db.parse(new InputSource(new InputStreamReader(in, "UTF-8")));

Related

Protocol error trying to parse XML response in Java

I am successfully making an API call that is a SOAP request with an account number in the body. I connected using Httpurlconnection and I am reading those results using BufferedReader:
if (responseCode == HttpURLConnection.HTTP_OK) {​​​​​ // success
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {​​​​​
{​​​​​
sb.append(inputLine).append("\n");
String xml2String = sb.toString();
Then using documentbuilderfactory to build the doc to read into the parser:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = dbFactory.newDocumentBuilder();
Document xmlDom = docBuilder.parse(new InputSource(inputLine));
And then try to parse:
DOMParser parser = new DOMParser();
parser.parse(new InputSource(new StringReader(returnList.item(0).getTextContent())));
Document doc = parser.getDocument();
NodeList responsedata = doc.getDocumentElement().getChildNodes();
NodeList returnList = xmlDom.getElementsByTagName("DATA");
// Get the DATA
DOMParser parser = new DOMParser();
parser.parse(new InputSource(new StringReader(returnList.item(0).getTextContent())));
Document doc = parser.getDocument();
NodeList responsedata = doc.getDocumentElement().getChildNodes();
This is the error I get (which includes the output from the API request):
Exception,no protocol:
{​​​​​"d":"<DATA><BussFlds><FieldName>FirstName</FieldName><Value><![CDATA[TESTY]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>LastName</FieldName><Value><![CDATA[TESTER]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>TYPE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>DATE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>CUSTCODE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>PREMCODE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>ADDRESS</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>CITY</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>STATE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>ZIP</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>ZIP4</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>ACCTBALANCE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>PASTDUE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>PHONE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds></DATA>"}​​​​​
I suspect that it is that curly bracket data on the first row or missing header information but I am not sure if that is the issue or how to fix it. Thanks!
In
docBuilder.parse(new InputSource(inputLine))
You are using the stringbuffer. Replace it with your variable xml2String
This response:
{"d":"<DATA><BussFlds>…
is not XML. You cannot read it with a DocumentBuilder.
That response is in a format known as JSON. You cannot use an XML parser to read it.
So, you will want to pass the response to a JSON parser, not an XML parser.
A JSON “object” is basically a dictionary (that is, a lookup table) with string keys. Your response has exactly one entry, whose key is "d". So you first need to parse the response as JSON:
String xml;
try (JsonParser jsonParser = Json.createParser(con.getInputStream())) {
xml = jsonParser.getObject().getString("d");
}
(There are other JSON parsing libraries available. I chose the one that is part of Java EE for the above example.)
Notice that the code does not attempt to read con.getInputStream() as a string first. There is no benefit to doing that. The parser accepts an InputStream directly. Which means there is no need to use InputStreamReader, or BufferedReader, or StringBuffer.
Now that you have XML content in the xml variable, you can parse it with DocumentBuilder:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = dbFactory.newDocumentBuilder();
Document xmlDom = docBuilder.parse(new InputSource(new StringReader(xml)));
Side note: You should never use StringBuffer. Use StringBuilder instead. StringBuffer is a 26-year-old class that was part of Java 1.0, and it is designed for multithreaded use, which is almost never needed, and which adds a lot of overhead.

XML parsing of a .txt file printing nothing on the console [duplicate]

This question already has answers here:
How can I efficiently parse HTML with Java?
(3 answers)
Closed 6 years ago.
I want to parse a simple web site and scrape information from that web site.
I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
URL url = new URL("http://www.deneme.com");
URLConnection uc = url.openConnection();
InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;
FileWriter outFile = new FileWriter("orhancan");
PrintWriter out = new PrintWriter(outFile);
while ((inputLine = in.readLine()) != null) {
out.println(inputLine);
}
in.close();
out.close();
File fXmlFile = new File("orhancan");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
NodeList prelist = doc.getElementsByTagName("body");
System.out.println(prelist.getLength());
Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?
There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Or if you want the body:
Elements body = doc.select("body");
Or if you want all links:
Elements links = doc.select("body a");
You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.
Definitely JSoup is the answer. ;-)
HTML is not always valid, well-formatted XML. Try a special HTML parser instead of an XML parser. There are a couple of different ones available:
http://java-source.net/open-source/html-parsers

Set timeout when loading xml from a URL?

Is it possible to set a timeout when loading an xml directly from a URL?
Builder parser = new Builder();
Document doc = parser.build("http://somehost");
This may take sometimes minutes, and would be really handy to be able to time this out directly in the library.
You need to use build(InputStream inStream) api instead of build(String systemID).
URL url = new URL("http://somehost");
con = url.openConnection();
con.setConnectTimeout(connectTimeout);
con.setReadTimeout(readTimeout);
inStream = con.getInputStream();
Builder parser = new Builder();
Document doc = parser.build(inStream);

Parse Web Site HTML with JAVA [duplicate]

This question already has answers here:
How can I efficiently parse HTML with Java?
(3 answers)
Closed 6 years ago.
I want to parse a simple web site and scrape information from that web site.
I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
URL url = new URL("http://www.deneme.com");
URLConnection uc = url.openConnection();
InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;
FileWriter outFile = new FileWriter("orhancan");
PrintWriter out = new PrintWriter(outFile);
while ((inputLine = in.readLine()) != null) {
out.println(inputLine);
}
in.close();
out.close();
File fXmlFile = new File("orhancan");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
NodeList prelist = doc.getElementsByTagName("body");
System.out.println(prelist.getLength());
Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?
There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Or if you want the body:
Elements body = doc.select("body");
Or if you want all links:
Elements links = doc.select("body a");
You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.
Definitely JSoup is the answer. ;-)
HTML is not always valid, well-formatted XML. Try a special HTML parser instead of an XML parser. There are a couple of different ones available:
http://java-source.net/open-source/html-parsers

size or length of a Document object

How can i get the size of an Document object?
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputStream iStr = urlConnection.getInputStream();
doc = db.parse(iStr);
???? --> Log.i("Bytes",String.valueOf(doc.get????));
One way to implement this would be to write a custom subclass of FilterInputStream that counts the bytes that are read from the stream. (The class would be analogous to LineNumberInputStream ...)
Then wrap the URL stream with your filter:
InputStream iStr = new ByteNumberInputStream(urlConnection.getInputStream());
Finally when you are done reading/parsing the stream, call some method on your filter object to fetch the byte count.
How about converting the document to string using asXml() and using String.length()?

Categories