XML parsing of a .txt file printing nothing on the console [duplicate] - java

This question already has answers here:
How can I efficiently parse HTML with Java?
(3 answers)
Closed 6 years ago.
I want to parse a simple web site and scrape information from that web site.
I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
URL url = new URL("http://www.deneme.com");
URLConnection uc = url.openConnection();
InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;
FileWriter outFile = new FileWriter("orhancan");
PrintWriter out = new PrintWriter(outFile);
while ((inputLine = in.readLine()) != null) {
out.println(inputLine);
}
in.close();
out.close();
File fXmlFile = new File("orhancan");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
NodeList prelist = doc.getElementsByTagName("body");
System.out.println(prelist.getLength());
Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?

There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Or if you want the body:
Elements body = doc.select("body");
Or if you want all links:
Elements links = doc.select("body a");
You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.

Definitely JSoup is the answer. ;-)

HTML is not always valid, well-formatted XML. Try a special HTML parser instead of an XML parser. There are a couple of different ones available:
http://java-source.net/open-source/html-parsers

Related

Protocol error trying to parse XML response in Java

I am successfully making an API call that is a SOAP request with an account number in the body. I connected using Httpurlconnection and I am reading those results using BufferedReader:
if (responseCode == HttpURLConnection.HTTP_OK) {​​​​​ // success
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {​​​​​
{​​​​​
sb.append(inputLine).append("\n");
String xml2String = sb.toString();
Then using documentbuilderfactory to build the doc to read into the parser:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = dbFactory.newDocumentBuilder();
Document xmlDom = docBuilder.parse(new InputSource(inputLine));
And then try to parse:
DOMParser parser = new DOMParser();
parser.parse(new InputSource(new StringReader(returnList.item(0).getTextContent())));
Document doc = parser.getDocument();
NodeList responsedata = doc.getDocumentElement().getChildNodes();
NodeList returnList = xmlDom.getElementsByTagName("DATA");
// Get the DATA
DOMParser parser = new DOMParser();
parser.parse(new InputSource(new StringReader(returnList.item(0).getTextContent())));
Document doc = parser.getDocument();
NodeList responsedata = doc.getDocumentElement().getChildNodes();
This is the error I get (which includes the output from the API request):
Exception,no protocol:
{​​​​​"d":"<DATA><BussFlds><FieldName>FirstName</FieldName><Value><![CDATA[TESTY]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>LastName</FieldName><Value><![CDATA[TESTER]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>TYPE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>DATE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>CUSTCODE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>PREMCODE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>ADDRESS</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>CITY</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>STATE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>ZIP</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>ZIP4</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>ACCTBALANCE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>PASTDUE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>PHONE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds></DATA>"}​​​​​
I suspect that it is that curly bracket data on the first row or missing header information but I am not sure if that is the issue or how to fix it. Thanks!
In
docBuilder.parse(new InputSource(inputLine))
You are using the stringbuffer. Replace it with your variable xml2String
This response:
{"d":"<DATA><BussFlds>…
is not XML. You cannot read it with a DocumentBuilder.
That response is in a format known as JSON. You cannot use an XML parser to read it.
So, you will want to pass the response to a JSON parser, not an XML parser.
A JSON “object” is basically a dictionary (that is, a lookup table) with string keys. Your response has exactly one entry, whose key is "d". So you first need to parse the response as JSON:
String xml;
try (JsonParser jsonParser = Json.createParser(con.getInputStream())) {
xml = jsonParser.getObject().getString("d");
}
(There are other JSON parsing libraries available. I chose the one that is part of Java EE for the above example.)
Notice that the code does not attempt to read con.getInputStream() as a string first. There is no benefit to doing that. The parser accepts an InputStream directly. Which means there is no need to use InputStreamReader, or BufferedReader, or StringBuffer.
Now that you have XML content in the xml variable, you can parse it with DocumentBuilder:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = dbFactory.newDocumentBuilder();
Document xmlDom = docBuilder.parse(new InputSource(new StringReader(xml)));
Side note: You should never use StringBuffer. Use StringBuilder instead. StringBuffer is a 26-year-old class that was part of Java 1.0, and it is designed for multithreaded use, which is almost never needed, and which adds a lot of overhead.

Convert html String to org.w3c.dom.Document in Java

To convert from HTML String to
org.w3c.dom.Document
I'm using
jtidy-r938.jar
here is my code:
public static Document getDoc(String html) {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setWraplen(Integer.MAX_VALUE);
// tidy.setPrintBodyOnly(true);
tidy.setXmlOut(false);
tidy.setShowErrors(0);
tidy.setShowWarnings(false);
// tidy.setForceOutput(true);
tidy.setQuiet(true);
Writer out = new StringWriter();
PrintWriter dummyOut = new PrintWriter(out);
tidy.setErrout(dummyOut);
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(html.getBytes());
Document doc = tidy.parseDOM(inputStream, null);
return doc;
}
But sometime the library work incorrectly, some tag is lost.
Please tell a good open library to do this task.
Thanks very much!
You don't tell why sometimes the library doesn't give the good result.
Nevertheless, i am working very regularly with html files where I must extract data from and the main problem encountered is that fact that some tags are not valid because not closed for example.
The best solution i found to resolve is the api htmlcleaner (htmlCleaner Website).
It allows you to make your html file well formed.
Then, to transform it in document w3c or another strict format file is easier.
With HtmlCleaner, you could do such as :
HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean(html);
DomSerializer ser = new DomSerializer(cleaner.getProperties());
Document myW3cDoc = ser.createDOM(node);
I refer DomSerializer from htmlcleaner.

How to read a InputStream with UTF-8?

Welcome all
I'm developing a Java app, that calls a PHP from internet that it's giving me a XML response.
In the response is contained this word: "Próximo", but when i parse the nodes of the XML and obtain the response into a String variable, I'm receiving the word like this: "Pr& oacute;ximo".
I'm sure that the problem is that i'm using different encoding in the Java app then encoding of PHP script. Then, i supose i must set encoding to the same as in your PHP xml, UTF-8
This is the code i'm using to geat the XML file from the PHP.
¿What should i change in this code to set the encoding to UTF-8? (note that im not using bufered reader, i'm using input stream)
InputStream in = null;
String url = "http://www.myurl.com"
try {
URL formattedUrl = new URL(url);
URLConnection connection = formattedUrl.openConnection();
HttpURLConnection httpConnection = (HttpURLConnection) connection;
httpConnection.setAllowUserInteraction(false);
httpConnection.setInstanceFollowRedirects(true);
httpConnection.setRequestMethod("GET");
httpConnection.connect();
if (httpConnection.getResponseCode() == HttpURLConnection.HTTP_OK)
in = httpConnection.getInputStream();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(in);
doc.getDocumentElement().normalize();
NodeList myNodes = doc.getElementsByTagName("myNode");
When you get your InputStream read byte[]s from it. When you create your Strings, pass in the CharSetfor "UTF-8". Example:
byte[] buffer = new byte[contentLength];
int bytesRead = inputStream.read(buffer);
String page = new String(buffer, 0, bytesRead, "UTF-8");
Note, you're probably going to want to make your buffer some sane size (like 1024), and continuously called inputStream.read(buffer).
#Amir Pashazadeh
Yes, you can also use an InputStreamReader, and try changing the parse() line to:
Document doc = db.parse(new InputSource(new InputStreamReader(in, "UTF-8")));

Use a Remote XML as File [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to read XML response from a URL in java?
I'm trying to read an XML file from my web server and display the contents of it on a ListView, so I'm reading the file like this:
File xml = new File("http://example.com/feed.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(xml);
doc.getDocumentElement().normalize();
NodeList mainNode = doc.getElementsByTagName("article");
// for loop to populate the list...
The problem is that I'm getting this error:
java.io.FileNotFoundException: /http:/mydomainname.com/feed.xml (No such file or directory)
Why I'm having this problem and how to correct it?
File is meant to point to local files.
If you want to point to a remote URI, the easiest is to use the class url
//modified code
URL url = new URL("http://example.com/feed.xml");
URLConnection urlConnection = url.openConnection();
InputStream in = new BufferedInputStream(urlConnection.getInputStream());
//your code
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse( in );
As you can see, later on, thanks to java streaming apis, you can easily adapt your code logic to work with the content of the file. This is due to an overload of the parse method in class DocumentBuilder.
You need to use HTTPURLConnection to get xml as input stream and pass it DocumentBuilder, from there you can use the logic you have.
DefaultHttpClient client = new DefaultHttpClient();
HttpResponse resp = client.execute(yourURL);
if(resp.getStatusCode == 200)
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(resp.getEntity().getContent());
}
Note: I just type here, there may be syntax errors.
You need to read the file using a URL object. For instance, try something like this:
URL facultyURL = new URL("http://example.com/feed.xml");
InputStream is = facultyURL.openStream();

Parse Web Site HTML with JAVA [duplicate]

This question already has answers here:
How can I efficiently parse HTML with Java?
(3 answers)
Closed 6 years ago.
I want to parse a simple web site and scrape information from that web site.
I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
URL url = new URL("http://www.deneme.com");
URLConnection uc = url.openConnection();
InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;
FileWriter outFile = new FileWriter("orhancan");
PrintWriter out = new PrintWriter(outFile);
while ((inputLine = in.readLine()) != null) {
out.println(inputLine);
}
in.close();
out.close();
File fXmlFile = new File("orhancan");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
NodeList prelist = doc.getElementsByTagName("body");
System.out.println(prelist.getLength());
Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?
There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Or if you want the body:
Elements body = doc.select("body");
Or if you want all links:
Elements links = doc.select("body a");
You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.
Definitely JSoup is the answer. ;-)
HTML is not always valid, well-formatted XML. Try a special HTML parser instead of an XML parser. There are a couple of different ones available:
http://java-source.net/open-source/html-parsers

Categories