Simplest example - how to parse html with saxon using java? - java

Could anyone say, why the following code does not give any results?
Of course html is valid and has a lot of "div" elements.
Processor proc = new Processor(false);
proc.setConfigurationProperty("http://saxon.sf.net/feature/sourceParserClass", "org.ccil.cowan.tagsoup.Parser");
XPathCompiler xpath = proc.newXPathCompiler();
DocumentBuilder builder = proc.newDocumentBuilder();
XdmNode doc = builder.build(new File("/tmp/test.html"));
XPathSelector selector = xpath.compile("//div").load();
selector.setContextItem(doc);
for (XdmItem item : selector)
{
System.out.println(((XdmNode)item).getNodeName());
}
I took that code from saxon samples and added "proc.setConfigurationProperty..." in order to parse html input.
All i want is:
1) submit html string
2) get document node
3) make some queries with xpath v3
Thank you.
P.s. I don't want to use xslt.

Changing "//div" to "//*[name()="div"]" solved the problem.

Related

jsoup select() method not found

I want to make a web parser in java. I'm using jsoup. But I got an error like this
how to fix it? is it because of my classpath?
also this are my imports
Use below code.
Document doc = Jsoup.connect(I).get();
Elements links = doc.select("div#content > p").first();
You have to get the Document from Connection.
Connection con = Jsoup.connect(url);
Document doc = con.get();

What is a good approach to verify XML response from RESTful service in Java?

I am performing simple RESTFUL service API verification in Java.
To handle response in JSON format is very convenient. Using org.json library, it's easy to convert JSON string from RESTFUL response into JSON object, and compare it with that of the expected JSON string.
JSONObject response = new JSONObject(json_response_str);
JSONObject expected = new JSONObject(json_expected_str);
JSONAssert.assertEquals(expected, response, JSONCompareMode.LENIENT);
If it is some element of the JSON response that need to compare, it is also easy because it is easy to extract sub element from JSONObject using APIs like:
JSONObject element_in_response = response.get("..."); or
JSONObject element_in_response = response.getJSONObject("...");
However, to handle response in XML format, things are more difficult. To compare the whole XML response with expected XML is not bad, I can use XMLUnit to do it:
String xml_response_str = ...
String xml_expected_str = ...
assertXMLEquals(xml_response_str, xml_expected_str);
However, there's no such things like xmlOject as there is in JSON.
So what do I do if want to compare some element of the XML response with expected?
I've search forums and JAXB is sometimes mentioned. I checked and it is about parsing XML to Java object. So am I supposed to parse both response XML string and expected XML string, then extract the element as Java object, then compare them? It seems complicated, not to mention I need the XML schema to start with.
What is the effective way to do this, is there anything that is as convenient as in the case of JSON?
Thanks,
You can try to use XPATH.
There is a short example.
Here is XML string:
<?xml version="1.0" encoding="UTF-8"?>
<resp>
<status>good</status>
<msg>hi</msg>
</resp>
The folowing code will get status and message:
String xml = "<resp><status>good</status><msg>hi</msg></resp>";
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();
InputSource source = new InputSource(new StringReader(xml));
Document doc = (Document) xpath.evaluate("/", source, XPathConstants.NODE);
String status = xpath.evaluate("/resp/status", doc);
String msg = xpath.evaluate("/resp/msg", doc);
System.out.println("status=" + status);
System.out.println("Message=" + msg);
Here is more examples about how to use XPATH:
http://viralpatel.net/blogs/java-xml-xpath-tutorial-parse-xml/
There are a number of ways for testing XML. Converting XML to JSON not being one of them, but can be done.
Testing XML is usually performed using XPath style comparisons which focus on elements, attributes and content and not so much on comparing chunks.
From looking at your code you're already familiar with XML assertions from http://xmlunit.sourceforge.net/api/org/custommonkey/xmlunit/XMLAssert.htm but you might also want to look at http://www.w3schools.com/xsl/xpath_intro.asp.
XML validation is not that easy and does require a lot of effort to begin with. Once you've got your tesing tools in order it gets a whole lot easier.
verify (or extract) XML is independent from protocol, RESTful service, etc. , but it is normally used in SOAP services.
Comparison with JSON is interesting. JSON is more easy to use with php, javascript, ...
If you want to connect two java servers, XML is sufficient, or plain java Objects (not portable solution with other languages).
Better point to use XML: it is more, more powerfull, well standardized, and you have lot of tools to process it.
What you are asking: equivalent of JSONobject exists in XML for a while: it is a DOM document, or a Node.
1 read your XML
String xml="<root>content</root>";
DocumentBuilderFactory builderFactory =DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
// PARSE
Document document = builder.parse(new InputSource(new StringReader(xml)));
2 Best way to get some particular data: XPath: you give some path to your datas (root/group/class1/other_group/...), you can put wildcards (*), select about parameters, values, etc.
see this:
How to read XML using XPath in Java
XPath xpath = XPathFactory.newInstance().newXPath();
String expression="/root";
3 you can get direct values
expression="/root/text()";
String value = xpath.evaluate(expression, document);
4 or you get all data (if several)
XPathExpression expr = xpath.compile(expression) ;
NodeList nodes = (NodeList) expr.evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++)
{
Node nodeSegment = nodes.item(i);
if (nodeSegment.getNodeType() == Node.ELEMENT_NODE)
{
Element eElement = (Element) nodeSegment;
System.out.println("TAG="+eElement.getTagName());
System.out.println("VALUE="+eElement.getNodeValue());

How to append element in a XML file using dom?

My XML file looks like this:
<Messages>
<Contact Name="Robin" Number="8775454554">
<Message Date="24 Jan 2012" Time="04:04">this is report1</Message>
</Contact>
<Contact Name="Tobin" Number="546456456">
<Message Date="24 Jan 2012" Time="04:04">this is report2</Message>
</Contact>
<Messages>
I need to check whether the 'Number' attribute of Contact element is equal to 'somenumber' and if it is, I'm required to insert one more Message element inside Contact element.
How can it be achieved using DOM? And what are the drawbacks of using DOM?
The main drawback to using a DOM is it's necessary to load the whole model into memory at once, rather than if your simply parsing the document, you can limit the data you keep in memory at one point. This of course isn't really an issue until your processing very large XML documents.
As for the processing side of things, something like the following should work:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document dom = db.parse(is);
NodeList contacts = dom.getElementsByTagName("Contact");
for(int i = 0; i < contacts.getLength(); i++) {
Element contact = (Element) contacts.item(i);
String contactNumber = contact.getAttribute("Number");
if(contactNumber.equals(somenumber)) {
Element newMessage = dom.createElement("Message");
// Configure the message element
contact.appendChild(newMessage);
}
}
DOM has two main disadvantages:
It requires reading of the complete XML into a Java representation in memory. That can be both time and memory consuming
It is a pretty verbose API, so you need to write a lot of code to achieve simple things like you're asking for.
If time and memory consumption is OK for you, but verbosity is not, you could still use jOOX, a library that I have created to wrap standard Java DOM objects to simplify manipulation of XML. These are some examples of how you would implement your requirement with jOOX:
// With css-style selectors
String result1 = $(file).find("Contact[Number=somenumber]").append(
$("<Message Date=\"25 Jan 2012\" Time=\"23:44\">this is report2</Message>")
).toString();
// With XPath
String result2 = $(file).find("//Contact[#Number = somenumber]").append(
$("<Message Date=\"25 Jan 2012\" Time=\"23:44\">this is report2</Message>")
).toString();
// Instead of file, you can also provide your source XML in various other forms
Note that jOOX only wraps standard Java DOM. The underlying operations (find() and append(), as well as $() actually perform various DOM operations).
You will do something to this effect.
Get the NodeList of Contact element.
Iterate through the NodeList and get Contact element.
Get Number through contact.getAttribute("Number") where contact is of type Element.
If your number equals someNumber, then add Message by calling contact.appendChild(). Message must be an element.
Use the Element class to create a new element
Element message = doc.createElement("Message");
message.setAttribute("message", strMessage);
Now add this element after whatever element you want using
elem.getParentNode().insertBefore(message, elem.getNextSibling());
You might want to take a look at this tutorial its about exactly what you want to do

Some help scraping a page in Java

I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it.
I've tried reading the documentation but it seems too extensive and I don't know where to begin.
Could you show me how to scrape this table in to an array? I can try figuring out my way from there. A snippet/example would do just fine too.
Thanks.
You can try jsoup: Java HTML Parser. It is an excellent library with good sample codes.
Transform the web page you are trying to scrap into an XHTML document. There are several options to do this with Java, such as JTidy and HTMLCleaner. These tools will also automatically fix malformed HTML (e.g., close unclosed tags). Both work very well, but I prefer JTidy because it integrates better with Java's DOM API;
Extract required information using XPath expressions.
Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table.
public static void main(String[] args) throws Exception {
// Create a new JTidy instance and set options
Tidy tidy = new Tidy();
tidy.setXHTML(true);
// Parse an HTML page into a DOM document
URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");
Document doc = tidy.parseDOM(url.openStream(), System.out);
// Use XPath to obtain whatever you want from the (X)HTML
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("//td[#valign = 'top']/a/text()");
NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
List<String> filenames = new ArrayList<String>();
for (int i = 0; i < nodes.getLength(); i++) {
filenames.add(nodes.item(i).getNodeValue());
}
System.out.println(filenames);
}
The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:] as expected.
Another cool tool that you can use is Web Harvest. It basically does everything I did above but using an XML file to configure the extraction pipeline.
Regex is definitely the way to go. Building a DOM is overly complicated and itself requires a lot of text parsing.
If all you are doing is scraping a table into a datafile, regex will be just fine, and may be even better than using a DOM document. DOM documents will use up a lot of memory (especially for really large data tables) so you probably want a SAX parser for large documents.

JDOM, XPath and Namespace Interactions

I'm having a very frustrating time extracting some elements from a JDOM document using an XPath expression. Here's a sample XML document - I'd like to remove the ItemCost elements from the document altogether, but I'm having trouble getting an XPath expression to evaluate to anything at the moment.
<srv:getPricebookByCompanyResponse xmlns:srv="http://ess.com/ws/srv">
<srv:Pricebook>
<srv:PricebookName>Demo Operator Pricebook</srv:PricebookName>
<srv:PricebookItems>
<srv:PricebookItem>
<srv:ItemName>Demo Wifi</srv:ItemName>
<srv:ProductCode>DemoWifi</srv:ProductCode>
<srv:ItemPrice>15</srv:ItemPrice>
<srv:ItemCost>10</srv:ItemCost>
</srv:PricebookItem>
<srv:PricebookItem>
<srv:ItemName>1Mb DIA</srv:ItemName>
<srv:ProductCode>Demo1MbDIA</srv:ProductCode>
<srv:ItemPrice>20</srv:ItemPrice>
<srv:ItemCost>15</srv:ItemCost>
</srv:PricebookItem>
</srv:PricebookItems>
</srv:Pricebook>
</srv:getPricebookByCompanyResponse>
I would normally just use an expression such as //srv:ItemCost to identify these elements, which works fine on other documents, however here it continually returns 0 nodes in the List. Here's the code I've been using:
Namespace ns = Namespace.getNamespace("srv","http://ess.com/ws/srv");
XPath filterXpression = XPath.newInstance("//ItemCost");
filterXpression.addNamespace(ns);
List nodes = filterXpression.selectNodes(response);
Where response is a JDOM element containing the above XML snippet (verified with an XMLOutputter). nodes continually has size()==0 whenever parsing this document. Using the XPath parser in Eclipse on the same document, this expression does not work either. After some digging, I got the Eclipse evaluator to work with the following expression: //*[local-name() = 'ItemCost'], however replacing the //srv:ItemCost in the Java code with this still produced no results. Another thing I noticed is if I remove the namespace declaration from the XML, //srv:ItemCost will resolve correctly in the Eclipse parser, but I can't remove it from the XML. I've been scratching my head for ours hours on this one now, and would really appreciate some nudging in the right direction.
Many thanks
Edit : Fixed code -
Document build = new Document(response);
XPath filterXpression = XPath.newInstance("//srv:ItemCost");
List nodes = filterXpression.selectNodes(build);
Strange, indeed... I tested on my side with jdom, and your snippet produced an empty list, the following works as intended:
public static void main(String[] args) throws JDOMException, IOException {
File xmlFile = new File("sample.xml");
SAXBuilder builder = new SAXBuilder();
Document build = builder.build(xmlFile);
XPath filterXpression = XPath.newInstance("//srv:ItemCost");
System.out.println(filterXpression.getXPath());
List nodes = filterXpression.selectNodes(build);
System.out.println(nodes.size());
}
It produces the output:
//srv:ItemCost
2

Categories