Parsing xml with dom4j or jdom or anyhow - java

I wanna read feed entries and I'm just stuck now. Take this for example : https://stackoverflow.com/feeds/question/2084883 lets say I wanna read all the summary node value inside each entry node in document. How do I do that? I've changed many variations of code this one is closest to what I want to achieve I think :
Element entryPoint = document.getRootElement();
Element elem;
for(Iterator iter = entryPoint.elements().iterator(); iter.hasNext();){
elem = (Element)iter.next();
System.out.println(elem.getName());
}
It goes trough all nodes in xml file and writes their name. Now what I wanted to do next is
if(elem.getName().equals("entry"))
to get only the entry nodes, how do I get elements of the entry nodes, and how to get let say summary and its value? tnx
Question: how to get values of summary nodes from this link

Have you tried jdom? I find it simpler and convenient.
http://www.jdom.org/
To get all children of an xml element, you can just do
SAXBuilder sb = new SAXBuilder();
StringReader sr = new StringReader(xmlDocAsString);
Document doc = sb.build(sr);
Element root = doc.getRootElement();
List l = root.getChildren("entry");
for (Iterator iter = l.iterator(); iter.hasNext();) {
...//do whatever...
}

Here's how you'd do it using vanilla Java:
//read the XML into a DOM
StreamSource source = new StreamSource(new StringReader("<theXml></theXml>"));
DOMResult result = new DOMResult();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(source, result);
Node root = result.getNode();
//make XPath object aware of namespaces
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setNamespaceContext(new NamespaceContext(){
#Override
public String getNamespaceURI(String prefix) {
if ("atom".equals(prefix)){
return "http://www.w3.org/2005/Atom";
}
return null;
}
#Override
public String getPrefix(String namespaceURI) {
return null;
}
#Override
public Iterator getPrefixes(String namespaceURI) {
return null;
}
});
//get all summaries
NodeList summaries = (NodeList) xpath.evaluate("/atom:feed/atom:entry/atom:summary", root, XPathConstants.NODESET);
for (int i = 0; i < summaries.getLength(); ++i) {
Node summary = summaries.item(i);
//print out all the attributes
for (int j = 0; j < summary.getAttributes().getLength(); ++j) {
Node attr = summary.getAttributes().item(j);
System.out.println(attr.getNodeName() + "=" + attr.getNodeValue());
}
//print text content
System.out.println(summaries.item(i).getTextContent());
}

if(elem.getName() == "entry")
I have no idea whether this is your problem (you don't really state what your problem is), but never test string equality with --. Instead, use equals():
if(elem.getName().equals("entry"))

A bit late but it might be useful for people googling...
There is a specialized API for dealing with RSS and Atom feeds in Java. It's called Rome, can be found here :
http://java.net/projects/rome/
It is really quite useful, it makes easy to read feed whatever the RSS or Atom version. You can also build feeds and generate the XML with it though I have no experience with this feature.
Here is a simple example that reads a feed and prints out the description nodes of all the entries in the feed :
URL feedSource = new URL("http://....");
feed = new SyndFeedInput().build(new XmlReader(feedSource));
List<SyndEntryImpl> entries = (List<SyndEntryImpl>)feed.getEntries();
for(SyndEntryImpl entry : entries){
System.out.println(entry.getDescription().getValue());
}
Simple enough.

Related

How do I parse answers from AWS Mechanical Turk in Java?

The answer() method for an Assignment returns a String like this:
<?xml version="1.0" encoding="ASCII"?><QuestionFormAnswers xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2005-10-01/QuestionFormAnswers.xsd"><Answer><QuestionIdentifier>blah</QuestionIdentifier><FreeText>toplevel</FreeText></Answer></QuestionFormAnswers>
How am I supposed to parse this to get the actual answers? I see in older versions of the API there's a QuestionFormAnswers type. This is also referenced in the documentation, which states:
public String getAnswer()
The Worker's answers submitted for the HIT contained in a QuestionFormAnswers document, if the Worker provides an answer. If the Worker does not provide any answers, Answer may contain a QuestionFormAnswers document, or Answer may be empty.
Returns:
The Worker's answers submitted for the HIT contained in a QuestionFormAnswers document, if the Worker provides an answer. If the Worker does not provide any answers, Answer may contain a QuestionFormAnswers document, or Answer may be empty.
But it actually returns a String and not a QuestionFormAnswers. How do I parse this string XML result? Can I just use any standard method of parsing XML documents?
The answer appears to be yes, you can use any standard XML parsing technique.
Here is what worked for me:
private static Map<String, String> parseXML(String answerXML) {
try {
List<String> identifierList = new ArrayList<>();
List<String> answerList = new ArrayList<>();
InputSource is = new InputSource(new StringReader(answerXML));
Document document = null;
document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(is);
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList identifiers = null;
try {
identifiers = (NodeList) xpath.evaluate("//Answer/QuestionIdentifier", document,
XPathConstants.NODESET);
} catch (XPathExpressionException e) {
e.printStackTrace();
}
for (int i = 0; i < identifiers.getLength(); i++) {
Node identifier = identifiers.item(i);
String relation = identifier.getTextContent();
identifierList.add(relation);
}
NodeList texts = (NodeList) xpath.evaluate("//Answer/FreeText", document, XPathConstants.NODESET);
for (int i = 0; i < texts.getLength(); i++) {
Node text = texts.item(i);
String answer = text.getTextContent();
answerList.add(answer);
}
Map<String, String> result = new HashMap<>();
for (int k = 0; k < identifierList.size(); k++) {
result.put(identifierList.get(k), answerList.get(k));
}
return result;
} catch (Exception e) {
log.error("Failed to parse XML " + answerXML, e);
}
return null;
}
This creates a map from the input ids to the answers.

How to get html from a org.w3c.dom.Node in java?

I've build a method which extracts data from an html document using the xpath components of saxon-he. I'm using w3c dom object model for this.
I already created a method which returns the text-value, similar like the text value method from jsoup (jsoupElement.text()):
protected String getNodeValue(Node node) {
NodeList childNodes = node.getChildNodes();
for (int x = 0; x < childNodes.getLength(); x++) {
Node data = childNodes.item(x);
if (data.getNodeType() == Node.TEXT_NODE)
return data.getNodeValue();
}
return "";
}
This works fine but i now i need the underlying html of a selected node (with jsoup it would be jsoupElement.html()). Using the w3c dom object model i have org.w3c.dom.Node. How can i get the html from a org.w3c.dom.Node as String? I couldn't find anything regarding this in the documentation.
Just for clarification: I need the inner html (with or without the node element/tag) as String. Similar like http://api.jquery.com/html/ or http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#html--
To serialize a W3C DOM Node's child nodes to HTML with Saxon you can use a default Transformer where you set the output method to html:
public static String getInnerHTML(Node node) throws TransformerConfigurationException, TransformerException
{
StringWriter sw = new StringWriter();
Result result = new StreamResult(sw);
TransformerFactory factory = new net.sf.saxon.TransformerFactoryImpl();
Transformer proc = factory.newTransformer();
proc.setOutputProperty(OutputKeys.METHOD, "html");
for (int i = 0; i < node.getChildNodes().getLength(); i++)
{
proc.transform(new DOMSource(node.getChildNodes().item(i)), result);
}
return sw.toString();
}
But as said, this is a serialization of the tree, the original XML or HTML is not stored in a DOM tree or Saxon's tree model, there is no way to access it.

Xpath approach in case of large files

The class you're gonna see right now is the classic approach to parse an XML document via XPath in Java:
public class Main {
private Document createXMLDocument(String fileName) throws Exception {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse(fileName);
return doc;
}
private NodeList readXMLNodes(Document doc, String xpathExpression) throws Exception {
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile(xpathExpression);
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
return nodes;
}
public static void main(String[] args) throws Exception {
Main m = new Main();
Document doc = m.createXMLDocument("tv.xml");
NodeList nodes = m.readXMLNodes(doc, "//serie/eason/#id");
int n = nodes.getLength();
Map<Integer, List<String>> series = new HashMap<Integer, List<String>>();
for (int i = 1; i <= n; i++) {
nodes = m.readXMLNodes(doc, "//serie/eason[#id='" + i + "']/episode/text()");
List<String> episodes = new ArrayList<String>();
for (int j = 0; j < nodes.getLength(); j++) {
episodes.add(nodes.item(j).getNodeValue());
}
series.put(i, episodes);
}
for (Map.Entry<Integer, List<String>> entry : series.entrySet()) {
System.out.println("Season: " + entry.getKey());
for (String ep : entry.getValue()) {
System.out.println("Episodio: " + ep);
}
System.out.println("+------------------------------------+");
}
}
}
In there I find some methods to be worrying in case of a huge xml file. Like the use of
Document doc = builder.parse(fileName);
return doc;
or
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
return nodes;
I'm worried because the xml document I need to handle is created by the customer and inside you can basically have an indefinite number of records describing emails and their contents (every user has its own personal email, so lots of html in there). I know it's not the smartest approach but it's one of the possibilities and it was already up and running before I arrived here.
My question is: how can I parse and evaluate huge xml files using xpath?
You could use the StAX parser. It will take less memory than the DOM options. A good introduction to StAX is at http://tutorials.jenkov.com/java-xml/stax.html
First of all, XPath doesn't parse XML. Your createXMLDocument() method does that, producing as output a tree representation of the parsed XML. The XPath is then used to search the tree representation.
What you are really looking for is something that searches the XML on the fly, while it is being parsed.
One way to do this is with an XQuery system that implements "document projection" (for example, Saxon-EE). This will analyze your query to see what parts of the document are needed, and when you parse your document, it will build a tree containing only those parts of the document that are actually needed.
If the query is as simple as the one in your example, however, then it isn't too hard to code it as a SAX application, where events such as startElement and endElement are notified by the XML parser to the application, without building a tree in memory.

How can I automate the XML Parsing using JDOM

I have to parse an XML file using JDOM and get some infos from all his elements.
<?xml version="1.0" encoding="UTF-8"?>
<root>
<element1>something</element1>
<element2>
<subelement21>moo</subelement21>
<subelement22>
<subelement221>toto</subelement221>
<subelement222>tata</subelement222>
</subelement22>
</element2>
</root>
So, for the element1 it's easy. But for the element2 I have to go through his children and if the children has children go through them too and so on.
public static void getInfos(Vector<String> files) {
Document document = null;
Element root = null;
SAXBuilder sxb = new SAXBuilder();
for (int i =0 ; i< files.size() ; i++)
{
System.out.println("n°" + i + " : " + files.elementAt(i));
try
{
document = sxb.build(files.elementAt(i));
root = document.getRootElement();
List<?> listElements = root.getChildren();
Iterator<?> it = listElements.iterator();
while(it.hasNext())
{
Element courant = (Element)it.next();
System.out.println(courant.getName());
if(courant.getChildren().size() > 0)
{
// here is the problem -> the element has a children
}
}
}
catch (Exception e) {
e.printStackTrace();
}
}
}
What do you suggest in this case, like a recursive call or something else so I can use the same function.
Thanks.
I would use SAX. I'd keep a stack in the contenthandler that tracked what my current path was in the document, and keep a buffer that my characters method appended to. In endElement I'd get the content from the buffer and clear it out, then use the current path to decide what to do with it.
(this is assuming this document has no mixed-content.)
Here's a link to an article on using SAX to process complex XML documents, it expands on what I briefly described into an approach that handles recursive data structures. (It also has a predecessor article that is an introduction to SAX.)
You could consider using XPath to get the exact elements you want. The example here uses namespaces but the basic idea holds.

A question on how to solve the string problem in Java

I've created a simple xml file here:
http://roberthan.host56.com/productsNew.xml
which is quite simple, the root node is [products] while all other element nodes are [product]. Under each [product] node, there are two child nodes, [code] and [name], so it basically looks like:
[product]
[code]ddd[/code]
[name]ssss[/name]
[/product]
I've also written up the following Java code to parse this XML file and take out the text content of the [product] node, and add it to a JComboBox.
docBuilder = docFactory.newDocumentBuilder();
doc = docBuilder.parse("http://roberthan.host56.com/productsNew.xml");
NodeList productNodes = doc.getElementsByTagName("product");
productlist.clear();
for (i = 0; i < productNodes.getLength(); i++)
{
Node childNode = productNodes.item(i);
if (childNode.hasChildNodes()) {
NodeList nl = childNode.getChildNodes();
Node nameNode = nl.item(2);
productlist.add(nameNode.getTextContent());
}
}
final JComboBox productComboB = new JComboBox();
Iterator iterator = productlist.iterator();
while(iterator.hasNext())
{
productComboB.addItem(iterator.next().toString());
}
The code is quite straightforward, I firstly parse the xml and get all the product nodes and put them into a nodelist, and the productList is an arrayList. I loop through the all the [product] nodes, for each of them, if it has child nodes, then I take the second child node (which is the [name] node) and put the text content of it in the array list, and finally, I loop through the arrayList and add each item to the combo box.
The problem I got is, if I select the [code] child node, which means "Node nameNode = nl.item(1)", it will work perfectly; however, if I change that item(1) to item(2) to extract all the [name] nodes, the combo box will have a drop down list, but all the items are blank, like I have inserted 10 empty strings.
Also, if I try to add a "Hello World" string into the combo box after the above code, the "Hello World" item will appear after the 10 empty items.
I have spent the whole afternoon debugging this but still no breakthrough, the XML is actually quite simple and the Java is straightforward too. Could anyone share some thoughts with me on this please. Thanks a lot!
It is because the node list contains text nodes also.
If you add the following snippet to your code you will find that
for(int j = 0;j<nl.getLength();j++){
System.out.println(nl.item(j).getNodeName());
}
It will give the following output for each iteration of the product
#text
code
#text
name
#text
This means you have to get the 3rd element to get the name node.
Node nameNode = nl.item(3);
But I'll suggest you to use XPath to solve this problem.
NodeList nodelist = XPathAPI.selectNodeList(doc, "//products/product/name");
for (int i = 0; i < nodelist.getLength(); i++) {
productlist.add(nodelist.item(i).getTextContent());
}
XPath using this expression will easily solve your problem:
String XPATH_EXPRESSION1 = "//name/text()";
e.g.,
public static final String PRODUCTS_NEW = "http://roberthan.host56.com/productsNew.xml";
public static final String XPATH_EXPRESSION1 = "//name/text()";
public XmlFun() {
URL productsUrl;
try {
productsUrl = new URL(PRODUCTS_NEW);
List<String> nameList = xPathExtract(productsUrl.openStream());
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (XPathExpressionException e) {
e.printStackTrace();
}
}
private List<String> xPathExtract(InputStream inStream) throws ParserConfigurationException, SAXException, IOException, XPathExpressionException {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document domDoc = builder.parse(inStream);
XPathFactory xFactory = XPathFactory.newInstance();
XPath xpath = xFactory.newXPath();
XPathExpression xExpr = xpath.compile(XPATH_EXPRESSION1);
NodeList nodes = (NodeList)xExpr.evaluate(domDoc, XPathConstants.NODESET);
List<String> resultList = new ArrayList<String>();
for (int i = 0; i < nodes.getLength(); i++) {
String node = nodes.item(i).getNodeValue();
resultList.add(node);
}
return resultList;
}

Categories