Parsing Tree-structure into Relational-style data-store - java

Would someone be able to help with how I could implement this, or at least the algorithm to use for this.
What I am trying to do is parse a a hierarchical/tree structure file into a relation store. I will explain further below, with an example.
This is a sample source file, just a simple/non-realistic example for purposes of this question.
<title text=“title1">
<comment id=“comment1">
<data> this is part of comment one</data>
<data> this is some more of comment one</data>
</comment>
<comment id=“comment2”>
<data> this is part of comment two</data>
<data> this is some more of comment two</data>
<data> this is even some more of comment two</data>
</comment>
</title>
So the main thing to note here is that the number of <comment>, and the number of <data> elements for each comment may be arbitrary. So given the above, I would want to transform into something looking like:
title | comment | data
------------------------------------------------------------------------
title1 comment1 this is some part of comment one
title1 comment1 this is some more of comment one
title1 comment2 this is part of comment two
title1 comment2 this is some more of comment two
title1 comment2 this is even some more of comment two
In order to make this happen, lets say I can have specified the relational schema in the following manner, using an xpath expression that can be evaluated on the source file.
attribute1: title = /title/#title
attribute2: comment = /title/comment/#id
attribute3: data = /title/comment/data/text()
Suggested Data-structures:
ResultSet is a List<Map<String,String>> (where: each map represents a single row)
Schema is a Map<String,String> (where: we map attribute-name --> path expression)
Source file, some DOM Document

I'm not sure whether you're asking how to implement the XML parser itself or how, given a parse tree for the XML, how to flatten it into a hierarchical structure. I'm guessing that you're looking at the latter of these now (there are many good XML parsers out there and I doubt that's the bottleneck), so I'll answer that here. Let me know if you're actually interested in the XML parsing detail and I can update the answer.
I believe that the way you want to think about this is with a recursive descent over the tree. The idea is as follows: your naming system consists of the concatenation of all the nodes above you in the tree followed by your own name. Given that, you could run a recursive DFS over the tree using something like this:
FlattenXML(XMLDocument x) {
for each top-level XML node t:
RecFlattenTree(t, "");
}
RecFlattenTree(Tree t, String prefix) {
if t is a leaf with data d:
update the master table by adding (prefix, d) to the list of entries
else
for each child c of t, whose name is x:
RecFlattenTree(c, prefix + "/" + x)
}
For example, if you were to trace this over the XML document you had up top, it might go something like this:
RecFlattenTree(title1, "/title1")
RecFlattenTree(comment1, "/title1/comment1")
RecFlattenTree(data node 1 , "/title1/comment1")
Add /title1/comment1/data, value = "this is some part of comment one"
RecFlattenTree(data node 2, "/title1/comment1")
Add /title1/comment2/data, value = "this is some more of comment one"
RecFlattenTree(comment2, "/title1/comment2")
RecFlattenTree(data node 1 , "/title1/comment2")
Add /title1/comment2/data, value = "this is part of comment two"
RecFlattenTree(data node 2, "/title1/comment2")
Add /title1/comment2/data, value = "this is more of comment two"
RecFlattenTree(data node 3, "/title1/comment2")
Add /title1/comment2/data, value = "this is even more of comment two"
Which ends up generating the list
/title1/comment1/data, value = "this is some part of comment one"
/title1/comment1/data, value = "this is some more of comment one"
/title1/comment1/data, value = "this is part of comment two"
/title1/comment1/data, value = "this is more of comment two"
/title1/comment1/data, value = "this is even more of comment two"
Which is exactly what you want.
Hope this helps! Let me know if I misinterpreted your question!

Related

How can I select several XML elements using XPath?

Assuming the following XML :
<response>
<header>
<resultCode>0000</resultCode>
<resultMsg>OK</resultMsg>
</header>
<body>
<items>
<item>
<addr1>America</addr1>
<addr2>(Atlanta)</addr2>
</item>
<item>
<addr1>Canada</addr1>
<addr2>(Toronto)</addr2>
</item>
<item>
<addr1>France</addr1>
<addr2>(Paris)</addr2>
</item>
</items>
</body>
</response>
I wanted to select several XML elements using XPath.
So, I wrote the JAVA code below.
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder()
.parse(urlBuilder.toString());
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList items = (NodeList) xpath.evaluate("//item", doc, XPathConstants.NODESET );
NodeList addrAll= (NodeList) xpath.evaluate("//item/addr1 | //item/addr2", doc, XPathConstants.NODESET);
System.out.println("length:"+addrAll.getLength());
for(int tmp=0; tmp<addrAll.getLength();tmp++){
System.out.println(addrAll.item(tmp).getTextContent());
}
The result is:
length:6
America
(Atlanta)
Canada
(Toronto)
France
(Paris)
But, this is not what I wanted.
My expected output:
length:3
America (Atlanta)
Canada (Toronto)
France (Paris)
I hope you understand my question.
How can I edit my code to do that?
That's not how xpath works; it retrieves the information it designates, but won't concatenate several data points.
To do that, you'll have either use xslt, or you can create two xpaths, one for each of the addrX parts, and then have the Java client code combine them.
How you need to update your Java code depends on several things, like if each item will always contain both an addr1 and addr2, for example.
If you can rely on that, you can do this:
System.out.println("length:"+addrAll.getLength());
for(int tmp=0; tmp<addrAll.getLength();tmp+=2){
String country = addrAll.item(tmp).getTextContent();
String city = addrAll.item(tmp+1).getTextContent();
System.out.printf("%s %s\n", country, city);
}
XPath 1.0 has a limited set of data types available: string, boolean, number, and node-set. Your desired answer is a sequence of three strings, which don't correspond to existing nodes, and there's no such thing in XPath 1.0 as a sequence of three strings.
If you're in the Java world, there's really no reason to restrict yourself to XPath 1.0. XPath 2.0 extends the type system to allow a sequence of strings, so you can get your answer with an expression such as //item/concat(addr1, ' ', addr2) or //item/string-join(*, ' ').
XPath 2.0 has been around for more than ten years - time to move forward! You might also consider using a more modern tree model than DOM: JDOM2 and XOM are vastly easier to use.
List<WebElement> items = wd.findElements(By.xpath("//items/item"));
System.out.println("length: " + items.size());
items.forEach(item -> System.out.println(item.getText()));
Output:
length: 3
America (Atlanta)
Canada (Toronto)
France (Paris)
You could put into the List or Map.

Why is DOM doing this? (Wrong nodeName XML)

I have this XML (just a little part.. the complete xml is big)
<Root>
<Products>
<Product ID="307488">
<ClassificationReference ClassificationID="AR" Type="AgencyLink"/>
<ClassificationReference ClassificationID="AM" Type="AgencyLink">
<MetaData>
<Value AttributeID="tipoDeCompra" ID="C">Compra Centralizada</Value>
</MetaData>
</ClassificationReference>
</Product>
</Products>
</Root>
Well... I want to get the data from the line
<Value AttributeID="tipoDeCompra" ID="C">Compra Centralizada</Value>
I'm using DOM and when I use nodoValue.getTextContent() I got "Compra Centralizada" and that is ok...
But when I use nodoValue.getNodeName() I got "MetaData" but I was expecting "Value"
What is the explanations for this behaviour?
Thanks!
Your nodeValuevariable most likely points to the MetaData node, so the returned name is correct.
Note that for an element node Node.getTextContent() returns the concatenation of the text content of all child nodes. Therefore in your example the text content of the MetaData element is equal to the text content of the Value element, namely Compra Centralizada.
I guess your are getting the Node object using getElementsByTagName("MetaData"). In this case nodoValue.getTextContent() will return the text content correctly but to get the node name you need to get the child node.
Your current node must be MetaData and getTextContent() will give all the text within its opening and closing tags. This is because you are getting
Compra Centralizada
as the value. You should get the first child using getChildNodes() and then can get the Value tag.

XStream doesn't show CData tags

When I read an XML with XStream, it doesn't show tag <![CDATA[ and ]]>.
I'd like XStream to show it.
For example:
This is a part of "test.xml"
<![CDATA[<b>]]>
If I show it in a browser, the browser shows it correctly:
<![CDATA[ <b> ]]>
But when I read and show XML with XStream I see only:
<b>
If i'm not mistaken each element should have a name and a value, (if their being read in as Xppdom objects). I'm guessing what you're looking at is the value. with the it might be a little different, because it is unparsed data, so the name may be "!CDATA" or may not have one at all. In the normal case: if you have <node attr1='val1'> text </node>, when it is read in, calling .getName() will return "node", .getValue() will return text, and .getAttribute("attr1") will return "val1".
If you wanted to print everything with their tags you could make a method String formatXppDom(XppDom elem) to format a printable string with the tags.

How to check for opening and closing tags in xml file using java?

I have a xml file like the following:
<file>
<students>
<student>
<name>Arthur</name>
<height>168</height>
</student>
<student>
<name>John</name>
<height>176</height>
</student>
</students>
</file>
How do I check whether for each opening tag, there is an ending tag? For example, if I do not provide the ending tag as:
<file>
<students>
<student>
<name>Arthur</name>
<height>168</height>
// Ending tag for student missing here
<student>
<name>John</name>
<height>176</height>
</student>
</students>
</file>
How do I continue parsing the rest of the file?
I tried with SAX parser as explained here, but its not very suitable for me as it throws an exception in case I do not provide a closing tag as in the second xml code I provided.
An XML file that does not verify your condition "for each opening tag, there is an ending tag", is not well formed.
To check that an XML file is well formed is the first job of a XML parser (it's its first task). Hence, you need a XML parser.
The tutorial you found has a bug in it. characters() maybe called multiple times for the same element (source). The proper way to mark the end of an element is to reset the respective boolean states inside of endElement(). The comments section has code that shows the required change.
With that issue fixed, you can do error checking in startElement() to ensure that the file is not trying to start an invalid element given the current state. This will also allow you to ensure that a name element is only found inside of a student element.
You can implement the following algorithm (pseudo-code):
String xml = ...
stack = new Stack()
while True:
tag = extractNextTag(xml)
// no new tag is found
if tag == null:
break
if (tag.isOpening()):
stack.push(tag.name)
else:
oldTagName = stack.pop()
if (oldTagName != tag.name):
error("Open/close tag error")
if ! stack.isEmpty():
error("Open/close tag error")
you can implement function extractNewTag with 10-20 lines of codes using some knowled about parsers or just writing simple regular expression.
Of course when you search for a new tag you need to start searching from the symbol that follows the last tag you found.

java xpath parsing

Is there a way to retrieve from a XML file all the nodes that are not empty using XPath? The XML looks like this:
<workspace>
<light>
<activeFlag>true</activeFlag>
<ambientLight>0.0:0.0:0.0:0.0</ambientLight>
<diffuseLight>1.0:1.0;1.0:1.0</diffuseLight>
<specularLight>2.0:2.0:2.0:2.0</specularLight>
<position>0.1:0.1:0.1:0.1</position>
<spotDirection>0.2:0.2:0.2:0.2</spotDirection>
<spotExponent>1.0</spotExponent>
<spotCutoff>2.0</spotCutoff>
<constantAttenuation>3.0</constantAttenuation>
<linearAtenuation>4.0</linearAtenuation>
<quadricAttenuation>5.0</quadricAttenuation>
</light>
<camera>
<activeFlag>true</activeFlag>
<position>2:2:2</position>
<normal>1:1:1</normal>
<direction>0:0:0</direction>
</camera>
<object>
<material>lemn</material>
<Lu>1</Lu>
<Lv>2</Lv>
<unit>metric</unit>
<tip>tip</tip>
<origin>1:1:1</origin>
<normal>2:2:2</normal>
<parent>
<object>null</object>
</parent>
<leafs>
<object>null</object>
</leafs>
</object>
After each tag the parser "sees" another empty node that i don't need.
I guess what you want is all element nodes that have an immediate text node child that does not consist solely of white space:
//*[string-length(normalize-space(text())) > 0]
If you're using XSLT, use <xsl:strip-space elements="*"/>. If you're not, it depends what technology you are using (you haven't told us), eg. DOM, JDOM, etc.
You want:
//*[normalize-space()]
The expression:
//*[string-length(normalize-space(text())) > 0]
is a wrong answer. It selects all elements in the document whose first text node child's text isn't whitespace-only.
Therefore, this wouldn't select:
<p><b>Hello </b><i>World!</i></p>
although this paragraph contains quite a lot of text...

Categories