Serializing Java DOM Document to XML: Add CData Elements - java

I am constructing an XML DOM Document with a SAX parser. I have written methods to handle the startCDATA and endCDATA methods and in the endCDATA method I construct a new CDATA section like this:
public void onEndCData() {
xmlStructure.cData = false;
Document document = xmlStructure.xmlResult.document;
Element element = (Element) xmlStructure.xmlResult.stack.peek();
CDATASection section = document.createCDATASection(xmlStructure.stack.peek().characters);
element.appendChild(section);
}
When I serialize this to an XML file I use the following line to configure the transformer:
transformer.setOutputProperty(OutputKeys.CDATA_SECTION_ELEMENTS, "con:setting");
Never the less no <![CDATA[ tags appear in my XML file and instead all backets are escaped to > and <, this is no problem for other tools but it is a problem for humans who need to read the file as well. I am positive that the "con:setting" tag is the right one. So is there maybe a problem with the namespace prefix?
Also this question indicates that it is not possible to omit the CDATA_SECTION_ELEMENTS property and generally serialize all CDATA nodes without escaping the data. Is that information correct, or are there maybe other methods that the author of the answer was not aware of?
Update: It seems I had a mistake in my code. When using the document.createCDATASection() function, and then serializing the code with the Transformer it DOES output CDATA tags, even without the use of the CDATA_SECTION_ELEMENTS property in the transformer.

It looks like you have a namespace-aware DOM. The docs say you need to provide the Qualified Name Representation of the element:
private static String qualifiedNameRepresentation(Element e) {
String ns = e.getNamespaceURI();
String local = e.getLocalName();
return (ns == null) ? local : '{' + ns + '}' + local;
}
So the value of the property will be of the form {http://your.conn.namespace}setting.

In this line
transformer.setOutputProperty(OutputKeys.CDATA_SECTION_ELEMENTS, "con:setting");
try replacing "con:setting" with "{http://con.namespace/}setting"
using the appropriate namespace

Instead of using a no-op Transformer to serialize your DOM tree you could try using the DOM-native "load and save" mechanism, which should preserve the CDATASection nodes from the DOM tree and write them as CDATA sections in the resulting XML.
DOMImplementationLS ls = (DOMImplementationLS)document.getImplementation();
LSOutput output = ls.createLSOutput();
LSSerializer ser = ls.createLSSerializer();
try (FileOutputStream outStream = new FileOutputStream(...)) {
output.setByteStream(outStream);
output.setEncoding("UTF-8");
ser.write(document, output);
}

Related

How to get all XML branches

How can I get all XML branches using Java.
For example if i have the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<addresses xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation='test.xsd'>
<address>
<name>Joe Tester</name>
<street>Baker street 5</street>
</address>
<person>
<name>Joe Tester</name>
<age>44</age>
</person>
</addresses>
I want to obtain the following branches:
addresses
addresses_address
addresses_address_name
addresses_address_street
addresses_person
addresses_person_name
addresses_person_age
Thanks.
You can get XML root, its' node and sub node names easily using any template engine. i.e Velocity, FreeMarker and other, FreeMarker have powerful new facilities for XML processing. You can drop XML documents into the data model, and templates can pull data from them in a variety of ways, such as with XPath expressions. FreeMarker, as an XML transformation tool with the much better-known XSLT stylesheet approach promulgated by the Worldwide Web Consortium (W3C).
FrerMarker support XPath to using jaxen,XPath expression needs Jaxen. downlaod
FreeMarker will use Xalan, unless you choose Jaxen by calling freemarker.ext.dom.NodeModel.useJaxenXPathSupport() from Java.
Just you need One Template, that will generate all XML branches according to input XML. really Put any XML on run-time to data model freemarker will process the template and generate XML branches corresponding to that XML structure. If your XML structure will change then no need of to change your Java code. Even if you want to change the output then changes will comes in template file hence no need recompilation Java code.
Just change in template, get get changes on the fly.
FTL File [One template for multiple XML document for creating xml branch names]
<#list doc ['/*' ] as rootNode>
<#assign rootNodeValue="${rootNode?node_name}">
${rootNodeValue}
<#list doc ['/*/*' ] as childNodes>
<#if childNodes?is_node==true>
${rootNodeValue}-${childNodes?node_name}
<#list doc ['/*/${childNodes?node_name}/*' ] as subNodes>
${rootNodeValue}-${childNodes?node_name}-${subNodes?node_name}
</#list>
</#if>
</#list>
</#list>
XMLTest.Java for process template
import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
import java.util.HashMap;
import java.util.Map;
import javax.xml.parsers.ParserConfigurationException;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import freemarker.ext.dom.NodeModel;
import freemarker.template.Configuration;
import freemarker.template.DefaultObjectWrapper;
import freemarker.template.ObjectWrapper;
import freemarker.template.Template;
import freemarker.template.TemplateException;
public class XMLTest {
public static void main(String[] args) throws SAXException, IOException,
ParserConfigurationException, TemplateException {
Configuration config = new Configuration();
config.setClassForTemplateLoading(XMLTest.class, "");
config.setObjectWrapper(new DefaultObjectWrapper());
config.setObjectWrapper(ObjectWrapper.BEANS_WRAPPER);
Map<String, Object> dataModel = new HashMap<String, Object>();
//load xml
InputStream stream = XMLTest.class.getClassLoader().getResourceAsStream(xml_path);
// if you xml sting then then pass it from InputSource constructor, no need of load xml from dir
InputSource source = new InputSource(stream);
NodeModel xmlNodeModel = NodeModel.parse(source);
dataModel.put("doc", xmlNodeModel);
Template template = config.getTemplate("test.ftl");
StringWriter out = new StringWriter();
template.process(dataModel, out);
System.out.println(out.getBuffer().toString());
}
}
Final OutPut
addresses
addresses-address
addresses-address-name
addresses-address-street
addresses-person
addresses-person-name
addresses-person-age
See doc for 1.XML Node Model 2.XML Node MOdel
Download FreeMarker from here
Downlaod Jaxen from here
There are many ways that you can extract data from XML and use it in Java. The one you choose will depend on how you want to use the data.
Some scenarios are:
You might want to manipulate nodes, order, remove and add others and transform the XML.
You might just want to read (and possibly change) the text contained in elements and attributes.
You might have a very large file and you just want to find some particular data and ignore the rest of the file.
For scenario #3, the best option is some memory-efficient stream-based parser, such as SAX or XML reader with the StAX API.
You can also use that for scenario #2, if you do mostly reading (and not writing), but DOM-based APIs might be easier to work with. You can use the standard DOM org.w3c.dom API or a more Java-like API such as JDOM or DOM4J. If you wish to synchronize XML files with Java objects you also might want to use a full Java-XML mapping framework such as JAXB.
DOM APIs are also great for scenario #1, but in many cases it might be simpler to use XSLT (via the javax.xml.transform TrAX API in Java). If you use DOM you can also use XPath to select the nodes.
I will show you an example on how to extract the individual nodes of your file using the standard DOM API (org.w3c.dom) and also using XPath (javax.xml.xpath).
1. Setup
Initialize the parser:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Parse file into a Document Object Model:
Document source = builder.parse(new File("src/main/resources/addresses.xml"));
2. Selecting nodes with J2SE DOM
You get the root element using getDocumentElement():
Element addresses = source.getDocumentElement();
From there you can get the child nodes using getChildNodes() but that will return all child nodes, which includes text nodes (the whitespace between elements). addresses.getChildNodes().item(0) returns the whitespace after the <addresses> tag and before the <address> tag. To get the element you would have to go for the second item. An easier way to do that is use getElementsByTagName, which returns a node-set and get the first item:
Element addresses_address = (Element)addresses.getElementsByTagName("address").item(0);
Many of the DOM methods return org.w3c.dom.Node objects, which you have to cast. Sometimes they might not be Element objects so you have to check. Node sets are not automatically converted into arrays. They are org.w3c.dom.NodeList so you have to use .item(0) and not [0] (if you use other DOM APIs such as JDOM or DOM4J, it will seem more intuitive).
You could use addresses.getElementsByTagName to get all the elements you need, but you would have to deal with the context for the two <name> elements. So a better way is to call it in the appropriate context:
Element addresses_address = (Element)addresses.getElementsByTagName("address").item(0);
Element addresses_address_name = (Element)addresses_address.getElementsByTagName("name").item(0);
Element addresses_address_street = (Element)addresses_address.getElementsByTagName("street").item(0);
Element addresses_person = (Element)addresses.getElementsByTagName("person").item(0);
Element addresses_person_name = (Element)addresses_person.getElementsByTagName("name").item(0);
Element addresses_person_age = (Element)addresses_person.getElementsByTagName("age").item(0);
That will give you all the Element nodes (or branches as you called them) for your file. If you want the text nodes (as actual Node objects) you need to get it as the first child:
Node textNode = addresses2_address_street.getFirstChild();
And if you want the String contents you can use:
String street = addresses2_address_street.getTextContent();
3. Selecting nodes with XPath
Another way to select nodes is using XPath. You will need the DOM source and you also need to initialize the XPath processor:
XPath xPath = XPathFactory.newInstance().newXPath();
You can extract the root node like this:
Element addresses = (Element)xPath.evaluate("/addresses", source, XPathConstants.NODE);
And all the other nodes using a path-like syntax:
Element addresses_address = (Element)xPath.evaluate("/addresses/address", source, XPathConstants.NODE);
Element addresses_address_name = (Element)xPath.evaluate("/addresses/address/name", source, XPathConstants.NODE);
Element addresses_address_street = (Element)xPath.evaluate("/addresses/address/street", source, XPathConstants.NODE);
You can also use relative paths, choosing a different element as the root:
Element addresses_person = (Element)xPath.evaluate("person", addresses, XPathConstants.NODE);
Element addresses_person_name = (Element)xPath.evaluate("person/name", addresses, XPathConstants.NODE);
Element addresses_person_age = (Element)xPath.evaluate("age", addresses_person, XPathConstants.NODE);
You can get the text contents as before, since you have Element objects:
String addressName = addresses_address_name.getTextContent();
But you can also do it directly using the same methods above without the last argument (which defaults to string). Here I'm using different relative and absolute XPath expressions:
String addressName = xPath.evaluate("name", addresses_address);
String addressStreet = xPath.evaluate("address/street", addresses);
String personName = xPath.evaluate("name", addresses_person);
String personAge = xPath.evaluate("/addresses/person/age", source);

dom4j XML declaration in document

I need to remove XML declaration from dom4j document type
I am creating document by
doc = (Document) DocumentHelper.parseText(someXMLstringWithoutXMLDeclaration);
String parsed to Document doc by DocumenHelper contains no XML declaration (it comes from XML => XSL => XML transformation)
I think that DocumentHelper is adding declaration to a document body ?
Is there any way to remove XML declaration from the body of
doc
The simpler solution I choose is
doc.getRootElement().asXML();
I'm not sure where exactly this the declaration is a problem in your code.
I had this once when I wanted to write an xml file without declaration (using dom4j).
So if this is your use case: "omit declaration" is what you're looking for.
http://dom4j.sourceforge.net/dom4j-1.6.1/apidocs/org/dom4j/io/OutputFormat.html
Google says this can be set as a property as well, not sure what it does though.
You need to interact with the root element instead of the document.
For example, using the default, compact OutputFormat mentioned by PhilW:
Document doc = (Document) DocumentHelper.parseText(someXMLstringWithoutXMLDeclaration);
final Writer writer = new StringWriter();
new XMLWriter(writer).write(doc.getRootElement());
String out = writer.toString();

How to parse advanced XML files in Java

I've seen numerous examples about how to read XML files in Java. But they only show simple XML files. For example they show how to extract first and last names from an XML file. However I need to extract data from a collada XML file. Like this:
<library_visual_scenes>
<visual_scene id="ID1">
<node name="SketchUp">
<instance_geometry url="#ID2">
<bind_material>
<technique_common>
<instance_material symbol="Material2" target="#ID3">
<bind_vertex_input semantic="UVSET0" input_semantic="TEXCOORD" input_set="0" />
</instance_material>
</technique_common>
</bind_material>
</instance_geometry>
</node>
</visual_scene>
</library_visual_scenes>
This is only a small part of a collada file. Here I need to extract the id of visual_scene, and then the url of instance_geometry and last the target of instance_material. Of course I need to extract much more, but I don't understand how to use it really and this is a place to start.
I have this code so far:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = null;
try {
builder = factory.newDocumentBuilder();
}
catch( ParserConfigurationException error ) {
Log.e( "Collada", error.getMessage() ); return;
}
Document document = null;
try {
document = builder.parse( string );
}
catch( IOException error ) {
Log.e( "Collada", error.getMessage() ); return;
}
catch( SAXException error ) {
Log.e( "Collada", error.getMessage() ); return;
}
NodeList library_visual_scenes = document.getElementsByTagName( "library_visual_scenes" );
It seems like most examples on the web is similar to this one: http://www.easywayserver.com/blog/java-how-to-read-xml-file/
I need help figuring out what to do when I want to extract deeper tags or find a good tutorial on reading/parsing XML files.
Really, your parsing per se is already done when you call builder.parse(string). What you need to know now is how to select/query information from the parsed XML document.
I would agree with #khachik regarding how to do that. Elaborating a little (since no one else has posted an answer):
XPath is the most convenient way to extract information, and if your input document is not huge, XPath is fast enough. Here is a good starting tutorial on XPath in Java. XPath is also recommended if you need random access to the XML data (i.e. if you have to go back and forth extracting data from the tree in a different order than it appears in the source document), since SAX is designed for linear access.
Some sample XPath expressions:
extract the id of visual_scene: /*/visual_scene/#id
the url of instance_geometry: /*/visual_scene/node/instance_geometry/#url
the url of instance_geometry for node whose name is Sketchup: /*/visual_scene/node[#name = 'Sketchup']/instance_geometry/#url
the target of instance_material: /*/visual_scene/node/instance_geometry/bind_material/technique_common/instance_material/#target
Since COLLADA models can be really large, you might need to do a SAX-based filter, which will allow you to process the document in stream mode without having to keep it all in memory at once. But if your existing code to parse the XML is already performing well enough, you may not need SAX. SAX is more complicated to use for extracting specific data than XPath.
You are using DOM in your code.
DOM creates a tree structure of the xml file it parsed, and you have to traverse the tree to get the information in various nodes.
In your code all you did is create the tree representation. I.e.
document = builder.parse( string );//document is loaded in memory as tree
Now you should reference the DOM apis to see how to get the information you need.
NodeList library_visual_scenes = document.getElementsByTagName( "library_visual_scenes" );
For instance this method returns a NodeList of all elements with the specified name.
Now you should loop over the NodeList
for (int i = 0; i < library_visual_scenes.getLength(); i++) {
Element element = (Element) nodes.item(i);
Node visual_scene = element.getFirstChild();
if(visual_scene.getNodeType() == Node.ELEMENT_NODE)
{
String id = ((Element)visual_scene).getAttribute(id);
System.out.println("id="+id);
}
}
DISCLAIMER: This is a sample code. Have not compiled it. It shows you the concept. You should look into DOM apis.
EclipseLink JAXB (MOXy) has a useful #XmlPath extension for leveraging XPath to populate an object. It may be what you are looking for. Note: I am the MOXy tech lead.
The following example maps a simple address object to Google's representation of geocode information:
package blog.geocode;
import javax.xml.bind.annotation.XmlRootElement;
import javax.xml.bind.annotation.XmlType;
import org.eclipse.persistence.oxm.annotations.XmlPath;
#XmlRootElement(name="kml")
#XmlType(propOrder={"country", "state", "city", "street", "postalCode"})
public class Address {
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:Thoroughfare/ns:ThoroughfareName/text()")
private String street;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:LocalityName/text()")
private String city;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:AdministrativeAreaName/text()")
private String state;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:CountryNameCode/text()")
private String country;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:PostalCode/ns:PostalCodeNumber/text()")
private String postalCode;
}
For the rest of the example see:
http://bdoughan.blogspot.com/2010/09/xpath-based-mapping-geocode-example.html
Nowadays, several java RAD tools have java code generators from given DTDs, so you can use them.

How to preserve newlines in CDATA when generating XML?

I want to write some text that contains whitespace characters such as newline and tab into an xml file so I use
Element element = xmldoc.createElement("TestElement");
element.appendChild(xmldoc.createCDATASection(somestring));
but when I read this back in using
Node vs = xmldoc.getElementsByTagName("TestElement").item(0);
String x = vs.getFirstChild().getNodeValue();
I get a string that has no newlines anymore.
When i look directly into the xml on disk, the newlines seem preserved. so the problem occurs when reading in the xml file.
How can I preserve the newlines?
Thanks!
I don't know how you parse and write your document, but here's an enhanced code example based on yours:
// creating the document in-memory
Document xmldoc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
Element element = xmldoc.createElement("TestElement");
xmldoc.appendChild(element);
element.appendChild(xmldoc.createCDATASection("first line\nsecond line\n"));
// serializing the xml to a string
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl =
(DOMImplementationLS)registry.getDOMImplementation("LS");
LSSerializer writer = impl.createLSSerializer();
String str = writer.writeToString(xmldoc);
// printing the xml for verification of whitespace in cdata
System.out.println("--- XML ---");
System.out.println(str);
// de-serializing the xml from the string
final Charset charset = Charset.forName("utf-16");
final ByteArrayInputStream input = new ByteArrayInputStream(str.getBytes(charset));
Document xmldoc2 = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(input);
Node vs = xmldoc2.getElementsByTagName("TestElement").item(0);
final Node child = vs.getFirstChild();
String x = child.getNodeValue();
// print the value, yay!
System.out.println("--- Node Text ---");
System.out.println(x);
The serialization using LSSerializer is the W3C way to do it (see here). The output is as expected, with line separators:
--- XML ---
<?xml version="1.0" encoding="UTF-16"?>
<TestElement><![CDATA[first line
second line ]]></TestElement>
--- Node Text ---
first line
second line
You need to check the type of each node using node.getNodeType(). If the type is CDATA_SECTION_NODE, you need to concat the CDATA guards to node.getNodeValue.
You don't necessarily have to use CDATA to preserve white space characters.
The XML specification specify how to encode these characters.
So for example, if you have an element with value that contains new space you should encode it with
Carriage return:
And so forth
EDIT: cut all the irrelevant stuff
I'm curious to know what DOM implementation you're using, because it doesn't mirror the default behaviour of the one in a couple of JVMs I've tried (they ship with a Xerces impl). I'm also interested in what newline characters your document has.
I'm not sure if whether CDATA should preserve whitespace is a given. I suspect that there are many factors involved. Don't DTDs/schemas affect how whitespace is processed?
You could try using the xml:space="preserve" attribute.
xml:space='preserve' is not it. That is only for "all whitespace" nodes. That is, if you want the whitespace nodes in
<this xml:space='preserve'> <has/>
<whitespace/>
</this>
But see that those whitespace nodes are ONLY whitespace.
I have been struggling to get Xerces to generate events allowing isolation of CDATA content as well. I have no solution as yet.

Java+DOM: How do I elegantly rename an xmlns:xyz attribute?

I have something like that as input:
<root xmlns="urn:my:main"
xmlns:a="urn:my:a" xmlns:b="urn:my:b">
...
</root>
And want to have something like that as output:
<MY_main:root xmlns:MY_main="urn:my:main"
xmlns:MY_a="urn:my:a" xmlns:MY_b="urn:my:b">
...
</MY_main:root>
... or the other way round.
How do I achieve this using DOM in an elegant way?
That is, without searching for attribute names starting with "xmlns".
You will not find the xmlns attributes in your DOM, they are not part of the DOM.
You may have some success if you find the nodes you want (getElementsByTagNameNS) and set their qualifiedName (qname) to a new value containing the prefix you like. Then re-generate the XML document.
By the way, the namespace prefix (which is what you are trying to change) is largely irrelevant when using any sane XML parser. The namespace URI is what counts. Why would you want to set the prefix to a specific value?
I have used the following jdom stub to remove all the namespace references:
Element rootElement = new SAXBuilder().build(contents).getRootElement();
for (Iterator i = rootElement.getDescendants(new ElementFilter()); i.hasNext();) {
Element el = (Element) i.next();
if (el.getNamespace() != null) el.setNamespace(null);
}
return rootElement;
Reading and writing the xml is done as normal. If you are just after human readable output that should do the job. If however you need to convert back you may have a problem.
The following may work to replace the namespaces with a more friendly version based on your example (untested):
rootElement.setNamespace(Namespace.getNamespace("MY_Main", "urn:my:main"));
rootElement.addNamespaceDeclaration(Namespace.getNamespace("MY_a", "urn:my:a"))
rootElement.addNamespaceDeclaration(Namespace.getNamespace("MY_b", "urn:my:b"))

Categories