How to created a formatted string from xml node in java - java

I'm trying to create a formatted string from an XML Node. See this example:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<parent>
<foo>
<bar>foo</bar>
</foo>
</parent>
</root>
The Node I want to create a formatted string for is "foo". I expected a result like this:
<foo>
<bar>foo</bar>
</foo>
But the actual result is:
<foo>
<bar>foo</bar>
</foo>
My approach looks like this:
public String toXmlString(Node node) throws TransformerException {
final Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
final Writer writer = new StringWriter();
final StreamResult streamResult = new StreamResult(writer);
transformer.transform(new DOMSource(node), streamResult);
return writer.toString();
}
What am I doing wrong?

It is doing exactly what it's supposed to do. indent="yes" allows the transform to add whitespace to indent elements, but not to remove whitespace, since it cannot know which whitespace in the input is significant.
In the input you provide, the <foo> and </foo> element lines have 8 leading blanks, and the <bar> line has 12.
The reason the <foo> opening tag is not indented is that the preceding whitespace actually belongs to the containing <parent> element and is not present in the subtree you passed to the transform.
Whitespace stripping behavior is covered in detail in the standards (XSLT 1, XSLT 2). In summary
A whitespace text node is preserved if either of the following apply:
The element name of the parent of the text node is in the set of whitespace-preserving element names
...
and
(XSLT 2) The set of whitespace-preserving element names is specified by xsl:strip-space and xsl:preserve-space declarations. Whether an element name is included in the set of whitespace-preserving names is determined by the best match among all the xsl:strip-space or xsl:preserve-space declarations: it is included if and only if there is no match or the best match is an xsl:preserve-space element.
stated more simply in the XSLT 1 spec:
Initially, the set of whitespace-preserving element names contains all element names.
Unfortunately, using xsl:strip-space does not produce the results you want. With <xsl:strip-space elements="*"> (and indent="yes") I get the following output:
<foo><bar>foo</bar>
</foo>
Which makes sense. Whitespace is stripped, and then the </foo> tag is made to line up under its opening tag.

This will work better with the third party library JDOM 2, which also makes everything easier about manipulating DOM documents.
Its "pretty format" output will indent as expected, removing existing indentation, as long as the text nodes removed/altered were whitespace-only. When one wants to preserve whitespace, one doesn't ask for indented output.
Will look like this:
public String toXmlString(Element element) {
return new XMLOutputter(Format.getPrettyFormat()).outputString(element);
}

Saxon gives your desired output provided you strip whitespace on input:
public void testIndentation() {
try {
String in = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
+ "<root>\n"
+ " <parent>\n"
+ " <foo>\n"
+ " <bar>foo</bar>\n"
+ " </foo> \n"
+ " </parent>\n"
+ "</root>";
Processor proc = new Processor(false);
DocumentBuilder builder = proc.newDocumentBuilder();
builder.setWhitespaceStrippingPolicy(WhitespaceStrippingPolicy.ALL); //XX
XdmNode doc = builder.build(new StreamSource(new StringReader(in)));
StringWriter sw = new StringWriter();
Serializer serializer = proc.newSerializer(sw);
serializer.setOutputProperty(Serializer.Property.METHOD, "xml");
serializer.setOutputProperty(Serializer.Property.INDENT, "yes");
XdmNode foo = doc.axisIterator(Axis.DESCENDANT, new QName("foo")).next();
serializer.serializeNode(foo);
System.err.println(sw);
} catch (SaxonApiException err) {
fail();
}
}
But if you don't strip whitespace (comment out line XX), you get the ragged output shown in your post. The spec, from XSLT 2.0 onwards, allows the processor to be smarter than this, but Saxon doesn't take advantage of this. One reason is that the serialization is entirely streamed: it's looking at each event (start element, end element, etc) in isolation rather than considering the document as a whole.

Based on kumesana's answer, I've found an acceptable solution:
public String toXmlString(Node node) throws TransformerException {
final DOMBuilder builder = new DOMBuilder();
final Element element = (Element) node;
final org.jdom2.Element jdomElement = builder.build(element);
final XMLOutputter xmlOutputter = new XMLOutputter(Format.getPrettyFormat());
final String output = xmlOutputter.outputString(jdomElement);
return output;
}

Related

" is auto converting to " through Document & Transformer API

I am loading xml file (pom.xml) through org.w3c.dom.Document and editing some node's value (basically changing the version value of some dependency) through javax.xml.transform.Transformer, javax.xml.transform.TransformerFactory
& javax.xml.transform.dom.DOMSource.
But problem is that, this also convert all occurrence of " to " character, which I don't want. See below sample:
<Export-Package>!${bundle.namespace}.internal.*,${bundle.namespace}.*;version="${project.version}"</Export-Package>
converted to:
<Export-Package>!${bundle.namespace}.internal.*,${bundle.namespace}.*;version="${project.version}"</Export-Package>
Please help on this, how I can ignore these auto conversion with currently consumed API.
Code Sample:
public void writeDocument(File filePath)
{
TransformerFactory transformerFactory = TransformerFactory.newInstance();
this.thisDoc.getDocumentElement().normalize();
Transformer transformer;
try
{
transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(thisDoc);
StreamResult result = new StreamResult(filePath);
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.transform(source, result);
}
catch (TransformerException e)
{
VersionUpdateExceptions.throwException(e, LOG);
}
}
This is the required behavior by the Document Object Model (DOM) Level 3 Load and Save Specification:
Within the character data of a document (outside of markup), any
characters that cannot be represented directly are replaced with
character references. Occurrences of '<' and '&' are replaced by the
predefined entities < and &. The other predefined entities
(>, ', and ") might not be used, except where needed
(e.g. using > in cases such as ']]>').
For example, if you use " inside an attribute:
<Export-Package id=""test"">
" will be preserved. Otherwise, it won't.
If absolutely necessary you could achieve the preserving of """ with an ugly hack.
Read the pom.xml as a String and replace ocurrences of " by some "marker" string
To parse the document use an StringReader to create an InputSource
Execute your method, but creating a StreamResult with a StringWriter.
Get the content from the StringWriter as a String and replace your marker string with "
Save the content to the file

xquery transformation creates empty namespace in element

I'm sorry but I guess I just don't see the mistake I'm making here.
I have a camel route which returns an XML and to be able to test the output I wrote a JUnit Test which runs with SpringRunner. There I get the XML Stream from the exchange which I validate against an XSD. This works great because the XSD throws an exception because the output XML is not valid, but I don't understand why the following xquery generates an element with EMPTY NAMESPACE?
See the xquery snippet (I'm sorry again I cannot provide more code):
declare default element namespace "http://www.dppgroup.com/XXXPMS";
let $cmmdoc := $doc/*:cmmdoc
, $partner := $doc/*:cmmdoc/*:information/*:partner_gruppe/*:partner
, $sequence:= fn:substring($cmmdoc/#unifier,3)
return <ClientMMS xmlns:infra="http://www.dppgroup.com/InfraNS">
{
for $x in $partner
where $x[#partnerStatusCode = " "]
return
element {"DataGroup" } {
<Client sequenceNumber="{$sequence}" />
}
}
My problem is, that with this code the resulting XML contains the DataGroup-element with the following namespace definition:
<?xml version="1.0" encoding="UTF-8"?>
<ClientMMS xmlns="http://www.dppgroup.com/XXXPMS"
xmlns:infra="http://www.dppgroup.com/InfraNS">
<DataGroup xmlns="">
<Client sequenceNumber="170908065609671475"/>
</DataGroup>
</ClientMMS>
The snippet from the Unit-Test: I'm using jdk1.8_102
String xml = TestDataReader.readXML("/input/info/info_in.xml", PROJECT_ENCODING);
quelle.sendBody(xml);
boolean valid = false;
try {
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream((byte[]) archiv.getExchanges().get(1).getIn().getBody());
Document document = documentBuilder.parse(byteArrayInputStream);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(document);
transformer.transform(source, result);
String xmlString = result.getWriter().toString();
System.out.println(xmlString);
In no XQuery introduction/tutorial/explanation I can find a reason why this happens. Can you guys please explain why the DataGroup element is not in the default namespace?
The XQuery you posted should create the result fine without the namespace undeclaration you show.
In your Java code if you want to work with XML with namespaces make sure you use a namespace aware DocumentBuilder, as the default DocumentBuilderFactory is not namespace aware make sure you set setNamespaceAware(true) on the factory before creating a DocumentBuilder with it.

How can i escape special characters with using DOM

This issue has been bugging me a lot lately and i can't seem to find out a possible solution.
I am dealing with a web-server that receives an XML document to do some processing. The server's parser has issues with &,',",<,>. I know this is bad, i didn't implement the xml parser on that server. But before waiting for a patch i need to circumvent.
Now, before uploading my XML document to this server, i need to parse it and escape the xml special characters. I am currently using DOM. The issue is, if i iterate through the TEXT_NODES and replaces all the special characters with their escaped versions, when I save this document,
for d'ex i get d&apos;ex but i need d&apos;ex
It makes sense since, DOM escapes "&". But obviously this is not what i need.
So if DOM is already capable of escaping "&" to "&" how can i make it escape other characters like " to " ?
If it can't, how can i save the already parsed and escaped texts in it's nodes without it having to re-escape them when saving ?
This is how i escape the special characters i used apache StringEscapeUtils class:
public String xMLTransform() throws Exception
{
String xmlfile = FileUtils.readFileToString(new File(filepath));
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new InputSource(new StringReader(xmlfile.trim().replaceFirst("^([\\W]+)<", "<"))));
NodeList nodeList = doc.getElementsByTagName("*");
for (int i = 0; i < nodeList.getLength(); i++) {
Node currentNode = nodeList.item(i);
if (currentNode.getNodeType() == Node.ELEMENT_NODE) {
Node child = currentNode.getFirstChild();
while(child != null) {
if (child.getNodeType() == Node.TEXT_NODE) {
child.setNodeValue(StringEscapeUtils.escapeXml10(child.getNodeValue()));
//Escaping works here. But when saving the final document, the "&" used in escaping gets escaped as well by DOM.
}
child = child.getNextSibling();
}
}
}
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
transformer.transform(source, result);
FileOutputStream fop = null;
File file;
file = File.createTempFile("escapedXML"+UUID.randomUUID(), ".xml");
fop = new FileOutputStream(file);
String xmlString = writer.toString();
byte[] contentInBytes = xmlString.getBytes();
fop.write(contentInBytes);
fop.flush();
fop.close();
return file.getPath();
}
I think the solution you're looking for is a customized XSLT parser that you can configure for your additional HTML escaping.
I'm not able to say for certain how to configure the xslt file to do what you want, but I am fairly confident it can be done. I've stubbed out the basic Java setup below:
#Test
public void testXSLTTransforms () throws Exception {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element el = doc.createElement("Container");
doc.appendChild(el);
Text e = doc.createTextNode("Character");
el.appendChild(e);
//e.setNodeValue("\'");
//e.setNodeValue("\"");
e.setNodeValue("&");
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(System.out);
//This prints the original document to the command line.
transformer.transform(source, result);
InputStream xsltStream = getClass().getResourceAsStream("/characterswap.xslt");
Source xslt = new StreamSource(xsltStream);
transformer = transformerFactory.newTransformer(xslt);
//This one is the one you'd pipe to a file
transformer.transform(source, result);
}
And I've got a simple XSLT I used for proof of concept that shows the default character encoding you mentioned:
characterswap.xslt
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:text>
Original VALUE : </xsl:text>
<xsl:copy-of select="."/>
<xsl:text>
OUTPUT ESCAPING DISABLED : </xsl:text>
<xsl:value-of select="." disable-output-escaping="yes"/>
<xsl:text>
OUTPUT ESCAPING ENABLED : </xsl:text>
<xsl:value-of select="." disable-output-escaping="no"/>
</xsl:template>
</xsl:stylesheet>
And the console out is pretty basic:
<?xml version="1.0" encoding="UTF-8"?>
<Container>&</Container>
Original VALUE : <Container>&</Container>
OUTPUT ESCAPING DISABLED : &
OUTPUT ESCAPING ENABLED : &
You can take the active node from the XSLT execution and perform specific character replacments. There are multiple examples I was able to find, but I'm having difficulty getting them working in my context.
XSLT string replace
is a good place to start.
This is about the extent of my knowledge with XSLT, I hope it helps you solve your issue.
Best of luck.
I was considering this further, and the solution may not only be XSLT. From your description, I have the impression that rather than xml10 encoding, you're kind of looking for a full set of html encoding.
Along those lines, if we take your current node text transformation:
if (child.getNodeType() == Node.TEXT_NODE) {
child.setNodeValue(StringEscapeUtils.escapeXml10(child.getNodeValue()));
}
And explicitly expect that we want the HTML Encoding:
if (child.getNodeType() == Node.TEXT_NODE) {
//Capture the current node value
String nodeValue = child.getNodeValue();
//Decode for XML10 to remove existing escapes
String decodedNode = StringEscapeUtils.unescapeXml10(nodeValue);
//Then Re-encode for HTML (3/4/5)
String fullyEncodedHTML = StringEscapeUtils.escapeHtml3(decodedNode);
//String fullyEncodedHTML = StringEscapeUtils.escapeHtml4(decodedNode);
//String fullyEncodedHTML = StringEscapeUtils.escapeHtml5(decodedNode);
//Then place the fully-encoded HTML back to the node
child.setNodeValue(fullyEncodedHTML);
}
I would think that the xml would now be fully encoded with all of the
HTML escapes you were wanting.
Now combine this with the XSLT for output escaping (from above), and the document will not undergo any further transformations when written out to the file.
I like this solution because it limits the logic held in the XSLT file. Rather than managing the entire String find/replace, you would just need to ensure that you copy your entire node and copy the text() with output escaping disabled.
In theory, that seems like it would fulfill my understanding of your objective.
Caveat again is that I'm weak with XSLT, so the example xslt file may
still need some tweaking. This solution reduces that unknown work
quantity, in my opinion.
I've seen people use regex to do something similar
Copied from (Replace special character with an escape preceded special character in Java)
String newSearch = search.replaceAll("(?=[]\\[+&|!(){}^\"~*?:\\\\-])", "\\\\");
That whacky regex is a "look ahead" - a non capturing assertion that the following char match something - in this case a character class.
Notice how you don't need to escape chars in a character class, except a ] (even the minus don't need escaping if first or last).
The \\\\ is how you code a regex literal \ (escape once for java, once for regex)
Here's a test of this working:
public static void main(String[] args) {
String search = "code:xy";
String newSearch = search.replaceAll("(?=[]\\[+&|!(){}^\"~*?:\\\\-])", "\\\\");
System.out.println(newSearch);
}
Output:
code\:xy
this is very closely related to this question (how to Download a XML file from a URL by Escaping Special Characters like < > $amp; etc?).
This post has a similar case where the code downloads XML's with parsed / escaped content.
As i understand , you read file , parse it and escape characters . During saving the XML gets "escaped" again. While you can use the DOM for checking well-formed XML or schema, file based operations to escape can help you escape XML and HTML special characters. The code sample in the post refers to usage of IOUtils and StringUtils to do it. Hope this helps !
I would use StringEscapeUtils.escapeXml10()... details here. https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringEscapeUtils.html#ESCAPE_XML10

Work with raw text in javax.xml.transform.Transformer

While working with an XML document, I use strings that already contain XML entities and wish them to be inserted as-is. However, this happens instead:
String s = "This — That";
....
document.appendChild(document.createTextNode(s));
....
transformer.transform(new DOMSource(document), new StreamResult(stringWriter));
System.out.println(stringWriter.toString()); // outputs "This &mdash; That" at the relevant Node.
I have no control over the input string and I need exactly the output "This — That".
If I use StringEscapeUtils.unescapeHtml, the output is "This — That" which is not what I need.
I also tried several versions of transformer.setOutputProperty(OutputKeys.ENCODING, "encoding") but haven't found an encoding that converts "—" to "—".
What can I do to prevent javax.xml.transform.Transformer from re-escaping already correctly escaped text or how can I transform the input to get entities in the output?
Please explain how this is a duplicate.
The question referenced had the problem that "
" was being converted into CRLF because the entities were being resolved. The solution was to escape the entities.
My problem is the reverse. The text is already escaped and the transformer is re-escaping the text. "—" is outputting "&mdash;".
I cannot use the solution to post-convert all "&" -> "&" because not all nodes represent html.
More complete code:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer t = factory.newTransformer();
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document document = builder.newDocument();
Element rootElement = document.createElement("Test");
rootElement.appendChild(document.createTextNode("This — That");
document.appendChild(rootElement);
DOMImplementation domImpl = bgDoc.getImplementation();
DocumentType docType = domImpl.createDocumentType("Test",
"-//Company//program//language",
"test.dtd");
t.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, docType.getPublicId());
t.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, docType.getSystemId());
StringWriter writer = new StringWriter();
StreamResult rslt = new StreamResult(writer);
Source src = new DOMSource(document);
t.transform(src, rslt);
System.out.println(writer.toString());
// outputs xml header, then "<Test>This &mdash; That</Test>"
The fact is, once you have a DOM tree, there's no longer a string with —: it's instead represented internally as a Unicode string.
So, to input the raw string, you need to parse it to a Node, and to output, serialize a Node.
Regarding serialization, there are a few other questions including Change the com.sun.org.apache.xml.internal.serialize.XMLSerializer & com.sun.org.apache.xml.internal.serialize.OutputFormat .
To parse a single node, there is LSParser.parseWithContext.

How to unformat xml file

I have a method which returns a String with a formatted xml. The method reads the xml from a file on the server and parses it into the string:
Esentially what the method currently does is:
private ServletConfig config;
InputStream xmlIn = null ;
xmlIn = config.getServletContext().getResourceAsStream(filename + ".xml") ;
String xml = IOUtils.toString(xmlIn);
IOUtils.closeQuietly(xmlIn);
return xml;
What I need to do is add a new input argument, and based on that value, continue returning the formatted xml, or return unformatted xml.
What I mean with formatted xml is something like:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
And what I mean with unformatted xml is something like:
<xml><root><elements><elem1/><elem2/><elements><root></xml>
or:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
Is there a simple way to do this?
Strip all newline characters with String xml = IOUtils.toString(xmlIn).replace("\n", ""). Or \t to keep several lines but without indentation.
if you are sure that the formatted xml like:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
you can replace all group 1 in ^(\s*)< to "". in this way, the text in xml won't be changed.
an empty transformer with a parameter setting the indent params like so
public static String getStringFromDocument(Document dom, boolean indented) {
String signedContent = null;
try {
StringWriter sw = new StringWriter();
DOMSource domSource = new DOMSource(dom);
TransformerFactory tf = new TransformerFactoryImpl();
Transformer trans = tf.newTransformer();
trans = tf.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
trans.setOutputProperty(OutputKeys.INDENT, indented ? "yes" : "no");
trans.transform(domSource, new StreamResult(sw));
sw.flush();
signedContent = sw.toString();
} catch (TransformerException e) {
e.printStackTrace();
}
return signedContent;
}
works for me.
the key lies in this line
trans.setOutputProperty(OutputKeys.INDENT, indented ? "yes" : "no");
Try something like the following:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(
new StreamSource(new StringReader(
"<xsl:stylesheet version=\"1.0\"" +
" xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\">" +
"<xsl:output method=\"xml\" omit-xml-declaration=\"yes\"/>" +
" <xsl:strip-space elements=\"*\"/>" +
" <xsl:template match=\"#*|node()\">" +
" <xsl:copy>" +
" <xsl:apply-templates select=\"#*|node()\"/>" +
" </xsl:copy>" +
" </xsl:template>" +
"</xsl:stylesheet>"
))
);
Source source = new StreamSource(new StringReader("xml string here"));
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
Instead of source being StreamSource in the second instance, it can also be DOMSource if you have an in-memory Document, if you want to modify the DOM before saving.
DOMSource source = new DOMSource(document);
To read an XML file into a Document object:
File file = new File("c:\\MyXMLFile.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);
doc.getDocumentElement().normalize();
Enjoy :)
If you fancy trying your hand with JAXB then the marshaller has a handy property for setting whether to format (use new lines and indent) the output or not.
JAXBContext jc = JAXBContext.newInstance(packageName);
Marshaller m = jc.createMarshaller();
m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
m.marshal(element, outputStream);
Quite an overhead to get to that stage though... perhaps a good option if you already have a solid xsd
You can:
1) remove all consecutive whitespaces (but not single whitespace) and then replace all >(whitespace)< by ><
applicable only if usefull content does not have multiple consecutive significant whitespaces
2) read it in some dom tree and serialize it using some nonpretty serialization
SAXReader reader = new SAXReader();
Reader r = new StringReader(data);
Document document = reader.read(r);
OutputFormat format = OutputFormat.createCompactFormat();
StringWriter sw = new StringWriter();
XMLWriter writer = new XMLWriter(sw, format);
writer.write(document);
String string = writer.toString();
3) use Canonicalization (but you must somehow explain to it that those whitespaces you want to remove are insignificant)
Kotlin.
An indentation will usually come after new line and formatted as one space or more. Hence, to make everything in the same column, we will replace all of the new lines, following one or more spaces:
xmlTag = xmlTag.replace("(\n +)".toRegex(), " ")

Categories