How can i escape special characters with using DOM - java

This issue has been bugging me a lot lately and i can't seem to find out a possible solution.
I am dealing with a web-server that receives an XML document to do some processing. The server's parser has issues with &,',",<,>. I know this is bad, i didn't implement the xml parser on that server. But before waiting for a patch i need to circumvent.
Now, before uploading my XML document to this server, i need to parse it and escape the xml special characters. I am currently using DOM. The issue is, if i iterate through the TEXT_NODES and replaces all the special characters with their escaped versions, when I save this document,
for d'ex i get d&apos;ex but i need d&apos;ex
It makes sense since, DOM escapes "&". But obviously this is not what i need.
So if DOM is already capable of escaping "&" to "&" how can i make it escape other characters like " to " ?
If it can't, how can i save the already parsed and escaped texts in it's nodes without it having to re-escape them when saving ?
This is how i escape the special characters i used apache StringEscapeUtils class:
public String xMLTransform() throws Exception
{
String xmlfile = FileUtils.readFileToString(new File(filepath));
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new InputSource(new StringReader(xmlfile.trim().replaceFirst("^([\\W]+)<", "<"))));
NodeList nodeList = doc.getElementsByTagName("*");
for (int i = 0; i < nodeList.getLength(); i++) {
Node currentNode = nodeList.item(i);
if (currentNode.getNodeType() == Node.ELEMENT_NODE) {
Node child = currentNode.getFirstChild();
while(child != null) {
if (child.getNodeType() == Node.TEXT_NODE) {
child.setNodeValue(StringEscapeUtils.escapeXml10(child.getNodeValue()));
//Escaping works here. But when saving the final document, the "&" used in escaping gets escaped as well by DOM.
}
child = child.getNextSibling();
}
}
}
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
transformer.transform(source, result);
FileOutputStream fop = null;
File file;
file = File.createTempFile("escapedXML"+UUID.randomUUID(), ".xml");
fop = new FileOutputStream(file);
String xmlString = writer.toString();
byte[] contentInBytes = xmlString.getBytes();
fop.write(contentInBytes);
fop.flush();
fop.close();
return file.getPath();
}

I think the solution you're looking for is a customized XSLT parser that you can configure for your additional HTML escaping.
I'm not able to say for certain how to configure the xslt file to do what you want, but I am fairly confident it can be done. I've stubbed out the basic Java setup below:
#Test
public void testXSLTTransforms () throws Exception {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element el = doc.createElement("Container");
doc.appendChild(el);
Text e = doc.createTextNode("Character");
el.appendChild(e);
//e.setNodeValue("\'");
//e.setNodeValue("\"");
e.setNodeValue("&");
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(System.out);
//This prints the original document to the command line.
transformer.transform(source, result);
InputStream xsltStream = getClass().getResourceAsStream("/characterswap.xslt");
Source xslt = new StreamSource(xsltStream);
transformer = transformerFactory.newTransformer(xslt);
//This one is the one you'd pipe to a file
transformer.transform(source, result);
}
And I've got a simple XSLT I used for proof of concept that shows the default character encoding you mentioned:
characterswap.xslt
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:text>
Original VALUE : </xsl:text>
<xsl:copy-of select="."/>
<xsl:text>
OUTPUT ESCAPING DISABLED : </xsl:text>
<xsl:value-of select="." disable-output-escaping="yes"/>
<xsl:text>
OUTPUT ESCAPING ENABLED : </xsl:text>
<xsl:value-of select="." disable-output-escaping="no"/>
</xsl:template>
</xsl:stylesheet>
And the console out is pretty basic:
<?xml version="1.0" encoding="UTF-8"?>
<Container>&</Container>
Original VALUE : <Container>&</Container>
OUTPUT ESCAPING DISABLED : &
OUTPUT ESCAPING ENABLED : &
You can take the active node from the XSLT execution and perform specific character replacments. There are multiple examples I was able to find, but I'm having difficulty getting them working in my context.
XSLT string replace
is a good place to start.
This is about the extent of my knowledge with XSLT, I hope it helps you solve your issue.
Best of luck.
I was considering this further, and the solution may not only be XSLT. From your description, I have the impression that rather than xml10 encoding, you're kind of looking for a full set of html encoding.
Along those lines, if we take your current node text transformation:
if (child.getNodeType() == Node.TEXT_NODE) {
child.setNodeValue(StringEscapeUtils.escapeXml10(child.getNodeValue()));
}
And explicitly expect that we want the HTML Encoding:
if (child.getNodeType() == Node.TEXT_NODE) {
//Capture the current node value
String nodeValue = child.getNodeValue();
//Decode for XML10 to remove existing escapes
String decodedNode = StringEscapeUtils.unescapeXml10(nodeValue);
//Then Re-encode for HTML (3/4/5)
String fullyEncodedHTML = StringEscapeUtils.escapeHtml3(decodedNode);
//String fullyEncodedHTML = StringEscapeUtils.escapeHtml4(decodedNode);
//String fullyEncodedHTML = StringEscapeUtils.escapeHtml5(decodedNode);
//Then place the fully-encoded HTML back to the node
child.setNodeValue(fullyEncodedHTML);
}
I would think that the xml would now be fully encoded with all of the
HTML escapes you were wanting.
Now combine this with the XSLT for output escaping (from above), and the document will not undergo any further transformations when written out to the file.
I like this solution because it limits the logic held in the XSLT file. Rather than managing the entire String find/replace, you would just need to ensure that you copy your entire node and copy the text() with output escaping disabled.
In theory, that seems like it would fulfill my understanding of your objective.
Caveat again is that I'm weak with XSLT, so the example xslt file may
still need some tweaking. This solution reduces that unknown work
quantity, in my opinion.

I've seen people use regex to do something similar
Copied from (Replace special character with an escape preceded special character in Java)
String newSearch = search.replaceAll("(?=[]\\[+&|!(){}^\"~*?:\\\\-])", "\\\\");
That whacky regex is a "look ahead" - a non capturing assertion that the following char match something - in this case a character class.
Notice how you don't need to escape chars in a character class, except a ] (even the minus don't need escaping if first or last).
The \\\\ is how you code a regex literal \ (escape once for java, once for regex)
Here's a test of this working:
public static void main(String[] args) {
String search = "code:xy";
String newSearch = search.replaceAll("(?=[]\\[+&|!(){}^\"~*?:\\\\-])", "\\\\");
System.out.println(newSearch);
}
Output:
code\:xy

this is very closely related to this question (how to Download a XML file from a URL by Escaping Special Characters like < > $amp; etc?).
This post has a similar case where the code downloads XML's with parsed / escaped content.
As i understand , you read file , parse it and escape characters . During saving the XML gets "escaped" again. While you can use the DOM for checking well-formed XML or schema, file based operations to escape can help you escape XML and HTML special characters. The code sample in the post refers to usage of IOUtils and StringUtils to do it. Hope this helps !

I would use StringEscapeUtils.escapeXml10()... details here. https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringEscapeUtils.html#ESCAPE_XML10

Related

How to created a formatted string from xml node in java

I'm trying to create a formatted string from an XML Node. See this example:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<parent>
<foo>
<bar>foo</bar>
</foo>
</parent>
</root>
The Node I want to create a formatted string for is "foo". I expected a result like this:
<foo>
<bar>foo</bar>
</foo>
But the actual result is:
<foo>
<bar>foo</bar>
</foo>
My approach looks like this:
public String toXmlString(Node node) throws TransformerException {
final Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
final Writer writer = new StringWriter();
final StreamResult streamResult = new StreamResult(writer);
transformer.transform(new DOMSource(node), streamResult);
return writer.toString();
}
What am I doing wrong?
It is doing exactly what it's supposed to do. indent="yes" allows the transform to add whitespace to indent elements, but not to remove whitespace, since it cannot know which whitespace in the input is significant.
In the input you provide, the <foo> and </foo> element lines have 8 leading blanks, and the <bar> line has 12.
The reason the <foo> opening tag is not indented is that the preceding whitespace actually belongs to the containing <parent> element and is not present in the subtree you passed to the transform.
Whitespace stripping behavior is covered in detail in the standards (XSLT 1, XSLT 2). In summary
A whitespace text node is preserved if either of the following apply:
The element name of the parent of the text node is in the set of whitespace-preserving element names
...
and
(XSLT 2) The set of whitespace-preserving element names is specified by xsl:strip-space and xsl:preserve-space declarations. Whether an element name is included in the set of whitespace-preserving names is determined by the best match among all the xsl:strip-space or xsl:preserve-space declarations: it is included if and only if there is no match or the best match is an xsl:preserve-space element.
stated more simply in the XSLT 1 spec:
Initially, the set of whitespace-preserving element names contains all element names.
Unfortunately, using xsl:strip-space does not produce the results you want. With <xsl:strip-space elements="*"> (and indent="yes") I get the following output:
<foo><bar>foo</bar>
</foo>
Which makes sense. Whitespace is stripped, and then the </foo> tag is made to line up under its opening tag.
This will work better with the third party library JDOM 2, which also makes everything easier about manipulating DOM documents.
Its "pretty format" output will indent as expected, removing existing indentation, as long as the text nodes removed/altered were whitespace-only. When one wants to preserve whitespace, one doesn't ask for indented output.
Will look like this:
public String toXmlString(Element element) {
return new XMLOutputter(Format.getPrettyFormat()).outputString(element);
}
Saxon gives your desired output provided you strip whitespace on input:
public void testIndentation() {
try {
String in = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
+ "<root>\n"
+ " <parent>\n"
+ " <foo>\n"
+ " <bar>foo</bar>\n"
+ " </foo> \n"
+ " </parent>\n"
+ "</root>";
Processor proc = new Processor(false);
DocumentBuilder builder = proc.newDocumentBuilder();
builder.setWhitespaceStrippingPolicy(WhitespaceStrippingPolicy.ALL); //XX
XdmNode doc = builder.build(new StreamSource(new StringReader(in)));
StringWriter sw = new StringWriter();
Serializer serializer = proc.newSerializer(sw);
serializer.setOutputProperty(Serializer.Property.METHOD, "xml");
serializer.setOutputProperty(Serializer.Property.INDENT, "yes");
XdmNode foo = doc.axisIterator(Axis.DESCENDANT, new QName("foo")).next();
serializer.serializeNode(foo);
System.err.println(sw);
} catch (SaxonApiException err) {
fail();
}
}
But if you don't strip whitespace (comment out line XX), you get the ragged output shown in your post. The spec, from XSLT 2.0 onwards, allows the processor to be smarter than this, but Saxon doesn't take advantage of this. One reason is that the serialization is entirely streamed: it's looking at each event (start element, end element, etc) in isolation rather than considering the document as a whole.
Based on kumesana's answer, I've found an acceptable solution:
public String toXmlString(Node node) throws TransformerException {
final DOMBuilder builder = new DOMBuilder();
final Element element = (Element) node;
final org.jdom2.Element jdomElement = builder.build(element);
final XMLOutputter xmlOutputter = new XMLOutputter(Format.getPrettyFormat());
final String output = xmlOutputter.outputString(jdomElement);
return output;
}

xquery transformation creates empty namespace in element

I'm sorry but I guess I just don't see the mistake I'm making here.
I have a camel route which returns an XML and to be able to test the output I wrote a JUnit Test which runs with SpringRunner. There I get the XML Stream from the exchange which I validate against an XSD. This works great because the XSD throws an exception because the output XML is not valid, but I don't understand why the following xquery generates an element with EMPTY NAMESPACE?
See the xquery snippet (I'm sorry again I cannot provide more code):
declare default element namespace "http://www.dppgroup.com/XXXPMS";
let $cmmdoc := $doc/*:cmmdoc
, $partner := $doc/*:cmmdoc/*:information/*:partner_gruppe/*:partner
, $sequence:= fn:substring($cmmdoc/#unifier,3)
return <ClientMMS xmlns:infra="http://www.dppgroup.com/InfraNS">
{
for $x in $partner
where $x[#partnerStatusCode = " "]
return
element {"DataGroup" } {
<Client sequenceNumber="{$sequence}" />
}
}
My problem is, that with this code the resulting XML contains the DataGroup-element with the following namespace definition:
<?xml version="1.0" encoding="UTF-8"?>
<ClientMMS xmlns="http://www.dppgroup.com/XXXPMS"
xmlns:infra="http://www.dppgroup.com/InfraNS">
<DataGroup xmlns="">
<Client sequenceNumber="170908065609671475"/>
</DataGroup>
</ClientMMS>
The snippet from the Unit-Test: I'm using jdk1.8_102
String xml = TestDataReader.readXML("/input/info/info_in.xml", PROJECT_ENCODING);
quelle.sendBody(xml);
boolean valid = false;
try {
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream((byte[]) archiv.getExchanges().get(1).getIn().getBody());
Document document = documentBuilder.parse(byteArrayInputStream);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(document);
transformer.transform(source, result);
String xmlString = result.getWriter().toString();
System.out.println(xmlString);
In no XQuery introduction/tutorial/explanation I can find a reason why this happens. Can you guys please explain why the DataGroup element is not in the default namespace?
The XQuery you posted should create the result fine without the namespace undeclaration you show.
In your Java code if you want to work with XML with namespaces make sure you use a namespace aware DocumentBuilder, as the default DocumentBuilderFactory is not namespace aware make sure you set setNamespaceAware(true) on the factory before creating a DocumentBuilder with it.

Work with raw text in javax.xml.transform.Transformer

While working with an XML document, I use strings that already contain XML entities and wish them to be inserted as-is. However, this happens instead:
String s = "This — That";
....
document.appendChild(document.createTextNode(s));
....
transformer.transform(new DOMSource(document), new StreamResult(stringWriter));
System.out.println(stringWriter.toString()); // outputs "This &mdash; That" at the relevant Node.
I have no control over the input string and I need exactly the output "This — That".
If I use StringEscapeUtils.unescapeHtml, the output is "This — That" which is not what I need.
I also tried several versions of transformer.setOutputProperty(OutputKeys.ENCODING, "encoding") but haven't found an encoding that converts "—" to "—".
What can I do to prevent javax.xml.transform.Transformer from re-escaping already correctly escaped text or how can I transform the input to get entities in the output?
Please explain how this is a duplicate.
The question referenced had the problem that "
" was being converted into CRLF because the entities were being resolved. The solution was to escape the entities.
My problem is the reverse. The text is already escaped and the transformer is re-escaping the text. "—" is outputting "&mdash;".
I cannot use the solution to post-convert all "&" -> "&" because not all nodes represent html.
More complete code:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer t = factory.newTransformer();
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document document = builder.newDocument();
Element rootElement = document.createElement("Test");
rootElement.appendChild(document.createTextNode("This — That");
document.appendChild(rootElement);
DOMImplementation domImpl = bgDoc.getImplementation();
DocumentType docType = domImpl.createDocumentType("Test",
"-//Company//program//language",
"test.dtd");
t.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, docType.getPublicId());
t.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, docType.getSystemId());
StringWriter writer = new StringWriter();
StreamResult rslt = new StreamResult(writer);
Source src = new DOMSource(document);
t.transform(src, rslt);
System.out.println(writer.toString());
// outputs xml header, then "<Test>This &mdash; That</Test>"
The fact is, once you have a DOM tree, there's no longer a string with —: it's instead represented internally as a Unicode string.
So, to input the raw string, you need to parse it to a Node, and to output, serialize a Node.
Regarding serialization, there are a few other questions including Change the com.sun.org.apache.xml.internal.serialize.XMLSerializer & com.sun.org.apache.xml.internal.serialize.OutputFormat .
To parse a single node, there is LSParser.parseWithContext.

Keep numeric character entity characters such as ` ` when parsing XML in Java

I am parsing XML that contains numeric character entity characters such as (but not limited to)
< > (line feed carriage return < >) in Java. While parsing, I am appending text content of nodes to a StringBuffer to later write it out to a textfile.
However, these unicode characters are resolved or transformed into newlines/whitespace when I write the String to a file or print it out.
How can I keep the original numeric character entity characters symbols when iterating over nodes of an XML file in Java and storing the text content nodes to a String?
Example of demo xml file:
<?xml version="1.0" encoding="UTF-8"?>
<ABCD version="2">
<Field attributeWithChar="A string followed by special symbols
" />
</ABCD>
Example Java code. It loads the XML, iterates over the nodes and collects the text content of each node to a StringBuffer. After the iteration is over, it writes the StringBuffer to the console and also to a file (but no
) symbols.
What would be a way to keep these symbols when storing them to a String? Could you please help me? Thank you.
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException, TransformerException {
DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
Document document = null;
DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
document = documentBuilder.parse(new File("path/to/demo.xml"));
StringBuilder sb = new StringBuilder();
NodeList nodeList = document.getElementsByTagName("*");
for (int i = 0; i < nodeList.getLength(); i++) {
Node node = nodeList.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
NamedNodeMap nnp = node.getAttributes();
for (int j = 0; j < nnp.getLength(); j++) {
sb.append(nnp.item(j).getTextContent());
}
}
}
System.out.println(sb.toString());
try (Writer writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("path/to/demo_output.xml"), "UTF-8"))) {
writer.write(sb.toString());
}
}
You need to escape all the XML entities before parsing the file into a Document. You do that by escaping the ampersand & itself with its corresponding XML entity &. Something like,
DocumentBuilder documentBuilder =
DocumentBuilderFactory.newInstance().newDocumentBuilder();
String xmlContents = new String(Files.readAllBytes(Paths.get("demo.xml")), "UTF-8");
Document document = documentBuilder.parse(
new InputSource(new StringReader(xmlContents.replaceAll("&", "&"))
));
Output :
2A string followed by special symbols
P.S. This is complement of Ravi Thapliyal's answer, not an alternative.
I am having the same problem with handling an XML file which is exported from 2003 format Excelsheet. This XML file stores line-breaks in text contents as
along with other numeric character references. However, after reading it with Java DOM parser, manipulating the content of some elements and transforming it back to the XML file, I see that all the numeric character references are expanded (i.e. The line-break is converted to CRLF) in Windows with J2SE1.6. Since my goal is to keep the content format unchanged as much as possible while manipulating some elements (i.e. retain numeric character references), Ravi Thapliyal's suggestion seems to be the only working solution.
When writing the XML content back to the file, it is necessary to replace all & with &, right? To do that, I had to give a StringWriter to the transformer as StreamResult and obtain String from it, replace all and dump the string to the xml file.
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
DOMSource source = new DOMSource(document);
//write into a stringWriter for further processing.
StringWriter stringWriter = new StringWriter();
StreamResult result = new StreamResult(stringWriter);
t.transform(source, result);
//stringWriter stream contains xml content.
String xmlContent = stringWriter.getBuffer().toString();
//revert "&" back to "&" to retain numeric character references.
xmlContent = xmlContent.replaceAll("&", "&");
BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));
wr.write(xmlContent);
wr.close();

XML Canonical form in Java

This question got me pretty close and actually works. Now I'm trying to understand it better and make it more robust.
Have the following test code:
// Just build a test xml
String xml;
xml = "<aaa Batt = \"That\" Aatt=\"this\" >\n";
xml += "<!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/>\n";
xml += " <ccc/></aaa>";
// do the necessary bureaucracy
DocumentBuilder docBuilder;
docBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc;
doc = docBuilder.parse(new ByteArrayInputStream(xml.getBytes()));
// Normalize document
// Do I realy need to do this?
doc.normalize();
// Canonize using Apache's Xml security
org.apache.xml.security.Init.init(); // Doesnt work if I don't do this.
byte[] c14nOutputbytes = Canonicalizer.getInstance(
Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS)
.canonicalizeSubtree(doc.getDocumentElement());
// This was a reparse reccomended to get attributes in alpha order
Document canon = docBuilder.parse(new ByteArrayInputStream(c14nOutputbytes));
// Input and output for the transformer
DOMSource xmlInput = new DOMSource(canon);
StreamResult xmlOutput = new StreamResult(new StringWriter());
// Configure transformer and format code
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(
"{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(xmlInput, xmlOutput);
// And print it
System.out.println(xmlOutput.getWriter().toString());
Executing this code, would output:
<aaa Aatt="this" Batt="That">
<!-- Document comment --><bbb lolol="dsf" moarttt="fasf"/>
<ccc/>
</aaa>
Which might be canonized, but doesn't seem to respect the indentation I asked the transformer to do.
Having such an example, I have a few questions:
For my intent, is there any difference between .normalize() and Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS? Removing either of them seems to yield the same result (again within my intent of have a canonical and pretty printed xml).
Why do the blank spaces within the xml seem to screw the formatting? Would I have to trim the text of each xml node to make it work? It just sounds wrong, nonetheless if the input xml is <aaa Batt = \"That\" Aatt=\"this\" ><!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/><ccc/></aaa> the xml is perfectly formatted.
Why after asking for the canonical form, tags such as <ccc/> weren't expanded to <ccc></ccc>? Wikipedia says "empty elements are encoded as start/end pairs, not using the special empty-element syntax".
Sorry if these are too many questions at once, but I have the feeling the answers for all of these should be somewhat the same.

Categories