Work with raw text in javax.xml.transform.Transformer

Work with raw text in javax.xml.transform.Transformer - java

While working with an XML document, I use strings that already contain XML entities and wish them to be inserted as-is. However, this happens instead:
String s = "This — That";
....
document.appendChild(document.createTextNode(s));
....
transformer.transform(new DOMSource(document), new StreamResult(stringWriter));
System.out.println(stringWriter.toString()); // outputs "This &mdash; That" at the relevant Node.
I have no control over the input string and I need exactly the output "This — That".
If I use StringEscapeUtils.unescapeHtml, the output is "This — That" which is not what I need.
I also tried several versions of transformer.setOutputProperty(OutputKeys.ENCODING, "encoding") but haven't found an encoding that converts "—" to "—".
What can I do to prevent javax.xml.transform.Transformer from re-escaping already correctly escaped text or how can I transform the input to get entities in the output?
Please explain how this is a duplicate.
The question referenced had the problem that "
" was being converted into CRLF because the entities were being resolved. The solution was to escape the entities.
My problem is the reverse. The text is already escaped and the transformer is re-escaping the text. "—" is outputting "&mdash;".
I cannot use the solution to post-convert all "&" -> "&" because not all nodes represent html.
More complete code:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer t = factory.newTransformer();
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document document = builder.newDocument();
Element rootElement = document.createElement("Test");
rootElement.appendChild(document.createTextNode("This — That");
document.appendChild(rootElement);
DOMImplementation domImpl = bgDoc.getImplementation();
DocumentType docType = domImpl.createDocumentType("Test",
"-//Company//program//language",
"test.dtd");
t.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, docType.getPublicId());
t.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, docType.getSystemId());
StringWriter writer = new StringWriter();
StreamResult rslt = new StreamResult(writer);
Source src = new DOMSource(document);
t.transform(src, rslt);
System.out.println(writer.toString());
// outputs xml header, then "<Test>This &mdash; That</Test>"

The fact is, once you have a DOM tree, there's no longer a string with —: it's instead represented internally as a Unicode string.
So, to input the raw string, you need to parse it to a Node, and to output, serialize a Node.
Regarding serialization, there are a few other questions including Change the com.sun.org.apache.xml.internal.serialize.XMLSerializer & com.sun.org.apache.xml.internal.serialize.OutputFormat .
To parse a single node, there is LSParser.parseWithContext.

Related

How to created a formatted string from xml node in java

I'm trying to create a formatted string from an XML Node. See this example:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<parent>
<foo>
<bar>foo</bar>
</foo>
</parent>
</root>
The Node I want to create a formatted string for is "foo". I expected a result like this:
<foo>
<bar>foo</bar>
</foo>
But the actual result is:
<foo>
<bar>foo</bar>
</foo>
My approach looks like this:
public String toXmlString(Node node) throws TransformerException {
final Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
final Writer writer = new StringWriter();
final StreamResult streamResult = new StreamResult(writer);
transformer.transform(new DOMSource(node), streamResult);
return writer.toString();
}
What am I doing wrong?

It is doing exactly what it's supposed to do. indent="yes" allows the transform to add whitespace to indent elements, but not to remove whitespace, since it cannot know which whitespace in the input is significant.
In the input you provide, the <foo> and </foo> element lines have 8 leading blanks, and the <bar> line has 12.
The reason the <foo> opening tag is not indented is that the preceding whitespace actually belongs to the containing <parent> element and is not present in the subtree you passed to the transform.
Whitespace stripping behavior is covered in detail in the standards (XSLT 1, XSLT 2). In summary
A whitespace text node is preserved if either of the following apply:
The element name of the parent of the text node is in the set of whitespace-preserving element names
...
and
(XSLT 2) The set of whitespace-preserving element names is specified by xsl:strip-space and xsl:preserve-space declarations. Whether an element name is included in the set of whitespace-preserving names is determined by the best match among all the xsl:strip-space or xsl:preserve-space declarations: it is included if and only if there is no match or the best match is an xsl:preserve-space element.
stated more simply in the XSLT 1 spec:
Initially, the set of whitespace-preserving element names contains all element names.
Unfortunately, using xsl:strip-space does not produce the results you want. With <xsl:strip-space elements="*"> (and indent="yes") I get the following output:
<foo><bar>foo</bar>
</foo>
Which makes sense. Whitespace is stripped, and then the </foo> tag is made to line up under its opening tag.

This will work better with the third party library JDOM 2, which also makes everything easier about manipulating DOM documents.
Its "pretty format" output will indent as expected, removing existing indentation, as long as the text nodes removed/altered were whitespace-only. When one wants to preserve whitespace, one doesn't ask for indented output.
Will look like this:
public String toXmlString(Element element) {
return new XMLOutputter(Format.getPrettyFormat()).outputString(element);
}

Saxon gives your desired output provided you strip whitespace on input:
public void testIndentation() {
try {
String in = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
+ "<root>\n"
+ " <parent>\n"
+ " <foo>\n"
+ " <bar>foo</bar>\n"
+ " </foo> \n"
+ " </parent>\n"
+ "</root>";
Processor proc = new Processor(false);
DocumentBuilder builder = proc.newDocumentBuilder();
builder.setWhitespaceStrippingPolicy(WhitespaceStrippingPolicy.ALL); //XX
XdmNode doc = builder.build(new StreamSource(new StringReader(in)));
StringWriter sw = new StringWriter();
Serializer serializer = proc.newSerializer(sw);
serializer.setOutputProperty(Serializer.Property.METHOD, "xml");
serializer.setOutputProperty(Serializer.Property.INDENT, "yes");
XdmNode foo = doc.axisIterator(Axis.DESCENDANT, new QName("foo")).next();
serializer.serializeNode(foo);
System.err.println(sw);
} catch (SaxonApiException err) {
fail();
}
}
But if you don't strip whitespace (comment out line XX), you get the ragged output shown in your post. The spec, from XSLT 2.0 onwards, allows the processor to be smarter than this, but Saxon doesn't take advantage of this. One reason is that the serialization is entirely streamed: it's looking at each event (start element, end element, etc) in isolation rather than considering the document as a whole.

Based on kumesana's answer, I've found an acceptable solution:
public String toXmlString(Node node) throws TransformerException {
final DOMBuilder builder = new DOMBuilder();
final Element element = (Element) node;
final org.jdom2.Element jdomElement = builder.build(element);
final XMLOutputter xmlOutputter = new XMLOutputter(Format.getPrettyFormat());
final String output = xmlOutputter.outputString(jdomElement);
return output;
}

How to Parse Complete XML into a string using DOM PARSER?

For example:
If I pass tagValue=1 then it should return complete xml1 as a string to me String input is in some function.
String input = "<1><xml1></xml1></1><2><xml2></xml2><2>.......<10000><xml10000></xml10000></10000‌>";
String output = "<xml1></xml1>"; // for tagValue=1;

If the xml has a root element then it is doable
1. Parse the XML using dom parser.
2. Iterate through each node
3. Find the desired node.
4. Write the node in a different xml using transform
Sample code
Step 1: I used XML as a string, you can read from file.
DocumentBuilder dBuilder = DocumentBuilderFactory.newInstance()
.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(uri));
Document doc = dBuilder.parse(is);
Iterate through each node
if (doc.hasChildNodes()) {
printNote(doc.getChildNodes(), doc);
}
Please put in your logic to iterate thorugh the nodes and find the right child node which you want to process.
Write back as xml. Here assumption is that tempNode is the one you want to write as XML.
TransformerFactory tFactory =
TransformerFactory.newInstance();
Transformer transformer;
try {
transformer = tFactory.newTransformer();
DOMSource source = new DOMSource(tempNode);
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);

You are not parsing XML, you are parsing a String, a replace should be enough in your particular case
String input = "<xml1><note><to>Tove</to><from>Jani</from<heading>Reminder</heading><body>Don't forget me this weekend</body></note><xml1>";
input = input.replace("<xml1>", "");

Import and parse an xml file without FileOutputStream

Consider the code fragment that I have at the moment which works and the right elements are found and placed into my map:
public void importXml(InputSource emailAttach)throws Exception {
Map<String, String> hWL = new HashMap<String, String>();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(emailAttach);
FileOutputStream fos=new FileOutputStream("temp.xml");
OutputStreamWriter os = new OutputStreamWriter(fos,"UTF-8");
// Transform to XML UTF-8 format
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
t.transform(new DOMSource(doc), new StreamResult(os));
os.close();
fos.close();
doc = db.parse(new File("temp.xml"));
NodeList nl = doc.getElementsByTagName("Email");
Element eE=(Element)nl.item(0);
int ctr=eE.getChildNodes().getLength();
String sNName;
String sNValue;
Node nTemp;
for (int i=0;i<ctr;i++){
nTemp=eE.getChildNodes().item(i);
sNName=nTemp.getNodeName().toUpperCase().trim();
if (nTemp.getChildNodes().item(0)!=null) {
sNValue=nTemp.getChildNodes().item(0).getNodeValue().trim();
hWL.put(sNName,sNValue);
}
}
}
However I prefer not to create a temp file first after converting the data to UTF-8 and parsing from the temp file. Is there anyway I can do this?
I've tried using a ByteArrayOutputStream in place of OutputStreamWriter, and calling toString() on the ByteArrayOutputStream as such:
doc = db.parse(bos.toString("UTF-8");
But then my Map ends up being empty.

From the API docs (the ability of its meticulous studying is a valuable asset for any programmer) - the parse method with the String argument seems to take something different from what you feed to it:
Document parse(String uri)
Parse the content of the given URI as an XML document and return a new DOM >Document object.
This might be your friend:
db.parse ( new ByteArrayInputStream( bos.toByteArray()));

Update
#user2496748 sorry I should have searched for the API but instead I was looking at the source code through a decompiler which tells me the parameter is arg0 instead of uri. Big difference.
I think I understand stream readers/writers and byte to char or vice versa a little more now.
After some review I was able to simply my code to this and achieve what I wanted to do. Since I am able to get the email attachment as a InputSource:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
emailAttach.setEncoding("UTF-8");
Document doc = db.parse(emailAttach);
Works as well and tested with non-english characters.

You don't need to write and re-read and re-parse the transformed document. Just change this:
t.transform(new DOMSource(doc), new StreamResult(os));
to this:
DOMResult result = new DOMResult();
t.transform(new DOMSource(doc), result);
doc = (Document)result.getNode();
and then continue from after your present doc = db.parse(new File("temp.xml"));.

How can i escape special characters with using DOM

This issue has been bugging me a lot lately and i can't seem to find out a possible solution.
I am dealing with a web-server that receives an XML document to do some processing. The server's parser has issues with &,',",<,>. I know this is bad, i didn't implement the xml parser on that server. But before waiting for a patch i need to circumvent.
Now, before uploading my XML document to this server, i need to parse it and escape the xml special characters. I am currently using DOM. The issue is, if i iterate through the TEXT_NODES and replaces all the special characters with their escaped versions, when I save this document,
for d'ex i get d&apos;ex but i need d&apos;ex
It makes sense since, DOM escapes "&". But obviously this is not what i need.
So if DOM is already capable of escaping "&" to "&" how can i make it escape other characters like " to " ?
If it can't, how can i save the already parsed and escaped texts in it's nodes without it having to re-escape them when saving ?
This is how i escape the special characters i used apache StringEscapeUtils class:
public String xMLTransform() throws Exception
{
String xmlfile = FileUtils.readFileToString(new File(filepath));
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new InputSource(new StringReader(xmlfile.trim().replaceFirst("^([\\W]+)<", "<"))));
NodeList nodeList = doc.getElementsByTagName("*");
for (int i = 0; i < nodeList.getLength(); i++) {
Node currentNode = nodeList.item(i);
if (currentNode.getNodeType() == Node.ELEMENT_NODE) {
Node child = currentNode.getFirstChild();
while(child != null) {
if (child.getNodeType() == Node.TEXT_NODE) {
child.setNodeValue(StringEscapeUtils.escapeXml10(child.getNodeValue()));
//Escaping works here. But when saving the final document, the "&" used in escaping gets escaped as well by DOM.
}
child = child.getNextSibling();
}
}
}
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
transformer.transform(source, result);
FileOutputStream fop = null;
File file;
file = File.createTempFile("escapedXML"+UUID.randomUUID(), ".xml");
fop = new FileOutputStream(file);
String xmlString = writer.toString();
byte[] contentInBytes = xmlString.getBytes();
fop.write(contentInBytes);
fop.flush();
fop.close();
return file.getPath();
}

I think the solution you're looking for is a customized XSLT parser that you can configure for your additional HTML escaping.
I'm not able to say for certain how to configure the xslt file to do what you want, but I am fairly confident it can be done. I've stubbed out the basic Java setup below:
#Test
public void testXSLTTransforms () throws Exception {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element el = doc.createElement("Container");
doc.appendChild(el);
Text e = doc.createTextNode("Character");
el.appendChild(e);
//e.setNodeValue("\'");
//e.setNodeValue("\"");
e.setNodeValue("&");
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(System.out);
//This prints the original document to the command line.
transformer.transform(source, result);
InputStream xsltStream = getClass().getResourceAsStream("/characterswap.xslt");
Source xslt = new StreamSource(xsltStream);
transformer = transformerFactory.newTransformer(xslt);
//This one is the one you'd pipe to a file
transformer.transform(source, result);
}
And I've got a simple XSLT I used for proof of concept that shows the default character encoding you mentioned:
characterswap.xslt
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:text>
Original VALUE : </xsl:text>
<xsl:copy-of select="."/>
<xsl:text>
OUTPUT ESCAPING DISABLED : </xsl:text>
<xsl:value-of select="." disable-output-escaping="yes"/>
<xsl:text>
OUTPUT ESCAPING ENABLED : </xsl:text>
<xsl:value-of select="." disable-output-escaping="no"/>
</xsl:template>
</xsl:stylesheet>
And the console out is pretty basic:
<?xml version="1.0" encoding="UTF-8"?>
<Container>&</Container>
Original VALUE : <Container>&</Container>
OUTPUT ESCAPING DISABLED : &
OUTPUT ESCAPING ENABLED : &
You can take the active node from the XSLT execution and perform specific character replacments. There are multiple examples I was able to find, but I'm having difficulty getting them working in my context.
XSLT string replace
is a good place to start.
This is about the extent of my knowledge with XSLT, I hope it helps you solve your issue.
Best of luck.
I was considering this further, and the solution may not only be XSLT. From your description, I have the impression that rather than xml10 encoding, you're kind of looking for a full set of html encoding.
Along those lines, if we take your current node text transformation:
if (child.getNodeType() == Node.TEXT_NODE) {
child.setNodeValue(StringEscapeUtils.escapeXml10(child.getNodeValue()));
}
And explicitly expect that we want the HTML Encoding:
if (child.getNodeType() == Node.TEXT_NODE) {
//Capture the current node value
String nodeValue = child.getNodeValue();
//Decode for XML10 to remove existing escapes
String decodedNode = StringEscapeUtils.unescapeXml10(nodeValue);
//Then Re-encode for HTML (3/4/5)
String fullyEncodedHTML = StringEscapeUtils.escapeHtml3(decodedNode);
//String fullyEncodedHTML = StringEscapeUtils.escapeHtml4(decodedNode);
//String fullyEncodedHTML = StringEscapeUtils.escapeHtml5(decodedNode);
//Then place the fully-encoded HTML back to the node
child.setNodeValue(fullyEncodedHTML);
}
I would think that the xml would now be fully encoded with all of the
HTML escapes you were wanting.
Now combine this with the XSLT for output escaping (from above), and the document will not undergo any further transformations when written out to the file.
I like this solution because it limits the logic held in the XSLT file. Rather than managing the entire String find/replace, you would just need to ensure that you copy your entire node and copy the text() with output escaping disabled.
In theory, that seems like it would fulfill my understanding of your objective.
Caveat again is that I'm weak with XSLT, so the example xslt file may
still need some tweaking. This solution reduces that unknown work
quantity, in my opinion.

I've seen people use regex to do something similar
Copied from (Replace special character with an escape preceded special character in Java)
String newSearch = search.replaceAll("(?=[]\\[+&|!(){}^\"~*?:\\\\-])", "\\\\");
That whacky regex is a "look ahead" - a non capturing assertion that the following char match something - in this case a character class.
Notice how you don't need to escape chars in a character class, except a ] (even the minus don't need escaping if first or last).
The \\\\ is how you code a regex literal \ (escape once for java, once for regex)
Here's a test of this working:
public static void main(String[] args) {
String search = "code:xy";
String newSearch = search.replaceAll("(?=[]\\[+&|!(){}^\"~*?:\\\\-])", "\\\\");
System.out.println(newSearch);
}
Output:
code\:xy

this is very closely related to this question (how to Download a XML file from a URL by Escaping Special Characters like < > $amp; etc?).
This post has a similar case where the code downloads XML's with parsed / escaped content.
As i understand , you read file , parse it and escape characters . During saving the XML gets "escaped" again. While you can use the DOM for checking well-formed XML or schema, file based operations to escape can help you escape XML and HTML special characters. The code sample in the post refers to usage of IOUtils and StringUtils to do it. Hope this helps !

I would use StringEscapeUtils.escapeXml10()... details here. https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringEscapeUtils.html#ESCAPE_XML10

XML Canonical form in Java

This question got me pretty close and actually works. Now I'm trying to understand it better and make it more robust.
Have the following test code:
// Just build a test xml
String xml;
xml = "<aaa Batt = \"That\" Aatt=\"this\" >\n";
xml += "<!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/>\n";
xml += " <ccc/></aaa>";
// do the necessary bureaucracy
DocumentBuilder docBuilder;
docBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc;
doc = docBuilder.parse(new ByteArrayInputStream(xml.getBytes()));
// Normalize document
// Do I realy need to do this?
doc.normalize();
// Canonize using Apache's Xml security
org.apache.xml.security.Init.init(); // Doesnt work if I don't do this.
byte[] c14nOutputbytes = Canonicalizer.getInstance(
Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS)
.canonicalizeSubtree(doc.getDocumentElement());
// This was a reparse reccomended to get attributes in alpha order
Document canon = docBuilder.parse(new ByteArrayInputStream(c14nOutputbytes));
// Input and output for the transformer
DOMSource xmlInput = new DOMSource(canon);
StreamResult xmlOutput = new StreamResult(new StringWriter());
// Configure transformer and format code
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(
"{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(xmlInput, xmlOutput);
// And print it
System.out.println(xmlOutput.getWriter().toString());
Executing this code, would output:
<aaa Aatt="this" Batt="That">
<!-- Document comment --><bbb lolol="dsf" moarttt="fasf"/>
<ccc/>
</aaa>
Which might be canonized, but doesn't seem to respect the indentation I asked the transformer to do.
Having such an example, I have a few questions:
For my intent, is there any difference between .normalize() and Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS? Removing either of them seems to yield the same result (again within my intent of have a canonical and pretty printed xml).
Why do the blank spaces within the xml seem to screw the formatting? Would I have to trim the text of each xml node to make it work? It just sounds wrong, nonetheless if the input xml is <aaa Batt = \"That\" Aatt=\"this\" ><!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/><ccc/></aaa> the xml is perfectly formatted.
Why after asking for the canonical form, tags such as <ccc/> weren't expanded to <ccc></ccc>? Wikipedia says "empty elements are encoded as start/end pairs, not using the special empty-element syntax".
Sorry if these are too many questions at once, but I have the feeling the answers for all of these should be somewhat the same.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Work with raw text in javax.xml.transform.Transformer - java

Related

How to created a formatted string from xml node in java

How to Parse Complete XML into a string using DOM PARSER?

Import and parse an xml file without FileOutputStream

How can i escape special characters with using DOM

XML Canonical form in Java

Categories

Resources