" is auto converting to " through Document & Transformer API - java

I am loading xml file (pom.xml) through org.w3c.dom.Document and editing some node's value (basically changing the version value of some dependency) through javax.xml.transform.Transformer, javax.xml.transform.TransformerFactory
& javax.xml.transform.dom.DOMSource.
But problem is that, this also convert all occurrence of " to " character, which I don't want. See below sample:
<Export-Package>!${bundle.namespace}.internal.*,${bundle.namespace}.*;version="${project.version}"</Export-Package>
converted to:
<Export-Package>!${bundle.namespace}.internal.*,${bundle.namespace}.*;version="${project.version}"</Export-Package>
Please help on this, how I can ignore these auto conversion with currently consumed API.
Code Sample:
public void writeDocument(File filePath)
{
TransformerFactory transformerFactory = TransformerFactory.newInstance();
this.thisDoc.getDocumentElement().normalize();
Transformer transformer;
try
{
transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(thisDoc);
StreamResult result = new StreamResult(filePath);
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.transform(source, result);
}
catch (TransformerException e)
{
VersionUpdateExceptions.throwException(e, LOG);
}
}

This is the required behavior by the Document Object Model (DOM) Level 3 Load and Save Specification:
Within the character data of a document (outside of markup), any
characters that cannot be represented directly are replaced with
character references. Occurrences of '<' and '&' are replaced by the
predefined entities < and &. The other predefined entities
(>, ', and ") might not be used, except where needed
(e.g. using > in cases such as ']]>').
For example, if you use " inside an attribute:
<Export-Package id=""test"">
" will be preserved. Otherwise, it won't.
If absolutely necessary you could achieve the preserving of """ with an ugly hack.
Read the pom.xml as a String and replace ocurrences of " by some "marker" string
To parse the document use an StringReader to create an InputSource
Execute your method, but creating a StreamResult with a StringWriter.
Get the content from the StringWriter as a String and replace your marker string with "
Save the content to the file

Related

How to created a formatted string from xml node in java

I'm trying to create a formatted string from an XML Node. See this example:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<parent>
<foo>
<bar>foo</bar>
</foo>
</parent>
</root>
The Node I want to create a formatted string for is "foo". I expected a result like this:
<foo>
<bar>foo</bar>
</foo>
But the actual result is:
<foo>
<bar>foo</bar>
</foo>
My approach looks like this:
public String toXmlString(Node node) throws TransformerException {
final Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
final Writer writer = new StringWriter();
final StreamResult streamResult = new StreamResult(writer);
transformer.transform(new DOMSource(node), streamResult);
return writer.toString();
}
What am I doing wrong?
It is doing exactly what it's supposed to do. indent="yes" allows the transform to add whitespace to indent elements, but not to remove whitespace, since it cannot know which whitespace in the input is significant.
In the input you provide, the <foo> and </foo> element lines have 8 leading blanks, and the <bar> line has 12.
The reason the <foo> opening tag is not indented is that the preceding whitespace actually belongs to the containing <parent> element and is not present in the subtree you passed to the transform.
Whitespace stripping behavior is covered in detail in the standards (XSLT 1, XSLT 2). In summary
A whitespace text node is preserved if either of the following apply:
The element name of the parent of the text node is in the set of whitespace-preserving element names
...
and
(XSLT 2) The set of whitespace-preserving element names is specified by xsl:strip-space and xsl:preserve-space declarations. Whether an element name is included in the set of whitespace-preserving names is determined by the best match among all the xsl:strip-space or xsl:preserve-space declarations: it is included if and only if there is no match or the best match is an xsl:preserve-space element.
stated more simply in the XSLT 1 spec:
Initially, the set of whitespace-preserving element names contains all element names.
Unfortunately, using xsl:strip-space does not produce the results you want. With <xsl:strip-space elements="*"> (and indent="yes") I get the following output:
<foo><bar>foo</bar>
</foo>
Which makes sense. Whitespace is stripped, and then the </foo> tag is made to line up under its opening tag.
This will work better with the third party library JDOM 2, which also makes everything easier about manipulating DOM documents.
Its "pretty format" output will indent as expected, removing existing indentation, as long as the text nodes removed/altered were whitespace-only. When one wants to preserve whitespace, one doesn't ask for indented output.
Will look like this:
public String toXmlString(Element element) {
return new XMLOutputter(Format.getPrettyFormat()).outputString(element);
}
Saxon gives your desired output provided you strip whitespace on input:
public void testIndentation() {
try {
String in = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
+ "<root>\n"
+ " <parent>\n"
+ " <foo>\n"
+ " <bar>foo</bar>\n"
+ " </foo> \n"
+ " </parent>\n"
+ "</root>";
Processor proc = new Processor(false);
DocumentBuilder builder = proc.newDocumentBuilder();
builder.setWhitespaceStrippingPolicy(WhitespaceStrippingPolicy.ALL); //XX
XdmNode doc = builder.build(new StreamSource(new StringReader(in)));
StringWriter sw = new StringWriter();
Serializer serializer = proc.newSerializer(sw);
serializer.setOutputProperty(Serializer.Property.METHOD, "xml");
serializer.setOutputProperty(Serializer.Property.INDENT, "yes");
XdmNode foo = doc.axisIterator(Axis.DESCENDANT, new QName("foo")).next();
serializer.serializeNode(foo);
System.err.println(sw);
} catch (SaxonApiException err) {
fail();
}
}
But if you don't strip whitespace (comment out line XX), you get the ragged output shown in your post. The spec, from XSLT 2.0 onwards, allows the processor to be smarter than this, but Saxon doesn't take advantage of this. One reason is that the serialization is entirely streamed: it's looking at each event (start element, end element, etc) in isolation rather than considering the document as a whole.
Based on kumesana's answer, I've found an acceptable solution:
public String toXmlString(Node node) throws TransformerException {
final DOMBuilder builder = new DOMBuilder();
final Element element = (Element) node;
final org.jdom2.Element jdomElement = builder.build(element);
final XMLOutputter xmlOutputter = new XMLOutputter(Format.getPrettyFormat());
final String output = xmlOutputter.outputString(jdomElement);
return output;
}

How to unescape string in XML using Transformer?

I've a function which takes a XML document as parameter and writes it to the file. It contains element as <tag>"some text & some text": <text> text</tag> but in output file it's written as <tag>"some text & some text": <text> text</tag> But I don't want string to be escaped while writing to the file.
Function is,
public static void function(Document doc, String fileUri, String randomId){
DOMSource source = new DOMSource(doc,ApplicationConstants.ENC_UTF_8);
FileWriterWithEncoding writer = null;
try {
File file = new File(fileUri+File.separator+randomId+".xml");
if (!new File(fileUri).exists()){
new File(fileUri).mkdirs();
}
writer = new FileWriterWithEncoding(new File(file.toString()),ApplicationConstants.ENC_UTF_8);
StreamResult result = new StreamResult(writer);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = null;
transformer = transformerFactory.newTransformer();
transformer.setParameter(OutputKeys.INDENT, "yes");
transformer.transform(source, result);
writer.close();
transformer.clearParameters();
}catch (IOException | TransformerException e) {
log.error("convert Exception is :"+ e);
}
}
There are five escape characters in XML ("'<>&). According to XML grammar, they must be escaped in certain places in XML, please see this question:
What characters do I need to escape in XML documents?
So you can't to much, for instance, to avoid escaping & or < in text content.
You could use CDATA sections if you want to retain "unescaped" content. Please see this question:
Add CDATA to an xml file

Work with raw text in javax.xml.transform.Transformer

While working with an XML document, I use strings that already contain XML entities and wish them to be inserted as-is. However, this happens instead:
String s = "This — That";
....
document.appendChild(document.createTextNode(s));
....
transformer.transform(new DOMSource(document), new StreamResult(stringWriter));
System.out.println(stringWriter.toString()); // outputs "This &mdash; That" at the relevant Node.
I have no control over the input string and I need exactly the output "This — That".
If I use StringEscapeUtils.unescapeHtml, the output is "This — That" which is not what I need.
I also tried several versions of transformer.setOutputProperty(OutputKeys.ENCODING, "encoding") but haven't found an encoding that converts "—" to "—".
What can I do to prevent javax.xml.transform.Transformer from re-escaping already correctly escaped text or how can I transform the input to get entities in the output?
Please explain how this is a duplicate.
The question referenced had the problem that "
" was being converted into CRLF because the entities were being resolved. The solution was to escape the entities.
My problem is the reverse. The text is already escaped and the transformer is re-escaping the text. "—" is outputting "&mdash;".
I cannot use the solution to post-convert all "&" -> "&" because not all nodes represent html.
More complete code:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer t = factory.newTransformer();
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document document = builder.newDocument();
Element rootElement = document.createElement("Test");
rootElement.appendChild(document.createTextNode("This — That");
document.appendChild(rootElement);
DOMImplementation domImpl = bgDoc.getImplementation();
DocumentType docType = domImpl.createDocumentType("Test",
"-//Company//program//language",
"test.dtd");
t.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, docType.getPublicId());
t.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, docType.getSystemId());
StringWriter writer = new StringWriter();
StreamResult rslt = new StreamResult(writer);
Source src = new DOMSource(document);
t.transform(src, rslt);
System.out.println(writer.toString());
// outputs xml header, then "<Test>This &mdash; That</Test>"
The fact is, once you have a DOM tree, there's no longer a string with —: it's instead represented internally as a Unicode string.
So, to input the raw string, you need to parse it to a Node, and to output, serialize a Node.
Regarding serialization, there are a few other questions including Change the com.sun.org.apache.xml.internal.serialize.XMLSerializer & com.sun.org.apache.xml.internal.serialize.OutputFormat .
To parse a single node, there is LSParser.parseWithContext.

XML Node to String Conversion for Large Sized XML

Till now I was using DOMSource to transform the XML file into string, in my Android App.
Here's my code :
public String convertElementToString (Node element) throws TransformerConfigurationException, TransformerFactoryConfigurationError
{
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
//initialize StreamResult with File object to save to file
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(element);
try {
transformer.transform(source, result);
}
catch (TransformerException e) {
Log.e("CONVERT_ELEMENT_TO_STRING", "converting element to string failed. Aborting", e);
}
String xmlString = result.getWriter().toString();
xmlString = xmlString.replace("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "");
xmlString = xmlString.replace("\n", "");
return xmlString;
}
This was working fine for small xml files.
But for large sized xml this code started throwing OutOfMemoryError.
What may be the reason behind it and how to rectify this problem?
First off: if you just need the XML as a string, and aren't using the Node for anything else, you should use StAX (Streaming API for XML) instead, as that has a much lower memory footprint. You'll find StAX in the javax.xml.stream package of the standard libraries.
One improvement to your current code would be to change the line
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
to
transformer.setOutputProperty(OutputKeys.INDENT, "no");
Since you're stripping newlines anyway at the end of the method, it's not very useful to request additional indentation in the output. It's a small thing, but might reduce your memory requirements a bit if there are a lot of tags (hence, newlines and whitespace for indentation) in your XML.

XML Canonical form in Java

This question got me pretty close and actually works. Now I'm trying to understand it better and make it more robust.
Have the following test code:
// Just build a test xml
String xml;
xml = "<aaa Batt = \"That\" Aatt=\"this\" >\n";
xml += "<!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/>\n";
xml += " <ccc/></aaa>";
// do the necessary bureaucracy
DocumentBuilder docBuilder;
docBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc;
doc = docBuilder.parse(new ByteArrayInputStream(xml.getBytes()));
// Normalize document
// Do I realy need to do this?
doc.normalize();
// Canonize using Apache's Xml security
org.apache.xml.security.Init.init(); // Doesnt work if I don't do this.
byte[] c14nOutputbytes = Canonicalizer.getInstance(
Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS)
.canonicalizeSubtree(doc.getDocumentElement());
// This was a reparse reccomended to get attributes in alpha order
Document canon = docBuilder.parse(new ByteArrayInputStream(c14nOutputbytes));
// Input and output for the transformer
DOMSource xmlInput = new DOMSource(canon);
StreamResult xmlOutput = new StreamResult(new StringWriter());
// Configure transformer and format code
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(
"{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(xmlInput, xmlOutput);
// And print it
System.out.println(xmlOutput.getWriter().toString());
Executing this code, would output:
<aaa Aatt="this" Batt="That">
<!-- Document comment --><bbb lolol="dsf" moarttt="fasf"/>
<ccc/>
</aaa>
Which might be canonized, but doesn't seem to respect the indentation I asked the transformer to do.
Having such an example, I have a few questions:
For my intent, is there any difference between .normalize() and Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS? Removing either of them seems to yield the same result (again within my intent of have a canonical and pretty printed xml).
Why do the blank spaces within the xml seem to screw the formatting? Would I have to trim the text of each xml node to make it work? It just sounds wrong, nonetheless if the input xml is <aaa Batt = \"That\" Aatt=\"this\" ><!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/><ccc/></aaa> the xml is perfectly formatted.
Why after asking for the canonical form, tags such as <ccc/> weren't expanded to <ccc></ccc>? Wikipedia says "empty elements are encoded as start/end pairs, not using the special empty-element syntax".
Sorry if these are too many questions at once, but I have the feeling the answers for all of these should be somewhat the same.

Categories