Transforming XML and preserving Unicode characters with XSLT - java

My XSLT transformations have been successful for months until I ran across an XML file with Unicode characters (most likely emoji). I need to preserve the Unicode but XSLT is converting it to HTML Entities. I thought that setting the encoding to UTF-8 would solve my problem but I'm still having issues.
Any help appreciated. Code:
private byte[] transform(InputStream stream) throws Exception{
System.setProperty("javax.xml.transform.TransformerFactory", "org.apache.xalan.processor.TransformerFactoryImpl");
Transformer xmlTransformer;
xmlTransformer = (TransformerImpl) TransformerFactory.newInstance().newTransformer(new StreamSource(createXsltStylesheet()));
xmlTransformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(stream,"UTF-8");
Source staxSource = new StAXSource(reader, true);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(outputStream, "UTF-8");
xmlTransformer.transform(staxSource, new StreamResult(writer));
return outputStream.toByteArray();
}
If I add
xmlTransformer.setOutputProperty(OutputKeys.METHOD, "text");
the Unicode is preserved but the XML is not.

I just ran across this same issue, and after far too long researching it, here's what I've concluded.
Java XSLT processors escape multi-byte UTF-8 characters into HTML entities even if the output mode is XML... if multibyte chars occur in a text() node that's not wrapped in CDATA. If the characters are wrapped in CDATA (for output) the multibyte character will be preserved.
My Problem:
I had an xml file that looked like this, complete with emoji.
<events>
<event>
<id>RANDOMID</id>
<blah>
<blahId>FOOONE</blahId>
</blah>
<blah>
<blahId>FOOTWO</blahId>
</blah>
<eventComment>Did some things. Had some Fun. 👍</eventComment>
</event>
</events>
I started with an XSL stylesheet that looked like this:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/TR/xhtml1/strict"
>
<xsl:output method = "xml" version="1.0" encoding = "UTF-8" omit-xml-declaration="no" indent="yes" />
<xsl:template match="/">
<events>
<xsl:for-each select="/events/event">
<event>
<xsl:copy-of select="./*[name() != 'blah'"/>
<xsl:for-each select="./blah">
<blahId><xsl:copy-of select="./blahId/text()"/></blahId>
</xsl:for-each>
</event>
</xsl:for-each>
</events>
</xsl:template>
</xsl:stylesheet>
Running this with a java Transformer consistently produced 👍 where my emoji should be. Subsequent attempts to parse the resultant Document failed with the following exception message:
org.xml.sax.SAXParseException; lineNumber: y; columnNumber: x; Character reference "&#55357" is an invalid XML character.
HOGWASH!
Testing this with xsltproc on the command line was useless, since xsltproc isn't stupid when it comes to multibyte characters. I got the output I expected.
A SOLUTION
Having the XSLT wrap the eventComment in CDATA by specifying the QName in the xsl:output tag cdata-section-elements attribute will preserve the bytes and works with xsltproc and the java Transformer.
The magic here is the output cdata-secion-elements property from the <xsl:output> tag. https://www.w3.org/TR/xslt#output
I updated my XSL template to be:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/TR/xhtml1/strict"
>
<xsl:output cdata-section-elements="eventComment" method="xml" version="1.0" encoding="UTF-8" omit-xml-declaration="no" indent="yes"/>
<xsl:template match="/">
<events>
<xsl:for-each select="/events/event">
<event>
<xsl:copy-of select="./*[name() != 'blah' and name() != 'eventComment']"/>
<!-- For the cdata-section-elements to resolve that eventComment needs to be preserved as CDATA
(so we don't get java doing stupid things with unicode escapment)
it needs to be explicitly referenced here.
-->
<eventComment><xsl:copy-of select="./eventComment/text()"/></eventComment>
<xsl:for-each select="./blah">
<blahId><xsl:copy-of select="./blahId/text()"/></blahId>
</xsl:for-each>
</event>
</xsl:for-each>
</events>
</xsl:template>
</xsl:stylesheet>
And now my output from both xsltproc and a java Transformer looks like this, and parses happily with java DocumentBuilders.
<?xml version="1.0" encoding="UTF-8"?>
<events xmlns="http://www.w3.org/TR/xhtml1/strict">
<event>
<id xmlns="">RANDOMID</id>
<eventComment><![CDATA[Did some things. Had some Fun. 👍]]></eventComment>
<blahId>FOO</blahId>
<blahId>FOOTOO</blahId>
</event>
</events>

This line is suspicious:
stream = IOUtils.toInputStream(outputStream.toString(),"UTF-8");
You are converting a ByteArrayOutputStream to a String using the default encoding of your platform, which is probably not UTF-8. Change it to
stream = IOUtils.toInputStream(outputStream.toString("UTF-8"),"UTF-8");
or, for better performance, just wrap the byte array in a ByteArrayInputStream :
return new ByteArrayInputStream(outputStream.toByteArray());

Try to convert to String the XML using Apache Serializer.
//Serialize DOM
OutputFormat format = new OutputFormat (doc);
// as a String
StringWriter stringOut = new StringWriter ();
XMLSerializer serial = new XMLSerializer (stringOut,
format);
serial.serialize(doc);
// Display the XML
System.out.println(stringOut.toString());

just solved a similar problem by adding below line to original XML:
document.appendChild(document.createProcessingInstruction(StreamResult.PI_DISABLE_OUTPUT_ESCAPING, ""));
refer to : Writing emoji to XML file in JAVA
perhaps can use similar setting for the transformer...

Related

XSLT in Java: CDATA section split

I want to replace some items in a huge XML file, and I thought I'll do it with XSLT. I have absolutely no experience with it, so if you think there would be better ways to do this, please tell me.
Anyway, as a first step I just wanted to copy the whole XML over. This is my xsl file:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="no" cdata-section-elements="script"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
The relevant Java code is:
Source xmlInput = new StreamSource(oldProjectStream);
Source xsl = new StreamSource("test.xsl");
Transformer transformer = TransformerFactory.newInstance().newTransformer(xsl);
StreamResult xmlOutput = new StreamResult("output/project.xml");
transformer.transform(xmlInput, xmlOutput);
Most of the output is fine, also the order of the elements is not changed (this could turn out quite important).
The XML contains some Lua code in CDATA sections. At some (seemingly random) points, however, the CDATA section is closed and reopened again. It seems to have to do with brackets in the code, but just rately - there are about 5 points in a 1.4 MB XML looking like this:
<script><![CDATA[
...
html_encoding["Otilde" ] = string.char(213)
html_encoding["Ouml" ]]]><![CDATA[ = string.char(214)
html_encoding["Oslash" ] = string.char(216)
...
]]></script>
In the original file, the middle line looks just like the other ones. There are thousands of lines where I've put the dots. What's going on here?
The (proprietary) application that should handle the XML isn't able to load it.
It's useful to tell us which XSLT processor you are using.
The serializer has to close and reopen a CDATA section if it encounters "]]>" in the data, because that sequence cannot legally appear in a CDATA section. It shouldn't need to do so under any other circumstances, though the spec probably doesn't disallow it.

XSLT: extract the last x digit of a sibling node with xpath expression

I am trying to extract the last 4 numbers of the "red" sibling with xpath.
The source xml looks like:
...
<node2>
<key><![CDATA[RED]]></key>
<value><![CDATA[98472978241908]]></value>
... more key value pairs here...
</node2>
...
And when I use the follwing xpath:
/nodelevelX/nodelevelY/node2/key[text()='RED']/following-sibling::value
I have the full number in output, then I tried to extract the digit with this xpath experssion:
/nodelevelX/nodelevelY/node2/key[text()='RED']/following-sibling::value/text()[substring(., string-length(.)-4)]
I still have the full number. The substring function does not seems to work.
my xsl header is:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
I think there is a small error, but I cannot see where. I followed many discussions on SO and others (w3schools) and tried to follow the advices whithout success.
UPDATE: The context:
I use the following identity which copy all the nodes from my source XML to the destination (xml)
and I apply specific rules for some node after inside a xsl:template:
<!-- This copy the whole source XML in destination -->
<xsl:template match="node() | #*">
<xsl:copy>
<xsl:apply-templates select="node() | #*" />
</xsl:copy>
</xsl:template>
<!-- specific rules for some nodes -->
<xsl:template match="/nodeDetails">
<mynewnode>
<!-- here I take the whole value and it s working -->
<someVal><xsl:value-of select="/nodeDetails/nodeX/key[text()='ANOTHER_KEY']/following-sibling::value" /></someVal>
<!-- FIXME substring does not work now -->
<redVal><xsl:value-of select="/nodeDetails/nodeX/key[text()='RED']/following-sibling::value/text()[substring(.,string-length(.)-4)]" /></redVal>
</mynewnode>
</xsl:template>
And for the transformation I use the following code from a junit class in Java (JDK 6):
#Test
public void transformXml() throws TransformerException {
TransformerFactory factory = TransformerFactory.newInstance();
Source xslt = new StreamSource(getClass().getResourceAsStream("contract.xsl"));
Transformer transformer = factory.newTransformer(xslt);
Source input = new StreamSource(getClass().getResourceAsStream("source.xml"));
Writer output = new StringWriter();
transformer.transform(input, new StreamResult(output));
System.out.println("output=" + output.toString());
}
Your current XPath will evaluate to a nodeset, but what you need is a string. Please try something like this:
<xsl:variable name="value"
select="/nodelevelX/nodelevelY/node2/key[. = 'RED']
/following-sibling::value[1]" />
<xsl:value-of select="substring($value, string-length($value) - 3)" />
Though to be sure about an answer, I'd need to see the portion of your XSLT where you are trying to output this value.
Use this XPath 2.0 expression:
/nodelevelX/nodelevelY/node2/key[text()='RED']
/following-sibling::*[1][self::value]
/substring(., string-length() -3)
XSLT 2.0 - based verification:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"/nodelevelX/nodelevelY/node2/key[text()='RED']
/following-sibling::*[1][self::value]
/substring(., string-length() -3)"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the following XML document:
<nodelevelX>
<nodelevelY>
<node2>
<key>GREEN</key>
<value>0123456789</value>
<key>RED</key>
<value>98472978241908</value>
<key>BLACK</key>
<value>987654321</value>
</node2>
</nodelevelY>
</nodelevelX>
the XPath expression is evaluated and the result of this evaluation is copied to the output:
1908

Java Render XML Document as PDF

I have an XML document currently stored as an in-memory string & want to render it as a PDF. In other words, the PDF content will be an XML document. The XML being rendered by the method is generic -- multiple types of XML documents might be sent in.
I'm having a bit difficulty figuring out how to accomplish using using various Java-based frameworks.
Apache FOP
It appears as if this framework require specific transformation for XML elements in the document to FOP entities. Since the method in questions must accept generic XML, I don't think this framework fits my requirement.
iText
I've tried rendering a document using a combination of iText/Flying Saucer (org.xhtmlrenderer) and while it does render a PDF, the content only contains space-separated data values and no xml elements or attributes. Using the code & test data below below:
File
<?xml version="1.0" encoding="UTF-8"?>
<root>
<elem1>value1</elem1>
<elem2>value2</elem2>
</root>
Code
File inputFile = new File(PdfGenerator.class.getResource("test.xml").getFile());
OutputStream os = new FileOutputStream("c:\\temp\\Sample.pdf");
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(inputFile);
renderer.layout();
renderer.createPDF(os);
os.close();
Results in a PDF that contains the content values value1 value2, but no tags.
My question is
can someone provide a code snippet for rending a PDF containing XML content using one of the frameworks above, or is there another framework better suited to my needs?
Edit:
I realize the same question was asked here, but it seems the solution presented requires intimate knowledge of the structure of the incoming XML doc in the css file.
Just for the sake of giving an example using fop - here you have it. For everyone to be able to follow this I'm using the fop command line tool.
The same can easily be performed within Java code and then you don't need to have the xml as a file at any time.
XSLT that produce a PDF
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format">
<xsl:template match="/">
<fo:root>
<fo:layout-master-set>
<fo:simple-page-master master-name="content"
page-width="210mm" page-height="297mm" margin="20mm 20mm 20mm 20mm">
<fo:region-body/>
</fo:simple-page-master>
</fo:layout-master-set>
<fo:page-sequence master-reference="content">
<fo:flow flow-name="xsl-region-body">
<fo:block>
<xsl:apply-templates />
</fo:block>
</fo:flow>
</fo:page-sequence>
</fo:root>
</xsl:template>
<xsl:template match="#*">
<xsl:text> </xsl:text>
<xsl:value-of select="name()" />
<xsl:text>="</xsl:text>
<xsl:value-of select="." />
<xsl:text>"</xsl:text>
</xsl:template>
<xsl:template match="*">
<xsl:param name="indent">0</xsl:param>
<fo:block margin-left="{$indent}">
<xsl:text><</xsl:text>
<xsl:value-of select="name()" />
<xsl:apply-templates select="#*" />
<xsl:text>></xsl:text>
<xsl:apply-templates>
<xsl:with-param name="indent" select="$indent+10" />
</xsl:apply-templates>
<xsl:text></</xsl:text>
<xsl:value-of select="name()" />
<xsl:text>></xsl:text>
</fo:block>
</xsl:template>
</xsl:stylesheet>
We call this file xml2pdf.xsl
Short explanation of the code
The template match="/" mainly builds the pdf except for the row which calls the other template match methods or more precise the template match="*".
The template match="" writes the element start and end and calls which in turn calls the template match="#" for each attribute in the element (if any). Finally it calls the
The indent parameter gets increased by 10 for each level the template reaches with the select="$indent+10" attribute in the with-param statement.
Using the code
# fop -xsl xml2pdf.xsl -xml sample.xml -pdf result.pdf
This is the solution using itext . Your html content is in the request. And itext is not free. Check out its licensing requirement as it has changed in recent years although it is not very expensive.
public class MyPDFGeneratorService {
public byte[] generatePdf(final XhtmlPDFGenerationRequest request) {
try {
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(this.getDocument(request.getContent()), null);
renderer.layout();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
renderer.createPDF(baos);
return this.toByteArray(baos);
}
catch (Exception e) {
throw new PDFGenerationException(
"Unable to generate PDF.", e);
}
}
private Document getDocument(final String content) {
InputSource is = new InputSource(new BufferedReader(new StringReader(
content)));
return XMLResource.load(is).getDocument();
}
private byte[] toByteArray(final ByteArrayOutputStream baos)
throws IOException {
byte[] bytes = baos.toByteArray();
baos.close();
return bytes;
}
}
Try Googling, there are a number of code snippets. For example: http://www.vogella.com/articles/JavaPDF/article.html
I recommend iText rather than FOP, it's faster, less memory-intensive and you have more control over the result.

Unusual output for XSL transformations

I have an xml document and a style sheet to convert the document into another useful xml.
For the reference the xml document is somewhat like this:
<root>
<element1>value1</element1>
<element2>value2</element2>
<element3>value3</element3>
<element4>..some more levels of data</element4>
</root>
The style sheet looks somewhat like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:include href="errorResponse.xsl"/>
<xsl:template match="root/element4">
<xsl:element name="myRoot">
<xsl:element name="myElement">
<xsl:apply-templates select="./someElement/someOtherElement"/>
</xsl:element>
</xsl:element>
</xsl:template>
The output xml string which I am getting is like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
value1
value2
value3
<myRoot><myelement> some data </myElemrnt></myroot>
The code snippet which I am using for transformation is this:
InputStream styleSheet = new FileUtil().getFileStream("xsltFileName");
StreamSource xslStream = new StreamSource(styleSheet);
DOMSource in = new DOMSource(inputXMLDoc);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
TransformerFactory transFact = TransformerFactory.newInstance();
transFact.setURIResolver(new XsltURIResolver());
Transformer trans = transFact.newTransformer(xslStream);
trans.transform(in, new StreamResult(baos));
System.out.println(baos.toString()); // displays the above output
However the output is in undesired format. I dont want value1, value2, value3. This is also creating problems further for the new XML generated, to be processed.
I have seen a lot of questions around the transformations. This is bugging me for a long time. Appreciate a lot if someone could point out where I am going wrong.
Also point out if I am following any incorrect conventions during the entire process.
Thanks and regards.
You are getting that output because of the Default Template Rule, which outputs the text nodes. If you don't want those nodes you need to exclude them explicitly by matching them and replacing them with nothing (i.e. an empty template).
Try adding this template to your stylesheet:
<xsl:template match="/">
<xsl:apply-templates select="root/element4"/>
</xsl:template>
It matches the root and discards everything except for root/element4.
What happens here is that the XSLT built-in templates are applied to any node not matched explicitly by a template. The net effect of the built-in templates is to copy any text node (on which tey are applied) to the output.
One of the simplest and shortest way to supress this unwanted output is to add the following template:
<xsl:template match="text()"/>
which causes any text-node for which this template is selected for execution, not to be copied to the output.

How to put String text without converting content to xml file in Java?

I need to put String content to xml in Java. I use this kind of code to insert information in xml:
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File ("file.xml"));
DOMSource source = new DOMSource (doc);
Node cards = doc.getElementsByTagName ("cards").item (0);
Element card = doc.createElement ("card");
cards.appendChild(card);
Element question = doc.createElement("question");
question.appendChild(doc.createTextNode("This <b>is</b> a test.");
card.appendChild (question);
StreamResult result = new StreamResult (new File (file));
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.INDENT, "yes");
tf.transform(source, result);
But string is converted in xml like this:
<cards>
<card>
<question>This <b>is</b> a test.</question>
</card>
</cards>
It should be like this:
<cards>
<card>
<question>This <b>is</b> a test.</question>
</card>
</cards>
I tried to use CDDATA method but it puts code like this:
// I changed this code
question.appendChild(doc.createTextNode("This <b>is</b> a test.");
// to this
question.appendChild(doc.createCDATASection("This <b>is</b> a test.");
This code gets a xml file look like:
<cards>
<card>
<question><![CDATA[This <b>is</b> a test.]]></question>
</card>
</cards>
I hope that somebody can help me to put String content in the xml file exactly with same content.
Thanks in advance!
This would be expected behaviour.
Consider if the brackets were kept as you put them, the end result would essentially be:
<cards>
<card>
<question>
This
<b>
is
</b>
a test.
</question>
</card>
</cards>
Basically, it would result in the <b> being an additional node in the xml tree. Encoding the brackets to < and > ensures that when displayed by any xml parser, the brackets will be displayed, and not confused as being an additional node.
If you really wanted them to display as you say you do, you will need to create elements named b. This will not only be awkward, it will also not display quite as you've written above - it would display as additional nested nodes as I've shown above. So you would need to amend the xml writer to output inline for those tags.
Nasty.
Check this solution: how to unescape XML in java
Maybe you could solve it in this way (code only for <question> tag part):
Element question = doc.createElement("question");
question.appendChild(doc.createTextNode("This ");
Element b = doc.createElement("b");
b.appendChild(doc.createTextNode("is");
question.appendChild(b);
question.appendChild(doc.createTextNode(" a test.");
card.appendChild(question);
What you are effectively trying to do is to insert XML into the middle of a DOM without parsing it. You can't do this since the DOM APIs don't support it.
You have three choices:
You could serialize the DOM and then insert the String at the appropriate point. The end result may or may not be well-formed XML ... depending on what is in the String that you inserted.
You could create and insert DOM nodes representing the text and the <b>...</b> element. This requires you to know the XML structure of the stuff that you are inserting. #bluish's answer gives an example.
You could wrap the String in some container element, parse it using an XML parser to give a second DOM, find the nodes of interest, and add them to the original DOM. This requires that the String is well-formed XML when wrapped in the container element.
Or, since you're already using a Transformation, why not go all the way?
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()" />
</xsl:copy>
</xsl:template>
<xsl:template match="cards">
<card>
<question>This <b>is</b> a test</question>
</card>
</xsl:template>
</xsl:stylesheet>

Categories