I want to replace some items in a huge XML file, and I thought I'll do it with XSLT. I have absolutely no experience with it, so if you think there would be better ways to do this, please tell me.
Anyway, as a first step I just wanted to copy the whole XML over. This is my xsl file:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="no" cdata-section-elements="script"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
The relevant Java code is:
Source xmlInput = new StreamSource(oldProjectStream);
Source xsl = new StreamSource("test.xsl");
Transformer transformer = TransformerFactory.newInstance().newTransformer(xsl);
StreamResult xmlOutput = new StreamResult("output/project.xml");
transformer.transform(xmlInput, xmlOutput);
Most of the output is fine, also the order of the elements is not changed (this could turn out quite important).
The XML contains some Lua code in CDATA sections. At some (seemingly random) points, however, the CDATA section is closed and reopened again. It seems to have to do with brackets in the code, but just rately - there are about 5 points in a 1.4 MB XML looking like this:
<script><![CDATA[
...
html_encoding["Otilde" ] = string.char(213)
html_encoding["Ouml" ]]]><![CDATA[ = string.char(214)
html_encoding["Oslash" ] = string.char(216)
...
]]></script>
In the original file, the middle line looks just like the other ones. There are thousands of lines where I've put the dots. What's going on here?
The (proprietary) application that should handle the XML isn't able to load it.
It's useful to tell us which XSLT processor you are using.
The serializer has to close and reopen a CDATA section if it encounters "]]>" in the data, because that sequence cannot legally appear in a CDATA section. It shouldn't need to do so under any other circumstances, though the spec probably doesn't disallow it.
Related
I wanted to create new element in target XML if and only if the element value of source XML is not empty. I can do this using below code. But, my problem is I have around 5k field to wrap with similar condition. Do we have any better way to handle this?
<xsl:if test="edi:po-num"> //wanted to avoid this for each element
<xsl:element name="element">
<xsl:attribute name="name">order_reference_number</xsl:attribute>
<xsl:value-of select="edi:po-num"/>
</xsl:element>
</xsl:if>
java code to transform:
Transformer trans = StylesheetCache.newTransformer(xslFilePath);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
trans.transform(source, new StreamResult(outputStream));
Your options in XSLT 1.0 are limited - XSLT 1.0 code tends to be verbose. But if it's really repetitive, then you could consider writing a meta-stylesheet - an XSLT stylesheet that generates your stylesheet from some higher-level description of what it needs to do.
Note also, your code will be a lot less verbose if you use literal result elements and attribute value templates rather than xsl:element and xsl:attribute.
I want to transform XML file using XSLT.
I made:
TransformerFactory factory = TransformerFactory.newInstance();
InputStream is =
this.getClass().getResourceAsStream(getPathToXSLTFile());
Source xslt = new StreamSource(is);
Transformer transformer = factory.newTransformer(xslt);
Source text = new StreamSource(new File(getInputFileName()));
transformer.transform(text, new StreamResult(new File(getOutputFileName())));
Which input file have about 10000000 lines, I have error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.sun.org.apache.xml.internal.utils.FastStringBuffer.append(FastStringBuffer.java:682)
at com.sun.org.apache.xml.internal.dtm.ref.sax2dtm.SAX2DTM.characters(SAX2DTM.java:2111)
at com.sun.org.apache.xalan.internal.xsltc.dom.SAXImpl.characters(SAXImpl.java:863)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.characters(AbstractSAXParser.java:546)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:455)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:841)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:770)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager.getDTM(XSLTCDTMManager.java:421)
at com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager.getDTM(XSLTCDTMManager.java:215)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.getDOM(TransformerImpl.java:556)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:739)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:351)
at ru.magnit.task.utils.AbstractXmlUtil.transformXML(AbstractXmlUtil.java:66)
at ru.magnit.task.EntryPoint.main(EntryPoint.java:72)
In this line:
transformer.transform(text, new StreamResult(new File(getOutputFileName())));
What is the reason for this and can it be optimized somehow, without the size of the heap?
UPDATE:
My XSLT file:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="entries">
<entries>
<xsl:apply-templates/>
</entries>
</xsl:template>
<xsl:template match="entry">
<entry>
<xsl:attribute name="field">
<xsl:apply-templates select="*"/>
</xsl:attribute>
</entry>
</xsl:template>
In general XSLT 1.0 and 2.0 work with a data model which pulls the complete XML input into a tree model to allow full XPath navigation, resulting in a memory usage that increases with the size of the input document.
So unless you increase the heap space if your current document size leads to memory shortage there is not much you can do, at least not in general, there might be XSLT processor specific and some XSLT specific optimizations depending on your concrete XSLT code, but you can't avoid that the processor first pulls in the complete document. We would need to see your XSLT to try to tell whether it can be optimized. Profiling a stylesheet can help to identify areas to be optimized, I am not sure whether Xalan supports that. And I am not sure whether that stack trace not simply means that Xalan already runs out of memory when building the DTM (its tree model) for your large input, in that case obviously optimizing the XSLT code does not help as it is not even executed.
A Java specific way you could attempt is to use https://docs.oracle.com/javase/8/docs/api/javax/xml/transform/sax/SAXTransformerFactory.html instead to create a SAX filter from your stylesheet and chain it with a default Transformer to serialize the result of the filter, I think I have once tried that and found it can consume less memory than the traditional approach with a Transformer.
XSLT 3.0 tries to address the memory problem with the new approach of streaming (https://www.w3.org/TR/xslt-30/#streaming-concepts), however so far there is only one implementation with Saxon 9 EE, a commercial product. And in general a stylesheet is not necessarily streamable, instead you have to rewrite it to make it streamable (if that is at all possible, for instance sorting input nodes is not possible with streaming).
For instance, your posted stylesheet converted to XSLT 3.0 to use streaming (no rewrite necessary, only needed to set up the default mode as streamable) is
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math"
version="3.0">
<xsl:mode streamable="yes"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="entries">
<entries>
<xsl:apply-templates/>
</entries>
</xsl:template>
<xsl:template match="entry">
<entry>
<xsl:attribute name="field">
<xsl:apply-templates select="*"/>
</xsl:attribute>
</entry>
</xsl:template>
</xsl:stylesheet>
and Saxon 9.8 EE and the beta of Exselt assess that as streamable.
My XSLT transformations have been successful for months until I ran across an XML file with Unicode characters (most likely emoji). I need to preserve the Unicode but XSLT is converting it to HTML Entities. I thought that setting the encoding to UTF-8 would solve my problem but I'm still having issues.
Any help appreciated. Code:
private byte[] transform(InputStream stream) throws Exception{
System.setProperty("javax.xml.transform.TransformerFactory", "org.apache.xalan.processor.TransformerFactoryImpl");
Transformer xmlTransformer;
xmlTransformer = (TransformerImpl) TransformerFactory.newInstance().newTransformer(new StreamSource(createXsltStylesheet()));
xmlTransformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(stream,"UTF-8");
Source staxSource = new StAXSource(reader, true);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(outputStream, "UTF-8");
xmlTransformer.transform(staxSource, new StreamResult(writer));
return outputStream.toByteArray();
}
If I add
xmlTransformer.setOutputProperty(OutputKeys.METHOD, "text");
the Unicode is preserved but the XML is not.
I just ran across this same issue, and after far too long researching it, here's what I've concluded.
Java XSLT processors escape multi-byte UTF-8 characters into HTML entities even if the output mode is XML... if multibyte chars occur in a text() node that's not wrapped in CDATA. If the characters are wrapped in CDATA (for output) the multibyte character will be preserved.
My Problem:
I had an xml file that looked like this, complete with emoji.
<events>
<event>
<id>RANDOMID</id>
<blah>
<blahId>FOOONE</blahId>
</blah>
<blah>
<blahId>FOOTWO</blahId>
</blah>
<eventComment>Did some things. Had some Fun. 👍</eventComment>
</event>
</events>
I started with an XSL stylesheet that looked like this:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/TR/xhtml1/strict"
>
<xsl:output method = "xml" version="1.0" encoding = "UTF-8" omit-xml-declaration="no" indent="yes" />
<xsl:template match="/">
<events>
<xsl:for-each select="/events/event">
<event>
<xsl:copy-of select="./*[name() != 'blah'"/>
<xsl:for-each select="./blah">
<blahId><xsl:copy-of select="./blahId/text()"/></blahId>
</xsl:for-each>
</event>
</xsl:for-each>
</events>
</xsl:template>
</xsl:stylesheet>
Running this with a java Transformer consistently produced 👍 where my emoji should be. Subsequent attempts to parse the resultant Document failed with the following exception message:
org.xml.sax.SAXParseException; lineNumber: y; columnNumber: x; Character reference "�" is an invalid XML character.
HOGWASH!
Testing this with xsltproc on the command line was useless, since xsltproc isn't stupid when it comes to multibyte characters. I got the output I expected.
A SOLUTION
Having the XSLT wrap the eventComment in CDATA by specifying the QName in the xsl:output tag cdata-section-elements attribute will preserve the bytes and works with xsltproc and the java Transformer.
The magic here is the output cdata-secion-elements property from the <xsl:output> tag. https://www.w3.org/TR/xslt#output
I updated my XSL template to be:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/TR/xhtml1/strict"
>
<xsl:output cdata-section-elements="eventComment" method="xml" version="1.0" encoding="UTF-8" omit-xml-declaration="no" indent="yes"/>
<xsl:template match="/">
<events>
<xsl:for-each select="/events/event">
<event>
<xsl:copy-of select="./*[name() != 'blah' and name() != 'eventComment']"/>
<!-- For the cdata-section-elements to resolve that eventComment needs to be preserved as CDATA
(so we don't get java doing stupid things with unicode escapment)
it needs to be explicitly referenced here.
-->
<eventComment><xsl:copy-of select="./eventComment/text()"/></eventComment>
<xsl:for-each select="./blah">
<blahId><xsl:copy-of select="./blahId/text()"/></blahId>
</xsl:for-each>
</event>
</xsl:for-each>
</events>
</xsl:template>
</xsl:stylesheet>
And now my output from both xsltproc and a java Transformer looks like this, and parses happily with java DocumentBuilders.
<?xml version="1.0" encoding="UTF-8"?>
<events xmlns="http://www.w3.org/TR/xhtml1/strict">
<event>
<id xmlns="">RANDOMID</id>
<eventComment><![CDATA[Did some things. Had some Fun. 👍]]></eventComment>
<blahId>FOO</blahId>
<blahId>FOOTOO</blahId>
</event>
</events>
This line is suspicious:
stream = IOUtils.toInputStream(outputStream.toString(),"UTF-8");
You are converting a ByteArrayOutputStream to a String using the default encoding of your platform, which is probably not UTF-8. Change it to
stream = IOUtils.toInputStream(outputStream.toString("UTF-8"),"UTF-8");
or, for better performance, just wrap the byte array in a ByteArrayInputStream :
return new ByteArrayInputStream(outputStream.toByteArray());
Try to convert to String the XML using Apache Serializer.
//Serialize DOM
OutputFormat format = new OutputFormat (doc);
// as a String
StringWriter stringOut = new StringWriter ();
XMLSerializer serial = new XMLSerializer (stringOut,
format);
serial.serialize(doc);
// Display the XML
System.out.println(stringOut.toString());
just solved a similar problem by adding below line to original XML:
document.appendChild(document.createProcessingInstruction(StreamResult.PI_DISABLE_OUTPUT_ESCAPING, ""));
refer to : Writing emoji to XML file in JAVA
perhaps can use similar setting for the transformer...
I have an xml document and a style sheet to convert the document into another useful xml.
For the reference the xml document is somewhat like this:
<root>
<element1>value1</element1>
<element2>value2</element2>
<element3>value3</element3>
<element4>..some more levels of data</element4>
</root>
The style sheet looks somewhat like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:include href="errorResponse.xsl"/>
<xsl:template match="root/element4">
<xsl:element name="myRoot">
<xsl:element name="myElement">
<xsl:apply-templates select="./someElement/someOtherElement"/>
</xsl:element>
</xsl:element>
</xsl:template>
The output xml string which I am getting is like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
value1
value2
value3
<myRoot><myelement> some data </myElemrnt></myroot>
The code snippet which I am using for transformation is this:
InputStream styleSheet = new FileUtil().getFileStream("xsltFileName");
StreamSource xslStream = new StreamSource(styleSheet);
DOMSource in = new DOMSource(inputXMLDoc);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
TransformerFactory transFact = TransformerFactory.newInstance();
transFact.setURIResolver(new XsltURIResolver());
Transformer trans = transFact.newTransformer(xslStream);
trans.transform(in, new StreamResult(baos));
System.out.println(baos.toString()); // displays the above output
However the output is in undesired format. I dont want value1, value2, value3. This is also creating problems further for the new XML generated, to be processed.
I have seen a lot of questions around the transformations. This is bugging me for a long time. Appreciate a lot if someone could point out where I am going wrong.
Also point out if I am following any incorrect conventions during the entire process.
Thanks and regards.
You are getting that output because of the Default Template Rule, which outputs the text nodes. If you don't want those nodes you need to exclude them explicitly by matching them and replacing them with nothing (i.e. an empty template).
Try adding this template to your stylesheet:
<xsl:template match="/">
<xsl:apply-templates select="root/element4"/>
</xsl:template>
It matches the root and discards everything except for root/element4.
What happens here is that the XSLT built-in templates are applied to any node not matched explicitly by a template. The net effect of the built-in templates is to copy any text node (on which tey are applied) to the output.
One of the simplest and shortest way to supress this unwanted output is to add the following template:
<xsl:template match="text()"/>
which causes any text-node for which this template is selected for execution, not to be copied to the output.
I need to put String content to xml in Java. I use this kind of code to insert information in xml:
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File ("file.xml"));
DOMSource source = new DOMSource (doc);
Node cards = doc.getElementsByTagName ("cards").item (0);
Element card = doc.createElement ("card");
cards.appendChild(card);
Element question = doc.createElement("question");
question.appendChild(doc.createTextNode("This <b>is</b> a test.");
card.appendChild (question);
StreamResult result = new StreamResult (new File (file));
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.INDENT, "yes");
tf.transform(source, result);
But string is converted in xml like this:
<cards>
<card>
<question>This <b>is</b> a test.</question>
</card>
</cards>
It should be like this:
<cards>
<card>
<question>This <b>is</b> a test.</question>
</card>
</cards>
I tried to use CDDATA method but it puts code like this:
// I changed this code
question.appendChild(doc.createTextNode("This <b>is</b> a test.");
// to this
question.appendChild(doc.createCDATASection("This <b>is</b> a test.");
This code gets a xml file look like:
<cards>
<card>
<question><![CDATA[This <b>is</b> a test.]]></question>
</card>
</cards>
I hope that somebody can help me to put String content in the xml file exactly with same content.
Thanks in advance!
This would be expected behaviour.
Consider if the brackets were kept as you put them, the end result would essentially be:
<cards>
<card>
<question>
This
<b>
is
</b>
a test.
</question>
</card>
</cards>
Basically, it would result in the <b> being an additional node in the xml tree. Encoding the brackets to < and > ensures that when displayed by any xml parser, the brackets will be displayed, and not confused as being an additional node.
If you really wanted them to display as you say you do, you will need to create elements named b. This will not only be awkward, it will also not display quite as you've written above - it would display as additional nested nodes as I've shown above. So you would need to amend the xml writer to output inline for those tags.
Nasty.
Check this solution: how to unescape XML in java
Maybe you could solve it in this way (code only for <question> tag part):
Element question = doc.createElement("question");
question.appendChild(doc.createTextNode("This ");
Element b = doc.createElement("b");
b.appendChild(doc.createTextNode("is");
question.appendChild(b);
question.appendChild(doc.createTextNode(" a test.");
card.appendChild(question);
What you are effectively trying to do is to insert XML into the middle of a DOM without parsing it. You can't do this since the DOM APIs don't support it.
You have three choices:
You could serialize the DOM and then insert the String at the appropriate point. The end result may or may not be well-formed XML ... depending on what is in the String that you inserted.
You could create and insert DOM nodes representing the text and the <b>...</b> element. This requires you to know the XML structure of the stuff that you are inserting. #bluish's answer gives an example.
You could wrap the String in some container element, parse it using an XML parser to give a second DOM, find the nodes of interest, and add them to the original DOM. This requires that the String is well-formed XML when wrapped in the container element.
Or, since you're already using a Transformation, why not go all the way?
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()" />
</xsl:copy>
</xsl:template>
<xsl:template match="cards">
<card>
<question>This <b>is</b> a test</question>
</card>
</xsl:template>
</xsl:stylesheet>