OutOfMemoryError: Java heap space using XSLT transform - java

I want to transform XML file using XSLT.
I made:
TransformerFactory factory = TransformerFactory.newInstance();
InputStream is =
this.getClass().getResourceAsStream(getPathToXSLTFile());
Source xslt = new StreamSource(is);
Transformer transformer = factory.newTransformer(xslt);
Source text = new StreamSource(new File(getInputFileName()));
transformer.transform(text, new StreamResult(new File(getOutputFileName())));
Which input file have about 10000000 lines, I have error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.sun.org.apache.xml.internal.utils.FastStringBuffer.append(FastStringBuffer.java:682)
at com.sun.org.apache.xml.internal.dtm.ref.sax2dtm.SAX2DTM.characters(SAX2DTM.java:2111)
at com.sun.org.apache.xalan.internal.xsltc.dom.SAXImpl.characters(SAXImpl.java:863)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.characters(AbstractSAXParser.java:546)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:455)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:841)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:770)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager.getDTM(XSLTCDTMManager.java:421)
at com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager.getDTM(XSLTCDTMManager.java:215)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.getDOM(TransformerImpl.java:556)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:739)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:351)
at ru.magnit.task.utils.AbstractXmlUtil.transformXML(AbstractXmlUtil.java:66)
at ru.magnit.task.EntryPoint.main(EntryPoint.java:72)
In this line:
transformer.transform(text, new StreamResult(new File(getOutputFileName())));
What is the reason for this and can it be optimized somehow, without the size of the heap?
UPDATE:
My XSLT file:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="entries">
<entries>
<xsl:apply-templates/>
</entries>
</xsl:template>
<xsl:template match="entry">
<entry>
<xsl:attribute name="field">
<xsl:apply-templates select="*"/>
</xsl:attribute>
</entry>
</xsl:template>

In general XSLT 1.0 and 2.0 work with a data model which pulls the complete XML input into a tree model to allow full XPath navigation, resulting in a memory usage that increases with the size of the input document.
So unless you increase the heap space if your current document size leads to memory shortage there is not much you can do, at least not in general, there might be XSLT processor specific and some XSLT specific optimizations depending on your concrete XSLT code, but you can't avoid that the processor first pulls in the complete document. We would need to see your XSLT to try to tell whether it can be optimized. Profiling a stylesheet can help to identify areas to be optimized, I am not sure whether Xalan supports that. And I am not sure whether that stack trace not simply means that Xalan already runs out of memory when building the DTM (its tree model) for your large input, in that case obviously optimizing the XSLT code does not help as it is not even executed.
A Java specific way you could attempt is to use https://docs.oracle.com/javase/8/docs/api/javax/xml/transform/sax/SAXTransformerFactory.html instead to create a SAX filter from your stylesheet and chain it with a default Transformer to serialize the result of the filter, I think I have once tried that and found it can consume less memory than the traditional approach with a Transformer.
XSLT 3.0 tries to address the memory problem with the new approach of streaming (https://www.w3.org/TR/xslt-30/#streaming-concepts), however so far there is only one implementation with Saxon 9 EE, a commercial product. And in general a stylesheet is not necessarily streamable, instead you have to rewrite it to make it streamable (if that is at all possible, for instance sorting input nodes is not possible with streaming).
For instance, your posted stylesheet converted to XSLT 3.0 to use streaming (no rewrite necessary, only needed to set up the default mode as streamable) is
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math"
version="3.0">
<xsl:mode streamable="yes"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="entries">
<entries>
<xsl:apply-templates/>
</entries>
</xsl:template>
<xsl:template match="entry">
<entry>
<xsl:attribute name="field">
<xsl:apply-templates select="*"/>
</xsl:attribute>
</entry>
</xsl:template>
</xsl:stylesheet>
and Saxon 9.8 EE and the beta of Exselt assess that as streamable.

Related

XSLT in Java: CDATA section split

I want to replace some items in a huge XML file, and I thought I'll do it with XSLT. I have absolutely no experience with it, so if you think there would be better ways to do this, please tell me.
Anyway, as a first step I just wanted to copy the whole XML over. This is my xsl file:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="no" cdata-section-elements="script"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
The relevant Java code is:
Source xmlInput = new StreamSource(oldProjectStream);
Source xsl = new StreamSource("test.xsl");
Transformer transformer = TransformerFactory.newInstance().newTransformer(xsl);
StreamResult xmlOutput = new StreamResult("output/project.xml");
transformer.transform(xmlInput, xmlOutput);
Most of the output is fine, also the order of the elements is not changed (this could turn out quite important).
The XML contains some Lua code in CDATA sections. At some (seemingly random) points, however, the CDATA section is closed and reopened again. It seems to have to do with brackets in the code, but just rately - there are about 5 points in a 1.4 MB XML looking like this:
<script><![CDATA[
...
html_encoding["Otilde" ] = string.char(213)
html_encoding["Ouml" ]]]><![CDATA[ = string.char(214)
html_encoding["Oslash" ] = string.char(216)
...
]]></script>
In the original file, the middle line looks just like the other ones. There are thousands of lines where I've put the dots. What's going on here?
The (proprietary) application that should handle the XML isn't able to load it.
It's useful to tell us which XSLT processor you are using.
The serializer has to close and reopen a CDATA section if it encounters "]]>" in the data, because that sequence cannot legally appear in a CDATA section. It shouldn't need to do so under any other circumstances, though the spec probably doesn't disallow it.

XSLT: extract the last x digit of a sibling node with xpath expression

I am trying to extract the last 4 numbers of the "red" sibling with xpath.
The source xml looks like:
...
<node2>
<key><![CDATA[RED]]></key>
<value><![CDATA[98472978241908]]></value>
... more key value pairs here...
</node2>
...
And when I use the follwing xpath:
/nodelevelX/nodelevelY/node2/key[text()='RED']/following-sibling::value
I have the full number in output, then I tried to extract the digit with this xpath experssion:
/nodelevelX/nodelevelY/node2/key[text()='RED']/following-sibling::value/text()[substring(., string-length(.)-4)]
I still have the full number. The substring function does not seems to work.
my xsl header is:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
I think there is a small error, but I cannot see where. I followed many discussions on SO and others (w3schools) and tried to follow the advices whithout success.
UPDATE: The context:
I use the following identity which copy all the nodes from my source XML to the destination (xml)
and I apply specific rules for some node after inside a xsl:template:
<!-- This copy the whole source XML in destination -->
<xsl:template match="node() | #*">
<xsl:copy>
<xsl:apply-templates select="node() | #*" />
</xsl:copy>
</xsl:template>
<!-- specific rules for some nodes -->
<xsl:template match="/nodeDetails">
<mynewnode>
<!-- here I take the whole value and it s working -->
<someVal><xsl:value-of select="/nodeDetails/nodeX/key[text()='ANOTHER_KEY']/following-sibling::value" /></someVal>
<!-- FIXME substring does not work now -->
<redVal><xsl:value-of select="/nodeDetails/nodeX/key[text()='RED']/following-sibling::value/text()[substring(.,string-length(.)-4)]" /></redVal>
</mynewnode>
</xsl:template>
And for the transformation I use the following code from a junit class in Java (JDK 6):
#Test
public void transformXml() throws TransformerException {
TransformerFactory factory = TransformerFactory.newInstance();
Source xslt = new StreamSource(getClass().getResourceAsStream("contract.xsl"));
Transformer transformer = factory.newTransformer(xslt);
Source input = new StreamSource(getClass().getResourceAsStream("source.xml"));
Writer output = new StringWriter();
transformer.transform(input, new StreamResult(output));
System.out.println("output=" + output.toString());
}
Your current XPath will evaluate to a nodeset, but what you need is a string. Please try something like this:
<xsl:variable name="value"
select="/nodelevelX/nodelevelY/node2/key[. = 'RED']
/following-sibling::value[1]" />
<xsl:value-of select="substring($value, string-length($value) - 3)" />
Though to be sure about an answer, I'd need to see the portion of your XSLT where you are trying to output this value.
Use this XPath 2.0 expression:
/nodelevelX/nodelevelY/node2/key[text()='RED']
/following-sibling::*[1][self::value]
/substring(., string-length() -3)
XSLT 2.0 - based verification:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"/nodelevelX/nodelevelY/node2/key[text()='RED']
/following-sibling::*[1][self::value]
/substring(., string-length() -3)"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the following XML document:
<nodelevelX>
<nodelevelY>
<node2>
<key>GREEN</key>
<value>0123456789</value>
<key>RED</key>
<value>98472978241908</value>
<key>BLACK</key>
<value>987654321</value>
</node2>
</nodelevelY>
</nodelevelX>
the XPath expression is evaluated and the result of this evaluation is copied to the output:
1908

Java SAX parsing: What's wrong with this XML?

I'm trying to validate an XML file, but I get the following error:
Can not find declaration of element
'xsl:stylesheet'.
This is the XML:
<?xml version='1.0' encoding='utf-8'?>
<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform' xmlns:msxsl='urn:schemas-microsoft-com:xslt' exclude-result-prefixes='msxsl' xmlns:ns='http://www.ibm.com/wsla'>
<xsl:strip-space elements='*'/>
<xsl:output method='xml' indent='yes'/>
<xsl:template match='#* | node()'>
<xsl:copy>
<xsl:apply-templates select='#* | node()'/>
</xsl:copy>
</xsl:template>
<xsl:template match="/ns:SLA/ns:ServiceDefinition/ns:WSDLSOAPOperation/ns:SLAParameter/#name[.='TotalMemoryConsumption']">
<xsl:attribute name='{name()}'>
<xsl:text>MemConsumption</xsl:text>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
Where is the mistake?
EDIT: I want to parse this XML in Java with SAX, but I get the following error:
Element type "xsl:template" must be followed by either attribute specifications, ">" or "/>".
How to get rid of it?
Assuming you are actually trying to validate your XSL as an XML document, it looks like that website requires you to point to a schema or DTD in order to validate the XML against it. You can get a non-normative schema here: http://www.w3.org/TR/xslt20/#schema-for-xslt. Here's instructions on how to reference a schema from an XML file: http://www.ibm.com/developerworks/xml/library/x-tipsch.html
You could also check "Well-Formedness only," and check the document for well-formedness, if not actually validity.
Generally, any XSL engine will report any errors in your XSL document, so you don't need to validate it separately.
Your XSL is OK, don't worry. Just that there is no DTD/XSD for XSLs 1.0. no one bothers checking XSLT stylesheets (1.0) for validity. "Wellformedness" is enough.

Removing elements from an XML document, XSLT and JAXB

This question is a follow up to my earlier question:
Creating a valid XSD that is open using <all> and <any> elements
Given that I have a Java String containing an XML document of the following form:
<TRADE>
<TIME>12:12</TIME>
<MJELLO>12345</MJELLO>
<OPTIONAL>12:12</OPTIONAL>
<DATE>25-10-2011</DATE>
<HELLO>hello should be ignored</HELLO>
</TRADE>
How can I use XSLT or similar (in Java by using JAXB) to remove all elements not contained in a set of elements.
In the above example I am only interested in (TIME, OPTIONAL, DATE), so I would like to transform it into:
<TRADE>
<TIME>12:12</TIME>
<OPTIONAL>12:12</OPTIONAL>
<DATE>25-10-2011</DATE>
</TRADE>
The order of the elements is not fixed.
This transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:param name="pNames" select="'|TIME|OPTIONAL|DATE|'"/>
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*" name="identity">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*/*">
<xsl:if test="contains($pNames, concat('|', name(), '|'))">
<xsl:call-template name="identity"/>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
when applied on the provided XML document:
<TRADE>
<TIME>12:12</TIME>
<MJELLO>12345</MJELLO>
<OPTIONAL>12:12</OPTIONAL>
<DATE>25-10-2011</DATE>
<HELLO>hello should be ignored</HELLO>
</TRADE>
produces the wanted, correct result:
<TRADE>
<TIME>12:12</TIME>
<OPTIONAL>12:12</OPTIONAL>
<DATE>25-10-2011</DATE>
</TRADE>
Explanation:
The identity rule (template) copies every node "as-is".
The identity rule is overridden by a template matching any element that is not the top element of the document. Inside the template a check is made if the name of the matched element is one of the names specified in the external parameter $pNames in a pipe-delimited string of wanted names.
See the documentation of your XSLT processor on how to pass a parameter to a transformation -- this is implementation-dependent and differs from processor to processor.
I haven't tried yet, but maybe the javax.xml.tranform package can help:
http://download.oracle.com/javase/6/docs/api/javax/xml/transform/package-summary.html
JAXB & XSLT
JAXB integrates very cleanly with XSLT for an example see:
How to get jaxb to Ignore certain data during unmarshalling
Your Other Question
Based on your previous question (see link below), the transform is really unnecessary as JAXB will just ignore attributes and elements that are not mapped to fields/properties in your domain object.
Creating a valid XSD that is open using <all> and <any> elements

How can I transform a functional language in XML to Java?

I'm working with a DSL based on an XML schema that supports functional language features such as loops, variable state with context, and calls to external Java classes. I'd like to write a tool which takes the XML document and converts it to, at the very least, something that looks like Java, where the <set> tags get converted to variable assignments, loops get converted to for loops, and so on.
I've been looking into ANTLR as well as standard XML parsers, and I'm wondering whether there's a recommended way to go about this. Can such an XML document be converted to something that's convertable to Java, if not directly?
I'm willing to write the parsing through SAX that writes an intermediate language based on each tag, if that's the recommended way, but the part that's giving me pause is the fact that it's context-based in the same way a language like Scheme is, with child elements of any tag being fully evaluated before the parent.
You can do it with XSLT. Then just use to generate the code snippets you need.
(remember to set the output format to plain text)
EDIT: Sample XSLT script
Input - a.xml:
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="b.xsl"?>
<set name='myVar'>
<concat>
<s>newText_</s>
<ref>otherVar</ref>
</concat>
</set>
Script - b.xsl:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output method="text" />
<xsl:template match="set">
<xsl:value-of select="#name"/>=<xsl:apply-templates/>
</xsl:template>
<xsl:template match="concat">
<xsl:for-each select="*">
<xsl:if test="position() > 1">+</xsl:if>
<xsl:apply-templates select="."/>
</xsl:for-each>
</xsl:template>
<xsl:template match="ref">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="s">
<xsl:text>"</xsl:text>
<xsl:apply-templates/>
<xsl:text>"</xsl:text>
</xsl:template>
</xsl:stylesheet>
Note that a.xml contain an instruction that will let XSLT-capable browsers render it with the stylesheet b.xsl. Firefox is such a browser. Open a.xml in firefox and you will see
myVar="newText_"+otherVar
Note that XSLT is a quite capable programming language, so there is a lot you can do.

Categories