Extremely slow XSLT transformation in Java - java

I try to transform XML document using XSLT. As an input I have www.wordpress.org XHTML source code, and XSLT is dummy example retrieving site's title (actually it could do nothing - it doesn't change anything).
Every single API or library I use, transformation takes about 2 minutes! If you take a look at wordpress.org source, you will notice that it is only 183 lines of code. As I googled it is probably due to DOM tree building. No matter how simple XSLT is, it is always 2 minutes - so it confirms idea that it's related to DOM building, but anyway it should not take 2 minutes in my opinion.
Here is an example code (nothing special):
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = null;
try {
transformer = tFactory.newTransformer(
new StreamSource("/home/pd/XSLT/transf.xslt"));
} catch (TransformerConfigurationException e) {
e.printStackTrace();
}
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
System.out.println("START");
try {
transformer.transform(new SAXSource(new InputSource(
new FileInputStream("/home/pd/XSLT/wordpress.xml"))),
new StreamResult(outputStream));
} catch (TransformerException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("STOP");
System.out.println(new String(outputStream.toByteArray()));
It's between START and STOP where java "pauses" for 2 minutes. If I take a look at the processor or memory usage, nothing increases. It looks like really JVM stopped...
Do you have any experience in transforming XMLs that are longer than 50 (this is random number ;)) lines? As I read XSLT always needs to build DOM tree in order to do its work. Fast transformation is crucial for me.
Thanks in advance,
Piotr

Does the sample HTML file use namespaces? If so, your XML parser may be attempting to retrieve contents (a schema, perhaps) from the namespace URIs. This is likely if each run takes exactly two minutes -- it's likely one or more TCP timeouts.
You can verify this by timing how long it takes to instantiate your InputSource object (where the WordPress XML is actually parsed), as this is likely the line which is causing the delay. After reviewing the sample file you posted, it does include a declared namespace (xmlns="http://www.w3.org/1999/xhtml").
To work around this, you can implement your own EntityResolver which essentially disables the URL-based resolution. You may need to use a DOM -- see DocumentBuilder's setEntityResolver method.
Here's a sample using DOM and disabling resolution (note -- this is untested):
try {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbFactory.newDocumentBuilder();
db.setEntityResolver(new EntityResolver() {
#Override
public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException {
return null; // Never resolve any IDs
}
});
System.out.println("BUILDING DOM");
Document doc = db.parse(new FileInputStream("/home/pd/XSLT/wordpress.xml"));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer(
new StreamSource("/home/pd/XSLT/transf.xslt"));
System.out.println("RUNNING TRANSFORM");
transformer.transform(
new DOMSource(doc.getDocumentElement()),
new StreamResult(outputStream));
System.out.println("TRANSFORMED CONTENTS BELOW");
System.out.println(outputStream.toString());
} catch (Exception e) {
e.printStackTrace();
}
If you want to use SAX, you would have to use a SAXSource with an XMLReader which uses your custom resolver.

The commenters who've posted that the answer likely resides with the EntityResolver are probably correct. However, the solution may not be to simply not load the schemas but rather load them from the local file system.
So you could do something like this
db.setEntityResolver(new EntityResolver() {
#Override
public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException {
try {
FileInputStream fis = new FileInputStream(new File("classpath:xsd/" + systemId));
InputSource is = new InputSource(fis);
return is
} catch (FileNotFoundException ex) {
logger.error("File Not found", ex);
return null;
}
}
});

Chances are the problem isn't with the call transfomer.transform. It's more likely that you are doing something in your xslt that is taking forever. My suggestion would be use a tool like Oxygen or XML Spy to profile your XSLT and find out which templates are taking the longest to execute. Once you've determined this you can begin to optimize the template.

If you are debugging your code on an android device, make sure you try it without eclipse attached to the process. When I was debugging my app xslt transformations were taking 8 seconds, where the same process took a tenth of a second on ios in native code. Once I ran the code without eclipse attached to it, the process took a comparable amount of time to the c based counterpart.

Related

Link XML and XSD using java

i'm trying to write the header for an xml file so it would be something like this:
<file xmlns="http://my_namespace"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://my_namespace file.xsd">
however, I can't seem to find how to do it using the Document class in java. This is what I have:
public void exportToXML() {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
try {
dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.newDocument();
doc.setXmlStandalone(true);
doc.createTextNode("<file xmlns=\"http://my_namespace"\n" +
"xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n" +
"xsi:schemaLocation=\"http://my_namespace file.xsd\">");
Element mainRootElement = doc.createElement("MainRootElement");
doc.appendChild(mainRootElement);
for(int i = 0; i < tipoDadosParaExportar.length; i++) {
mainRootElement.appendChild(criarFilhos(doc, tipoDadosParaExportar[i]));
}
Transformer tr = TransformerFactory.newInstance().newTransformer();
tr.transform(new DOMSource(doc),
new StreamResult(new FileOutputStream(filename)));
} catch (Exception e) {
e.printStackTrace();
}
}
I tried writing it on the file using the createTextNode but it didn't work either, it only writes the version before showing the elements.
PrintStartXMLFile
Would appreciate if you could help me. Have a nice day
Your createTextNode() method is only suitable for creating text nodes, it's not suitable for creating elements. You need to use createElement() for this. If you're doing this by building a tree, then you need to build nodes, you can't write lexical markup.
I'm not sure what MainRootElement is supposed to be; you've only given a fragment of your desired output so it's hard to tell.
Creating a DOM tree and then serializing it is a pretty laborious way of constructing an XML file. Using something like an XMLEventWriter is easier. But to be honest, I got frustrated by all the existing approaches and wrote a new library for the purpose as part of Saxon 10. It's called simply "Push", and looks something like this:
Processor proc = new Processor();
Serializer serializer = proc.newSerializer(new File(fileName));
Push push = proc.newPush(serializer);
Document doc = push.document(true);
doc.setDefaultNamespace("http://my_namespace");
Element root = doc.element("root")
.attribute(new QName("xsi", "http://www.w3.org/2001/XMLSchema-instance", "schemaLocation"),
"http://my_namespace file.xsd");
doc.close();

Getting NULL pointer exception net.sf.saxon.event.ReceivingContentHandler.startElement in DaisyDiff

I'm using DaizyDIff library to compare two html files. I wrote a java code to implement the DaizyDiff. but while running I'm getting NULL pointer exception on net.sf.saxon.event.ReceivingContentHandler.startElement
I have tries multiple approach on SAXTransformerFactory , but I couldn't figure out
public static void daisyDiffTest() throws Exception {
String html1 = "<html><body>var v2</body></html>";
String html2 = "<html> \n <body> \n Hello world \n </body> \n </html>";
try {
StringWriter finalResult = new StringWriter();
SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler result = tf.newTransformerHandler();
result.getTransformer().setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
result.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
result.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
result.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
result.setResult(new StreamResult(finalResult));
ContentHandler postProcess = result;
Locale val = Locale.ENGLISH;
DaisyDiff.diffHTML(new InputSource(new StringReader(html1)), new InputSource(new StringReader(html2)),
postProcess, "test", val);
System.out.println(finalResult.toString());
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Expected result would be diff in the HTML file.
It's hard to know without knowing what DaisyDiff is, or what calls it makes. It's quite possible that it's not tested or supported for use with Saxon.
The format of data passed to the startElement() event in a SAX ContentHandler depends on the configuration options of the XML parser, and the problem when Saxon is invoked as a ContentHandler in this way is that it has no way of discovering what configuration options the parser is using.
As stated in the Javadoc documentation here: http://www.saxonica.com/documentation/index.html#!javadoc/net.sf.saxon.event/ReceivingContentHandler#startElement if the events emitted by the parser don't correspond to what an appropriately configured parser would emit, the ReceivingContentHandler will fail in unpredictable ways.
Posting the stack trace of the exception might be useful.
I had the same issue. The good news first, found a way thanks to this post to solve the issue:
Instead of SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory.newInstance(); instantiate directly SAXTransformerFactory tf = new org.apache.xalan.processor.TransformerFactoryImpl(); (make sure it's the apache xalan one!).
My stacktrace for completeness:
Exception in thread "main" java.lang.NullPointerException
at net.sf.saxon.event.ReceivingContentHandler.startElement(ReceivingContentHandler.java:310)
at org.outerj.daisy.diff.html.HtmlSaxDiffOutput.generateOutput(HtmlSaxDiffOutput.java:147)

Storing an XML resource as an object rather than a file

Please turn your phasers to "noob".
As a part of my Java Servlet, I make a call to a REST resource and accept the text file returned, as below:
// check to see if the file really exists (i.e. a session is in
// progress) or we need to create one
// this should save constantly hitting the server for a new file for
// every transaction.
if (fXmlFile.exists()) {
} else {
File collectionTree = new File(bscConnector.GetCollection());
PrintWriter xmlfile = new PrintWriter(directoryName + "/outputString.xml");
xmlfile.println(collectionTree);
xmlfile.close();
}
From there I run a search and replace on it to make it valid XML file so that I can actually run xpath queries against it:
SearchAndReplace sAndR = new SearchAndReplace();
// Swap the slashes so we can actually
// query the freakin' document.
sAndR.readFiles(fXmlFile, "\\", "/");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
Document doc = null;
try {
dBuilder = dbFactory.newDocumentBuilder();
doc = dBuilder.parse(fXmlFile);
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// optional, but recommended
// read this -
// http://stackoverflow.com/questions/13786607/normalization-in-dom-parsing-with-java-how-does-it-work
doc.getDocumentElement().normalize();
// Create an instance of an xpath object
XPath xPath = XPathFactory.newInstance().newXPath();
And then I go to town on it with various xpath queries that create the interface, yadda yadda.
My question is this; while this approach works, it seems freakishly weird to be creating and querying an actual file on the server rather than doing all this in a session object, but I can't find the correct way of doing this; what object/set of objects should I be using instead of this serialize-to-disk-and-read approach?
Thanks.
This question turned out to be so simple I'm considering deleting it just to prevent polluting stackoverflow; it was a basic misunderstanding of what Java could do with a String. I replaced all the file manipulation stuff with:
String fXmlFile = null;
if (fXmlFile != null) {} else {
File collectionTree = new File(bscConnector.GetCollection());
fXmlFile = collectionTree.toString();
fXmlFile = fXmlFile.replace("\\", "/");
}
and other than that left my code unchanged. All works, much faster too since it's not serializing and deserializing a large text file any more.
I'm going to move the initialization of the fXmlFile out of the JSP and into the servlet, define it as a session object, and pass it in as a part of the request because right now I'm having to declare it as null right before I test to see if it's null, which seems self-defeating. Other than that, it's all good.
Thanks eldjon.

reading and updating a large xml file in java

I have an XML file about 400 MB
I need to find a specific element and then reformat its date attribute from mm-dd-yyyy to dd-mm-yyyy
Here is the code that I am using
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(inputXML);
doc.getDocumentElement().normalize();
//format the date
NodeList nodes = doc.getElementsByTagName("empDetails");
for (int i = 0; i < nodes.getLength(); i++){
String oldDate =nodes.item(i).getAttributes().getNamedItem("doj").getNodeValue();
String newValue = //formatted to dd-mm-yyyy
nodes.item(i).getAttributes().getNamedItem("doj").setTextContent(newValue);
}
//now write back to file
// write the content into xml file
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer;
transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File(fileName));
transformer.transform(source, result);
However this is throwing out of memory
On windows 32 bit - it fails
So I tried this on a unix box and set the memory to :
java -Xmx3072m -classpath . MyTest
It did run for some time but failed again
Question - is it possible to be handling a file of 400 MB where I want to selectivey update and save? ( am sure the answer is yes )
Is my code bad - anything that I should change ? ( no unix shell scripts as an alternate solution please - my intent is to use java )
should I be bumping up the heap size further ?
Thanks,
satish
It would probably be better to use the StAX api read the document like a stream while writing out (again using StAX) the parts you don't want to change immediately to a a temporary file. When you get to a part you are interested in, change the values before feeding it back to the temporary file. When you are done you can rename the temporary file over the old one.
I'd recommend the XMLEventReader and XMLEventWriter. XMLEvents you don't care about you can pass directly through from reader to writer. This will only keep small parts of the document you are working on in memory.
XMLEventReader reader = ...;
XMLEventWriter writer = ...;
XMLEvent cursor;
while(reader.hasNext()){
cursor = reader.nextEvent();
if(doICareAboutThisEvent(cursor)){
writer.add(changeEvent(cursor));
}else{
writer.add(cursor);
}
}
Obviously the implementation can be more complicated and your decisions about which elements to care about and edit can be more complicated than the state of a single element. This is just a very simple example.

How to remove extra empty lines from XML file?

In short; i have many empty lines generated in an XML file, and i am looking for a way to remove them as a way of leaning the file. How can i do that ?
For detailed explanation; I currently have this XML file :
<recent>
<paths>
<path>path1</path>
<path>path2</path>
<path>path3</path>
<path>path4</path>
</paths>
</recent>
And i use this Java code to delete all tags, and add new ones instead :
public void savePaths( String recentFilePath ) {
ArrayList<String> newPaths = getNewRecentPaths();
Document recentDomObject = getXMLFile( recentFilePath ); // Get the <recent> element.
NodeList pathNodes = recentDomObject.getElementsByTagName( "path" ); // Get all <path> nodes.
//1. Remove all old path nodes :
for ( int i = pathNodes.getLength() - 1; i >= 0; i-- ) {
Element pathNode = (Element)pathNodes.item( i );
pathNode.getParentNode().removeChild( pathNode );
}
//2. Save all new paths :
Element pathsElement = (Element)recentDomObject.getElementsByTagName( "paths" ).item( 0 ); // Get the first <paths> node.
for( String newPath: newPaths ) {
Element newPathElement = recentDomObject.createElement( "path" );
newPathElement.setTextContent( newPath );
pathsElement.appendChild( newPathElement );
}
//3. Save the XML changes :
saveXMLFile( recentFilePath, recentDomObject );
}
After executing this method a number of times i get an XML file with right results, but with many empty lines after the "paths" tag and before the first "path" tag, like this :
<recent>
<paths>
<path>path5</path>
<path>path6</path>
<path>path7</path>
</paths>
</recent>
Anyone knows how to fix that ?
------------------------------------------- Edit: Add the getXMLFile(...), saveXMLFile(...) code.
public Document getXMLFile( String filePath ) {
File xmlFile = new File( filePath );
try {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document domObject = db.parse( xmlFile );
domObject.getDocumentElement().normalize();
return domObject;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
public void saveXMLFile( String filePath, Document domObject ) {
File xmlOutputFile = null;
FileOutputStream fos = null;
try {
xmlOutputFile = new File( filePath );
fos = new FileOutputStream( xmlOutputFile );
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
transformer.setOutputProperty( "{http://xml.apache.org/xslt}indent-amount", "2" );
DOMSource xmlSource = new DOMSource( domObject );
StreamResult xmlResult = new StreamResult( fos );
transformer.transform( xmlSource, xmlResult ); // Save the XML file.
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (TransformerConfigurationException e) {
e.printStackTrace();
} catch (TransformerException e) {
e.printStackTrace();
} finally {
if (fos != null)
try {
fos.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
First, an explanation of why this happens — which might be a bit off since you didn't include the code that is used to load the XML file into a DOM object.
When you read an XML document from a file, the whitespaces between tags actually constitute valid DOM nodes, according to the DOM specification. Therefore, the XML parser treats each such sequence of whitespaces as a DOM node (of type TEXT);
To get rid of it, there are three approaches I can think of:
Associate the XML with a schema, and then use setValidating(true) along with setIgnoringElementContentWhitespace(true) on the DocumentBuilderFactory.
(Note: setIgnoringElementContentWhitespace will only work if the parser is in validating mode, which is why you must use setValidating(true))
Write an XSL to process all nodes, filtering out whitespace-only TEXT nodes.
Use Java code to do this: use XPath to find all whitespace-only TEXT nodes, iterate through them and remove each one from its parent (using getParentNode().removeChild()). Something like this would do (doc would be your DOM document object):
XPath xp = XPathFactory.newInstance().newXPath();
NodeList nl = (NodeList) xp.evaluate("//text()[normalize-space(.)='']", doc, XPathConstants.NODESET);
for (int i=0; i < nl.getLength(); ++i) {
Node node = nl.item(i);
node.getParentNode().removeChild(node);
}
I was able to fix this by using this code after removing all the old "path" nodes :
while( pathsElement.hasChildNodes() )
pathsElement.removeChild( pathsElement.getFirstChild() );
This will remove all the generated empty spaces in the XML file.
Special thanks to MadProgrammer for commenting with the helpful link mentioned above.
You could look at something like this if you only need to "clean" your xml quickly.
Then you could have a method like:
public static String cleanUp(String xml) {
final StringReader reader = new StringReader(xml.trim());
final StringWriter writer = new StringWriter();
try {
XmlUtil.prettyFormat(reader, writer);
return writer.toString();
} catch (IOException e) {
e.printStackTrace();
}
return xml.trim();
}
Also, to compare anche check differences, if you need it: XMLUnit
I faced the same problem, and I had no idea for the long time, but now, after this Brad's question and his own answer on his own question, I figured out where is the trouble.
I have to add my own answer, because Brad's one isn't really perfect, how Isaac said:
I wouldn't be a huge fan of blindly removing child nodes without knowing what they are
So, better "solution" (quoted because it is more likely workaround) is:
pathsElement.setTextContent("");
This completely removes useless blank lines. It is definitely better than removing all the child nodes. Brad, this should work for you too.
But, this is an effect, not the cause, and we got how to remove this effect, not the cause.
Cause is: when we call removeChild(), it removes this child, but it leaves indent of removed child, and line break too. And this indent_and_like_break is treated as a text content.
So, to remove the cause, we should figure out how to remove child and its indent. Welcome to my question about this.
There is a very simple way to get rid of the empty lines if using an DOM handling API (for example DOM4J):
place the text you want to keep in a variable(ie text)
set the node text to "" using node.setText("")
set the node text to text using node.setText(text)
et voila! there are no more empty lines. The other answers delineate very well how the extra empty lines in the xml output are actually extra nodes of type text.
This technique can be used with any DOM parsing system, so long as the name of the text setting function is changed to suit the one in your API, hence the way of representing it slightly more abstractly.
Hope this helps:)
When i used dom4j to remove some elements and i met the same question,the solution above not useful without adding some other required jars.Finally,i find out a simple solution only need to use JDK io pakage:
use BufferedReader to read the xml file and filter empty lines.
StringBuilder stringBuilder = new StringBuilder();
FileInputStream fis = new FileInputStream(outFile);
InputStreamReader isr = new InputStreamReader(fis);
BufferedReader br = new BufferedReader(isr);
String s;
while ((s = br.readLine()) != null) {
if (s.trim().length() > 0) {
stringBuilder.append(s).append("\n");
}
}
write the string to the xml file
OutputStreamWriter osw = new OutputStreamWriter(fou);
BufferedWriter bw = new BufferedWriter(osw);
String str = stringBuilder.toString();
bw.write(str);
bw.flush();
remember to close all the stream
In my case, I converted it to a string then just did a regex:
//save as String
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
tr.transform(new DOMSource(document), result);
strResult = writer.toString();
//remove empty lines
strResult = strResult.replaceAll("\\n\\s*\\n", "\n");
Couple of remarks:
1) When your are manipulating XML (removing elements / adding new one) I strongly advice you to use XSLT (and not DOM)
2) When you tranform a XML Document by XSLT (as you do in your save method), set the OutputKeys.INDENT to "no"
3) For simple post processing of your xml (removing white space, comments, etc.) you can use a simple SAX2 filter
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setIgnoringElementContentWhitespace(true);
I am using below code:
System.out.println("Start remove textnode");
i=0;
while (parentNode.getChildNodes().item(i)!=null) {
System.out.println(parentNode.getChildNodes().item(i).getNodeName());
if (parentNode.getChildNodes().item(i).getNodeName().equalsIgnoreCase("#text")) {
parentNode.removeChild(parentNode.getChildNodes().item(i));
System.out.println("text node removed");
}
i=i+1;
}
Very late answer, but maybe it is still helpful to someone.
I had this code in my class, where the document is built after transformation (Just like you):
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
Change the last line to
transformer.setOutputProperty(OutputKeys.INDENT, "no");

Categories