Getting NULL pointer exception net.sf.saxon.event.ReceivingContentHandler.startElement in DaisyDiff - java

I'm using DaizyDIff library to compare two html files. I wrote a java code to implement the DaizyDiff. but while running I'm getting NULL pointer exception on net.sf.saxon.event.ReceivingContentHandler.startElement
I have tries multiple approach on SAXTransformerFactory , but I couldn't figure out
public static void daisyDiffTest() throws Exception {
String html1 = "<html><body>var v2</body></html>";
String html2 = "<html> \n <body> \n Hello world \n </body> \n </html>";
try {
StringWriter finalResult = new StringWriter();
SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler result = tf.newTransformerHandler();
result.getTransformer().setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
result.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
result.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
result.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
result.setResult(new StreamResult(finalResult));
ContentHandler postProcess = result;
Locale val = Locale.ENGLISH;
DaisyDiff.diffHTML(new InputSource(new StringReader(html1)), new InputSource(new StringReader(html2)),
postProcess, "test", val);
System.out.println(finalResult.toString());
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Expected result would be diff in the HTML file.

It's hard to know without knowing what DaisyDiff is, or what calls it makes. It's quite possible that it's not tested or supported for use with Saxon.
The format of data passed to the startElement() event in a SAX ContentHandler depends on the configuration options of the XML parser, and the problem when Saxon is invoked as a ContentHandler in this way is that it has no way of discovering what configuration options the parser is using.
As stated in the Javadoc documentation here: http://www.saxonica.com/documentation/index.html#!javadoc/net.sf.saxon.event/ReceivingContentHandler#startElement if the events emitted by the parser don't correspond to what an appropriately configured parser would emit, the ReceivingContentHandler will fail in unpredictable ways.
Posting the stack trace of the exception might be useful.

I had the same issue. The good news first, found a way thanks to this post to solve the issue:
Instead of SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory.newInstance(); instantiate directly SAXTransformerFactory tf = new org.apache.xalan.processor.TransformerFactoryImpl(); (make sure it's the apache xalan one!).
My stacktrace for completeness:
Exception in thread "main" java.lang.NullPointerException
at net.sf.saxon.event.ReceivingContentHandler.startElement(ReceivingContentHandler.java:310)
at org.outerj.daisy.diff.html.HtmlSaxDiffOutput.generateOutput(HtmlSaxDiffOutput.java:147)

Related

Storing an XML resource as an object rather than a file

Please turn your phasers to "noob".
As a part of my Java Servlet, I make a call to a REST resource and accept the text file returned, as below:
// check to see if the file really exists (i.e. a session is in
// progress) or we need to create one
// this should save constantly hitting the server for a new file for
// every transaction.
if (fXmlFile.exists()) {
} else {
File collectionTree = new File(bscConnector.GetCollection());
PrintWriter xmlfile = new PrintWriter(directoryName + "/outputString.xml");
xmlfile.println(collectionTree);
xmlfile.close();
}
From there I run a search and replace on it to make it valid XML file so that I can actually run xpath queries against it:
SearchAndReplace sAndR = new SearchAndReplace();
// Swap the slashes so we can actually
// query the freakin' document.
sAndR.readFiles(fXmlFile, "\\", "/");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
Document doc = null;
try {
dBuilder = dbFactory.newDocumentBuilder();
doc = dBuilder.parse(fXmlFile);
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// optional, but recommended
// read this -
// http://stackoverflow.com/questions/13786607/normalization-in-dom-parsing-with-java-how-does-it-work
doc.getDocumentElement().normalize();
// Create an instance of an xpath object
XPath xPath = XPathFactory.newInstance().newXPath();
And then I go to town on it with various xpath queries that create the interface, yadda yadda.
My question is this; while this approach works, it seems freakishly weird to be creating and querying an actual file on the server rather than doing all this in a session object, but I can't find the correct way of doing this; what object/set of objects should I be using instead of this serialize-to-disk-and-read approach?
Thanks.
This question turned out to be so simple I'm considering deleting it just to prevent polluting stackoverflow; it was a basic misunderstanding of what Java could do with a String. I replaced all the file manipulation stuff with:
String fXmlFile = null;
if (fXmlFile != null) {} else {
File collectionTree = new File(bscConnector.GetCollection());
fXmlFile = collectionTree.toString();
fXmlFile = fXmlFile.replace("\\", "/");
}
and other than that left my code unchanged. All works, much faster too since it's not serializing and deserializing a large text file any more.
I'm going to move the initialization of the fXmlFile out of the JSP and into the servlet, define it as a session object, and pass it in as a part of the request because right now I'm having to declare it as null right before I test to see if it's null, which seems self-defeating. Other than that, it's all good.
Thanks eldjon.

How to remove extra empty lines from XML file?

In short; i have many empty lines generated in an XML file, and i am looking for a way to remove them as a way of leaning the file. How can i do that ?
For detailed explanation; I currently have this XML file :
<recent>
<paths>
<path>path1</path>
<path>path2</path>
<path>path3</path>
<path>path4</path>
</paths>
</recent>
And i use this Java code to delete all tags, and add new ones instead :
public void savePaths( String recentFilePath ) {
ArrayList<String> newPaths = getNewRecentPaths();
Document recentDomObject = getXMLFile( recentFilePath ); // Get the <recent> element.
NodeList pathNodes = recentDomObject.getElementsByTagName( "path" ); // Get all <path> nodes.
//1. Remove all old path nodes :
for ( int i = pathNodes.getLength() - 1; i >= 0; i-- ) {
Element pathNode = (Element)pathNodes.item( i );
pathNode.getParentNode().removeChild( pathNode );
}
//2. Save all new paths :
Element pathsElement = (Element)recentDomObject.getElementsByTagName( "paths" ).item( 0 ); // Get the first <paths> node.
for( String newPath: newPaths ) {
Element newPathElement = recentDomObject.createElement( "path" );
newPathElement.setTextContent( newPath );
pathsElement.appendChild( newPathElement );
}
//3. Save the XML changes :
saveXMLFile( recentFilePath, recentDomObject );
}
After executing this method a number of times i get an XML file with right results, but with many empty lines after the "paths" tag and before the first "path" tag, like this :
<recent>
<paths>
<path>path5</path>
<path>path6</path>
<path>path7</path>
</paths>
</recent>
Anyone knows how to fix that ?
------------------------------------------- Edit: Add the getXMLFile(...), saveXMLFile(...) code.
public Document getXMLFile( String filePath ) {
File xmlFile = new File( filePath );
try {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document domObject = db.parse( xmlFile );
domObject.getDocumentElement().normalize();
return domObject;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
public void saveXMLFile( String filePath, Document domObject ) {
File xmlOutputFile = null;
FileOutputStream fos = null;
try {
xmlOutputFile = new File( filePath );
fos = new FileOutputStream( xmlOutputFile );
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
transformer.setOutputProperty( "{http://xml.apache.org/xslt}indent-amount", "2" );
DOMSource xmlSource = new DOMSource( domObject );
StreamResult xmlResult = new StreamResult( fos );
transformer.transform( xmlSource, xmlResult ); // Save the XML file.
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (TransformerConfigurationException e) {
e.printStackTrace();
} catch (TransformerException e) {
e.printStackTrace();
} finally {
if (fos != null)
try {
fos.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
First, an explanation of why this happens — which might be a bit off since you didn't include the code that is used to load the XML file into a DOM object.
When you read an XML document from a file, the whitespaces between tags actually constitute valid DOM nodes, according to the DOM specification. Therefore, the XML parser treats each such sequence of whitespaces as a DOM node (of type TEXT);
To get rid of it, there are three approaches I can think of:
Associate the XML with a schema, and then use setValidating(true) along with setIgnoringElementContentWhitespace(true) on the DocumentBuilderFactory.
(Note: setIgnoringElementContentWhitespace will only work if the parser is in validating mode, which is why you must use setValidating(true))
Write an XSL to process all nodes, filtering out whitespace-only TEXT nodes.
Use Java code to do this: use XPath to find all whitespace-only TEXT nodes, iterate through them and remove each one from its parent (using getParentNode().removeChild()). Something like this would do (doc would be your DOM document object):
XPath xp = XPathFactory.newInstance().newXPath();
NodeList nl = (NodeList) xp.evaluate("//text()[normalize-space(.)='']", doc, XPathConstants.NODESET);
for (int i=0; i < nl.getLength(); ++i) {
Node node = nl.item(i);
node.getParentNode().removeChild(node);
}
I was able to fix this by using this code after removing all the old "path" nodes :
while( pathsElement.hasChildNodes() )
pathsElement.removeChild( pathsElement.getFirstChild() );
This will remove all the generated empty spaces in the XML file.
Special thanks to MadProgrammer for commenting with the helpful link mentioned above.
You could look at something like this if you only need to "clean" your xml quickly.
Then you could have a method like:
public static String cleanUp(String xml) {
final StringReader reader = new StringReader(xml.trim());
final StringWriter writer = new StringWriter();
try {
XmlUtil.prettyFormat(reader, writer);
return writer.toString();
} catch (IOException e) {
e.printStackTrace();
}
return xml.trim();
}
Also, to compare anche check differences, if you need it: XMLUnit
I faced the same problem, and I had no idea for the long time, but now, after this Brad's question and his own answer on his own question, I figured out where is the trouble.
I have to add my own answer, because Brad's one isn't really perfect, how Isaac said:
I wouldn't be a huge fan of blindly removing child nodes without knowing what they are
So, better "solution" (quoted because it is more likely workaround) is:
pathsElement.setTextContent("");
This completely removes useless blank lines. It is definitely better than removing all the child nodes. Brad, this should work for you too.
But, this is an effect, not the cause, and we got how to remove this effect, not the cause.
Cause is: when we call removeChild(), it removes this child, but it leaves indent of removed child, and line break too. And this indent_and_like_break is treated as a text content.
So, to remove the cause, we should figure out how to remove child and its indent. Welcome to my question about this.
There is a very simple way to get rid of the empty lines if using an DOM handling API (for example DOM4J):
place the text you want to keep in a variable(ie text)
set the node text to "" using node.setText("")
set the node text to text using node.setText(text)
et voila! there are no more empty lines. The other answers delineate very well how the extra empty lines in the xml output are actually extra nodes of type text.
This technique can be used with any DOM parsing system, so long as the name of the text setting function is changed to suit the one in your API, hence the way of representing it slightly more abstractly.
Hope this helps:)
When i used dom4j to remove some elements and i met the same question,the solution above not useful without adding some other required jars.Finally,i find out a simple solution only need to use JDK io pakage:
use BufferedReader to read the xml file and filter empty lines.
StringBuilder stringBuilder = new StringBuilder();
FileInputStream fis = new FileInputStream(outFile);
InputStreamReader isr = new InputStreamReader(fis);
BufferedReader br = new BufferedReader(isr);
String s;
while ((s = br.readLine()) != null) {
if (s.trim().length() > 0) {
stringBuilder.append(s).append("\n");
}
}
write the string to the xml file
OutputStreamWriter osw = new OutputStreamWriter(fou);
BufferedWriter bw = new BufferedWriter(osw);
String str = stringBuilder.toString();
bw.write(str);
bw.flush();
remember to close all the stream
In my case, I converted it to a string then just did a regex:
//save as String
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
tr.transform(new DOMSource(document), result);
strResult = writer.toString();
//remove empty lines
strResult = strResult.replaceAll("\\n\\s*\\n", "\n");
Couple of remarks:
1) When your are manipulating XML (removing elements / adding new one) I strongly advice you to use XSLT (and not DOM)
2) When you tranform a XML Document by XSLT (as you do in your save method), set the OutputKeys.INDENT to "no"
3) For simple post processing of your xml (removing white space, comments, etc.) you can use a simple SAX2 filter
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setIgnoringElementContentWhitespace(true);
I am using below code:
System.out.println("Start remove textnode");
i=0;
while (parentNode.getChildNodes().item(i)!=null) {
System.out.println(parentNode.getChildNodes().item(i).getNodeName());
if (parentNode.getChildNodes().item(i).getNodeName().equalsIgnoreCase("#text")) {
parentNode.removeChild(parentNode.getChildNodes().item(i));
System.out.println("text node removed");
}
i=i+1;
}
Very late answer, but maybe it is still helpful to someone.
I had this code in my class, where the document is built after transformation (Just like you):
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
Change the last line to
transformer.setOutputProperty(OutputKeys.INDENT, "no");

Validating Xml, Xsd and transform to other Xml

I was trying to input an xml, xsd in java and validate them. Next if i found them true, then i have to transform the input xml to other xml. I have done my program in such a way that though the xml and xsd are not validating true... I could transform the xml to other xml using xsl which i dont want to. A help would be needed.
Code is below: Here in this code, XsdFile, XslFile all are predefined to the folder where i would like to test. They are just objects. alternative, to think like a file "c:\abc.xml". I am only concerned at the condition if(doc!=null). What should i give there to work fine. The following code is working though the validation fails.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(true);
factory.setAttribute(
"http://java.sun.com/xml/jaxp/properties/schemaLanguage",
"http://www.w3.org/2001/XMLSchema");
factory
.setAttribute(
"http://java.sun.com/xml/jaxp/properties/schemaSource",
XsdFile);
try {
System.out.println();
DocumentBuilder parser = factory.newDocumentBuilder();
Document doc = parser.parse(XmlFile);
if (doc != null) {
Source xmlInput = new StreamSource(new File(XmlFile));
Source xsl = new StreamSource(new File(inputXslFile));
Result xmlOutput = new StreamResult(new File(transformedXml));
try {
Transformer transformer = TransformerFactory.newInstance()
.newTransformer(xsl);
transformer.transform(xmlInput, xmlOutput);
System.out.println("The transformed xml is:" + xmlOutput);
} catch (TransformerException e) {
// Handle.
}
}
} catch (ParserConfigurationException e) {
System.out.println("Parser not configured: " + e.getMessage());
} catch (SAXException e) {
System.out.print("Parsing XML failed due to a "
+ e.getClass().getName() + ":");
System.out.println(e.getMessage());
} catch (IOException e) {
e.printStackTrace();
}
It's kind of annoying. the default ErrorHandler for XML parsing just prints out the errors. So, when validation fails, you'll just get an error message. you should specify a custom ErrorHandler on the DocumentBuilderFactory which throws an exception when the validation fails.
as a side note, since you've already parsed the xml into DOM, you should use a DOMSource for your transformation (so you don't need to re-parse the xml in order to transform it).

Extremely slow XSLT transformation in Java

I try to transform XML document using XSLT. As an input I have www.wordpress.org XHTML source code, and XSLT is dummy example retrieving site's title (actually it could do nothing - it doesn't change anything).
Every single API or library I use, transformation takes about 2 minutes! If you take a look at wordpress.org source, you will notice that it is only 183 lines of code. As I googled it is probably due to DOM tree building. No matter how simple XSLT is, it is always 2 minutes - so it confirms idea that it's related to DOM building, but anyway it should not take 2 minutes in my opinion.
Here is an example code (nothing special):
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = null;
try {
transformer = tFactory.newTransformer(
new StreamSource("/home/pd/XSLT/transf.xslt"));
} catch (TransformerConfigurationException e) {
e.printStackTrace();
}
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
System.out.println("START");
try {
transformer.transform(new SAXSource(new InputSource(
new FileInputStream("/home/pd/XSLT/wordpress.xml"))),
new StreamResult(outputStream));
} catch (TransformerException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("STOP");
System.out.println(new String(outputStream.toByteArray()));
It's between START and STOP where java "pauses" for 2 minutes. If I take a look at the processor or memory usage, nothing increases. It looks like really JVM stopped...
Do you have any experience in transforming XMLs that are longer than 50 (this is random number ;)) lines? As I read XSLT always needs to build DOM tree in order to do its work. Fast transformation is crucial for me.
Thanks in advance,
Piotr
Does the sample HTML file use namespaces? If so, your XML parser may be attempting to retrieve contents (a schema, perhaps) from the namespace URIs. This is likely if each run takes exactly two minutes -- it's likely one or more TCP timeouts.
You can verify this by timing how long it takes to instantiate your InputSource object (where the WordPress XML is actually parsed), as this is likely the line which is causing the delay. After reviewing the sample file you posted, it does include a declared namespace (xmlns="http://www.w3.org/1999/xhtml").
To work around this, you can implement your own EntityResolver which essentially disables the URL-based resolution. You may need to use a DOM -- see DocumentBuilder's setEntityResolver method.
Here's a sample using DOM and disabling resolution (note -- this is untested):
try {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbFactory.newDocumentBuilder();
db.setEntityResolver(new EntityResolver() {
#Override
public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException {
return null; // Never resolve any IDs
}
});
System.out.println("BUILDING DOM");
Document doc = db.parse(new FileInputStream("/home/pd/XSLT/wordpress.xml"));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer(
new StreamSource("/home/pd/XSLT/transf.xslt"));
System.out.println("RUNNING TRANSFORM");
transformer.transform(
new DOMSource(doc.getDocumentElement()),
new StreamResult(outputStream));
System.out.println("TRANSFORMED CONTENTS BELOW");
System.out.println(outputStream.toString());
} catch (Exception e) {
e.printStackTrace();
}
If you want to use SAX, you would have to use a SAXSource with an XMLReader which uses your custom resolver.
The commenters who've posted that the answer likely resides with the EntityResolver are probably correct. However, the solution may not be to simply not load the schemas but rather load them from the local file system.
So you could do something like this
db.setEntityResolver(new EntityResolver() {
#Override
public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException {
try {
FileInputStream fis = new FileInputStream(new File("classpath:xsd/" + systemId));
InputSource is = new InputSource(fis);
return is
} catch (FileNotFoundException ex) {
logger.error("File Not found", ex);
return null;
}
}
});
Chances are the problem isn't with the call transfomer.transform. It's more likely that you are doing something in your xslt that is taking forever. My suggestion would be use a tool like Oxygen or XML Spy to profile your XSLT and find out which templates are taking the longest to execute. Once you've determined this you can begin to optimize the template.
If you are debugging your code on an android device, make sure you try it without eclipse attached to the process. When I was debugging my app xslt transformations were taking 8 seconds, where the same process took a tenth of a second on ios in native code. Once I ran the code without eclipse attached to it, the process took a comparable amount of time to the c based counterpart.

Transformation error: "The current event is not START_ELEMENT but 2"

Similarly to an older post I'm trying to access a web service with JAX-WS using:
Dispatch<Source> sourceDispatch = null;
sourceDispatch = service.createDispatch(portQName, Source.class, Service.Mode.PAYLOAD);
Source result = sourceDispatch.invoke(new StreamSource(new StringReader(req)));
System.out.println(sourceToXML(result));
where:
private static String sourceToXML(Source result) {
Node rootNode= null;
try {
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer();
DOMResult domResult = new DOMResult();
transformer.transform(result, domResult );
rootNode = (Node) domResult.getNode();
} catch (TransformerException e) {
e.getMessage();
}
return rootNode.getFirstChild().getNodeValue();
}
but I get the error 'The current event is not START_ELEMENT null but 2' (I think on the transformer)
What am I doing wrong :(
Presumably from the parser. I would say a stack trace would be helpful, but Xerces/Xalan has a tendency for screwing those up.
Obvious steps to take:
Try looking at the result as a string.
Try parsing with a parser, ignoring the transformer for now.
Try finding out exactly what the error was.
You should amend your statement
e.getMessage()
to actually print the error message :-) That should help.
System.err.println(e.getMessage());
or preferably
e.printStackTrace();

Categories