Optimize way to normalize XML - java

What is the best and optimized way to normalize XML in java?
We are persisting XML in Database and before persisting XML to DB, we want to normalize it, remove indentation and persist the whole XML as a single line as original XML is taking a lot of space. We are using Java Document Builder currently to remove indentation and under heavy load, document builder is taking a lot of memory and causing high CPU.
We persist different types of XMLs to db and some of ours XMLs are huge enough. Here is the sample snippet we are using. Any suggestions on how we can optimize it?
ByteArrayInputStream payloadStream = new ByteArrayInputStream(payload.getBytes(XML_ENCODING));
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setFeature("http://xml.org/sax/features/external-general-entities", false);
factory.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
factory.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
DocumentBuilder dBuilder = factory.newDocumentBuilder();
Document doc = dBuilder.parse(payloadStream);
doc.getDocumentElement().normalize();
Transformer trans = TransformerFactory.newInstance().newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, STRING_YES);
trans.setOutputProperty(OutputKeys.INDENT, STRING_NO);
trans.setOutputProperty(INDENT_PROP, INDENT_AMOUNT);
StringWriter sw = new StringWriter();
trans.transform(new DOMSource(doc), new StreamResult(sw));
String xmlString = sw.toString();

Don't use Document Builder, use a StAX or SAX parser. They need hardly any memory as they don't build any model. You get an element and you write it out.
Instead of (or in addition to) space removal and normalization, consider compression. It makes the document much smaller and the const of indentation is close to zero.
I personally, find SAX simpler to use than StAX (though the majority would disagree). You extend a DefaultHandler with a couple methods like in this example. As you don't care about the content, all you need is writing it out, e.g., using an XMLStreamWriter.

You could run a streaming XSLT 3.0 transformation that simply does
<xsl:strip-space elements="*"/>
<xsl:mode on-no-match="deep-copy" streamable="yes"/>
<xsl:output method="xml" indent="no"/>

Related

Java edit XML file with DOM

I have hit somewhat of a roadblock.
My goal is to filter out everything except the number.
Here is the xml file
<?xml version="1.0" encoding="utf-8" ?>
<orders>
<order>
<stuff>"Some random information and # 123456"</stuff>
</order>
</orders>
Here is my incomplete code. I don't know how to find it nor how to go about making the change I want.
public static void main(String argv[]) {
try {
// Lesen der Datei
File inputFile = new File("C:\\filepath...\\asdf.xml");
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(inputFile);
// I don't know where to go from there
NodeList filter = doc.getChildNodes();
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult consoleResult = new StreamResult(System.out);
transformer.transform(source, consoleResult);
} catch (Exception e) {
e.printStackTrace();
}
}
When you use
Transformer transformer = transformerFactory.newTransformer();
the transformer is an "identity transformer" - it copies the input to the output with no change. In effect you're using the identity transformer here for serialization only, to convert the DOM to lexical XML.
If you want to make actual changes to the XML content, you have two choices: either write Java code to modify the in-memory DOM tree before serialising it, or write XSLT code so your Transformer is doing a real transformation not just an identity transformation. XSLT is almost certainly the better approach except that it involves more of a learning curve.
I'm not sure exactly what output you want, which makes it difficult to give you working code. The phrase "filter out" is unfortunately ambiguous, when people say "I want to filter out X" they sometimes mean they want to remove X, and sometimes they mean they want to remove everything except X. Also, "removing the number" isn't a complete specification unless we know all possibilities of what might appear in your document, for example is the number always preceded by "#", or is that only the case in this one example input? But one approach would be to remove all digits, which you could do with a call on translate(., '0123456789', '').
Note that if you're using XSLT you don't need to construct a DOM first, in fact, it's a waste of time and space. Just supply the lexical XML as input to the transformer, in the form of a StreamSource.

Disable automatic ampersand escaping in XML?

Consider:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.newDocument();
Element root = doc.createElement("list");
doc.appendChild(root);
for(CorrectionEntry correction : dictionary){
Element elem = doc.createElement("elem");
elem.setAttribute("from", correction.getEscapedFrom());
elem.setAttribute("to", correction.getEscapedTo());
root.appendChild(elem);
}
(then follows the writing of the document into an XML file)
where getEscapedFrom and getEscapedTo return (in my code) something like finké if the originating word is finké. So as to perform a Unicode escape for the characters that are bigger than 127.
The problem is that the final XML has the following line <elem from="finke" to="fink&#xE9;" /> (from is finke, to is finké) where I would like it to be <elem from="finke" to="finké" />
I've tried, following another response in StackOverflow, to disable escaping of ampersands putting the line doc.appendChild(doc.createProcessingInstruction(StreamResult.PI_DISABLE_OUTPUT_ESCAPING, "&")); after the creation of the doc but without success.
How could I "tell XML" to not escape ampersands? Or, conversely, how could I let "XML" to convert from é, or \\u00E9, to é?
Update
I managed to come to the problem: up until the writing of the file the node (through debug) seems to contain the right string. Once I call transformer.transform(domSource, streamResult); everything goes wild.
DOMSource domSource = new DOMSource(doc);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
StreamResult streamResult = new StreamResult(baos);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(domSource, streamResult);
System.out.println(baos.toString());
The problem seems to be the transformer.
Try setting setOutputProperty("encoding", "us-ascii") on the transformer. That tells the serializer to produce the output using ASCII characters only, which means any non-ASCII character will be escaped. But you can't control whether it will be a decimal or hex escape (unless you use Saxon-PE or higher as your Transformer, in which case there's a serialization option to control this).
It's never a good idea to try to do the serialization "by hand". For at least three reasons: (a) you'll get it wrong (we see a lot of SO questions caused by people producing bad XML this way), (b) you should be working with the tools, not against them, (c) the people who wrote the serializers understand XML better than you do, and they know what's expected of them. You're probably working to requirements written by someone whose understanding of XML is very superficial.

TransformerFactory - avoiding network lookups to verify DTDs

I am needing to program for offline transformation of XML documents.
I have been able to stop DTD network lookups when loading the original XML file with the following :
DocumentBuilderFactory factory;
factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setNamespaceAware(true);
factory.setFeature("http://xml.org/sax/features/namespaces", false);
factory.setFeature("http://xml.org/sax/features/validation", false);
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
// open up the xml document
docbuilder = factory.newDocumentBuilder();
doc = docbuilder.parse(new FileInputStream(m_strFilePath));
However, I am unable to apply this to the TransformerFactory object.
The DTDs are available locally, but I do not know how to direct the transformer to look at the local files as opposed to trying to do a network lookup.
From what I can see, the transformer needs these documents to correctly do the transformation.
For information, I am transforming MusicXML documents from Partwise to Timewise.
As you have probably guessed, XSLT is not my strong point (far from it).
Do I need to modify the XSLT files to reference local files, or can this be done differently ?
Further to the comments below, here is an excerpt of the xsl file. It is the only place that I see which refers to an external file :
<!--
XML output, with a DOCTYPE refering the timewise DTD.
Here we use the full Internet URL.
-->
<xsl:output method="xml" indent="yes" encoding="UTF-8"
omit-xml-declaration="no" standalone="no"
doctype-system="http://www.musicxml.org/dtds/timewise.dtd"
doctype-public="-//Recordare//DTD MusicXML 2.0 Timewise//EN" />
Is the mentioned technique valid for this also ?
The DTD file contains references to a number of MOD files like this :
<!ENTITY % layout PUBLIC
"-//Recordare//ELEMENTS MusicXML 2.0 Layout//EN"
"layout.mod">
I presume that these files will also be imported in turn also.
Ok, here is the answer which works for me.
1st step : load the original document, turning off validation and dtd loading within the factory.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// stop the network loading of DTD files
factory.setValidating(false);
factory.setNamespaceAware(true);
factory.setFeature("http://xml.org/sax/features/namespaces", false);
factory.setFeature("http://xml.org/sax/features/validation", false);
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
// open up the xml document
DocumentBuilder docbuilder = factory.newDocumentBuilder();
Document doc = docbuilder.parse(new FileInputStream(m_strFilePath));
2nd step : Now that I have got the document in memory ... and after having detected that I need to transform it -
TransformerFactory transformfactory = TransformerFactory.newInstance();
Templates xsl = transformfactory.newTemplates(new StreamSource(new FileInputStream((String)m_XslFile)));
Transformer transformer = xsl.newTransformer();
Document newdoc = docbuilder.newDocument();
Result XmlResult = new DOMResult(newdoc);
// now transform
transformer.transform(
new DOMSource(doc.getDocumentElement()),
XmlResult);
I needed to do this as I have further processing going on afterwards and did not want the overhead of outputting to file and reloading.
Little explanation :
The trick is to use the original DOM object which has had all the validation features turned off. You can see this here :
transformer.transform(
new DOMSource(doc.getDocumentElement()), // <<-----
XmlResult);
This has been tested with network access TURNED OFF.
So I know that there are no more network lookups.
However, if the DTDs, MODs, etc are available locally, then, as per the suggestions, the use of an EntityResolver is the answer. This to be applied, again, to the original docbuilder object.
I now have a transformed document stored in newdoc, ready to play with.
I hope this will help others.
You can use a library like Apache xml-commons-resolver and write a catalog file to map web URLs to your local copy of the relevant files. To wire this catalog up to the transformer mechanism you would need to use a SAXSource instead of a StreamSource as the source of your stylesheet:
SAXSource styleSource = new SAXSource(new InputSource("file:/path/to/stylesheet.xsl"));
CatalogResolver resolver = new CatalogResolver();
styleSource.getXMLReader().setEntityResolver(resolver);
TransformerFactory tf = TransformerFactory.newInstance();
tf.setURIResolver(resolver);
Transformer transformer = tf.newTransformer(styleSource);
The usual way to do this in Java is to use an LSResourceResolver to resolve the system ID (and/or public ID) to your local file. This is documented at http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/ls/LSResourceResolver.html. You shouldn't need anything outside of standard Java XML parser features to get this working.

Sort xml attributes for pretty print using javax.xml.transform.Transformer

Is there a way I could tell the xml transformer to sort alphabetically all the attributes for the tags of a given XML? So lets say...
<MyTag paramter1="lol" andTheOtherThing="potato"/>
Would turn into
<MyTag andTheOtherThing="potato" paramter1="lol"/>
I saw how to format it from the examples I found here and here, but sorting the tag attributes would be the last issue I have.
I was hoping there was something like:
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.SORTATT, "yes"); // <-- no such thing
Which seems to be what they say:
http://docs.oracle.com/javase/1.4.2/docs/api/javax/xml/transform/OutputKeys.html
As mentioned, by forty-two, you can make canonical XML from the XML and that will order the attributes alphabetically for you.
In Java we can use something like Apache's Canonicalizer
org.apache.xml.security.c14n.Canonicalizer
Something like this (assuming that the Document inXMLDoc is already a DOM):
Document retDoc;
byte[] c14nOutputbytes;
DocumentBuilderFactory factory;
DocumentBuilder parser;
// CANONICALIZE THE ORIGINAL DOM
c14nOutputbytes = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_WITH_COMMENTS).canonicalizeSubtree(inXMLDoc.getDocumentElement());
// PARSE THE CANONICALIZED BYTES (IF YOU WANT ANOTHER DOM) OR JUST USE THE BYTES
factory = DocumentBuilderFactory.newInstance();
factory.set ... // SETUP THE FACTORY
parser = factory.newDocumentBuilder();
// REPARSE TO GET ANOTHER DOM WITH THE ATTRIBUTES IN ALPHA ORDER
ByteArrayInputStream bais = new ByteArrayInputStream(c14nOutputbytes);
retDoc = parser.parse(bais);
Other things will get changed when Canonicalizing of course (it will become Canonical XML http://en.wikipedia.org/wiki/Canonical_XML) so just expect some changes other than the attribute order.

Java: need help with optimizing a part of code

I have a simple code for transforming XML, but it is very time consuming (I have to repeat it many times). Does anyone have a recommendation how to optimize this code? Thanks.
EDIT: This is a new version of the code. I unfortunatelly can't reuse Transformer, since XSLTRuleis in most of the cases different. I'm now reusing TransformerFactory. I'm not reading from files before this so I can't use StreamSource. Largest amount of time is spent on initialization of Transformer.
private static TransformerFactory tFactory = TransformerFactory.newInstance();
public static String transform(String XML, String XSLTRule) throws TransformerException {
Source xmlInput = new StreamSource(new StringReader(XML));
Source xslInput = new StreamSource(new StringReader(XSLTRule));
Transformer transformer = tFactory.newTransformer(xslInput);
StringWriter resultWriter = new StringWriter();
Result result = new StreamResult(resultWriter);
transformer.transform(xmlInput, result);
return resultWriter.toString();
}
The first thing you should do is to skip the unnecessary conversion of the XML string to bytes (especially with a hardcoded, potentially incorrect encoding). You can use a StringReader and pass that to the StreamSource constructor. The same for the result: use a StringWriter and avoid the conversion.
Of course, if you call the method after converting your XML from a file (bytes) to a String in the first place (again with a potentially wrong encoding), it would be even better to have the StreamSource read from the file directly.
It seems like you apply an XSLT to an XML file. To speed things up, you can try compiling the XSLT, like with XSLTC.
I can only think of a couple of minor things:
The TransformerFactory could be reused.
The Transformer could be reused if it is thread confined, and the XSL input is the same each time.
If you can estimate the output size reasonably accurately, you could create the ByteArrayOutputStream with an initial size hint.
As stated in Michaels answer, you could potentially speed things up by not loading either the input or output xml entirely into memory yourself and make your api stream based.

Categories