I have a simple code for transforming XML, but it is very time consuming (I have to repeat it many times). Does anyone have a recommendation how to optimize this code? Thanks.
EDIT: This is a new version of the code. I unfortunatelly can't reuse Transformer, since XSLTRuleis in most of the cases different. I'm now reusing TransformerFactory. I'm not reading from files before this so I can't use StreamSource. Largest amount of time is spent on initialization of Transformer.
private static TransformerFactory tFactory = TransformerFactory.newInstance();
public static String transform(String XML, String XSLTRule) throws TransformerException {
Source xmlInput = new StreamSource(new StringReader(XML));
Source xslInput = new StreamSource(new StringReader(XSLTRule));
Transformer transformer = tFactory.newTransformer(xslInput);
StringWriter resultWriter = new StringWriter();
Result result = new StreamResult(resultWriter);
transformer.transform(xmlInput, result);
return resultWriter.toString();
}
The first thing you should do is to skip the unnecessary conversion of the XML string to bytes (especially with a hardcoded, potentially incorrect encoding). You can use a StringReader and pass that to the StreamSource constructor. The same for the result: use a StringWriter and avoid the conversion.
Of course, if you call the method after converting your XML from a file (bytes) to a String in the first place (again with a potentially wrong encoding), it would be even better to have the StreamSource read from the file directly.
It seems like you apply an XSLT to an XML file. To speed things up, you can try compiling the XSLT, like with XSLTC.
I can only think of a couple of minor things:
The TransformerFactory could be reused.
The Transformer could be reused if it is thread confined, and the XSL input is the same each time.
If you can estimate the output size reasonably accurately, you could create the ByteArrayOutputStream with an initial size hint.
As stated in Michaels answer, you could potentially speed things up by not loading either the input or output xml entirely into memory yourself and make your api stream based.
Related
I have hit somewhat of a roadblock.
My goal is to filter out everything except the number.
Here is the xml file
<?xml version="1.0" encoding="utf-8" ?>
<orders>
<order>
<stuff>"Some random information and # 123456"</stuff>
</order>
</orders>
Here is my incomplete code. I don't know how to find it nor how to go about making the change I want.
public static void main(String argv[]) {
try {
// Lesen der Datei
File inputFile = new File("C:\\filepath...\\asdf.xml");
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(inputFile);
// I don't know where to go from there
NodeList filter = doc.getChildNodes();
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult consoleResult = new StreamResult(System.out);
transformer.transform(source, consoleResult);
} catch (Exception e) {
e.printStackTrace();
}
}
When you use
Transformer transformer = transformerFactory.newTransformer();
the transformer is an "identity transformer" - it copies the input to the output with no change. In effect you're using the identity transformer here for serialization only, to convert the DOM to lexical XML.
If you want to make actual changes to the XML content, you have two choices: either write Java code to modify the in-memory DOM tree before serialising it, or write XSLT code so your Transformer is doing a real transformation not just an identity transformation. XSLT is almost certainly the better approach except that it involves more of a learning curve.
I'm not sure exactly what output you want, which makes it difficult to give you working code. The phrase "filter out" is unfortunately ambiguous, when people say "I want to filter out X" they sometimes mean they want to remove X, and sometimes they mean they want to remove everything except X. Also, "removing the number" isn't a complete specification unless we know all possibilities of what might appear in your document, for example is the number always preceded by "#", or is that only the case in this one example input? But one approach would be to remove all digits, which you could do with a call on translate(., '0123456789', '').
Note that if you're using XSLT you don't need to construct a DOM first, in fact, it's a waste of time and space. Just supply the lexical XML as input to the transformer, in the form of a StreamSource.
Consider:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.newDocument();
Element root = doc.createElement("list");
doc.appendChild(root);
for(CorrectionEntry correction : dictionary){
Element elem = doc.createElement("elem");
elem.setAttribute("from", correction.getEscapedFrom());
elem.setAttribute("to", correction.getEscapedTo());
root.appendChild(elem);
}
(then follows the writing of the document into an XML file)
where getEscapedFrom and getEscapedTo return (in my code) something like finké if the originating word is finké. So as to perform a Unicode escape for the characters that are bigger than 127.
The problem is that the final XML has the following line <elem from="finke" to="finké" /> (from is finke, to is finké) where I would like it to be <elem from="finke" to="finké" />
I've tried, following another response in StackOverflow, to disable escaping of ampersands putting the line doc.appendChild(doc.createProcessingInstruction(StreamResult.PI_DISABLE_OUTPUT_ESCAPING, "&")); after the creation of the doc but without success.
How could I "tell XML" to not escape ampersands? Or, conversely, how could I let "XML" to convert from é, or \\u00E9, to é?
Update
I managed to come to the problem: up until the writing of the file the node (through debug) seems to contain the right string. Once I call transformer.transform(domSource, streamResult); everything goes wild.
DOMSource domSource = new DOMSource(doc);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
StreamResult streamResult = new StreamResult(baos);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(domSource, streamResult);
System.out.println(baos.toString());
The problem seems to be the transformer.
Try setting setOutputProperty("encoding", "us-ascii") on the transformer. That tells the serializer to produce the output using ASCII characters only, which means any non-ASCII character will be escaped. But you can't control whether it will be a decimal or hex escape (unless you use Saxon-PE or higher as your Transformer, in which case there's a serialization option to control this).
It's never a good idea to try to do the serialization "by hand". For at least three reasons: (a) you'll get it wrong (we see a lot of SO questions caused by people producing bad XML this way), (b) you should be working with the tools, not against them, (c) the people who wrote the serializers understand XML better than you do, and they know what's expected of them. You're probably working to requirements written by someone whose understanding of XML is very superficial.
I am getting special characters as a result of transformation from DOM Document to ByteArrayOutputStream. The Document object is fine in terms of content,I mean, tags' content with Latin characters (i.e "ç", "ú", "Ú", "ã", etc) are right. However, the transformation to ByteArrayOutputStream results with weird characteres. For instance, the Latin character "Ú" is presented as "Ú" (two bytes). Clearly, this is a problem related with encoding conversion, but I don't expected to get this at this point.
The conversion from DOM Document to ByteArrayOutputStream is performed by the following method:
private String write(final Document doc) throws TransformerException {
ByteArrayOutputStream os = new ByteArrayOutputStream();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer trans = tf.newTransformer();
trans.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
trans.transform(new DOMSource(doc), new StreamResult(os)); //Here is the problem
String xmlData = os.toString(); //The conversion error is passed on
return xmlData;
}
I bit more detail regarding this issue. This very same method works fine when the application is running on Linux/UNIX. When the application runs on Windows OS family the transformation doesn't work at all. Yet, while I was debugging this method, I noticed that the content of "os" object has already its content with the wrong representation of the character "Ú", which leads me that the problem occurs during the trans.transform execution.
Can some one help me to fix this problem, please? What should I do in order to avoid this encoding issue?
Thank you in advance for the help.
Regards,
Anderson
UPDATE:
Hi Makaveli84! Here is the outcome. Based on your suggestion I have made the change in the code, making sure that the encoding is "ISO-8859-1". This fixed the special characters problem; however, the XML declaration was changed from
<?xml version="1.0" encoding="UTF-8"?>
to
<?xml version="1.0" encoding="ISO-8859-1"?>
That's make sense, but the business case restricts the XML declaration to
Having said that, what I decide to do was to get rid of the XML declaration by setting doing this
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes")
Afterwards, I just added the string <?xml version="1.0" encoding="UTF-8"?> via concatenation, as follows:
xmlData = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>".concat(os.toString());
This was the method I've found the keep the Latin characters in the XML during the conversion to String, I mean, this is a workaround in my humble opinion.
Clearly, this is not an elegant way to tackle the original problem posted here. Then my question is: Is there some way to achieve my goal by not using the above solution?
Here is the complete method after the adjustments made:
private String write(final Document doc) throws TransformerException {
ByteArrayOutputStream os = new ByteArrayOutputStream();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer trans = tf.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
trans.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
trans.transform(new DOMSource(doc), new StreamResult(os));
String xmlData = null;
xmlData = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>".concat(os.toString());
return xmlData;
}
Findings on smoke test:
I tested the new approach on Windows and now the problem was fixed. On the other hand, on Linux the code produces special characters, I mean, on Linux Latin character turned to be misrepresented (i.e. 'Ó' became '¿' in XML).
Should I test which operating system is running the application in order to handle Latin characters properly?
Any suggestion is welcome...
Half of your problem can be solved by decoding (transforming) into a byte array using a single-byte mapping charset such as ISO-8859-1.
Change this
trans.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
to this
trans.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
However, there is no escape when going from byte-array to String. You need to have detected the original encoding and use that information when converting the byte-array toString. For example:
String xmlData = os.toString("UTF-8");
or
String xmlData = os.toString("UTF-16");
I am transforming a DOM document (org.w3c.dom.Document) to a Stream using
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, UTF_8.name());
ByteArrayOutputStream out = new ByteArrayOutputStream();
StreamResult output = new StreamResult(out);
Source input = new DOMSource(document);
transformer.transform(input, output);
The document contains text nodes with linefeeds ('\n'). In the output they
are replaced with CRLF ("\r\n"), which is not desired. Is there a way to control this (besides replacing them afterwards, of course)?
I have no control over the documents DTD (-> XML whitespace handling).
(Remark: OutputKeys.INDENT is not the correct answer.)
Remark: Why this question is different from question 19102804 (Ensure Unix-style line endings):
This question refers explicitely to javax.xml.transform.Transformer and to the possibilities to influence its treatment of line endings. Question 19102804 asks for any solution, not only for one using javax.xml.transform.Transformer.
Question 19102804 is limited to the task of getting "Unix-style line endings". In my case the ideal solution would be a component that justs puts out the DOM model instance as it is, not touching any node (what everything so far does).
Changing the line.separator system property is not an option (see comment).
If all you want to do is serialize a DOM node then in the Java world you can use LSSerializer (https://docs.oracle.com/javase/7/docs/api/org/w3c/dom/ls/LSSerializer.html) instead of a default Transformer and then you have the method setNewLine (https://docs.oracle.com/javase/7/docs/api/org/w3c/dom/ls/LSSerializer.html#setNewLine(java.lang.String)) to define or control your preferred line ending.
Working solution based on Martin Honnens answer. (But this is not exactly an answer to the question, which explicitely refers to Transformer. So probably the correct answer is "No.", but I'll leave that open for the moment.):
final DOMImplementationLS dom =
(DOMImplementationLS) DOMImplementationRegistry.newInstance().getDOMImplementation("LS")
;
final LSSerializer serializer = dom.createLSSerializer();
serializer.setNewLine("\n");
final LSOutput destination = dom.createLSOutput();
destination.setEncoding(UTF_8.name());
final ByteArrayOutputStream bos = new ByteArrayOutputStream();
destination.setByteStream(bos);
serializer.write(document, destination);
One difference between Transformer and LSSerializer is that the Transformer writes
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
without inserting whitespace after, while the LSSerializer writes
<?xml version="1.0" encoding="UTF-8"?>
and inserts a newline after.
I am producing compiled .class files (Translet) from XSL transformation files with using TransformerFactory which is implemented by org.apache.xalan.xsltc.trax.TransformerFactoryImpl.
Unfortunately, I couldn't find the way how to use these translet classes on XML transformation despite my searchings for hours.
Is there any code example or reference documentation may you give? Because this document is insufficient and complicated.
Thanks.
A standard transformation in XSLT looks like this:
public void translate(InputStream xmlStream, InputStream styleStream, OutputStream resultStream) {
Source source = new StreamSource(xmlStream);
Source style = new StreamSource(styleStream);
Result result = new StreamResult(resultStream);
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer t = tFactory.newTransformer(style);
t.transform(source, result);
}
so given that you don't use a Transformer factory, but a ready made Java class (which is an additional maintenance headache and doesn't give you that much better performance since you can keep your transformer object after the initial compilation) the same function would look like this:
public void translate(InputStream xmlStream, OutputStream resultStream) {
Source source = new StreamSource(xmlStream);
Result result = new StreamResult(resultStream);
Translet t = new YourTransletClass();
t.transform(source, result);
}
In your search you missed out to type the Interface specification into Google where the 3rd link shows the interface definition, that has the same call signature as Transformer. So you can swap a transformer object for your custom object (or keep your transformer objects in memory for reuse)
Hope that helps