XML Node to String Conversion for Large Sized XML - java

Till now I was using DOMSource to transform the XML file into string, in my Android App.
Here's my code :
public String convertElementToString (Node element) throws TransformerConfigurationException, TransformerFactoryConfigurationError
{
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
//initialize StreamResult with File object to save to file
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(element);
try {
transformer.transform(source, result);
}
catch (TransformerException e) {
Log.e("CONVERT_ELEMENT_TO_STRING", "converting element to string failed. Aborting", e);
}
String xmlString = result.getWriter().toString();
xmlString = xmlString.replace("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "");
xmlString = xmlString.replace("\n", "");
return xmlString;
}
This was working fine for small xml files.
But for large sized xml this code started throwing OutOfMemoryError.
What may be the reason behind it and how to rectify this problem?

First off: if you just need the XML as a string, and aren't using the Node for anything else, you should use StAX (Streaming API for XML) instead, as that has a much lower memory footprint. You'll find StAX in the javax.xml.stream package of the standard libraries.
One improvement to your current code would be to change the line
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
to
transformer.setOutputProperty(OutputKeys.INDENT, "no");
Since you're stripping newlines anyway at the end of the method, it's not very useful to request additional indentation in the output. It's a small thing, but might reduce your memory requirements a bit if there are a lot of tags (hence, newlines and whitespace for indentation) in your XML.

Related

" is auto converting to " through Document & Transformer API

I am loading xml file (pom.xml) through org.w3c.dom.Document and editing some node's value (basically changing the version value of some dependency) through javax.xml.transform.Transformer, javax.xml.transform.TransformerFactory
& javax.xml.transform.dom.DOMSource.
But problem is that, this also convert all occurrence of " to " character, which I don't want. See below sample:
<Export-Package>!${bundle.namespace}.internal.*,${bundle.namespace}.*;version="${project.version}"</Export-Package>
converted to:
<Export-Package>!${bundle.namespace}.internal.*,${bundle.namespace}.*;version="${project.version}"</Export-Package>
Please help on this, how I can ignore these auto conversion with currently consumed API.
Code Sample:
public void writeDocument(File filePath)
{
TransformerFactory transformerFactory = TransformerFactory.newInstance();
this.thisDoc.getDocumentElement().normalize();
Transformer transformer;
try
{
transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(thisDoc);
StreamResult result = new StreamResult(filePath);
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.transform(source, result);
}
catch (TransformerException e)
{
VersionUpdateExceptions.throwException(e, LOG);
}
}
This is the required behavior by the Document Object Model (DOM) Level 3 Load and Save Specification:
Within the character data of a document (outside of markup), any
characters that cannot be represented directly are replaced with
character references. Occurrences of '<' and '&' are replaced by the
predefined entities < and &. The other predefined entities
(>, ', and ") might not be used, except where needed
(e.g. using > in cases such as ']]>').
For example, if you use " inside an attribute:
<Export-Package id=""test"">
" will be preserved. Otherwise, it won't.
If absolutely necessary you could achieve the preserving of """ with an ugly hack.
Read the pom.xml as a String and replace ocurrences of " by some "marker" string
To parse the document use an StringReader to create an InputSource
Execute your method, but creating a StreamResult with a StringWriter.
Get the content from the StringWriter as a String and replace your marker string with "
Save the content to the file

How to unescape string in XML using Transformer?

I've a function which takes a XML document as parameter and writes it to the file. It contains element as <tag>"some text & some text": <text> text</tag> but in output file it's written as <tag>"some text & some text": <text> text</tag> But I don't want string to be escaped while writing to the file.
Function is,
public static void function(Document doc, String fileUri, String randomId){
DOMSource source = new DOMSource(doc,ApplicationConstants.ENC_UTF_8);
FileWriterWithEncoding writer = null;
try {
File file = new File(fileUri+File.separator+randomId+".xml");
if (!new File(fileUri).exists()){
new File(fileUri).mkdirs();
}
writer = new FileWriterWithEncoding(new File(file.toString()),ApplicationConstants.ENC_UTF_8);
StreamResult result = new StreamResult(writer);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = null;
transformer = transformerFactory.newTransformer();
transformer.setParameter(OutputKeys.INDENT, "yes");
transformer.transform(source, result);
writer.close();
transformer.clearParameters();
}catch (IOException | TransformerException e) {
log.error("convert Exception is :"+ e);
}
}
There are five escape characters in XML ("'<>&). According to XML grammar, they must be escaped in certain places in XML, please see this question:
What characters do I need to escape in XML documents?
So you can't to much, for instance, to avoid escaping & or < in text content.
You could use CDATA sections if you want to retain "unescaped" content. Please see this question:
Add CDATA to an xml file

java XML to string issue

I tried to solve my xml issue but I am stuck... I am getting the xml as System.Out at console quite well. However when I tried to get it as a return string value, it gives me only half of the xml (the returned string does not contain error but it's broken xml). The code is below. (ide: androidstudio, tried jdk 1.7/1.8 same result)
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
//initialize StreamResult with File object to save to file
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(doc);
transformer.transform(source, result);
// Output to console for testing
StreamResult consoleResult = new StreamResult(System.out);
transformer.transform(source, consoleResult); // this give all data to as System.Out as correct every line
xmlString = result.getWriter().toString();
Log.v("XML OUTPUT xmlString", xmlString.toString()); // but this gives only part of xml like broken from half part...
Update:
Solution, there is nothing wrong but Log.v cuts string... this means code is working. thanks for tip to #wero

java xml document.getTextContent() stays empty

I'm trying to build an xml document in a JUnit test.
doc=docBuilder.newDocument();
Element root = doc.createElement("Settings");
doc.appendChild(root);
Element label0 = doc.createElement("label_0");
root.appendChild(label0);
String s=doc.getTextContent();
System.out.println(s);
Yet the document stays empty (i.e. the println yields null.) I don'thave a clue why that is. The actual problem is that a subsequent XPath expression throws the error: Unable to evaluate expression using this context.
The return value of getTextContent on Document is defined to null- See Node.
To retreive the text contents call getTextNode on the root element
I imagine you want to serialize the document to pass it to the test case.
To do this you have to pass your document to an empty XSL transformer, like this:
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
//initialize StreamResult with File object to save to file
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(doc);
transformer.transform(source, result);
String xmlString = result.getWriter().toString();
System.out.println(xmlString);
See also: How to pretty print XML from Java?

XML Canonical form in Java

This question got me pretty close and actually works. Now I'm trying to understand it better and make it more robust.
Have the following test code:
// Just build a test xml
String xml;
xml = "<aaa Batt = \"That\" Aatt=\"this\" >\n";
xml += "<!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/>\n";
xml += " <ccc/></aaa>";
// do the necessary bureaucracy
DocumentBuilder docBuilder;
docBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc;
doc = docBuilder.parse(new ByteArrayInputStream(xml.getBytes()));
// Normalize document
// Do I realy need to do this?
doc.normalize();
// Canonize using Apache's Xml security
org.apache.xml.security.Init.init(); // Doesnt work if I don't do this.
byte[] c14nOutputbytes = Canonicalizer.getInstance(
Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS)
.canonicalizeSubtree(doc.getDocumentElement());
// This was a reparse reccomended to get attributes in alpha order
Document canon = docBuilder.parse(new ByteArrayInputStream(c14nOutputbytes));
// Input and output for the transformer
DOMSource xmlInput = new DOMSource(canon);
StreamResult xmlOutput = new StreamResult(new StringWriter());
// Configure transformer and format code
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(
"{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(xmlInput, xmlOutput);
// And print it
System.out.println(xmlOutput.getWriter().toString());
Executing this code, would output:
<aaa Aatt="this" Batt="That">
<!-- Document comment --><bbb lolol="dsf" moarttt="fasf"/>
<ccc/>
</aaa>
Which might be canonized, but doesn't seem to respect the indentation I asked the transformer to do.
Having such an example, I have a few questions:
For my intent, is there any difference between .normalize() and Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS? Removing either of them seems to yield the same result (again within my intent of have a canonical and pretty printed xml).
Why do the blank spaces within the xml seem to screw the formatting? Would I have to trim the text of each xml node to make it work? It just sounds wrong, nonetheless if the input xml is <aaa Batt = \"That\" Aatt=\"this\" ><!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/><ccc/></aaa> the xml is perfectly formatted.
Why after asking for the canonical form, tags such as <ccc/> weren't expanded to <ccc></ccc>? Wikipedia says "empty elements are encoded as start/end pairs, not using the special empty-element syntax".
Sorry if these are too many questions at once, but I have the feeling the answers for all of these should be somewhat the same.

Categories