How to preserve newlines in CDATA when generating XML? - java

I want to write some text that contains whitespace characters such as newline and tab into an xml file so I use
Element element = xmldoc.createElement("TestElement");
element.appendChild(xmldoc.createCDATASection(somestring));
but when I read this back in using
Node vs = xmldoc.getElementsByTagName("TestElement").item(0);
String x = vs.getFirstChild().getNodeValue();
I get a string that has no newlines anymore.
When i look directly into the xml on disk, the newlines seem preserved. so the problem occurs when reading in the xml file.
How can I preserve the newlines?
Thanks!

I don't know how you parse and write your document, but here's an enhanced code example based on yours:
// creating the document in-memory
Document xmldoc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
Element element = xmldoc.createElement("TestElement");
xmldoc.appendChild(element);
element.appendChild(xmldoc.createCDATASection("first line\nsecond line\n"));
// serializing the xml to a string
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl =
(DOMImplementationLS)registry.getDOMImplementation("LS");
LSSerializer writer = impl.createLSSerializer();
String str = writer.writeToString(xmldoc);
// printing the xml for verification of whitespace in cdata
System.out.println("--- XML ---");
System.out.println(str);
// de-serializing the xml from the string
final Charset charset = Charset.forName("utf-16");
final ByteArrayInputStream input = new ByteArrayInputStream(str.getBytes(charset));
Document xmldoc2 = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(input);
Node vs = xmldoc2.getElementsByTagName("TestElement").item(0);
final Node child = vs.getFirstChild();
String x = child.getNodeValue();
// print the value, yay!
System.out.println("--- Node Text ---");
System.out.println(x);
The serialization using LSSerializer is the W3C way to do it (see here). The output is as expected, with line separators:
--- XML ---
<?xml version="1.0" encoding="UTF-16"?>
<TestElement><![CDATA[first line
second line ]]></TestElement>
--- Node Text ---
first line
second line

You need to check the type of each node using node.getNodeType(). If the type is CDATA_SECTION_NODE, you need to concat the CDATA guards to node.getNodeValue.

You don't necessarily have to use CDATA to preserve white space characters.
The XML specification specify how to encode these characters.
So for example, if you have an element with value that contains new space you should encode it with
Carriage return:
And so forth

EDIT: cut all the irrelevant stuff
I'm curious to know what DOM implementation you're using, because it doesn't mirror the default behaviour of the one in a couple of JVMs I've tried (they ship with a Xerces impl). I'm also interested in what newline characters your document has.
I'm not sure if whether CDATA should preserve whitespace is a given. I suspect that there are many factors involved. Don't DTDs/schemas affect how whitespace is processed?
You could try using the xml:space="preserve" attribute.

xml:space='preserve' is not it. That is only for "all whitespace" nodes. That is, if you want the whitespace nodes in
<this xml:space='preserve'> <has/>
<whitespace/>
</this>
But see that those whitespace nodes are ONLY whitespace.
I have been struggling to get Xerces to generate events allowing isolation of CDATA content as well. I have no solution as yet.

Related

How would you parse xml in java if the content of a tag contains > or <?

Currently, I'm using XMLInputFactory and XMLEventReader to parse XML from a rss data feed. In the description, it contains html tags in the using of > and <. Java reads this as actual tags and it thinks that the end of the description, so it cuts off and goes to the next element. How can I exclude the tags from parsing?
I don't use the pull parser (XMLEventReader) much, but I believe that, as with the SAX parser, it can report a text node as a sequence of Characters events, rather than as a single event, and it's up to the application to concatenate them. The most likely place the parser is likely to choose to split the content is at entity boundaries, to avoid doing bulk copying of character data when expanding entities.
You could temporary replace every > and < tags by a specific unique label you know. Then, do your parsing, and replace them with the > and < tags again when you are done with your parsing, like in the following code.
String original = "<container>>This< is a >test<</container>";
String newStr = original.replace(">", "_TMP_CHARACTER_G_").replace("<", "_TMP_CHARACTER_L_");
System.out.println(original + "\n" + newStr);
// Print <container>>This< is a >test<</container>
// and <container>_TMP_CHARACTER_G_This_TMP_CHARACTER_L_ is a _TMP_CHARACTER_G_test_TMP_CHARACTER_L_</container>
// [Do your parsing here]
String theTagYouWant = newStr;
String theConvertedTag = theTagYouWant.replace("_TMP_CHARACTER_G_", ">").replace("_TMP_CHARACTER_L_", "<");
System.out.println(theConvertedTag);
// Print the original String <container>>This< is a >test<</container>

preserve /t and /n in XML attribute with Java parser

In a XML file parsed to a Document I want to get a XML attribute that has embedded tabs and new lines.
I've googled and found that the XML parsing spec says the attribute text is "normalized", replacing white space characters with a blank.
I guess a have to replace the tabs and line breaks with an appropriate escaped character before I parse the XML.
In all of my googling I have not found a straightforward method to get from the File to a Document where the attribute text is returned with Tabs and Line breaks preserved.
The XML file is generated from a third party application so it may not be addressed there.
I want to use the JDK parser.
My initial attempts at reading the File into a string and parsing the String fail with a parse error on the first byte
Any suggestions on a straight forward approach?
An example element is at pastbin
Element example
[1]: https://pastebin.com/pc9uGbSD
I perform a XML Parse like this
public ReadPlexExport(Path xmlPath, ExportType exType) throws Exception {
this.xmlPath = xmlPath;
this.type = exType;
this.doc = DBF.newDocumentBuilder().parse(this.xmlPath.toFile());
}
The quick and dirty solution to my immediate problem was to read the XML file line by line as a text file, on each line replacing \t characters with the escaped tab value, writing the line to a new file, then appending an escaped line break.
The new XML files could be parsed. The original XML would always be in a form that allowed this hack as \t and line breaks would only ever occur in Attributes.

How to skip the validation of texts within an xml tag?

I am getting xml downloaded from bugzilla in this format:
<bugzilla>
<bug>
<bug_id>111</bug_id>
<short_desc>text 1 & 2</short_desc>
</bug>
<bug>
<bug_id>222</bug_id>
<short_desc>text 2 <this is a short desc> </short_desc>
</bug>
</bugzilla>
As you can see, when i am trying to parse this using jaxb parser, its failing with two reasons:
for & within the first tag (it needs to be changed to &
Error message: The entity name must immediately follow the '&' in the entity reference.
same case for <this is a short desc> text. Error message The entity name must immediately follow the '&' in the entity reference.
But what i dont understand is both these are contents of valid tags. So why validation logic is running for such contents. In the second case, its not just a single tag right as <thisisashortdesc>, which can throw actual valid error saying the closing tag missing. but this case has spaces between.
Find the code been used belowg:
File file = new File("C:\test\file.xml");
JAXBContext jaxbContext = JAXBContext.newInstance(Bugzilla.class);
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
Bugzilla bugzillaReport = (Bugzilla) jaxbUnmarshaller.unmarshal(file);
Anyways to resolve this issue.
As you are aware, valid XML must be parsed, as there is no fuzzy matching as in HTML. The standard solution is to place a <![CDATA[....]]>. (CDATA stands for character data.)
<short_desc><![CDATA[text 1 & 2]]></short_desc>
<short_desc><![CDATA[text 2 <this is a short desc> ]]></short_desc>
This is cumbersome, and the question is whether usage still works, when a text instead of a CData is expected. And creating the correct XML is probably easier. Apache commons also has a StringEscapeUtils.escapeXml10(String) for that purpose.
Try it (CDATA) first.
String xml = new String(Files.readAllBytes(Paths.get("C:\\test\\file.xml")),
StandardCharsets.UTF_8);
xml = "<?xml version=\"1.0\">\n" + xml;
xml = xml.replace("<short_desc>", "<short_desc><![CDATA[");
xml = xml.replace("</short_desc>", "]]></short_desc>");
jaxbUnmarshaller.unmarshal(new StreamSource(new StringReader(xml)));
Notice that a backslash \ must be self-escaped in a java String.
A java 9 repair:
xml = xml.replaceAll("(?s)<short_desc>(.*)</short_desc>",
matchResult -> "<short_desc>"
+ StringEscapeUtils.escapeXml10(matchResult.group(1))
+ "</short_desc>");
or without apache common lang StringEscapeUtils:
xml = xml.replaceAll("(?s)<short_desc>(.*)</short_desc>",
matchResult -> "<short_desc>"
+ matchResult.group(1)
.replace("&", "&")
.replace("\"", """)
.replace("<", "<")
.replace(">", ">")
+ "</short_desc>");

Serializing Java DOM Document to XML: Add CData Elements

I am constructing an XML DOM Document with a SAX parser. I have written methods to handle the startCDATA and endCDATA methods and in the endCDATA method I construct a new CDATA section like this:
public void onEndCData() {
xmlStructure.cData = false;
Document document = xmlStructure.xmlResult.document;
Element element = (Element) xmlStructure.xmlResult.stack.peek();
CDATASection section = document.createCDATASection(xmlStructure.stack.peek().characters);
element.appendChild(section);
}
When I serialize this to an XML file I use the following line to configure the transformer:
transformer.setOutputProperty(OutputKeys.CDATA_SECTION_ELEMENTS, "con:setting");
Never the less no <![CDATA[ tags appear in my XML file and instead all backets are escaped to > and <, this is no problem for other tools but it is a problem for humans who need to read the file as well. I am positive that the "con:setting" tag is the right one. So is there maybe a problem with the namespace prefix?
Also this question indicates that it is not possible to omit the CDATA_SECTION_ELEMENTS property and generally serialize all CDATA nodes without escaping the data. Is that information correct, or are there maybe other methods that the author of the answer was not aware of?
Update: It seems I had a mistake in my code. When using the document.createCDATASection() function, and then serializing the code with the Transformer it DOES output CDATA tags, even without the use of the CDATA_SECTION_ELEMENTS property in the transformer.
It looks like you have a namespace-aware DOM. The docs say you need to provide the Qualified Name Representation of the element:
private static String qualifiedNameRepresentation(Element e) {
String ns = e.getNamespaceURI();
String local = e.getLocalName();
return (ns == null) ? local : '{' + ns + '}' + local;
}
So the value of the property will be of the form {http://your.conn.namespace}setting.
In this line
transformer.setOutputProperty(OutputKeys.CDATA_SECTION_ELEMENTS, "con:setting");
try replacing "con:setting" with "{http://con.namespace/}setting"
using the appropriate namespace
Instead of using a no-op Transformer to serialize your DOM tree you could try using the DOM-native "load and save" mechanism, which should preserve the CDATASection nodes from the DOM tree and write them as CDATA sections in the resulting XML.
DOMImplementationLS ls = (DOMImplementationLS)document.getImplementation();
LSOutput output = ls.createLSOutput();
LSSerializer ser = ls.createLSSerializer();
try (FileOutputStream outStream = new FileOutputStream(...)) {
output.setByteStream(outStream);
output.setEncoding("UTF-8");
ser.write(document, output);
}

SAXReader not re-ecape characters

I'm reading a XML file with dom4j. The file looks like this:
...
<Field>
hello, world...</Field>
...
I read the file with SAXReader into a Document. When I use getText() on a the node I obtain the followin String:
\r\n hello, world...
I do some processing and then write another file using asXml(). But the characters are not escaped as in the original file which results in error in the external system which uses the file.
How can I escape the special character and have
when writing the file?
You cannot easily. Those aren't 'escapes', they are 'character entities'. They are a fundamental part of XML. Xerces has some very complex support for 'unparsed entities', but I doubt that it applies to these, as opposed to the species that are defined in a DTD.
It depends on what you're getting and what you want (see my previous comment.)
The SAX reader is doing nothing wrong - your XML is giving you a literal newline character. If you control this XML, then instead of the newline characters, you will need to insert a \ (backslash) character following by the "r" or "n" characters (or both.)
If you do not control this XML, then you will need to do a literal conversion of the newline character to "\r\n" after you've gotten your string back. In C# it would be something like:
myString = myString.Replace("\r\n", "\\r\\n");
XML entities are abstracted away in DOM. Content is exposed with String without the need to bother about the encoding -- which in most of the case is what you want.
But SAX has some support for how entities are processed. You could try to create a XMLReader with a custom EntityResolver#resolveEntity, and pass it as parameter to the SAXReader. But I feat it may not work:
The Parser will call this method
before opening any external entity
except the top-level document entity
(including the external DTD subset,
external entities referenced within
the DTD, and external entities
referenced within the document
element)
Otherwise you could try to configure a LexicalHandler for SAX in a way to be notified when an entity is encountered. Javadoc for LexicalHandler#startEntity says:
Report the beginning of some internal
and external XML entities.
You will not be able to change the resolving, but that may still help.
EDIT
You must read and write XML with the SAXReader and XMLWriter provided by dom4j. See reading a XML file and writing an XML file. Don't use asXml() and dump the file yourself.
FileOutputStream fos = new FileOutputStream("simple.xml");
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(fos, format);
writer.write(doc);
writer.flush();
You can pre-process the input stream to replace & to e.g. [$AMPERSAND_CHARACTER$], then do the stuff with dom4j, and post-process the output stream making the back substitution.
Example (using streamflyer):
import com.github.rwitzel.streamflyer.util.ModifyingReaderFactory;
import com.github.rwitzel.streamflyer.util.ModifyingWriterFactory;
// Pre-process
Reader originalReader = new InputStreamReader(myInputStream, "utf-8");
Reader modifyingReader = new ModifyingReaderFactory().createRegexModifyingReader(originalReader, "&", "[\\$AMPERSAND_CHARACTER\\$]");
// Read and modify XML via dom4j
SAXReader xmlReader = new SAXReader();
Document xmlDocument = xmlReader.read(modifyingReader);
// ...
// Post-process
Writer originalWriter = new OutputStreamWriter(myOutputStream, "utf-8");
Writer modifyingWriter = new ModifyingWriterFactory().createRegexModifyingWriter(originalWriter, "\\[\\$AMPERSAND_CHARACTER\\$\\]", "&");
// Write to output stream
OutputFormat xmlOutputFormat = OutputFormat.createPrettyPrint();
XMLWriter xmlWriter = new XMLWriter(modifyingWriter, xmlOutputFormat);
xmlWriter.write(xmlDocument);
xmlWriter.close();
You can also use FilterInputStream/FilterOutputStream, PipedInputStream/PipedOutputStream, or ProxyInputStream/ProxyOutputStream for pre- and post-processing.

Categories