Handle invalid xml character while serializing - java

I have requirement where I need to serialize a document which contains a string like ンᅧᅭ%ンᅨ&. While serializing it throws the following exception:
java.io.IOException: The character '' is an invalid XML character
Is there a way we can serialize this String as is with any workaround?
StringWriter stringOut = new StringWriter();
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element rootElement = doc.createElement("company");
doc.appendChild(rootElement);
String xml = "ンᅧᅭ%ンᅨ&";
//String xml = "ンᅧᅭ%ンᅨ&";
Element junk = doc.createElement("replyToQ");
junk.appendChild(doc.createCDATASection(xml));
//junk.appendChild(doc.createTextNode(stripNonValidXMLCharacters(xml)));
rootElement.appendChild(junk);
//org.w3c.dom.Document doc = this.toDOM();
//Serialize DOM
OutputFormat format = new OutputFormat(doc,"UTF-8",true);
format.setIndenting(false);
format.setLineSeparator("");
format.setPreserveSpace(true);
format.setOmitXMLDeclaration(false);
XMLSerializer serial = new XMLSerializer( stringOut, format );
// As a DOM Serializer
serial.asDOMSerializer();
serial.serialize( doc.getDocumentElement() );

EDIT: I read your question as a deserialisation question, not serialization. Sorry.
The answer is that you need to escape them using Uuicode entity escape strings.
Character ン becomes ソ. See Japanese Katakana chart
Also see here XML Escaping
You need to pre-process the file to correctly escape the xml characters.
read each character in the file
if the character is invalid xml, escape it appropriately
write character to temporary file
at the end of the original file, overwrite original with temporary file.
Your file is now valid xml and can be parsed by standard means. It will most likely be bigger. Give the supplier of your file a telling off for writing a buggy xml writer ;)

Related

Correct xml escaping in Java

I need to convert CSV into XML and then to OutputStream. Rule is to convert " into " in my code.
Input CSV row:
{"Test":"Value"}
Expected output:
<root>
<child>{"Test":"Value"}</child>
<root>
Current output:
<root>
<child>{&quot;Test&quot;:&quot;Value&quot;}</child>
<root>
Code:
File file = new File(FilePath);
BufferedReader reader = null;
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder domBuilder = domFactory.newDocumentBuilder();
Document newDoc = domBuilder.newDocument();
Element rootElement = newDoc.createElement("root");
newDoc.appendChild(rootElement);
reader = new BufferedReader(new FileReader(file));
String text = null;
while ((text = reader.readLine()) != null) {
Element rowElement = newDoc.createElement("child");
rootElement.appendChild(rowElement);
text = StringEscapeUtils.escapeXml(text);
rowElement.setTextContent(text);
}
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Source xmlSource = new DOMSource(newDoc);
Result outputTarget = new StreamResult(outputStream);
TransformerFactory.newInstance().newTransformer().transform(xmlSource, outputTarget);
System.out.println(new String(baos.toByteArray()))
Could you please help? What I miss and when & convert to &?
The XML library will automatically escape strings that need to be XML-escaped, so you don't need to manually escape using StringEscapeUtils.escapeXml. Simply remove that line and you should get exactly what you're looking for properly-escaped XML.
XML doesn't require " characters to be escaped everywhere, only within attribute values. So this is valid XML already:
<root>
<child>{"Test":"Value"}</child>
<root>
You would escape the quotes if you had an attribute that contained a quote, such as: <child attr="properly "ed"/>
This is one of the main reasons to use an XML library: the subtleties of quoting are already handled for you. No need to read the XML spec to make sure you got the quoting rules correct.

Work with raw text in javax.xml.transform.Transformer

While working with an XML document, I use strings that already contain XML entities and wish them to be inserted as-is. However, this happens instead:
String s = "This — That";
....
document.appendChild(document.createTextNode(s));
....
transformer.transform(new DOMSource(document), new StreamResult(stringWriter));
System.out.println(stringWriter.toString()); // outputs "This &mdash; That" at the relevant Node.
I have no control over the input string and I need exactly the output "This — That".
If I use StringEscapeUtils.unescapeHtml, the output is "This — That" which is not what I need.
I also tried several versions of transformer.setOutputProperty(OutputKeys.ENCODING, "encoding") but haven't found an encoding that converts "—" to "—".
What can I do to prevent javax.xml.transform.Transformer from re-escaping already correctly escaped text or how can I transform the input to get entities in the output?
Please explain how this is a duplicate.
The question referenced had the problem that "
" was being converted into CRLF because the entities were being resolved. The solution was to escape the entities.
My problem is the reverse. The text is already escaped and the transformer is re-escaping the text. "—" is outputting "&mdash;".
I cannot use the solution to post-convert all "&" -> "&" because not all nodes represent html.
More complete code:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer t = factory.newTransformer();
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document document = builder.newDocument();
Element rootElement = document.createElement("Test");
rootElement.appendChild(document.createTextNode("This — That");
document.appendChild(rootElement);
DOMImplementation domImpl = bgDoc.getImplementation();
DocumentType docType = domImpl.createDocumentType("Test",
"-//Company//program//language",
"test.dtd");
t.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, docType.getPublicId());
t.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, docType.getSystemId());
StringWriter writer = new StringWriter();
StreamResult rslt = new StreamResult(writer);
Source src = new DOMSource(document);
t.transform(src, rslt);
System.out.println(writer.toString());
// outputs xml header, then "<Test>This &mdash; That</Test>"
The fact is, once you have a DOM tree, there's no longer a string with —: it's instead represented internally as a Unicode string.
So, to input the raw string, you need to parse it to a Node, and to output, serialize a Node.
Regarding serialization, there are a few other questions including Change the com.sun.org.apache.xml.internal.serialize.XMLSerializer & com.sun.org.apache.xml.internal.serialize.OutputFormat .
To parse a single node, there is LSParser.parseWithContext.

Keep numeric character entity characters such as ` ` when parsing XML in Java

I am parsing XML that contains numeric character entity characters such as (but not limited to)
< > (line feed carriage return < >) in Java. While parsing, I am appending text content of nodes to a StringBuffer to later write it out to a textfile.
However, these unicode characters are resolved or transformed into newlines/whitespace when I write the String to a file or print it out.
How can I keep the original numeric character entity characters symbols when iterating over nodes of an XML file in Java and storing the text content nodes to a String?
Example of demo xml file:
<?xml version="1.0" encoding="UTF-8"?>
<ABCD version="2">
<Field attributeWithChar="A string followed by special symbols
" />
</ABCD>
Example Java code. It loads the XML, iterates over the nodes and collects the text content of each node to a StringBuffer. After the iteration is over, it writes the StringBuffer to the console and also to a file (but no
) symbols.
What would be a way to keep these symbols when storing them to a String? Could you please help me? Thank you.
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException, TransformerException {
DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
Document document = null;
DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
document = documentBuilder.parse(new File("path/to/demo.xml"));
StringBuilder sb = new StringBuilder();
NodeList nodeList = document.getElementsByTagName("*");
for (int i = 0; i < nodeList.getLength(); i++) {
Node node = nodeList.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
NamedNodeMap nnp = node.getAttributes();
for (int j = 0; j < nnp.getLength(); j++) {
sb.append(nnp.item(j).getTextContent());
}
}
}
System.out.println(sb.toString());
try (Writer writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("path/to/demo_output.xml"), "UTF-8"))) {
writer.write(sb.toString());
}
}
You need to escape all the XML entities before parsing the file into a Document. You do that by escaping the ampersand & itself with its corresponding XML entity &. Something like,
DocumentBuilder documentBuilder =
DocumentBuilderFactory.newInstance().newDocumentBuilder();
String xmlContents = new String(Files.readAllBytes(Paths.get("demo.xml")), "UTF-8");
Document document = documentBuilder.parse(
new InputSource(new StringReader(xmlContents.replaceAll("&", "&"))
));
Output :
2A string followed by special symbols
P.S. This is complement of Ravi Thapliyal's answer, not an alternative.
I am having the same problem with handling an XML file which is exported from 2003 format Excelsheet. This XML file stores line-breaks in text contents as
along with other numeric character references. However, after reading it with Java DOM parser, manipulating the content of some elements and transforming it back to the XML file, I see that all the numeric character references are expanded (i.e. The line-break is converted to CRLF) in Windows with J2SE1.6. Since my goal is to keep the content format unchanged as much as possible while manipulating some elements (i.e. retain numeric character references), Ravi Thapliyal's suggestion seems to be the only working solution.
When writing the XML content back to the file, it is necessary to replace all & with &, right? To do that, I had to give a StringWriter to the transformer as StreamResult and obtain String from it, replace all and dump the string to the xml file.
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
DOMSource source = new DOMSource(document);
//write into a stringWriter for further processing.
StringWriter stringWriter = new StringWriter();
StreamResult result = new StreamResult(stringWriter);
t.transform(source, result);
//stringWriter stream contains xml content.
String xmlContent = stringWriter.getBuffer().toString();
//revert "&" back to "&" to retain numeric character references.
xmlContent = xmlContent.replaceAll("&", "&");
BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));
wr.write(xmlContent);
wr.close();

Extract XML blocks as string in Java

I have an XML as below
<accountProducts>
<accountProduct>...</accountProduct>
<accountProduct>...</accountProduct>
<accountProduct>...</accountProduct>
<accountProduct>...</accountProduct>
</accountProducts>
Now I want to extract each of the accountProduct block as string. So is there any XML parsing technique to do that or I need to do string manipulation.
Any help please.
Using the DOM as suggested above, you will need to parse your XML with a DocumentBuilder.
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
//if your document has namespaces, you can specify that in your builder.
DocumentBuilder db = dbf.newDocumentBuilder();
Using this object, you can call the parse() method.
Your XML input can be provided to a DOM parser as a file or as a stream.
As a file...
File f = new File("MyXmlFile.xml");
Document d = db.parse(f);
As a string...
String myXmlString = "...";
InputSource ss = new InputSource(new StringReader(myXmlString));
Document d = db.parse(ss);
Once you have a Document object, you can traverse the document with DOM functions or with XPATH. This example illustrates the DOM methods.
In your example, assuming that accountProduct nodes contain only text, the following should work.
NodeList nl = d.getElementsByTagName("accountProduct");
for(int i=0; i<nl.getLength(); i++) {
Element elem = (Element)nl.item(i);
System.out.println(elem.getTextContent());
}
If accountProduct contains mixed content (text and elements), you would need more code to extract what you need.
Use JAXP for this.
The Java API for XML Processing (JAXP) is for processing XML data using applications written in the Java programming language.

Convert a String to w3c.dom.Element: XMLParseException:Start of root element expected

I found the following piece of code from a blog, when running it I get an exception
XMLParseException:Start of root element expected. at 9th line.
Can any one explain why I get the Exception and suggest any other way for converting String to an element?
String s = "Hello DOM Parser";
java.io.InputStream sbis = new java.io.StringBufferInputStream(s);
javax.xml.parsers.DocumentBuilderFactory b = javax.xml.parsers.DocumentBuilderFactory.newInstance();
b.setNamespaceAware(false);
org.w3c.dom.Document doc = null;
javax.xml.parsers.DocumentBuilder db = null;
db = b.newDocumentBuilder();
doc = db.parse(sbis);
org.w3c.dom.Element e = doc.getDocumentElement();
To create a DOM Element with a custom tag (which I assume is what you want, but can't be sure), you can use the following approach:
String customTag = "HelloDOMParser";
Document doc = documentBuilder.newDocument();
String fullName = nameSpacePrefix + ":" + customTag;
Element customElement = document.createElementNS(namespaceUri, fullName);
doc.appendChild(customElement);
I am assuming you know the namespace URI and your prefix (if any). If you don't use namespaces, just use the createElement() method instead.
Like said in the comment, "Hello DOM Parser" is not a XML element. And so the parser doesn't know what to do with it. I don't know what kind document you are building, but if you want HTML you can embed the text in a html tag for example;
<div>Hello DOM Parser</div>
<span>Hello DOM Parser</span>
if you are building XML, you can embed the text in any random html tag;
<mytag>Hello DOM Parser</mytag>
Some explanation on DOM;
http://www.w3.org/DOM
To answer your question, to convert a String to a w3c Element, you can use createElement;
Element hello = document.createElement("hello");
hello.appendChild(document.createTextNode("Hello DOM Parser"));
This results in;
<hello>Hello DOM Parser</hello>
The parse method of DocumentBuilder accept the input stream which contains the xml content. The following change will work for you:
String s = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<root><elem1>Java</elem1></root>";
Try to avoid using the deprecated classes such as StringBufferInputStream. You can refer to the document below to know more about Java XML parsing.
http://www.java-samples.com/showtutorial.php?tutorialid=152

Categories