Correct xml escaping in Java - java

I need to convert CSV into XML and then to OutputStream. Rule is to convert " into " in my code.
Input CSV row:
{"Test":"Value"}
Expected output:
<root>
<child>{"Test":"Value"}</child>
<root>
Current output:
<root>
<child>{&quot;Test&quot;:&quot;Value&quot;}</child>
<root>
Code:
File file = new File(FilePath);
BufferedReader reader = null;
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder domBuilder = domFactory.newDocumentBuilder();
Document newDoc = domBuilder.newDocument();
Element rootElement = newDoc.createElement("root");
newDoc.appendChild(rootElement);
reader = new BufferedReader(new FileReader(file));
String text = null;
while ((text = reader.readLine()) != null) {
Element rowElement = newDoc.createElement("child");
rootElement.appendChild(rowElement);
text = StringEscapeUtils.escapeXml(text);
rowElement.setTextContent(text);
}
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Source xmlSource = new DOMSource(newDoc);
Result outputTarget = new StreamResult(outputStream);
TransformerFactory.newInstance().newTransformer().transform(xmlSource, outputTarget);
System.out.println(new String(baos.toByteArray()))
Could you please help? What I miss and when & convert to &?

The XML library will automatically escape strings that need to be XML-escaped, so you don't need to manually escape using StringEscapeUtils.escapeXml. Simply remove that line and you should get exactly what you're looking for properly-escaped XML.
XML doesn't require " characters to be escaped everywhere, only within attribute values. So this is valid XML already:
<root>
<child>{"Test":"Value"}</child>
<root>
You would escape the quotes if you had an attribute that contained a quote, such as: <child attr="properly "ed"/>
This is one of the main reasons to use an XML library: the subtleties of quoting are already handled for you. No need to read the XML spec to make sure you got the quoting rules correct.

Related

Protocol error trying to parse XML response in Java

I am successfully making an API call that is a SOAP request with an account number in the body. I connected using Httpurlconnection and I am reading those results using BufferedReader:
if (responseCode == HttpURLConnection.HTTP_OK) {​​​​​ // success
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {​​​​​
{​​​​​
sb.append(inputLine).append("\n");
String xml2String = sb.toString();
Then using documentbuilderfactory to build the doc to read into the parser:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = dbFactory.newDocumentBuilder();
Document xmlDom = docBuilder.parse(new InputSource(inputLine));
And then try to parse:
DOMParser parser = new DOMParser();
parser.parse(new InputSource(new StringReader(returnList.item(0).getTextContent())));
Document doc = parser.getDocument();
NodeList responsedata = doc.getDocumentElement().getChildNodes();
NodeList returnList = xmlDom.getElementsByTagName("DATA");
// Get the DATA
DOMParser parser = new DOMParser();
parser.parse(new InputSource(new StringReader(returnList.item(0).getTextContent())));
Document doc = parser.getDocument();
NodeList responsedata = doc.getDocumentElement().getChildNodes();
This is the error I get (which includes the output from the API request):
Exception,no protocol:
{​​​​​"d":"<DATA><BussFlds><FieldName>FirstName</FieldName><Value><![CDATA[TESTY]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>LastName</FieldName><Value><![CDATA[TESTER]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>TYPE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>DATE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>CUSTCODE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>PREMCODE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>ADDRESS</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>CITY</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>STATE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>ZIP</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>ZIP4</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>ACCTBALANCE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>PASTDUE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds><BussFlds><FieldName>PHONE</FieldName><Value><![CDATA[]]></Value><DataType>String</DataType><Format></Format><Editable>True</Editable></BussFlds></DATA>"}​​​​​
I suspect that it is that curly bracket data on the first row or missing header information but I am not sure if that is the issue or how to fix it. Thanks!
In
docBuilder.parse(new InputSource(inputLine))
You are using the stringbuffer. Replace it with your variable xml2String
This response:
{"d":"<DATA><BussFlds>…
is not XML. You cannot read it with a DocumentBuilder.
That response is in a format known as JSON. You cannot use an XML parser to read it.
So, you will want to pass the response to a JSON parser, not an XML parser.
A JSON “object” is basically a dictionary (that is, a lookup table) with string keys. Your response has exactly one entry, whose key is "d". So you first need to parse the response as JSON:
String xml;
try (JsonParser jsonParser = Json.createParser(con.getInputStream())) {
xml = jsonParser.getObject().getString("d");
}
(There are other JSON parsing libraries available. I chose the one that is part of Java EE for the above example.)
Notice that the code does not attempt to read con.getInputStream() as a string first. There is no benefit to doing that. The parser accepts an InputStream directly. Which means there is no need to use InputStreamReader, or BufferedReader, or StringBuffer.
Now that you have XML content in the xml variable, you can parse it with DocumentBuilder:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = dbFactory.newDocumentBuilder();
Document xmlDom = docBuilder.parse(new InputSource(new StringReader(xml)));
Side note: You should never use StringBuffer. Use StringBuilder instead. StringBuffer is a 26-year-old class that was part of Java 1.0, and it is designed for multithreaded use, which is almost never needed, and which adds a lot of overhead.

Import and parse an xml file without FileOutputStream

Consider the code fragment that I have at the moment which works and the right elements are found and placed into my map:
public void importXml(InputSource emailAttach)throws Exception {
Map<String, String> hWL = new HashMap<String, String>();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(emailAttach);
FileOutputStream fos=new FileOutputStream("temp.xml");
OutputStreamWriter os = new OutputStreamWriter(fos,"UTF-8");
// Transform to XML UTF-8 format
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
t.transform(new DOMSource(doc), new StreamResult(os));
os.close();
fos.close();
doc = db.parse(new File("temp.xml"));
NodeList nl = doc.getElementsByTagName("Email");
Element eE=(Element)nl.item(0);
int ctr=eE.getChildNodes().getLength();
String sNName;
String sNValue;
Node nTemp;
for (int i=0;i<ctr;i++){
nTemp=eE.getChildNodes().item(i);
sNName=nTemp.getNodeName().toUpperCase().trim();
if (nTemp.getChildNodes().item(0)!=null) {
sNValue=nTemp.getChildNodes().item(0).getNodeValue().trim();
hWL.put(sNName,sNValue);
}
}
}
However I prefer not to create a temp file first after converting the data to UTF-8 and parsing from the temp file. Is there anyway I can do this?
I've tried using a ByteArrayOutputStream in place of OutputStreamWriter, and calling toString() on the ByteArrayOutputStream as such:
doc = db.parse(bos.toString("UTF-8");
But then my Map ends up being empty.
From the API docs (the ability of its meticulous studying is a valuable asset for any programmer) - the parse method with the String argument seems to take something different from what you feed to it:
Document parse(String uri)
Parse the content of the given URI as an XML document and return a new DOM >Document object.
This might be your friend:
db.parse ( new ByteArrayInputStream( bos.toByteArray()));
Update
#user2496748 sorry I should have searched for the API but instead I was looking at the source code through a decompiler which tells me the parameter is arg0 instead of uri. Big difference.
I think I understand stream readers/writers and byte to char or vice versa a little more now.
After some review I was able to simply my code to this and achieve what I wanted to do. Since I am able to get the email attachment as a InputSource:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
emailAttach.setEncoding("UTF-8");
Document doc = db.parse(emailAttach);
Works as well and tested with non-english characters.
You don't need to write and re-read and re-parse the transformed document. Just change this:
t.transform(new DOMSource(doc), new StreamResult(os));
to this:
DOMResult result = new DOMResult();
t.transform(new DOMSource(doc), result);
doc = (Document)result.getNode();
and then continue from after your present doc = db.parse(new File("temp.xml"));.

Keep numeric character entity characters such as ` ` when parsing XML in Java

I am parsing XML that contains numeric character entity characters such as (but not limited to)
< > (line feed carriage return < >) in Java. While parsing, I am appending text content of nodes to a StringBuffer to later write it out to a textfile.
However, these unicode characters are resolved or transformed into newlines/whitespace when I write the String to a file or print it out.
How can I keep the original numeric character entity characters symbols when iterating over nodes of an XML file in Java and storing the text content nodes to a String?
Example of demo xml file:
<?xml version="1.0" encoding="UTF-8"?>
<ABCD version="2">
<Field attributeWithChar="A string followed by special symbols
" />
</ABCD>
Example Java code. It loads the XML, iterates over the nodes and collects the text content of each node to a StringBuffer. After the iteration is over, it writes the StringBuffer to the console and also to a file (but no
) symbols.
What would be a way to keep these symbols when storing them to a String? Could you please help me? Thank you.
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException, TransformerException {
DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
Document document = null;
DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
document = documentBuilder.parse(new File("path/to/demo.xml"));
StringBuilder sb = new StringBuilder();
NodeList nodeList = document.getElementsByTagName("*");
for (int i = 0; i < nodeList.getLength(); i++) {
Node node = nodeList.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
NamedNodeMap nnp = node.getAttributes();
for (int j = 0; j < nnp.getLength(); j++) {
sb.append(nnp.item(j).getTextContent());
}
}
}
System.out.println(sb.toString());
try (Writer writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("path/to/demo_output.xml"), "UTF-8"))) {
writer.write(sb.toString());
}
}
You need to escape all the XML entities before parsing the file into a Document. You do that by escaping the ampersand & itself with its corresponding XML entity &. Something like,
DocumentBuilder documentBuilder =
DocumentBuilderFactory.newInstance().newDocumentBuilder();
String xmlContents = new String(Files.readAllBytes(Paths.get("demo.xml")), "UTF-8");
Document document = documentBuilder.parse(
new InputSource(new StringReader(xmlContents.replaceAll("&", "&"))
));
Output :
2A string followed by special symbols
P.S. This is complement of Ravi Thapliyal's answer, not an alternative.
I am having the same problem with handling an XML file which is exported from 2003 format Excelsheet. This XML file stores line-breaks in text contents as
along with other numeric character references. However, after reading it with Java DOM parser, manipulating the content of some elements and transforming it back to the XML file, I see that all the numeric character references are expanded (i.e. The line-break is converted to CRLF) in Windows with J2SE1.6. Since my goal is to keep the content format unchanged as much as possible while manipulating some elements (i.e. retain numeric character references), Ravi Thapliyal's suggestion seems to be the only working solution.
When writing the XML content back to the file, it is necessary to replace all & with &, right? To do that, I had to give a StringWriter to the transformer as StreamResult and obtain String from it, replace all and dump the string to the xml file.
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
DOMSource source = new DOMSource(document);
//write into a stringWriter for further processing.
StringWriter stringWriter = new StringWriter();
StreamResult result = new StreamResult(stringWriter);
t.transform(source, result);
//stringWriter stream contains xml content.
String xmlContent = stringWriter.getBuffer().toString();
//revert "&" back to "&" to retain numeric character references.
xmlContent = xmlContent.replaceAll("&", "&");
BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));
wr.write(xmlContent);
wr.close();

Handle invalid xml character while serializing

I have requirement where I need to serialize a document which contains a string like ンᅧᅭ%ンᅨ&. While serializing it throws the following exception:
java.io.IOException: The character '' is an invalid XML character
Is there a way we can serialize this String as is with any workaround?
StringWriter stringOut = new StringWriter();
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element rootElement = doc.createElement("company");
doc.appendChild(rootElement);
String xml = "ンᅧᅭ%ンᅨ&";
//String xml = "ンᅧᅭ%ンᅨ&";
Element junk = doc.createElement("replyToQ");
junk.appendChild(doc.createCDATASection(xml));
//junk.appendChild(doc.createTextNode(stripNonValidXMLCharacters(xml)));
rootElement.appendChild(junk);
//org.w3c.dom.Document doc = this.toDOM();
//Serialize DOM
OutputFormat format = new OutputFormat(doc,"UTF-8",true);
format.setIndenting(false);
format.setLineSeparator("");
format.setPreserveSpace(true);
format.setOmitXMLDeclaration(false);
XMLSerializer serial = new XMLSerializer( stringOut, format );
// As a DOM Serializer
serial.asDOMSerializer();
serial.serialize( doc.getDocumentElement() );
EDIT: I read your question as a deserialisation question, not serialization. Sorry.
The answer is that you need to escape them using Uuicode entity escape strings.
Character ン becomes ソ. See Japanese Katakana chart
Also see here XML Escaping
You need to pre-process the file to correctly escape the xml characters.
read each character in the file
if the character is invalid xml, escape it appropriately
write character to temporary file
at the end of the original file, overwrite original with temporary file.
Your file is now valid xml and can be parsed by standard means. It will most likely be bigger. Give the supplier of your file a telling off for writing a buggy xml writer ;)

xml and percent symbols

I'm using XMLStreamReader to parse a piece of xml:
XMLStreamReader rd = XMLInputFactory.newInstance().createXMLStreamReader(io_xml, "UTF-8");
...
if (eventType == XMLStreamConstants.START_ELEMENT) {
String name = rd.getLocalName();
if (name.equals("key")) {
String val = rd.getElementText();
}
}
Problem is, I'm getting a bad read for the following string: "<key>cami%C3%B5es%2Babc</key>"
org.junit.ComparisonFailure:
expected:<cami[%C3%B5es%]2Babc> but was:<cami[ C3 B5es ]2Babc>
Do I neeed to do anything special within the XML? I already tried to put everything within a CDATA section but I get the same error.
Using a "regular" parser everything works:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xml));
Document parse = builder.parse(is);
String value = parse.getFirstChild().getTextContent();
...
I figured it out. The problem was in a different section of the code. A setter that didn't just set.

Categories