Keep numeric character entity characters such as ` ` when parsing XML in Java

Keep numeric character entity characters such as ` ` when parsing XML in Java - java

I am parsing XML that contains numeric character entity characters such as (but not limited to)
< > (line feed carriage return < >) in Java. While parsing, I am appending text content of nodes to a StringBuffer to later write it out to a textfile.
However, these unicode characters are resolved or transformed into newlines/whitespace when I write the String to a file or print it out.
How can I keep the original numeric character entity characters symbols when iterating over nodes of an XML file in Java and storing the text content nodes to a String?
Example of demo xml file:
<?xml version="1.0" encoding="UTF-8"?>
<ABCD version="2">
<Field attributeWithChar="A string followed by special symbols
" />
</ABCD>
Example Java code. It loads the XML, iterates over the nodes and collects the text content of each node to a StringBuffer. After the iteration is over, it writes the StringBuffer to the console and also to a file (but no
) symbols.
What would be a way to keep these symbols when storing them to a String? Could you please help me? Thank you.
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException, TransformerException {
DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
Document document = null;
DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
document = documentBuilder.parse(new File("path/to/demo.xml"));
StringBuilder sb = new StringBuilder();
NodeList nodeList = document.getElementsByTagName("*");
for (int i = 0; i < nodeList.getLength(); i++) {
Node node = nodeList.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
NamedNodeMap nnp = node.getAttributes();
for (int j = 0; j < nnp.getLength(); j++) {
sb.append(nnp.item(j).getTextContent());
}
}
}
System.out.println(sb.toString());
try (Writer writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("path/to/demo_output.xml"), "UTF-8"))) {
writer.write(sb.toString());
}
}

You need to escape all the XML entities before parsing the file into a Document. You do that by escaping the ampersand & itself with its corresponding XML entity &. Something like,
DocumentBuilder documentBuilder =
DocumentBuilderFactory.newInstance().newDocumentBuilder();
String xmlContents = new String(Files.readAllBytes(Paths.get("demo.xml")), "UTF-8");
Document document = documentBuilder.parse(
new InputSource(new StringReader(xmlContents.replaceAll("&", "&"))
));
Output :
2A string followed by special symbols

P.S. This is complement of Ravi Thapliyal's answer, not an alternative.
I am having the same problem with handling an XML file which is exported from 2003 format Excelsheet. This XML file stores line-breaks in text contents as
along with other numeric character references. However, after reading it with Java DOM parser, manipulating the content of some elements and transforming it back to the XML file, I see that all the numeric character references are expanded (i.e. The line-break is converted to CRLF) in Windows with J2SE1.6. Since my goal is to keep the content format unchanged as much as possible while manipulating some elements (i.e. retain numeric character references), Ravi Thapliyal's suggestion seems to be the only working solution.
When writing the XML content back to the file, it is necessary to replace all & with &, right? To do that, I had to give a StringWriter to the transformer as StreamResult and obtain String from it, replace all and dump the string to the xml file.
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
DOMSource source = new DOMSource(document);
//write into a stringWriter for further processing.
StringWriter stringWriter = new StringWriter();
StreamResult result = new StreamResult(stringWriter);
t.transform(source, result);
//stringWriter stream contains xml content.
String xmlContent = stringWriter.getBuffer().toString();
//revert "&" back to "&" to retain numeric character references.
xmlContent = xmlContent.replaceAll("&", "&");
BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));
wr.write(xmlContent);
wr.close();

Related

Import and parse an xml file without FileOutputStream

Consider the code fragment that I have at the moment which works and the right elements are found and placed into my map:
public void importXml(InputSource emailAttach)throws Exception {
Map<String, String> hWL = new HashMap<String, String>();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(emailAttach);
FileOutputStream fos=new FileOutputStream("temp.xml");
OutputStreamWriter os = new OutputStreamWriter(fos,"UTF-8");
// Transform to XML UTF-8 format
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
t.transform(new DOMSource(doc), new StreamResult(os));
os.close();
fos.close();
doc = db.parse(new File("temp.xml"));
NodeList nl = doc.getElementsByTagName("Email");
Element eE=(Element)nl.item(0);
int ctr=eE.getChildNodes().getLength();
String sNName;
String sNValue;
Node nTemp;
for (int i=0;i<ctr;i++){
nTemp=eE.getChildNodes().item(i);
sNName=nTemp.getNodeName().toUpperCase().trim();
if (nTemp.getChildNodes().item(0)!=null) {
sNValue=nTemp.getChildNodes().item(0).getNodeValue().trim();
hWL.put(sNName,sNValue);
}
}
}
However I prefer not to create a temp file first after converting the data to UTF-8 and parsing from the temp file. Is there anyway I can do this?
I've tried using a ByteArrayOutputStream in place of OutputStreamWriter, and calling toString() on the ByteArrayOutputStream as such:
doc = db.parse(bos.toString("UTF-8");
But then my Map ends up being empty.

From the API docs (the ability of its meticulous studying is a valuable asset for any programmer) - the parse method with the String argument seems to take something different from what you feed to it:
Document parse(String uri)
Parse the content of the given URI as an XML document and return a new DOM >Document object.
This might be your friend:
db.parse ( new ByteArrayInputStream( bos.toByteArray()));

Update
#user2496748 sorry I should have searched for the API but instead I was looking at the source code through a decompiler which tells me the parameter is arg0 instead of uri. Big difference.
I think I understand stream readers/writers and byte to char or vice versa a little more now.
After some review I was able to simply my code to this and achieve what I wanted to do. Since I am able to get the email attachment as a InputSource:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
emailAttach.setEncoding("UTF-8");
Document doc = db.parse(emailAttach);
Works as well and tested with non-english characters.

You don't need to write and re-read and re-parse the transformed document. Just change this:
t.transform(new DOMSource(doc), new StreamResult(os));
to this:
DOMResult result = new DOMResult();
t.transform(new DOMSource(doc), result);
doc = (Document)result.getNode();
and then continue from after your present doc = db.parse(new File("temp.xml"));.

Correct xml escaping in Java

I need to convert CSV into XML and then to OutputStream. Rule is to convert " into " in my code.
Input CSV row:
{"Test":"Value"}
Expected output:
<root>
<child>{"Test":"Value"}</child>
<root>
Current output:
<root>
<child>{&quot;Test&quot;:&quot;Value&quot;}</child>
<root>
Code:
File file = new File(FilePath);
BufferedReader reader = null;
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder domBuilder = domFactory.newDocumentBuilder();
Document newDoc = domBuilder.newDocument();
Element rootElement = newDoc.createElement("root");
newDoc.appendChild(rootElement);
reader = new BufferedReader(new FileReader(file));
String text = null;
while ((text = reader.readLine()) != null) {
Element rowElement = newDoc.createElement("child");
rootElement.appendChild(rowElement);
text = StringEscapeUtils.escapeXml(text);
rowElement.setTextContent(text);
}
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Source xmlSource = new DOMSource(newDoc);
Result outputTarget = new StreamResult(outputStream);
TransformerFactory.newInstance().newTransformer().transform(xmlSource, outputTarget);
System.out.println(new String(baos.toByteArray()))
Could you please help? What I miss and when & convert to &?

The XML library will automatically escape strings that need to be XML-escaped, so you don't need to manually escape using StringEscapeUtils.escapeXml. Simply remove that line and you should get exactly what you're looking for properly-escaped XML.
XML doesn't require " characters to be escaped everywhere, only within attribute values. So this is valid XML already:
<root>
<child>{"Test":"Value"}</child>
<root>
You would escape the quotes if you had an attribute that contained a quote, such as: <child attr="properly "ed"/>
This is one of the main reasons to use an XML library: the subtleties of quoting are already handled for you. No need to read the XML spec to make sure you got the quoting rules correct.

Work with raw text in javax.xml.transform.Transformer

While working with an XML document, I use strings that already contain XML entities and wish them to be inserted as-is. However, this happens instead:
String s = "This — That";
....
document.appendChild(document.createTextNode(s));
....
transformer.transform(new DOMSource(document), new StreamResult(stringWriter));
System.out.println(stringWriter.toString()); // outputs "This &mdash; That" at the relevant Node.
I have no control over the input string and I need exactly the output "This — That".
If I use StringEscapeUtils.unescapeHtml, the output is "This — That" which is not what I need.
I also tried several versions of transformer.setOutputProperty(OutputKeys.ENCODING, "encoding") but haven't found an encoding that converts "—" to "—".
What can I do to prevent javax.xml.transform.Transformer from re-escaping already correctly escaped text or how can I transform the input to get entities in the output?
Please explain how this is a duplicate.
The question referenced had the problem that "
" was being converted into CRLF because the entities were being resolved. The solution was to escape the entities.
My problem is the reverse. The text is already escaped and the transformer is re-escaping the text. "—" is outputting "&mdash;".
I cannot use the solution to post-convert all "&" -> "&" because not all nodes represent html.
More complete code:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer t = factory.newTransformer();
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document document = builder.newDocument();
Element rootElement = document.createElement("Test");
rootElement.appendChild(document.createTextNode("This — That");
document.appendChild(rootElement);
DOMImplementation domImpl = bgDoc.getImplementation();
DocumentType docType = domImpl.createDocumentType("Test",
"-//Company//program//language",
"test.dtd");
t.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, docType.getPublicId());
t.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, docType.getSystemId());
StringWriter writer = new StringWriter();
StreamResult rslt = new StreamResult(writer);
Source src = new DOMSource(document);
t.transform(src, rslt);
System.out.println(writer.toString());
// outputs xml header, then "<Test>This &mdash; That</Test>"

The fact is, once you have a DOM tree, there's no longer a string with —: it's instead represented internally as a Unicode string.
So, to input the raw string, you need to parse it to a Node, and to output, serialize a Node.
Regarding serialization, there are a few other questions including Change the com.sun.org.apache.xml.internal.serialize.XMLSerializer & com.sun.org.apache.xml.internal.serialize.OutputFormat .
To parse a single node, there is LSParser.parseWithContext.

Edit a XML file in java

I have the following XML file:
<Tables>
<table>
<row></row>
</table>
<Tables>
and I want to edit it to :
<Tables>
<table>
<row>some value</row>
</table>
<Tables>
I write the XML file using file writer. How can I edit it?
What I was found that I create a temp file contains edits then delete the original file and rename the temp file. Is there any other way?
that's my code to write the file:
public boolean createTable(String path, String name, String[] properties) throws IOException {
FileWriter writer = new FileWriter(path);
writer.write("<Tables>");
writer.write("\t<" + name + ">");
for(int i=0; i<properties.length; i++){
writer.write("\t\t<" + properties[0] + "></" + properties[0] + ">");
}
writer.write("\t</" + name + ">");
writer.write("</Tables>");
writer.close();
return false;
}

Don't read and write XML yourself. Java comes with multiple API's for parsing and generating XML, which takes care of all the encoding and escaping issues for you:
DOM XML is loaded into memory in a tree structure.
SAX XML is processed as a sequence of events. This is a push-parser, where the parser calls your code for each event.
StAX XML is read as a sequence of events/tokens. This is a pull-parser, where your code calls the parser to get next value.
You can also find many third-party libraries for parsing XML, and Java itself also supports marshalling of XML to POJO's.
In your case I'd suggest DOM, since it's easiest to use. Don't use DOM for huge XML files, since it loads the entire file into memory. For huge files, I'd suggest StAX.
Other than encoding issues, using an XML parser will make the code less susceptible to minor variations in the input, e.g. the 3 empty row elements below all mean the same. Or is the row element even empty, and how to get rid of existing content like shown:
<!-- row is empty -->
<row></row>
<row/>
<row />
<!-- row has content -->
<row>5 + 7 < 10</row>
<row><![CDATA[5 + 7 < 10]]></row>
<row><condition expr="5 + 7 < 10"></row>
Using DOM:
// Load XML from file
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder domBuilder = domFactory.newDocumentBuilder();
Document document = domBuilder.parse(file);
// Modify DOM tree (simple version)
NodeList rowNodes = document.getElementsByTagName("row");
for (int i = 0; i < rowNodes.getLength(); i++) {
Node rowNode = rowNodes.item(i);
// Remove existing content (if any)
while (rowNode.getFirstChild() != null)
rowNode.removeChild(rowNode.getFirstChild());
// Add text content
rowNode.appendChild(document.createTextNode("some value"));
}
// Save XML to file
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.transform(new DOMSource(document),
new StreamResult(file));

if your xml is static you can use this, here input.xml is your xml file
File file = new File("input.xml");
byte[] data;
try (FileInputStream fis = new FileInputStream(file)) {
data = new byte[(int) file.length()];
fis.read(data);
}
String input = new String(data, "UTF-8");
String tag = "<row>";
String newXML = input.substring(0, input.indexOf(tag) + tag.length()) + "your value" + input.substring(input.indexOf(tag) + tag.length(), input.length());
try (FileWriter fw = new FileWriter(file)) {
fw.write(newXML);
}
System.out.println("XML Updated");

How to remove extra empty lines from XML file?

In short; i have many empty lines generated in an XML file, and i am looking for a way to remove them as a way of leaning the file. How can i do that ?
For detailed explanation; I currently have this XML file :
<recent>
<paths>
<path>path1</path>
<path>path2</path>
<path>path3</path>
<path>path4</path>
</paths>
</recent>
And i use this Java code to delete all tags, and add new ones instead :
public void savePaths( String recentFilePath ) {
ArrayList<String> newPaths = getNewRecentPaths();
Document recentDomObject = getXMLFile( recentFilePath ); // Get the <recent> element.
NodeList pathNodes = recentDomObject.getElementsByTagName( "path" ); // Get all <path> nodes.
//1. Remove all old path nodes :
for ( int i = pathNodes.getLength() - 1; i >= 0; i-- ) {
Element pathNode = (Element)pathNodes.item( i );
pathNode.getParentNode().removeChild( pathNode );
}
//2. Save all new paths :
Element pathsElement = (Element)recentDomObject.getElementsByTagName( "paths" ).item( 0 ); // Get the first <paths> node.
for( String newPath: newPaths ) {
Element newPathElement = recentDomObject.createElement( "path" );
newPathElement.setTextContent( newPath );
pathsElement.appendChild( newPathElement );
}
//3. Save the XML changes :
saveXMLFile( recentFilePath, recentDomObject );
}
After executing this method a number of times i get an XML file with right results, but with many empty lines after the "paths" tag and before the first "path" tag, like this :
<recent>
<paths>
<path>path5</path>
<path>path6</path>
<path>path7</path>
</paths>
</recent>
Anyone knows how to fix that ?
------------------------------------------- Edit: Add the getXMLFile(...), saveXMLFile(...) code.
public Document getXMLFile( String filePath ) {
File xmlFile = new File( filePath );
try {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document domObject = db.parse( xmlFile );
domObject.getDocumentElement().normalize();
return domObject;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
public void saveXMLFile( String filePath, Document domObject ) {
File xmlOutputFile = null;
FileOutputStream fos = null;
try {
xmlOutputFile = new File( filePath );
fos = new FileOutputStream( xmlOutputFile );
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
transformer.setOutputProperty( "{http://xml.apache.org/xslt}indent-amount", "2" );
DOMSource xmlSource = new DOMSource( domObject );
StreamResult xmlResult = new StreamResult( fos );
transformer.transform( xmlSource, xmlResult ); // Save the XML file.
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (TransformerConfigurationException e) {
e.printStackTrace();
} catch (TransformerException e) {
e.printStackTrace();
} finally {
if (fos != null)
try {
fos.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

First, an explanation of why this happens — which might be a bit off since you didn't include the code that is used to load the XML file into a DOM object.
When you read an XML document from a file, the whitespaces between tags actually constitute valid DOM nodes, according to the DOM specification. Therefore, the XML parser treats each such sequence of whitespaces as a DOM node (of type TEXT);
To get rid of it, there are three approaches I can think of:
Associate the XML with a schema, and then use setValidating(true) along with setIgnoringElementContentWhitespace(true) on the DocumentBuilderFactory.
(Note: setIgnoringElementContentWhitespace will only work if the parser is in validating mode, which is why you must use setValidating(true))
Write an XSL to process all nodes, filtering out whitespace-only TEXT nodes.
Use Java code to do this: use XPath to find all whitespace-only TEXT nodes, iterate through them and remove each one from its parent (using getParentNode().removeChild()). Something like this would do (doc would be your DOM document object):
XPath xp = XPathFactory.newInstance().newXPath();
NodeList nl = (NodeList) xp.evaluate("//text()[normalize-space(.)='']", doc, XPathConstants.NODESET);
for (int i=0; i < nl.getLength(); ++i) {
Node node = nl.item(i);
node.getParentNode().removeChild(node);
}

I was able to fix this by using this code after removing all the old "path" nodes :
while( pathsElement.hasChildNodes() )
pathsElement.removeChild( pathsElement.getFirstChild() );
This will remove all the generated empty spaces in the XML file.
Special thanks to MadProgrammer for commenting with the helpful link mentioned above.

You could look at something like this if you only need to "clean" your xml quickly.
Then you could have a method like:
public static String cleanUp(String xml) {
final StringReader reader = new StringReader(xml.trim());
final StringWriter writer = new StringWriter();
try {
XmlUtil.prettyFormat(reader, writer);
return writer.toString();
} catch (IOException e) {
e.printStackTrace();
}
return xml.trim();
}
Also, to compare anche check differences, if you need it: XMLUnit

I faced the same problem, and I had no idea for the long time, but now, after this Brad's question and his own answer on his own question, I figured out where is the trouble.
I have to add my own answer, because Brad's one isn't really perfect, how Isaac said:
I wouldn't be a huge fan of blindly removing child nodes without knowing what they are
So, better "solution" (quoted because it is more likely workaround) is:
pathsElement.setTextContent("");
This completely removes useless blank lines. It is definitely better than removing all the child nodes. Brad, this should work for you too.
But, this is an effect, not the cause, and we got how to remove this effect, not the cause.
Cause is: when we call removeChild(), it removes this child, but it leaves indent of removed child, and line break too. And this indent_and_like_break is treated as a text content.
So, to remove the cause, we should figure out how to remove child and its indent. Welcome to my question about this.

There is a very simple way to get rid of the empty lines if using an DOM handling API (for example DOM4J):
place the text you want to keep in a variable(ie text)
set the node text to "" using node.setText("")
set the node text to text using node.setText(text)
et voila! there are no more empty lines. The other answers delineate very well how the extra empty lines in the xml output are actually extra nodes of type text.
This technique can be used with any DOM parsing system, so long as the name of the text setting function is changed to suit the one in your API, hence the way of representing it slightly more abstractly.
Hope this helps:)

When i used dom4j to remove some elements and i met the same question,the solution above not useful without adding some other required jars.Finally,i find out a simple solution only need to use JDK io pakage:
use BufferedReader to read the xml file and filter empty lines.
StringBuilder stringBuilder = new StringBuilder();
FileInputStream fis = new FileInputStream(outFile);
InputStreamReader isr = new InputStreamReader(fis);
BufferedReader br = new BufferedReader(isr);
String s;
while ((s = br.readLine()) != null) {
if (s.trim().length() > 0) {
stringBuilder.append(s).append("\n");
}
}
write the string to the xml file
OutputStreamWriter osw = new OutputStreamWriter(fou);
BufferedWriter bw = new BufferedWriter(osw);
String str = stringBuilder.toString();
bw.write(str);
bw.flush();
remember to close all the stream

In my case, I converted it to a string then just did a regex:
//save as String
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
tr.transform(new DOMSource(document), result);
strResult = writer.toString();
//remove empty lines
strResult = strResult.replaceAll("\\n\\s*\\n", "\n");

Couple of remarks:
1) When your are manipulating XML (removing elements / adding new one) I strongly advice you to use XSLT (and not DOM)
2) When you tranform a XML Document by XSLT (as you do in your save method), set the OutputKeys.INDENT to "no"
3) For simple post processing of your xml (removing white space, comments, etc.) you can use a simple SAX2 filter

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setIgnoringElementContentWhitespace(true);

I am using below code:
System.out.println("Start remove textnode");
i=0;
while (parentNode.getChildNodes().item(i)!=null) {
System.out.println(parentNode.getChildNodes().item(i).getNodeName());
if (parentNode.getChildNodes().item(i).getNodeName().equalsIgnoreCase("#text")) {
parentNode.removeChild(parentNode.getChildNodes().item(i));
System.out.println("text node removed");
}
i=i+1;
}

Very late answer, but maybe it is still helpful to someone.
I had this code in my class, where the document is built after transformation (Just like you):
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
Change the last line to
transformer.setOutputProperty(OutputKeys.INDENT, "no");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Keep numeric character entity characters such as ` ` when parsing XML in Java - java

Related

Import and parse an xml file without FileOutputStream

Correct xml escaping in Java

Work with raw text in javax.xml.transform.Transformer

Edit a XML file in java

How to remove extra empty lines from XML file?

Categories

Resources