How to disable/avoid Ampersand-Escaping in Java-XML? - java

I want to create a XML where blanks are replaced by  . But the Java-Transformer escapes the Ampersand, so that the output is  
Here is my sample code:
public class Test {
public static void main(String[] args) {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.newDocument();
Element element = document.createElement("element");
element.setTextContent(" ");
document.appendChild(element);
ByteArrayOutputStream stream = new ByteArrayOutputStream();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StreamResult streamResult = new StreamResult(stream);
transformer.transform(new DOMSource(document), streamResult);
System.out.println(stream.toString());
}
}
And this is the output of my sample code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<element>&#160;</element>
Any ideas to fix or avoid that? thanks a lot!

Use output escaping as follows:
Node disableEscaping = document.createProcessingInstruction(StreamResult.PI_DISABLE_OUTPUT_ESCAPING, "&");
Element element = document.createElement("element");
element.setTextContent(" ");
document.appendChild(disableEscaping );
document.appendChild(element);
Node enableEscaping = document.createProcessingInstruction(StreamResult.PI_ENABLE_OUTPUT_ESCAPING, "&");
document.appendChild(enableEscaping )

Set the text content directly to the character you want, and the serializer will escape it for you if necessary:
element.setTextContent("\u00A0");

Try to use
element.appendChild (document.createCDATASection (" "));
instead of
element.setTextContent(...);
You'll get this in your xml:
It may work if I understand correctly what you're trying to do.

As addon to forty-two's answer:
If, like me, you're trying the code in a non-patched Eclipse IDE, you're likely to see some weird A's appearing instead of the non-breaking space. This is because of the encoding of the console in Eclipse not matching Unicode (UTF-8).
Adding -Dfile.encoding=UTF-8 to your eclipse.ini should solve this.
Cheers,
Wim

Related

xquery transformation creates empty namespace in element

I'm sorry but I guess I just don't see the mistake I'm making here.
I have a camel route which returns an XML and to be able to test the output I wrote a JUnit Test which runs with SpringRunner. There I get the XML Stream from the exchange which I validate against an XSD. This works great because the XSD throws an exception because the output XML is not valid, but I don't understand why the following xquery generates an element with EMPTY NAMESPACE?
See the xquery snippet (I'm sorry again I cannot provide more code):
declare default element namespace "http://www.dppgroup.com/XXXPMS";
let $cmmdoc := $doc/*:cmmdoc
, $partner := $doc/*:cmmdoc/*:information/*:partner_gruppe/*:partner
, $sequence:= fn:substring($cmmdoc/#unifier,3)
return <ClientMMS xmlns:infra="http://www.dppgroup.com/InfraNS">
{
for $x in $partner
where $x[#partnerStatusCode = " "]
return
element {"DataGroup" } {
<Client sequenceNumber="{$sequence}" />
}
}
My problem is, that with this code the resulting XML contains the DataGroup-element with the following namespace definition:
<?xml version="1.0" encoding="UTF-8"?>
<ClientMMS xmlns="http://www.dppgroup.com/XXXPMS"
xmlns:infra="http://www.dppgroup.com/InfraNS">
<DataGroup xmlns="">
<Client sequenceNumber="170908065609671475"/>
</DataGroup>
</ClientMMS>
The snippet from the Unit-Test: I'm using jdk1.8_102
String xml = TestDataReader.readXML("/input/info/info_in.xml", PROJECT_ENCODING);
quelle.sendBody(xml);
boolean valid = false;
try {
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream((byte[]) archiv.getExchanges().get(1).getIn().getBody());
Document document = documentBuilder.parse(byteArrayInputStream);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(document);
transformer.transform(source, result);
String xmlString = result.getWriter().toString();
System.out.println(xmlString);
In no XQuery introduction/tutorial/explanation I can find a reason why this happens. Can you guys please explain why the DataGroup element is not in the default namespace?
The XQuery you posted should create the result fine without the namespace undeclaration you show.
In your Java code if you want to work with XML with namespaces make sure you use a namespace aware DocumentBuilder, as the default DocumentBuilderFactory is not namespace aware make sure you set setNamespaceAware(true) on the factory before creating a DocumentBuilder with it.

Import and parse an xml file without FileOutputStream

Consider the code fragment that I have at the moment which works and the right elements are found and placed into my map:
public void importXml(InputSource emailAttach)throws Exception {
Map<String, String> hWL = new HashMap<String, String>();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(emailAttach);
FileOutputStream fos=new FileOutputStream("temp.xml");
OutputStreamWriter os = new OutputStreamWriter(fos,"UTF-8");
// Transform to XML UTF-8 format
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
t.transform(new DOMSource(doc), new StreamResult(os));
os.close();
fos.close();
doc = db.parse(new File("temp.xml"));
NodeList nl = doc.getElementsByTagName("Email");
Element eE=(Element)nl.item(0);
int ctr=eE.getChildNodes().getLength();
String sNName;
String sNValue;
Node nTemp;
for (int i=0;i<ctr;i++){
nTemp=eE.getChildNodes().item(i);
sNName=nTemp.getNodeName().toUpperCase().trim();
if (nTemp.getChildNodes().item(0)!=null) {
sNValue=nTemp.getChildNodes().item(0).getNodeValue().trim();
hWL.put(sNName,sNValue);
}
}
}
However I prefer not to create a temp file first after converting the data to UTF-8 and parsing from the temp file. Is there anyway I can do this?
I've tried using a ByteArrayOutputStream in place of OutputStreamWriter, and calling toString() on the ByteArrayOutputStream as such:
doc = db.parse(bos.toString("UTF-8");
But then my Map ends up being empty.
From the API docs (the ability of its meticulous studying is a valuable asset for any programmer) - the parse method with the String argument seems to take something different from what you feed to it:
Document parse(String uri)
Parse the content of the given URI as an XML document and return a new DOM >Document object.
This might be your friend:
db.parse ( new ByteArrayInputStream( bos.toByteArray()));
Update
#user2496748 sorry I should have searched for the API but instead I was looking at the source code through a decompiler which tells me the parameter is arg0 instead of uri. Big difference.
I think I understand stream readers/writers and byte to char or vice versa a little more now.
After some review I was able to simply my code to this and achieve what I wanted to do. Since I am able to get the email attachment as a InputSource:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
emailAttach.setEncoding("UTF-8");
Document doc = db.parse(emailAttach);
Works as well and tested with non-english characters.
You don't need to write and re-read and re-parse the transformed document. Just change this:
t.transform(new DOMSource(doc), new StreamResult(os));
to this:
DOMResult result = new DOMResult();
t.transform(new DOMSource(doc), result);
doc = (Document)result.getNode();
and then continue from after your present doc = db.parse(new File("temp.xml"));.

How can i escape special characters with using DOM

This issue has been bugging me a lot lately and i can't seem to find out a possible solution.
I am dealing with a web-server that receives an XML document to do some processing. The server's parser has issues with &,',",<,>. I know this is bad, i didn't implement the xml parser on that server. But before waiting for a patch i need to circumvent.
Now, before uploading my XML document to this server, i need to parse it and escape the xml special characters. I am currently using DOM. The issue is, if i iterate through the TEXT_NODES and replaces all the special characters with their escaped versions, when I save this document,
for d'ex i get d&apos;ex but i need d&apos;ex
It makes sense since, DOM escapes "&". But obviously this is not what i need.
So if DOM is already capable of escaping "&" to "&" how can i make it escape other characters like " to " ?
If it can't, how can i save the already parsed and escaped texts in it's nodes without it having to re-escape them when saving ?
This is how i escape the special characters i used apache StringEscapeUtils class:
public String xMLTransform() throws Exception
{
String xmlfile = FileUtils.readFileToString(new File(filepath));
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new InputSource(new StringReader(xmlfile.trim().replaceFirst("^([\\W]+)<", "<"))));
NodeList nodeList = doc.getElementsByTagName("*");
for (int i = 0; i < nodeList.getLength(); i++) {
Node currentNode = nodeList.item(i);
if (currentNode.getNodeType() == Node.ELEMENT_NODE) {
Node child = currentNode.getFirstChild();
while(child != null) {
if (child.getNodeType() == Node.TEXT_NODE) {
child.setNodeValue(StringEscapeUtils.escapeXml10(child.getNodeValue()));
//Escaping works here. But when saving the final document, the "&" used in escaping gets escaped as well by DOM.
}
child = child.getNextSibling();
}
}
}
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
transformer.transform(source, result);
FileOutputStream fop = null;
File file;
file = File.createTempFile("escapedXML"+UUID.randomUUID(), ".xml");
fop = new FileOutputStream(file);
String xmlString = writer.toString();
byte[] contentInBytes = xmlString.getBytes();
fop.write(contentInBytes);
fop.flush();
fop.close();
return file.getPath();
}
I think the solution you're looking for is a customized XSLT parser that you can configure for your additional HTML escaping.
I'm not able to say for certain how to configure the xslt file to do what you want, but I am fairly confident it can be done. I've stubbed out the basic Java setup below:
#Test
public void testXSLTTransforms () throws Exception {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element el = doc.createElement("Container");
doc.appendChild(el);
Text e = doc.createTextNode("Character");
el.appendChild(e);
//e.setNodeValue("\'");
//e.setNodeValue("\"");
e.setNodeValue("&");
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(System.out);
//This prints the original document to the command line.
transformer.transform(source, result);
InputStream xsltStream = getClass().getResourceAsStream("/characterswap.xslt");
Source xslt = new StreamSource(xsltStream);
transformer = transformerFactory.newTransformer(xslt);
//This one is the one you'd pipe to a file
transformer.transform(source, result);
}
And I've got a simple XSLT I used for proof of concept that shows the default character encoding you mentioned:
characterswap.xslt
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:text>
Original VALUE : </xsl:text>
<xsl:copy-of select="."/>
<xsl:text>
OUTPUT ESCAPING DISABLED : </xsl:text>
<xsl:value-of select="." disable-output-escaping="yes"/>
<xsl:text>
OUTPUT ESCAPING ENABLED : </xsl:text>
<xsl:value-of select="." disable-output-escaping="no"/>
</xsl:template>
</xsl:stylesheet>
And the console out is pretty basic:
<?xml version="1.0" encoding="UTF-8"?>
<Container>&</Container>
Original VALUE : <Container>&</Container>
OUTPUT ESCAPING DISABLED : &
OUTPUT ESCAPING ENABLED : &
You can take the active node from the XSLT execution and perform specific character replacments. There are multiple examples I was able to find, but I'm having difficulty getting them working in my context.
XSLT string replace
is a good place to start.
This is about the extent of my knowledge with XSLT, I hope it helps you solve your issue.
Best of luck.
I was considering this further, and the solution may not only be XSLT. From your description, I have the impression that rather than xml10 encoding, you're kind of looking for a full set of html encoding.
Along those lines, if we take your current node text transformation:
if (child.getNodeType() == Node.TEXT_NODE) {
child.setNodeValue(StringEscapeUtils.escapeXml10(child.getNodeValue()));
}
And explicitly expect that we want the HTML Encoding:
if (child.getNodeType() == Node.TEXT_NODE) {
//Capture the current node value
String nodeValue = child.getNodeValue();
//Decode for XML10 to remove existing escapes
String decodedNode = StringEscapeUtils.unescapeXml10(nodeValue);
//Then Re-encode for HTML (3/4/5)
String fullyEncodedHTML = StringEscapeUtils.escapeHtml3(decodedNode);
//String fullyEncodedHTML = StringEscapeUtils.escapeHtml4(decodedNode);
//String fullyEncodedHTML = StringEscapeUtils.escapeHtml5(decodedNode);
//Then place the fully-encoded HTML back to the node
child.setNodeValue(fullyEncodedHTML);
}
I would think that the xml would now be fully encoded with all of the
HTML escapes you were wanting.
Now combine this with the XSLT for output escaping (from above), and the document will not undergo any further transformations when written out to the file.
I like this solution because it limits the logic held in the XSLT file. Rather than managing the entire String find/replace, you would just need to ensure that you copy your entire node and copy the text() with output escaping disabled.
In theory, that seems like it would fulfill my understanding of your objective.
Caveat again is that I'm weak with XSLT, so the example xslt file may
still need some tweaking. This solution reduces that unknown work
quantity, in my opinion.
I've seen people use regex to do something similar
Copied from (Replace special character with an escape preceded special character in Java)
String newSearch = search.replaceAll("(?=[]\\[+&|!(){}^\"~*?:\\\\-])", "\\\\");
That whacky regex is a "look ahead" - a non capturing assertion that the following char match something - in this case a character class.
Notice how you don't need to escape chars in a character class, except a ] (even the minus don't need escaping if first or last).
The \\\\ is how you code a regex literal \ (escape once for java, once for regex)
Here's a test of this working:
public static void main(String[] args) {
String search = "code:xy";
String newSearch = search.replaceAll("(?=[]\\[+&|!(){}^\"~*?:\\\\-])", "\\\\");
System.out.println(newSearch);
}
Output:
code\:xy
this is very closely related to this question (how to Download a XML file from a URL by Escaping Special Characters like < > $amp; etc?).
This post has a similar case where the code downloads XML's with parsed / escaped content.
As i understand , you read file , parse it and escape characters . During saving the XML gets "escaped" again. While you can use the DOM for checking well-formed XML or schema, file based operations to escape can help you escape XML and HTML special characters. The code sample in the post refers to usage of IOUtils and StringUtils to do it. Hope this helps !
I would use StringEscapeUtils.escapeXml10()... details here. https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringEscapeUtils.html#ESCAPE_XML10

Not able to parse XML file containing Chinese content

I have an XML file containing Chinese content. But while displaying I am getting question marks. Could somebody look into this issue?
My book.xml :
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<book>
<person>
<first>密码</first>
<last>Pai</last>
<age>22</age>
</person>
</book>
And my code is:
public static void main (String argv []){
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse (new File("book.xml"));
String strDoc=getStringFromDocument(doc);
System.out.println(strDoc);
}
public static String getStringFromDocument(Document doc) {
TransformerFactory transfac = TransformerFactory.newInstance();
Transformer trans = transfac.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
trans.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
trans.setOutputProperty(OutputKeys.INDENT, "yes");
StringWriter sw = new StringWriter();
StreamResult result = new StreamResult(sw);
DOMSource source = new DOMSource(doc);
trans.transform(source, result);
String xmlString = sw.toString();
return xmlString.toString();
}
After that I am getting ??:
<?xml version="1.0" encoding="UTF-8"?>
<book>
<person>
<first>??</first>
<last>Pai</last>
<age>22</age>
</person>
Your code runs fine on my system. I was able to create a books.xml with chinese characters, run your code on my system and get the correct output.
[update]
Previously I thought your books.xml file was suspect - but I was finally able to reproduce your problem on my system by setting -Dfile.encoding=ISO-8859-1.
Somewhere in your environment you have an incorrect character encoding setting. Perhaps in the JVM, perhaps in the console where you are displaying the characters.
One way to ensure that you are writing your String as a UTF-8 encoded byte stream is to change:
System.out.println(strDoc);
to
System.out.write(strDoc.getBytes("UTF-8"));
This may or may not fix what you are seeing on the screen. Your console must also be configured to properly handle UTF-8 encoded data. But if you write these bytes to a file or socket, you should be able confirm that the bytes match those in your original file.

XML Canonical form in Java

This question got me pretty close and actually works. Now I'm trying to understand it better and make it more robust.
Have the following test code:
// Just build a test xml
String xml;
xml = "<aaa Batt = \"That\" Aatt=\"this\" >\n";
xml += "<!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/>\n";
xml += " <ccc/></aaa>";
// do the necessary bureaucracy
DocumentBuilder docBuilder;
docBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc;
doc = docBuilder.parse(new ByteArrayInputStream(xml.getBytes()));
// Normalize document
// Do I realy need to do this?
doc.normalize();
// Canonize using Apache's Xml security
org.apache.xml.security.Init.init(); // Doesnt work if I don't do this.
byte[] c14nOutputbytes = Canonicalizer.getInstance(
Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS)
.canonicalizeSubtree(doc.getDocumentElement());
// This was a reparse reccomended to get attributes in alpha order
Document canon = docBuilder.parse(new ByteArrayInputStream(c14nOutputbytes));
// Input and output for the transformer
DOMSource xmlInput = new DOMSource(canon);
StreamResult xmlOutput = new StreamResult(new StringWriter());
// Configure transformer and format code
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(
"{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(xmlInput, xmlOutput);
// And print it
System.out.println(xmlOutput.getWriter().toString());
Executing this code, would output:
<aaa Aatt="this" Batt="That">
<!-- Document comment --><bbb lolol="dsf" moarttt="fasf"/>
<ccc/>
</aaa>
Which might be canonized, but doesn't seem to respect the indentation I asked the transformer to do.
Having such an example, I have a few questions:
For my intent, is there any difference between .normalize() and Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS? Removing either of them seems to yield the same result (again within my intent of have a canonical and pretty printed xml).
Why do the blank spaces within the xml seem to screw the formatting? Would I have to trim the text of each xml node to make it work? It just sounds wrong, nonetheless if the input xml is <aaa Batt = \"That\" Aatt=\"this\" ><!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/><ccc/></aaa> the xml is perfectly formatted.
Why after asking for the canonical form, tags such as <ccc/> weren't expanded to <ccc></ccc>? Wikipedia says "empty elements are encoded as start/end pairs, not using the special empty-element syntax".
Sorry if these are too many questions at once, but I have the feeling the answers for all of these should be somewhat the same.

Categories