jaxb marshaller characterEscapeHandler - java

I have the following problem. I've set the following properties to the marshaller:
marshaller.setProperty( Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE );
marshaller.setProperty( "com.sun.xml.bind.characterEscapeHandler", new CharacterEscapeHandler() {
public void escape(char[] ch, int start, int length, boolean isAttVal, Writer out) throws IOException {
String s = new String(ch, start, length);
System.out.println("Inside CharacterEscapeHandler...");
out.write(StringEscapeUtils.escapeXml(StringEscapeUtils.unescapeXml(s)));
}
});
When i try to marshall an object to SOAPBody with the following code:
SOAPMessage message = MessageFactory.newInstance().createMessage();
marshaller.marshal(request, message.getSOAPBody());
the CharacterEscapeHandler.escape is not invoked, and the characters are not escaped, but this code:
StringWriter writer = new StringWriter();
marshaller.marshal(request, writer);
invokes CharacterEscapeHandler.escape(), and all the characters are escaped... Is this normal behaviour for JAXB. And how can I escape characters before placing them inside SOAP's body?
Update:
Our system have to communicate with another system, which expects the text to be escaped.
Example for message sent by the other system:
<env:Envelope xmlns:env="http://www.w3.org/2003/05/soap-envelope">
<env:Body xmlns:ac="http://www.ACORD.org/Standards/AcordMsgSvc/1">
<ac:CallRs xmlns:ac="http://www.ACORD.org/Standards/AcordMsgSvc/1">
<ac:Sender>
<ac:PartyId>urn:partyId</ac:PartyId>
<ac:PartyRoleCd/>
<ac:PartyName>PARTYNAME</ac:PartyName>
</ac:Sender>
<ac:Receiver>
<ac:PartyRoleCd>broker</ac:PartyRoleCd>
<ac:PartyName>�марант Българи� ООД</ac:PartyName>
</ac:Receiver>
<ac:Application>
<ac:ApplicationCd>applicationCd</ac:ApplicationCd>
<ac:SchemaVersion>schemaversion/</ac:SchemaVersion>
</ac:Application>
<ac:TimeStamp>2011-05-11T18:41:19</ac:TimeStamp>
<ac:MsgItem>
<ac:MsgId>30d63016-fa7d-4410-a19a-510e43674e70</ac:MsgId>
<ac:MsgTypeCd>Error</ac:MsgTypeCd>
<ac:MsgStatusCd>completed</ac:MsgStatusCd>
</ac:MsgItem>
<ac:RqItem>
<ac:MsgId>d8c2d9c4-3f1c-459f-abe1-0e9accbd176b</ac:MsgId>
<ac:MsgTypeCd>RegisterPolicyRq</ac:MsgTypeCd>
<ac:MsgStatusCd>completed</ac:MsgStatusCd>
</ac:RqItem>
<ac:WorkFolder>
<ac:MsgFile>
<ac:FileId>cid:28b8c9d1-9655-4727-bbb2-3107482e7f2e</ac:FileId>
<ac:FileFormatCd>text/xml</ac:FileFormatCd>
</ac:MsgFile>
</ac:WorkFolder>
</ac:CallRs>
</env:Body>
</env:Envelope>
So I need to escape all the text between the opening/closing tags.. like this inside ac:PartyName

When you marshal to a DOM Document, JAXB is not in charge of the actual serialization and escaping, it just builds the DOM tree in memory. The serialization is then handled by the DOM implementation.
Needing additional escaping when writing xml is usually a sign of a design problem or not using xml correctly. If you can give some more context why you need this escaping, maybe I could suggest an alternative solution.

Related

Encoding for unicode and & characters

I am trying to save the below string to my protobuff model:
STOXX®Europe 600 Food&BevNR ETF
But while printing the protomodel value it's displayed like:
STOXX®Europe 600 Food&BevNR ETF
I tried to encode the string to UTF-8 and also tried StringEscapeUtils.unescapeJava(str), but it failed. I'm getting this string by parsing the XML response from server. Any ideas ?
Ref: XML parser Skip invalid xml element with XmlStreamReader
Correcting the XML parsing should be better than needing to unescape everything. Please check below a test case showing this:
public static void main(String[] args) throws Exception {
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty("javax.xml.stream.isCoalescing", true);
ReaderInputStream ris = new ReaderInputStream(new StringReader("<tag>STOXX®Europe 600 Food&BevNR ETF</tag>"));
XMLStreamReader reader = factory.createXMLStreamReader(ris, "UTF-8");
StringBuilder sb = new StringBuilder();
while (reader.hasNext()) {
reader.next();
if (reader.hasText())
sb.append(reader.getText());
}
System.out.println(sb);
}
Output:
STOXX®Europe 600 Food&BevNR ETF
Actually I have protobuf method with me to solve this issue:
ByteString.copyFrom(StringEscapeUtils.unescapeHtml3(string), "ISO-8859-1").toStringUtf8();
Documentation of ByteString
As the text comes from XML use:
s = StringEscapeUtils.unescapeXml(s);
This is way better than unescaping HTML which has hundreds of named entities &...;.
The two rubbish characters instead of the Copyright Symbol are due to reading an UTF-8 encoded text (multibyte for Special chars) as some single Byte Encoding, maybe Latin-1.
This wrong conversion just might be repaired with another conversion, but best would be to read using a UTF-8 Encoding.
// Hack, just patching. Assumes Latin-1 encoding
s = new String(s.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
// Or maybe:
s = new String(s.getBytes(), StandardCharsets.UTF_8);
Better inspect the reading code, and look wheter an optional Encoding went missing: InputStreamReader, OutputStreamWriter, new String, getBytes.
Your entire problem would be solved by using an XML reader too.

JDOM Transformer - don't contract empty elements

I'm using JDOM 2.0.6 to transform an XSLT into an HTML, but I'm coming across the following problem - sometimes the data should be empty, that is, I'll have in my XSLT the following:
<div class="someclass"><xsl:value-of select="somevalue"/></div>
and when somevalue is empty, the output I get is:
<div class="someclass"/>
which may be perfectly valid XML, but is not valid HTML, and causes problems when displaying the resulting page.
Similar problems occur for <span> or <script> tags.
So my question is - how can I tell JDOM not to contract empty elements, and leave them as <div></div>?
Edit
I suspect the problem is not in the actual XSLTTransformer, but later when using JDOM to write to html. Here is the code I use:
XMLOutputter htmlout = new XMLOutputter(Format.getPrettyFormat());
htmlout.getFormat().setEncoding("UTF-8");
Document htmlDoc = transformer.transform(model);
htmlDoc.setDocType(new DocType("html"));
try (OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(outHtml), "UTF-8")) {
htmlout.output(htmlDoc, osw);
}
Currently the proposed solution of adding a zero-width space works for me, but I'm interested to know if there is a way to tell JDOM to treat the document as an HTML (be it in the transform stage or the output stage, but I'm guessing the problem lies in the output stage).
You can use a zero-width-space between the elements. This doesn't affect the HTML output, but keeps the open-close-tags separated because they have a non-empty content.
<div class="someclass">​<xsl:value-of select="somevalue"/></div>
Downside is: the tag is not really empty anymore. That would matter if your output would be XML. But for HTML - which is probably the last stage of processing - it should not matter.
In your case, the XML transform is happening directly to a file/stream, and it is no longer in the control of JDOM.
In JDOM, you can select whether the output from the JDOM document has expanded, or not-expanded output for empty elements. Typically, people have output from JDOM like:
XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat());
xout.output(document, System.out);
You can modify the output format, though, and expand the empty elements
Format expanded = Format.getPrettyFormat().setExpandEmptyElements(true);
XMLOutputter xout = new XMLOutputter(expanded);
xout.output(document, System.out);
If you 'recover' (assuming it is valid XHTML?) the XSLT transformed xml as a new JDOM document you can output the result with expanded empty elements.
If you want to transform to a HTML file then consider to use Jaxp Transformer with a JDOMSource and a StreamResult, then the Transformer will serialize the transformation result as HTML if the output method is html (either as set in your code or as done with a no-namespace root element named html.
In addition to the "expandEmptyElements" option, you could create your own writer and pass it to the XMLOutputter:
XMLOutputter outputter = new XMLOutputter(Format.getPrettyFormat().setExpandEmptyElements(true));
StringWriter writer = new HTML5Writer();
outputter.output(document, writer);
System.out.println(writer.toString());
This writer can then modify all HTML5 void elements. Elements like "script" for example won't be touched:
private static class HTML5Writer extends StringWriter {
private static String[] VOIDELEMENTS = new String[] { "area", "base", "br", "col", "command", "embed", "hr",
"img", "input", "keygen", "link", "meta", "param", "source", "track", "wbr" };
private boolean inVoidTag;
private StringBuffer voidTagBuffer;
public void write(String str) {
if (voidTagBuffer != null) {
if (str.equals("></")) {
voidTagBuffer.append(" />");
super.write(voidTagBuffer.toString());
voidTagBuffer = null;
} else {
voidTagBuffer.append(str);
}
} else if (inVoidTag) {
if (str.equals(">")) {
inVoidTag = false;
}
} else {
for (int i = 0; i < VOIDELEMENTS.length; i++) {
if (str.equals(VOIDELEMENTS[i])) {
inVoidTag = true;
voidTagBuffer = new StringBuffer(str);
return;
}
}
super.write(str);
}
}
}
I know, this is dirty, but I had the same problem and didn't find any other way.

Transformation of multiple input files

Right now i am using this java (which receives one xml file parameter) method to perform XSLT transformation:
static public byte[] simpleTransform(byte[] sourcebytes, int ref_id) {
try {
StreamSource xmlSource = new StreamSource(new ByteArrayInputStream(sourcebytes));
StringWriter writer = new StringWriter();
transformations_list.get(ref_id).transformer.transform(xmlSource, new StreamResult(writer));
return writer.toString().getBytes("UTF-8");
} catch (Exception e) {
e.printStackTrace();
return new byte[0];
}
}
And in my xslt file i am using the document('f2.xml') to refer to other transform related files.
I want to use my Java like this (get multiple xml files):
static public byte[] simpleTransform(byte[] f1, byte[] f2, byte[] f3, int ref_id)
An in my XSLT i don't want to call document('f2.xml') but refer to the object by using f2 received in my Java method.
Is there a way to do it? how do i refer to
f2.xml
in my XSLT using this way?
I'm not entirely sure what is in f1, f2 etc. Is it the URL of a document? or the XML document content itself?
There are two possible approaches you could consider.
One is to write a URIResolver. When you call document('f2.xml') Saxon will call your URIResolver to get the relevant document as a Source object. Your URIResolver could return a StreamSource initialized with a ByteArrayInputStream referring to the relevant btye[] value.
A second approach is to supply the documents as parameters to the stylesheet. You could declare a global parameter <xsl:param name="f2" as="document-node()"/> and then use Transfomer.setParameter() to supply the actual document; within the stylesheet, replace document('f2.xml') by $f2. Saxon will accept a Source object as the value supplied to setParameter, so you could again create a StreamSource initialized with a ByteArrayInputStream referring to the relevant btye[] value; alternatively (and perhaps better) you could pre-build the tree by calling a Saxon DocumentBuilder.

How apply CDATA to transformer parameter with jdom

For some reason I have tried to surround the parameters sExtraParameter, sExtraParameter2, sExtraParameter3 with <![CDATA[ ]]> string in order to get "pretty-printed" latin characters. But every time I check the xml output, it stills show bad parsed characters.
So, if is there another way to apply the CDATA to this parameters?
public static Element xslTransformJDOM(File xmlFile, String xslStyleSheet, String sExtraParameter, String sExtraParameterValue, String sExtraParameter2, String sExtraParameterValue2, String sExtraParameter3,String sExtraParameterValue3 ) throws JDOMException, TransformerConfigurationException, FileNotFoundException, IOException{
try{
Transformer transformer = TransformerFactory.newInstance().newTransformer(new StreamSource(xslStyleSheet));
transformer.setParameter(sExtraParameter, sExtraParameterValue);
transformer.setParameter(sExtraParameter2, sExtraParameterValue2);
transformer.setParameter(sExtraParameter3, sExtraParameterValue3);
JDOMResult out = new JDOMResult();
transformer.transform(new StreamSource(xmlFile), out);
Element result = out.getDocument().detachRootElement();
setSize(new XMLOutputter().outputString(result).length());
return result;
}
catch (TransformerException e){
throw new JDOMException("XSLT Transformation failed", e);
}
}
edit:
I am following up a project from my boss, for these reason I have not the entire code to show you here.
Maybe I have missed the question, but the API (http://docs.oracle.com/javaee/1.4/api/javax/xml/transform/Transformer.html#setParameter(java.lang.String, java.lang.Object)) for setParameter does not expect
value - The value object. This can be any valid Java object. It is up to the processor to provide the proper object coersion or to simply pass the object on for use in an extension.
This could then vary by implementation, assuming you are using JDOM.
There may be a CDATA xml element that would then be processed correctly. Maybe: http://www.jdom.org/docs/apidocs/org/jdom2/CDATA.html
You could still think about setting the serializer settings to some sort of whitespace preservation. http://www.jdom.org/docs/apidocs.1.1/org/jdom/output/Format.TextMode.html

XML Canonicalizer Problem

I'm using the package org.apache.xml.security.c14nfor the canonicalization of XMLs. I use the following code:
private String CanonicalizeXML(String XML) throws InvalidCanonicalizerException, CanonicalizationException, ParserConfigurationException, IOException, SAXException {
Canonicalizer canon = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_OMIT_COMMENTS);
return new String(canon.canonicalize(XML.getBytes()));
}
However, it doesn't seem to work as I expected, since it doesn't delete any non-necessary white spaces between elements. Do I do something wrong?
Thanks,
Ivan
I think it may be your expectation which is incorrect:
You don't say which version of XML Canonicalization, but both 1.0 and 1.1 say:
All whitespace in character content is
retained (excluding characters removed
during line feed normalization)
Is your xml document referencing a dtd or schema? Without one of those the parser has no way to know which whitespace is significant and so it has to preservere it.
The org.apache.xml.security.c14n does not remove whitespaces.
I resolved by setting setIgnoringBoundaryWhitespace = true on my SAXBuilder:
SAXBuilder builder = new SAXBuilder ();
builder.setIgnoringBoundaryWhitespace(true);
org.jdom2.Document doc = builder.build(is);
DOMOutputter out = new DOMOutputter();
Document docW3 = out.output(doc);

Categories