XML Canonicalizer Problem

XML Canonicalizer Problem - java

I'm using the package org.apache.xml.security.c14nfor the canonicalization of XMLs. I use the following code:
private String CanonicalizeXML(String XML) throws InvalidCanonicalizerException, CanonicalizationException, ParserConfigurationException, IOException, SAXException {
Canonicalizer canon = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_OMIT_COMMENTS);
return new String(canon.canonicalize(XML.getBytes()));
}
However, it doesn't seem to work as I expected, since it doesn't delete any non-necessary white spaces between elements. Do I do something wrong?
Thanks,
Ivan

I think it may be your expectation which is incorrect:
You don't say which version of XML Canonicalization, but both 1.0 and 1.1 say:
All whitespace in character content is
retained (excluding characters removed
during line feed normalization)

Is your xml document referencing a dtd or schema? Without one of those the parser has no way to know which whitespace is significant and so it has to preservere it.

The org.apache.xml.security.c14n does not remove whitespaces.
I resolved by setting setIgnoringBoundaryWhitespace = true on my SAXBuilder:
SAXBuilder builder = new SAXBuilder ();
builder.setIgnoringBoundaryWhitespace(true);
org.jdom2.Document doc = builder.build(is);
DOMOutputter out = new DOMOutputter();
Document docW3 = out.output(doc);

Related

Using vanilla W3C Document and Apache Batik SVG rasterizer

As SVG is a regular XML file and ImageTranscoder.transcode() API accepts org.w3c.dom.Document, respective TranscoderInput constructor accepts org.w3c.dom.Document; one would expect that loading and parsing file with a Java stock XML parser would work:
TranscoderInput input = new TranscoderInput(loadSvgDocument(new FileInputStream(svgFile)));
BufferedImageTranscoder t = new BufferedImageTranscoder();
t.transcode(input, null);
Where loadSvgDocument() method is defined as:
Document loadSvgDocument(String svgFileName, InputStream is) {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
// using stock Java 8 XML parser
Document document;
try {
DocumentBuilder db = dbf.newDocumentBuilder();
document = db.parse(is);
} catch (...) {...}
return document;
}
It does not work. I am getting strange casting exceptions.
Exception in thread "main" java.lang.ClassCastException: org.apache.batik.dom.GenericElement cannot be cast to org.w3c.dom.svg.SVGSVGElement
at org.apache.batik.anim.dom.SVGOMDocument.getRootElement(SVGOMDocument.java:235)
at org.apache.batik.transcoder.SVGAbstractTranscoder.transcode(SVGAbstractTranscoder.java:193)
at org.apache.batik.transcoder.image.ImageTranscoder.transcode(ImageTranscoder.java:92)
at org.apache.batik.transcoder.XMLAbstractTranscoder.transcode(XMLAbstractTranscoder.java:142)
at org.apache.batik.transcoder.SVGAbstractTranscoder.transcode(SVGAbstractTranscoder.java:156)
Note: class BufferedImageTranscoder is my class, created as per Batik blueprints, extending ImageTranscoder which in turn extends SVGAbstractTranscoder mentioned in the stack trace above.
Unfortunately I cannot use Batik own parser, SAXSVGDocumentFactory:
String parser = XMLResourceDescriptor.getXMLParserClassName();
SAXSVGDocumentFactory f = new SAXSVGDocumentFactory(parser);
svgDocument = (SVGDocument) f.createDocument(..);
I am trying to render Papirus SVG icons but they all have <svg ... version="1"> and SAXSVGDocumentFactory does not like that and fails on the createDocument(..) with Unsupport SVG version '1'. They probably meant unsupported.
Exception in thread "main" java.lang.RuntimeException: Unsupport SVG version '1'
at org.apache.batik.anim.dom.SAXSVGDocumentFactory.getDOMImplementation(SAXSVGDocumentFactory.java:327)
at org.apache.batik.dom.util.SAXDocumentFactory.startElement(SAXDocumentFactory.java:640)
. . .
at org.apache.batik.anim.dom.SAXSVGDocumentFactory.createDocument(SAXSVGDocumentFactory.java:225)
Changing version="1" to version="1.0" in the file itself fixes the problem and the icon is rendered nicely for me. But there are hundreds (thousands) of icons and fixing them all is tedious and I would effectively create a port of their project. This is not a way forward for me. Much easier is to make the fix in run time, using DOM API:
Element svgDocumentNode = svgDocument.getDocumentElement();
String svgVersion = svgDocumentNode.getAttribute("version");
if (svgVersion.equals("1")) {
svgDocumentNode.setAttribute("version", "1.0");
}
But that can be done only with stock Java XML parser, Batik XML parser blows too early, before this code can be reached, before Document is generated. But when I use stock XML parser, make the version fix, then Batik Transcoder (rasterizer) does not like it. So I hit a wall here.
Is there a convertor from a stock XML parser produced org.w3c.dom.Document and Batik compatible org.w3c.dom.svg.SVGDocument?

OK, I found a solution bypassing the problem. Luckily class SAXSVGDocumentFactory can be easily subclasses and critical method
getDOMImplementation() overriden.
protected Document loadSvgDocument(InputStream is) {
String parser = XMLResourceDescriptor.getXMLParserClassName();
SAXSVGDocumentFactory f = new LenientSaxSvgDocumentFactory(parser);
SVGDocument svgDocument;
try {
svgDocument = (SVGDocument) f.createDocument("aaa", is);
} catch (...) {
...
}
return svgDocument;
}
static class LenientSaxSvgDocumentFactory extends SAXSVGDocumentFactory {
public LenientSaxSvgDocumentFactory(String parser) {
super(parser);
}
#Override
public DOMImplementation getDOMImplementation(String ver) {
// code is mostly rip-off from original Apache Batik 1.9 code
// only the condition was extended to accept also "1" string
if (ver == null || ver.length() == 0
|| ver.equals("1.0") || ver.equals("1.1") || ver.equals("1")) {
return SVGDOMImplementation.getDOMImplementation();
} else if (ver.equals("1.2")) {
return SVG12DOMImplementation.getDOMImplementation();
}
throw new RuntimeException("Unsupported SVG version '" + ver + "'");
}
}
This time I got lucky, the main question remains however: is there a convertor from a stock XML parser produced org.w3c.dom.Document and Batik compatible org.w3c.dom.svg.SVGDocument?

Encoding for unicode and & characters

I am trying to save the below string to my protobuff model:
STOXX®Europe 600 Food&BevNR ETF
But while printing the protomodel value it's displayed like:
STOXXÂ®Europe 600 Food&BevNR ETF
I tried to encode the string to UTF-8 and also tried StringEscapeUtils.unescapeJava(str), but it failed. I'm getting this string by parsing the XML response from server. Any ideas ?
Ref: XML parser Skip invalid xml element with XmlStreamReader

Correcting the XML parsing should be better than needing to unescape everything. Please check below a test case showing this:
public static void main(String[] args) throws Exception {
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty("javax.xml.stream.isCoalescing", true);
ReaderInputStream ris = new ReaderInputStream(new StringReader("<tag>STOXXÂ®Europe 600 Food&BevNR ETF</tag>"));
XMLStreamReader reader = factory.createXMLStreamReader(ris, "UTF-8");
StringBuilder sb = new StringBuilder();
while (reader.hasNext()) {
reader.next();
if (reader.hasText())
sb.append(reader.getText());
}
System.out.println(sb);
}
Output:
STOXX®Europe 600 Food&BevNR ETF

Actually I have protobuf method with me to solve this issue:
ByteString.copyFrom(StringEscapeUtils.unescapeHtml3(string), "ISO-8859-1").toStringUtf8();
Documentation of ByteString

As the text comes from XML use:
s = StringEscapeUtils.unescapeXml(s);
This is way better than unescaping HTML which has hundreds of named entities &...;.
The two rubbish characters instead of the Copyright Symbol are due to reading an UTF-8 encoded text (multibyte for Special chars) as some single Byte Encoding, maybe Latin-1.
This wrong conversion just might be repaired with another conversion, but best would be to read using a UTF-8 Encoding.
// Hack, just patching. Assumes Latin-1 encoding
s = new String(s.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
// Or maybe:
s = new String(s.getBytes(), StandardCharsets.UTF_8);
Better inspect the reading code, and look wheter an optional Encoding went missing: InputStreamReader, OutputStreamWriter, new String, getBytes.
Your entire problem would be solved by using an XML reader too.

ghost4j class cast exception during joining two PostScripts

I am trying to join two PostScript files to one with ghost4j 0.5.0 as follows:
final PSDocument[] psDocuments = new PSDocument[2];
psDocuments[0] = new PSDocument();
psDocuments[0].load("1.ps");
psDocuments[1] = new PSDocument();
psDocuments[1].load("2.ps");
psDocuments[0].append(psDocuments[1]);
psDocuments[0].write("3.ps");
During this simplified process I got the following exception message for the above "append" line:
org.ghost4j.document.DocumentException: java.lang.ClassCastException:
org.apache.xmlgraphics.ps.dsc.events.UnparsedDSCComment cannot be cast to
org.apache.xmlgraphics.ps.dsc.events.DSCCommentPage
Until now I have not made to find out whats the problem here - maybe some kind of a problem within one of the PostScript files?
So help would be appreciated.
EDIT:
I tested with ghostScript commandline tool:
gswin32.exe -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pswrite -sOutputFile="test.ps" --filename "1.ps" "2.ps"
which results in a document where 1.ps and 2.ps are merged into one(!) page (i.e. overlay).
When removing the --filename the resulting document will be a PostScript with two pages as expected.

The exception occurs because one of the 2 documents does not follow the Adobe Document Structuring Convention (DSC), which is mandatory if you want to use the Document append method.
Use the SafeAppenderModifier instead. There is an example here: http://www.ghost4j.org/highlevelapisamples.html (Append a PDF document to a PostScript document)

I think something is wrong in the document or in the XMLGraphics library as it seems it cannot parse a part of it.
Here you can see the code in ghost4j that I think it is failing (link):
DSCParser parser = new DSCParser(bais);
Object tP = parser.nextDSCComment(DSCConstants.PAGES);
while (tP instanceof DSCAtend)
tP = parser.nextDSCComment(DSCConstants.PAGES);
DSCCommentPages pages = (DSCCommentPages) tP;
And here you can see why XMLGraphics may bre sesponsable (link):
private DSCComment parseDSCComment(String name, String value) {
DSCComment parsed = DSCCommentFactory.createDSCCommentFor(name);
if (parsed != null) {
try {
parsed.parseValue(value);
return parsed;
} catch (Exception e) {
//ignore and fall back to unparsed DSC comment
}
}
UnparsedDSCComment unparsed = new UnparsedDSCComment(name);
unparsed.parseValue(value);
return unparsed;
}
It seems parsed.parseValue(value) has thrown an exception, it was hidden in the catch and it returned an unparsed version ghost4j didn't expect.

How apply CDATA to transformer parameter with jdom

For some reason I have tried to surround the parameters sExtraParameter, sExtraParameter2, sExtraParameter3 with <![CDATA[ ]]> string in order to get "pretty-printed" latin characters. But every time I check the xml output, it stills show bad parsed characters.
So, if is there another way to apply the CDATA to this parameters?
public static Element xslTransformJDOM(File xmlFile, String xslStyleSheet, String sExtraParameter, String sExtraParameterValue, String sExtraParameter2, String sExtraParameterValue2, String sExtraParameter3,String sExtraParameterValue3 ) throws JDOMException, TransformerConfigurationException, FileNotFoundException, IOException{
try{
Transformer transformer = TransformerFactory.newInstance().newTransformer(new StreamSource(xslStyleSheet));
transformer.setParameter(sExtraParameter, sExtraParameterValue);
transformer.setParameter(sExtraParameter2, sExtraParameterValue2);
transformer.setParameter(sExtraParameter3, sExtraParameterValue3);
JDOMResult out = new JDOMResult();
transformer.transform(new StreamSource(xmlFile), out);
Element result = out.getDocument().detachRootElement();
setSize(new XMLOutputter().outputString(result).length());
return result;
}
catch (TransformerException e){
throw new JDOMException("XSLT Transformation failed", e);
}
}
edit:
I am following up a project from my boss, for these reason I have not the entire code to show you here.

Maybe I have missed the question, but the API (http://docs.oracle.com/javaee/1.4/api/javax/xml/transform/Transformer.html#setParameter(java.lang.String, java.lang.Object)) for setParameter does not expect
value - The value object. This can be any valid Java object. It is up to the processor to provide the proper object coersion or to simply pass the object on for use in an extension.
This could then vary by implementation, assuming you are using JDOM.
There may be a CDATA xml element that would then be processed correctly. Maybe: http://www.jdom.org/docs/apidocs/org/jdom2/CDATA.html
You could still think about setting the serializer settings to some sort of whitespace preservation. http://www.jdom.org/docs/apidocs.1.1/org/jdom/output/Format.TextMode.html

jaxb marshaller characterEscapeHandler

I have the following problem. I've set the following properties to the marshaller:
marshaller.setProperty( Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE );
marshaller.setProperty( "com.sun.xml.bind.characterEscapeHandler", new CharacterEscapeHandler() {
public void escape(char[] ch, int start, int length, boolean isAttVal, Writer out) throws IOException {
String s = new String(ch, start, length);
System.out.println("Inside CharacterEscapeHandler...");
out.write(StringEscapeUtils.escapeXml(StringEscapeUtils.unescapeXml(s)));
}
});
When i try to marshall an object to SOAPBody with the following code:
SOAPMessage message = MessageFactory.newInstance().createMessage();
marshaller.marshal(request, message.getSOAPBody());
the CharacterEscapeHandler.escape is not invoked, and the characters are not escaped, but this code:
StringWriter writer = new StringWriter();
marshaller.marshal(request, writer);
invokes CharacterEscapeHandler.escape(), and all the characters are escaped... Is this normal behaviour for JAXB. And how can I escape characters before placing them inside SOAP's body?
Update:
Our system have to communicate with another system, which expects the text to be escaped.
Example for message sent by the other system:
<env:Envelope xmlns:env="http://www.w3.org/2003/05/soap-envelope">
<env:Body xmlns:ac="http://www.ACORD.org/Standards/AcordMsgSvc/1">
<ac:CallRs xmlns:ac="http://www.ACORD.org/Standards/AcordMsgSvc/1">
<ac:Sender>
<ac:PartyId>urn:partyId</ac:PartyId>
<ac:PartyRoleCd/>
<ac:PartyName>PARTYNAME</ac:PartyName>
</ac:Sender>
<ac:Receiver>
<ac:PartyRoleCd>broker</ac:PartyRoleCd>
<ac:PartyName>Ð�Ð¼Ð°Ñ€Ð°Ð½Ñ‚ Ð‘ÑŠÐ»Ð³Ð°Ñ€Ð¸Ñ� ÐžÐžÐ”</ac:PartyName>
</ac:Receiver>
<ac:Application>
<ac:ApplicationCd>applicationCd</ac:ApplicationCd>
<ac:SchemaVersion>schemaversion/</ac:SchemaVersion>
</ac:Application>
<ac:TimeStamp>2011-05-11T18:41:19</ac:TimeStamp>
<ac:MsgItem>
<ac:MsgId>30d63016-fa7d-4410-a19a-510e43674e70</ac:MsgId>
<ac:MsgTypeCd>Error</ac:MsgTypeCd>
<ac:MsgStatusCd>completed</ac:MsgStatusCd>
</ac:MsgItem>
<ac:RqItem>
<ac:MsgId>d8c2d9c4-3f1c-459f-abe1-0e9accbd176b</ac:MsgId>
<ac:MsgTypeCd>RegisterPolicyRq</ac:MsgTypeCd>
<ac:MsgStatusCd>completed</ac:MsgStatusCd>
</ac:RqItem>
<ac:WorkFolder>
<ac:MsgFile>
<ac:FileId>cid:28b8c9d1-9655-4727-bbb2-3107482e7f2e</ac:FileId>
<ac:FileFormatCd>text/xml</ac:FileFormatCd>
</ac:MsgFile>
</ac:WorkFolder>
</ac:CallRs>
</env:Body>
</env:Envelope>
So I need to escape all the text between the opening/closing tags.. like this inside ac:PartyName

When you marshal to a DOM Document, JAXB is not in charge of the actual serialization and escaping, it just builds the DOM tree in memory. The serialization is then handled by the DOM implementation.
Needing additional escaping when writing xml is usually a sign of a design problem or not using xml correctly. If you can give some more context why you need this escaping, maybe I could suggest an alternative solution.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

XML Canonicalizer Problem - java

I think it may be your expectation which is incorrect: You don't say which version of XML Canonicalization, but both 1.0 and 1.1 say: All whitespace in character content is retained (excluding characters removed during line feed normalization)

Is your xml document referencing a dtd or schema? Without one of those the parser has no way to know which whitespace is significant and so it has to preservere it.

Related

Using vanilla W3C Document and Apache Batik SVG rasterizer

Encoding for unicode and & characters

ghost4j class cast exception during joining two PostScripts

How apply CDATA to transformer parameter with jdom

jaxb marshaller characterEscapeHandler

Categories

Resources