javax Transformer preserve the escaping entity - java

I finally gave up the transformation of xslt after the failure of several tryings (if you are interested, you can check my question here: don't want xslt intepret the entity characters). And now I'm working on the javax Transformer trying to solve the problem.
The problem is that I would like to keep the escaping of apostrophe ' in the xml and html:
< cell i='0' j='0' vi='0' parity='Odd' subType='-1'> "String'</cell>
output is what I don't want:
< td nowrap="true" class="gridCell" i="0" j="0">"String'< /td>
I would like the result as follows:
< td nowrap="true" class="gridCell" i="0" j="0">"String'< /td>
I don't know whether we could use the method transform to do this, and I see a similar question, but he needs the opposite thing : How Do You Prevent A javax Transformer From Escaping Whitespace?
I appreciate any help from you, thank you!

Strictly speaking, there's no difference to an XML parser between an ', ', &#x27 or &apos; in a text node, In an attribute it's slightly different, given that the value has to be enclosed between " or '; if you've enclosed an attribute's value in ', you MUST use an entity to specify an apostrophe within the value, and an XML serializer will do the same.
I'm a bit rusty with XML handling in Java, but I know in C# you can have your transform generate an XML document object, and create a class that inherits from the XmlWriter class to serialize it however you wish, by overriding the 'WriteString' method. Hopefully there's something similar you can do in Java, possibly by implementing the Result interface, or perhaps passing a DOMResult in that you can work with. I can't remember how you normally serialize a Document in java, but it should be something you can manipulate or override.

XML defines &apos; and ' to mean the same thing, so most XML tools are going to treat the distinction as irrelevant. I think you should question the requirement - why are you trying to do this? It's a bit like trying to preserve the spaces around the "=" sign in an attribute: it's pointless.

Related

In an XML document, is it possible to tell the difference between an entity-encoded character and one that is not?

I am being feed an XML document with metadata about online resources that I need to parse. Among the different metadata items are a collection of tags, which are comma-delimited. Here is an example:
<tags>Research skills, Searching, evaluating and referencing</tags>
The issue is that one of these "tags" contains a comma in it. The comma within the tag is encoded, but the commas intended to delimit tags are not. I am (currently) using the getText() method on org.dom4j.Node to read the text content of the <tags> element, which returns a String.
The problem is that I am not able -- as far as I'm aware -- to differentiate the encoded comma (from the ones that aren't encoded) in the String I receive.
Short of writing my own XML parser, is there another way to access the text content of this node in a more "raw" state? (viz. a state where the encoded comma is still encoded.)
When you use dom4j or DOM all the entities are already resolved, so you would need to go back to the parsing step to catch character references.
SAX is a more lowlevel interface and has support via its LexicalHandler interface to get notified when the parser encounters entity references, but it does not report character references. So it seems that you would really need to write an own parser, or patch an existing one.
But in the end it would be best if you can change the schema of your document:
<tags>
<tag>Research skills</tag>
<tag>Searching, evaluating and referencing</tag>
</tags>
In your current document character references are used to act as metadata. XML elements are a better way to express that.
Using LexEv from http://andrewjwelch.com/lexev/, putting xercesImpl.jar from Apache Xerces on the class path, I am able to compile and run some short sample using dom4j:
LexEv lexEv = new LexEv();
SAXReader reader = new SAXReader(lexEv);
Document doc = reader.read("input1.xml");
System.out.println(doc.getRootElement().asXML());
If the input1.xml has your sample XML snippet, then the output is
<tags xmlns:lexev="http://andrewjwelch.com/lexev">Research skills, Searching<lexev:char-ref name="#44">,</lexev:char-ref> evaluating and referencing</tags>
So that way you could get a representation of your input where a pure character and a character reference can be distinguished.
As far as I know, every XML processing frameworks (except vtd-xml) resolve entities during parsing....
you can only distinguish a character from its entity encoded counterpart using vtd-xml by using VTDNav's toRawString() method...

Escaping an xml string in java

I read elements with CDATA sections from a rss-feed which I need to convert to valid xml. The content in the CDATA section is mostly valid xhtml, but some times characters like ampersand appear in attributes (url's).
I can use .replaceAll("&", "&") to solve this but thinking a bit forward it may be that other invalid characters show up in attributes or text.
The CMS to which I'm importing the element, won't accept CDATA sections without setting up another configuration for the content, so my question is: is there any simple way to escape the string, only for attributes and text?
I'm using the jdom library to manipulate the xml after the import.
Edit: I've checked out apache's StringEscapeUtils, but this is escaping the whole string. I need something that will only escape attribute values and text inside elements.
Apache Commons provides handy functions for this: StringEscapeUtils
When you use JDOM it will automatically correctly escape ay content that needs it. Is your CMS loaded with the output of JDOM, or are you using some other library to populate the CMS...?
In essence, if you have valid XML input, and you use JDOM (something from org.jdom2.output.*) to output the data, then you will always have good output.... so, what are you doing to have broken output?
Rolf

Escaping Special Characters with JiBX (Un) Marshalling

i want that during marshelling special character should escape,
is there any way to do this?
alt="<i><b> image alt</b></i>"
this is saved as
<b><i>image alt</b></i>
i want to save value as it is
If you store something as XML, you HAVE to escape that signs. Otherwise you XML will become invalid:
<xml>text</xml>
if test == </xml> the XML will be clearly invalid:
<xml></xml></xml>
This must be:
<xml></xml></xml>
If you unmarshall it, it should become the correct value again.
You may also use CDATA
I thought I share my experience, because answers I found weren't quit comprehensive (and I'm still not pretty sure if this is the most professional solution out there).
In our project we use maven-jibx-plugin to generate POJOs from XSDs (in two runs as usual: 1. *.xsd->binding.xml, then 2. binding.xml-> *.java).
Based on documentation of value node and Dennis Sosnoski's answer on jibx mailing list I added xml-maven-plugin to our project build process. I use it to apply an XSL file on generated binding.xml before POJO generation. The point is to change value of style attribute on appropriate value node from text to cdata.
So far it seams it solved my encoding issue and now I can return to client xmls like:
<Description><![CDATA[<strong>Valuable content goes here</strong>...<br />]]></Description>
Hope this makes someones life easier. :)

Java library to escape/clean XML?

I get some malformed xml text input like:
"<Tag>something</Tag> 8 > 3, 2 < 3, ... <Tag>something</Tag>"
I want to clean the input so to get:
"<Tag>something</Tag> 8 > 3, 2 < 3, ... <Tag>something</Tag>"
That is, escape those special symbols like <,> and yet keep the valid tags ("<Tag>something</Tag>, note, with the same case)
Do you know of any java library to do this? Probably a xml/html parser? (though I don't really need a parser, simple a "clean" procedure)
JTidy is "HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML"
But it can also be used with xml. Check the documentation. It's incredible smart, it will probably work for you.
I don't know of any library that would do that. Your input is malformed XML, and no proper XML parser would accept it. More important, it is not always possible to distinguish an actual tag from something that looks-like-a-tag-but-is-really-text. Therefore any heuristic-based attempt that you make to solve the problem will be fragile; i.e. it could occasionally produce malformed XML.
The best approach is address the problem before you assemble the XML.
If you generate the XML by (for example) unparsing a DOM, then the unparser will take care of the escaping for you.
If you are generating the XML by templating or string bashing, then you need to call something like StringEscapeUtils.escapeXml on the relevant text chunks ... before the XML tags get incorporated.
If you leave the problem until after the "XML" has been assembled, it cannot be properly fixed.
The best solution is to fix the program generating your text input. The easiest such fix would involve an escape utility like the other answers suggested. If that's not an option, I'd use a regular expression like
</?[a-zA-Z]+ */?>
to match the expected tags, and then split the string up into tags (which you want to pass through unchanged) and text between tags (against which you want to apply an escape method.)
I wouldn't count on an XML parser to be able to do it for you because what you're dealing with isn't valid XML. It is possible for the existing lack of escaping to produce ambiguities, so you might not be able to do a perfect job either.
Check out Guava's XmlEscaper. It is in pre-release for version 11 but the code is available.
Apache Commons Lang contains a class named StringEscapeUtils which does exactly what you want! The method you'd want to use is escapeXml, I presume.

Java: Ignoring escapes when parsing XML

I'm using a DocumentBuilder to parse XML files. However, the specification for the project requires that within text nodes, strings like " and < be returned literally, and not decoded as characters (" and <).
A previous similar question, Read escaped quote as escaped quote from xml, received one answer that seems to be specific to Apache, and another that appears to simply not not do what it says it does. I'd love to be proven wrong on either count, however :)
For reference, here is some code:
file = new File(fileName);
DocBderFac = DocumentBuilderFactory.newInstance();
DocBder = DocBderFac.newDocumentBuilder();
doc = DocBder.parse(file);
NodeList textElmntLst = doc.getElementsByTagName(text);
Element textElmnt = (Element) textElmntLst.item(0);
NodeList txts = textElmnt.getChildNodes();
String txt = ((Node) txts.item(0)).getNodeValue();
System.out.println(txt);
I would like that println() to produce things like
"3>2"
instead of
"3>2"
which is what currently happens.
Thanks!
You can turn them back into xml-encoded form by
StringEscapeUtils.escapeXml(str);
(javadoc, commons-lang)
I'm using a DocumentBuilder to parse XML files. However, the specification for the project requires that within text nodes, strings like " and < be returned literally, and not decoded as characters (" and <).
Bad requirement. Don't do that.
Or at least consider carefully why you think you want or need it.
CDATA sections and escapes are a tactic for allowing you to pass text like quotes and '<' characters through XML and not have XML confuse them with markup. They have no meaning in themselves and when you pull them out of the XML, you should accept them as the quotes and '<' characters they were intended to represent.
One approach might be to try dom4j, and to use the Node.asXML() method. It might return a deep structure, so it might need cloning to get just the node or text you want without any of its children.
Both good answers, but both a little too heavy-weight for this very small-scale application. I ended up going with the total hack of just stripping out all &s (I do this to &s that aren't part of escapes later anyway). It's ugly, but it's working.
Edit: I understand there's all kinds of things wrong with this, and that the requirement is stupid. It's for a school project, all that matters is that it work in one case, and the requirement is not my fault :)

Categories