Write dom4j Document to xml-file with escaped '\r\n\' - java

Sorry for asking about quite the same issue, but now i would like to:
write a dom4j Document which contains tags looking like this:
<Field>\r\n some text</Field>
to a xml file, but the \r\n should be escaped to
org.dom4j.Document.asXml()
does not work.

Assuming you mean that's a CRLF sequence in the text node (and not merely a literal backslash-r-backslash-n), you won't be able to persuade an XML serialiser to write them as
, because XML says you don't have to. The document is absolutely equivalent in XML terms, whether you escape it or not. The only place you need to escape the CRLF sequence as
is in an attribute value.
If you really must produce this output, you would have to write your own XML serialiser that followed special rules for escaping control codes. But if you are doing this because an external tool can't read the XML element with CRLF sequences in, you should concentrate on fixing that tool, because if it can't deal with newlines in text content it's broken and not a proper XML parser.

Walk the tree, applying String.replace to the Text nodes.

Related

In an XML document, is it possible to tell the difference between an entity-encoded character and one that is not?

I am being feed an XML document with metadata about online resources that I need to parse. Among the different metadata items are a collection of tags, which are comma-delimited. Here is an example:
<tags>Research skills, Searching, evaluating and referencing</tags>
The issue is that one of these "tags" contains a comma in it. The comma within the tag is encoded, but the commas intended to delimit tags are not. I am (currently) using the getText() method on org.dom4j.Node to read the text content of the <tags> element, which returns a String.
The problem is that I am not able -- as far as I'm aware -- to differentiate the encoded comma (from the ones that aren't encoded) in the String I receive.
Short of writing my own XML parser, is there another way to access the text content of this node in a more "raw" state? (viz. a state where the encoded comma is still encoded.)
When you use dom4j or DOM all the entities are already resolved, so you would need to go back to the parsing step to catch character references.
SAX is a more lowlevel interface and has support via its LexicalHandler interface to get notified when the parser encounters entity references, but it does not report character references. So it seems that you would really need to write an own parser, or patch an existing one.
But in the end it would be best if you can change the schema of your document:
<tags>
<tag>Research skills</tag>
<tag>Searching, evaluating and referencing</tag>
</tags>
In your current document character references are used to act as metadata. XML elements are a better way to express that.
Using LexEv from http://andrewjwelch.com/lexev/, putting xercesImpl.jar from Apache Xerces on the class path, I am able to compile and run some short sample using dom4j:
LexEv lexEv = new LexEv();
SAXReader reader = new SAXReader(lexEv);
Document doc = reader.read("input1.xml");
System.out.println(doc.getRootElement().asXML());
If the input1.xml has your sample XML snippet, then the output is
<tags xmlns:lexev="http://andrewjwelch.com/lexev">Research skills, Searching<lexev:char-ref name="#44">,</lexev:char-ref> evaluating and referencing</tags>
So that way you could get a representation of your input where a pure character and a character reference can be distinguished.
As far as I know, every XML processing frameworks (except vtd-xml) resolve entities during parsing....
you can only distinguish a character from its entity encoded counterpart using vtd-xml by using VTDNav's toRawString() method...

Escaping an xml string in java

I read elements with CDATA sections from a rss-feed which I need to convert to valid xml. The content in the CDATA section is mostly valid xhtml, but some times characters like ampersand appear in attributes (url's).
I can use .replaceAll("&", "&") to solve this but thinking a bit forward it may be that other invalid characters show up in attributes or text.
The CMS to which I'm importing the element, won't accept CDATA sections without setting up another configuration for the content, so my question is: is there any simple way to escape the string, only for attributes and text?
I'm using the jdom library to manipulate the xml after the import.
Edit: I've checked out apache's StringEscapeUtils, but this is escaping the whole string. I need something that will only escape attribute values and text inside elements.
Apache Commons provides handy functions for this: StringEscapeUtils
When you use JDOM it will automatically correctly escape ay content that needs it. Is your CMS loaded with the output of JDOM, or are you using some other library to populate the CMS...?
In essence, if you have valid XML input, and you use JDOM (something from org.jdom2.output.*) to output the data, then you will always have good output.... so, what are you doing to have broken output?
Rolf

How to make SAXParser ignore escape codes

I am writing a Java program to read and XML file, actually an iTunes library which is XML plist format.
I have managed to get round most obstacles that this format throws up except when it encounters text containing the &. The XLM file represents this ampersand as & and I can only manage to read the text following the & in any particular section of text.
Is there a way to disable detection of escape codes? I am using SAXParser.
There is something fishy about what you are trying to do.
If the file format you are trying to parse contains bare ampersand (&) characters then it is not well-formed XML. Ampersands are represented as character entities (e.g. &) in well-formed XML.
If it is really supposed to be real XML, then there is a bug in whatever wrote / generated the file.
If it is not supposed to be real XML (i.e. those ampersands are not a mistake), then you probably shouldn't by trying to parse it using an XML parser.
Ah, I see. The XML is actually correctly encoded, but you didn't get the SO markup right.
It would appear that your real problem is that your characters(...) callback is being called separately for the text before the &, for the (decoded) &, and finally for the text after the &. You simply have to have to deal with this by joining the text chunks back together.
The javadoc for ContentHandler.characters() says this:
"The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks ...".
It's probably not the best general solution for escape characters, but I only had to take into account new lines so it was easy to just check for \n.
You could check for the backslash \ only to check for all escape characters or in your case &, although I think others will come with more elegant solutions.
#Override
public void characters(char[] ch, int start, int length)
{
String elementData = new String(ch, start, length);
boolean elementDataContainsNewLine = (elementData.indexOf("\n") != -1);
if (!elementDataContainsNewLine)
{
//do what you want if it is no new line
}
}
Do you have an excerpt for us? Is the file itunes-generated? If so, it sounds like a bug in iTunes to me, that forgot to encode the ampersand correctly. I would not be surprised: they clearly didn't get XML in the first place, their schema of <name>[key]</name><string>[value]</string> must make the XML inventors puke.
You might want to use a different, more robust, parser. SAX is great as long as the file is well-formed. I do however not know how robust dom4j and jdom are. Just give them a try. For python, I know that I would recomment ElementTree or BeautifulSoup which are very robust.
Also have a look at http://code.google.com/p/xmlwise/ which I found mentioned here in stackoverflow (did you use search?).
Update: (as per updated question) You need to understand the role of entities in XML and thus SAX. They by default a separate nodes, just like text nodes. So you will likely need to join them with adjacent text nodes to get the full value. Do you use a DTD in your parser? Using a proper DTD - with entity definitions - can help parsing a lot, as it can contain mappings from entities such as & to the characters they represent &, and the parser may be able to do the merging for you. (At least the python XML-pull parser I like to use for large files does when materializing subtrees.)
I am parsing the below string using SAXParser
<xml>
<FirstTag>&<</FirstTag>
<SecondTag>test</SecondTag>
</xml>
I want the same string to be retained but it is getting converted to below
<xml>
<FirstTag>&<</FirstTag>
<SecondTag>test</SecondTag>
<xml>
Here is my code. How can I avoid this being converted?
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
MyHandler handler = new MyHandler();
values = handler.getValues();
saxParser.parse(x, handler);

problem in XML parser when a extra quote

I have written a xml parser which successfully parses a xml file which is given as input.But sometime the input file that is given to may parser has double quote in a text property because of which my parser crashes.
Eg
<tag myprop=" this has a extra quote here like " some times" > </tag>
I know the tag that may /may not have the extra quote.I use a dom parser.
How can i handle this situation?
You won't be able to use an XML parser until you have actual XML. What you currently have is invalid (ie not XML). You should escape the quote-mark inside the attribute beforehand.
The escaped code would look like:
<tag myprop=" this has a extra quote here like " some times" > </tag>
As to why your parser crashes, well there are dozens of XML libraries in existence - have you looked at any of those? I would personally expect to receive a ParseException or something like that.
I don't know for sure, but I think it's just invalid XML and so your parser should fail gracefully (rather than crashing) but I don't think it should successfully parse such a file.
You can not. That is not a valid XML, so the DOM parser will fail to parse.
see XML 1.0 specification, section 2.4:
http://www.w3.org/TR/xml/#attdecls
To allow attribute values to contain both single and double quotes,
the apostrophe or single-quote character (') may be represented as "
&apos ; ", and the double-quote character (") as "&quot ;"."
so, since it's not valid XML your parser shouldn't be trying to handle the invalid value, it just needs to give an error.

XML parsing with SAX | how to handle special characters?

We have a JAVA application that pulls the data from SAP, parses it and renders to the users.
The data is pulled using JCO connector.
Recently we were thrown an exception:
org.xml.sax.SAXParseException: Character reference "&#00" is an invalid XML character.
So, we are planning to write a new level of indirection where ALL special/illegal characters are replaced BEFORE parsing the XML.
My questions here are :
Is there any existing(open source) utility that does this job of replacing illegal characters in XML?
Or if I had to write such utility, how should i handle them?
Why is the above exception thrown?
Thank You.
From my point of view, the source (SAP) should do the replacement. Otherwise, what it transmits to your programm may looks like XML, but is not.
While replacing the '&' by '&' can be done by a simple String.replaceAll(...) to the string from to toXML() call, others characters can be harder to replace (the '<' and '>' for exemple).
regards
Guillaume
It sounds like a bug in their escaping. Depending on context you might be best off just writing your own version of their XMLWriter class that uses a real XML library rather than trying to write your own XML utilities like the SAP developers did.
Alternatively, looking at the character code, &#00, you might be able to get away with a replace all on it with the empty string:
String goodXml = badXml.replaceAll("", "");
I've had a related, but opposite problem, where I was trying to insert character 1 into the output of an XSLT transformation. I considered post-processing to replace a marker with the zero, but instead chose to use an xsl:param.
If I was in your situation, I'd either come up with a bespoke encoding, replacing the characters which are invalid in XML, and handling them as special cases in your parsing, or if possible, replace them with whitespace.
I don't have experience with JCO, so can't advise on how or where I'd replace the invalid characters.
You can encode/decode non-ASCII characters in XML by using the Apache Commons Lang class StringEscapeUtils escapeXML method. See:
http://commons.apache.org/lang/api-2.4/index.html
To read about how XML character references work, search for "numeric character references" on wikipedia.

Categories