problem in XML parser when a extra quote - java

I have written a xml parser which successfully parses a xml file which is given as input.But sometime the input file that is given to may parser has double quote in a text property because of which my parser crashes.
Eg
<tag myprop=" this has a extra quote here like " some times" > </tag>
I know the tag that may /may not have the extra quote.I use a dom parser.
How can i handle this situation?

You won't be able to use an XML parser until you have actual XML. What you currently have is invalid (ie not XML). You should escape the quote-mark inside the attribute beforehand.
The escaped code would look like:
<tag myprop=" this has a extra quote here like " some times" > </tag>
As to why your parser crashes, well there are dozens of XML libraries in existence - have you looked at any of those? I would personally expect to receive a ParseException or something like that.

I don't know for sure, but I think it's just invalid XML and so your parser should fail gracefully (rather than crashing) but I don't think it should successfully parse such a file.

You can not. That is not a valid XML, so the DOM parser will fail to parse.

see XML 1.0 specification, section 2.4:
http://www.w3.org/TR/xml/#attdecls
To allow attribute values to contain both single and double quotes,
the apostrophe or single-quote character (') may be represented as "
&apos ; ", and the double-quote character (") as "&quot ;"."
so, since it's not valid XML your parser shouldn't be trying to handle the invalid value, it just needs to give an error.

Related

how to escape special characters in xml without escaping xml tags(<>) in java

I want to escape special characters in xml input.
I tried StringEscapeUtils.escapeXml10(xmlString) but it ends up escaping xml tags also(<>).
For example:
<Company>Test & Test</Company>
should normalized to
<Company>Test&Test</Company>
Not
<Company>Test&Test</Company>
You're basically asking how to automatically convert invalid XML to valid XML. That's not a tractable problem, in the general case (imagine for example that you had an embedded < in the actual data).
The correct solution to this problem is to identify why you're starting with invalid XML, and fix that issue.

ignoring & in DOM XML parser

I am trying to parse an XML file using SAX parser.But when it finds an & it gives me an error "The entity name must immediately follow the '&' in the entity reference.".How can i make the parser to ignore '&' while parsing or if possible to convert it into & from the DTD itself
Your input is not valid XML, since it seems to contain & characters which are not followed by an entity name or character reference.
The cleanest way to solve this is to make sure that the input is valid XML before you parse it, i.e. replace the offending & characters with &.
I don't think you can convince any decent XML parser to silently ignore XML syntax errors.
Find the person/entity responsible for producing the invalid XML input
Make sure that person/entity never again in his/her/its life will ever be capable of producing invalid XML again
Repeat for any new offender
Use of unnecessary violence in the apprehension of the XML villains HAS been approved
Or, you can just resign and use TagSoup or something similar.

Escaping an xml string in java

I read elements with CDATA sections from a rss-feed which I need to convert to valid xml. The content in the CDATA section is mostly valid xhtml, but some times characters like ampersand appear in attributes (url's).
I can use .replaceAll("&", "&") to solve this but thinking a bit forward it may be that other invalid characters show up in attributes or text.
The CMS to which I'm importing the element, won't accept CDATA sections without setting up another configuration for the content, so my question is: is there any simple way to escape the string, only for attributes and text?
I'm using the jdom library to manipulate the xml after the import.
Edit: I've checked out apache's StringEscapeUtils, but this is escaping the whole string. I need something that will only escape attribute values and text inside elements.
Apache Commons provides handy functions for this: StringEscapeUtils
When you use JDOM it will automatically correctly escape ay content that needs it. Is your CMS loaded with the output of JDOM, or are you using some other library to populate the CMS...?
In essence, if you have valid XML input, and you use JDOM (something from org.jdom2.output.*) to output the data, then you will always have good output.... so, what are you doing to have broken output?
Rolf

Write dom4j Document to xml-file with escaped '\r\n\'

Sorry for asking about quite the same issue, but now i would like to:
write a dom4j Document which contains tags looking like this:
<Field>\r\n some text</Field>
to a xml file, but the \r\n should be escaped to
org.dom4j.Document.asXml()
does not work.
Assuming you mean that's a CRLF sequence in the text node (and not merely a literal backslash-r-backslash-n), you won't be able to persuade an XML serialiser to write them as
, because XML says you don't have to. The document is absolutely equivalent in XML terms, whether you escape it or not. The only place you need to escape the CRLF sequence as
is in an attribute value.
If you really must produce this output, you would have to write your own XML serialiser that followed special rules for escaping control codes. But if you are doing this because an external tool can't read the XML element with CRLF sequences in, you should concentrate on fixing that tool, because if it can't deal with newlines in text content it's broken and not a proper XML parser.
Walk the tree, applying String.replace to the Text nodes.

Handling special characters in XML when transforming with Saxon

I'm attempting to apply a stylesheet to an XML document using Saxon. Given an XML file that was generated in Microsoft Word and that has Microsoft Word-style quotes, such as around FOO in the following document
<?xml version="1.0" encoding="UTF-8"?>
<doc>
<act>
<performer typeCode=“FOO“ />
<performer typeCode="BAR" />
</act>
</doc>
Saxon throws the following error:
SXXP0003: Error reported by XML parser: Invalid byte 1 of 1-byte UTF-8 sequence.
What is the best way to handle these type of "special" characters in XML that were intended to be valid but break in actual parsing/transformation?
Since the above is not valid XML, you will have to do some preprocessing of the input (say with a FilterReader), as just about any XML parser will indicate an error (and typically a fatal error, so you cannot handle the error and continue).
If the special quotes are only in the xml you can do a simple replace of the special quotes with plain quotes (a little more work if you have to check the preamble for the encoding type). If you want to keep special quotes elsewhere in the document you will have to do something a bit more complicated (mostly keep track as to whether you are in a tag or not).
trouble is those 'special' quotes are not valid xml. Saxon or any other xml parser is going to throw that stuff out and not parse the document.
Only thing I can suggest is a search and replace for those and replace them with the expected quotes.

Categories