ignoring & in DOM XML parser - java

I am trying to parse an XML file using SAX parser.But when it finds an & it gives me an error "The entity name must immediately follow the '&' in the entity reference.".How can i make the parser to ignore '&' while parsing or if possible to convert it into & from the DTD itself

Your input is not valid XML, since it seems to contain & characters which are not followed by an entity name or character reference.
The cleanest way to solve this is to make sure that the input is valid XML before you parse it, i.e. replace the offending & characters with &.
I don't think you can convince any decent XML parser to silently ignore XML syntax errors.

Find the person/entity responsible for producing the invalid XML input
Make sure that person/entity never again in his/her/its life will ever be capable of producing invalid XML again
Repeat for any new offender
Use of unnecessary violence in the apprehension of the XML villains HAS been approved
Or, you can just resign and use TagSoup or something similar.

Related

how to escape special characters in xml without escaping xml tags(<>) in java

I want to escape special characters in xml input.
I tried StringEscapeUtils.escapeXml10(xmlString) but it ends up escaping xml tags also(<>).
For example:
<Company>Test & Test</Company>
should normalized to
<Company>Test&Test</Company>
Not
<Company>Test&Test</Company>
You're basically asking how to automatically convert invalid XML to valid XML. That's not a tractable problem, in the general case (imagine for example that you had an embedded < in the actual data).
The correct solution to this problem is to identify why you're starting with invalid XML, and fix that issue.

In an XML document, is it possible to tell the difference between an entity-encoded character and one that is not?

I am being feed an XML document with metadata about online resources that I need to parse. Among the different metadata items are a collection of tags, which are comma-delimited. Here is an example:
<tags>Research skills, Searching, evaluating and referencing</tags>
The issue is that one of these "tags" contains a comma in it. The comma within the tag is encoded, but the commas intended to delimit tags are not. I am (currently) using the getText() method on org.dom4j.Node to read the text content of the <tags> element, which returns a String.
The problem is that I am not able -- as far as I'm aware -- to differentiate the encoded comma (from the ones that aren't encoded) in the String I receive.
Short of writing my own XML parser, is there another way to access the text content of this node in a more "raw" state? (viz. a state where the encoded comma is still encoded.)
When you use dom4j or DOM all the entities are already resolved, so you would need to go back to the parsing step to catch character references.
SAX is a more lowlevel interface and has support via its LexicalHandler interface to get notified when the parser encounters entity references, but it does not report character references. So it seems that you would really need to write an own parser, or patch an existing one.
But in the end it would be best if you can change the schema of your document:
<tags>
<tag>Research skills</tag>
<tag>Searching, evaluating and referencing</tag>
</tags>
In your current document character references are used to act as metadata. XML elements are a better way to express that.
Using LexEv from http://andrewjwelch.com/lexev/, putting xercesImpl.jar from Apache Xerces on the class path, I am able to compile and run some short sample using dom4j:
LexEv lexEv = new LexEv();
SAXReader reader = new SAXReader(lexEv);
Document doc = reader.read("input1.xml");
System.out.println(doc.getRootElement().asXML());
If the input1.xml has your sample XML snippet, then the output is
<tags xmlns:lexev="http://andrewjwelch.com/lexev">Research skills, Searching<lexev:char-ref name="#44">,</lexev:char-ref> evaluating and referencing</tags>
So that way you could get a representation of your input where a pure character and a character reference can be distinguished.
As far as I know, every XML processing frameworks (except vtd-xml) resolve entities during parsing....
you can only distinguish a character from its entity encoded counterpart using vtd-xml by using VTDNav's toRawString() method...

Escape sequences in XML

I'm getting an Exception due to special characters when Xml is accessed by client
Can any one help me...?
You need to set the correct encoding, and make sure the XML document is created with the same encoding.
<?xml version="1.0" encoding="INSERT ENCODING HERE"?>
You will need to ensure the special characters are enclosed within CDATA sections:
<![CDATA[
some special characters here
]]>
I have found my mistake in my case while opening/reading the XML i'm getting the error because of three symbols. So need to replace the Character: <>& by EntityName: <>&. Then only the parsing error will not be displayed.
Click Here to see HTML Symbol Entities Reference
Click Here to Read XML Basic Generation
In other scenario instead symbols the Entity names need to be replaced then only parsing exception will not be displayed.
You can include XML's special characters in XPL. XPL is structured exactly like XML but allows the special characters. The XPL to XML conversion utilities will take care of all the details for you. http://hll.nu

problem in XML parser when a extra quote

I have written a xml parser which successfully parses a xml file which is given as input.But sometime the input file that is given to may parser has double quote in a text property because of which my parser crashes.
Eg
<tag myprop=" this has a extra quote here like " some times" > </tag>
I know the tag that may /may not have the extra quote.I use a dom parser.
How can i handle this situation?
You won't be able to use an XML parser until you have actual XML. What you currently have is invalid (ie not XML). You should escape the quote-mark inside the attribute beforehand.
The escaped code would look like:
<tag myprop=" this has a extra quote here like " some times" > </tag>
As to why your parser crashes, well there are dozens of XML libraries in existence - have you looked at any of those? I would personally expect to receive a ParseException or something like that.
I don't know for sure, but I think it's just invalid XML and so your parser should fail gracefully (rather than crashing) but I don't think it should successfully parse such a file.
You can not. That is not a valid XML, so the DOM parser will fail to parse.
see XML 1.0 specification, section 2.4:
http://www.w3.org/TR/xml/#attdecls
To allow attribute values to contain both single and double quotes,
the apostrophe or single-quote character (') may be represented as "
&apos ; ", and the double-quote character (") as "&quot ;"."
so, since it's not valid XML your parser shouldn't be trying to handle the invalid value, it just needs to give an error.

XML parsing with SAX | how to handle special characters?

We have a JAVA application that pulls the data from SAP, parses it and renders to the users.
The data is pulled using JCO connector.
Recently we were thrown an exception:
org.xml.sax.SAXParseException: Character reference "&#00" is an invalid XML character.
So, we are planning to write a new level of indirection where ALL special/illegal characters are replaced BEFORE parsing the XML.
My questions here are :
Is there any existing(open source) utility that does this job of replacing illegal characters in XML?
Or if I had to write such utility, how should i handle them?
Why is the above exception thrown?
Thank You.
From my point of view, the source (SAP) should do the replacement. Otherwise, what it transmits to your programm may looks like XML, but is not.
While replacing the '&' by '&' can be done by a simple String.replaceAll(...) to the string from to toXML() call, others characters can be harder to replace (the '<' and '>' for exemple).
regards
Guillaume
It sounds like a bug in their escaping. Depending on context you might be best off just writing your own version of their XMLWriter class that uses a real XML library rather than trying to write your own XML utilities like the SAP developers did.
Alternatively, looking at the character code, &#00, you might be able to get away with a replace all on it with the empty string:
String goodXml = badXml.replaceAll("", "");
I've had a related, but opposite problem, where I was trying to insert character 1 into the output of an XSLT transformation. I considered post-processing to replace a marker with the zero, but instead chose to use an xsl:param.
If I was in your situation, I'd either come up with a bespoke encoding, replacing the characters which are invalid in XML, and handling them as special cases in your parsing, or if possible, replace them with whitespace.
I don't have experience with JCO, so can't advise on how or where I'd replace the invalid characters.
You can encode/decode non-ASCII characters in XML by using the Apache Commons Lang class StringEscapeUtils escapeXML method. See:
http://commons.apache.org/lang/api-2.4/index.html
To read about how XML character references work, search for "numeric character references" on wikipedia.

Categories