XML parsing with SAX | how to handle special characters? - java

We have a JAVA application that pulls the data from SAP, parses it and renders to the users.
The data is pulled using JCO connector.
Recently we were thrown an exception:
org.xml.sax.SAXParseException: Character reference "&#00" is an invalid XML character.
So, we are planning to write a new level of indirection where ALL special/illegal characters are replaced BEFORE parsing the XML.
My questions here are :
Is there any existing(open source) utility that does this job of replacing illegal characters in XML?
Or if I had to write such utility, how should i handle them?
Why is the above exception thrown?
Thank You.

From my point of view, the source (SAP) should do the replacement. Otherwise, what it transmits to your programm may looks like XML, but is not.
While replacing the '&' by '&' can be done by a simple String.replaceAll(...) to the string from to toXML() call, others characters can be harder to replace (the '<' and '>' for exemple).
regards
Guillaume

It sounds like a bug in their escaping. Depending on context you might be best off just writing your own version of their XMLWriter class that uses a real XML library rather than trying to write your own XML utilities like the SAP developers did.
Alternatively, looking at the character code, &#00, you might be able to get away with a replace all on it with the empty string:
String goodXml = badXml.replaceAll("", "");

I've had a related, but opposite problem, where I was trying to insert character 1 into the output of an XSLT transformation. I considered post-processing to replace a marker with the zero, but instead chose to use an xsl:param.
If I was in your situation, I'd either come up with a bespoke encoding, replacing the characters which are invalid in XML, and handling them as special cases in your parsing, or if possible, replace them with whitespace.
I don't have experience with JCO, so can't advise on how or where I'd replace the invalid characters.

You can encode/decode non-ASCII characters in XML by using the Apache Commons Lang class StringEscapeUtils escapeXML method. See:
http://commons.apache.org/lang/api-2.4/index.html
To read about how XML character references work, search for "numeric character references" on wikipedia.

Related

Hexadecimal (§) to Char (§) in Java

I have a huge xml file which contains a lot hexadecimal values ex: §, so how to convert these hexadecimal value in entire file to Char ex: § (§) in java.
<md.first.line.cite>UT ST § 10-2-409</md.first.line.cite>
You mentioned that you want to convert to §, so this is one of way to conver hex to your desired character:
System.out.println((char)0xA7);
Or
int hex=0xA7;
System.out.println((char)hex);
Lets start with what § and   mean. These are XML character references. They represent the Unicode code-points U+00A7 and U+2002. (This is real XML syntax, not just some random nuisance escape sequence that needs to be dealt with.)
So, if you parse that XML with a conformant XML parser, the parser will automatically take care of translating the references to the corresponding Unicode code-points. Your application should not need to do any translating.
This implies that you are NOT using a proper XML parser in your application. Bad idea! Doing your own XML parsing by string bashing or using regexes tends to give inflexible code and/or unreliable results when faced with "variant" XML. So my main recommendation would be:
Use a standard off-the-shelf XML parser.
If your XML is non-compliant, consider using Jsoup or similar to extract information from the XML.
If you already deep down the rabbit hole of string bashing, etc, the best thing to do would be to extract the entire encoded XML text segment and convert it to a String using existing library code. The standard Java SE class libraries don't provide this functionality, but you could use StringEscapeUtils.unescapeXml() from org.apache.commons.text. (The version from org.apache.commons.lang3 has been deprecated.)

Best way to validate non-printable ascii characters in XML

Application needs to validate the different input XML(s) messages for non-printable ascii characters. We currently know two options to do this.
Change the XSD to include the restriction.
Validate the input xml string in java application using Regular Expression
Which approach is better in terms of performance as our application has to return the response within a few seconds? Is there any other option available to do this?
It's mainly a matter of opinion but if you have an XSD that seems to be the natural place to include the validations. The only thing you may need to consider is that via XSD you will either fail or pass, whereas with ad-hoc java validation you can ignore non-printable, or replace or take an action without failing the input completely.
The only characters that are (a) ASCII, (b) non-printable, and (c) allowed in XML 1.0 documents are CR, NL, and TAB. I find it hard to see why excluding those three characters is especially important, but if you already have an XSD schema, then it makes sense to add the restriction there.
The usual approach is not to make these three characters invalid, but to treat them as equivalent to space characters, which you can do by using a data type that has the whitespace facet value "normalize" or "collapse".

how to escape special characters in xml without escaping xml tags(<>) in java

I want to escape special characters in xml input.
I tried StringEscapeUtils.escapeXml10(xmlString) but it ends up escaping xml tags also(<>).
For example:
<Company>Test & Test</Company>
should normalized to
<Company>Test&Test</Company>
Not
<Company>Test&Test</Company>
You're basically asking how to automatically convert invalid XML to valid XML. That's not a tractable problem, in the general case (imagine for example that you had an embedded < in the actual data).
The correct solution to this problem is to identify why you're starting with invalid XML, and fix that issue.

Escaping XML using Java and Marklogic

I am loading XML into a Marklogic database using Java and Marklogic's XCC API. Before I do so, I use Apache Commons to escape the elements' contents (StringEscapeUtils.escapeXml). Upon loading the contents, though, I error out due to a curly brace character in the contents. escapeXml doesn't handle the curly brace. My questions are:
1) Is that a Marklogic specific issue (maybe with XCC) or is it an issue with XML in general?
2) Are there other characters that could also cause a problem (i.e. not escaped by the escapeXml routine)?
3) Is there a different routine that could be used to avoid this and any future undesired characters?
You should not escape contents when using XCC, it escapes them itself so you would be doing double-escapes. However curly brackes are generally not something XML complains about, perhaps you are using the Invoke instead of Insert methods.
( i.e. then XCC would try to interpret your content as XQuery )
Could you provide a sample of your content and code snippet ?

Replacing Java unicode encodings with actual characters

When I make web queries, for accented characters, I get special character encodings back as strings such as "\u00f3" , but I need to replace it with the actual character, like "ó" before making another query.
How would I find these cases without actually looking for each one, one by one?
It seems you're handling JSON formatted data.
Use any of the many freely available JSON libraries to handle this (and other parsing issues) for you instead of trying to do it manually.
The one from JSON.org is pretty widely used, but there are surely others that work just as well.

Categories