Best way to validate non-printable ascii characters in XML - java

Application needs to validate the different input XML(s) messages for non-printable ascii characters. We currently know two options to do this.
Change the XSD to include the restriction.
Validate the input xml string in java application using Regular Expression
Which approach is better in terms of performance as our application has to return the response within a few seconds? Is there any other option available to do this?

It's mainly a matter of opinion but if you have an XSD that seems to be the natural place to include the validations. The only thing you may need to consider is that via XSD you will either fail or pass, whereas with ad-hoc java validation you can ignore non-printable, or replace or take an action without failing the input completely.

The only characters that are (a) ASCII, (b) non-printable, and (c) allowed in XML 1.0 documents are CR, NL, and TAB. I find it hard to see why excluding those three characters is especially important, but if you already have an XSD schema, then it makes sense to add the restriction there.
The usual approach is not to make these three characters invalid, but to treat them as equivalent to space characters, which you can do by using a data type that has the whitespace facet value "normalize" or "collapse".

Related

Unable to parse Special char Plus Sign (+) in XML

Is there any way to escape/avoid the special character (Plus sign +) in XML?
I am creating the XML on run time and it may contain special charters
e.g "Tag+" is the name which I received at run time and based on that
I will have to create tags in XML.
<Tag+>___</Tag+>
Kindly suggest a solution for this. How to handle this kind of scenario?
Thank you
One way to generate a valid XML Name from an arbitrary character string is to replace any character that's not valid in a name by _XXX_ where XXX is the hexadecimal code of the character in question. For a list of characters that are valid in names, see the XML specification. Or you could escape any character other than [0-9a-zA-z] if you prefer.
This will have the effect of turning "Tag+" into "Tag_2B_".
If there's part of this algorithm that you don't know how to implement, please ask a more specific question.

Escape special characters/Symbols in XML?

while creating a XML using a table in my DB , i got many special characters like registered trademark, trademark, degree, different punctuation, etc (these are present in symbol form , hexadecimal, name code , number code )... . some other words like , °, ...
Also some characters are shown as x99,xEA, etc in my XML.
Is there a library/ API to handle all these while creating XML using JAVA Code.
I am using "UTF-8" character encoding for my XML.
Also i cann't clean my DB to have consistent data since it's production data.
A potential option is to enclose your data in CDATA tags, which marks the data as character data that may include markup, but should not be processed as such.
There is a free command line tool for transforming files with special characters in text to valid XML. It also assures that the file encoding matches what is specified in the declaration.
There is also a Java developer suite that allows you to use the parser to parse such files (called XPL) as an alternative to XML or a pre-process into XML. It uses a StAX-like process called StAX-PL.

Replacing Java unicode encodings with actual characters

When I make web queries, for accented characters, I get special character encodings back as strings such as "\u00f3" , but I need to replace it with the actual character, like "ó" before making another query.
How would I find these cases without actually looking for each one, one by one?
It seems you're handling JSON formatted data.
Use any of the many freely available JSON libraries to handle this (and other parsing issues) for you instead of trying to do it manually.
The one from JSON.org is pretty widely used, but there are surely others that work just as well.

XML parsing with SAX | how to handle special characters?

We have a JAVA application that pulls the data from SAP, parses it and renders to the users.
The data is pulled using JCO connector.
Recently we were thrown an exception:
org.xml.sax.SAXParseException: Character reference "&#00" is an invalid XML character.
So, we are planning to write a new level of indirection where ALL special/illegal characters are replaced BEFORE parsing the XML.
My questions here are :
Is there any existing(open source) utility that does this job of replacing illegal characters in XML?
Or if I had to write such utility, how should i handle them?
Why is the above exception thrown?
Thank You.
From my point of view, the source (SAP) should do the replacement. Otherwise, what it transmits to your programm may looks like XML, but is not.
While replacing the '&' by '&' can be done by a simple String.replaceAll(...) to the string from to toXML() call, others characters can be harder to replace (the '<' and '>' for exemple).
regards
Guillaume
It sounds like a bug in their escaping. Depending on context you might be best off just writing your own version of their XMLWriter class that uses a real XML library rather than trying to write your own XML utilities like the SAP developers did.
Alternatively, looking at the character code, &#00, you might be able to get away with a replace all on it with the empty string:
String goodXml = badXml.replaceAll("", "");
I've had a related, but opposite problem, where I was trying to insert character 1 into the output of an XSLT transformation. I considered post-processing to replace a marker with the zero, but instead chose to use an xsl:param.
If I was in your situation, I'd either come up with a bespoke encoding, replacing the characters which are invalid in XML, and handling them as special cases in your parsing, or if possible, replace them with whitespace.
I don't have experience with JCO, so can't advise on how or where I'd replace the invalid characters.
You can encode/decode non-ASCII characters in XML by using the Apache Commons Lang class StringEscapeUtils escapeXML method. See:
http://commons.apache.org/lang/api-2.4/index.html
To read about how XML character references work, search for "numeric character references" on wikipedia.

How to write ASCII extended characters(which has ascii code > 127) to XML file using java?

I read texts from different sources which can have characters from different languages/extended characters like € ƒ „ … † ® ©. And then I am supposed to write to an XML file, I am using PrinterWriter in java to write to an XML file whatever string I read. So for these types of extended characters which has ascii greater than 127 gives illegal characters error in XML file, so how can I encode it properly while writing to XML.
First, there's no such thing as an ASCII code above 127. ASCII only defines values up to 127. "Extended ASCII" is an ambiguous term, as it's used to describe many different encodings.
Now, as for XML: use whichever XML API you want to write the string, without worrying about the contents (so long as they are representable in XML; various control characters in the range U+0000 to U+001F aren't representable, unfortunately). Don't try to create the XML from scratch yourself - that's what XML APIs are for. Make sure that your XML document uses an encoding that will cope with the characters you need (UTF-8 is normally a good choice, and is often the default), make sure that your Java strings have the right Unicode data in them, and you should be fine.
EDIT: I hadn't actually spotted this bit before:
I am using PrinterWriter in java to write to an XML
Don't. Please use an XML API. There are plenty around, and you'll have a lot less to worry about. I'd also not recommend using PrintWriter anyway for the most part - suppressing exceptions isn't really a good idea in most cases.
Use the &#value; syntax. Space would be

Categories