JAXB skip invalid elements and continue processing

JAXB skip invalid elements and continue processing - java

I have an application that needs to process XML files in the following format:
<records>
<record/>
<record/>
<record/>
...
</records>
I am using JAXB to parse these files. However I am trying to prepare my application for the inevitable occurrence for when it is unable to parse one of the records due to some invalid data (for example a character where an int should be).
The problem is that if JAXB is unable to parse an individual record, it halts processing on the entire file. This is not good - I need it to only skip the problematic record, report it, and move on. However I can't discover any way to do this. The only thing I've found is the ValidationEventHandler which lets me return true telling JAXB to continue processing the file in the event of an error, but the problem with that is that it doesn't actually SKIP the problematic record - it tries to parse it even though it's known to be invalid, which causes NumberFormatException and halts processing.
I found this answer How to skip a single jaxb element validation contained in a jaxb collection in Spring Batch Job? but it doesn't actually answer the question, just suggests to use ValidationEventHandler even though that functionality is not sufficient.
How can I skip the invalid records and continue processing? How can I solve this problem?

Typically I wouldn't use JAXB if I knew that the input data will likely contain errors and I need to gracefully recover... STAX might be better suited. But, Jaxb does have a "catch all" you can use: https://docs.oracle.com/javase/7/docs/api/javax/xml/bind/annotation/XmlAnyElement.html

Related

"Strict" Avro Parsing Mode (No dropping additional fields)

This is tangentially related to avro json additional field
The issue I have is that JSON Avro decoding allows for additional fields on the root level recrod while disallowing them on inner records because of a parsing failure. In the current project I work on we have a requirement that we cannot drop any data which means I need to find a solution somehow.
See this code example https://gist.github.com/GrafBlutwurst/4d5c108b026b34ce83d2569bc8991b3d
(Avro 1.8.2)
Does anyone know if there's a "strict" mode for the AVRO Parser or something similar? This ticket also seems to link somewhat to it https://issues.apache.org/jira/browse/AVRO-2034
Thanks!
EDIT: After more researching it seems there's a PR open to fix this
https://github.com/apache/avro/pull/321 but only for ruby
EDIT II: It most likely seems to be a parser bug it's not only in nested object but also an issue if the string contains several json objects and the first one contains additional fields. There's a drain method that is supposed to pop left over tokens from the stack but it doesn't seem to work. as the current parsing position is always 1 when it's entered (top of the stack) as of yet I haven't figured out why.

XSD validation for multiple entries in a XML

I am using SAXON parser for XSD validation in Java. If we use a XML with single element it works fine. Even if we have multiple elements it works fine. But we are unable to identify which element failed and which passed. To be more clear, we have an XSD to validate a simple xml file with a root element and the other elements within are <person> <employment></employment></person>. The is repeatable element.I have an xml like below with errors.
<person>
<employment>correct elements inside</employment>
<employment>wrong elements inside </employment>
</person>
I am performing a XSD validation for above xml. It fails overall due to error in second <employment> entry. But what I need is to identify that first employment passed and second one failed.
How can I achieve this using SAXON?

You haven't said how you are running the validation: From the command line? from the JAXP validation API? From XSLT or XQuery? From the s9api API?
If you are running from the command line then all validation errors are output to System.err with location information about where they were found.
If you are running from an application (via any of the APIs) then errors are notified to an ErrorListener. The default ErrorListener behaves like the command line - it writes details to System.err. If it's a GUI application then you probably won't see this unless you redirect it to some window. Given what you say about your requirements, you would probably be advised to write your own ErrorListener that formats the output in the way you want it. Some of the APIs provide an option to supply a List object into which objects are written representing the validation errors found.
In the next release (9.7) we will have an option to produce all the validation errors in an XML report format.
MORE INFORMATION BASED ON YOUR RESPONSE:
I would recommend using SchemaValidator.setErrorListener() to set your own ErrorListener. The exception passed to the ErrorListener will typically be an instance of net.sf.saxon.type.ValidationException.
If you are validating an in-memory tree, then ValidationException.getNode() on this exception object should give you the node that's invalid, or perhaps the node where the invalidity was detected, which is not quite the same thing.
If you are validating a stream of events, e.g. a SAXSource, then ValidationException.getPath() should give you a path to the node in the form of a string, while ValidationException.getAbsolutePath() should give you a path in structured form.

Parse Ampersand in XML with Java's DOM XML API

I am trying to parse an XML document with the Java DOM API (not SAX). Whenever the parser encounters the ampersand (&) when parsing a text node, it errors out. I am guessing that this is solvable with 1)escaping, 2) encoding or 3) Use a different parser.
I am reading an XML document that I dont have any control over, so I cannot precisely identify where the ampersand appears in the document every time I read it.
The answers I have seen to similar questions have advised replacing the entity type when parsing the XML, but I am not sure how I will be able to do that since, it doesnt even parse when it encounters the XML ampersand.
Any help will be appreciated.

As noted, the XML is malformed (oops!): all occurrences of & in XML (other than the token introducing a character entity [?]) must be encoded as &.
Some solutions (which are basically just as described in the post!):
Fix the XML (at source, or in hack-it-up phase), or;
Parse it with the "appropriate" tool (e.g. a "forgiving" HTML parser)
For the "hack-it-up" approach, consider a separate input stream -- see Working with Filter Streams -- that executes as a filter prior to the actual DOM parser: whenever a & is encountered (that is not part of a character entity) it "fixes it" by inserting & into the stream. Of course, if the XML source didn't get basic encoding correct...
Happy coding.

"I am reading an XML document that I dont have any control over".
No, you are reading a non-XML document. The reason you get an error is that XML parsers are required to give you an error when you read something that isn't XML.
The XML culture is that responsibility for producing well-formed XML rests with the sender. You need to change whatever produces this data to do it properly. Otherwise, you might as well forget XML and its benefits, and move back to the chaotic world of privately-agreed protocols and custom parsers.

Schema validation, how to display user friendly validation messages?

Is there a way to avoid or set up a schema to display better user friendly messages?
I am parsing the string and using reg ex to interpret them, but there might be a better way.
Ex.
"cvc-complex-type.2.4.b: The content of element 'node' is not complete. One of '{\"\":offer,\"\":links}' is expected."
Instead I want:
"The element 'node' is not complete. The child elements 'offer' and 'links' are expected."
Again, I've solved the problem by creating an extra layer that validates it. But when I have to use a XML tool with a schema validation, the crypt messages are the ones displayed.
Thanks

Not that I know of. You will probably have to create some custom code to adapt your error messages. One way might be to define a set of regular expressions that can pull out the relevant pieces of the validator's error messages and then plug them back into your own error messages. Something like this comes to mind (not optimized, doesn't handle general case, etc. but I think you'll get the idea):
String uglyMessage = "cvc-complex-type.2.4.b: The content of element 'node' is not complete. One of '{\"\":offer,\"\":links}' is expected.";
String findRegex = "cvc-complex-type\\.2\\.4\\.b: The content of element '(\\w+)' is not complete\\. One of '\\{\"\":(\\w+),\"\":(\\w+)}' is expected\\.";
String replaceRegex = "The element '$1' is not complete. The child elements '$2' and '$3' are expected.";
String userFriendlyMessage = Pattern.compile(findRegex).matcher(uglyMessage).replaceAll(replaceRegex);
System.out.println(userFriendlyMessage);
// OUTPUT:
// The element 'node' is not complete. The child elements 'offer' and 'links' are expected.
I suspect those validator error messages are vendor-specific so if you don't have control over the XML validator in your deployed app, this may not work for you.

We're using Schematron for displaying user a friendly error messages if XML he send us is wrong. Our current implementation is a bit simplistic, notably in the following points:
Error messages text is hadcoded into a schematron rules
For each new XML type (i.e. new XSD schema) there is a need to manually add schematron rules
This, however, can be easily fixed, by the following rework:
Schematron rules should contain a unique error message codes, while actual message text selection (including I18n issues) should be done out of validation framework scope
Basic rules can be generated from XSD schema using XSD to Schematron converter (available at http://www.schematron.com/resources.html)

I asked a similar question a while ago.
My conclusion was there is no provided way of mapping the errors, and that it's something you need to do yourself.
Hope someone out there can do better!

What does the org.apache.xmlbeans.XmlException with a message of "Unexpected element: CDATA" mean?

I'm trying to parse and load an XML document, however I'm getting this exception when I call the parse method on the class that extends XmlObject. Unfortunately, it gives me no ideas of what element is unexpected, which is my problem.
I am not able to share the code for this, but I can try to provide more information if necessary.

Not being able to share code or input data, you may consider the following approach. That's a very common dichotomic approach to diagnostic, I'm afraid, and indeed you may readily have started it...
Try and reduce the size of the input XML by removing parts of it, ensuring that the underlying XML document remains well formed and possibly valid (if validity is required in your parser's setup). If you maintain validity, this may require to alter [a copy of] the Schema (DTD or other), as manditory elements might be removed during the cut-and-try approach... BTW, the error message seems to hint more at a validation issue that a basic well-formedness assertion issue.
Unless one has a particular hunch as to the area that triggers the parser's complaint, we typically remove (or re-add, when things start working) about half of what was previously cut or re-added.
You may also start with trying a mostly empty file, to assert that the parser does work at all... There again is the idea to "divide to prevail": is the issue in the XML input or in the parser ? (remembering that there could be two issues, one in the input and one in the parser, and thtat such issues could even be unrelated...)
Sorry to belabor basic diagnostics techniques which you may well be fluent with...

You should check the arguments you are passing to the method parse();
If you are directly passing a string to parse or file or inputstream accordingly (File/InputStream/String) etc.

The exception is caused by the length of the XML file. If you add or remove one character from the file, the parser will succeed.
The problem occurs within the 3rd party PiccoloLexer library that XMLBeans relies on. It has been fixed in revision 959082 but has not been applied to xbean 2.5 jar.
XMLBeans - Problem with XML files if length is exactly 8193bytes
Issue reported on XMLBean Jira

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.