"Strict" Avro Parsing Mode (No dropping additional fields)

"Strict" Avro Parsing Mode (No dropping additional fields) - java

This is tangentially related to avro json additional field
The issue I have is that JSON Avro decoding allows for additional fields on the root level recrod while disallowing them on inner records because of a parsing failure. In the current project I work on we have a requirement that we cannot drop any data which means I need to find a solution somehow.
See this code example https://gist.github.com/GrafBlutwurst/4d5c108b026b34ce83d2569bc8991b3d
(Avro 1.8.2)
Does anyone know if there's a "strict" mode for the AVRO Parser or something similar? This ticket also seems to link somewhat to it https://issues.apache.org/jira/browse/AVRO-2034
Thanks!
EDIT: After more researching it seems there's a PR open to fix this
https://github.com/apache/avro/pull/321 but only for ruby
EDIT II: It most likely seems to be a parser bug it's not only in nested object but also an issue if the string contains several json objects and the first one contains additional fields. There's a drain method that is supposed to pop left over tokens from the stack but it doesn't seem to work. as the current parsing position is always 1 when it's entered (top of the stack) as of yet I haven't figured out why.

Related

JAXB skip invalid elements and continue processing

I have an application that needs to process XML files in the following format:
<records>
<record/>
<record/>
<record/>
...
</records>
I am using JAXB to parse these files. However I am trying to prepare my application for the inevitable occurrence for when it is unable to parse one of the records due to some invalid data (for example a character where an int should be).
The problem is that if JAXB is unable to parse an individual record, it halts processing on the entire file. This is not good - I need it to only skip the problematic record, report it, and move on. However I can't discover any way to do this. The only thing I've found is the ValidationEventHandler which lets me return true telling JAXB to continue processing the file in the event of an error, but the problem with that is that it doesn't actually SKIP the problematic record - it tries to parse it even though it's known to be invalid, which causes NumberFormatException and halts processing.
I found this answer How to skip a single jaxb element validation contained in a jaxb collection in Spring Batch Job? but it doesn't actually answer the question, just suggests to use ValidationEventHandler even though that functionality is not sufficient.
How can I skip the invalid records and continue processing? How can I solve this problem?

Typically I wouldn't use JAXB if I knew that the input data will likely contain errors and I need to gracefully recover... STAX might be better suited. But, Jaxb does have a "catch all" you can use: https://docs.oracle.com/javase/7/docs/api/javax/xml/bind/annotation/XmlAnyElement.html

Hadoop mapReduce - difference between RecordReader´s parsing logic and Writeable.readFields()

I´m currently working on a mapReduce job processing xml data and I think there´s something about the data flow in hadoop that I´m not getting correctly.
I´m running on Amazon´s ElasticMapReduce service.
Input data: large files (significantly above 64MB, so they should be splitable), consisting of a lot of small xml files that are concatenated by a previous s3distcp operation that concatenates all files into one.
I am using a slightly modified version of Mahout´s XmlInputFormat to extract the individual xml snippets from the input.
As a next step I´d like to parse those xml snippets into business objects which should then be passed to the mapper.
Now here is where I think I´m missing something: In order for that to work, my business objects need to implement the Writable interface, defining how to read/write an instance from/to an DataInput or DataOutput.
However, I don´t see where this comes into play - the logic needed to read an instance of my object is already in the InputFormat´s record reader, so why does the object have to be capable of reading/writing itself??
I did quite some research already and I know (or rather assume) WritableSerialization is used when transferring data between nodes in the cluster, but I´d like to understand the reasons behind that architecture.
The InputSplits are defined upon job submission - so if the name node sees that data needs to be moved to a specific node for a map task to work, would it not be sufficient to simply send the raw data as a byte stream? Why do we need to decode that into Writables if the RecordReader of our input format does the same thing anyway?
I really hope someone can show me the error in my thoughts above, many thanks in advance!

j8583 cannot handle field 128

I've been using j8583 to parse and construct ISO 8583 message in Java. All seems well until one of the message has field 128 in it. That field is always missing when I construct or parse a message that has bit 128, but the other bit (2...127) are fine.
I've double check the xml configuration, and nothing wrong there.
Is it just me or there are actually a bug in j8583? anybody know how to solve this? I'm on a really tight schedule, so changing library for iso 8583 is very unlikely

I'm the author of j8583. I just reviewed the code and there is indeed a problem with MessageFactory.newMessage() where it won't assign field 128 to new messages. I just committed the change, so you can get the latest source from the repository and your new messages will include field 128.
I also reviewed the parsing code and I couldn't find anything wrong there. If you parse a message with field 128 and it's in your parsing guide, the message should contain it.
However, I've encountered certain ISO8583 implementations in which a message has the 128 field set in the bitmap but it's really not in the message. In these cases j8583 can't parse the message because there's missing data. I'm still trying to figure out how to handle this.
When you find any bugs in j8583 please post them in the project page, so I get notified and solve them. I don't usually look for j8583 tagged questions in this site (but I should probably start doing so).

Schema validation, how to display user friendly validation messages?

Is there a way to avoid or set up a schema to display better user friendly messages?
I am parsing the string and using reg ex to interpret them, but there might be a better way.
Ex.
"cvc-complex-type.2.4.b: The content of element 'node' is not complete. One of '{\"\":offer,\"\":links}' is expected."
Instead I want:
"The element 'node' is not complete. The child elements 'offer' and 'links' are expected."
Again, I've solved the problem by creating an extra layer that validates it. But when I have to use a XML tool with a schema validation, the crypt messages are the ones displayed.
Thanks

Not that I know of. You will probably have to create some custom code to adapt your error messages. One way might be to define a set of regular expressions that can pull out the relevant pieces of the validator's error messages and then plug them back into your own error messages. Something like this comes to mind (not optimized, doesn't handle general case, etc. but I think you'll get the idea):
String uglyMessage = "cvc-complex-type.2.4.b: The content of element 'node' is not complete. One of '{\"\":offer,\"\":links}' is expected.";
String findRegex = "cvc-complex-type\\.2\\.4\\.b: The content of element '(\\w+)' is not complete\\. One of '\\{\"\":(\\w+),\"\":(\\w+)}' is expected\\.";
String replaceRegex = "The element '$1' is not complete. The child elements '$2' and '$3' are expected.";
String userFriendlyMessage = Pattern.compile(findRegex).matcher(uglyMessage).replaceAll(replaceRegex);
System.out.println(userFriendlyMessage);
// OUTPUT:
// The element 'node' is not complete. The child elements 'offer' and 'links' are expected.
I suspect those validator error messages are vendor-specific so if you don't have control over the XML validator in your deployed app, this may not work for you.

We're using Schematron for displaying user a friendly error messages if XML he send us is wrong. Our current implementation is a bit simplistic, notably in the following points:
Error messages text is hadcoded into a schematron rules
For each new XML type (i.e. new XSD schema) there is a need to manually add schematron rules
This, however, can be easily fixed, by the following rework:
Schematron rules should contain a unique error message codes, while actual message text selection (including I18n issues) should be done out of validation framework scope
Basic rules can be generated from XSD schema using XSD to Schematron converter (available at http://www.schematron.com/resources.html)

I asked a similar question a while ago.
My conclusion was there is no provided way of mapping the errors, and that it's something you need to do yourself.
Hope someone out there can do better!

What does the org.apache.xmlbeans.XmlException with a message of "Unexpected element: CDATA" mean?

I'm trying to parse and load an XML document, however I'm getting this exception when I call the parse method on the class that extends XmlObject. Unfortunately, it gives me no ideas of what element is unexpected, which is my problem.
I am not able to share the code for this, but I can try to provide more information if necessary.

Not being able to share code or input data, you may consider the following approach. That's a very common dichotomic approach to diagnostic, I'm afraid, and indeed you may readily have started it...
Try and reduce the size of the input XML by removing parts of it, ensuring that the underlying XML document remains well formed and possibly valid (if validity is required in your parser's setup). If you maintain validity, this may require to alter [a copy of] the Schema (DTD or other), as manditory elements might be removed during the cut-and-try approach... BTW, the error message seems to hint more at a validation issue that a basic well-formedness assertion issue.
Unless one has a particular hunch as to the area that triggers the parser's complaint, we typically remove (or re-add, when things start working) about half of what was previously cut or re-added.
You may also start with trying a mostly empty file, to assert that the parser does work at all... There again is the idea to "divide to prevail": is the issue in the XML input or in the parser ? (remembering that there could be two issues, one in the input and one in the parser, and thtat such issues could even be unrelated...)
Sorry to belabor basic diagnostics techniques which you may well be fluent with...

You should check the arguments you are passing to the method parse();
If you are directly passing a string to parse or file or inputstream accordingly (File/InputStream/String) etc.

The exception is caused by the length of the XML file. If you add or remove one character from the file, the parser will succeed.
The problem occurs within the 3rd party PiccoloLexer library that XMLBeans relies on. It has been fixed in revision 959082 but has not been applied to xbean 2.5 jar.
XMLBeans - Problem with XML files if length is exactly 8193bytes
Issue reported on XMLBean Jira

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

"Strict" Avro Parsing Mode (No dropping additional fields) - java

Related

JAXB skip invalid elements and continue processing

Hadoop mapReduce - difference between RecordReader´s parsing logic and Writeable.readFields()

j8583 cannot handle field 128

Schema validation, how to display user friendly validation messages?

What does the org.apache.xmlbeans.XmlException with a message of "Unexpected element: CDATA" mean?

Categories

Resources