Avro schemas not compatible if field order changes - java

Scenario -
Client serializes a POJO using Avro Reflect Datum Writer and writes GenericRecord to a file.
The schema obtained through reflection is something like this (Note the ordering A, B, D, C) -
{
"namespace": "storage.management.example.schema",
"type": "record",
"doc": "Example schema for testing",
"name": "Event",
"fields": [
....
....
{ "name": "A", "type": "string" },
{ "name": "B", "type": "string" },
{ "name": "D", "type": "string" },
{ "name": "C", "type": "string" },
....
....
]
}
An agent reads off the file and uses a default schema (Note the ordering - A, B, C, D)to deserialize a subset of the record (The client is guaranteed to have these fields)
{
"namespace": "storage.management.example.schema",
"type": "record",
"doc": "Example schema for testing",
"name": "Event",
"fields": [
{ "name": "A", "type": "string" },
{ "name": "B", "type": "string" },
{ "name": "C", "type": "string" },
{ "name": "D", "type": "string" }
]
}
The problem :
De-serialization with the above subset schema results in the following exception -
Caused by: java.io.IOException: Invalid int encoding
at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:145)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:259)
at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:430)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
However, de-serialization succeeds if the subset schema also specifies fields in the order A, B, D, C. (same as client schema)
Is this behavior expected? I though Avro only depends on field name to build the record and not the ordering.
Any fixes to this ? Different clients may have different orders and I have no way to enforce ordering because schema is generated through reflection.

This is not necessarily expected behavior. You might be making the same mistake I made when I began using Avro.
Avro is able to have different versions of schemas (e.g., write with one but read into another) but one thing very easily missed (at least by myself) is that you must have the exact schema that wrote the message when trying to read it.
The documentation and information you read about Avro, at least at the surface level, doesn't make that very clear. Usually they focus on it being "backwards compatible." To be fair, it is in a sense, but usually when people see that phrase they think it means something a little different. Usually we think that means you can work with old messages using a new schema, not work with old messages using a new schema and the old messages' schema.
As an example, see this pseudocode
Schema myUnsortedSchema has C B A order
Schema myAlphabeticalSchema has A B C order
Writer writer uses myUnsortedSchema
Reader badReader uses myAlphabeticalSchema only
writer writes message
badReader reads message
Error! Not sure what the error message will say exactly, but the problem is that badReader not only tries to read into myAlphabeticalSchema but also read the message as if it were written by myAlphabeticalSchema. The solution is that there is a way to give it both schemas, the one that wrote the message and the one to read into (how depends on the language).
Reader goodReader reads messages written with myUnsortedSchema into myAlphabeticalSchema
goodReader reads message
No error! This is the correct usage.
If you are using an approach like goodReader then this behavior is unexpected, but if you are using an approach like badReader then the behavior is expected.
Some services like Schema Registry help with this by appending some metadata to the front of the message bytes to determine which schema wrote the message (and stripping them off before reading of course). It's out of the scope of the question but can help solve problems like this.

The ordering of fields may be different: fields are matched by name.
https://avro.apache.org/docs/1.8.1/spec.html
.... in your first schema there are other fields as well which you havent shown

Is this behavior expected?
The documentation say that "A record is encoded by encoding the values of its fields in the order that they are declared."
So, I think it's the correct behavior.

Related

Document.parse() constructor not working for nested json array

I have one extended json string.
{"_id": {"oid": "59a47286cfa9a3a73e51e72c"}, "theaterId": {"numberInt": "101100"}, "location": {"address": {"street1": "340 XDW Market", "city": "Bloomington", "state": "MN", "zipcode": "12427"}, "geo": {"type": "Point", "coordinates": [{"$numberDouble": "-193.24565"}, {"$numberDouble": "144.85466"}]}}}
Trying to convert above json string to document in order to insert it into MongoDB. For this I am using org.bson.Document.Document.parse(json_string) constructor.
But the document I am getting after parsing, doesn't preserve the datatype inside geo.coordinate arraylist (Check below Document). While it preserve datatype of theaterId.
{
"_id": {
"oid": "59a47286cfa9a3a73e51e72c"
},
"theaterId": {
"numberInt": "101100"
},
"location": {
"address": {
"street1": "340 XDW Market",
"city": "Bloomington",
"state": "MN",
"zipcode": "12427"
},
"geo": {
"type": "Point",
"coordinates": [-193.24565, 144.85466]
}
}
}
Is this a potential issue in Document.parse() API ?
Your fields in geo.coordinate are starting with dollar sign $. In theaterId you have numberInt, while in coordinate - $numberDouble.
Check the docs and this question for how to handle it depending on what you need. Considering, that it looks like numberInt satisfies your needs, you might just need to remove the dollars from field names.
Edit: After digging somewhat deeper into those docs, the one you provided as well, {"numberInt": "101100"} is not extended json with datatype, it's just a normal json object with property and value for that property. It would need to be {"$numberInt": "101100"} to be extended json. On the other hand {"$numberDouble": "-193.24565"} is extended. The datatype is not lost, it's parsed into List<Double>, since we know each element is of type Double the datatype can be reconstructed back.
If you take at Document.toJson(), under the hood it's working with RELAXED output mode, which will output coordinates as you are seeing them - [-193.24565, 144.85466]. If you provide EXTENDED output mode, for example like this:
JsonWriterSettings settings = JsonWriterSettings.builder().outputMode(JsonMode.EXTENDED).build();
System.out.println(document.toJson(settings));
then the datatype will be reconstructed back from the java type, and coordinates will look like so:
[{"$numberDouble": "-193.24565"}, {"$numberDouble": "144.85466"}]
In conclusion, there is no problem with Document.parse("json"), but there might be a problem with the json you are supplying to it.
Edit2:
As in showed in example, the datatypes can be reconstructed back from java types. I am not familiar with the way collection.insertOne(Document.parse(json_string)) works under the hood, but if you don't explicitly specify the mode, it might be using RELAXED by default, instead of EXTENDED. The docs here state - This format prioritizes type preservation at the loss of human-readability and interoperability with older formats., so it would make sense. But this is just a wild guess on my part though, you would need to dig into docs to make sure.

How to add an enum value to an AVRO schema in a FULL compatible way?

I have an enum in an AVRO schema like this :
{
"type": "record",
"name": "MySchema",
"namespace": "com.company",
"fields": [
{
"name": "color",
"type": {
"type": "enum",
"name": "Color",
"symbols": [
"UNKNOWN",
"GREEN",
"RED"
]
},
"default": "UNKNOWN"
}
]
}
When using FULL (which means BACKWARD and FORWARD) compatibility mode, how am I supposed to add a new symbol to the enum ? Is this impossible ?
I read Avro schema : is adding an enum value to existing schema backward compatible? but it doesn't help.
Whenever I try to add a new value to the symbols it fails the compatibility check in the schema registry even though I have a default value on the enum. After testing a bit it seems that adding a new value is BACKWARD compatible but not FORWARD compatible. However, due to the default value I set I expected it to be also FORWARD compatible. Indeed the old reader schema should be able to read a value written by the new schema and default to the "UNKNOWN" enum value when it doesn't know the new symbol.
It appears there is currently a bug in AVRO which affects the versions 1.9.0, 1.10.0, 1.9.1, 1.9.2, 1.11.0, 1.10.1, 1.10.2 and further until it is fixed.
The bug is in avro handling of enum default value.
According to the documentation on the reader side with an old schema, we should be able to deserialize a payload containing an enum value that was generated by the writer side having the new schema. Since the value is unknown to the reader it should be deserialized as the default value.
A default value for this enumeration, used during resolution when the reader encounters a symbol from the writer that isn't defined in the reader's schema
However thats not what happen and the deserializer on the reader side fails with the exception org.apache.avro.AvroTypeException: No match for C.
I have reported the bug here, and a pushed a reproduction test here
Hope it attracts some attention from the maintainers :)
We can use the symbol level defaults to achieve this, (by moving default inside the type definition). Hope this helps
{
"type": "record",
"name": "MySchema",
"namespace": "com.company",
"fields": [
{
"name": "color",
"type": {
"type": "enum",
"name": "Color",
"symbols": [
"UNKNOWN",
"GREEN",
"RED"
],
"default": "UNKNOWN"
}
}
]
}
Adding new symbol into an enum is not FULL compatible , not even FORWARD compatible.
see ==> https://github.com/confluentinc/schema-registry/issues/880

JSON Schema - Enum of Objects

I'm new to JSON schema, so bear with me. My goal is to have a JSON property that is an object. It's keys relate to each other, meaning multiple keys always have the same values together. This will probably help make it clear, it's my attempt to do this with an enum:
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "Part",
"type": "object",
"properties": {
"relationship": {
"type": "object",
"enum": [
{
"code": "1",
"value": "MEMBER"
},
{
"code": "2",
"value": "SPOUSE"
},
{
"code": "3",
"value": "CHILD"
},
{
"code": "4",
"value": "STUDENT"
},
{
"code": "5",
"value": "DISABILITY_DEPENDENT"
},
{
"code": "6",
"value": "ADULT_DEPENDENT"
},
{
"code": "8",
"value": "DOMESTIC_PARTNER"
}
]
}
}
}
So using an enum like this works, even though I can't find it anywhere in the JSON Schema spec. However, the error message sucks. Normally I get the most extremely detailed error messages from schema validation, however in this case I do not.
$.part.relationship: does not have a value in the enumeration [, , , , , , ]
I'm not sure what I'm doing wrong. I'm using a Java parser for JSON Schema:
<dependency>
<groupId>com.networknt</groupId>
<artifactId>json-schema-validator</artifactId>
<version>1.0.53</version>
</dependency>
Not sure if the error message is the fault of the parser or something I'm doing bad with the schema. Help would be appreciated.
It was news to me, but according to the spec it does seem that objects are valid enum values. That said, your usage is quite unusual. I've not seen it used before.
the six primitive types ("null", "boolean", "object", "array", "number", or "string")
...
6.1.2. enum
...
Elements in the array might be of any type, including null.
Your problem is fundamentally that the library that you're using doesn't know how to convert those objects to printable strings. Even if it did give it a reasonable go, you might end up with
does not have a value in the enumeration [{"code": "1", "value":"MEMBER"}, {"code": "2" ...
which might be okay, but it's hardly amazing. If the code and value were both valid but didn't match, you might have to look quite closely at the list before you ever saw the problem.
JSON Schema in general is not very good at enforcing constraints between what it considers to be 2 unrelated fields. That's beyond the scope of it what it aims to do. It's trying to validate the structure. Dependencies between fields are business constraints, not structural ones.
I think the best thing you could do to achieve readable error messages would be to have 2 sub-properties, each with an enumeration containing 8 values; one for the codes, one for the values.
Then you'll get
$.part.relationship.code does not have a value in the enumeration [1,2,3,4 ...
or
$.part.relationship.value does not have a value in the enumeration ["MEMBER", "SPOUSE", ...
You can do some additional business validation on top of the schema validation if enforcing that constraint is important to you. Then generate your own error such as
code "1" does not match value "SPOUSE"
If code and value always have the same values relative to each other, why encode both in the JSON? Just encode a single value in the JSON and infer the other in the application.
This will be much easier to validate.

How to specify that a JSON instance is defined by a specific JSON Schema

My question is how can I know which JSON schema to use to validate a particular JSON document? I specified the URL to the schema in the id field from it. Is this enough? Should I put the id in the JSON? I am not sure I understand how to connect a JSON to a specific JSON Schema.
Here is my schema
{
"$schema": "http://json-schema.org/draft-04/schema#",
"id": "url/schema.json",
"title": "title",
"definitions": {
"emailObject": {
"type": "object",
"properties":{
"name": {
"description": "The name of the customer",
"type": "string",
"maxLength": 200
},
"email": {
"description": "The email of the customer",
"type": "string",
"format": "email",
"maxLength": 100
}
}
}
}
To add to and clarify Tom's answer, here is an example of how you can link a JSON document to a JSON Schema. There is no standard way of doing this outside the context of an HTTP response. If that is something you need, you will have to come up with your own strategy.
GET /data/my-data HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json
Link: </schema/my-schema> rel=describedby
{ "name": "Fake Fakerson", "email": "fake#fakerson.com" }
GET /schema/my-schema HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/schema+json
{
"$schema": "http://json-schema.org/draft-04/schema#",
"id": "url/schema.json",
"title": "title",
"definitions": {
"emailObject": {
"type": "object",
"properties":{
"name": {
"description": "The name of the customer",
"type": "string",
"maxLength": 200
},
"email": {
"description": "The email of the customer",
"type": "string",
"format": "email",
"maxLength": 100
}
}
}
}
}
According to 10.1 of the specification:
It is RECOMMENDED that instances described by a schema/profile provide
a link to a downloadable JSON Schema using the link relation
"describedby", as defined by Linked Data Protocol 1.0, section 8.1
[W3C.REC-ldp-20150226]. (emphasis mine)
This would appear to describe exactly the behaviour you require, however, a casual perusal of the Linked Data Protocol section 8.1 leaves us none the wiser:
The relationship A describedby B asserts that resource B provides a
description of resource A. There are no constraints on the format or
representation of either A or B, neither are there any further
constraints on either resource (emphasis mine)
After a quick google search, I found this question, which at first glance would appear to be duplicated by your question. However, upon deeper inspection, the question is actually about inheritance within schemas, not the referencing of a schema from it's supported instances.
One of the answers, rather intriguingly, provides a solution which draws on the JSON-Hyper-schema standard - an attempt to extend the JSON-schema standard to support the definition of application-level semantics.
The way it achieves this is by use of the links collection:
{
...
"links":[
{
"rel":"describedby",
"href":"{+fileType}"
}
]
}
It turns out that this is based on another standard RFC5988 - Web Linking which happens to be the same standard which allows us to load CSS into HTML pages.
As #Jason points out in his comment -
Your first quote, the one from the spec, is the right way to do it.
The linked data definition of describedby does not contradict the JSON
Schema spec. It's a purposefully broad definition so it can be applied
to any media type that describes data. That includes JSON Schema, XML
Schema, or anything else.
So, it would appear that including a links collection in your schema instance would be the correct way to reference the schema. So in your specific case, you could do this:
{
...
"links":[
{
"rel":"describedby",
"href":"url/schema.json" // I assume!!
}
]
}
Even though this may be correct, I don't know how many JSON parsers will respect this when resolving to an actual schema via the link.

Could avro's logical types be used to validate input data?

I'm trying to understand how avro's logicaltypes were supposed to be used.
First let me give an example about what I'm trying to achieve; I wanna write a new Logical Type (RegExLogicalType) that validates an input string and either accept it or raise some Exception.
or let's speak about one of the existing supported avro's logical types (decimal) I was expecting to use it in this way:
If invalid decimal logical type is specified an exception must be raised; something like when mandatory field was expected but nothing has been provided org.apache.avro.AvroRuntimeException: Field test_decimal type:BYTES pos:2 not set and has no default value
If a valid decimal logical type is specified no Exception should be raised.
what I have found in the documentation is only speaking about reading/de-serialization and I don't know what about writing/serialization
Language implementations must ignore unknown logical types when
reading, and should use the underlying Avro type. If a logical type is
invalid, for example a decimal with scale greater than its precision,
then implementations should ignore the logical type and use the
underlying Avro type.
I don't want the above mention behavior for the serialization/de-serialization I need to have something equivalent to XSD restrictions (patterns) that is used to validate the data against the schema
here in avro if the schema is as follows
{"namespace": "com.stackoverflow.avro",
"type": "record",
"name": "Request",
"fields": [
{"name": "caller_jwt", "type": "string", "logicalType": "regular-expression", "pattern": "[a-zA-Z0-9]*\\.[a-zA-Z0-9]*\\.[a-zA-Z0-9]*"},
{"name": "test_decimal", "type": "bytes", "logicalType": "decimal", "precision": 4, "scale": 2}
]
}
and if I tried to build an object and serialize it like:
DatumWriter<Request> userDatumWriter = new SpecificDatumWriter<>(Request.class);
DataFileWriter<Request> dataFileWriter = new DataFileWriter<>(userDatumWriter);
ByteBuffer badDecimal = ByteBuffer.wrap("bad".getBytes());
Request request = Request.newBuilder()
.setTestDecimal(badDecimal) // bad decimal
.setCallerJwt("qsdsqdqsd").build(); // bad value according to regEx
dataFileWriter.create(request.getSchema(), new File("users.avro"));
dataFileWriter.append(dcCreationRequest);
dataFileWriter.close();
no exception is thrown and the object is serialized to users.avro file
so I don't know if avro's logical types could be used to validate input data? or there is something else that could be used to validate input data?

Categories