Avro schema: Cannot make fields optionable - java

I have a field defined in my avro schema as follows.
{
"name": "currency",
"type": ["null","string"],
"default": null
},
I recieve some data as json as which doesnot contain field currency and it always throws this error.
Expected field name not found: currency
I use the following code to convert this to a generic object.
DecoderFactory decoderFactory = new DecoderFactory();
Decoder decoder = decoderFactory.jsonDecoder(schema, eventDto.toString());
DatumReader<GenericData.Record> reader =
new GenericDatumReader<>(schema);
GenericRecord genericRecord = reader.read(null, decoder);
Most of the stackoverflow and github answers suggest that what I did above should make the fields optional and should work fine. But this doesn't seem to work for me. Is there any way to solve this.

This is a pretty common misunderstanding. The Java JSON decoder does not use defaults when a field is not found. This is because the JSON encoder would have included that field when creating the JSON and so the decoder expects the field to be there.
If you would like to add your support for having it use the defaults in the way you expect you can find a similar issue on their tracker here and add a comment.

Related

Best way to parse JSON with an unknown structure for comparison with a known structure?

I have a YAML file which I convert to JSON, and then to a Java object using GSON. This will be used as the standard definition which I will compare other YAML files against. The YAML files which I will be validating should contain fields with identical structures to my definition. However, it is very possible that it might contain fields with different structure, and fields that don't exist within my definition, as it is ultimately up to the user to create these fields before I receive the file. A field in the YAML to be validated can look like this, with the option of as many levels of nesting as the user wishes to define.
LBU:
type: nodes.Compute
properties:
name: LBU
description: LBU
configurable_properties:
test: {"additional_configurable_properties":{"aaa":"1"}}
vdu_profile:
min_number_of_instances: 1
max_number_of_instances: 4
capabilities:
virtual_compute:
properties:
virtual_memory:
virtual_mem_size: 8096 MB
virtual_cpu:
cpu_architecture: x86
num_virtual_cpu: 2
virtual_cpu_clock: 1800 MHz
requirements:
- virtual_storage:
capability: capabilities.VirtualStorage
node: LBU_Storage
Currently, I receive this YAML file and convert it to a JsonObject with Gson. It is not possible to map this to a Java object because of any possible unknown fields. My goal is to run through this file and validate every single field against a matching one in my definition. If a field is present that does not exist in the definition, or does exist but has properties that differ, I need to inform the user with specific info about the field.
So far, I am going the route of getting fields like this.
for (String field : obj.get("topology_template").getAsJsonObject().get("node_template").getAsJsonObject().get(key).getAsJsonObject().get(
obj.get("topology_template").getAsJsonObject().get("node_templates").getAsJsonObject().get(key).getAsJsonObject().keySet().toArray()[i].toString()).getAsJsonObject().keySet()) {
However, it seems that this is rather excessive and is very hard to follow for some deeply nested fields.
What I want to know is if there is a simpler way to traverse every field of a JsonObject, without mapping it to a Java object, and without explicitly accessing each field by name?
I think you are looking for something like a streaming Json Parser:
Here's an example
String json
= "{\"name\":\"Tom\",\"age\":25,\"address\":[\"Poland\",\"5th avenue\"]}";
JsonFactory jfactory = new JsonFactory();
JsonParser jParser = jfactory.createParser(json);
String parsedName = null;
Integer parsedAge = null;
List<String> addresses = new LinkedList<>();
while (jParser.nextToken() != JsonToken.END_OBJECT) {
String fieldname = jParser.getCurrentName();
if ("name".equals(fieldname)) {
jParser.nextToken();
parsedName = jParser.getText();
}
if ("age".equals(fieldname)) {
jParser.nextToken();
parsedAge = jParser.getIntValue();
}
if ("address".equals(fieldname)) {
jParser.nextToken();
while (jParser.nextToken() != JsonToken.END_ARRAY) {
addresses.add(jParser.getText());
}
}
}
jParser.close();
Please find the documentation here:
https://github.com/FasterXML/jackson-docs/wiki/JacksonStreamingApi

How to convert protocol buffers binary to JSON using the descriptor in Java

I have a message with a field of the "Any" well known type which can hold a serialized protobuf message of any type.
I want to convert this field to its json representation.
I know the field names are required, and typically you would need the generated classes loaded in the app for this to work, but I am looking for a way to do it with the descriptors.
First, I parse the descriptors:
FileInputStream descriptorFile = new FileInputStream("/descriptor");
DescriptorProtos.FileDescriptorSet fdp = DescriptorProtos.FileDescriptorSet.parseFrom(descriptorFile);
Then, loop through the contained messages and find the correct one (using the "Any" type's URL, which contains the package and message name. I add this to a TypeRegistry which is used to format the JSON.
JsonFormat.TypeRegistry.Builder typeRegistryBuilder = JsonFormat.TypeRegistry.newBuilder();
String messageNameFromUrl = member.getAny().getTypeUrl().split("/")[1];
for (DescriptorProtos.FileDescriptorProto file : fdp.getFileList()) {
for (DescriptorProtos.DescriptorProto dp : file.getMessageTypeList()) {
if (messageNameFromUrl.equals(String.format("%s.%s", file.getPackage(), dp.getName()))) {
typeRegistryBuilder.add(dp.getDescriptorForType()); //Doesn't work.
typeRegistryBuilder.add(MyConcreteGeneratedClass.getDescriptor()); //Works
System.out.println(JsonFormat.printer().usingTypeRegistry(typeRegistryBuilder.build()).preservingProtoFieldNames().print(member.getAny()));
return;
}
}
}
The problem seems to be that parsing the descriptor gives me access to Descriptors.DescriptorProto objects, but I see no way to get the Descriptors.Descriptor object needed for the type registry. I can access the concrete class's descriptor with getDescriptor() and that works, but I am trying to format the JSON at runtime by accessing a pre-generated descriptor file from outside the app, and so I do not have that concrete class available to call getDescriptor()
What would be even better is if I could use the "Any" field's type URL to resolve the Type object and use that to generate the JSON, since it also appears to have the field numbers and names as required for this process.
Any help is appreciated, thanks!
If you convert a DescriptorProtos.FileDescriptorProto to Descriptors.FileDescriptor, the latter has a getMessageTypes() method that returns List<Descriptor>.
Following is a snippet of Kotlin code taken from an open-source library I'm developing called okgrpc. It's the first of its kind attempt to create a dynamic gRPC client/CLI in Java.
private fun DescriptorProtos.FileDescriptorProto.resolve(
index: Map<String, DescriptorProtos.FileDescriptorProto>,
cache: MutableMap<String, Descriptors.FileDescriptor>
): Descriptors.FileDescriptor {
if (cache.containsKey(this.name)) return cache[this.name]!!
return this.dependencyList
.map { (index[it] ?: error("Unknown dependency: $it")).resolve(index, cache) }
.let {
val fd = Descriptors.FileDescriptor.buildFrom(this, *it.toTypedArray())
cache[fd.name] = fd
fd
}
}

Avro doesn't provide backward compatibility

I need to send my data through stream, So I chose Avro for data serialization and deserialization. But the existing implementation using avro readers, doesn't support backward compatibility. Write serialized data into file and read from file support backward compatibility. How can I achieve backward compatibility, without knowing the writer's schema. I found many stackoverflow questions related to this. But I didn't find any solution for this issue. Can someone help me to solve this.
Following is my serializer and deserializer methods.
public static byte[] serialize(String json, Schema schema) throws IOException {
GenericDatumWriter<Object> writer = new GenericDatumWriter<>(schema);
ByteArrayOutputStream output = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().binaryEncoder(output, null);
DatumReader<Object> reader = new GenericDatumReader<>(schema);
Decoder decoder = DecoderFactory.get().jsonDecoder(schema, json);
Object datum = reader.read(null, decoder);
writer.write(datum, encoder);
encoder.flush();
output.flush();
return output.toByteArray();
}
public static String deserialize(byte[] avro, Schema schema) throws IOException {
GenericDatumReader<Object> reader = new GenericDatumReader(schema);
Decoder decoder = DecoderFactory.get().binaryDecoder(avro, null);
Object datum = reader.read(null, decoder);
ByteArrayOutputStream output = new ByteArrayOutputStream();
JsonEncoder encoder = EncoderFactory.get().jsonEncoder(schema, output);
DatumWriter<Object> writer = new GenericDatumWriter(schema);
writer.write(datum, encoder);
encoder.flush();
output.flush();
return new String(output.toByteArray(), "UTF-8");
}
You may have to define what scope you are looking for backward compatibility. Are you expecting new attributes to be added? OR are you going to remove any attributes? To handle both of these scenarios, there are different options available.
As described on the confluent blog, addition of new attributes can be achieved and avro serialization/deserialization activity can be made backward compatible, you must specify default value for the new attribute. Something like below
{"name": "size", "type": "string", "default": "XL"}
The other option is to specify, reader and writer schemas exclusively. But as described in your question, it doesn't seems to be an option you are looking for.
If you are planning to remove an attribute, you can continue to parse the attribute but don't use it in application. Note this has to happen for a definite period and consumers must be given enough time to make changes to their program, before you completely retire the attribute. Make sure to log a statement to indicate the attribute was found when it was not supposed to be sent (or better send a notification to client system with a warning).
Besides above points, there is an excellent blog which talks about backward/forward compatibility.
Backward compatibility means that you can encode data with an older schema and the data can still be decoded by a reader that knows the latest schema.
Explanation from Confluent's website
So in order to decode Avro data with backward compatibility, your reader needs access to the latest schema. This can be done for example using a Schema Registry.

How to extract schema from an avro file in Java

How do you extract first the schema and then the data from an avro file in Java? Identical to this question except in java.
I've seen examples of how to get the schema from an avsc file but not an avro file. What direction should I be looking in?
Schema schema = new Schema.Parser().parse(
new File("/home/Hadoop/Avro/schema/emp.avsc")
);
If you want know the schema of a Avro file without having to generate the corresponding classes or care about which class the file belongs to, you can use the GenericDatumReader:
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(new File("file.avro"), datumReader);
Schema schema = dataFileReader.getSchema();
System.out.println(schema);
And then you can read the data inside the file:
GenericRecord record = null;
while (dataFileReader.hasNext()) {
record = dataFileReader.next(record);
System.out.println(record);
}
Thanks for #Helder Pereira's answer. As a complement, the schema can also be fetched from getSchema() of GenericRecord instance.
Here is an live demo about it, the link above shows how to get data and schema in java for Parquet, ORC and AVRO data format.
You can use the data bricks library as shown here https://github.com/databricks/spark-avro which will load the avro file into a Dataframe (Dataset<Row>)
Once you have a Dataset<Row>, you can directly get the schema using df.schema()

Flume: Avro event deserializer To Elastic Search

I want to take a record created by the AVRO deserializer and send it to ElasticSearch. I realize I have to write custom code to do this.
Using the LITERAL option, I have the JSON schema that is the first step in using the GenericRecord. However, looking throughout the AVRO Java API, I see no way of using GenericRecord for one record. All examples use DataFileReader.
In short, I can't get the fields from the Flume event.
Has anyone done this before?
TIA.
I was able to figure it out. I did the following:
// Get the schema
String strSchema = event.getHeader("flume.avro.schema.literal");
// Get the body
byte[] body = event.getBody();
// Create the avro schema
Schema schema = Schema.Parser.parse(strSchema);
// Get the decoder to use to get the "record" from the event stream in object form
BinaryDecoder decoder = DecoderFactory.binaryDecoder(body, null);
// Get the datum reader
GenericDatumReader reader = new GenericDatumReader(schema);
// Get the Avro record in object form
GenericRecord record = reader.read(null, decoder);
// Now you can iterate over the fields
for (Schema.Field field : schema.getFields()) {
Object value = record.get(field.name());
// Code to add field to JSON to send to ElasticSearch not listed
// ...
} // for (Schema.Field field : schema.getFields()) {
This works well.

Categories