Avro doesn't provide backward compatibility - java

I need to send my data through stream, So I chose Avro for data serialization and deserialization. But the existing implementation using avro readers, doesn't support backward compatibility. Write serialized data into file and read from file support backward compatibility. How can I achieve backward compatibility, without knowing the writer's schema. I found many stackoverflow questions related to this. But I didn't find any solution for this issue. Can someone help me to solve this.
Following is my serializer and deserializer methods.
public static byte[] serialize(String json, Schema schema) throws IOException {
GenericDatumWriter<Object> writer = new GenericDatumWriter<>(schema);
ByteArrayOutputStream output = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().binaryEncoder(output, null);
DatumReader<Object> reader = new GenericDatumReader<>(schema);
Decoder decoder = DecoderFactory.get().jsonDecoder(schema, json);
Object datum = reader.read(null, decoder);
writer.write(datum, encoder);
encoder.flush();
output.flush();
return output.toByteArray();
}
public static String deserialize(byte[] avro, Schema schema) throws IOException {
GenericDatumReader<Object> reader = new GenericDatumReader(schema);
Decoder decoder = DecoderFactory.get().binaryDecoder(avro, null);
Object datum = reader.read(null, decoder);
ByteArrayOutputStream output = new ByteArrayOutputStream();
JsonEncoder encoder = EncoderFactory.get().jsonEncoder(schema, output);
DatumWriter<Object> writer = new GenericDatumWriter(schema);
writer.write(datum, encoder);
encoder.flush();
output.flush();
return new String(output.toByteArray(), "UTF-8");
}

You may have to define what scope you are looking for backward compatibility. Are you expecting new attributes to be added? OR are you going to remove any attributes? To handle both of these scenarios, there are different options available.
As described on the confluent blog, addition of new attributes can be achieved and avro serialization/deserialization activity can be made backward compatible, you must specify default value for the new attribute. Something like below
{"name": "size", "type": "string", "default": "XL"}
The other option is to specify, reader and writer schemas exclusively. But as described in your question, it doesn't seems to be an option you are looking for.
If you are planning to remove an attribute, you can continue to parse the attribute but don't use it in application. Note this has to happen for a definite period and consumers must be given enough time to make changes to their program, before you completely retire the attribute. Make sure to log a statement to indicate the attribute was found when it was not supposed to be sent (or better send a notification to client system with a warning).
Besides above points, there is an excellent blog which talks about backward/forward compatibility.

Backward compatibility means that you can encode data with an older schema and the data can still be decoded by a reader that knows the latest schema.
Explanation from Confluent's website
So in order to decode Avro data with backward compatibility, your reader needs access to the latest schema. This can be done for example using a Schema Registry.

Related

Avro schema: Cannot make fields optionable

I have a field defined in my avro schema as follows.
{
"name": "currency",
"type": ["null","string"],
"default": null
},
I recieve some data as json as which doesnot contain field currency and it always throws this error.
Expected field name not found: currency
I use the following code to convert this to a generic object.
DecoderFactory decoderFactory = new DecoderFactory();
Decoder decoder = decoderFactory.jsonDecoder(schema, eventDto.toString());
DatumReader<GenericData.Record> reader =
new GenericDatumReader<>(schema);
GenericRecord genericRecord = reader.read(null, decoder);
Most of the stackoverflow and github answers suggest that what I did above should make the fields optional and should work fine. But this doesn't seem to work for me. Is there any way to solve this.
This is a pretty common misunderstanding. The Java JSON decoder does not use defaults when a field is not found. This is because the JSON encoder would have included that field when creating the JSON and so the decoder expects the field to be there.
If you would like to add your support for having it use the defaults in the way you expect you can find a similar issue on their tracker here and add a comment.

java.io.IOException: Not a data file while reading Avro from file

The following code is used to serialize the data.
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
BinaryEncoder binaryEncoder =
EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null);
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(data.getSchema());
datumWriter.setSchema(data.getSchema());
datumWriter.write(data, binaryEncoder);
binaryEncoder.flush();
byteArrayOutputStream.close();
result = byteArrayOutputStream.toByteArray();
I used the following command
FileUtils.writeByteArrayToFile(new File("D:/sample.avro"), data);
to write avro byte array to a file. But when I try to read the same using
File file = new File("D:/sample.avro");
try {
dataFileReader = new DataFileReader(file, datumReader);
} catch (IOException exp) {
System.out.println(exp);
System.exit(1);
}
it throws exception
java.io.IOException: Not a data file.
at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:89)
What is the problem happening here. I refered two other similar stackoverflow questions this and this but haven't been of much help to me. Can someone help me understand this.
The actual data is encoded in the Avro binary format, but typically what's passed around is more than just the encoded data.
What most people think of an "avro file" is a format that includes the header (which has things like the writer schema) and then the actual data: https://avro.apache.org/docs/current/spec.html#Object+Container+Files. The first four bytes of an avro file should be b"Obj1" or 0x4F626A01. The error you are getting is because the binary you are trying to read as a data file doesn't start with the standard magic bytes.
Another standard format is the single object encoding: https://avro.apache.org/docs/current/spec.html#single_object_encoding. This type of binary format should start with 0xC301.
But if I had to guess, the binary you have could just be the raw serialized data without any sort of header information. Though it's hard to know for sure without knowing how the byte array that you have was created.
You'd need to utilize Avro to write the data as well as read it otherwise the schema isn't written (hence the "Not a data file" message). (see: https://cwiki.apache.org/confluence/display/AVRO/FAQ#FAQ-HowcanIserializedirectlyto/fromabytearray?)
If you're just looking to serialize an object, see: https://mkyong.com/java/how-to-read-and-write-java-object-to-a-file/

Creating DataFrame in Java based on ByteArrayInputStream

I need to convert following to Spark DataFrame in Java with the saving of the structure according to the avro schema. And then I'm going to write it to s3 based on this avro structure.
GenericRecord r = new GenericData.Record(inAvroSchema);
r.put("id", "1");
r.put("cnt", 111);
Schema enumTest =
SchemaBuilder.enumeration("name1")
.namespace("com.name")
.symbols("s1", "s2");
GenericData.EnumSymbol symbol = new GenericData.EnumSymbol(enumTest, "s1");
r.put("type", symbol);
ByteArrayOutputStream bao = new ByteArrayOutputStream();
GenericDatumWriter<GenericRecord> w = new GenericDatumWriter<>(inAvroSchema);
Encoder e = EncoderFactory.get().jsonEncoder(inAvroSchema, bao);
w.write(r, e);
e.flush();
I can create the object based on JSON structure
Object o = reader.read(null, DecoderFactory.get().jsonDecoder(inAvroSchema, new ByteArrayInputStream(bao.toByteArray())));
But maybe there is any way to create DataFrame based on ByteArrayInputStream(bao.toByteArray())?
Thanks
No, you have to use a Data Source to read Avro data.
And it's crutial for Spark to read Avro as files from filesystem, because many optimizations and features depend on it (such as compression and partitioning).
You have to add spark-avro (unless you are above 2.4).
Note that EnumType you are using will be String in Spark's Dataset
Also see this: Spark: Read an inputStream instead of File
Alternatively you can consider deploying a bunch of tasks with SparkContext#parallelize and reading/writing the files explicitly by DatumReader/DatumWriter.

How to generate schema-less avro files using apache avro?

I am using Apache avro for data serialization. Since, the data has a fixed schema I do not want the schema to be a part of serialized data. In the following example, schema is a part of the avro file "users.avro".
User user1 = new User();
user1.setName("Alyssa");
user1.setFavoriteNumber(256);
User user2 = new User("Ben", 7, "red");
User user3 = User.newBuilder()
.setName("Charlie")
.setFavoriteColor("blue")
.setFavoriteNumber(null)
.build();
// Serialize user1 and user2 to disk
File file = new File("users.avro");
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<User (userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File("users.avro"));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user3);
dataFileWriter.close();
Can anyone please tell me how to store avro-files without schema embedded in it?
Here you find a comprehensive how to in which I explain how to achieve the schema-less serialization using Apache Avro.
A companion test campaign shows up some figures on the performance that you might expect.
The code is on GitHub: example and test classes show up how to use the Data Reader and Writer with a Stub class generated by Avro itself.
Should be doable.
Given an encoder, you can use a DatumWriter to write data directly to a ByteArrayOutputStream (which you can then write to a java.io.File).
Here's how to get started in Scala (from Salat-Avro):
val baos = new ByteArrayOutputStream
val encoder = EncoderFactory.get().binaryEncoder(baos, null)
encoder.write(myRecord, encoder)

How do you get the xml version of a document using org.jdom2?

I am using org.jdom2 to parse xml files. I need to know if the file is marked as version 1.1 or version 1.0. How do I access the xml declaration?
Also how do I set the version when writing the output using the XMLOutputter?
The XML Version is parsed and used by the XML parser (SAX). Some parsers support the SAX2 API, and that allows some of the parsers to supply extended parsing information. If the parser does this, the XML version may be available in the Locator2 implementation getXMLVersion(). JDOM does not have a hook on this information, so the data is not yet available in JDOM. It would make a good feature request.
JDOM also outputs data in XML 1.0 version. The differences between 1.0 and 1.1 from JDOM's perspective are slight. The most significant difference is the slightly different handling between different supported characters.
If you want to specify a different XML version for your output you can force the declaration by disabling the XMLOutputter's declaration (setOmitDeclaration() and then dump the declaration yourself on to the stream before outputting the XML.
Alternatively you can extend the XMLOutputProcessor and override the processDelcaration() method to outpuit the declaration you want.
None of these options are easy, and the support for XML 1.1 in JDOM is limited. Your mileage may vary, but please keep me updated on your success, and file issues on the Github issues if you have suggestions/problems: https://github.com/hunterhacker/jdom/issues
I fully believe that rolfl's answer is correct. It isn't the approach I finally took. I decided to just do a quick parsing of the document myself. This probably needs further testing with documents with BOM.
private static Pattern xmlDeclaration = Pattern.compile("<?xml.* version=\"([\\d|\\.]+)\".*?>");
private static boolean isXml10(InputStream inputStream) throws IOException
{
boolean result = true;
InputStreamReader is = null;
BufferedReader br = null;
try
{
is = new InputStreamReader(inputStream);
br = new BufferedReader(is);
String line = br.readLine();
Matcher declarationMatch = xmlDeclaration.matcher(line);
if (declarationMatch.find())
{
String version = declarationMatch.group(1);
result = version.equals("1.0");
}
}
finally
{
is.close();
br.close();
}
return result;
}

Categories