How to generate schema-less avro files using apache avro?

How to generate schema-less avro files using apache avro? - java

I am using Apache avro for data serialization. Since, the data has a fixed schema I do not want the schema to be a part of serialized data. In the following example, schema is a part of the avro file "users.avro".
User user1 = new User();
user1.setName("Alyssa");
user1.setFavoriteNumber(256);
User user2 = new User("Ben", 7, "red");
User user3 = User.newBuilder()
.setName("Charlie")
.setFavoriteColor("blue")
.setFavoriteNumber(null)
.build();
// Serialize user1 and user2 to disk
File file = new File("users.avro");
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<User (userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File("users.avro"));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user3);
dataFileWriter.close();
Can anyone please tell me how to store avro-files without schema embedded in it?

Here you find a comprehensive how to in which I explain how to achieve the schema-less serialization using Apache Avro.
A companion test campaign shows up some figures on the performance that you might expect.
The code is on GitHub: example and test classes show up how to use the Data Reader and Writer with a Stub class generated by Avro itself.

Should be doable.
Given an encoder, you can use a DatumWriter to write data directly to a ByteArrayOutputStream (which you can then write to a java.io.File).
Here's how to get started in Scala (from Salat-Avro):
val baos = new ByteArrayOutputStream
val encoder = EncoderFactory.get().binaryEncoder(baos, null)
encoder.write(myRecord, encoder)

Related

Creating DataFrame in Java based on ByteArrayInputStream

I need to convert following to Spark DataFrame in Java with the saving of the structure according to the avro schema. And then I'm going to write it to s3 based on this avro structure.
GenericRecord r = new GenericData.Record(inAvroSchema);
r.put("id", "1");
r.put("cnt", 111);
Schema enumTest =
SchemaBuilder.enumeration("name1")
.namespace("com.name")
.symbols("s1", "s2");
GenericData.EnumSymbol symbol = new GenericData.EnumSymbol(enumTest, "s1");
r.put("type", symbol);
ByteArrayOutputStream bao = new ByteArrayOutputStream();
GenericDatumWriter<GenericRecord> w = new GenericDatumWriter<>(inAvroSchema);
Encoder e = EncoderFactory.get().jsonEncoder(inAvroSchema, bao);
w.write(r, e);
e.flush();
I can create the object based on JSON structure
Object o = reader.read(null, DecoderFactory.get().jsonDecoder(inAvroSchema, new ByteArrayInputStream(bao.toByteArray())));
But maybe there is any way to create DataFrame based on ByteArrayInputStream(bao.toByteArray())?
Thanks

No, you have to use a Data Source to read Avro data.
And it's crutial for Spark to read Avro as files from filesystem, because many optimizations and features depend on it (such as compression and partitioning).
You have to add spark-avro (unless you are above 2.4).
Note that EnumType you are using will be String in Spark's Dataset
Also see this: Spark: Read an inputStream instead of File
Alternatively you can consider deploying a bunch of tasks with SparkContext#parallelize and reading/writing the files explicitly by DatumReader/DatumWriter.

Avro doesn't provide backward compatibility

I need to send my data through stream, So I chose Avro for data serialization and deserialization. But the existing implementation using avro readers, doesn't support backward compatibility. Write serialized data into file and read from file support backward compatibility. How can I achieve backward compatibility, without knowing the writer's schema. I found many stackoverflow questions related to this. But I didn't find any solution for this issue. Can someone help me to solve this.
Following is my serializer and deserializer methods.
public static byte[] serialize(String json, Schema schema) throws IOException {
GenericDatumWriter<Object> writer = new GenericDatumWriter<>(schema);
ByteArrayOutputStream output = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().binaryEncoder(output, null);
DatumReader<Object> reader = new GenericDatumReader<>(schema);
Decoder decoder = DecoderFactory.get().jsonDecoder(schema, json);
Object datum = reader.read(null, decoder);
writer.write(datum, encoder);
encoder.flush();
output.flush();
return output.toByteArray();
}
public static String deserialize(byte[] avro, Schema schema) throws IOException {
GenericDatumReader<Object> reader = new GenericDatumReader(schema);
Decoder decoder = DecoderFactory.get().binaryDecoder(avro, null);
Object datum = reader.read(null, decoder);
ByteArrayOutputStream output = new ByteArrayOutputStream();
JsonEncoder encoder = EncoderFactory.get().jsonEncoder(schema, output);
DatumWriter<Object> writer = new GenericDatumWriter(schema);
writer.write(datum, encoder);
encoder.flush();
output.flush();
return new String(output.toByteArray(), "UTF-8");
}

You may have to define what scope you are looking for backward compatibility. Are you expecting new attributes to be added? OR are you going to remove any attributes? To handle both of these scenarios, there are different options available.
As described on the confluent blog, addition of new attributes can be achieved and avro serialization/deserialization activity can be made backward compatible, you must specify default value for the new attribute. Something like below
{"name": "size", "type": "string", "default": "XL"}
The other option is to specify, reader and writer schemas exclusively. But as described in your question, it doesn't seems to be an option you are looking for.
If you are planning to remove an attribute, you can continue to parse the attribute but don't use it in application. Note this has to happen for a definite period and consumers must be given enough time to make changes to their program, before you completely retire the attribute. Make sure to log a statement to indicate the attribute was found when it was not supposed to be sent (or better send a notification to client system with a warning).
Besides above points, there is an excellent blog which talks about backward/forward compatibility.

Backward compatibility means that you can encode data with an older schema and the data can still be decoded by a reader that knows the latest schema.
Explanation from Confluent's website
So in order to decode Avro data with backward compatibility, your reader needs access to the latest schema. This can be done for example using a Schema Registry.

How to extract schema from an avro file in Java

How do you extract first the schema and then the data from an avro file in Java? Identical to this question except in java.
I've seen examples of how to get the schema from an avsc file but not an avro file. What direction should I be looking in?
Schema schema = new Schema.Parser().parse(
new File("/home/Hadoop/Avro/schema/emp.avsc")
);

If you want know the schema of a Avro file without having to generate the corresponding classes or care about which class the file belongs to, you can use the GenericDatumReader:
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(new File("file.avro"), datumReader);
Schema schema = dataFileReader.getSchema();
System.out.println(schema);
And then you can read the data inside the file:
GenericRecord record = null;
while (dataFileReader.hasNext()) {
record = dataFileReader.next(record);
System.out.println(record);
}

Thanks for #Helder Pereira's answer. As a complement, the schema can also be fetched from getSchema() of GenericRecord instance.
Here is an live demo about it, the link above shows how to get data and schema in java for Parquet, ORC and AVRO data format.

You can use the data bricks library as shown here https://github.com/databricks/spark-avro which will load the avro file into a Dataframe (Dataset<Row>)
Once you have a Dataset<Row>, you can directly get the schema using df.schema()

Flume: Avro event deserializer To Elastic Search

I want to take a record created by the AVRO deserializer and send it to ElasticSearch. I realize I have to write custom code to do this.
Using the LITERAL option, I have the JSON schema that is the first step in using the GenericRecord. However, looking throughout the AVRO Java API, I see no way of using GenericRecord for one record. All examples use DataFileReader.
In short, I can't get the fields from the Flume event.
Has anyone done this before?
TIA.

I was able to figure it out. I did the following:
// Get the schema
String strSchema = event.getHeader("flume.avro.schema.literal");
// Get the body
byte[] body = event.getBody();
// Create the avro schema
Schema schema = Schema.Parser.parse(strSchema);
// Get the decoder to use to get the "record" from the event stream in object form
BinaryDecoder decoder = DecoderFactory.binaryDecoder(body, null);
// Get the datum reader
GenericDatumReader reader = new GenericDatumReader(schema);
// Get the Avro record in object form
GenericRecord record = reader.read(null, decoder);
// Now you can iterate over the fields
for (Schema.Field field : schema.getFields()) {
Object value = record.get(field.name());
// Code to add field to JSON to send to ElasticSearch not listed
// ...
} // for (Schema.Field field : schema.getFields()) {
This works well.

What is the purpose of DatumWriter in Avro

Avro website has an example:
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File("users.avro"));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user3);
dataFileWriter.close();
What is the purpose of DatumWriter<User>? I mean what it does provide? It provides write method, but instead of using it we use DataFileWriter. Can someone explain the design purpose of it?

The DatumWriter class responsible for translating the given data object into an avro record with a given schema (which is extracted in your case from the User class).
Given this record the DataFileWriter is responsible to write it to a file.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to generate schema-less avro files using apache avro? - java

Related

Creating DataFrame in Java based on ByteArrayInputStream

Avro doesn't provide backward compatibility

How to extract schema from an avro file in Java

Flume: Avro event deserializer To Elastic Search

What is the purpose of DatumWriter in Avro

Categories

Resources