Flume: Avro event deserializer To Elastic Search - java

I want to take a record created by the AVRO deserializer and send it to ElasticSearch. I realize I have to write custom code to do this.
Using the LITERAL option, I have the JSON schema that is the first step in using the GenericRecord. However, looking throughout the AVRO Java API, I see no way of using GenericRecord for one record. All examples use DataFileReader.
In short, I can't get the fields from the Flume event.
Has anyone done this before?
TIA.

I was able to figure it out. I did the following:
// Get the schema
String strSchema = event.getHeader("flume.avro.schema.literal");
// Get the body
byte[] body = event.getBody();
// Create the avro schema
Schema schema = Schema.Parser.parse(strSchema);
// Get the decoder to use to get the "record" from the event stream in object form
BinaryDecoder decoder = DecoderFactory.binaryDecoder(body, null);
// Get the datum reader
GenericDatumReader reader = new GenericDatumReader(schema);
// Get the Avro record in object form
GenericRecord record = reader.read(null, decoder);
// Now you can iterate over the fields
for (Schema.Field field : schema.getFields()) {
Object value = record.get(field.name());
// Code to add field to JSON to send to ElasticSearch not listed
// ...
} // for (Schema.Field field : schema.getFields()) {
This works well.

Related

Avro schema: Cannot make fields optionable

I have a field defined in my avro schema as follows.
{
"name": "currency",
"type": ["null","string"],
"default": null
},
I recieve some data as json as which doesnot contain field currency and it always throws this error.
Expected field name not found: currency
I use the following code to convert this to a generic object.
DecoderFactory decoderFactory = new DecoderFactory();
Decoder decoder = decoderFactory.jsonDecoder(schema, eventDto.toString());
DatumReader<GenericData.Record> reader =
new GenericDatumReader<>(schema);
GenericRecord genericRecord = reader.read(null, decoder);
Most of the stackoverflow and github answers suggest that what I did above should make the fields optional and should work fine. But this doesn't seem to work for me. Is there any way to solve this.
This is a pretty common misunderstanding. The Java JSON decoder does not use defaults when a field is not found. This is because the JSON encoder would have included that field when creating the JSON and so the decoder expects the field to be there.
If you would like to add your support for having it use the defaults in the way you expect you can find a similar issue on their tracker here and add a comment.

How to extract schema from an avro file in Java

How do you extract first the schema and then the data from an avro file in Java? Identical to this question except in java.
I've seen examples of how to get the schema from an avsc file but not an avro file. What direction should I be looking in?
Schema schema = new Schema.Parser().parse(
new File("/home/Hadoop/Avro/schema/emp.avsc")
);
If you want know the schema of a Avro file without having to generate the corresponding classes or care about which class the file belongs to, you can use the GenericDatumReader:
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(new File("file.avro"), datumReader);
Schema schema = dataFileReader.getSchema();
System.out.println(schema);
And then you can read the data inside the file:
GenericRecord record = null;
while (dataFileReader.hasNext()) {
record = dataFileReader.next(record);
System.out.println(record);
}
Thanks for #Helder Pereira's answer. As a complement, the schema can also be fetched from getSchema() of GenericRecord instance.
Here is an live demo about it, the link above shows how to get data and schema in java for Parquet, ORC and AVRO data format.
You can use the data bricks library as shown here https://github.com/databricks/spark-avro which will load the avro file into a Dataframe (Dataset<Row>)
Once you have a Dataset<Row>, you can directly get the schema using df.schema()

How to generate schema-less avro files using apache avro?

I am using Apache avro for data serialization. Since, the data has a fixed schema I do not want the schema to be a part of serialized data. In the following example, schema is a part of the avro file "users.avro".
User user1 = new User();
user1.setName("Alyssa");
user1.setFavoriteNumber(256);
User user2 = new User("Ben", 7, "red");
User user3 = User.newBuilder()
.setName("Charlie")
.setFavoriteColor("blue")
.setFavoriteNumber(null)
.build();
// Serialize user1 and user2 to disk
File file = new File("users.avro");
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<User (userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File("users.avro"));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user3);
dataFileWriter.close();
Can anyone please tell me how to store avro-files without schema embedded in it?
Here you find a comprehensive how to in which I explain how to achieve the schema-less serialization using Apache Avro.
A companion test campaign shows up some figures on the performance that you might expect.
The code is on GitHub: example and test classes show up how to use the Data Reader and Writer with a Stub class generated by Avro itself.
Should be doable.
Given an encoder, you can use a DatumWriter to write data directly to a ByteArrayOutputStream (which you can then write to a java.io.File).
Here's how to get started in Scala (from Salat-Avro):
val baos = new ByteArrayOutputStream
val encoder = EncoderFactory.get().binaryEncoder(baos, null)
encoder.write(myRecord, encoder)

What is the purpose of DatumWriter in Avro

Avro website has an example:
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File("users.avro"));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user3);
dataFileWriter.close();
What is the purpose of DatumWriter<User>? I mean what it does provide? It provides write method, but instead of using it we use DataFileWriter. Can someone explain the design purpose of it?
The DatumWriter class responsible for translating the given data object into an avro record with a given schema (which is extracted in your case from the User class).
Given this record the DataFileWriter is responsible to write it to a file.

Thrift: Serialize + Deserialize changes object

I have a thrift struct something like this:
struct GeneralContainer {
1: required string identifier;
2: required binary data;
}
The idea is to be able to pass different types of thrift objects on a single "pipe", and still be able to deserialize at the other side correctly.
But serializing a GeneralContainer object, and then deserializing it changes the contents of the data field. I am using the TBinaryProtocol:
TSerializer serializer = new TSerializer(new TBinaryProtocol.Factory());
TDeserializer deserializer = new TDeserializer(new TBinaryProtocol.Factory());
GeneralContainer container = new GeneralContainer();
container.setIdentifier("my-thrift-type");
container.setData(ByteBuffer.wrap(serializer.serialize(myThriftTypeObject)));
byte[] serializedContainer = serializer.serialize(container);
GeneralContainer testContainer = new GeneralContainer();
deserializer.deserialize(testContainer, serializedContainer);
Assert.assertEquals(container, testContainer); // fails
My guess is that some sort of markers are getting messed up when we serialize an object containing binary field using TBinaryProtocol. Is that correct? If yes, what are my options for the protocol? My goal is to minimize the size of resulting serialized byte array.
Thanks,
Aman
Tracked it to a bug in thrift 0.4 serialization. Works fine in thrift 0.8.

Categories