Convert Avro into Byte and Store Byte data into MySQL - java

I have an Avro schema file customer.avsc. I already successfully created the Avro object using builder, and I can read the avro object. I am wondering how to convert the customer avro object into Byte and store it in the database. Thanks a lot!
public static void main(String[] args) {
// we can now build a customer in a "safe" way
Customer.Builder customerBuilder = Customer.newBuilder();
customerBuilder.setAge(30);
customerBuilder.setFirstName("Mark");
customerBuilder.setLastName("Simpson");
customerBuilder.setAutomatedEmail(true);
customerBuilder.setHeight(180f);
customerBuilder.setWeight(90f);
Customer customer = customerBuilder.build();
System.out.println(customer);
System.out.println(111111);
// write it out to a file
final DatumWriter<Customer> datumWriter = new SpecificDatumWriter<>(Customer.class);
try (DataFileWriter<Customer> dataFileWriter = new DataFileWriter<>(datumWriter)) {
dataFileWriter.create(customer.getSchema(), new File("customer-specific.avro"));
dataFileWriter.append(customer);
System.out.println("successfully wrote customer-specific.avro");
} catch (IOException e){
e.printStackTrace();
}

I am using BinaryEncoder to solve this problem. In this case, the avro could be converted into Byte and saved into the MySQL database. Then when receiving the data from kafka (byte -> MySQL -> Debezium Connector -> Kafka -> Consumer API), then I can just decode the payload of that byte column into avro / Java object again with the same schema.
Here is the code.
Customer.Builder customerBuilder = Customer.newBuilder();
customerBuilder.setAge(20);
customerBuilder.setFirstName("first");
customerBuilder.setLastName("last");
customerBuilder.setAutomatedEmail(true);
customerBuilder.setHeight(180f);
customerBuilder.setWeight(90f);
Customer customer = customerBuilder.build();
DatumWriter<SpecificRecord> writer = new SpecificDatumWriter<SpecificRecord>(
customer.getSchema());
ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(customer, encoder);
encoder.flush();
out.close();
byte[] serializedBytes = out.toByteArray();
System.out.println("Sending message in bytes : " + serializedBytes);
// //String serializedHex = Hex.encodeHexString(serializedBytes);
// //System.out.println("Serialized Hex String : " + serializedHex);
// KeyedMessage<String, byte[]> message = new KeyedMessage<String, byte[]>("page_views", serializedBytes);
// producer.send(message);
// producer.close();
DatumReader<Customer> userDatumReader = new SpecificDatumReader<Customer>(Customer.class);
Decoder decoder = DecoderFactory.get().binaryDecoder(serializedBytes, null);
SpecificRecord datum = userDatumReader.read(null, decoder);
System.out.println(datum);

Related

Deserialize Avro Data from bytes

I am trying to deserialize, i.e., get an object of class org.apache.avro.generic.GenericRecord from byte array Avro data. This data contains a header with the full schema.
So far, I have tried this:
public List<GenericRecord> deserializeGenericWithSchema(byte[] message) throws IOException {
List<GenericRecord> listOfRecords = new ArrayList<>();
DatumReader<GenericRecord> reader = new GenericDatumReader<>();
DataFileReader<GenericRecord> fileReader =
new DataFileReader<>(new SeekableByteArrayInput(message), reader);
GenericRecord record = null;
while (fileReader.hasNext()) {
listOfRecords.add(fileReader.next(record));
}
return listOfRecords;
}
But I am getting an error:
java.io.IOException: Invalid int encoding at
org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:145) at
org.apache.avro.io.BinaryDecoder.readBytes(BinaryDecoder.java:282) at
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:112)
at org.apache.avro.file.DataFileReader.(DataFileReader.java:97)
However, if I write to disk the byte array message and change my function like:
public List<GenericRecord> deserializeGenericWithSchema(String fileName) throws IOException {
byte[] file = new File(fileName);
List<GenericRecord> listOfRecords = new ArrayList<>();
DatumReader<GenericRecord> reader = new GenericDatumReader<>();
DataFileReader<GenericRecord> fileReader =
new DataFileReader<>(file, reader);
GenericRecord record = null;
while (fileReader.hasNext()) {
listOfRecords.add(fileReader.next(record));
}
return listOfRecords;
}
It works flawlessly. I really don't want to write to disk every avro message I get because this is intended to work in a real time basis.
What am I doing wrong in my first approach?
Do you have any follow up on the issue? My assumption is encoding issue. Where the byte[] came from? Is it the exact byte[] you are writing to the disk? Maybe the explanation is on both File writer and reader default encoding settings.

How to deserialize avro files

I would like to read a hdfs folder containing avro files with spark . Then I would like to deserialize the avro events contained in these files. I would like to do it without the com.databrics library (or any other that allow to do it easely).
The problem is that I have difficulties with the deserialization.
I assume that my avro file is compressed with snappy because at the begining of the file (just after the schema), I have
avro.codecsnappy
written. Then it's followed by readable or unreadable charaters.
My first attempt to deserialize the avro event is the following :
public static String deserialize(String message) throws IOException {
Schema.Parser schemaParser = new Schema.Parser();
Schema avroSchema = schemaParser.parse(defaultFlumeAvroSchema);
DatumReader<GenericRecord> specificDatumReader = new SpecificDatumReader<GenericRecord>(avroSchema);
byte[] messageBytes = message.getBytes();
Decoder decoder = DecoderFactory.get().binaryDecoder(messageBytes, null);
GenericRecord genericRecord = specificDatumReader.read(null, decoder);
return genericRecord.toString();
}
This function works when I want to deserialise an avro file that doesn't have the avro.codecsbappy in it. When it's the case I have the error :
Malformed data : length is negative : -50
So I tried another way of doing it which is :
private static void deserialize2(String path) throws IOException {
DatumReader<GenericRecord> reader = new GenericDatumReader<>();
DataFileReader<GenericRecord> fileReader =
new DataFileReader<>(new File(path), reader);
System.out.println(fileReader.getSchema().toString());
GenericRecord record = new GenericData.Record(fileReader.getSchema());
int numEvents = 0;
while (fileReader.hasNext()) {
fileReader.next(record);
ByteBuffer body = (ByteBuffer) record.get("body");
CharsetDecoder decoder = Charsets.UTF_8.newDecoder();
System.out.println("Positon of the index " + body.position());
System.out.println("Size of the array : " + body.array().length);
String bodyStr = decoder.decode(body).toString();
System.out.println("THE BODY STRING ---> " bodyStr);
numEvents++;
}
fileReader.close();
}
and it returns the follwing output :
Positon of the index 0
Size of the array : 127482
THE BODY STRING --->
I can see that the array isn't empty but it just return an empty string.
How can I proceed ?
Use this when converting to string:
String bodyStr = new String(body.array());
System.out.println("THE BODY STRING ---> " + bodyStr);
Source: https://www.mkyong.com/java/how-do-convert-byte-array-to-string-in-java/
Well, it seems that you are on a good way. However, your ByteBuffer might not have a proper byte[] array to decode, so let's try the following instead:
byte[] bytes = new byte[body.remaining()];
buffer.get(bytes);
String result = new String(bytes, "UTF-8"); // Maybe you need to change charset
This should work, you have shown in your question that ByteBuffer contains actual data, as given in the code example you might have to change the charset.
List of charsets: https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html
Also usful: https://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html

Creating objects from Primitive avro schema

Suppose I have a schema in avro like this
{ "type" : "string" }
How should i create object from this schema in java?
I did not found a way to do it directly with the java avro lib
but you can still do this
public static byte[] jsonToAvro(String json, Schema schema) throws IOException {
DatumReader<Object> reader = new GenericDatumReader<>(schema);
GenericDatumWriter<Object> writer = new GenericDatumWriter<>(schema);
ByteArrayOutputStream output = new ByteArrayOutputStream();
Decoder decoder = DecoderFactory.get().jsonDecoder(schema, json);
Encoder encoder = EncoderFactory.get().binaryEncoder(output, null);
Object datum = reader.read(null, decoder);
writer.write(datum, encoder);
encoder.flush();
return output.toByteArray();
}
Schema PRIMITIVE = new Schema.Parser().parse("{ \"type\" : \"string\" }");
byte[] b = jsonToAvro("\"" + mystring + "\"", PRIMITIVE);
from How to avro binary encode my json string to a byte array?

Avro: ReflectDatumWriter does not output schema information

See the following sample code:
User datum = new User("a123456", "my.email#world.com");
Schema schema = ReflectData.get().getSchema(datum.getClass());
DatumWriter<Object> writer = new ReflectDatumWriter<>(schema);
ByteArrayOutputStream output = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().binaryEncoder(output, null);
writer.write(datum, encoder);
encoder.flush();
byte[] bytes = output.toByteArray();
System.out.println(new String(bytes));
which produces:
a123456$my.email#world.com
I had presumed that all Avro writers would publish the schema information as well as the data, but this does not.
I can successfully get the schema printed if I use the GenericDatumWriter in combination with a DataFileWriter but I wish to use the ReflectDatumWriter as I don't wish to construct a GenericRecord myself (I want the library to do this)
How do I get the schema serialized as well?
I solved this myself, you need to use a DataFileWriter as this contains an entry in the create() method that writes the schema
Solution is to use this in conjunction with a ByteArrayOutputStream:
Schema schema = ReflectData.get().getSchema(User.class);
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
ByteArrayOutputStream output = new ByteArrayOutputStream();
dataFileWriter.create(schema, output);
GenericRecord user = createGenericRecord(schema);
dataFileWriter.append(user);
dataFileWriter.close();
byte[] bytes = output.toByteArray();
System.out.println(new String(bytes));

Avro write and read works on one machine and not on other

Here is some Avro code that runs on one machine but fails on the other with an exception.
We are not able to make sure what's wrong here.
Here is the code that is causing the problem.
Class<?> clazz = obj.getClass();
ReflectData rdata = ReflectData.AllowNull.get();
Schema schema = rdata.getSchema(clazz);
ByteArrayOutputStream os = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().binaryEncoder(os, null);
DatumWriter<T> writer = new ReflectDatumWriter<T>(schema, rdata);
writer.write(obj, encoder);
encoder.flush();
byte[] bytes = os.toByteArray();
String binaryString = new String (bytes, "ISO-8859-1");
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(binaryString.getBytes("ISO-8859-1"), null);
GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> (schema);
GenericRecord record = datumReader.read(null, decoder);
Exception is:
org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -32
at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:437)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:427)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:189)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:187)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:263)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:216)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:173)
Adding Dfile.encoding=UTF-8 in the tomcat params helped us resolve the issue.

Categories