I have an Avro schema file customer.avsc. I already successfully created the Avro object using builder, and I can read the avro object. I am wondering how to convert the customer avro object into Byte and store it in the database. Thanks a lot!
public static void main(String[] args) {
// we can now build a customer in a "safe" way
Customer.Builder customerBuilder = Customer.newBuilder();
customerBuilder.setAge(30);
customerBuilder.setFirstName("Mark");
customerBuilder.setLastName("Simpson");
customerBuilder.setAutomatedEmail(true);
customerBuilder.setHeight(180f);
customerBuilder.setWeight(90f);
Customer customer = customerBuilder.build();
System.out.println(customer);
System.out.println(111111);
// write it out to a file
final DatumWriter<Customer> datumWriter = new SpecificDatumWriter<>(Customer.class);
try (DataFileWriter<Customer> dataFileWriter = new DataFileWriter<>(datumWriter)) {
dataFileWriter.create(customer.getSchema(), new File("customer-specific.avro"));
dataFileWriter.append(customer);
System.out.println("successfully wrote customer-specific.avro");
} catch (IOException e){
e.printStackTrace();
}
I am using BinaryEncoder to solve this problem. In this case, the avro could be converted into Byte and saved into the MySQL database. Then when receiving the data from kafka (byte -> MySQL -> Debezium Connector -> Kafka -> Consumer API), then I can just decode the payload of that byte column into avro / Java object again with the same schema.
Here is the code.
Customer.Builder customerBuilder = Customer.newBuilder();
customerBuilder.setAge(20);
customerBuilder.setFirstName("first");
customerBuilder.setLastName("last");
customerBuilder.setAutomatedEmail(true);
customerBuilder.setHeight(180f);
customerBuilder.setWeight(90f);
Customer customer = customerBuilder.build();
DatumWriter<SpecificRecord> writer = new SpecificDatumWriter<SpecificRecord>(
customer.getSchema());
ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(customer, encoder);
encoder.flush();
out.close();
byte[] serializedBytes = out.toByteArray();
System.out.println("Sending message in bytes : " + serializedBytes);
// //String serializedHex = Hex.encodeHexString(serializedBytes);
// //System.out.println("Serialized Hex String : " + serializedHex);
// KeyedMessage<String, byte[]> message = new KeyedMessage<String, byte[]>("page_views", serializedBytes);
// producer.send(message);
// producer.close();
DatumReader<Customer> userDatumReader = new SpecificDatumReader<Customer>(Customer.class);
Decoder decoder = DecoderFactory.get().binaryDecoder(serializedBytes, null);
SpecificRecord datum = userDatumReader.read(null, decoder);
System.out.println(datum);
I would like to read a hdfs folder containing avro files with spark . Then I would like to deserialize the avro events contained in these files. I would like to do it without the com.databrics library (or any other that allow to do it easely).
The problem is that I have difficulties with the deserialization.
I assume that my avro file is compressed with snappy because at the begining of the file (just after the schema), I have
avro.codecsnappy
written. Then it's followed by readable or unreadable charaters.
My first attempt to deserialize the avro event is the following :
public static String deserialize(String message) throws IOException {
Schema.Parser schemaParser = new Schema.Parser();
Schema avroSchema = schemaParser.parse(defaultFlumeAvroSchema);
DatumReader<GenericRecord> specificDatumReader = new SpecificDatumReader<GenericRecord>(avroSchema);
byte[] messageBytes = message.getBytes();
Decoder decoder = DecoderFactory.get().binaryDecoder(messageBytes, null);
GenericRecord genericRecord = specificDatumReader.read(null, decoder);
return genericRecord.toString();
}
This function works when I want to deserialise an avro file that doesn't have the avro.codecsbappy in it. When it's the case I have the error :
Malformed data : length is negative : -50
So I tried another way of doing it which is :
private static void deserialize2(String path) throws IOException {
DatumReader<GenericRecord> reader = new GenericDatumReader<>();
DataFileReader<GenericRecord> fileReader =
new DataFileReader<>(new File(path), reader);
System.out.println(fileReader.getSchema().toString());
GenericRecord record = new GenericData.Record(fileReader.getSchema());
int numEvents = 0;
while (fileReader.hasNext()) {
fileReader.next(record);
ByteBuffer body = (ByteBuffer) record.get("body");
CharsetDecoder decoder = Charsets.UTF_8.newDecoder();
System.out.println("Positon of the index " + body.position());
System.out.println("Size of the array : " + body.array().length);
String bodyStr = decoder.decode(body).toString();
System.out.println("THE BODY STRING ---> " bodyStr);
numEvents++;
}
fileReader.close();
}
and it returns the follwing output :
Positon of the index 0
Size of the array : 127482
THE BODY STRING --->
I can see that the array isn't empty but it just return an empty string.
How can I proceed ?
Use this when converting to string:
String bodyStr = new String(body.array());
System.out.println("THE BODY STRING ---> " + bodyStr);
Source: https://www.mkyong.com/java/how-do-convert-byte-array-to-string-in-java/
Well, it seems that you are on a good way. However, your ByteBuffer might not have a proper byte[] array to decode, so let's try the following instead:
byte[] bytes = new byte[body.remaining()];
buffer.get(bytes);
String result = new String(bytes, "UTF-8"); // Maybe you need to change charset
This should work, you have shown in your question that ByteBuffer contains actual data, as given in the code example you might have to change the charset.
List of charsets: https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html
Also usful: https://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html
See the following sample code:
User datum = new User("a123456", "my.email#world.com");
Schema schema = ReflectData.get().getSchema(datum.getClass());
DatumWriter<Object> writer = new ReflectDatumWriter<>(schema);
ByteArrayOutputStream output = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().binaryEncoder(output, null);
writer.write(datum, encoder);
encoder.flush();
byte[] bytes = output.toByteArray();
System.out.println(new String(bytes));
which produces:
a123456$my.email#world.com
I had presumed that all Avro writers would publish the schema information as well as the data, but this does not.
I can successfully get the schema printed if I use the GenericDatumWriter in combination with a DataFileWriter but I wish to use the ReflectDatumWriter as I don't wish to construct a GenericRecord myself (I want the library to do this)
How do I get the schema serialized as well?
I solved this myself, you need to use a DataFileWriter as this contains an entry in the create() method that writes the schema
Solution is to use this in conjunction with a ByteArrayOutputStream:
Schema schema = ReflectData.get().getSchema(User.class);
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
ByteArrayOutputStream output = new ByteArrayOutputStream();
dataFileWriter.create(schema, output);
GenericRecord user = createGenericRecord(schema);
dataFileWriter.append(user);
dataFileWriter.close();
byte[] bytes = output.toByteArray();
System.out.println(new String(bytes));
Here is some Avro code that runs on one machine but fails on the other with an exception.
We are not able to make sure what's wrong here.
Here is the code that is causing the problem.
Class<?> clazz = obj.getClass();
ReflectData rdata = ReflectData.AllowNull.get();
Schema schema = rdata.getSchema(clazz);
ByteArrayOutputStream os = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().binaryEncoder(os, null);
DatumWriter<T> writer = new ReflectDatumWriter<T>(schema, rdata);
writer.write(obj, encoder);
encoder.flush();
byte[] bytes = os.toByteArray();
String binaryString = new String (bytes, "ISO-8859-1");
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(binaryString.getBytes("ISO-8859-1"), null);
GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> (schema);
GenericRecord record = datumReader.read(null, decoder);
Exception is:
org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -32
at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:437)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:427)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:189)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:187)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:263)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:216)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:173)
Adding Dfile.encoding=UTF-8 in the tomcat params helped us resolve the issue.
I am trying to convert a Json string into a generic Java Object, with an Avro Schema.
Below is my code.
String json = "{\"foo\": 30.1, \"bar\": 60.2}";
String schemaLines = "{\"type\":\"record\",\"name\":\"FooBar\",\"namespace\":\"com.foo.bar\",\"fields\":[{\"name\":\"foo\",\"type\":[\"null\",\"double\"],\"default\":null},{\"name\":\"bar\",\"type\":[\"null\",\"double\"],\"default\":null}]}";
InputStream input = new ByteArrayInputStream(json.getBytes());
DataInputStream din = new DataInputStream(input);
Schema schema = Schema.parse(schemaLines);
Decoder decoder = DecoderFactory.get().jsonDecoder(schema, din);
DatumReader<Object> reader = new GenericDatumReader<Object>(schema);
Object datum = reader.read(null, decoder);
I get "org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_NUMBER_FLOAT" Exception.
The same code works, if I don't have unions in the schema.
Can someone please explain and give me a solution.
For anyone who uses Avro - 1.8.2, JsonDecoder is not directly instantiable outside the package org.apache.avro.io now. You can use DecoderFactory for it as shown in the following code:
String schemaStr = "<some json schema>";
String genericRecordStr = "<some json record>";
Schema.Parser schemaParser = new Schema.Parser();
Schema schema = schemaParser.parse(schemaStr);
DecoderFactory decoderFactory = new DecoderFactory();
Decoder decoder = decoderFactory.jsonDecoder(schema, genericRecordStr);
DatumReader<GenericData.Record> reader =
new GenericDatumReader<>(schema);
GenericRecord genericRecord = reader.read(null, decoder);
Thanks to Reza. I found this webpage.
It introduces how to convert a Json string into an avro object.
http://rezarahim.blogspot.com/2013/06/import-org_26.html
The key of his code is:
static byte[] fromJsonToAvro(String json, String schemastr) throws Exception {
InputStream input = new ByteArrayInputStream(json.getBytes());
DataInputStream din = new DataInputStream(input);
Schema schema = Schema.parse(schemastr);
Decoder decoder = DecoderFactory.get().jsonDecoder(schema, din);
DatumReader<Object> reader = new GenericDatumReader<Object>(schema);
Object datum = reader.read(null, decoder);
GenericDatumWriter<Object> w = new GenericDatumWriter<Object>(schema);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Encoder e = EncoderFactory.get().binaryEncoder(outputStream, null);
w.write(datum, e);
e.flush();
return outputStream.toByteArray();
}
String json = "{\"username\":\"miguno\",\"tweet\":\"Rock: Nerf paper, scissors is fine.\",\"timestamp\": 1366150681 }";
String schemastr ="{ \"type\" : \"record\", \"name\" : \"twitter_schema\", \"namespace\" : \"com.miguno.avro\", \"fields\" : [ { \"name\" : \"username\", \"type\" : \"string\", \"doc\" : \"Name of the user account on Twitter.com\" }, { \"name\" : \"tweet\", \"type\" : \"string\", \"doc\" : \"The content of the user's Twitter message\" }, { \"name\" : \"timestamp\", \"type\" : \"long\", \"doc\" : \"Unix epoch time in seconds\" } ], \"doc:\" : \"A basic schema for storing Twitter messages\" }";
byte[] avroByteArray = fromJsonToAvro(json,schemastr);
Schema schema = Schema.parse(schemastr);
DatumReader<Genericrecord> reader1 = new GenericDatumReader<Genericrecord>(schema);
Decoder decoder1 = DecoderFactory.get().binaryDecoder(avroByteArray, null);
GenericRecord result = reader1.read(null, decoder1);
With Avro 1.4.1, this works:
private static GenericData.Record parseJson(String json, String schema)
throws IOException {
Schema parsedSchema = Schema.parse(schema);
Decoder decoder = new JsonDecoder(parsedSchema, json);
DatumReader<GenericData.Record> reader =
new GenericDatumReader<>(parsedSchema);
return reader.read(null, decoder);
}
Might need some tweaks for later Avro versions.
As it was already mentioned here in the comments, JSON that is understood by AVRO libs is a bit different from a normal JSON object. Specifically, UNION type is wrapped into a nested object structure: "union_field": {"type": "value"}.
So if you want to convert "normal" JSON to AVRO you'll have to use 3rd-party library. For now at least.
https://github.com/allegro/json-avro-converter - Java project that claims to support unions, not sure about default values.
https://github.com/agolovenko/json-to-avro-converter - this is my project, although written in Scala, still usable from Java. Supports unions, default values, base64 binary data...
Your schema does not match the schema of the json string. You need to have a different schema that does not have a union in the place of the error but a decimal number. Such schema should then be used as a writer schema while you can freely use the other one as the reader schema.
Problem is not the code, but the wrong format of the json
String json = "{"foo": {"double": 30.1}, "bar": {"double": 60.2}}";