What is the purpose of DatumWriter in Avro - java

Avro website has an example:
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File("users.avro"));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user3);
dataFileWriter.close();
What is the purpose of DatumWriter<User>? I mean what it does provide? It provides write method, but instead of using it we use DataFileWriter. Can someone explain the design purpose of it?

The DatumWriter class responsible for translating the given data object into an avro record with a given schema (which is extracted in your case from the User class).
Given this record the DataFileWriter is responsible to write it to a file.

Related

Avro's GenericDatumReader writer's schema

As stated in Avro Getting Started about deserialization without code generation: "The data will be read using the writer's schema included in the file, and the reader's schema provided to the GenericDatumReader". Here is how GenericDatumReader is created in the example
DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema);
But when you look at this GenericDatumReader constructor Javadoc it states "Construct where the writer's and reader's schemas are the same." (and actual code corresponds to this).
So the writer's schema isn’t taken from a serialized file but from a constructor parameter? If yes, how to read data using written schema like described on the page?
I've received an answer on Avro mailing list:
...writer schema can be adjusted after creation. This is what the DataFileReader does.
So after the DataFileReader is initialised, the underlying GenericDatumReader uses the the schema in the file as write schema (to understand the data), and the schema you provided as read schema (to give data to you via dataFileReader.next(user)).

Creating DataFrame in Java based on ByteArrayInputStream

I need to convert following to Spark DataFrame in Java with the saving of the structure according to the avro schema. And then I'm going to write it to s3 based on this avro structure.
GenericRecord r = new GenericData.Record(inAvroSchema);
r.put("id", "1");
r.put("cnt", 111);
Schema enumTest =
SchemaBuilder.enumeration("name1")
.namespace("com.name")
.symbols("s1", "s2");
GenericData.EnumSymbol symbol = new GenericData.EnumSymbol(enumTest, "s1");
r.put("type", symbol);
ByteArrayOutputStream bao = new ByteArrayOutputStream();
GenericDatumWriter<GenericRecord> w = new GenericDatumWriter<>(inAvroSchema);
Encoder e = EncoderFactory.get().jsonEncoder(inAvroSchema, bao);
w.write(r, e);
e.flush();
I can create the object based on JSON structure
Object o = reader.read(null, DecoderFactory.get().jsonDecoder(inAvroSchema, new ByteArrayInputStream(bao.toByteArray())));
But maybe there is any way to create DataFrame based on ByteArrayInputStream(bao.toByteArray())?
Thanks
No, you have to use a Data Source to read Avro data.
And it's crutial for Spark to read Avro as files from filesystem, because many optimizations and features depend on it (such as compression and partitioning).
You have to add spark-avro (unless you are above 2.4).
Note that EnumType you are using will be String in Spark's Dataset
Also see this: Spark: Read an inputStream instead of File
Alternatively you can consider deploying a bunch of tasks with SparkContext#parallelize and reading/writing the files explicitly by DatumReader/DatumWriter.

Can I transform a JSON-LD to a Java object?

EDIT: I changed my mind. I would find a way to generate the Java class and load the JSON as an object of that class.
I just discovered that exists a variant of JSON called JSON-LD.
It seems to me a more structured way of defining JSON, that reminds me XML with an associated schema, like XSD.
Can I create a Java class from JSON-LD, load it at runtime and use it to convert JSON-LD to an instantiation of that class?
I read the documentation of both the implementations but I found nothing about it. Maybe I read them bad?
Doing a Google search brought me to a library that will decode the JSON-LD into an "undefined" Object.
// Open a valid json(-ld) input file
InputStream inputStream = new FileInputStream("input.json");
// Read the file into an Object (The type of this object will be a List, Map, String, Boolean,
// Number or null depending on the root object in the file).
Object jsonObject = JsonUtils.fromInputStream(inputStream);
// Create a context JSON map containing prefixes and definitions
Map context = new HashMap();
// Customise context...
// Create an instance of JsonLdOptions with the standard JSON-LD options
JsonLdOptions options = new JsonLdOptions();
// Customise options...
// Call whichever JSONLD function you want! (e.g. compact)
Object compact = JsonLdProcessor.compact(jsonObject, context, options);
// Print out the result (or don't, it's your call!)
System.out.println(JsonUtils.toPrettyString(compact));
https://github.com/jsonld-java/jsonld-java
Apparently, it can take it from just a string as well, as if reading it from a file or some other source. How you access the contents of the object, I can't tell. The documentation seems to be moderately decent, though.
It seems to be an active project, as the last commit was only 4 days ago and has 30 contributors. The license is BSD 3-Clause, if that makes any difference to you.
I'm not in any way associate with this project. I'm not an author nor have I made any pull requests. It's just something I found.
Good luck and I hope this helped!
see this page: JSON-LD Module for Jackson

How to extract schema from an avro file in Java

How do you extract first the schema and then the data from an avro file in Java? Identical to this question except in java.
I've seen examples of how to get the schema from an avsc file but not an avro file. What direction should I be looking in?
Schema schema = new Schema.Parser().parse(
new File("/home/Hadoop/Avro/schema/emp.avsc")
);
If you want know the schema of a Avro file without having to generate the corresponding classes or care about which class the file belongs to, you can use the GenericDatumReader:
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(new File("file.avro"), datumReader);
Schema schema = dataFileReader.getSchema();
System.out.println(schema);
And then you can read the data inside the file:
GenericRecord record = null;
while (dataFileReader.hasNext()) {
record = dataFileReader.next(record);
System.out.println(record);
}
Thanks for #Helder Pereira's answer. As a complement, the schema can also be fetched from getSchema() of GenericRecord instance.
Here is an live demo about it, the link above shows how to get data and schema in java for Parquet, ORC and AVRO data format.
You can use the data bricks library as shown here https://github.com/databricks/spark-avro which will load the avro file into a Dataframe (Dataset<Row>)
Once you have a Dataset<Row>, you can directly get the schema using df.schema()

How to generate schema-less avro files using apache avro?

I am using Apache avro for data serialization. Since, the data has a fixed schema I do not want the schema to be a part of serialized data. In the following example, schema is a part of the avro file "users.avro".
User user1 = new User();
user1.setName("Alyssa");
user1.setFavoriteNumber(256);
User user2 = new User("Ben", 7, "red");
User user3 = User.newBuilder()
.setName("Charlie")
.setFavoriteColor("blue")
.setFavoriteNumber(null)
.build();
// Serialize user1 and user2 to disk
File file = new File("users.avro");
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<User (userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File("users.avro"));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user3);
dataFileWriter.close();
Can anyone please tell me how to store avro-files without schema embedded in it?
Here you find a comprehensive how to in which I explain how to achieve the schema-less serialization using Apache Avro.
A companion test campaign shows up some figures on the performance that you might expect.
The code is on GitHub: example and test classes show up how to use the Data Reader and Writer with a Stub class generated by Avro itself.
Should be doable.
Given an encoder, you can use a DatumWriter to write data directly to a ByteArrayOutputStream (which you can then write to a java.io.File).
Here's how to get started in Scala (from Salat-Avro):
val baos = new ByteArrayOutputStream
val encoder = EncoderFactory.get().binaryEncoder(baos, null)
encoder.write(myRecord, encoder)

Categories