Avro's GenericDatumReader writer's schema - java

As stated in Avro Getting Started about deserialization without code generation: "The data will be read using the writer's schema included in the file, and the reader's schema provided to the GenericDatumReader". Here is how GenericDatumReader is created in the example
DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema);
But when you look at this GenericDatumReader constructor Javadoc it states "Construct where the writer's and reader's schemas are the same." (and actual code corresponds to this).
So the writer's schema isn’t taken from a serialized file but from a constructor parameter? If yes, how to read data using written schema like described on the page?

I've received an answer on Avro mailing list:
...writer schema can be adjusted after creation. This is what the DataFileReader does.
So after the DataFileReader is initialised, the underlying GenericDatumReader uses the the schema in the file as write schema (to understand the data), and the schema you provided as read schema (to give data to you via dataFileReader.next(user)).

Related

How to extract schema from an avro file in Java

How do you extract first the schema and then the data from an avro file in Java? Identical to this question except in java.
I've seen examples of how to get the schema from an avsc file but not an avro file. What direction should I be looking in?
Schema schema = new Schema.Parser().parse(
new File("/home/Hadoop/Avro/schema/emp.avsc")
);
If you want know the schema of a Avro file without having to generate the corresponding classes or care about which class the file belongs to, you can use the GenericDatumReader:
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(new File("file.avro"), datumReader);
Schema schema = dataFileReader.getSchema();
System.out.println(schema);
And then you can read the data inside the file:
GenericRecord record = null;
while (dataFileReader.hasNext()) {
record = dataFileReader.next(record);
System.out.println(record);
}
Thanks for #Helder Pereira's answer. As a complement, the schema can also be fetched from getSchema() of GenericRecord instance.
Here is an live demo about it, the link above shows how to get data and schema in java for Parquet, ORC and AVRO data format.
You can use the data bricks library as shown here https://github.com/databricks/spark-avro which will load the avro file into a Dataframe (Dataset<Row>)
Once you have a Dataset<Row>, you can directly get the schema using df.schema()

EXI get JAXB unmarshaller

I wish to know the EXI equivalent of the JAXB unmarshaller.
I have looked at the EXI examples, where I have successfully obtained EXIFactory, set the grammar, get the XMLReader.
The example then creates a transformer to transform EXI stream to XML stream.
However, I do not need the output stream. I just need the unmarshalled result to stay as in-memory POJOs. I need the result to be direct unmarshall of EXI. I am using EXI marshall/unmarshall as a faster alternative to text XML.
Forgot to say which library I was using. Here it is:
<groupId>com.siemens.ct.exi</groupId>
<artifactId>exificient</artifactId>
<version>0.9.6</version>
JAXB Marshaller/Unmarshaller let you set various input/output mechanism
e.g.
Unmarshaller.unmarshal( javax.xml.transform.Source source )
or
Marshaller.marshal( Object jaxbElement, javax.xml.transform.Result result )
EXIficient implements
javax.xml.transform.Source (see com.siemens.ct.exi.api.sax.EXISource)
javax.xml.transform.Result (see com.siemens.ct.exi.api.sax.EXIResult)
Both, EXISource and EXIResult, can be initialized with the EXIFactory.
Hope this helps,
-- Daniel

Reading an Avro Record which was written with a different writer schema

Let's say I'm deserializing a GenericRecord which was written with an Encoder (not a FileWriter, so the schema's not stored with the serialized Record). And I'm using a reader schema which is a superset of the writer schema (i.e., the reader schema contains all the fields of the writer schema, plus a few more).
When I attempt to read the Record, do I need to know what the writer schema was? Specifically, do I need to provide both the writer schema and the reader schema when I instantiate my GenericDatumReader?
Or can I create a GenericDatumReader, specifying only the reader schema, and have the reader deserialize whatever fields it finds in the encoded Record and provide defaults for the rest of the fields (as specified by the reader schema)?

What is the purpose of DatumWriter in Avro

Avro website has an example:
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File("users.avro"));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user3);
dataFileWriter.close();
What is the purpose of DatumWriter<User>? I mean what it does provide? It provides write method, but instead of using it we use DataFileWriter. Can someone explain the design purpose of it?
The DatumWriter class responsible for translating the given data object into an avro record with a given schema (which is extracted in your case from the User class).
Given this record the DataFileWriter is responsible to write it to a file.

java.net.ConnectException : Validating Xml against XSD : local machine

I need to validate an XML against a local XSD and I do not have a internet connection on the target machine (on which this process runs). The code look like below :
SchemaFactory factory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
File schemaLocation = new File(xsd);
Schema schema = factory.newSchema(schemaLocation);
Validator validator = schema.newValidator();
Source source = new StreamSource(new BufferedInputStream(new FileInputStream(new File(xml))));
validator.validate(source);
I always get a java.net.ConnectException when validate() is called.
Can you please let me know what is not being done correctly ?
Many Thanks.
Abhishek
Agreed with Mads' comment - there are likely many references here that will attempt outgoing connections to the Internet, and you will need to download local copies for them. However, I'd advise against changing references within the XML or schema files, etc. - but instead, provide an EntityResolver to return the contents of your local copies instead of connecting out to the Internet. (I previously wrote a little bit about this at http://blogger.ziesemer.com/2009/01/xml-and-xslt-tips-and-tricks-for-java.html#InputValidation.)
However, in your case, since you're using a Validator instead of Validator.setResourceResolver(...) - and pass-in a LSResourceResolver, before calling validate.

Categories