Merge two avro schemas programmatically - java

I have two similar schemas where only one nested field changes (it is called onefield in schema1 and anotherfield in schema2).
schema1
{
"type": "record",
"name": "event",
"namespace": "foo",
"fields": [
{
"name": "metadata",
"type": {
"type": "record",
"name": "event",
"namespace": "foo.metadata",
"fields": [
{
"name": "onefield",
"type": [
"null",
"string"
],
"default": null
}
]
},
"default": null
}
]
}
schema2
{
"type": "record",
"name": "event",
"namespace": "foo",
"fields": [
{
"name": "metadata",
"type": {
"type": "record",
"name": "event",
"namespace": "foo.metadata",
"fields": [
{
"name": "anotherfield",
"type": [
"null",
"string"
],
"default": null
}
]
},
"default": null
}
]
}
I am able to programatically merge both schemas using avro 1.8.0:
Schema s1 = new Schema.Parser().parse(schema1);
Schema s2 = new Schema.Parser().parse(schema2);
Schema[] schemas = {s1, s2};
Schema mergedSchema = null;
for (Schema schema: schemas) {
mergedSchema = AvroStorageUtils.mergeSchema(mergedSchema, schema);
}
and use it to convert an input json into an avro or json representation:
JsonAvroConverter converter = new JsonAvroConverter();
try {
byte[] example = new String("{}").getBytes("UTF-8");
byte[] avro = converter.convertToAvro(example, mergedSchema);
byte[] json = converter.convertToJson(avro, mergedSchema);
System.out.println(new String(json));
} catch (AvroConversionException e) {
e.printStackTrace();
}
That code shows the expected output: {"metadata":{"onefield":null,"anotherfield":null}}. The issue is that I am not able to see the merged schema. If I do a simple System.out.println(mergedSchema) I get the following exception:
Exception in thread "main" org.apache.avro.SchemaParseException: Can't redefine: merged schema (generated by AvroStorage).merged
at org.apache.avro.Schema$Names.put(Schema.java:1127)
at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:561)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:689)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:715)
at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:700)
at org.apache.avro.Schema.toString(Schema.java:323)
at org.apache.avro.Schema.toString(Schema.java:313)
at java.lang.String.valueOf(String.java:2982)
at java.lang.StringBuilder.append(StringBuilder.java:131)
I call it the avro uncertainty principle :). It looks like avro is able to work with the merged schema, but it fails when it tries to serialize the schema to JSON. The merge works with simpler schemas, so it sounds like a bug in avro 1.8.0 to me.
Do you know what could be happening or how to solve it? Any workaround (ex: alternative Schema serializers) is welcome.

I found the same issue with the pig util class... actually there are 2 bugs here
AVRO allows serialize data through GenericDatumWriter using an invalid schema
The piggybank util class is generating invalid schemas because it is using the same name/namespace for all the merged fields (instance of keep the original name)
This is working properly for more complex scenarios https://github.com/kite-sdk/kite/blob/master/kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/SchemaUtil.java#L511
Schema mergedSchema = SchemaUtil.merge(s1, s2);
From your example, I am getting the following output
{
"type": "record",
"name": "event",
"namespace": "foo",
"fields": [
{
"name": "metadata",
"type": {
"type": "record",
"name": "event",
"namespace": "foo.metadata",
"fields": [
{
"name": "onefield",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "anotherfield",
"type": [
"null",
"string"
],
"default": null
}
]
},
"default": null
}
]
}
Hopefully this will help others.

Merge schema facility is not ssupported for avro files yet.
But lets say if you are having avro files in one directory with multiple avro files which has different schemas eg: /demo so you can read it through spark using. and provide one master schema file (i.e .avsc file) so spark will internally read all the records from the file and if any one file has missing column so it will show null value.
object AvroSchemaEvolution {
def main(args: Array[String]): Unit = {
val schema = new Schema.Parser().parse(new File("C:\\Users\\murtazaz\\Documents\\Avro_Schema_Evolution\\schema\\emp_inserted.avsc"))
val spark = SparkSession.builder().master("local").getOrCreate()
val df = spark.read
.format("com.databricks.spark.avro").option("avroSchema", schema.toString)
.load("C:\\Users\\murtazaz\\Documents\\Avro_Schema_Evolution\\demo").show()
}
}

Related

How to insert null values for an Avro map

I have a usecase where I need to have null values allowed for an Avro Map, but it seems like Avro doesn't allow unions for Map values. Basically, I need to implement the functionality of a POJO defined as Map<String,<Optional<String>>>.
How can I achieve this?
The following avro schema throws no type found error:
Error:
org.apache.avro:avro-maven-plugin:1.10.0: schema failed:
No type: {"type":["null","string"]}
{
"namespace": "com.testclass.avro",
"name": "test",
"type": "record",
"fields": [
{
"name": "user",
"type": {
"name": "userdetails",
"type": "record",
"fields": [
{
"name": "isPresent",
"type": "boolean"
},
{
"name": "address",
"type": {
"type": "map",
"name": "address",
"values": {
"type": ["null","string"]
}
}
}
]
}
}
]
}
Specifying the string as a string within the json definition helped solved the problem.
"address":{"test":{"string":"a"}, "test2":{"string":"a"}}

Avro schema for a given class

What would be the equivalent avro schema for following class
class A {
String s;
List<String> l;
}
I have following, but its doesnt work:
{
"name" : "A",
"type": "record",
"fields": [
{
"name": "s",
"type": "string"
},
{
"name": "l"
"type": "array",
"items": "string"
}
]
}
I believe the array type needs to be nested in another dictionary like so:
{
"name" : "A",
"type": "record",
"fields": [
{
"name": "s",
"type": "string"
},
{
"name": "l"
"type": {
"array",
"items": "string"
}
}
]
}
You can use Avro IDL to basically have the same thing
protocol SampleProtocol {
record A {
string s;
array<string> l;
}
}
You can refer the Avro documentation on how to actually get a Java Arraylist class when generating the class, otherwise it's an array

Query Elastic DSL - Search query using spring boot data

I have the following properties file generated via Java and spring boot data elasticsearch. The file is generated in a User.java class and the property "friends" is a List where Friends is a Fiends.java file, both class file act as the model. Essentially I want to produce a select statement but in Query DSL Language using Spring Boot Data. The index is called user.
So I am trying to achieve the following SELECT * FROM User where (userName ="Tom" OR nickname="Tom" OR friendsNickname="Tom") AND userID="3793"
or (verbose-dsl)
match where (userName="Tom" OR nickname="Tom" OR friendsNickname="Tom") AND userID="3793"
"mappings": {
"properties": {
"_class": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"userName": {
"type": "text"
},
"userId": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"friends": {
"type": "nested",
"properties": {
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
},
"age": {
"type": "text"
},
"friendsNickname": {
"type": "text"
}
}
},
"nickname": {
"type": "text"
}
}
}
I have tried the following code but return 0 hits back from a elastic search but no dice returns no hits
BoolQueryBuilder query =
QueryBuilders.boolQuery()
.must(
QueryBuilders.boolQuery()
.should(QueryBuilders.matchQuery("userName", "Tom"))
.should(QueryBuilders.matchQuery("nickname", "Tom"))
.should(
QueryBuilders.nestedQuery(
"friends",
QueryBuilders.matchQuery("friendsNickname", "Tom"),
ScoreMode.None)))
.must(QueryBuilders.boolQuery().must(QueryBuilders.matchQuery("userID", "3793")));
Apologies if this seems like a simple question, My knowledge on ES is quite thin, sorry if this may seem like an obvious answer.
Great start!!
You just have a tiny mistake on the following line where you need to prefix the field name by the nested field name, i.e. friends.friendsNickname
...
QueryBuilders.matchQuery("friends.friendsNickname", "Tom"),
... ^
|
prefix
Also you have another typo where the userID should read userId according to your mapping.
Use friends.friendsNickname and also user termsQuery on userId.keyword
`
.must(QueryBuilders.boolQuery()
.should(QueryBuilders.matchQuery("userName", "Tom"))
.should(QueryBuilders.matchQuery("nickname", "Tom"))
.should(QueryBuilders.matchQuery("friends.friendsNickname", "Tom"))
)
.must(QueryBuilders.termsQuery("userId.keyword", "3793"));
`
Although I recommend changing userName, userID to keyword.
"userId": {
"type": "keyword",
"ignore_above": 256,
"fields": {
"text": {
"type": "text"
}
}
}
Then you don't have to put keyword so you just have to put userId instead of userId.keyword. If you want to have full-text search on the field is use userId.text. The disadvantage of having a text type is that you can't use the field to sort your results that's why I encourage ID fields to be of type keyword.

Elasticsearch nested sort - mismatch between document and nested object used for sorting

I've been developing a new search API with AWS Elasticsearch (version 6.2) as backend.
Right now, I'm trying to support "sort" options for the API.
My mapping is as follows (unrelated fields not included):
{
"properties": {
"id": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"description": {
"type": "text"
},
"materialDefinitionProperties": {
"type": "nested",
"properties": {
"id": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
},
"analyzer": "case_sensitive_analyzer"
},
"value" : {
"type": "nested",
"properties": {
"valueString": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}
}
}
I'm attempting to allow the users sort by property value (path: materialDefinitionProperties.value.valueLong.raw).
Note that it's inside 2 levels of nested objects (materialDefinitionProperties and materialDefinitionProperties.value are nested objects).
To sort the results by the value of property with ID "PART NUMBER", my request for sorting is:
{
"fieldName": "materialDefinitionProperties.value.valueString.raw",
"nestedSort": {
"path": "materialDefinitionProperties",
"filter": {
"fieldName": "materialDefinitionProperties.id",
"value": "PART NUMBER",
"slop": 0,
"boost": 1
},
"nestedSort": {
"path": "materialDefinitionProperties.value"
}
},
"order": "ASC"
}
However, as I examined the response, the "sort" field does not match with document's property value:
{
"_index": "material-definition-index-v2",
"_type": "default",
"_id": "development_LITL4ZCNE",
"_source": {
"id": "LITL4ZCNE",
"description": [
"CPU, Intel, Cascade Lake, 8259CL, 24C, 210W, B1 Prod"
]
"materialDefinitionProperties": [
{
"id": "PART NUMBER",
"description": [],
"value": [
{
"valueString": "202-001193-001",
"isOriginal": true
}
]
}
]
},
"sort": [
"100-000018"
]
},
The document's PART NUMBER property is "202-001193-001", the "sort" field says "100-000018", which is the part number of another document.
It seems that there's a mismatch between the master document and nested object used for sorting.
This request worked well when there's only a small number of documents in the cluster. But once I backfill the cluster with ~1 million of records, the symptom appears. I've also tried creating a new ES cluster but the results are the same.
Sorting by other non-nested attributes worked well.
Did I misunderstand the concept of nested objects, or misuse the nested sort feature?
Any ideas appreciated!
This is a bug in Elasticsearch. Upgrading to 6.4.0 fixed the issue.
Issue tracker: https://github.com/elastic/elasticsearch/pull/32204
Release note: https://www.elastic.co/guide/en/elasticsearch/reference/current/release-notes-6.4.0.html

Jackson Parser for recursively parsing unknown input structure

I'm trying to parse recursively json input structure in java like the format below and trying to rewrite the same structure in another json.
Meanwhile I need to validate each & every json key/values while parsing.
{"Verbs":[{
"aaaa":"30d", "type":"ed", "rel":1.0, "id":"80", "spoken":"en", "ct":"on", "sps":null
},{
"aaaa":"31", "type":"cc", "rel":3.0, "id":"10", "spoken":"en", "ct":"off", "sps":null
},{
"aaaa":"81", "type":"nn", "rel":3.0, "id":"60", "spoken":"en", "ct":"on", "sps":null
}]}
Please advice how I can use Jackson parser JsonToken enums for reading and writing unknown json content.
You can use JSON Schema to validate your inputs.
Find the documentation for the data format, but from what I can read here, the schema would be something like this:
{
"$schema": "http://json-schema.org/schema#",
"type": "object",
"required": [ "Verbs" ],
"properties": {
"Verbs": { "type": "array", "items": { "$ref": "#/definitions/verb" } }
},
"definitions": {
"verb": {
"type": "object",
"required": [ "aaaa", "type", "rel", "id", "spoken", "ct", "sps" ],
"additionalProperties": false,
"properties": {
"aaaa": { "type": "string" },
"type": { "type": "string" },
"rel": { "type": "number" },
"id": { "type": "string", "pattern": "^[0-9]+$" },
"spoken": { "type": "string" },
"ct": { "enum": [ "on", "off" ] },
"sps": { "enum": [ null ] }
}
}
}
}
As you use Jackson, you can use this library which can validate your data for you.
Transforming your JSON after that can be done by creating a new JsonNode, for instance.

Categories