Extracting schema from Union Avro - java

I have the following union:
{
"name" : "price",
"type" : [ "null", {
"type" : "array",
"items" : {
"type" : "record",
"name" : "PriceType",
"fields" : [ {
"name" : "text_value",
"type" : [ "null", "double" ],
"source" : "element text_value"
}, {
"name" : "currency",
"type" : [ "null", "string" ],
"default" : null,
"source" : "attribute currency"
} ]
}
} ],
"default" : null,
"source" : "element price"
}
From this union I get the schema of price field using this code:
Schema new_schema=schema.getField("price").schema();
Now I want to obtain the schema of the Union:
{
"type" : "array",
"items" : {
"type" : "record",
"name" : "PriceType",
"fields" : [ {
"name" : "text_value",
"type" : [ "null", "double" ],
"source" : "element text_value"
}, {
"name" : "currency",
"type" : [ "null", "string" ],
"default" : null,
"source" : "attribute currency"
}
How can I do this? And I How Do I insert a Union in a Record?

Since it's a union type you can extract the schema as follows:
new_schema.getTypes().get(0);

Related

What is meant by processedWithError in the report task manager?

I already ingested the file into the druid, greatfully it shows the ingestion is success. However when I checked in the reports of the ingestion, there are all rows are processed with error yet the Datasource is display in the "Datasource" tab.
I have tried to minimise the rows from 20M to 20 rows only. Here is my configuration file:
"type" : "index",
"spec" : {
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "/home/data/Salutica",
"filter" : "outDashboard2RawV3.csv"
}
},
"dataSchema" : {
"dataSource": "DaTRUE2_Dashboard_V3",
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "WEEK",
"queryGranularity" : "none",
"intervals" : ["2017-05-08/2019-05-17"],
"rollup" : false
},
"parser" : {
"type" : "string",
"parseSpec": {
"format" : "csv",
"timestampSpec" : {
"column" : "Date_Time",
"format" : "auto"
},
"columns" : [
"Main_ID","Parameter_ID","Date_Time","Serial_Number","Status","Station_ID",
"Station_Type","Parameter_Name","Failed_Date_Time","Failed_Measurement",
"Database_Name","Date_Time_Year","Date_Time_Month",
"Date_Time_Day","Date_Time_Hour","Date_Time_Weekday","Status_New"
],
"dimensionsSpec" : {
"dimensions" : [
"Date_Time","Serial_Number","Status","Station_ID",
"Station_Type","Parameter_Name","Failed_Date_Time",
"Failed_Measurement","Database_Name","Status_New",
{
"name" : "Main_ID",
"type" : "long"
},
{
"name" : "Parameter_ID",
"type" : "long"
},
{
"name" : "Date_Time_Year",
"type" : "long"
},
{
"name" : "Date_Time_Month",
"type" : "long"
},
{
"name" : "Date_Time_Day",
"type" : "long"
},
{
"name" : "Date_Time_Hour",
"type" : "long"
},
{
"name" : "Date_Time_Weekday",
"type" : "long"
}
]
}
}
},
"metricsSpec" : [
{
"name" : "count",
"type" : "count"
}
]
},
"tuningConfig" : {
"type" : "index",
"partitionsSpec" : {
"type" : "hashed",
"targetPartitionSize" : 5000000
},
"jobProperties" : {}
}
}
}
Report:
{"ingestionStatsAndErrors":{"taskId":"index_DaTRUE2_Dashboard_V3_2019-09-10T01:16:47.113Z","payload":{"ingestionState":"COMPLETED","unparseableEvents":{},"rowStats":{"determinePartitions":{"processed":0,"processedWithError":0,"thrownAway":0,"unparseable":0},"buildSegments":{"processed":0,"processedWithError":20606701,"thrownAway":0,"unparseable":1}},"errorMsg":null},"type":"ingestionStatsAndErrors"}}
I'm expecting this:
{"processed":20606701,"processedWithError":0,"thrownAway":0,"unparseable":1}},"errorMsg":null},"type":"ingestionStatsAndErrors"}}
instead of this:
{"processed":0,"processedWithError":20606701,"thrownAway":0,"unparseable":1}},"errorMsg":null},"type":"ingestionStatsAndErrors"}}
Below is my input data from csv;
"Main_ID","Parameter_ID","Date_Time","Serial_Number","Status","Station_ID","Station_Type","Parameter_Name","Failed_Date_Time","Failed_Measurement","Database_Name","Date_Time_Year","Date_Time_Month","Date_Time_Day","Date_Time_Hour","Date_Time_Weekday","Status_New"
1,3,"2018-10-05 15:00:55","1840SDF00038","Passed","ST1","BLTBoard","1.8V","","","DaTRUE2Left",2018,10,5,15,"Friday","Passed"
1,4,"2018-10-05 15:00:55","1840SDF00038","Passed","ST1","BLTBoard","1.35V","","","DaTRUE2Left",2018,10,5,15,"Friday","Passed"
1,5,"2018-10-05 15:00:55","1840SDF00038","Passed","ST1","BLTBoard","Isc_VChrg","","","DaTRUE2Left",2018,10,5,15,"Friday","Passed"
1,6,"2018-10-05 15:00:55","1840SDF00038","Passed","ST1","BLTBoard","Isc_VBAT","","","DaTRUE2Left",2018,10,5,15,"Friday","Passed"

AVRO - Complex Records with Union Record type support

I am trying to build an AVRO's complex record with Union data type supported member record type.
{
"namespace": "proj.avro",
"protocol": "app_messages",
"doc" : "application messages",
"types": [
{
"name": "record_request",
"type" : "record",
"fields":
[
{
"name" : "request_id",
"type" : "int"
},
{
"name" : "message_type",
"type" : int,
},
{
"name" : "users",
"type" : "string"
}
]
},
{
"name" : "request_response",
"type" : "record",
"fields" :
[
{
"name" : "request_id",
"type" : "int"
},
{
"name" : "response_code",
"type" : "string"
},
{
"name" : "response_count",
"type" : "int"
},
{
"name" : "reason_code",
"type" : "string"
}
]
}
]
"messages" :
{
"published_msgs" :
{
"doc" : "My Messages",
"fields" :
[
{
"name" : "message_type",
"type" : "int"
},
{
"name" : "message_body",
"type" :
[
"record_request", "request_response"
]
}
]
}
}
}
I am getting error while trying to read this kind of schema.
I would like to know - is it possible to declare such AVRO schema - which has one of field which type is union of complex user defined message structure.
If its possible then could you please let me know what i am doing wrong or an example of such structure with union type field's type definition?
I want to use AVRO's dynamically schema usage - so specify this schema file run-time and parse the incoming buffer as "request"/"response".
Thanks,
It is possible define an union of complex types, the problem with your schema is that it is not defined at field level. Your schema must looks like this to achieve the union of complex types
{
"namespace": "proj.avro",
"protocol": "app_messages",
"doc" : "application messages",
"name": "myRecord",
"type" : "record",
"fields": [
{
"name": "requestResponse",
"type": [
{
"name": "record_request",
"type" : "record",
"fields":
[
{
"name" : "request_id",
"type" : "int"
},
{
"name" : "message_type",
"type" : "int"
},
{
"name" : "users",
"type" : "string"
}
]
},
{
"name" : "request_response",
"type" : "record",
"fields" :
[
{
"name" : "request_id",
"type" : "int"
},
{
"name" : "response_code",
"type" : "string"
},
{
"name" : "response_count",
"type" : "int"
},
{
"name" : "reason_code",
"type" : "string"
}
]
}
]
}
]
}

How to include more than one record in Avro schema?

I am new to Apache Avro. I am serializing the data by reading the schema using Parsers. The below details includes my schema. I need to include more than one record in the same schema.
{ "namespace": "tutorial.model",
"type": "record",
"name": "Employee",
"fields": [
{"name": "firstName", "type": "string"},
{"name": "lastName", "type": "string"},
{"name": "age", "type": "int"},
{"name": "id", "type": "string"},
{"name" : "company", "type" : "string"}
]
}
You can define embedded records as explained here GettingStartedGuide.
So, your schema would be something like this
{
"type" : "record",
"name" : "userInfo",
"namespace" : "my.example",
"fields" : [{"name" : "username",
"type" : "string",
"default" : "NONE"},
{"name" : "age",
"type" : "int",
"default" : -1},
{"name" : "phone",
"type" : "string",
"default" : "NONE"},
{"name" : "housenum",
"type" : "string",
"default" : "NONE"},
{"name" : "address",
"type" : {
"type" : "record",
"name" : "mailing_address",
"fields" : [
{"name" : "street",
"type" : "string",
"default" : "NONE"},
{"name" : "city",
"type" : "string",
"default" : "NONE"},
{"name" : "state_prov",
"type" : "string",
"default" : "NONE"},
{"name" : "country",
"type" : "string",
"default" : "NONE"},
{"name" : "zip",
"type" : "string",
"default" : "NONE"}
]},
"default" : {}
}
]
}

Retrieve Result from mongodb using $and and $in

my schema design is i want to retrieve some information from mongodb
{
"_id" : "23423q53q45345",
"value" : "5942178562002.65",
"dataset" : "GDP (current US$)",
"data" : [
{
"data_name" : "country",
"value" : "india"
},
{
"data_name" : "date",
"value" : "2011"
}
]
},
{
"_id" : "23423q53qdsfsd5",
"value" : "1234238562002.65",
"dataset" : "GDP (current US$)",
"data" : [
{
"data_name" : "country",
"value" : "india"
},
{
"data_name" : "date",
"value" : "2012"
}
]
},
{
"_id" : "213423q45345",
"value" : "6576867562002.65",
"dataset" : "GDP (current US$)",
"data" : [
{
"data_name" : "country",
"value" : "us"
},
{
"data_name" : "date",
"value" : "2011"
}
]
},
{
"_id" : "4564564545dsfsd5",
"value" : "2354353462002.65",
"dataset" : "GDP (current US$)",
"data" : [
{
"data_name" : "country",
"value" : "us"
},
{
"data_name" : "date",
"value" : "2012"
}
]
}
i want to get data of india for 2011
i used this query
db.collection.find({
"data.value": {
"$in": [
"india","2011"
]
}
});
it returns two results
{
"_id" : "23423q53q45345",
"value" : "5942178562002.65",
"dataset" : "GDP (current US$)",
"data" : [
{
"data_name" : "country",
"value" : "india"
},
{
"data_name" : "date",
"value" : "2011"
}
]
},
{
"_id" : "23423q53qdsfsd5",
"value" : "1234238562002.65",
"dataset" : "GDP (current US$)",
"data" : [
{
"data_name" : "country",
"value" : "india"
},
{
"data_name" : "date",
"value" : "2012"
}
]
}
it suppose to be one result
{
"_id" : "23423q53q45345",
"value" : "5942178562002.65",
"dataset" : "GDP (current US$)",
"data" : [
{
"data_name" : "country",
"value" : "india"
},
{
"data_name" : "date",
"value" : "2011"
}
]
}
i know that query is wrong but how to achieve that please help me out
db.collection.find({
$and: [
{"data.value": "india"},
{"data.value": "2011"}
]
});

How to retrieve a document by its own sub document or array?

I have such structure of document:
{
"_id" : "4e76fd1e927e1c9127d1d2e8",
"name" : "***",
"embedPhoneList" : [
{
"type" : "家庭",
"number" : "00000000000"
},
{
"type" : "手机",
"number" : "00000000000"
}
],
"embedAddrList" : [
{
"type" : "家庭",
"addr" : "山东省诸城市***"
},
{
"type" : "工作",
"addr" : "深圳市南山区***"
}
],
"embedEmailList" : [
{
"email" : "********#gmail.com"
},
{
"email" : "********#gmail.com"
},
{
"email" : "********#gmail.com"
},
{
"email" : "********#gmail.com"
}
]
}
What I wan't to do is find the document by it's sub document,such as email in embedEmailList field.
Or if I have structure like this
{
"_id" : "4e76fd1e927e1c9127d1d2e8",
"name" : "***",
"embedEmailList" : [
"123#gmail.com" ,
"********#gmail.com" ,
]
}
the embedEmailList is array,how to find if there is 123#gmail.com?
Thanks.
To search for a specific value in an array, mongodb supports this syntax:
db.your_collection.find({embedEmailList : "foo#bar.com"});
See here for more information.
To search for a value in an embedded object, it supports this syntax:
db.your_collection.find({"embedEmailList.email" : "foo#bar.com"});
See here for more information.

Categories