Kafka Avro To BigQuery using Apache Beam in Java - java

Here is the scenario:
Kafka To BigQuery using Apache Beam. This is an alternative to BigQuerySinkConnector [WePay] using Kafka Connect.
I have been able to read Avro message from Kafka Topic. I am also able to print the contents to console accurately. I am looking for help with writing these KafkaRecords to BigQuery table.
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline pipeline = Pipeline.create(options);
//Customer is an auto generated class from avro schema using eclipse avro maven plugin
// Read from Kafka Topic and get KafkaRecords
#SuppressWarnings("unchecked")
PTransform<PBegin, PCollection<KafkaRecord<String, Customer>>> input = KafkaIO.<String, Customer>read()
.withBootstrapServers("http://server1:9092")
.withTopic("test-avro")
.withConsumerConfigUpdates(ImmutableMap.of("specific.avro.reader", (Object)"true"))
.withConsumerConfigUpdates(ImmutableMap.of("auto.offset.reset", (Object)"earliest"))
.withConsumerConfigUpdates(ImmutableMap.of("schema.registry.url", (Object)"http://server2:8181"))
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializerAndCoder((Class)KafkaAvroDeserializer.class, AvroCoder.of(Customer.class));
// Print kafka records to console log
pipeline.apply(input)
.apply("ExtractRecord", ParDo.of(new DoFn<KafkaRecord<String,Customer>, KafkaRecord<String,Customer>>() {
#ProcessElement
public void processElement(ProcessContext c) {
KafkaRecord<String, Customer> record = (KafkaRecord<String, Customer>) c.element();
KV<String, Customer> log = record.getKV();
System.out.println("Key Obtained: " + log.getKey());
System.out.println("Value Obtained: " + log.getValue().toString());
c.output(record);
}
}));
// Write each record to BigQuery Table
// Table is already available in BigQuery so create disposition would be CREATE_NEVER
// Records to be appended to table - so write disposition would be WRITE_APPEND
// All fields in the Customer object have corresponding column names and datatypes - so it is one to one mapping
// Connection to BigQuery is through service account JSON file. This file has been set as environment variable in run config of eclipse project
// Set table specification for BigQuery
String bqTable = "my-project:my-dataset:my-table";
The current examples available - shows how to manually set a schema and assign field by field the values. I am looking for an automated way to infer the Customer Avro object and assign it to the columns directly without such manual field by field assignment.
Is this possible?

After much trial and error I was able to make the following work.
I would welcome review comments to share concerns / propose better solutions.
SchemaRegistryClient registryClient = new CachedSchemaRegistryClient(http://server2:8181,10);
SchemaMetadata latestSchemaMetadata;
Schema avroSchema = null;
try {
// getLatestSchemaMetadata takes the subject name which is topic-value format where "-value" is suffixed to topic
// so if topic is "test-avro" then subject is "test-avro-value"
latestSchemaMetadata = registryClient.getLatestSchemaMetadata("test-avro-value");
avroSchema = new Schema.Parser().parse(latestSchemaMetadata.getSchema());
} catch (IOException e) {
// TODO Auto-generated catch block
System.out.println("IO Exception while obtaining registry data");
e.printStackTrace();
} catch (RestClientException e) {
// TODO Auto-generated catch block
System.out.println("Client Exception while obtaining registry data");
e.printStackTrace();
}
// Printing avro schema obtained
System.out.println("---------------- Avro schema ----------- " + avroSchema.toString());
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline pipeline = Pipeline.create(options);
// Read from Kafka Topic and get KafkaRecords
// Create KafkaIO.Read with Avro schema deserializer
#SuppressWarnings("unchecked")
KafkaIO.Read<String, GenericRecord> read = KafkaIO.<String, GenericRecord>read()
.withBootstrapServers("http://server1:9092")
.withTopic(KafkaConfig.getInputTopic())
.withConsumerConfigUpdates(ImmutableMap.of("schema.registry.url", "http://server2:8181"))
.withConsumerConfigUpdates(ImmutableMap.of("specific.avro.reader", (Object)"true"))
.withConsumerConfigUpdates(ImmutableMap.of("auto.offset.reset", (Object)"earliest"))
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializerAndCoder((Class) KafkaAvroDeserializer.class, AvroCoder.of(avroSchema));
// Set Beam Schema
org.apache.beam.sdk.schemas.Schema beamSchema = AvroUtils.toBeamSchema(avroSchema);
// Print kafka records to console log
// Write each record to BigQuery Table
// Table is already available in BigQuery so create disposition would be CREATE_NEVER
// Records to be appended to table - so write disposition would be WRITE_APPEND
// All fields in the Customer object have corresponding column names and datatypes - so it is one to one mapping
// Connection to BigQuery is through service account JSON file. This file has been set as environment variable in run config of eclipse project
// Set table specification for BigQuery
String bqTable = "my-project:my-dataset:my-table";
p.apply(read)
.apply("ExtractRecord", ParDo.of(new DoFn<KafkaRecord<String,GenericRecord>, KV<String, GenericRecord>>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) {
KafkaRecord<String, GenericRecord> record = (KafkaRecord<String, GenericRecord>) c.element();
KV<String, GenericRecord> log = record.getKV();
System.out.println("Key Obtained: " + log.getKey());
System.out.println("Value Obtained: " + log.getValue().toString());
c.output(log);
}
}))
.apply(Values.<GenericRecord>create()).setSchema(beamSchema, TypeDescriptor.of(GenericRecord.class) ,AvroUtils.getToRowFunction(GenericRecord.class, avroSchema), AvroUtils.getFromRowFunction(GenericRecord.class))
.apply(BigQueryIO.<GenericRecord>write()
.to(tableSpec)
.useBeamSchema()
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND));
p.run().waitUntilFinish();
The above works with CREATE_IF_NEEDED also.

Related

Get only a subset of fields from a Kafka topic using Apache Beam

Is there a way to read only specific fields of a Kafka topic?
I have a topic, say person with a schema personSchema. The schema contains many fields such as id, name, address, contact, dateOfBirth.
I want to get only id, name and address. How can I do that?
Currently I´m reading streams using Apache Beam and intend to write data to BigQuery afterwards. I am trying to use Filter but cannot get it to work because of Boolean return type
Here´s my code:
Pipeline pipeline = Pipeline.create();
PCollection<KV<String, Person>> kafkaStreams =
pipeline
.apply("read streams", dataIO.readStreams(topic))
.apply(Filter.by(new SerializableFunction<KV<String, Person>, Boolean>() {
#Override
public Boolean apply(KV<String, Order> input) {
return input.getValue().get("address").equals(true);
}
}));
where dataIO.readStreams is returning this:
return KafkaIO.<String, Person>read()
.withTopic(topic)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(PersonAvroDeserializer.class)
.withConsumerConfigUpdates(consumer)
.withoutMetadata();
I would appreciate suggestions for a possible solution.
You can do this with ksqlDB, which also work directly with Kafka Connect for which there is a sink connector for BigQuery
CREATE STREAM MY_SOURCE WITH (KAFKA_TOPIC='person', VALUE_FORMAT=AVRO');
CREATE STREAM FILTERED_STREAM AS SELECT id, name, address FROM MY_SOURCE;
CREATE SINK CONNECTOR SINK_BQ_01 WITH (
'connector.class' = 'com.wepay.kafka.connect.bigquery.BigQuerySinkConnector',
'topics' = 'FILTERED_STREAM',
…
);
You can also do this by creating a new TableSchema by yourself with only the required fields. Later when you write to BigQuery, you can pass the newly created schema as an argument instead of the old one.
TableSchema schema = new TableSchema();
List<TableFieldSchema> tableFields = new ArrayList<TableFieldSchema>();
TableFieldSchema id =
new TableFieldSchema()
.setName("id")
.setType("STRING")
.setMode("NULLABLE");
tableFields.add(id);
schema.setFields(tableFields);
return schema;
I should also mention that if you are converting an AVRO record to BigQuery´s TableRow at some point, you may need to implement some checks there too.

Update/replace tabledata in google bigquery via java coding

I am trying to update bigquery tabledata using java. WriteDisposition is an option according to my research. I am bit novice, couldn't get through. kindly help.
Being said that I have tried to insert using WriteChannelConfiguration which worked fine. Need to make changes to this code to update the table.
public class BigQryAPI {
public static void explicit() {
// Load credentials from JSON key file. If you can't set the GOOGLE_APPLICATION_CREDENTIALS
// environment variable, you can explicitly load the credentials file to construct the
// credentials.
try {
GoogleCredentials credentials;
File credentialsPath = new File(BigQryAPI.class.getResource("/firstprojectkey.json").getPath()); // TODO: update to your key path.
FileInputStream serviceAccountStream = new FileInputStream(credentialsPath);
credentials = ServiceAccountCredentials.fromStream(serviceAccountStream);
// Instantiate a client
BigQuery bigquery =
BigQueryOptions.newBuilder().setCredentials(credentials).build().getService();
System.out.println("Datasets:");
for (Dataset dataset : bigquery.listDatasets().iterateAll()) {
System.out.printf("%s%n", dataset.getDatasetId().getDataset());
}
//load into table
TableId tableId = TableId.of("firstdataset","firsttable");
WriteChannelConfiguration writeChannelConfiguration =
WriteChannelConfiguration.newBuilder(tableId).setFormatOptions(FormatOptions.csv()).build();
TableDataWriteChannel writer = bigquery.writer(writeChannelConfiguration);
String csvdata="zzzxyz,zzzxyz";
// Write data to writer
try {
writer.write(ByteBuffer.wrap(csvdata.getBytes(Charsets.UTF_8)));
} finally {
writer.close();
}
// Get load job
Job job = writer.getJob();
job = job.waitFor();
LoadStatistics stats = job.getStatistics();
System.out.printf("these are my stats"+stats);
String query = "SELECT Name,Phone FROM `firstproject-256319.firstdataset.firsttable`;";
QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query).build();
// Print the results.
for (FieldValueList row : bigquery.query(queryConfig).iterateAll()) {
for (FieldValue val : row) {
System.out.printf("%s,", val.toString());
}
System.out.printf("\n");
}
}catch(Exception e) {System.out.println(e.getMessage());}
}
}
We can set write desposition while building the 'WriteChannelConfiguration'
WriteChannelConfiguration writeChannelConfiguration = WriteChannelConfiguration.newBuilder(table.tableId).setFormatOptions(FormatOptions.csv())
.setWriteDisposition(JobInfo.WriteDisposition.WRITE_TRUNCATE).build()
Details could be found here BigQuery API Docs

Load csv data with partition in spark 2.0

In Spark 2.0, i have the following method which loads the data into dataset
public Dataset<AccountingData> GetDataFrameFromTextFile()
{ // The schema is encoded in a string
String schemaString = "id firstname lastname accountNo";
// Generate the schema based on the string of schema
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split("\t")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
return sparksession.read().schema(schema)
.option("mode", "DROPMALFORMED")
.option("sep", "|")
.option("ignoreLeadingWhiteSpace", true)
.option("ignoreTrailingWhiteSpace ", true)
.csv("D:\\HadoopDirectory\Employee.txt").as(Encoders.bean(Employee.class));
}
and in my driver code, Map operation is called on the dataset
Dataset<Employee> rowDataset = ad.GetDataFrameFromTextFile();
Dataset<String> map = rowDataset.map(new MapFunction<Employee, String>() {
#Override
public String call(Employee emp) throws Exception {
return TraverseRuleByADRow(emp);
}
},Encoders.STRING());
When i run the driver program in spark local mode with 8 cores on my laptop, i see 8 partitions split the input file.May i know whether there is a way to load the file in more than 8 partitions, say 100 or 1000 partitions?
I know this is achievable if the source data is from sql server table via jdbc.
sparksession.read().format("jdbc").option("url", urlCandi).option("dbtable", tableName).option("partitionColumn", partitionColumn).option("lowerBound", String.valueOf(lowerBound))
.option("upperBound", String.valueOf(upperBound))
.option("numPartitions", String.valueOf(numberOfPartitions))
.load().as(Encoders.bean(Employee.class));
Thanks
Use repartition() method from Dataset. According to Scaladoc, there is no option to set number of paritions while reading

unexpected multiple execution of mapper intended to run once

I tried to write a very simple job with only 1 mapper and no reducer to write some data to hbase. In the mapper I tried to simply open connection with hbase, write a few rows of data to a table and then close connection. In job driver I am using JobConf.setNumMapTasks(1); and JobConf.setNumReduceTasks(0); to specify that only 1 mapper and no reducer are to be executed. I am also setting the reducer class to IdentityReducer in jobConf. The strange behavior I am observing is that the job successfully writes the data to hbase table however after that I see in the logs it continuously tried to open connection with hbase and then closes the connection which goes on for 20-30 minutes and after the job is declared to have completed with 100% success. At the end when I check the _success file created by the dummy data I put in OutputCollector.collect(...) I see hundred of rows of dummy data when there should only be 1.
Following is the code for job driver
public int run(String[] arg0) throws Exception {
Configuration config = HBaseConfiguration.create(getConf());
ensureRequiredParametersExist(config);
ensureOptionalParametersExist(config);
JobConf jobConf = new JobConf(config, getClass());
jobConf.setJobName(config.get(ETLJobConstants.ETL_JOB_NAME));
//set map specific configuration
jobConf.setNumMapTasks(1);
jobConf.setMaxMapAttempts(1);
jobConf.setInputFormat(TextInputFormat.class);
jobConf.setMapperClass(SingletonMapper.class);
jobConf.setMapOutputKeyClass(LongWritable.class);
jobConf.setMapOutputValueClass(Text.class);
//set reducer specific configuration
jobConf.setReducerClass(IdentityReducer.class);
jobConf.setOutputKeyClass(LongWritable.class);
jobConf.setOutputValueClass(Text.class);
jobConf.setOutputFormat(TextOutputFormat.class);
jobConf.setNumReduceTasks(0);
//set job specific configuration details like input file name etc
FileInputFormat.setInputPaths(jobConf, jobConf.get(ETLJobConstants.ETL_JOB_FILE_INPUT_PATH));
System.out.println("setting output path to : " + jobConf.get(ETLJobConstants.ETL_JOB_FILE_OUTPUT_PATH));
FileOutputFormat.setOutputPath(jobConf,
new Path(jobConf.get(ETLJobConstants.ETL_JOB_FILE_OUTPUT_PATH)));
JobClient.runJob(jobConf);
return 0;
}
Driver class extends Configured and implements Tool (I used the sample from definitive guide)Following is the code in my mapper class.
Following is the code in my Mapper's map method where I simply open the connection with Hbase, do some preliminary check to make sure table exists and then write the rows and close the table.
public void map(LongWritable arg0, Text arg1,
OutputCollector<LongWritable, Text> arg2, Reporter arg3)
throws IOException {
HTable aTable = null;
HBaseAdmin admin = null;
try {
arg3.setStatus("started");
/*
* set-up hbase config
*/
admin = new HBaseAdmin(conf);
/*
* open connection to table
*/
String tableName = conf.get(ETLJobConstants.ETL_JOB_TABLE_NAME);
HTableDescriptor htd = new HTableDescriptor(toBytes(tableName));
String colFamilyName = conf.get(ETLJobConstants.ETL_JOB_TABLE_COLUMN_FAMILY_NAME);
byte[] tablename = htd.getName();
/* call function to ensure table with 'tablename' exists */
/*
* loop and put the file data into the table
*/
aTable = new HTable(conf, tableName);
DataRow row = /* logic to generate data */
while (row != null) {
byte[] rowKey = toBytes(row.getRowKey());
Put put = new Put(rowKey);
for (DataNode node : row.getRowData()) {
put.add(toBytes(colFamilyName), toBytes(node.getNodeName()),
toBytes(node.getNodeValue()));
}
aTable.put(put);
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxo added another data row to hbase");
row = fileParser.getNextRow();
}
aTable.flushCommits();
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxo Finished adding data to hbase");
} finally {
if (aTable != null) {
aTable.close();
}
if (admin != null) {
admin.close();
}
}
arg2.collect(new LongWritable(10), new Text("something"));
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxoadded some dummy data to the collector");
}
As you could see around the end that I am writing some dummy data to collection in the end (10, 'something') and I see hundreds of rows of this data in the _success file after the job has terminated.
I can't identify why the mapper code is restarted multiple times over and over instead of running just once. Any help would be greatly appreciated.
Using JobConf.setNumMapTasks(1) is just saying to hadoop that you wish to use 1 mapper, if possible, unlike the setNumReduceTasks, which actually defines the number that you specified.
That's why more mappers are run and you observe all these numbers.
For more details, please read this post.

hbase java code returns null for a get but hbase shell get comman returns record

I have just started using hbase and also not a proficient java programmer. I created a debug program to test the current hbase program that does put & get records and also as a deduping mechanism. The debug program checks to see if certain ids are present in the hbase table that should have been inserted using the other program. When I do a get, for the most part records are there but some will be returned as null (not found). When I manually check from the hbase shell and request the same id, it returns the row with timestamp. Is there something I am not understanding here? Are there multiple versions of a record kept in hbase? I assumed hbase made unique records based on the id provided.
// code to get record
public static byte[] getPreHbase(String provid, String commentId) throws IOException {
provid = "98";
commentId = commentId.trim();
String rec = provid + "." + commentId;
byte [] value= "test".getBytes();
try{
Get g = new Get(Bytes.toBytes(rec));
Result r = htableII.get(g);
value = r.getValue(Bytes.toBytes("cmmnttest"),Bytes.toBytes("cmmntposts"));
String valueStr = Bytes.toString(value);
}catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return value;
As I mentioned this is only sometimes for some ids while others are returned. This is the manual call in shell
get 'hb_test', '98.1010000000003_1asdfghjkl'
COLUMN CELL
cmmnttest:cmmntposts timestamp=1420659812914,
value= 1010000000003_1asdfghjkl
1 row(s) in 0.0140 seconds

Categories