Apache Beam Streaming unable to write to BigQuery column-based partition - java

I'm currently building a streaming pipeline using Java SDK and trying to write to a BigQuery partitioned table using the BigQueryIO write/writeTableRows. I explored a couple of patterns but none of them succeed; few of them are below.
Using SerializableFunction to determine TableDestination
.withSchema(TableSchemaFactory.buildLineageSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) or CREATE_NEVER
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
and then calling this function inside the .to() method
#Override
public TableDestination apply(ValueInSingleWindow<TableRow> input) {
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField("processingdate");
String dest = String.format("%s.%s.%s", project, dataset, table);
return new TableDestination(dest, null, timePartitioning);
I also tried to format the partition column obtained from input and add it as part of the String location with $ annotation, like below:
#Override
public TableDestination apply(ValueInSingleWindow<TableRow> input) {
input.get("processingDate")
...convert to string MMddYYYY format
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField("processingdate");
String dest = String.format("%s.%s.%s$%s", project, dataset, table, convertedDate);
return new TableDestination(dest, null, timePartitioning);
however, none of them succeed, either failing with
invalid timestamp
timestamp field value out of range
You can only stream to partitions within 0 days in the past and 0 days in the future relative to the current date.
The destination table's partition is not supported for streaming. You can only stream to meta-table of date partitioned tables.
Streaming to metadata partition of column based partitioning table is disallowed.
I can't seem to get the right combination. Has anyone encountered the same issue before? Can anyone point me to the right direction or give me some pointers? what I want to achieve is load the streaming data based on the date column defined and not on processing time.
Thank you!

I expect most of these issues will be solved if you drop the partition decorator from dest. In most cases the BigQuery APIs for loading data will be able to figure out the right partition based on the messages themselves.
So try changing your definition of dest to:
String dest = String.format("%s.%s.%s", project, dataset, table);

Related

GCP - Bigquery to Kafka as streaming

I have a dataflow application(java) which is running in gcp and able to read the data from bigquery table and write to Kafka. But the application running as a batch mode, where as I would like make application as stream to read the data continuously from bigquery table and write to kafka topic.
Bigquery Table: Partitioned table with insert_time ( timestamp of record inserted intable) and message column
PCollection<TableRow> tablesRows = BigQueryUtil.readFromTable(pipeline,
"select message,processed from `myprojectid.mydatasetname.mytablename` " +
"where processed = false " +
"order by insert_time desc ")
.apply("Windowing",Window.into(FixedWindows.of(Duration.standardMinutes(1))));
.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn()))
.apply("Writing Messages", KafkaIO.<String, String>write().
withBootstrapServers(bootStrapURLs).
withTopic(options.getKafkaInputTopics()).
withKeySerializer(StringSerializer.class).
withValueSerializer(StringSerializer.class).
withProducerFactoryFn(new ProducerFactoryFn(sslConfig, projected))
);
pipeline.run();
Note: I have tried below options but no luck yet
Options 1. I tried the options of options.streaming (true); its running as stream but it will finish on the first success write.
Options 2. Applied trigger
Window.into(
FixedWindows.of(Duration.standardMinutes(5)))
.triggering(
AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardDays(2))
.accumulatingFiredPanes();
Option 3. Making unbounded forcibly
WindowingStrategy<?, ?> windowingStrategy = tablesRows.setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED).getWindowingStrategy();
.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn())).setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED)
Any solution is appreciated.
Some of the advice in Side Input Patterns in the Beam Programming Guide may be helpful here, even though you aren't using this as a side input. In particular, that article discusses using GenerateSequence to periodically emit a value and trigger a read from a bounded source.
This could allow your one time query to become a repeated query that periodically emits new records. It will be up to your query logic to determine what range of the table to scan on each query, though, and I expect it will be difficult to avoid emitting duplicate records. Hopefully your use case can tolerate that.
Emitting into the global window would look like:
PCollectionView<Map<String, String>> map =
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(
ParDo.of(
new DoFn<Long, Map<String, String>>() {
#ProcessElement
public void process(
#Element Long input,
#Timestamp Instant timestamp,
OutputReceiver<Map<String, String>> o) {
// Read from BigQuery here and for each row output a record: o.output(PlaceholderExternalService.readTestData(timestamp)
);
}
}))
.apply(
Window.<Map<String, String>>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(View.asSingleton());
This assumes that the size of the query result is relatively small, since the read happens entirely within a DoFn invocation.

Consuming from multiple Kafka topics

I want to write a Kafka application that consumes from topics and save something in a database. The topics are created by Debezium Kafka connect based on mysql binlog. So I have one topic per table.
This is the code I am using for consuming from one topic:
KStream<GenericRecord,mysql.company.tiers.Envelope>[] tierStream = builder.stream("mysql.alopeyk.tiers",
Consumed.with(TierSerde.getGenericKeySerde(), TierSerde.getEnvelopeSerde()));
From architectural point of view I should create a KStream for each table and run them in parallel. But the number of tables is so big and having that amount of threads may not be the best option.
All the tables have a column called created_at (it is a laravel app) so I am curious if there is a way to have a generic Serde for values that extracts this common column. This is the only column I am interested in its value besides the name of the table.
It is all about how your value is serialized by the applicatino that produced messages (Connector).
If Deserializer (Serdes) can extract created_at from different type of messages it is possible.
So, the Answer is yes, but it depends on your message value nad Deserializer.
Assuming all your messages after serialization have format as follow:
create_at;name:position;...
create_at;city,country;...
create_at;product_name;...
In such case Deserializer needs only to take characters till first ; and cast it to date and the rest of value can be dropped.
Sample code:
public class CustomDeserializer implements Deserializer<Date> {
#Override
public Date deserialize(String topic, byte[] data) {
String strDate = new String(data);
return new Date(Long.parseLong(strDate.substring(0, strDate.indexOf(";"))));
}
}

BigQueryIO read get TableSchema

What I want to do is read an existing table and generate a new table which has the same schema as the original table plus a few extra column (computed from some columns of the original table). The original table schema can be increased without notice to me (the fields I am using in my dataflow job won't change), so I would like to always read the schema instead of defining some custom class which contains the schema.
In Dataflow SDK 1.x, I can get the TableSchema via
final DataflowPipelineOptions options = ...
final String projectId = ...
final String dataset = ...
final String table = ...
final TableSchema schema = new BigQueryServicesImpl()
.getDatasetService(options)
.getTable(projectId, dataset, table)
.getSchema();
For Dataflow SDK 2.x, BigQueryServicesImpl has become a package-private class.
I read the responses in Get TableSchema from BigQuery result PCollection<TableRow> but I'd prefer not to make a separate query to BigQuery. As that response is now almost 2 years old, are there other thoughts or ideas from the SO community?
Due to how BigQueryI/O is setup now. It needs to query the table schema before the pipleine begins to run. This is a good feature idea, but its not feasible in a single pipeline. In the example you linked the table schema is queries before running the pipeline.
If new columns are added, then unfortunately a new pipeline must be relaunched.

Can I force quotes for STRING fields in a JobConfigurationExtract for BigQuery?

There is a table we would like to export to a customer by means of a JobConfigurationExtract to Google Cloud Storage using Java. I encounter an issue when exchanging CSV information with a customer. The customer is forced to receive CSV files separated by a comma. String fields should always have surrounding quotes.
I noticed that, by default, no quotes are added.
I also noticed that in the query explorer, quotes are added when a delimiter is present in one of the data values.
Small snippet of code as to how we configure this job.
Job exportJob = new Job();
JobConfiguration jobConfiguration = new JobConfiguration();
JobConfigurationExtract configurationExtract = new JobConfigurationExtract();
configurationExtract.setSourceTable(sourceTable);
configurationExtract.setFieldDelimiter(",");
configurationExtract.setPrintHeader(true);
configurationExtract.setDestinationUri(destinationUri);
//configurationExtract.setForcedQuotes(true) <=wish there was something like this.
jobConfiguration.setExtract(configurationExtract);
exportJob.setConfiguration(jobConfiguration);
Bigquery bigquery = getBigQuery();
Job resultJob = bigquery.jobs().insert(projectId, exportJob).execute();
Is there a way to achieve this, without making a very complicated query that concats quotes around strings?
There isn't a way to do this, other than, as you suggested, writing a query that writes out string fields with quotes. However, this is a reasonable feature request. Can you file it as a feature request at the bigquery public issue tracker here: https://code.google.com/p/google-bigquery/ so that we can prioritize it and you can keep track of progress?

Read Data from HBase

I'm new to HBase, what's the best way to retrieve results from a table, row by row? I would like to read the entire data in the table. My table has two column families say col1 and col2.
From Hbase shell, you can use scan command to list data in table, or get to retrieve a record. Reference here
I think here is what you need: both through HBase shell and Java API: http://cook.coredump.me/post/19672191046/hbase-client-example
However you should understand hbase shell 'scan' is very slow (it is not cached). But it is intended only for debug purpose.
Another useful part of information for you is here: http://hbase.apache.org/book/perf.reading.html
This chapter is right about reading from HBase but is is somewhat harder to understand because it assumes some level of familiarity and contains more advanced advices. I'd recommend to you this guide starting from the beginning.
USe Scan api of Hbase , there you can specify start row and end row and can retrive data frm the table .
Here is an example:
http://eternaltechnology.blogspot.in/2013/05/hbase-scanner-example-scanning.html
I was looking for something like this!
Map function
public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
String x1 = new String(value.getValue(Bytes.toBytes("ColumnFamily"), Bytes.toBytes("X1")));
String x2 = new String(value.getValue(Bytes.toBytes("ColumnFamily"), Bytes.toBytes("X2")));
}
Driver file:
Configuration config2 = new Configuration();
Job job2 = new Job(config1, "kmeans2");
//Configuration for job2
job2.setJarByClass(Converge.class);
job2.setMapperClass(Converge.Map.class);
job2.setReducerClass(Converge.Reduce.class);
job2.setInputFormatClass(TableInputFormat.class);
job2.setOutputFormatClass(NullOutputFormat.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
job2.getConfiguration().set(TableInputFormat.INPUT_TABLE, "tablename");

Categories