GCP - Bigquery to Kafka as streaming - java

I have a dataflow application(java) which is running in gcp and able to read the data from bigquery table and write to Kafka. But the application running as a batch mode, where as I would like make application as stream to read the data continuously from bigquery table and write to kafka topic.
Bigquery Table: Partitioned table with insert_time ( timestamp of record inserted intable) and message column
PCollection<TableRow> tablesRows = BigQueryUtil.readFromTable(pipeline,
"select message,processed from `myprojectid.mydatasetname.mytablename` " +
"where processed = false " +
"order by insert_time desc ")
.apply("Windowing",Window.into(FixedWindows.of(Duration.standardMinutes(1))));
.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn()))
.apply("Writing Messages", KafkaIO.<String, String>write().
withBootstrapServers(bootStrapURLs).
withTopic(options.getKafkaInputTopics()).
withKeySerializer(StringSerializer.class).
withValueSerializer(StringSerializer.class).
withProducerFactoryFn(new ProducerFactoryFn(sslConfig, projected))
);
pipeline.run();
Note: I have tried below options but no luck yet
Options 1. I tried the options of options.streaming (true); its running as stream but it will finish on the first success write.
Options 2. Applied trigger
Window.into(
FixedWindows.of(Duration.standardMinutes(5)))
.triggering(
AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardDays(2))
.accumulatingFiredPanes();
Option 3. Making unbounded forcibly
WindowingStrategy<?, ?> windowingStrategy = tablesRows.setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED).getWindowingStrategy();
.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn())).setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED)
Any solution is appreciated.

Some of the advice in Side Input Patterns in the Beam Programming Guide may be helpful here, even though you aren't using this as a side input. In particular, that article discusses using GenerateSequence to periodically emit a value and trigger a read from a bounded source.
This could allow your one time query to become a repeated query that periodically emits new records. It will be up to your query logic to determine what range of the table to scan on each query, though, and I expect it will be difficult to avoid emitting duplicate records. Hopefully your use case can tolerate that.
Emitting into the global window would look like:
PCollectionView<Map<String, String>> map =
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(
ParDo.of(
new DoFn<Long, Map<String, String>>() {
#ProcessElement
public void process(
#Element Long input,
#Timestamp Instant timestamp,
OutputReceiver<Map<String, String>> o) {
// Read from BigQuery here and for each row output a record: o.output(PlaceholderExternalService.readTestData(timestamp)
);
}
}))
.apply(
Window.<Map<String, String>>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(View.asSingleton());
This assumes that the size of the query result is relatively small, since the read happens entirely within a DoFn invocation.

Related

Apache Beam Streaming unable to write to BigQuery column-based partition

I'm currently building a streaming pipeline using Java SDK and trying to write to a BigQuery partitioned table using the BigQueryIO write/writeTableRows. I explored a couple of patterns but none of them succeed; few of them are below.
Using SerializableFunction to determine TableDestination
.withSchema(TableSchemaFactory.buildLineageSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) or CREATE_NEVER
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
and then calling this function inside the .to() method
#Override
public TableDestination apply(ValueInSingleWindow<TableRow> input) {
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField("processingdate");
String dest = String.format("%s.%s.%s", project, dataset, table);
return new TableDestination(dest, null, timePartitioning);
I also tried to format the partition column obtained from input and add it as part of the String location with $ annotation, like below:
#Override
public TableDestination apply(ValueInSingleWindow<TableRow> input) {
input.get("processingDate")
...convert to string MMddYYYY format
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField("processingdate");
String dest = String.format("%s.%s.%s$%s", project, dataset, table, convertedDate);
return new TableDestination(dest, null, timePartitioning);
however, none of them succeed, either failing with
invalid timestamp
timestamp field value out of range
You can only stream to partitions within 0 days in the past and 0 days in the future relative to the current date.
The destination table's partition is not supported for streaming. You can only stream to meta-table of date partitioned tables.
Streaming to metadata partition of column based partitioning table is disallowed.
I can't seem to get the right combination. Has anyone encountered the same issue before? Can anyone point me to the right direction or give me some pointers? what I want to achieve is load the streaming data based on the date column defined and not on processing time.
Thank you!
I expect most of these issues will be solved if you drop the partition decorator from dest. In most cases the BigQuery APIs for loading data will be able to figure out the right partition based on the messages themselves.
So try changing your definition of dest to:
String dest = String.format("%s.%s.%s", project, dataset, table);

How to change dataflow job graph during runtime with arguments?

I am using Dataflow to read data from a JDBC table and load results to a BigQuery table. There is one parameter "flag" that I want to pass during runtime and if the flag is set True, results should be loaded to an additional table in BigQuery.
To summarise:
If the flag is set False - Read table A from JDBC, write to table A in BigQuery
If the flag is set True - Read table A from JDBC, write to table A as well as B in BigQuery.
Please refer sample code of my pipeline
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline pipeline = Pipeline.create(options);
ValueProvider < String > gcsFlag = options.getGcsFlag();
PCollection < TableRow > inputData = pipeline.apply("Reading JDBC Table",
JdbcIO. < TableRow > read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
.create(options.getDriverClassName(), options.getJdbcUrl())
.withUsername(options.getUsername()).withPassword(options.getPassword()))
.withQuery(options.getSqlQuery())
.withCoder(TableRowJsonCoder.of())
.withRowMapper(new CustomRowMapper()));
inputData.apply(
"Write to BigQuery Table 1",
BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDirectory())
.to(options.getOutputTable()));
if (gcsFlag.get().equals("TRUE")) {
inputData.apply(
"Write to BigQuery Table 2",
BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDirectory())
.to(options.getOutputTable2()));
}
pipeline.run();
}
The challenge that I am facing is I have to pass the ValueProvider during compiling and creating the dataflow template. The job graph is constructed at compile time only and I am not able to re-use the same template again for other cases.
Is there a way that I can pass the ValueProvider<String> flag at runtime and the job graph can be constructed during runtime? With this, I can reuse the same template for both cases. Similarly, I want to also provide sqlQuery (options.getSqlQuery()) at runtime. So that I can use the same template for all the tables that I want to read from Source.
Any help is appreciated.
When you create the DAG it can't change in the runtime.
But still, you have a chance to fix your problem. Try Beam Partition-Pattern
https://beam.apache.org/documentation/transforms/java/elementwise/partition/

Is it possible to import in Dataflow streaming pipeline written in Python the Java method `wrapBigQueryInsertError`?

I'm trying to create a Dataflow streaming pipeline with Python3 that reads messages from a Pub/Sub topic to end up writing them on a BigQuery table "from scratch". I've seen in the Dataflow Java template named PubSubToBigQuery.java (that carries out what I'm looking for) a piece of code in the 3th step to handle those Pub/Sub messages transformed into table rows that fail when you try to insert them into the BigQuery table. Finally, in the code pieces of the steps 4 and 5, those are flatten and inserted in an error table:
Step 3:
PCollection<FailsafeElement<String, String>> failedInserts =
writeResult
.getFailedInsertsWithErr()
.apply(
"WrapInsertionErrors",
MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor())
.via((BigQueryInsertError e) -> wrapBigQueryInsertError(e)))
.setCoder(FAILSAFE_ELEMENT_CODER);
Steps 4 & 5
PCollectionList.of(
ImmutableList.of(
convertedTableRows.get(UDF_DEADLETTER_OUT),
convertedTableRows.get(TRANSFORM_DEADLETTER_OUT)))
.apply("Flatten", Flatten.pCollections())
.apply(
"WriteFailedRecords",
ErrorConverters.WritePubsubMessageErrors.newBuilder()
.setErrorRecordsTable(
ValueProviderUtils.maybeUseDefaultDeadletterTable(
options.getOutputDeadletterTable(),
options.getOutputTableSpec(),
DEFAULT_DEADLETTER_TABLE_SUFFIX))
.setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
.build());
failedInserts.apply(
"WriteFailedRecords",
ErrorConverters.WriteStringMessageErrors.newBuilder()
.setErrorRecordsTable(
ValueProviderUtils.maybeUseDefaultDeadletterTable(
options.getOutputDeadletterTable(),
options.getOutputTableSpec(),
DEFAULT_DEADLETTER_TABLE_SUFFIX))
.setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
.build());
In order to do this, I suspect that the key to making this possible lies in the first imported library in the template:
package com.google.cloud.teleport.templates;
import static com.google.cloud.teleport.templates.TextToBigQueryStreaming.wrapBigQueryInsertError;
Is this method available in Python?
If not, there is some way to perform the same in Python that is not to check that the structure and the data type of fields of the records that should be inserted corresponds to what the BigQuery table expects?
This kind of workaround slows down my streaming pipeline too much.
In Beam Python, when performing a streaming BigQuery write, the rows which failed during the BigQuery write are returned by the transform. See https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L1248
So you can process these in the same way as the Java template.

Kafka Spark Streaming cache

I have been getting data from one kafka topics in the form of JavaPairInputDStream (with twitter streaming api), plan is getting data from two topics checking for duplication with tweet_id and if its not in the package (package is for sending back to kafka), add it. Also i want to cache data for x mins then work on it.
I can get data from kafka topic and output it with
stream.foreachRDD(rdd -> {
System.out.println("--- New RDD with " + rdd.partitions().size()
+ " partitions and " + rdd.count() + " records");
rdd.foreach(record -> System.out.println(record._2));});
But i cant manage to cache it. Tried rdd.cache() and persist with count(). but it doesn't seem to do trick or i just wasn't able to understand it.
Anyone can guide me how to do this stuff?
Okay so its impossible to cache rdd like this it seems. I created another rdd and I'm using union() whenever stream creates new rdd and caching this way.

How can I append timestamp to rdd and push to elasticsearch

I am new to spark streaming and elasticsearch, I am trying to read data from kafka topic using spark and storing data as rdd. In the rdd I want to append time stamp, as soon as new data comes and then push to elasticsearch.
lines.foreachRDD(rdd -> {
if(!rdd.isEmpty()){
// rdd.collect().forEach(System.out::println);
String timeStamp = new
SimpleDateFormat("yyyy::MM::dd::HH::mm::ss").format(new Date());
List<String> myList = new ArrayList<String>(Arrays.asList(timeStamp.split("\\s+")));
List<String> f = rdd.collect();
Map<List<String>, ?> rddMaps = ImmutableMap.of(f, 1);
Map<List<String>, ?> myListrdd = ImmutableMap.of(myList, 1);
JavaRDD<Map<List<String>, ?>> javaRDD = sc.parallelize(ImmutableList.of(rddMaps));
JavaEsSpark.saveToEs(javaRDD, "sample/docs");
}
});
Spark?
As far as I understand, spark streaming is for real time streaming data computation, like map, reduce, join and window. It seems no need to use such a powerful tool, in the case that what we need is just add a timestamp for event.
Logstash?
If this is the situation, Logstash may be more suitable for our case.
Logstash will record the timestamp when event coming and it also has persistent queue and Dead Letter Queues that ensure the data resiliency. It has the native support for push data to ES (after all they are belong to a serial of products), which make it is very easy to push data to.
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash-%{type}-%{+YYYY.MM.dd}"
}
}
More
for more about Logstash, here is introduction.
here is a sample logstash config file.
Hope this is helpful.
Ref
Deploying and Scaling Logstash
If all you're using Spark Streaming for is getting the data from Kafka to Elasticsearch a neater way–and not needing any coding–would be to use Kafka Connect.
There is an Elasticsearch Kafka Connect sink. Depending on what you want to do with a Timestamp (e.g. for index routing, or to add as a field) you can use Single Message Transforms (there's an example of them here).

Categories