I have a Dataflow job that transforms data and writes out to BigQuery (batch job). Following the completion of the write operation I want to send a message to PubSub which will trigger further processing of the data in BigQuery. I have seen a few older questions/answers that hint at this being possible but only on streaming jobs:
Perform action after Dataflow pipeline has processed all data
Execute a process exactly after BigQueryIO.write() operation
How to notify when DataFlow Job is complete
I'm wondering if this is supported in any way for batch write jobs now? I cant use apache airflow to orchestrate all this unfortunately so sending a PubSub message seemed like the easiest way.
The conception of Beam implies the impossibility to do what you want. Indeed, you write a PCollection to BigQuery. By definition, a PCollection is a bounded or unbounded collection. How can you trigger something after a unbounded collection? When do you know that you have reach the end?
So, you have different way to achieve this. In your code, you can wait the pipeline completion and then publish a PubSub message.
Personally, I prefer to base this on the logs; When the the dataflow job is finish, I get the log of the end of job and I sink it into PubSub. That's decorrelated the pipeline code and the next step.
You can also have a look to Workflow. It's not really mature yet, but very promising for simple workflow like yours.
Related
I am using STORAGE_WRITE_API method in Dataflow for writing data into BigQuery through Batch Pipeline. That is causing issues and sometimes it gets stuck and does not load data into Biquery. It works with small tables but with the large table, it starts giving issues without throwing any errors.
I tried the same code with Default write method and it run properly with small as well as large tables.
So I wanted to know the STORAGE_WRITE_API method for BigQuery is recommended for Batch Pipeline or not?
rows.apply(BigQueryIO.writeTableRows()
.withJsonSchema(tableJsonSchema)
.to(String.format("project:SampleDataset.%s", tableName))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
);
The STORAGE_WRITE_API is recommended for batch and streaming according to the documentation :
The BigQuery Storage Write API is a unified data-ingestion API for
BigQuery. It combines streaming ingestion and batch loading into a
single high-performance API. You can use the Storage Write API to
stream records into BigQuery in real time or to batch process an
arbitrarily large number of records and commit them in a single atomic
operation.
Advantages of using the Storage Write API Exactly-once delivery
semantics. The Storage Write API supports exactly-once semantics
through the use of stream offsets. Unlike the tabledata.insertAll
method, the Storage Write API never writes two messages that have the
same offset within a stream, if the client provides stream offsets
when appending records.
Stream-level transactions. You can write data to a stream and commit
the data as a single transaction. If the commit operation fails, you
can safely retry the operation.
Transactions across streams. Multiple workers can create their own
streams to process data independently. When all the workers have
finished, you can commit all of the streams as a transaction.
Efficient protocol. The Storage Write API is more efficient than the
older insertAll method because it uses gRPC streaming rather than REST
over HTTP. The Storage Write API also supports binary formats in the
form of protocol buffers, which are a more efficient wire format than
JSON. Write requests are asynchronous with guaranteed ordering.
Schema update detection. If the underlying table schema changes while
the client is streaming, then the Storage Write API notifies the
client. The client can decide whether to reconnect using the updated
schema, or continue to write to the existing connection.
Lower cost. The Storage Write API has a significantly lower cost than
the older insertAll streaming API. In addition, you can ingest up to 2
TB per month for free.
There are many advantages for batch and streaming.
For batch mode, it's more efficient than BATCH_LOAD method.
You need to check from all the possible logs to understand this weird behaviour :
job log in Dataflow UI
worker log in Dataflow UI
Diagnostics tab in Dataflow UI
Cloud Logging with filter on dataflow_step
Use the latest apache beam version if possible 2.43.0
We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know for sure the increment was called only-once? I want to understand what part I am missing.
Also, is CloudBigTableIO usable in Streaming mode or is it tied to Batch mode only? I guess we could use the BigTable HBase client directly in the pipeline but the connector seems to have nice properties like Connection-pooling which we would like to leverage and hence the question.
The way that Dataflow (and other systems) offer the appearence of exactly-once execution in the presence of failures and retries is by requiring that side-effects (such as mutating BigTable) are idempotent. A "write" is idempotent because it is overwritten on retry. Inserts can be made idempotent by including a deterministic "insert ID" that deduplicates the insert.
For an increment, that is not the case. It is not supported because it would not be idempotent when retried, so it would not support exactly-once execution.
CloudBigTableIO is usable in streaming mode. We had to implement a DoFn rather than a Sink in order to support that via the Dataflow SDK.
I am using ActiveMQ classic as a queue manager. My message consumer (#JmsListener using Spring) writes to MongoDB. If MongoDB is unavailable, then it sends the message to a different queue, lets call it a redelivery queue.
So, imagine after mongoDB been down for many hours, its finally up. What is the best way to now read the message from this redelivery queue?
I am thinking if there is a possibility to do this by creating a batch job that runs once a day? If so, what are the options that can be used to create a job like that or if there are any other better options available.
There is no "batch" mode for JMS. A JMS consumer can only receive one message at a time. Typically the best way boost message throughput to deal with lots of messages is by increasing the number of consumers. This should be fairly simple to do with a Spring JmsListener using the concurrency setting.
You can, of course, use something like cron to schedule a job to deal with these messages or you use something like the Quartz Job Scheduler instead.
It's really impossible to give you the "best" way to deal with your situation on Stack Overflow. There are simply too many unknown variables.
I am trying to generate stream data, to simulate a situation where I receive two values, Integer type, in a different time range, with timestamps, and Kafka as connector.
I am using Flink environment as a consumer, but I don't know which is the best solution for the producer. (Java syntax better than Scala if possible)
Should I produce the data direct from Kafka? If yes, what is the best way to do it?
Or maybe is better if I produce the data from Flink as a producer, send it to Kafka and consume it at the end by Flink again? How can I do that from flink?
Or perhaps there is another easy way to generate stream data and pass it to Kafka.
If yes, please put me on the track to achieve it.
As David also mentioned, you can create a dummy producer in simple Java using KafkaProducer APIs to schedule and send messages to Kafka as per you wish. Similarly you can do that with Flink if you want multiple simultaneous producers. With Flink you will need to write a separate job for producer and consumer. Kafka basically enables an ASync processing architecture so it does not have queue mechanisms. So better to keep producer and consumer jobs separate.
But think a little bit more about the intention of this test:
Are you trying to test Kafka streaming durability, replication, offset management capabilities
In this case, you need simultaneous producers for same topic, with null or non-null key in the message.
or Are you trying to test Flink-Kafka connector capabilities.
In this case, you need only one producer, few internal scenarios could be back pressure test by making producer push more messages than consumer can handle.
or Are you trying to test topic partitioning and Flink streaming parallelism.
In this case, single or multiple producers but key of message should be non-null, you can test how Flink executors are connecting with individual partitions and observe their behavior.
There are more ideas you may want to test and each of these will need something specific to be done in producer or not to be done.
You can check out https://github.com/abhisheknegi/twitStream for pulling tweets using Java APIs in case needed.
We are building an integration test for an Apache Beam pipeline and are running into some issues. See below for context...
Details about our pipeline:
We use PubsubIO as our data source (unbounded PCollection)
Intermediate transforms include a custom CombineFn and a very simple windowing/triggering strategy
Our final transform is JdbcIO, using org.neo4j.jdbc.Driver to write to Neo4j
Current testing approach:
Run Google Cloud's Pub/Sub emulator on the machine that the tests are running on
Build an in-memory Neo4j database and pass its URI into our pipeline options
Run pipeline by calling OurPipeline.main(TestPipeline.convertToArgs(options)
Use Google Cloud's Java Pub/Sub client library to publish messages to a test topic (using Pub/Sub emulator), which PubsubIO will read from
Data should flow through the pipeline and eventually hit our in-memory instance of Neo4j
Make simple assertions regarding the presence of this data in Neo4j
This is intended to be a simple integration test which will verify that our pipeline as a whole is behaving as expected.
The issue we're currently having is that when we run our pipeline it is blocking. We are using DirectRunner and pipeline.run() (not pipeline.run().waitUntilFinish()), but the test seems to hang after running the pipeline. Because this is an unbounded PCollection (running in streaming mode), the pipeline does not terminate, and thus any code after it is not reached.
So, I have a few questions:
1) Is there a way to run a pipeline and then stop it manually later?
2) Is there a way to run a pipeline asynchronously? Ideally it would just kick off the pipeline (which would then continuously poll Pub/Sub for data) and then move on to the code responsible for publishing to Pub/Sub.
3) Is this method of integration testing a pipeline reasonable, or are there better methods that might be more straightforward? Any info/guidance here would be appreciated.
Let me know if I can provide any additional code/context - thanks!
You can run the pipeline asynchronously using the DirectRunner by passing setting the isBlockOnRun pipeline option to false. So long as you keep a reference to the returned PipelineResult available, calling cancel() on that result should stop the pipeline.
For your third question, your setup seems reasonable. However, if you want to have a smaller-scale test of your pipeline (requiring fewer components), you can encapsulate all of your processing logic within a custom PTransform. This PTransform should take inputs that have been fully parsed from an input source, and produce outputs that are yet to be parsed for the output sink.
When this is done, you can use either Create (which will generally not exercise triggering) or TestStream (which may, depending on how you construct the TestStream) with the DirectRunner to generate a finite amount of input data, apply this processing PTransform to that PCollection, and use PAssert on the output PCollection to verify that the pipeline generated the outputs which you expect.
For more information about testing, the Beam website has information about these styles of tests in the Programming Guide and a blog post about testing pipelines with TestStream.