Apache Beam - Integration test with unbounded PCollection

Apache Beam - Integration test with unbounded PCollection - java

We are building an integration test for an Apache Beam pipeline and are running into some issues. See below for context...
Details about our pipeline:
We use PubsubIO as our data source (unbounded PCollection)
Intermediate transforms include a custom CombineFn and a very simple windowing/triggering strategy
Our final transform is JdbcIO, using org.neo4j.jdbc.Driver to write to Neo4j
Current testing approach:
Run Google Cloud's Pub/Sub emulator on the machine that the tests are running on
Build an in-memory Neo4j database and pass its URI into our pipeline options
Run pipeline by calling OurPipeline.main(TestPipeline.convertToArgs(options)
Use Google Cloud's Java Pub/Sub client library to publish messages to a test topic (using Pub/Sub emulator), which PubsubIO will read from
Data should flow through the pipeline and eventually hit our in-memory instance of Neo4j
Make simple assertions regarding the presence of this data in Neo4j
This is intended to be a simple integration test which will verify that our pipeline as a whole is behaving as expected.
The issue we're currently having is that when we run our pipeline it is blocking. We are using DirectRunner and pipeline.run() (not pipeline.run().waitUntilFinish()), but the test seems to hang after running the pipeline. Because this is an unbounded PCollection (running in streaming mode), the pipeline does not terminate, and thus any code after it is not reached.
So, I have a few questions:
1) Is there a way to run a pipeline and then stop it manually later?
2) Is there a way to run a pipeline asynchronously? Ideally it would just kick off the pipeline (which would then continuously poll Pub/Sub for data) and then move on to the code responsible for publishing to Pub/Sub.
3) Is this method of integration testing a pipeline reasonable, or are there better methods that might be more straightforward? Any info/guidance here would be appreciated.
Let me know if I can provide any additional code/context - thanks!

You can run the pipeline asynchronously using the DirectRunner by passing setting the isBlockOnRun pipeline option to false. So long as you keep a reference to the returned PipelineResult available, calling cancel() on that result should stop the pipeline.
For your third question, your setup seems reasonable. However, if you want to have a smaller-scale test of your pipeline (requiring fewer components), you can encapsulate all of your processing logic within a custom PTransform. This PTransform should take inputs that have been fully parsed from an input source, and produce outputs that are yet to be parsed for the output sink.
When this is done, you can use either Create (which will generally not exercise triggering) or TestStream (which may, depending on how you construct the TestStream) with the DirectRunner to generate a finite amount of input data, apply this processing PTransform to that PCollection, and use PAssert on the output PCollection to verify that the pipeline generated the outputs which you expect.
For more information about testing, the Beam website has information about these styles of tests in the Programming Guide and a blog post about testing pipelines with TestStream.

Related

Dataflow send PubSub message after BigQuery write completion

I have a Dataflow job that transforms data and writes out to BigQuery (batch job). Following the completion of the write operation I want to send a message to PubSub which will trigger further processing of the data in BigQuery. I have seen a few older questions/answers that hint at this being possible but only on streaming jobs:
Perform action after Dataflow pipeline has processed all data
Execute a process exactly after BigQueryIO.write() operation
How to notify when DataFlow Job is complete
I'm wondering if this is supported in any way for batch write jobs now? I cant use apache airflow to orchestrate all this unfortunately so sending a PubSub message seemed like the easiest way.

The conception of Beam implies the impossibility to do what you want. Indeed, you write a PCollection to BigQuery. By definition, a PCollection is a bounded or unbounded collection. How can you trigger something after a unbounded collection? When do you know that you have reach the end?
So, you have different way to achieve this. In your code, you can wait the pipeline completion and then publish a PubSub message.
Personally, I prefer to base this on the logs; When the the dataflow job is finish, I get the log of the end of job and I sink it into PubSub. That's decorrelated the pipeline code and the next step.
You can also have a look to Workflow. It's not really mature yet, but very promising for simple workflow like yours.

Is it possible to conditionally append to a cell in BigTable inside of a Google Cloud Dataflow pipeline step? [duplicate]

We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know for sure the increment was called only-once? I want to understand what part I am missing.
Also, is CloudBigTableIO usable in Streaming mode or is it tied to Batch mode only? I guess we could use the BigTable HBase client directly in the pipeline but the connector seems to have nice properties like Connection-pooling which we would like to leverage and hence the question.

The way that Dataflow (and other systems) offer the appearence of exactly-once execution in the presence of failures and retries is by requiring that side-effects (such as mutating BigTable) are idempotent. A "write" is idempotent because it is overwritten on retry. Inserts can be made idempotent by including a deterministic "insert ID" that deduplicates the insert.
For an increment, that is not the case. It is not supported because it would not be idempotent when retried, so it would not support exactly-once execution.

CloudBigTableIO is usable in streaming mode. We had to implement a DoFn rather than a Sink in order to support that via the Dataflow SDK.

How to provide a Flux with live time series data starting from a certain time in the past?

My goal is to develop a repository that provides a Flux of live time series data starting from a certain time in the past. The repository should provide an API as follows:
public interface TimeSeriesRepository {
//returns a Flux with incoming live data without considering past data
public Flux<TimeSeriesData> getLiveData();
//returns a Flux with incoming live data starting at startTime
public Flux<TimeSeriesData> getLiveData(Instant startTime);
}
The assumptions and constraints are:
the application is using Java 11, Spring Boot 2/Spring 5
the data is stored in a relational database such as PostreSQL and is timestamped.
the data is regularly updated with new data from an external actor
a RabbitMQ broker is available and could be used (if appropriate)
should not include components that require a Zookeeper cluster or similar e.g. event logs such as Apache Kafka or Apache Pulsar or Stream Processing Engines such as Apache Storm or Apache Flink because it is not a large-scale cloud application but should run on a regular PC (e.g. with 8GB RAM)
My first idea was that I would use Debezium to forward incoming data to RabbitMQ and use Reactor RabbitMQ to create a Flux. Actually this was my initial plan before I understood that the second method in the repository that considers historical data is also required. However, this solution would not provide historical data.
Thus, I considered using an event log such as Kafka, so I could replay data from the past but found out the operational overhead is too high. So I dismissed this idea and did not even bother to figure out the details on how this could have worked or potential drawbacks.
Now, I have considered using Spring Data R2DBC but I could not figure out how a query should look like that fulfills my goal.
How could I implement the Interface using any of the mentioned tools or maybe even with plain Java/Spring Data?
I will accept any answer that seems like a feasible approach.

Configuring storm cluster for production cluster

We have configured storm cluster with one nimbus server and three supervisors. Published three topologies which does different calculations as follows
Topology1 : Reads raw data from MongoDB, do some calculations and store back the result
Topology2 : Reads the result of topology1 and do some calculations and publish results to a queue
Topology3 : Consumes output of topology2 from the queue, calls a REST Service, get reply from REST service, update result in MongoDB collection, finally send an email.
As new bee to storm, looking for an expert advice on the following questions
Is there a way to externalize all configurations, for example a config.json, that can be referred by all topologies?
Currently configuration to connect MongoDB, MySql, Mq, REST urls are hard-coded in java file. It is not good practice to customize source files for each customer.
Wanted to log at each stage [Spouts and Bolts], Where to post/store log4j.xml that can be used by cluster?
Is it right to execute blocking call like REST call from a bolt?
Any help would be much appreciated.

Since each topology is just a Java program, simply pass the configuration into the Java Jar, or pass a path to a file. The topology can read the file at startup, and pass any configuration to components as it instantiates them.
Storm uses slf4j out of the box, and it should be easy to use within your topology as such. If you use the default configuration, you should be able to see logs either through the UI, or dumped to disk. If you can't find them, there are a number of guides to help, e.g. http://www.saurabhsaxena.net/how-to-find-storm-worker-log-directory/.
With storm, you have the flexibility to push concurrency out to the component level, and get multiple executors by instantiating multiple bolts. This is likely the simplest approach, and I'd advise you start there, and later introduce the complexity of an executor inside of your topology for asynchronously making HTTP calls.
See http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html for the canonical overview of parallelism in storm. Start simple, and then tune as necessary, as with anything.

Running Google Dataflow with PubsubIO source for testing

I'm creating data-processing application using Google Cloud Dataflow - it is going to stream data from Pubsub to Bigquery.
I'm somewhat bewildered with infrastructure. I created my application prototype and can run it locally, using files (with TextIO) for source and destination.
However if I change source to PubsubIO.Read.subscription(...) I fail with "java.lang.IllegalStateException: no evaluator registered for PubsubIO.Read" (I am not much surprised since I see no methods to pass authentication anyway).
But how am I supposed to run this? Should I create some virtual machine in Google Cloud Engine and deploy stuff there, or I am supposed to describe a job somehow and submit it to Dataflow API (without caring of any explicit VM-s?)
Could you please point me to some kind of step-by-step instruction on this topic - or rather explain the workflow shortly. I'm sorry for the question is probably silly.

You would need to run your pipeline on the Google Cloud infrastructure in order to access PubSub, see:
https://cloud.google.com/dataflow/pipelines/specifying-exec-params#CloudExecution
From their page:
// Create and set your PipelineOptions.
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
// For Cloud execution, set the Cloud Platform project, staging location,
// and specify DataflowPipelineRunner or BlockingDataflowPipelineRunner.
options.setProject("my-project-id");
options.setStagingLocation("gs://my-bucket/binaries");
options.setRunner(DataflowPipelineRunner.class);
// Create the Pipeline with the specified options.
Pipeline p = Pipeline.create(options);
// Specify all the pipeline reads, transforms, and writes.
...
// Run the pipeline.
p.run();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.