Running Google Dataflow with PubsubIO source for testing - java

I'm creating data-processing application using Google Cloud Dataflow - it is going to stream data from Pubsub to Bigquery.
I'm somewhat bewildered with infrastructure. I created my application prototype and can run it locally, using files (with TextIO) for source and destination.
However if I change source to PubsubIO.Read.subscription(...) I fail with "java.lang.IllegalStateException: no evaluator registered for PubsubIO.Read" (I am not much surprised since I see no methods to pass authentication anyway).
But how am I supposed to run this? Should I create some virtual machine in Google Cloud Engine and deploy stuff there, or I am supposed to describe a job somehow and submit it to Dataflow API (without caring of any explicit VM-s?)
Could you please point me to some kind of step-by-step instruction on this topic - or rather explain the workflow shortly. I'm sorry for the question is probably silly.

You would need to run your pipeline on the Google Cloud infrastructure in order to access PubSub, see:
https://cloud.google.com/dataflow/pipelines/specifying-exec-params#CloudExecution
From their page:
// Create and set your PipelineOptions.
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
// For Cloud execution, set the Cloud Platform project, staging location,
// and specify DataflowPipelineRunner or BlockingDataflowPipelineRunner.
options.setProject("my-project-id");
options.setStagingLocation("gs://my-bucket/binaries");
options.setRunner(DataflowPipelineRunner.class);
// Create the Pipeline with the specified options.
Pipeline p = Pipeline.create(options);
// Specify all the pipeline reads, transforms, and writes.
...
// Run the pipeline.
p.run();

Related

Google Cloud scheduler Java job containerized with Selenium

I've got a Java code to perform some interactions with web pages and used Selenium for it.
Now I'd like to get this code executed every hours and I've thought it's a great occasion to discover the cloud world.
I've created an account on Google Cloud.
Because my app need to have a driver to use Selenium (gecko driver for Firefox), I'll have to create an docker image to set everything it need inside it.
In Google Cloud services, there is the "Cloud Scheduler" which can allow me to run a code when I want to.
But here are my questions :
What kind of target should I configure (HTTP, Pub/Sub, HTTP App Engine)?
Because I'm not using the Google Cloud Functions, my container will always be up, it doesn't seems as a great idea for a pricing reason? I would have like to have my container up only the time of the execution.
Also I was thinking to use Quarkus framework to wrap my application since I've since it was made for the cloud and very quick to start, is that the best option for me?
I'll be very glade if someone can help me to see this a little better. I'm not a total beginner I work as a Java / JavaScript developer for 5 years now and dockerized some application but everything about the cloud is a big piece, not easy to know where to start.
So you:
are using docker images
run your workload occasionally
aren't willing to use Cloud Function
==> Cloud Run is your best bet. Here is Google Cloud Run Quick start : https://cloud.google.com/run/docs/quickstarts/prebuilt-deploy
Keep in mind that your containerised application needs to be listening to HTTP requests so take a look at Cloud Run Container runtime contract
Finally you can indeed trigger Cloud Run from Cloud Scheduler, and here a detailed documentation on how to do it https://cloud.google.com/run/docs/triggering/using-scheduler
As #MBHAPhoenix says, Cloud Run is your best option. You can then trigger the job from Cloud Scheduler. We have this exact scenario currently running for one of our projects but our container is Python. We wrote an article about it here
You should note that to trigger your Cloud Run job from Cloud Scheduler, you'll have to 'secure it'. This means means you won't be able to just type the URL in a web browser. A service account will be responsible for running the Cloud Run job and you'll then need to grant your Cloud Scheduler service access to this service account so it can invoke the Cloud Run Job. I've been meaning to put up a post about the exact steps for doing this (will try to get it done this weekend).
In terms of cost, we have this snippet from our article
...Cloud Run only runs when it receives an HTTP request. It plays dead and comes alive to execute your code when an HTTP request comes in. When it is done executing the request, it goes 'dead' again till the next request comes in. This means you're not paying for time spent idling i.e. when it is not doing anything.....

GAE Datastore Backup using Java and Cron

I want to backup (and later restore) data from GAE datastore using the export facilities that went live this year. I want to use cron and java. I have found this post which points at this page but it's just for phython.
Originally I wanted to do this automatically every day using the Google Cloud Platform console but I can't find a way of doing it. Now I am resorting to incorporating it into Java and a cron job. I need restore instructions as well as backup.
I'm not interested in using the datastore admin backup as this will no longer be available next year.
According to the docs, the way to do it is, indeed, through Cron for GAE and having a GAE module call the API to export.
The point is not the code itself, but understanding why this is so.
Currently, the easiest way to schedule tasks in GCP is though Cron jobs in GAE, but those can only call GAE modules. Following the docs that you pointed out, the Cron will be quite similar to the one described there.
Regarding the handler itself, you only need to call the Datastore Admin API authenticated with an account with the proper permissions.
Since the Cloud Client Library does not have admin capabilities for Datastore, you'll have to either construct the call manually, or use the Datastore API Client Library.
Notice that, for GCP APIs, there are usually two client libraries available: the Cloud Client Library and the API Client Library. The first one is hand-crafted while the second one is auto-generated from the discovery document of each API.
If one specific functionality is not available through the Cloud Client Library (the recommended way of interacting with GCP APIs), you can always check the API Client Library for that same functionality.

Building a File Polling/Ingest Task with Spring Batch and Spring Cloud Data Flow

We are planning to create a new processing mechanism which consists of listening to a few directories e.g: /opt/dir1, /opt/dirN and for each document create in these directories, start a routine to process, persist it's registries in a database (via REST calls to an existing CRUD API) and generate a protocol file to another directory.
For testing purposes, I am not using any modern (or even decent) framework/approach, just a regular SpringBoot app with WatchService implementation that listens to these directories and poll the files to be processed as soon as they are created. It works but, clearly I am most definitely having some performance implications at some time when I move to production and start receiving dozens of files to be processed in parallel, which isn't a reality in my example.
After some research and some tips from a few colleagues, I found Spring Batch + Spring Cloud Data Flow to be the best combination for my needs. However, I have never dealt with neither of Batch or Data Flow before and I'm kinda confuse on what and how I should build these blocks in order to get this routine going in the most simple and performatic manner. I have a few questions regarding it's added value and architecture and would really appreciate hearing your thoughts!
I managed to create and run a sample batch file ingest task based on this section of Spring Docs. How can I launch a task every time a file is created in a directory? Do I need a Stream for that?
If I do, How can I create a stream application that launches my task programmaticaly for each new file passing it's path as argument? Should I use RabbitMQ for this purpose?
How can I keep some variables externalized for my task e.g directories path? Can I have these streams and tasks read an application.yml somewhere else than inside it's jar?
Why should I use Spring Cloud Data Flow alongside Spring Batch and not only a batch application? Just because it spans parallel tasks for each file or do I get any other benefit?
Talking purely about performance, how would this solution compare to my WatchService + plain processing implementation if you think only about the sequential processing scenario, where I'd receive only 1 file per hour or so?
Also, if any of you have any guide or sample about how to launch a task programmaticaly, I would really thank you! I am still searching for that, but doesn't seem I'm doing it right.
Thank you for your attention and any input is highly appreciated!
UPDATE
I managed to launch my task via SCDF REST API so I could keep my original SpringBoot App using WatchService launching a new task via Feign or XXX. I still know this is far from what I should do here. After some more research I think creating a stream using file source and sink would be my way here, unless someone has any other opinion, but I can't get to set the inbound channel adapter to poll from multiple directories and I can't have multiple streams, because this platform is supposed to scale to the point where we have thousands of particiants (or directories to poll files from).
Here are a few pointers.
I managed to create and run a sample batch file ingest task based on this section of Spring Docs. How can I launch a task every time a file is created in a directory? Do I need a Stream for that?
If you'd have to launch it automatically upon an upstream event (eg: new file), yes, you could do that via a stream (see example). If the events are coming off of a message-broker, you can directly consume them in the batch-job, too (eg: AmqpItemReader).
If I do, How can I create a stream application that launches my task programmaticaly for each new file passing it's path as argument? Should I use RabbitMQ for this purpose?
Hopefully, the above example clarifies it. If you want to programmatically launch the Task (not via DSL/REST/UI), you can do so with the new Java DSL support, which was added in 1.3.
How can I keep some variables externalized for my task e.g directories path? Can I have these streams and tasks read an application.yml somewhere else than inside it's jar?
The recommended approach is to use Config Server. Depending on the platform where this is being orchestrated, you'd have to provide the config-server credentials to the Task and its sub-tasks including batch-jobs. In Cloud Foundry, we simply bind config-server service instance to each of the tasks and at runtime the externalized properties would be automatically resolved.
Why should I use Spring Cloud Data Flow alongside Spring Batch and not only a batch application? Just because it spans parallel tasks for each file or do I get any other benefit?
Ad a replacement for Spring Batch Admin, SCDF provides monitoring and management for Tasks/Batch-Jobs. The executions, steps, step-progress, and stacktrace upon errors are persisted and available to explore from the Dashboard. You can directly also use SCDF's REST endpoints to examine this information.
Talking purely about performance, how would this solution compare to my WatchService + plain processing implementation if you think only about the sequential processing scenario, where I'd receive only 1 file per hour or so?
This is implementation specific. We do not have any benchmarks to share. However, if performance is a requirement, you could explore remote-partitioning support in Spring Batch. You can partition the ingest or data processing Tasks with "n" number of workers, so that way you can achieve parallelism.

Apache Beam - Integration test with unbounded PCollection

We are building an integration test for an Apache Beam pipeline and are running into some issues. See below for context...
Details about our pipeline:
We use PubsubIO as our data source (unbounded PCollection)
Intermediate transforms include a custom CombineFn and a very simple windowing/triggering strategy
Our final transform is JdbcIO, using org.neo4j.jdbc.Driver to write to Neo4j
Current testing approach:
Run Google Cloud's Pub/Sub emulator on the machine that the tests are running on
Build an in-memory Neo4j database and pass its URI into our pipeline options
Run pipeline by calling OurPipeline.main(TestPipeline.convertToArgs(options)
Use Google Cloud's Java Pub/Sub client library to publish messages to a test topic (using Pub/Sub emulator), which PubsubIO will read from
Data should flow through the pipeline and eventually hit our in-memory instance of Neo4j
Make simple assertions regarding the presence of this data in Neo4j
This is intended to be a simple integration test which will verify that our pipeline as a whole is behaving as expected.
The issue we're currently having is that when we run our pipeline it is blocking. We are using DirectRunner and pipeline.run() (not pipeline.run().waitUntilFinish()), but the test seems to hang after running the pipeline. Because this is an unbounded PCollection (running in streaming mode), the pipeline does not terminate, and thus any code after it is not reached.
So, I have a few questions:
1) Is there a way to run a pipeline and then stop it manually later?
2) Is there a way to run a pipeline asynchronously? Ideally it would just kick off the pipeline (which would then continuously poll Pub/Sub for data) and then move on to the code responsible for publishing to Pub/Sub.
3) Is this method of integration testing a pipeline reasonable, or are there better methods that might be more straightforward? Any info/guidance here would be appreciated.
Let me know if I can provide any additional code/context - thanks!
You can run the pipeline asynchronously using the DirectRunner by passing setting the isBlockOnRun pipeline option to false. So long as you keep a reference to the returned PipelineResult available, calling cancel() on that result should stop the pipeline.
For your third question, your setup seems reasonable. However, if you want to have a smaller-scale test of your pipeline (requiring fewer components), you can encapsulate all of your processing logic within a custom PTransform. This PTransform should take inputs that have been fully parsed from an input source, and produce outputs that are yet to be parsed for the output sink.
When this is done, you can use either Create (which will generally not exercise triggering) or TestStream (which may, depending on how you construct the TestStream) with the DirectRunner to generate a finite amount of input data, apply this processing PTransform to that PCollection, and use PAssert on the output PCollection to verify that the pipeline generated the outputs which you expect.
For more information about testing, the Beam website has information about these styles of tests in the Programming Guide and a blog post about testing pipelines with TestStream.

Deploy Apache Spark application from another application in Java, best practice

I am a new user of Spark. I have a web service that allows a user to request the server to perform a complex data analysis by reading from a database and pushing the results back to the database. I have moved those analysis's into various Spark applications. Currently I use spark-submit to deploy these applications.
However, I am curious, when my web server (written in Java) receives a user request, what is considered the "best practice" way to initiate the corresponding Spark application? Spark's documentation seems to be to use "spark-submit" but I would rather not pipe out the command to a terminal to perform this action. I saw an alternative, Spark-JobServer, which provides an RESTful interface to do exactly this, but my Spark applications are written in either Java or R, which seems to not interface well with Spark-JobServer.
Is there another best-practice to kickoff a spark application from a web server (in Java), and wait for a status result whether the job succeeded or failed?
Any ideas of what other people are doing to accomplish this would be very helpful! Thanks!
I've had a similar requirement. Here's what I did:
To submit apps, I use the hidden Spark REST Submission API: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Using this same API you can query status for a Driver or you can Kill your Job later
There's also another hidden UI Json API: http://[master-node]:[master-ui-port]/json/ which exposes all information available on the master UI in JSON format.
Using "Submission API" I submit a driver and using the "Master UI API" I wait until my Driver and App state are RUNNING
The web server can also act as the Spark driver. So it would have a SparkContext instance and contain the code for working with RDDs.
The advantage of this is that the Spark executors are long-lived. You save time by not having to start/stop them all the time. You can cache RDDs between operations.
A disadvantage is that since the executors are running all the time, they take up memory that other processes in the cluster could possibly use. Another one is that you cannot have more than one instance of the web server, since you cannot have more than one SparkContext to the same Spark application.
We are using Spark Job-server and it is working fine with Java also just build jar of Java code and wrap it with Scala to work with Spark Job-Server.

Categories