Running function once Dataflow Batch-Job step has completed - java

I have a Dataflow job which has a fan-out of steps, each one of them writes result to a different folder on GCS. During a batch job execution, hundreds of files are written per folder.
I'd like to identify when the FileIO step is completed in order to run java code that loads the entire content of the folder to BigQuery table.
I know I can do it per written file with Cloud Functions and PubSub notification but I prefer doing so once only at the completion of the entire folder.
Thanks!

There are two ways you could do this:
Execute it after your pipeline.
Run your pipeline and on your pipeline result, call waitUntilFinish (wait_until_finish in Python) to delay execution until after your pipeline is complete, as follows:
pipeline.run().waitUntilFinish();
You can verify whether the pipeline completed successfully based on the result of waitUntilFinish and from there you can load the contents of the folders to BigQuery. The only caveat to this approach is that your code isn't part of the Dataflow pipeline so if you rely on the elements in your pipeline for that step it will be tougher.
Add transforms after FileIO.Write
The result of the FileIO.Write transform is a WriteFilesResult that allows you to get a PCollection containing all filenames of the written files by calling getPerDestinationOutputFilenames. From there you can continue your pipeline with transforms that can write all those files to BigQuery. Here's an example in Java:
WriteFilesResult<DestinationT> result = files.apply(FileIO.write()...)
result.getPerDestinationOutputFilenames().apply(...)
The equivalent in Python seems to be called FileResult but I can't find good documentation for that one.

#Daniel Oliveira Suggested an approach that you can follow but in my opinion it is not the best way.
Two reasons why I beg to differ with him:
Narrow scope for handling job failures : Consider a situation where your Dataflow job succeeded but your loading to Big Query job failed. Due to this tight coupling you won't be able to re-run the second job.
Performance of second job will become bottleneck : In a production scenario when your file size will grow, your load job will become bottleneck for other dependent process
As you already mentioned that you cannot write directly to BQ in same job. I will suggest you following approach:
Create another beam job for loading all the file to BQ. You can refer this for reading multiple files in beam.
Orchestrate both the code with cloud composer using Dataflow Java Operator or Dataflow Template Operator . Set airflow trigger rule as 'all_sucess' and set job1.setUpstream(job2). Please refer airflow documentation here
I hope this helped

Related

How to make execution of an external JAR inside Spark code sequential

There is scenario where
At step 1
InvokeTakaraJar(parameter..) is called
Which does the work of updating table with records but this is a normal Java jar and not a Spark code
The at step 2
There is
var df = GetDBTable(parameter..) which should get the records from the table being updated in above step.
Problem is since the first step is just the invoke of main method of external Java jar, it runs from the driver
And the 2nd step does not wait for completion of the step 1.
Ideally 2nd step needs to wait for the first to complete.
How to achieve this in Spark scala code where there is a requirement to run a different Java jar which needs to complete first and then Spark step should execute.
Spark doesn't really do guaranteed order very well. It actually wants to complete several tasks in parallel. I would be concerned about running a java program because it may not scale up to be able to complete when you are using data at scale. (So let's pretend for the sake of the argument your data that java is updating will always be small.)
That said if you need to run this java program and then run spark why not launch the spark job from Java after you have completed your table update?
Why not run a shell/oozie/build script that runs your java program first and then launches the spark job.
If you are looking for performance, consider rewriting the java job so it can be done using spark tooling.
For the absolute best performance see if you can re-write the java tooling so that it's triggered on data entry so that you never need to run it as a batch job, you can depend on the data already being updated.

How to keep flink batch job running continuously on local

I am practicing file reading through the flink batch processing mechanism on a Windows 10 machine.
I downloaded flink-1.7.2-bin-hadoop24-scala_2.12.tgz from flink's official site and executed start-cluster.bat .
I uploaded the jar though Flink's UI and was able to execute the job but the job finished in a matter of seconds.
I want to keep the job running continuously so that I can test my use case .
Can you guide my possible ways to achieve this?
In Flink, batch jobs run until all of their input has been processed, at which point they have finished and are terminated. If you want continuous processing, then you should either
use some deployment automation (outside of Flink) to arrange for new batch jobs to be created as needed, or
implement a streaming job
In your case it sounds like you might be looking for the FileProcessingMode.PROCESS_CONTINUOUSLY option on StreamExecutionEnvironment.readfile -- see the docs for more info.

Spark Streaming Kafka Stream batch execution

I'm new in spark streaming and I have a general question relating to its usage. I'm currently implementing an application which streams data from a Kafka topic.
Is it a common scenario to use the application to run a batch only one time, for example, an end of the day, collecting all the data from the topic, do some aggregation and transformation and so on?
That means after starting the app with spark-submit all this stuff will be performed in one batch and then the application would be shut down. Or is spark stream build for running endless and permanently stream data in continuous batches?
You can use kafka-stream api, and fix a window-time to perform aggregation and transformation over events in your topic only one batch at a time. for move information about windowing check this https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html#windowing

Apache Beam / Google Dataflow Final step to run only once

I have a pipeline where I download thousands of files, then transform them and store them as CSV on google cloud storage, before running a load job on bigquery.
This works fine, but as I run thousands of load jobs (one per downladed file), I reached the quota for imports.
I've changed my code so it lists all the files in a bucket and runs one job with all the files as parameters of the job.
So basically I need the final step to be run only once, when all the data has been processed. I guess I could use a groupBy transform to make sure all the data has been processed, but I'm wondering whether there is a better / more standard approach to it.
If I understood your question correctly, we might have had similar problem in one of our dataflows - we were hitting 'Load jobs per table per day' BigQuery limit due to the fact that the dataflow execution was triggered for each file in GCS separately and we had 1000+ files in the bucket.
In the end, the solution to our problem was quite simple - we modified our TextIO.read transform to use wildcards instead of individual file names
i.e TextIO.read().from("gs://<BUCKET_NAME>/<FOLDER_NAME>/**")
In this way only one dataflow job was executed and as a consequence all the data written to BigQuery was considered as a single load job, despite the fact that there were multiple sources.
Not sure if you can apply the same approach, tho.

How to parallelize processing from S3 to S3

I have a process that identifies an object on an S3 bucket that must be converted using our (fairly simplistic) custom Java code. The output of this conversion is written to a different prefix on the S3 bucket. So it's a simple, isolated job:
Read the input stream of the S3 object
Convert the object
Write the output as a new S3 object or objects
This process is probably only a few thousands lines of data on the S3 object, but hundreds (maybe thousands) of objects. What is a good approach to running this process on several machines? It appears that I could use Kinesis, EMR, SWF, or something I cook up myself. Each approach has quite a learning curve. Where should I start?
Given that it is a batch process and volume will grow (for 'only' 100GB it can be an overkill), Amazon Elastic Map Reduce (EMR) seems like a nice took for the job. Using EMR, you can process the data in your Hadoop Map Reduce jobs, Hive queries or Pig Scripts (and others), reading the data directly form S3. Also, you can use S3DistCP to transfer and compress data in parallel to and from the cluster, if necessary.
There is a free online introductory course to EMR and Hadoop at http://aws.amazon.com/training/course-descriptions/bigdata-fundamentals/
Also, you can take a free lab at https://run.qwiklabs.com/focuses/preview/1055?locale=en
You can try Amazon SQS to queue each job and then process them in parallel on different machines (it has a much easier learning curve than Amazon EMR / SWF).
Keep in mind though that with SQS you might receive the same message twice and thus process the same file twice if your code doesn't account for this (as opposed to SWF which guarantees that an activity is only performed once).
Also, if your processing code is not utilizing all the resources of the machine it's running on, you can download & process multiple files in parallel on the same machine as S3 will probably handle the load just fine (with multiple concurrent requests).

Categories