Apache Beam / Google Dataflow Final step to run only once - java

I have a pipeline where I download thousands of files, then transform them and store them as CSV on google cloud storage, before running a load job on bigquery.
This works fine, but as I run thousands of load jobs (one per downladed file), I reached the quota for imports.
I've changed my code so it lists all the files in a bucket and runs one job with all the files as parameters of the job.
So basically I need the final step to be run only once, when all the data has been processed. I guess I could use a groupBy transform to make sure all the data has been processed, but I'm wondering whether there is a better / more standard approach to it.

If I understood your question correctly, we might have had similar problem in one of our dataflows - we were hitting 'Load jobs per table per day' BigQuery limit due to the fact that the dataflow execution was triggered for each file in GCS separately and we had 1000+ files in the bucket.
In the end, the solution to our problem was quite simple - we modified our TextIO.read transform to use wildcards instead of individual file names
i.e TextIO.read().from("gs://<BUCKET_NAME>/<FOLDER_NAME>/**")
In this way only one dataflow job was executed and as a consequence all the data written to BigQuery was considered as a single load job, despite the fact that there were multiple sources.
Not sure if you can apply the same approach, tho.

Related

Running function once Dataflow Batch-Job step has completed

I have a Dataflow job which has a fan-out of steps, each one of them writes result to a different folder on GCS. During a batch job execution, hundreds of files are written per folder.
I'd like to identify when the FileIO step is completed in order to run java code that loads the entire content of the folder to BigQuery table.
I know I can do it per written file with Cloud Functions and PubSub notification but I prefer doing so once only at the completion of the entire folder.
Thanks!
There are two ways you could do this:
Execute it after your pipeline.
Run your pipeline and on your pipeline result, call waitUntilFinish (wait_until_finish in Python) to delay execution until after your pipeline is complete, as follows:
pipeline.run().waitUntilFinish();
You can verify whether the pipeline completed successfully based on the result of waitUntilFinish and from there you can load the contents of the folders to BigQuery. The only caveat to this approach is that your code isn't part of the Dataflow pipeline so if you rely on the elements in your pipeline for that step it will be tougher.
Add transforms after FileIO.Write
The result of the FileIO.Write transform is a WriteFilesResult that allows you to get a PCollection containing all filenames of the written files by calling getPerDestinationOutputFilenames. From there you can continue your pipeline with transforms that can write all those files to BigQuery. Here's an example in Java:
WriteFilesResult<DestinationT> result = files.apply(FileIO.write()...)
result.getPerDestinationOutputFilenames().apply(...)
The equivalent in Python seems to be called FileResult but I can't find good documentation for that one.
#Daniel Oliveira Suggested an approach that you can follow but in my opinion it is not the best way.
Two reasons why I beg to differ with him:
Narrow scope for handling job failures : Consider a situation where your Dataflow job succeeded but your loading to Big Query job failed. Due to this tight coupling you won't be able to re-run the second job.
Performance of second job will become bottleneck : In a production scenario when your file size will grow, your load job will become bottleneck for other dependent process
As you already mentioned that you cannot write directly to BQ in same job. I will suggest you following approach:
Create another beam job for loading all the file to BQ. You can refer this for reading multiple files in beam.
Orchestrate both the code with cloud composer using Dataflow Java Operator or Dataflow Template Operator . Set airflow trigger rule as 'all_sucess' and set job1.setUpstream(job2). Please refer airflow documentation here
I hope this helped

Can Apache Spark speed up the process of reading millions of records from Oracle DB and then writing these to a file?

I am new to Apache-Spark,
I have a requirement to read millions(~5 million) of records from Oracle database, then do some processing on these records , and write the processed records to a file.
At present ,this is done in Java , and in this process
- the records in DB are categorized into different sub sets, based on some data criteria
- In the Java process, 4 threads are running in parallel
- Each thread reads a sub set of records , processes and writes processed records to a new file
- finally it merges all these files into single file.
Still It takes around half an hour to complete the whole process .
So I would like to know , if Apache Spark could make this process fast- read millions of records from Oracle DB, process these, and write to a file ?
If Spark can make this process faster, what is the best approach to be used to implement this in my process? Also wWill it be effective in a non-clustered environment too?
Appreciate the help.
Yeah you can do that using Spark, it's built for distributed processing ! http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
You should be using a well configured spark cluster to achieve the same. Performance is something you need to fine tune by adding more worker nodes as required.

Maintain list of processed files to prevent duplicate file processing

I am looking for guidance in the design approach for resolving one of the problems that we have in our application.
We have scheduled jobs in our Java application and we use Quartz scheduler for it. Our application can have thousands of jobs that do the following:
Scan a folder location for any new files.
If there is a new file, then kick off the associated workflow to process it.
The requirement is to:
Process only new files.
If any duplicate file arrives (file with the same name), then don't process it.
As of now, we persist the list of the processed files at quartz job metadata. But this solution is not scalable as over the years (and depending on number of files received per day which could be range from 100K per day), the job metadata (that persist list of files processed) grows very large and it started giving us problems with data truncation error (while persisting job metadata in quartz table) and slowness.
What is the best approach for implementing this requirement and ensuring that we don't process duplicate files that arrive with the same name? Shall we consider the approach of persisting processed file list in the external database instead of job metadata? If we use a single external database table for persisting list of processed files for all those thousands of jobs then the table size may grow huge over the years which doesn't look the best approach (however proper indexing may help in this case).
Any guidance here shall be appreciated. It looks like a common use case to me for applications who continuously process new files - therefore looking for best possible approach to address this concern.
If not processing duplicate files is critical for you, the best way to do it would be by storing the file names in a database. Keep in mind that this could be slow since you would be query for each file name, or have a large query for all the new file names.
That said, if you're willing to process new files which may be a duplicate, there are a number of things that can be done as an alternative:
Move processed files to another folder, so that your folder will always have unprocessed files
Add a custom attribute to your processed files, and process files that do not have that attribute. Be aware that this method is not supported by all file systems. See this answer for more information.
Keep a reference to the time when your last quartz job started, and process new files which were created after that time.

How does Hadoop run in "real-time" against non-stale data?

My abysmally-rudimentary understanding of Hadoop and its "data ingest" tools (such as Flume or Sqoop) is that Hadoop must always run its MR jobs against data that is stored in structured files on its HDFS. And, that these tools (again, Flume, Sqoop, etc.) are responsible for essentially importing data from disparate systems (RDBMS, NoSQL, etc.) into HDFS.
To me, this means that Hadoop will always be running on "stale" (for lack of a better word) data that is minutes/hours/etc. old. Because, to import big data from these disparate systems onto HDFS takes time. By the time MR can even run, the data is stale and may no longer be relevant.
Say we have an app that has real-time constraints of making a decision within 500ms of something occurring. Say we have a massive stream of data that is being imported into HDFS, and because the data is so big it takes, say, 3 seconds to even get the data on to HDFS. Then say that the MR job that is responsible for making the decision takes 200ms. Because the loading of the data takes so long, we've already blown our time constraint, even though the MR job processing the data would be able to finish inside the given window.
Is there a solution for this kind of big data problem?
With the help of tools Apache Spark streaming API & another one is Storm which you can use for real time stream processing.

How to parallelize processing from S3 to S3

I have a process that identifies an object on an S3 bucket that must be converted using our (fairly simplistic) custom Java code. The output of this conversion is written to a different prefix on the S3 bucket. So it's a simple, isolated job:
Read the input stream of the S3 object
Convert the object
Write the output as a new S3 object or objects
This process is probably only a few thousands lines of data on the S3 object, but hundreds (maybe thousands) of objects. What is a good approach to running this process on several machines? It appears that I could use Kinesis, EMR, SWF, or something I cook up myself. Each approach has quite a learning curve. Where should I start?
Given that it is a batch process and volume will grow (for 'only' 100GB it can be an overkill), Amazon Elastic Map Reduce (EMR) seems like a nice took for the job. Using EMR, you can process the data in your Hadoop Map Reduce jobs, Hive queries or Pig Scripts (and others), reading the data directly form S3. Also, you can use S3DistCP to transfer and compress data in parallel to and from the cluster, if necessary.
There is a free online introductory course to EMR and Hadoop at http://aws.amazon.com/training/course-descriptions/bigdata-fundamentals/
Also, you can take a free lab at https://run.qwiklabs.com/focuses/preview/1055?locale=en
You can try Amazon SQS to queue each job and then process them in parallel on different machines (it has a much easier learning curve than Amazon EMR / SWF).
Keep in mind though that with SQS you might receive the same message twice and thus process the same file twice if your code doesn't account for this (as opposed to SWF which guarantees that an activity is only performed once).
Also, if your processing code is not utilizing all the resources of the machine it's running on, you can download & process multiple files in parallel on the same machine as S3 will probably handle the load just fine (with multiple concurrent requests).

Categories