There is scenario where
At step 1
InvokeTakaraJar(parameter..) is called
Which does the work of updating table with records but this is a normal Java jar and not a Spark code
The at step 2
There is
var df = GetDBTable(parameter..) which should get the records from the table being updated in above step.
Problem is since the first step is just the invoke of main method of external Java jar, it runs from the driver
And the 2nd step does not wait for completion of the step 1.
Ideally 2nd step needs to wait for the first to complete.
How to achieve this in Spark scala code where there is a requirement to run a different Java jar which needs to complete first and then Spark step should execute.
Spark doesn't really do guaranteed order very well. It actually wants to complete several tasks in parallel. I would be concerned about running a java program because it may not scale up to be able to complete when you are using data at scale. (So let's pretend for the sake of the argument your data that java is updating will always be small.)
That said if you need to run this java program and then run spark why not launch the spark job from Java after you have completed your table update?
Why not run a shell/oozie/build script that runs your java program first and then launches the spark job.
If you are looking for performance, consider rewriting the java job so it can be done using spark tooling.
For the absolute best performance see if you can re-write the java tooling so that it's triggered on data entry so that you never need to run it as a batch job, you can depend on the data already being updated.
Related
I have a general design problem regarding Cucumber-
I'm trying to build some cucumber scenarios around a specific external process that takes some time. Currently, the tests look like this:
Given some setup
When I perform X action
And do the external process
Then validate some stuff
I have a number of these tests, and it would be massively more performant if I could do the external process just once for all these scenarios.
The problem I'm running into is that it doesn't seem like theres any way to communicate between scenarios in cucumber.
My first idea was to have each test running concurrently and have them hit a wait and poll the external process to see if it's running before proceeding, but I have no way of triggering the process once all the tests are waiting since they can't communicate.
My second idea was to persist data between tests. So, each test would just stop at the point the process needs to be run, then somehow gets their CucumberContext to a follow up scenario that validates things after the process. However, I'd have to save this data to the file system and pick it up again, which is a very ugly way to handle it.
Does anyone have advice on either synchronizing steps in cucumber, or creating "continuation" scenarios? Or is there another approach I can take?
You can't communicate data between scenarios, nor should you try to. Each scenario (by design) is its own separate thing, which sets and resets everything.
Instead what you can do is improve the way you execute your external process so instead of doing it each time, you use the results of it being done once, and then re-use that result in future executions of the scenario.
You could change your scenarios to reflect this e.g.
Given I have done x
And the external process has been run for x
Then y should have happened
You should also consider the user experience of waiting for the external process. For new behaviours you could do something like
When I do x
Then I should see I am waiting for the external process
and then later do another scenario
Given I have done x
And the external process has completed
Then I should see y
You can use something like VCR to record the results of executing your external process. (https://rubygems.org/gems/vcr/versions/6.0.0)
Note: VCR is ruby specific, but I am sure you can find a java equivalent.
Now that your external process executes pretty much instantly (a few milliseconds) your no longer have any need to share things between scenarios.
I am practicing file reading through the flink batch processing mechanism on a Windows 10 machine.
I downloaded flink-1.7.2-bin-hadoop24-scala_2.12.tgz from flink's official site and executed start-cluster.bat .
I uploaded the jar though Flink's UI and was able to execute the job but the job finished in a matter of seconds.
I want to keep the job running continuously so that I can test my use case .
Can you guide my possible ways to achieve this?
In Flink, batch jobs run until all of their input has been processed, at which point they have finished and are terminated. If you want continuous processing, then you should either
use some deployment automation (outside of Flink) to arrange for new batch jobs to be created as needed, or
implement a streaming job
In your case it sounds like you might be looking for the FileProcessingMode.PROCESS_CONTINUOUSLY option on StreamExecutionEnvironment.readfile -- see the docs for more info.
I have a Dataflow job which has a fan-out of steps, each one of them writes result to a different folder on GCS. During a batch job execution, hundreds of files are written per folder.
I'd like to identify when the FileIO step is completed in order to run java code that loads the entire content of the folder to BigQuery table.
I know I can do it per written file with Cloud Functions and PubSub notification but I prefer doing so once only at the completion of the entire folder.
Thanks!
There are two ways you could do this:
Execute it after your pipeline.
Run your pipeline and on your pipeline result, call waitUntilFinish (wait_until_finish in Python) to delay execution until after your pipeline is complete, as follows:
pipeline.run().waitUntilFinish();
You can verify whether the pipeline completed successfully based on the result of waitUntilFinish and from there you can load the contents of the folders to BigQuery. The only caveat to this approach is that your code isn't part of the Dataflow pipeline so if you rely on the elements in your pipeline for that step it will be tougher.
Add transforms after FileIO.Write
The result of the FileIO.Write transform is a WriteFilesResult that allows you to get a PCollection containing all filenames of the written files by calling getPerDestinationOutputFilenames. From there you can continue your pipeline with transforms that can write all those files to BigQuery. Here's an example in Java:
WriteFilesResult<DestinationT> result = files.apply(FileIO.write()...)
result.getPerDestinationOutputFilenames().apply(...)
The equivalent in Python seems to be called FileResult but I can't find good documentation for that one.
#Daniel Oliveira Suggested an approach that you can follow but in my opinion it is not the best way.
Two reasons why I beg to differ with him:
Narrow scope for handling job failures : Consider a situation where your Dataflow job succeeded but your loading to Big Query job failed. Due to this tight coupling you won't be able to re-run the second job.
Performance of second job will become bottleneck : In a production scenario when your file size will grow, your load job will become bottleneck for other dependent process
As you already mentioned that you cannot write directly to BQ in same job. I will suggest you following approach:
Create another beam job for loading all the file to BQ. You can refer this for reading multiple files in beam.
Orchestrate both the code with cloud composer using Dataflow Java Operator or Dataflow Template Operator . Set airflow trigger rule as 'all_sucess' and set job1.setUpstream(job2). Please refer airflow documentation here
I hope this helped
I have a pipeline where I download thousands of files, then transform them and store them as CSV on google cloud storage, before running a load job on bigquery.
This works fine, but as I run thousands of load jobs (one per downladed file), I reached the quota for imports.
I've changed my code so it lists all the files in a bucket and runs one job with all the files as parameters of the job.
So basically I need the final step to be run only once, when all the data has been processed. I guess I could use a groupBy transform to make sure all the data has been processed, but I'm wondering whether there is a better / more standard approach to it.
If I understood your question correctly, we might have had similar problem in one of our dataflows - we were hitting 'Load jobs per table per day' BigQuery limit due to the fact that the dataflow execution was triggered for each file in GCS separately and we had 1000+ files in the bucket.
In the end, the solution to our problem was quite simple - we modified our TextIO.read transform to use wildcards instead of individual file names
i.e TextIO.read().from("gs://<BUCKET_NAME>/<FOLDER_NAME>/**")
In this way only one dataflow job was executed and as a consequence all the data written to BigQuery was considered as a single load job, despite the fact that there were multiple sources.
Not sure if you can apply the same approach, tho.
I am using the TFS Java SDK (version 11.0) to create some wrapper functions for a website. I have code that queries Work Items to retrieve information about defects. When I run the code in Eclipse it takes about 8-10 seconds to retrieve all 1000 Work Items. That same code when it is run in a web container (Tomcat) takes twice as long. I cannot figure out why it runs slower in Tomcat vs just running it in Eclipse. Any ideas?
With this data I can not figure out a reason, but you can try to use javOSize, particularly http://www.javosize.com/gettingStarted/slow.html, their tool is free and they are very collaborative to help you find your slow down problems.
You can follow a similar procedure, according to your data I would execute:
Lets imagine your wrapper function is called com.acme.WrapperClass, then do:
cd REPOSITORY
exec FIND_SLOW_METHODS_EXECUTING_CLASS com.acme.WrapperClass 100 20000
This will block for 20s, and will examine any method taking more than 100 ms. As soon as you execute the exec command execute your slow transaction and wait until javOSize returns the output, repeat the same procedure for your eclipse running process.
Paste both output here and hopefully we will find the answer.