I am practicing file reading through the flink batch processing mechanism on a Windows 10 machine.
I downloaded flink-1.7.2-bin-hadoop24-scala_2.12.tgz from flink's official site and executed start-cluster.bat .
I uploaded the jar though Flink's UI and was able to execute the job but the job finished in a matter of seconds.
I want to keep the job running continuously so that I can test my use case .
Can you guide my possible ways to achieve this?
In Flink, batch jobs run until all of their input has been processed, at which point they have finished and are terminated. If you want continuous processing, then you should either
use some deployment automation (outside of Flink) to arrange for new batch jobs to be created as needed, or
implement a streaming job
In your case it sounds like you might be looking for the FileProcessingMode.PROCESS_CONTINUOUSLY option on StreamExecutionEnvironment.readfile -- see the docs for more info.
Related
There is scenario where
At step 1
InvokeTakaraJar(parameter..) is called
Which does the work of updating table with records but this is a normal Java jar and not a Spark code
The at step 2
There is
var df = GetDBTable(parameter..) which should get the records from the table being updated in above step.
Problem is since the first step is just the invoke of main method of external Java jar, it runs from the driver
And the 2nd step does not wait for completion of the step 1.
Ideally 2nd step needs to wait for the first to complete.
How to achieve this in Spark scala code where there is a requirement to run a different Java jar which needs to complete first and then Spark step should execute.
Spark doesn't really do guaranteed order very well. It actually wants to complete several tasks in parallel. I would be concerned about running a java program because it may not scale up to be able to complete when you are using data at scale. (So let's pretend for the sake of the argument your data that java is updating will always be small.)
That said if you need to run this java program and then run spark why not launch the spark job from Java after you have completed your table update?
Why not run a shell/oozie/build script that runs your java program first and then launches the spark job.
If you are looking for performance, consider rewriting the java job so it can be done using spark tooling.
For the absolute best performance see if you can re-write the java tooling so that it's triggered on data entry so that you never need to run it as a batch job, you can depend on the data already being updated.
I have a Dataflow job which has a fan-out of steps, each one of them writes result to a different folder on GCS. During a batch job execution, hundreds of files are written per folder.
I'd like to identify when the FileIO step is completed in order to run java code that loads the entire content of the folder to BigQuery table.
I know I can do it per written file with Cloud Functions and PubSub notification but I prefer doing so once only at the completion of the entire folder.
Thanks!
There are two ways you could do this:
Execute it after your pipeline.
Run your pipeline and on your pipeline result, call waitUntilFinish (wait_until_finish in Python) to delay execution until after your pipeline is complete, as follows:
pipeline.run().waitUntilFinish();
You can verify whether the pipeline completed successfully based on the result of waitUntilFinish and from there you can load the contents of the folders to BigQuery. The only caveat to this approach is that your code isn't part of the Dataflow pipeline so if you rely on the elements in your pipeline for that step it will be tougher.
Add transforms after FileIO.Write
The result of the FileIO.Write transform is a WriteFilesResult that allows you to get a PCollection containing all filenames of the written files by calling getPerDestinationOutputFilenames. From there you can continue your pipeline with transforms that can write all those files to BigQuery. Here's an example in Java:
WriteFilesResult<DestinationT> result = files.apply(FileIO.write()...)
result.getPerDestinationOutputFilenames().apply(...)
The equivalent in Python seems to be called FileResult but I can't find good documentation for that one.
#Daniel Oliveira Suggested an approach that you can follow but in my opinion it is not the best way.
Two reasons why I beg to differ with him:
Narrow scope for handling job failures : Consider a situation where your Dataflow job succeeded but your loading to Big Query job failed. Due to this tight coupling you won't be able to re-run the second job.
Performance of second job will become bottleneck : In a production scenario when your file size will grow, your load job will become bottleneck for other dependent process
As you already mentioned that you cannot write directly to BQ in same job. I will suggest you following approach:
Create another beam job for loading all the file to BQ. You can refer this for reading multiple files in beam.
Orchestrate both the code with cloud composer using Dataflow Java Operator or Dataflow Template Operator . Set airflow trigger rule as 'all_sucess' and set job1.setUpstream(job2). Please refer airflow documentation here
I hope this helped
distributed CRON in Kubernetes is still a work in progress (https://github.com/kubernetes/kubernetes/issues/2156).
What do you use for CRON jobs in Kubernetes today?
Do you recommend any solution that works well with Spring/JVM-based services? Spring/JVM startup time is quite high and if CRON scheduler started a new JVM for each job, startup time might be much higher than time of actual work - is there any solution that could run the job in existing JVM?
Thank you,
Jakub
I think Mesos Chronos is still ideal solution.
I wrote a small Go app that functions like cron but writes log info to stdout (no email!) and can be built into a static binary for easy containerization.
I built kubectl from source as a static binary and included it in the image (it may be a static binary in the most recent releases). Kubectl will automatically look for the service account token/certs in /var/run/secrets/kubernetes.io/serviceaccount/ so you should be good to go unless you're not using the default service account.
I then set up a crontab to run kubectl to create a job at the period that I wanted. The crontab and yaml files for the jobs can be mounted as a secret. You can either use conf2kube or some other way of generating the secrets. I wrote a simple python script.
It's totally a workaround until there is proper support but I hope that helps.
I'm using cron jobs in kubernetes with java, each job launches a new JVM, so no. No reuse here.
To reuse you must jave something like a webapp always running and schedule jobs to run inside this already running app.
I have a parent job that triggers many downstream jobs dynamically.
I use python code to generate the list of jobs to be triggered, write it to a properties file, Inject the file using EnvInject plugin and then use the "Parameterized trigger plugin" with the job list variable (comma separated) variable to launch the jobs (If anyone know an easier way of doing this I would love to hear that also!).
It works great except when killing the parent job, the triggered jobs continue to run, and I want them dead also when killing the parent.
Is there a plugin or way to implement this? Maybe a hook that is called when a job is killed?
EDIT:
Sorry for the confusion, I wasn't clear about what I meant with "killing" the job. I mean clicking the red 'x' button in the Jenkins gui, not the Unix signal.
Thanks in advance.
Instead of killing the job, have another job that programmatically terminates all the required jobs. You could reuse the same property file to know which all jobs to be killed. You could use groovy script to terminate jobs.
To catch SIGTERM inside your process you could use the following code (unix specific):
import signal
def kill_children(*args, **kwargs):
# some code that uses the stored list of children procs to kill them
signal.signal(signal.SIGTERM, kill_children)
There are lots of other signals that a process can receive. SIGKILL is the most obvious in your situation. So it would just be a matter of working out what signal was killing the parent and handling it.
What's the best/easiest way to run periodic tasks (like a daemon thread) on a tomcat/jetty server? How do I start the thread? Is there a simple mechanism or is this a bad idea at all?
If want to keep everything on java side, give a look to Quartz.
It handles failover and fine grained repartition of jobs, with the same flexibility of cron jobs.
It's okay and effective to stash a java.util.Timer (or better yet ScheduledExecutor) instance in your ServeletContext. Create it in a Servlet's init() call and all your servlets can add TimerTasks to it.
One general purpose way which works for many systems is simply to have a cron job which performs a periodic wget against your app.
I can't answer the tomcat/jetty stuff, but I've done similar things with Python based web apps.
I normally just run a separate app that does the periodic tasks needed. If interop is needed between the website and the app, that communication can happen through some sort of API (using something like XML-RPC/unix sockets/etc) or even just through the database layer, if that's adequate.
Hope that helps.
If you want to use a cron job but don't have administrative access to the development system, you can do a user crontab by executing the command:
crontab -e
It uses vi by default on most systems, but you can change it to the editor of your choice via:
export EDITOR=/usr/local/bin/my_editor
Then, executing the crontab -e command will launch your crontab file in your editor. Upon saving, the changes will be committed back into the system's cron.