How to schedule/trigger spark jobs in Cloudera? - java

Currently our project is on MR and we use Oozie to orchestrate our MR Jobs. Now we are moving to Spark, and would like to know the recommended ways to schedule/trigger Spark Jobs on the CDH cluster. Note that CDH Oozie does not support Spark2 Jobs. So please give an alternative for this.

Last time I looked, Hue had a Spark option in the Worlflow editor. If Cloudera didn't support that, I'm not sure why it'd be there...
CDH Oozie does support plain shell scripts, though, but you need to be sure all NodeManagers will have spark-submit command available on the local server.
If that doesn't work, it also supports Java actions for running a JAR, so you could write your Spark scripts all starting with a main method that loads up any configuration from there

As soon as you submit the spark job from the shell, like:
spark-submit <script_path> <arguments_list>
it gets submitted to the CDH cluster. Immediately you will be able to see the spark jobs and its progress in the Hue.This is how we trigger the spark jobs.
Further, to orchestrate a series of jobs, you can use a shell script wrapper around it. Or, you can use a cron job to trigger in timing.

Related

How to organize a Apache Spark Project

I am new to Spark and I would like to understand how to best setup a project. I will use maven for building including tests.
I wrote my first Spark application but to launch it during developent, I had to run in in local mode:
SparkSession spark = SparkSession.builder()
.appName("RDDTest")
.master("local")
.getOrCreate();
However, if I want to submit it to a cluster, it would run still in local mode which I do not want.
So I would have to change the code before deployment, build the jar and submit it to the cluster. Obviously this is not the best approach.
I was wondering what is the best practice? Do you externalize the master URL somehow?
Generally you only want to run spark in local mode from test cases. So your main job shouldn't have ant local mode associated.
Also, all the parameters which spark accepts should come from command line. for example the App Name, master etc should be taken from command line only instead hard coding.
Try to keep the dataframe manipulations in small functions so they could be tested independently.
You need to use spark-submit script.
You can find further documentation here https://spark.apache.org/docs/latest/submitting-applications.html
I would have all methods to take a SparkContext as parameter (maybe even implicit parameters). Next, I would either use Maven profiles to define parameters for the SparkContext (test/prod) or alternatively program arguments.
An easy alternative would just be to programmtically define one SparkContext for your (prod) main method (cluster-mode), and a separate one for your tests (local-mode)

Developing Scheduled tasks in Java and running on Linux server

I wanted to develop 'tasks' in Java which can be run periodically as per the schedule defined.
How do I run this on my Linux server. If it is a jar file - is it enough that I create a jar file and run it using shell script? and schedule to run the script (CRON)
I was planning to make use of Spring Framework. Do I really need one? Since I can schedule to call my java program using CRON
How do I approach this?
You can build the app using Spring Boot and run it as a daemon:
https://docs.spring.io/spring-boot/docs/current/reference/html/deployment-install.html
And then use quartz to schedule tasks
You can use CRON job and as well as scheduler like (Quartz etc) to run your java task. I think CRON job is a convenient way to run your jar file. You can simply schedule your jar in the CRON job.
Check out quartz its an awesome scheduling library that you can include in any java application.
Once the scheduler is started it runs in selected intervals defined in a cron expression say
( ***** )

How to change logging levels for executors?

The default logging level for all classes executed by Spark executors seems to be INFO. I would like to change it to DEBUG, or WARN, etc. as needed.
I'm using Spark Streaming 1.3.1 but will switch to 1.4 soon.
I have the following line in the shell script that submits the Spark Streaming job:
export SPARK_JAVA_OPTS="-Dlog4j.configuration=file:./log4j.properties"
This allows me to change logging level for the classes running in driver, but not in the executors.
How can I control logging for classes that are run by executors?
Note: We are not running on Yarn. We're starting our own Spark cluster on EC2.
Note: Ideally, we would like to change logging level while Streaming process is still running. If that's not possible, at least we should be able to change some properties file. Recompiling code & redeploying is NOT an option.
tl;dr Change conf/log4j.properties to appropriate levels and distribute the file to workers.
You may have some luck with --files command-line option for spark-submit when you submit your Spark applications. It lets you change logging levels per application. It's just a guess as I didn't try it out yet.

Is it possible to cache the application jar when running Apache Spark applications on a cluster?

I've got an Apache Spark MLlib Java application that should be run on a cluster a lot of times with different input values. Is it possible to cache the application jar on the cluster and reuse it to reduce startup time, network load and coupling of components?
Does the used cluster manager make any difference?
If the application jar is cached, is it possible to use the same RDD caches in different instances of my application?
Vainilla Spark is not able to do this (at the time of writing - Spark is evolving fast) .
There's a Spark-JobServer contributed by Ooyala that exactly fulfills your needs. It keeps a register with the jars for sequential job submission and provides additional facilities to cache RDDs by name. Note that on a Spark Cluster, the Spark-JobServer acts as a Spark driver. The workers will still need to load the jars from the driver when going to execute a given task.
See docs here: https://github.com/ooyala/spark-jobserver

Running MapReduce job periodically without Oozie?

I have a mapreduce job as a 'jar' ,that should be run daily. Also, I need to run this jar from a remote java application. How can I schedule it: i.e, I just want to run job daily from my remote java application.
I read about Oozie, but I dont think it is apt here.
Take a look at Quartz. It enables you to run a standalone java programs or run inside an web or application container (like JBoss or Apache Tomcat). There is a good integration with Spring and Spring batch in particular.
Quartz can be configured outside of the java code - in XML and the syntax is exactly like in crontab. So, I found it very handy.
äSome examples can be found here and here.
I am not clear about your requirement. You can use ssh command execution libraries in your program.
SSH library for Java
If you are running your program in linux environment itself, You can set some crontab for periodic execution.
If the trigger of your jar is your java program, then you should schedule your java program hourly rather than the jar. And if that is separate, then you can schedule your jar in Oozie workflow where you can have the java code execution in step one of oozie workflow and jar execution in the second step.
In oozie, you can pass the parameters from one level to another as well.Hope this helps.
-Dipika Harwani

Categories