How to organize a Apache Spark Project

How to organize a Apache Spark Project - java

I am new to Spark and I would like to understand how to best setup a project. I will use maven for building including tests.
I wrote my first Spark application but to launch it during developent, I had to run in in local mode:
SparkSession spark = SparkSession.builder()
.appName("RDDTest")
.master("local")
.getOrCreate();
However, if I want to submit it to a cluster, it would run still in local mode which I do not want.
So I would have to change the code before deployment, build the jar and submit it to the cluster. Obviously this is not the best approach.
I was wondering what is the best practice? Do you externalize the master URL somehow?

Generally you only want to run spark in local mode from test cases. So your main job shouldn't have ant local mode associated.
Also, all the parameters which spark accepts should come from command line. for example the App Name, master etc should be taken from command line only instead hard coding.
Try to keep the dataframe manipulations in small functions so they could be tested independently.

You need to use spark-submit script.
You can find further documentation here https://spark.apache.org/docs/latest/submitting-applications.html

I would have all methods to take a SparkContext as parameter (maybe even implicit parameters). Next, I would either use Maven profiles to define parameters for the SparkContext (test/prod) or alternatively program arguments.
An easy alternative would just be to programmtically define one SparkContext for your (prod) main method (cluster-mode), and a separate one for your tests (local-mode)

Related

How to schedule/trigger spark jobs in Cloudera?

Currently our project is on MR and we use Oozie to orchestrate our MR Jobs. Now we are moving to Spark, and would like to know the recommended ways to schedule/trigger Spark Jobs on the CDH cluster. Note that CDH Oozie does not support Spark2 Jobs. So please give an alternative for this.

Last time I looked, Hue had a Spark option in the Worlflow editor. If Cloudera didn't support that, I'm not sure why it'd be there...
CDH Oozie does support plain shell scripts, though, but you need to be sure all NodeManagers will have spark-submit command available on the local server.
If that doesn't work, it also supports Java actions for running a JAR, so you could write your Spark scripts all starting with a main method that loads up any configuration from there

As soon as you submit the spark job from the shell, like:
spark-submit <script_path> <arguments_list>
it gets submitted to the CDH cluster. Immediately you will be able to see the spark jobs and its progress in the Hue.This is how we trigger the spark jobs.
Further, to orchestrate a series of jobs, you can use a shell script wrapper around it. Or, you can use a cron job to trigger in timing.

Running MapReduce job periodically without Oozie?

I have a mapreduce job as a 'jar' ,that should be run daily. Also, I need to run this jar from a remote java application. How can I schedule it: i.e, I just want to run job daily from my remote java application.
I read about Oozie, but I dont think it is apt here.

Take a look at Quartz. It enables you to run a standalone java programs or run inside an web or application container (like JBoss or Apache Tomcat). There is a good integration with Spring and Spring batch in particular.
Quartz can be configured outside of the java code - in XML and the syntax is exactly like in crontab. So, I found it very handy.
äSome examples can be found here and here.

I am not clear about your requirement. You can use ssh command execution libraries in your program.
SSH library for Java
If you are running your program in linux environment itself, You can set some crontab for periodic execution.

If the trigger of your jar is your java program, then you should schedule your java program hourly rather than the jar. And if that is separate, then you can schedule your jar in Oozie workflow where you can have the java code execution in step one of oozie workflow and jar execution in the second step.
In oozie, you can pass the parameters from one level to another as well.Hope this helps.
-Dipika Harwani

Pentaho kettle: how to set up tests for transformations/jobs?

I've been using Pentaho Kettle for quite a while and previously the transformations and jobs i've made (using spoon) have been quite simple load from db, rename etc, input to stuff to another db. But now i've been doing transformations that do a bit more complex calculations that i would now like to test somehow.
So what i would like to do is:
Setup some test data
Run the transformation
Verify result data
One option would probably be to make a Kettle test job that would test the transformation. But as my transformations relate to a java project i would prefer to run the tests from jUnit. So i've considered making a jUnit test that would:
Setup test data (using dbunit)
Run the transformation (using kitchen.sh from command line)
Verify result data (using dbunit)
This approach would however require test database(s) which are not always available (oracle etc. expensive/legacy db's) What I would prefer is that if I could mock or pass some stub test data to my input steps some how.
Any other ideas on how to test Pentaho kettle transformations?

there is a jira somewhere on jira.pentaho.com ( i dont have it to hand ) that requests exactly this - but alas it is not yet implemented.
So you do have the right solution in mind- I'd also add jenkins and an ant script to tie it all together. I've done a similar thing with report testing - I actually had a pentaho job load the data, then it executed the report, then it compared the output with known output and reported pass/failure.

If you separate out your kettle jobs into two phases:
load data to stream
process and update data
You can use copy rows to result at the end of your load data to stream step, and get rows from result to get rows at the start of your process step.
If you do this, then you can use any means to load data (kettle transform, dbunit called from ant script), and can mock up any database tables you want.
I use this for testing some ETL scripts I've written and it works just fine.

You can use the data validator step. Of course is not a full unit test suite, but i think sometimes will be useful to check the data integrity in a quick way.
You can run more than several tests at once.
For a more "serious" test i will recommend the #codek answer and execute your kettles under Jenkins.

Can I use PostgreSQL in Maven build?

I am trying to create an integration test, which requires a running PostgreSQL server. Is it possible to start the server in maven build and stop it when tests are completed (in a separate process, I think)? Assuming the PostgreSQL server is not installed on the machine.

You are trying to push maven far beyond the intended envelope, so you'll be in for a fair amount of hurt before it will work.
Luckily postgresql can be downloaded as a zip archive.
As already mentioned above maven can use ant tasks to extend its reach. Ant has a large set of tasks to unzip files, and run commands. The sequence would be as follows :
unzip postgresql-xxx.zip in a well known directory --> INSTALL_DIR
create a data directory --> DATA_DIR
/bin/init-db -D
/bin/postgres -D
/bin/create_db -EUNICODE test
This should give you a running server with a test database.
Further issues : create a user, security (you likely want to connect via TCP/IP but this is disabled by default if I recall correct, this requires editing a config file before starting the database)
...
Good Luck.

I started writing a plugin for this purpose:
https://github.com/adrianboimvaser/postgresql-maven-plugin
It's in a very early stage and lacks documentation, but mostly works.
I already released version 0.1 to Maven Central.
I'm also releasing PostgreSQL binary distributions for all platforms as maven artifacts.
You can find the usage pattern in the plugin's integration tests.
Cheers!

Not to my knowledge. However, you could run a remote command that starts the server.

I think the usual scenario is to have a running integration test db, and not to shut it down/ restart it between builds.
But if you really want to you could set up your continuous integration server to start/ stop the db.

You sound like you are trying to build a full continuous integration environment. You should probably look into using a full CI tool such as Cruise Control or Bamboo.
How I've done it before is to set up a dedicated CI db that is accessible from the CI server, and then have a series of bash/python/whatever scripts run as a After Successful Build step which can then run whatever extra integration tasks you like. Pair that with something like liquibase and you could wipe out the CI db and make sure it is up to the latest schema every build.

Just to bring some fresh perspective into this matter:
You could also start the postgresql database as docker instance.
The plugin ecosystem for docker seems to be still in flux, so you might need to decide yourself which fits. Here are a few links to speed up your search:
https://github.com/fabric8io/docker-maven-plugin
http://heidloff.net/article/23.09.2015102508NHEBVR.htm
https://dzone.com/articles/build-images-and-run-docker-containers-in-maven

How to use separate databases for production and testing in an eclipse RCP app

I am writing an eclipse RCP app and I am trying to use a separate db for tests to prevent corrupting my production db. During the setup of the test db i need to execute an sql file to fill it with test data.
Is there a way to tell the app to use a different db and execute a specific sql script (maybe via launch properties or maybe fragments or sth else)?
Thank you

I found and am using a different approach now, more RCP-ish IMHO. I define a fragment to override the database props and replacing a dummy query file in the host plug-in. Then i define two features - one for the testing with the fragment, and the production feature without the fragment. And then use the features in different products - one for production, one for testing. Works fine

Sounds like a perfect use-case for OSGi Services.

Your application will accept arguments like the Eclipse executable. You can specify the arguments in the ini file of your app (in Eclipse it is eclipse.ini, you can rename it for your app) in the form of
-vmargs
-Dkey=value
These values can be read using System.getProperty
On some platforms you should be able to accept these arguments from the command line as well.

For an RCP I usually use some type of property file. Within it I'd specify things such as the DB to use and startup script (if necessary). This approach will be well worth it as your application grows.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to organize a Apache Spark Project - java

You need to use spark-submit script. You can find further documentation here https://spark.apache.org/docs/latest/submitting-applications.html

Related

How to schedule/trigger spark jobs in Cloudera?

Running MapReduce job periodically without Oozie?

Pentaho kettle: how to set up tests for transformations/jobs?

Can I use PostgreSQL in Maven build?

How to use separate databases for production and testing in an eclipse RCP app

Categories

Resources