I am a new user of Spark. I have a web service that allows a user to request the server to perform a complex data analysis by reading from a database and pushing the results back to the database. I have moved those analysis's into various Spark applications. Currently I use spark-submit to deploy these applications.
However, I am curious, when my web server (written in Java) receives a user request, what is considered the "best practice" way to initiate the corresponding Spark application? Spark's documentation seems to be to use "spark-submit" but I would rather not pipe out the command to a terminal to perform this action. I saw an alternative, Spark-JobServer, which provides an RESTful interface to do exactly this, but my Spark applications are written in either Java or R, which seems to not interface well with Spark-JobServer.
Is there another best-practice to kickoff a spark application from a web server (in Java), and wait for a status result whether the job succeeded or failed?
Any ideas of what other people are doing to accomplish this would be very helpful! Thanks!
I've had a similar requirement. Here's what I did:
To submit apps, I use the hidden Spark REST Submission API: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Using this same API you can query status for a Driver or you can Kill your Job later
There's also another hidden UI Json API: http://[master-node]:[master-ui-port]/json/ which exposes all information available on the master UI in JSON format.
Using "Submission API" I submit a driver and using the "Master UI API" I wait until my Driver and App state are RUNNING
The web server can also act as the Spark driver. So it would have a SparkContext instance and contain the code for working with RDDs.
The advantage of this is that the Spark executors are long-lived. You save time by not having to start/stop them all the time. You can cache RDDs between operations.
A disadvantage is that since the executors are running all the time, they take up memory that other processes in the cluster could possibly use. Another one is that you cannot have more than one instance of the web server, since you cannot have more than one SparkContext to the same Spark application.
We are using Spark Job-server and it is working fine with Java also just build jar of Java code and wrap it with Scala to work with Spark Job-Server.
Related
We have several Java standalone applications (in form of Jar files) running on multiple servers. These applications mainly read and stream data between systems. We are using Java 8 mainly in our development. I was put in charge recently. My main function is to manage and maintain these apps.
Currently, I check these apps manually by accessing these servers, check if the app is running, and sometimes run some database queries to see if the app started pulling data. My problem is that in many cases, some of these apps fail and shutdown due to data issue or edge cases without anyone noticing. We need some monitoring and application recovery in place.
We don't have docker infrastructure in place. We plan to implement docker in the future, but for now this is not an option.
After research, the following are options I thought of or solutions I tried:
Have the apps create a socket client which sends a heartbeat to a monitoring app (which needs to be developed). I am keeping this as my last option.
I tried to use Eclipse Vertx to wrap the apps into Verticles. Then create a web view that can show me status and other info. After several tries, the apps fail to parse the data correctly (might be due to my lack of understanding to Vertx library).
Have a third party solution that does this, but I have no idea what solutions are out there. I am open for suggestions.
My requirements are:
Proper monitoring of the apps running and their status.
In case of failure, the app should start again while notifying the admin/developer.
I am willing to develop a solution or implement a third party one. I need you guidance on this.
Thank you.
You could use spring-boot-actuator (see health). It comes with a built-in endpoint that has some health checks(depending on your spring-boot project), but you can create your own as well.
Then, doing a http request to http://{host}:{port}/{context}/actuator/health (replace with yours), you could see those health checks status and also use the response status code to monitor your application.
Have you heard of Java Service Wrappers? Not a full management functionality, however it would monitor for JVM crashes and out of memory conditions and restart your application for sure. Alerting should also be possible.
There is a small comparison table here: https://yajsw.sourceforge.io/#mozTocId284533
So some basic monitoring and management is included already. If you need more, I suggest using JMX (https://www.oracle.com/java/technologies/javase/javamanagement.html) or Prometheus (https://prometheus.io/ and https://github.com/prometheus/client_java)
I have a question related to Apache Spark. I work with Java language for writing client code but my question can be answered in any language.
The title of the question may seem like there is already a general question in Google that can be found by a simple search, but the problem is that my question is something else and unfortunately every time I search, I didn't find something about this topic and my requirement. Similar topics that are usually found by searching but not my question is:
Multiple SparkSession for one SparkContext
Multiple SparkSessions in single JVM
...
My question is not the above questions at all, although it seems similar. I will first explain my question. In the following, after stating the question, I will say my requirement in a higher level because of which I asked the question. My goal is a requirement that will be solved if the question is answered or another solution to the requirement is provided.
The problem I am trying to solve
I wrote a rest server component in which I used Spark Java library. This rest server can receive a series of requests in a specific format and then form a query based on the requests and submit a job through the Spark library functions to the Spark cluster. (My own cluster) Also return the query answer in the form of a asynchronous response (when it is ready and user request it).
I use some code like this to create spark session (summary of it):
SparkConf sparkConf = new SparkConf()
.setMaster("spark://localhost:7077")
.setAppName("test");
SparkSession session = SparkSession.builder()
.config(sparkConf)
.getOrCreate();
...
As far as I know, we I run above code, spark create application test for me and allocate some of resources from my spark cluster. (I use Spark in standalone mode) for example assume it use all my resources. (So there is no resource for extra new application)
Now I have just one rest server, it can not be scaled at all, and if it goes down, the user can no longer work with the rest server API. So I want to scale it to two instance (at least) on different machines and on different JVMs. (This is the part where my question differs from the others)
If I bring another instance of my rest sever with same code as above, then it will create new Spark session (because it is different JVM on another machine) and it also creates another application with test name in Spark. But since I said all my resources have been used by the first Spark session, this application is on standby and can do nothing. (until resources become free)
Notes about problem:
I do not want to split the cluster resources and add some to the first rest server and some to the second rest server.
I want both versions (or any other numbers of instance if I mentioned) have a single Spark application. In other words, I want same SparkContext across different JVMs. Also note that I submit my spark query as cluster mode in Spark so my application is not worker and one of nodes in cluster becomes driver.
Requirement
As it is clear in the above description, I want my rest server to be HA of type active-active, so that both spark clients are connected to an same application, and the request to the rest servers can be given to each of them. This is my need at a higher level, which may be another way to meet it.
I would be very grateful if there would be a similar application or special documentation or experience, because my searches always ended with questions that I showed at the beginning, while they had nothing to do with my problem. Shame if there is a typo in some parts due to my weakness in English. Thanks.
I like your idea a lot (probably because I had to implement quite a few similar things in the past).
In short, I am 95% sure that there is no way to share JVM, SparkContext between machines, executions, etc. I tried to share dataframes between SparkContext and this was a huge fiasco ;).
The way I would approach that:
If your REST server connects to a cluster, once the Spark session is available, register the server to a load balancer.
If you submit your REST server as a Spark job, you can have it register to the load balancer.
You can submit multiple job/start multiple server. They can pick any advertise port, which they will share with the load balancer.
Your REST client would interact with the load balancer, not directly with the Spark REST server. Your REST server will have to have healthcheck endpoints so that the load balancer can do its job.
If one of your REST server goes down, the load balancer could start a new one. You will lose the dataframes of your application, but not multiple applications.
If multiple REST servers need to exchange data, I would use Delta as a "cache" or staging zone.
Does that make sense? It should not be too hard to implement and provide a good HA.
I'm planning a web application where users will be able to upload and process their files. The specifics of the application are irrelevant to my questions, but lets assume that the application will deal with mp3 audio files. I'm going to split my application in two distinct parts: the front-end and the back-end.
The front-end application will be a usual web application serving html pages to users. Typically a user will upload his file and fill an html form to specify which operations he would like to perform on the file. The files will be initially uploaded to a storage facility, such as Amazon S3, and later processed by a back-end server. I'm using Play 2.0.4 framework to develop the front-end application and this is going very well for me. I managed to implement user authorization, drafted most of the UI and also implemented file upload to S3. The application is currently deployed on Heroku without any problems.
For my back-end server I'm considering to use Play 2 framework once again. The back-end server will receive notification (http request) from the front-end server about creation of a new job. Job specification will include a link to the original user file in the storage and arguments describing the job. The job should be added to a queue. Now the most important part is to delegate the actual processing job to a third party program, which most certainly will be a compiled command line utility, such as SoX for the case of audio processing, written by good people using a programming language of their choice. As far as I know it is possible to call an external program from java, pass command line arguments and collect the result. After processing is done, the back-end server will upload processed file back to storage, and send notification (http request) to the front-end application, which will store a link to the processed file and display it to the user at some later time. To be able to use command line utility I'm going to deploy the back-end application to a Amazon EC2 instance with a Typesafe stack installation.
Here are some questions about this basic plan:
Is Play 2 a reasonable choice for the back-end, or should I look into alternatives? One of them seems to be CGI, which according to Wikipedia "is a standard method for web server software to delegate the generation of web content to executable files." Unfortunately I don't have any experience with that.
There shouldn't be any problem implementing a job queue with Play?
Is it possible to install a command line utility on EC2 and call it from Play?
Should I expect any problems installing Typesafe stack on the EC2? This post briefly describes what I'm planning to do https://www.assembla.com/spaces/bufferine/wiki/Typesafe_stack_on_Amazon_EC2
Assuming that in the future the application will grow, how would I split the jobs among multiple instances on EC2? Should I create a separate job-balancing application in between my front-end and back-end?
I would appreciate any advice! Thanks!
Note: I'm using Java api for Play 2 framework, since I'm not familiar with Scala language.
You may consider Akka for processing and it's built in Play2. It will help you to manage tasks easily, and even saving hardware ressources if used with advanced features. There is a Java API that should cover all your needs. And it's not necessary in a backend APP, if you need more power you can scale even better with two same instancies. Play and Akka are stateless, you can just add new instances to scale. To make it run on EC2, just use the play dist command.
And yes, you can install whatever you want in EC2 and call it from your app.
You may like:
http://akka.io/
http://www.playframework.com/documentation/2.1.0/JavaAkka
http://www.playframework.com/documentation/2.1.0/ProductionDist
also, but in scala
http://blog.greweb.fr/2013/01/playcli-play-iteratees-unix-pipe/
http://blog.greweb.fr/2012/11/play-framework-enumerator-outputstream/
The premise is this: For asynchronous job processing I have a homemade framework that:
Stores jobs in database
Has a simple java api to create more jobs and processors for them
Processing can be embedded in a web application or can run by itself in on different machines for scaling out
Web UI for monitoring the queue and canceling queue items
I would like to replace this with some ready made library because I would expect more robustness from those and I don't want to maintain this. I've been researching the issue and figured you could use JMS for something similar. But I would still have to build a simple java API, figure out a runtime where I would put the processing when I want to scale out and build a monitoring UI. I feel like the only thing I would benefit from JMS is that I would not have to do is the database stuff.
Is there something similar to this that is ready made?
UPDATE
Basically this is the setup I would want to do:
Web application runs in a Servlet container or Application Server
Web application uses a client api to create jobs
X amount of machines process those jobs
Monitor and manage jobs from an UI
You can use Quartz:
http://www.quartz-scheduler.org/
Check out Spring Batch.
Link to sprint batch website: http://projects.spring.io/spring-batch/
Currently I have a Java (and a half ported python version) app that runs in the background that has a queue of jobs (currently read out of a mysql database) which handles thread sleep/waking to share resources based on the job priority and running time. There is a front end php script that posts jobs to the database which are polled by the system every time interval.
This manner is somewhat inefficient (but nicer than locking issues using a job file) but I can't but wonder if there would be some way to simplify this.
My thoughts were java app (and or python app) sets up http service (jetty?) and has a web interface that directly pushes jobs to the queue without the middleman. Apache is serving other php sites so this would have to run in tandem.
I'm really after some other input as I'd prefer it to be a background service always running - having a cron execute jobs was painful (since some jobs run for 20+ hours so adding new ones was a pain with new php [ no threading] /java calls having to check if a service was running with outstanding jobs to add to instead of starting a new service) but also have a very simple web interface without too much resource wastage.
Thanks for your input.
Deploy a JSP using Tomcat (or similar) that allows the user to post job requests to a job scheduler web service using a webpage. On the backend, use Quartz Scheduler to manage your jobs and just have your web service add jobs to the Quartz queue.