Same SparkContext across different JVMs

Same SparkContext across different JVMs - java

I have a question related to Apache Spark. I work with Java language for writing client code but my question can be answered in any language.
The title of the question may seem like there is already a general question in Google that can be found by a simple search, but the problem is that my question is something else and unfortunately every time I search, I didn't find something about this topic and my requirement. Similar topics that are usually found by searching but not my question is:
Multiple SparkSession for one SparkContext
Multiple SparkSessions in single JVM
...
My question is not the above questions at all, although it seems similar. I will first explain my question. In the following, after stating the question, I will say my requirement in a higher level because of which I asked the question. My goal is a requirement that will be solved if the question is answered or another solution to the requirement is provided.
The problem I am trying to solve
I wrote a rest server component in which I used Spark Java library. This rest server can receive a series of requests in a specific format and then form a query based on the requests and submit a job through the Spark library functions to the Spark cluster. (My own cluster) Also return the query answer in the form of a asynchronous response (when it is ready and user request it).
I use some code like this to create spark session (summary of it):
SparkConf sparkConf = new SparkConf()
.setMaster("spark://localhost:7077")
.setAppName("test");
SparkSession session = SparkSession.builder()
.config(sparkConf)
.getOrCreate();
...
As far as I know, we I run above code, spark create application test for me and allocate some of resources from my spark cluster. (I use Spark in standalone mode) for example assume it use all my resources. (So there is no resource for extra new application)
Now I have just one rest server, it can not be scaled at all, and if it goes down, the user can no longer work with the rest server API. So I want to scale it to two instance (at least) on different machines and on different JVMs. (This is the part where my question differs from the others)
If I bring another instance of my rest sever with same code as above, then it will create new Spark session (because it is different JVM on another machine) and it also creates another application with test name in Spark. But since I said all my resources have been used by the first Spark session, this application is on standby and can do nothing. (until resources become free)
Notes about problem:
I do not want to split the cluster resources and add some to the first rest server and some to the second rest server.
I want both versions (or any other numbers of instance if I mentioned) have a single Spark application. In other words, I want same SparkContext across different JVMs. Also note that I submit my spark query as cluster mode in Spark so my application is not worker and one of nodes in cluster becomes driver.
Requirement
As it is clear in the above description, I want my rest server to be HA of type active-active, so that both spark clients are connected to an same application, and the request to the rest servers can be given to each of them. This is my need at a higher level, which may be another way to meet it.
I would be very grateful if there would be a similar application or special documentation or experience, because my searches always ended with questions that I showed at the beginning, while they had nothing to do with my problem. Shame if there is a typo in some parts due to my weakness in English. Thanks.

I like your idea a lot (probably because I had to implement quite a few similar things in the past).
In short, I am 95% sure that there is no way to share JVM, SparkContext between machines, executions, etc. I tried to share dataframes between SparkContext and this was a huge fiasco ;).
The way I would approach that:
If your REST server connects to a cluster, once the Spark session is available, register the server to a load balancer.
If you submit your REST server as a Spark job, you can have it register to the load balancer.
You can submit multiple job/start multiple server. They can pick any advertise port, which they will share with the load balancer.
Your REST client would interact with the load balancer, not directly with the Spark REST server. Your REST server will have to have healthcheck endpoints so that the load balancer can do its job.
If one of your REST server goes down, the load balancer could start a new one. You will lose the dataframes of your application, but not multiple applications.
If multiple REST servers need to exchange data, I would use Delta as a "cache" or staging zone.
Does that make sense? It should not be too hard to implement and provide a good HA.

Related

Manage running Java apps remotely

We have several Java standalone applications (in form of Jar files) running on multiple servers. These applications mainly read and stream data between systems. We are using Java 8 mainly in our development. I was put in charge recently. My main function is to manage and maintain these apps.
Currently, I check these apps manually by accessing these servers, check if the app is running, and sometimes run some database queries to see if the app started pulling data. My problem is that in many cases, some of these apps fail and shutdown due to data issue or edge cases without anyone noticing. We need some monitoring and application recovery in place.
We don't have docker infrastructure in place. We plan to implement docker in the future, but for now this is not an option.
After research, the following are options I thought of or solutions I tried:
Have the apps create a socket client which sends a heartbeat to a monitoring app (which needs to be developed). I am keeping this as my last option.
I tried to use Eclipse Vertx to wrap the apps into Verticles. Then create a web view that can show me status and other info. After several tries, the apps fail to parse the data correctly (might be due to my lack of understanding to Vertx library).
Have a third party solution that does this, but I have no idea what solutions are out there. I am open for suggestions.
My requirements are:
Proper monitoring of the apps running and their status.
In case of failure, the app should start again while notifying the admin/developer.
I am willing to develop a solution or implement a third party one. I need you guidance on this.
Thank you.

You could use spring-boot-actuator (see health). It comes with a built-in endpoint that has some health checks(depending on your spring-boot project), but you can create your own as well.
Then, doing a http request to http://{host}:{port}/{context}/actuator/health (replace with yours), you could see those health checks status and also use the response status code to monitor your application.

Have you heard of Java Service Wrappers? Not a full management functionality, however it would monitor for JVM crashes and out of memory conditions and restart your application for sure. Alerting should also be possible.
There is a small comparison table here: https://yajsw.sourceforge.io/#mozTocId284533
So some basic monitoring and management is included already. If you need more, I suggest using JMX (https://www.oracle.com/java/technologies/javase/javamanagement.html) or Prometheus (https://prometheus.io/ and https://github.com/prometheus/client_java)

Find out whose using Redis

We have one Redis for our company and multiple teams are using it. We are getting a surge of requests and nobody seems to know which application is causing it. We have only one password that goes around the whole company and our Redis is secured under a VPN so we know it's not coming from the outside.
Is there a way to know whose using Redis? Maybe we can pass in some headers with the connection from every app to identify who makes the most requests, etc.
We use Spring Data Redis for our communication.

This question is too broad since different strategies can be used here:
Use Redis MONITOR command. This is basically a built-in debugging tool that monitors all the commands executed by Redis
Use some kind of intermediate proxy. Instead of routing all the commands directly to redis - route everything to proxy that will do some processing like measuring the amounts of commands by the calling host or maybe types of commands depending what you want.
This is still only a configuration related solution so you won't need any changes at the level of applications
Since you have spring boot, you can use Micrometer / metering integration. This way you could create a counter / gauge that will get updated upon each request to Redis. If you also stream the metering data to tools like Prometheus, you'll be able to create a dashboard, say in grafana to see the whole picture. Micrometer can integrate also with other products, Prometheus/Grafana was only an example, you can chose any other solution (maybe in your organization you already have something like that).

Architecture on AWS : Running a distributed algorithm on dynamic nodes

As shown in the digram,the pet-project that I am working on has two following components.
a) The "RestAPI layer" (set of micro-services)
b) "Scalable Parallelized Algorithm" component.
I am planing on running this on AWS.I realized that I can use ElasticBeanTalk to deploy my RestAPI module.(Spring Boot JAR with embedded tomcat)
I am thinking how to architect the "Scalable Parallelized Algorithm" component.Here are some design details about this:
This consist of couple of Nodes which share the same data stored on
S3.
Each node perform the "algorithm" on a chunk of S3 data.One node works as master node and rest of the nodes send the partial result to
this node.(embarrassingly parallel,master-slave paradigm).Master node
get invoked by the RestAPI layer.
A "Node" is a Spring Boot application which communicates with other nodes through HTTP.
Number of "Nodes" is dynamic ,which means I should be able to manually add a new Node depend on the increasing data size of S3.
There is a "Node Registry" on Redis which contains IPs of all the nodes.Each node register itself , and use the list of IPs in the
registry to communicate with each other.
My questions:
1) Shall I use EC2 to deploy "Nodes" or can I use ElasticBeanStalk to deploy these nodes as well.I know with EC2 I can manage the number of nodes depend on the size of S3 data, but is it possible to do this with ElasticBeanStalk?
2) Can I use
Inet4Address.getLocalHost().getHostAddress()
to get the IP of the each Node ? Do EC2 instances have more than one IP ? This IP should be allow the RestAPI Layer to communicate with the "master" Node.
3) Whats the component I should use expose my RestAPI layer to the external world ? But I dont want to expose my "Nodes".
Update :
I cant use MapReduce since the nodes have state. ie, During initialization , each Node read its chunk of data from S3 and create the "vector space" in memory.This a time consuming process , so thats why this should be stored in memory.Also this system need near-real-time response , cannot use a "batch" system like MR.

1) I would look into CloudFormation to help you automate and orchestrate the Scalable Parallelized Algorithm. Read this FAQ
https://aws.amazon.com/cloudformation/faqs/
2) With regards to question #2, EC2 instances can have a private and public ip, depending on how you configure them. You can query the AWS EC2 Metadata service from the instance to obtain the information like this:
curl http://169.254.169.254/latest/meta-data/public-ipv4
or
curl http://169.254.169.254/latest/meta-data/local-ipv4
Full reference to EC2 instance metadata:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html
3) Check out the API Gateway service, it might be what you are looking for:
https://aws.amazon.com/api-gateway/faqs/

Some general principles
Use infrastructure automation: CloudFormation or Troposphere over CloudFormation. This would make your system clean and easy to maintain.
Use Tagging: this keeps your AWS account nice and tidy. Also you can do funky scripts like describe all instances based on Tags, which can be a one-liner CLI/SDK call returning all the IPs of your "slave" instances.
Use more Tags, it can be really powerful.
ElasticBeanstalk VS "manual" setup
ElasticBeanstalk sounds like a good choice to me, but it's important to see, it's using the same components which I would recommend:
Create an AMI which contains your Slave Instance ready to go, or
Create an AMI and use UserData to configure your Slave, or
Create an AMI and/or use an orchestration tool like Chef or Puppet to configure your slave instance.
Use this AMI in an Autoscaling Launch Config
Create an AutoScalingGroup which can be on a fix number of instances or can scale based on a metric.
Pro setup: if you can somehow count the jobs waiting for execution, that can be a metric for scaling up or down automatically
Pro+ tip: use the Master node to create the jobs, put the jobs into an SQS queue. The length of the queue is a good metric for scaling. Failed jobs are back in the queue and will be re-executed. ( The SQS message contains only a reference, not the full data of the job.)
Using a queue would decouple your environment which is highly recommended
To be clear, ElasticBeanstalk does something similar. Actually if you create a multi node Beanstalk stack, it will run a CloudFromation template, create an ELB, an ASG, a LCFG, and Instances. You just have a bit less control but also less management overhead.
If you go with Beanstalk, you need Worker Environment which also creates the SQS queue for you. If you go for a Worker Environment, you can find tutorials, working examples, which makes your start easier.
Further to read:
Background Task Handling for AWS Elastic Beanstalk
Architectural Overview
2) You can use CLI, it has some filtering capabilities, or you can use other commands like jq for filtering/formatting the output.
Here is a similar example.
Note: Use tags and then you can easily filter the instances. Or you can query based on the ELB/ASG.
3) Exposing your API via the API Gateway sounds a good solution. I assume you want to expose only the Master node(s) since thats what managing the tasks.

Configuring storm cluster for production cluster

We have configured storm cluster with one nimbus server and three supervisors. Published three topologies which does different calculations as follows
Topology1 : Reads raw data from MongoDB, do some calculations and store back the result
Topology2 : Reads the result of topology1 and do some calculations and publish results to a queue
Topology3 : Consumes output of topology2 from the queue, calls a REST Service, get reply from REST service, update result in MongoDB collection, finally send an email.
As new bee to storm, looking for an expert advice on the following questions
Is there a way to externalize all configurations, for example a config.json, that can be referred by all topologies?
Currently configuration to connect MongoDB, MySql, Mq, REST urls are hard-coded in java file. It is not good practice to customize source files for each customer.
Wanted to log at each stage [Spouts and Bolts], Where to post/store log4j.xml that can be used by cluster?
Is it right to execute blocking call like REST call from a bolt?
Any help would be much appreciated.

Since each topology is just a Java program, simply pass the configuration into the Java Jar, or pass a path to a file. The topology can read the file at startup, and pass any configuration to components as it instantiates them.
Storm uses slf4j out of the box, and it should be easy to use within your topology as such. If you use the default configuration, you should be able to see logs either through the UI, or dumped to disk. If you can't find them, there are a number of guides to help, e.g. http://www.saurabhsaxena.net/how-to-find-storm-worker-log-directory/.
With storm, you have the flexibility to push concurrency out to the component level, and get multiple executors by instantiating multiple bolts. This is likely the simplest approach, and I'd advise you start there, and later introduce the complexity of an executor inside of your topology for asynchronously making HTTP calls.
See http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html for the canonical overview of parallelism in storm. Start simple, and then tune as necessary, as with anything.

Deploy Apache Spark application from another application in Java, best practice

I am a new user of Spark. I have a web service that allows a user to request the server to perform a complex data analysis by reading from a database and pushing the results back to the database. I have moved those analysis's into various Spark applications. Currently I use spark-submit to deploy these applications.
However, I am curious, when my web server (written in Java) receives a user request, what is considered the "best practice" way to initiate the corresponding Spark application? Spark's documentation seems to be to use "spark-submit" but I would rather not pipe out the command to a terminal to perform this action. I saw an alternative, Spark-JobServer, which provides an RESTful interface to do exactly this, but my Spark applications are written in either Java or R, which seems to not interface well with Spark-JobServer.
Is there another best-practice to kickoff a spark application from a web server (in Java), and wait for a status result whether the job succeeded or failed?
Any ideas of what other people are doing to accomplish this would be very helpful! Thanks!

I've had a similar requirement. Here's what I did:
To submit apps, I use the hidden Spark REST Submission API: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Using this same API you can query status for a Driver or you can Kill your Job later
There's also another hidden UI Json API: http://[master-node]:[master-ui-port]/json/ which exposes all information available on the master UI in JSON format.
Using "Submission API" I submit a driver and using the "Master UI API" I wait until my Driver and App state are RUNNING

The web server can also act as the Spark driver. So it would have a SparkContext instance and contain the code for working with RDDs.
The advantage of this is that the Spark executors are long-lived. You save time by not having to start/stop them all the time. You can cache RDDs between operations.
A disadvantage is that since the executors are running all the time, they take up memory that other processes in the cluster could possibly use. Another one is that you cannot have more than one instance of the web server, since you cannot have more than one SparkContext to the same Spark application.

We are using Spark Job-server and it is working fine with Java also just build jar of Java code and wrap it with Scala to work with Spark Job-Server.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.