I'm a Java developer, aware of AWS and good at Hazelcast independently.
Have 2 AWS EC2 instances running and would like to run Hazelcast as an in-memory cluster between nodes. Followed link to do the required changes. Except configuration for taskdef.json in Task Definition.
Read some documentation but couldn't understand what and why exactly task definition is?
How to i know if it's already created? else if I create one now, would my production gets distracted?
The whole reason for the ec2 discovery is to resolve the issue with non static ip addresses. The EC2 plugin performs a describe instances and pulls the ip adddress from the json.
Related
I have a question related to Apache Spark. I work with Java language for writing client code but my question can be answered in any language.
The title of the question may seem like there is already a general question in Google that can be found by a simple search, but the problem is that my question is something else and unfortunately every time I search, I didn't find something about this topic and my requirement. Similar topics that are usually found by searching but not my question is:
Multiple SparkSession for one SparkContext
Multiple SparkSessions in single JVM
...
My question is not the above questions at all, although it seems similar. I will first explain my question. In the following, after stating the question, I will say my requirement in a higher level because of which I asked the question. My goal is a requirement that will be solved if the question is answered or another solution to the requirement is provided.
The problem I am trying to solve
I wrote a rest server component in which I used Spark Java library. This rest server can receive a series of requests in a specific format and then form a query based on the requests and submit a job through the Spark library functions to the Spark cluster. (My own cluster) Also return the query answer in the form of a asynchronous response (when it is ready and user request it).
I use some code like this to create spark session (summary of it):
SparkConf sparkConf = new SparkConf()
.setMaster("spark://localhost:7077")
.setAppName("test");
SparkSession session = SparkSession.builder()
.config(sparkConf)
.getOrCreate();
...
As far as I know, we I run above code, spark create application test for me and allocate some of resources from my spark cluster. (I use Spark in standalone mode) for example assume it use all my resources. (So there is no resource for extra new application)
Now I have just one rest server, it can not be scaled at all, and if it goes down, the user can no longer work with the rest server API. So I want to scale it to two instance (at least) on different machines and on different JVMs. (This is the part where my question differs from the others)
If I bring another instance of my rest sever with same code as above, then it will create new Spark session (because it is different JVM on another machine) and it also creates another application with test name in Spark. But since I said all my resources have been used by the first Spark session, this application is on standby and can do nothing. (until resources become free)
Notes about problem:
I do not want to split the cluster resources and add some to the first rest server and some to the second rest server.
I want both versions (or any other numbers of instance if I mentioned) have a single Spark application. In other words, I want same SparkContext across different JVMs. Also note that I submit my spark query as cluster mode in Spark so my application is not worker and one of nodes in cluster becomes driver.
Requirement
As it is clear in the above description, I want my rest server to be HA of type active-active, so that both spark clients are connected to an same application, and the request to the rest servers can be given to each of them. This is my need at a higher level, which may be another way to meet it.
I would be very grateful if there would be a similar application or special documentation or experience, because my searches always ended with questions that I showed at the beginning, while they had nothing to do with my problem. Shame if there is a typo in some parts due to my weakness in English. Thanks.
I like your idea a lot (probably because I had to implement quite a few similar things in the past).
In short, I am 95% sure that there is no way to share JVM, SparkContext between machines, executions, etc. I tried to share dataframes between SparkContext and this was a huge fiasco ;).
The way I would approach that:
If your REST server connects to a cluster, once the Spark session is available, register the server to a load balancer.
If you submit your REST server as a Spark job, you can have it register to the load balancer.
You can submit multiple job/start multiple server. They can pick any advertise port, which they will share with the load balancer.
Your REST client would interact with the load balancer, not directly with the Spark REST server. Your REST server will have to have healthcheck endpoints so that the load balancer can do its job.
If one of your REST server goes down, the load balancer could start a new one. You will lose the dataframes of your application, but not multiple applications.
If multiple REST servers need to exchange data, I would use Delta as a "cache" or staging zone.
Does that make sense? It should not be too hard to implement and provide a good HA.
I am new to aws.
I have a mysql rds instance and I just created 2 read replicas. My application is written in Java, and what I have done up until now is using the JDBC I have connected to the one aws instance, but now how do I distribute the work around the 3 servers?
You can set up an internal Elastic Load Balancer to round robin requests to the slaves. Then configure two connections in your code: one that points directly to the master for writes and one that points to the ELB endpoint for reads.
Or if you're adventurous, you could set up your own internal load balancer using Nginx, HAProxy, or something similar. In either case, your LB will listen on port 3306.
AWS suggests setting up route 53. Here is the official article on the subject https://aws.amazon.com/premiumsupport/knowledge-center/requests-rds-read-replicas/
In case you have the option to use Spring boot and spring-cloud-aws-jdbc
You can take a look at this working example and explanation in this post
As shown in the digram,the pet-project that I am working on has two following components.
a) The "RestAPI layer" (set of micro-services)
b) "Scalable Parallelized Algorithm" component.
I am planing on running this on AWS.I realized that I can use ElasticBeanTalk to deploy my RestAPI module.(Spring Boot JAR with embedded tomcat)
I am thinking how to architect the "Scalable Parallelized Algorithm" component.Here are some design details about this:
This consist of couple of Nodes which share the same data stored on
S3.
Each node perform the "algorithm" on a chunk of S3 data.One node works as master node and rest of the nodes send the partial result to
this node.(embarrassingly parallel,master-slave paradigm).Master node
get invoked by the RestAPI layer.
A "Node" is a Spring Boot application which communicates with other nodes through HTTP.
Number of "Nodes" is dynamic ,which means I should be able to manually add a new Node depend on the increasing data size of S3.
There is a "Node Registry" on Redis which contains IPs of all the nodes.Each node register itself , and use the list of IPs in the
registry to communicate with each other.
My questions:
1) Shall I use EC2 to deploy "Nodes" or can I use ElasticBeanStalk to deploy these nodes as well.I know with EC2 I can manage the number of nodes depend on the size of S3 data, but is it possible to do this with ElasticBeanStalk?
2) Can I use
Inet4Address.getLocalHost().getHostAddress()
to get the IP of the each Node ? Do EC2 instances have more than one IP ? This IP should be allow the RestAPI Layer to communicate with the "master" Node.
3) Whats the component I should use expose my RestAPI layer to the external world ? But I dont want to expose my "Nodes".
Update :
I cant use MapReduce since the nodes have state. ie, During initialization , each Node read its chunk of data from S3 and create the "vector space" in memory.This a time consuming process , so thats why this should be stored in memory.Also this system need near-real-time response , cannot use a "batch" system like MR.
1) I would look into CloudFormation to help you automate and orchestrate the Scalable Parallelized Algorithm. Read this FAQ
https://aws.amazon.com/cloudformation/faqs/
2) With regards to question #2, EC2 instances can have a private and public ip, depending on how you configure them. You can query the AWS EC2 Metadata service from the instance to obtain the information like this:
curl http://169.254.169.254/latest/meta-data/public-ipv4
or
curl http://169.254.169.254/latest/meta-data/local-ipv4
Full reference to EC2 instance metadata:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html
3) Check out the API Gateway service, it might be what you are looking for:
https://aws.amazon.com/api-gateway/faqs/
Some general principles
Use infrastructure automation: CloudFormation or Troposphere over CloudFormation. This would make your system clean and easy to maintain.
Use Tagging: this keeps your AWS account nice and tidy. Also you can do funky scripts like describe all instances based on Tags, which can be a one-liner CLI/SDK call returning all the IPs of your "slave" instances.
Use more Tags, it can be really powerful.
ElasticBeanstalk VS "manual" setup
ElasticBeanstalk sounds like a good choice to me, but it's important to see, it's using the same components which I would recommend:
Create an AMI which contains your Slave Instance ready to go, or
Create an AMI and use UserData to configure your Slave, or
Create an AMI and/or use an orchestration tool like Chef or Puppet to configure your slave instance.
Use this AMI in an Autoscaling Launch Config
Create an AutoScalingGroup which can be on a fix number of instances or can scale based on a metric.
Pro setup: if you can somehow count the jobs waiting for execution, that can be a metric for scaling up or down automatically
Pro+ tip: use the Master node to create the jobs, put the jobs into an SQS queue. The length of the queue is a good metric for scaling. Failed jobs are back in the queue and will be re-executed. ( The SQS message contains only a reference, not the full data of the job.)
Using a queue would decouple your environment which is highly recommended
To be clear, ElasticBeanstalk does something similar. Actually if you create a multi node Beanstalk stack, it will run a CloudFromation template, create an ELB, an ASG, a LCFG, and Instances. You just have a bit less control but also less management overhead.
If you go with Beanstalk, you need Worker Environment which also creates the SQS queue for you. If you go for a Worker Environment, you can find tutorials, working examples, which makes your start easier.
Further to read:
Background Task Handling for AWS Elastic Beanstalk
Architectural Overview
2) You can use CLI, it has some filtering capabilities, or you can use other commands like jq for filtering/formatting the output.
Here is a similar example.
Note: Use tags and then you can easily filter the instances. Or you can query based on the ELB/ASG.
3) Exposing your API via the API Gateway sounds a good solution. I assume you want to expose only the Master node(s) since thats what managing the tasks.
I want to use vert.x3 in cluster mode with hazelcast with java. I have two types of Verticles:
verticle for handle http requests(simple http server) (this type of verticle should be run on each node)
verticle with (non-local) eventbus consumers, for contain some data (i have N parts of data, each verticle contain one part, i would like to run each part in HA mode and only one instance for each(There are N verticles of this type in the cluster)).
Verticle of type one would communicate with verticle of type two.
I also have a fatjar with all code.
And I have few questions about it.
How should I do this?
How to run cluster?
Do I run same jar on each node or need do something else?
How to run each type of verticle?
How to guarantee that only one instance of verticle of type two will run on cluster?
Does I lose eventbus messages?
Is it correct way to use vertx for this task?
There are several questions here I'll try to answer them all.
How should I do this?
The easiest way imho is to have a single fat jar per verticle type and each verticle should have the dependency on the hazelcast cluster manager:
<dependency>
<groupId>io.vertx</groupId>
<artifactId>vertx-hazelcast</artifactId>
<version>3.3.2</version>
</dependency>
And in your shade plugin specify the manifest attributes:
Main-Class: io.vertx.core.Launcher
Main-Verticle: vertx.bc.service.Main
How to run cluster?
Now for each fat jar you can run as:
java -jar verticle1.jar -cluster
java -jar verticle2.jar -cluster
They should form a HZ cluster and be up and running. You can deploy on the same machine, or across several machines as long as your network supports multicast the default config will work for you. If you have special needs you need to customize your HZ config.
How to guarantee that only one instance of verticle of type two will run on cluster?
You can't. It is a distributed system the network should be considered unreliable so there cannot be an assumtion that you always know how many nodes of each type are running. To solve this you need monitoring tools. BTW this is not Vert.x specific but related to any distributed system/microservice architecture.
Does I lose eventbus messages?
Only if there are no consumers at the time of submission for a specific address, those messages will be lost. This relates to the previous question, to reduce this chance you should deploy more instances of a specific verticle to reduce the chance of message loss and the deployment should be across several machines to reduce the change of network split.
Is it correct way to use vertx for this task?
If you're using ha and only 1 instance this should work fine for consumer verticles, however note that the web server if for some reason dies and respawns on another host will not give what you're looking for since the http server "moved" from host1 to hostN. This means that all your web clients will now get a "Cannot connect to host" error since your application entry point is now using a different IP address.
I want to dynamically configure my API servers depending on the name of the "cluster".
So I'm using AmazonElastiCacheClient to discover the clusters name and need to extract the endpoint of the one that has a specific name.
The problem is that I can find it but there doesn't seem to be a way to get an endpoint.
foundCluster.getCacheNodes() returns an empty list, even if there is 1 Redis instance appearing in the AWS console, in-sync and running.
foundCluster.getConfigurationEndpoint() returns null.
Any idea?
Try adding
DescribeCacheClustersRequest.setShowCacheNodeInfo(true);
I am making a guess:
AWS Elastic Cache with redis currenlty supports only single node clusers (so no auto discovery etc). I am not sure this is due this. Memcached based cluster is different.
"At this time, ElastiCache supports single-node Redis cache clusters." http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/CacheNode.Redis.html