I am trying to create an EMR cluster on java, but i can't neither find it on the EMR cluster list, neither can see the instances requested on EC2.
EMR roles do exist:
sqlInjection#VirtualBox:~$ aws iam list-roles | grep EMR
"RoleName": "EMR_DefaultRole",
"Arn": "arn:aws:iam::removed:role/EMR_DefaultRole"
"RoleName": "EMR_EC2_DefaultRole",
"Arn": "arn:aws:iam::removed:role/EMR_EC2_DefaultRole"
and now my java code:
AWSCredentials awsCredentials = new BasicAWSCredentials(awsKey, awsKeySecret);
AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(awsCredentials);
StepFactory stepFactory = new StepFactory();
StepConfig enabledebugging = new StepConfig()
.withName("Enable debugging")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newEnableDebuggingStep());
HadoopJarStepConfig hadoopConfig1 = new HadoopJarStepConfig()
.withJar("s3://foo.bucket/hadoop_jar/2015-01-12/foo.jar")
.withMainClass("com.strackoverflow.DriverFoo") // optional main class, this can be omitted if jar above has a manifest
.withArgs("--input=s3://foo.bucket/logs/,s3://foo.bucket/morelogs/", "--output=s3://foo.bucket/myEMROutput" , "--inputType=text"); // i have custom java code to handle the --input, --output and --inputType parameters
StepConfig customStep = new StepConfig("Step1", hadoopConfig1);
Collection <StepConfig> steps = new ArrayList<StepConfig>();
{
steps.add(enabledebugging);
steps.add(customStep);
}
JobFlowInstancesConfig instancesConfig = new JobFlowInstancesConfig()
.withEc2KeyName("fookey") //not fookey.pem
.withInstanceCount(2)
.withKeepJobFlowAliveWhenNoSteps(false) // on aws example is set to true
.withMasterInstanceType("m1.medium")
.withSlaveInstanceType("m1.medium");
RunJobFlowRequest request = new RunJobFlowRequest()
.withName("java programatic request")
.withAmiVersion("3.3.1")
.withSteps(steps) // on the amazon example is lunched debug and hive, here is debug and a jar
.withLogUri("s3://devel.rui/emr_clusters/pr01/")
.withInstances(instancesConfig)
.withVisibleToAllUsers(true);
RunJobFlowResult result = emr.runJobFlow(request);
System.out.println("toString "+ result.toString());
System.out.println("getJobFlowId "+ result.getJobFlowId());
System.out.println("hashCode "+ result.hashCode());
Where is my cluster? I cannot see it on cluster list, output folder is not created, logs folder stays empty and no instances are visible on EC2.
by the program outputs this
toString {JobFlowId: j-2xxxxxxU}
getJobFlowId j-2xxxxxU
hashCode -1xxxxx4
I had follow the instruction from here to create the cluster
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/calling-emr-with-java-sdk.html
And this to create the java job
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-common-programming-sample.html
On the Amazon example, the region is not configured.
After configuring the region the cluster is launched properly.
AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(awsCredentials);
emr.setRegion(Region.getRegion(Regions.EU_WEST_1));
Related
I'm trying to trigger cronjob manually(not scheduled) using fabric8 library
but getting the following error:
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://172.20.0.1:443/apis/batch/v1/
namespaces/engineering/jobs. Message: Job.batch "app-chat-manual-947171" is invalid: spec.template.spec.containers[0].name: Re
quired value. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.template.spec.co
ntainers[0].name, message=Required value, reason=FieldValueRequired, additionalProperties={})], group=batch, kind=Job, name=ap
p-chat-manual-947171, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Job.batch "app-chat-man
ual-947171" is invalid: spec.template.spec.containers[0].name: Required value, metadata=ListMeta(_continue=null, remainingItemCount=
null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).
my code is running at the cluster:
maven dependency:
<dependency>
<groupId>io.fabric8</groupId>
<artifactId>kubernetes-client</artifactId>
<version>6.3.1</version>
</dependency>
java code:
public static void triggerCronjob(String cronjobName, String applicableNamespace) {
KubernetesClient kubernetesClient = new KubernetesClientBuilder().build();
final String podName = String.format("%s-manual-%s", cronjobName.length() > 38 ? cronjobName.substring(0, 38) : cronjobName,
new Random().nextInt(999999));
System.out.println("triggerCronjob method invoked, applicableNamespace: " + applicableNamespace
+ ", cronjobName: " + cronjobName + ", podName: " + podName);
Job job = new JobBuilder()
.withApiVersion("batch/v1")
.withNewMetadata()
.withName(podName)
.endMetadata()
.withNewSpec()
.withBackoffLimit(4)
.withNewTemplate()
.withNewSpec()
.addNewContainer()
.withName(podName)
.withImage("perl")
.withCommand("perl", "-Mbignum=bpi", "-wle", "print bpi(2000)")
.endContainer()
.withRestartPolicy("Never")
.endSpec()
.endTemplate()
.endSpec().build();
kubernetesClient.batch().v1().jobs().inNamespace(applicableNamespace).createOrReplace(job);
kubernetesClient.close();
System.out.println("CronJob triggered: applicableNamespace: " + applicableNamespace + ", cronjob name: " + cronjobName);
}
the code executed at the kubernetes cluster, but not form the application, it's an external program that's running in the cluster.
my goal is to trigger given job in a given namespace.
If you want to trigger an already existing CronJob, you need to provide ownerReference for the existing CronJob in Job:
// Get already existing CronJob
CronJob cronJob = kubernetesClient.batch().v1()
.cronjobs()
.inNamespace(namespace)
.withName(cronJobName)
.get();
// Create new Job object referencing CronJob
Job newJobToCreate = new JobBuilder()
.withNewMetadata()
.withName(jobName)
.addNewOwnerReference()
.withApiVersion("batch/v1")
.withKind("CronJob")
.withName(cronJob.getMetadata().getName())
.withUid(cronJob.getMetadata().getUid())
.endOwnerReference()
.addToAnnotations("cronjob.kubernetes.io/instantiate", "manual")
.endMetadata()
.withSpec(cronJob.getSpec().getJobTemplate().getSpec())
.build();
// Apply job object to Kubernetes Cluster
kubernetesClient.batch().v1()
.jobs()
.inNamespace(namespace)
.resource(newJobToCreate)
.create();
I'm creating clusters on AWS EMR (with Console and SDK). But these clusters always remain "starting" state and never start. Why can this happen and how can I solve it? Thanks.
bootstrap-actions log:
INFO i-062fab1a95f485684: new instance started
ERROR i-062fab1a95f485684: failed to start. bootstrap action 1 failed with non-zero exit code.
My codes:
val emr = AmazonElasticMapReduceClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(awsCred))
.withRegion(Regions.EU_WEST_1)
.build()
val stepFactory = new StepFactory();
val enabledebugging = new StepConfig()
.withName("Enable debugging")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newEnableDebuggingStep())
val spark = new Application().withName("Spark")
val hive = new Application().withName("Hive")
val ganglia = new Application().withName("Ganglia")
val zeppelin = new Application().withName("Zeppelin")
val request = new RunJobFlowRequest()
.withName("Spark Cluster")
.withReleaseLabel("emr-5.20.0")
.withSteps(enabledebugging)
.withApplications(spark)
.withLogUri("s3://my-logs")
.withServiceRole("EMR_DefaultRole")
.withJobFlowRole("EMR_EC2_DefaultRole")
.withInstances(new JobFlowInstancesConfig()
.withEc2SubnetId("subnet-xxxxx")
.withEc2KeyName("ec2test")
.withInstanceCount(3)
.withKeepJobFlowAliveWhenNoSteps(true)
.withMasterInstanceType("m5.xlarge")
.withSlaveInstanceType("m5.xlarge")
);
val result = emr.runJobFlow(request);
System.out.println("The cluster ID is " + result.toString());
I'm writing an app that is consisted of 4 chained MapReduce jobs, which runs on Amazon EMR. I'm using the JobFlow interface to chain the jobs. Each job is contained in its own class, and has its own main method. All of these are packed into a .jar which is saved in S3, and the cluster is initialized from a small local app on my laptop, which configures the JobFlowRequest and submits it to EMR.
For most of the attempts I make to start the cluster, it fails with the error message Terminated with errors On the master instance (i-<cluster number>), bootstrap action 1 timed out executing. I looked up info on this issue, and all I could find is that if the combined bootstrap time of the cluster exceeds 45 minutes, then this exception is thrown. However, This only occurs ~15 minutes after the request is submitted to EMR, with disregard to the requested cluster size, be it of 4 EC2 instances, 10 or even 20. This makes no sense to me at all, what am I missing?
Some tech specs:
-The project is compiled with Java 1.7.79
-The requested EMR image is 4.6.0, which uses Hadoop 2.7.2
-I'm using the AWS SDK for Java v. 1.10.64
This is my local main method, which sets up and submits the JobFlowRequest:
import com.amazonaws.AmazonClientException;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.ec2.model.InstanceType;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduce;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient;
import com.amazonaws.services.elasticmapreduce.model.*;
public class ExtractRelatedPairs {
public static void main(String[] args) throws Exception {
if (args.length != 1) {
System.err.println("Usage: ExtractRelatedPairs: <k>");
System.exit(1);
}
int outputSize = Integer.parseInt(args[0]);
if (outputSize < 0) {
System.err.println("k should be positive");
System.exit(1);
}
AWSCredentials credentials = null;
try {
credentials = new ProfileCredentialsProvider().getCredentials();
} catch (Exception e) {
throw new AmazonClientException(
"Cannot load the credentials from the credential profiles file. " +
"Please make sure that your credentials file is at the correct " +
"location (~/.aws/credentials), and is in valid format.",
e);
}
AmazonElasticMapReduce mapReduce = new AmazonElasticMapReduceClient(credentials);
HadoopJarStepConfig jarStep1 = new HadoopJarStepConfig()
.withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
.withMainClass("Phase1")
.withArgs("s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-gb-all/5gram/data/", "hdfs:///output1/");
StepConfig step1Config = new StepConfig()
.withName("Phase 1")
.withHadoopJarStep(jarStep1)
.withActionOnFailure("TERMINATE_JOB_FLOW");
HadoopJarStepConfig jarStep2 = new HadoopJarStepConfig()
.withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
.withMainClass("Phase2")
.withArgs("shdfs:///output1/", "hdfs:///output2/");
StepConfig step2Config = new StepConfig()
.withName("Phase 2")
.withHadoopJarStep(jarStep2)
.withActionOnFailure("TERMINATE_JOB_FLOW");
HadoopJarStepConfig jarStep3 = new HadoopJarStepConfig()
.withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
.withMainClass("Phase3")
.withArgs("hdfs:///output2/", "hdfs:///output3/", args[0]);
StepConfig step3Config = new StepConfig()
.withName("Phase 3")
.withHadoopJarStep(jarStep3)
.withActionOnFailure("TERMINATE_JOB_FLOW");
HadoopJarStepConfig jarStep4 = new HadoopJarStepConfig()
.withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
.withMainClass("Phase4")
.withArgs("hdfs:///output3/", "s3n://dsps162assignment2benasaf/output4");
StepConfig step4Config = new StepConfig()
.withName("Phase 4")
.withHadoopJarStep(jarStep4)
.withActionOnFailure("TERMINATE_JOB_FLOW");
JobFlowInstancesConfig instances = new JobFlowInstancesConfig()
.withInstanceCount(10)
.withMasterInstanceType(InstanceType.M1Small.toString())
.withSlaveInstanceType(InstanceType.M1Small.toString())
.withHadoopVersion("2.7.2")
.withEc2KeyName("AWS")
.withKeepJobFlowAliveWhenNoSteps(false)
.withPlacement(new PlacementType("us-east-1a"));
RunJobFlowRequest runFlowRequest = new RunJobFlowRequest()
.withName("extract-related-word-pairs")
.withInstances(instances)
.withSteps(step1Config, step2Config, step3Config, step4Config)
.withJobFlowRole("EMR_EC2_DefaultRole")
.withServiceRole("EMR_DefaultRole")
.withReleaseLabel("emr-4.6.0")
.withLogUri("s3n://dsps162assignment2benasaf/logs/");
System.out.println("Submitting the JobFlow Request to Amazon EMR and running it...");
RunJobFlowResult runJobFlowResult = mapReduce.runJobFlow(runFlowRequest);
String jobFlowId = runJobFlowResult.getJobFlowId();
System.out.println("Ran job flow with id: " + jobFlowId);
}
}
A while back, I encountered a similar issue, where even a Vanilla EMR cluster of 4.6.0 was failing to get past the startup, and thus it was throwing a timeout error on the bootstrap step.
I ended up just creating a cluster on a different/new VPC in a different region and it worked fine, and thus it led me to believe there may be a problem with either the original VPC itself or the software in 4.6.0.
Also, regarding the VPC, it was specifically having an issue setting and resolving DNS names for the newly created cluster nodes, even though older versions of EMR were not having this problem
I am rather new at using Spark and I am having issues running a simple word count application on a standalone cluster. I have a cluster consisting of one master node and one worker, launched on AWS using the spark-ec2 script. Everything works fine when I run the code locally using
./bin/spark-submit --class com.spark.SparkDataAnalysis --master local[*] ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount
This saves the output into the specified directory as it should.
When I try to run the application using
./bin/spark-submit --class com.spark.SparkDataAnalysis --master spark://server-ip:7077 ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount
it just keeps on running and never produce a final result. The directory gets created but only a temporary file of 0 bytes is present.
According to the Spark UI it keeps on running the mapToPair function indefinitely.
Here is a picture of the Spark UI
Does anyone know why this is happening and how to solve it?
Here is the code:
public class SparkDataAnalysis {
public static void main(String args[]){
SparkConf conf = new SparkConf().setAppName("SparkDataAnalysis");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> input = sc.textFile( args[0] );
JavaRDD<String> words = input.flatMap( s -> Arrays.asList( s.split( " " ) ) );
JavaPairRDD<String, Integer> counts = words.mapToPair( t -> new Tuple2<String, Integer>( t, 1 ) ).reduceByKey( (x, y) -> x + y );
counts.saveAsTextFile( args[1] );
}
}
I skipped using a standalone cluster via the spark-ec2 script and used Amazon EMR instead. There everything worked perfectly.
I am trying to attach a new volume to my instance using JClouds. But I can't find a way to do it.
final String POLL_PERIOD_TWENTY_SECONDS = String.valueOf(SECONDS.toMillis(20));
Properties overrides = new Properties();
overrides.setProperty(ComputeServiceProperties.POLL_INITIAL_PERIOD, POLL_PERIOD_TWENTY_SECONDS);
overrides.setProperty(ComputeServiceProperties.POLL_MAX_PERIOD, POLL_PERIOD_TWENTY_SECONDS);
Iterable<Module> modules = ImmutableSet.<Module> of(new SshjSshClientModule(), new SLF4JLoggingModule());
//Iterable<Module> modules = ImmutableSet.<Module> of(new SshjSshClientModule());
ComputeServiceContext context = ContextBuilder.newBuilder("aws-ec2")
.credentials("valid user", "valid password")
.modules(modules)
.overrides(overrides)
.buildView(ComputeServiceContext.class);
ComputeService computeService = context.getComputeService();
// Ubuntu AMI
Template template = computeService.templateBuilder()
.locationId("us-east-1")
.imageId("us-east-1/ami-7c807d14")
.hardwareId("t1.micro")
.build();
// This line creates the volume but does not attach it
template.getOptions().as(EC2TemplateOptions.class).mapNewVolumeToDeviceName("/dev/sdm", 100, true);
Set<? extends NodeMetadata> nodes = computeService.createNodesInGroup("m456", 1, template);
for (NodeMetadata nodeMetadata : nodes) {
String publicAddress = nodeMetadata.getPublicAddresses().iterator().next();
//String privateAddress = nodeMetadata.getPrivateAddresses().iterator().next();
String username = nodeMetadata.getCredentials().getUser();
String password = nodeMetadata.getCredentials().getPassword();
// [...]
System.out.println(String.format("ssh %s#%s %s", username, publicAddress, password));
System.out.println(nodeMetadata.getCredentials().getPrivateKey());
}
How can I create an attach a volume to the directory "/var" ?
How could I create the instance with more hard disk space ?
You can use the ElasticBlockStoreApi in jclouds for this. You have created the instance already, so create and attach the volume like this:
// Get node
NodeMetadata node = Iterables.getOnlyElement(nodes);
// Get AWS EC2 API
EC2Api ec2Api = computeServiceContext.unwrapApi(EC2Api.class);
// Create 100 GiB Volume
Volume volume = ec2Api.getElasticBlockStoreApi().get()
.createVolumeInAvailabilityZone(zoneId, 100);
// Attach to instance
Attachment attachment = ec2Api.getElasticBlockStoreApi().get()
.attachVolumeInRegion(region, volume.getId(), node.getId(), "/dev/sdx");
Now, you have an EBS volume attached, and the VM has a block device created for it, you just need to run the right commands to mount it on the /var directory, which will depend on your particular operating system. You can run a script like this:
// Run script on instance
computeService.runScriptOnNode(node.getId(),
Statements.exec("mount /dev/sdx /var"));