Submitting Spark application on standalone cluster - java

I am rather new at using Spark and I am having issues running a simple word count application on a standalone cluster. I have a cluster consisting of one master node and one worker, launched on AWS using the spark-ec2 script. Everything works fine when I run the code locally using
./bin/spark-submit --class com.spark.SparkDataAnalysis --master local[*] ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount
This saves the output into the specified directory as it should.
When I try to run the application using
./bin/spark-submit --class com.spark.SparkDataAnalysis --master spark://server-ip:7077 ./uber-ingestion-0.0.1-SNAPSHOT.jar file:///root/textfile.txt s3n://bucket/wordcount
it just keeps on running and never produce a final result. The directory gets created but only a temporary file of 0 bytes is present.
According to the Spark UI it keeps on running the mapToPair function indefinitely.
Here is a picture of the Spark UI
Does anyone know why this is happening and how to solve it?
Here is the code:
public class SparkDataAnalysis {
public static void main(String args[]){
SparkConf conf = new SparkConf().setAppName("SparkDataAnalysis");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> input = sc.textFile( args[0] );
JavaRDD<String> words = input.flatMap( s -> Arrays.asList( s.split( " " ) ) );
JavaPairRDD<String, Integer> counts = words.mapToPair( t -> new Tuple2<String, Integer>( t, 1 ) ).reduceByKey( (x, y) -> x + y );
counts.saveAsTextFile( args[1] );
}
}

I skipped using a standalone cluster via the spark-ec2 script and used Amazon EMR instead. There everything worked perfectly.

Related

Must the Spark Streaming developer install Hadoop on his computer?

I am trying to learn spark streaming, when my demo set Master is "local[2]", it is normal. But when I setMaster for the local cluster started in StandAlone mode, an error occurred:
lost an executor 2 (already removed): Unable to create executor due to java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
It should be noted that I submitted the code in idea
#Component
public final class JavaNetworkWordCount {
private static final String SPACE = " ";
#Bean("test")
public void test() throws Exception {
// Create a local StreamingContext with two working thread and batch interval of 10 second
SparkConf conf = new SparkConf()
.setJars(new String[]{"E:\\project\\spark-demo\\target\\spark-demo-0.0.1-SNAPSHOT.jar"})
.setMaster("spark://10.4.41.93:7077")
.set("spark.driver.host", "127.0.0.1")
.setAppName("JavaWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
// Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("192.168.2.51", 9999);
// Split each line into words
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split(SPACE)).iterator());
// Count each word in each batch
JavaPairDStream<String, Integer> pairs = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey((i1, i2) -> i1 + i2);
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print();
jssc.start(); // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate
}
}
It turns out, but I downloaded hadoop and set the value to HADOOP_HOME, after restarting the cluster, this error disappeared.

Accumulo scan/write not running in standalone Java main program in AWS EC2 master using Cloudera CDH 5.8.2

We are trying to run simple write/sacn from Accumulo (client jar 1.5.0) in standalone Java main program (Maven shade executable) as below in AWS EC2 master (described below) using Putty
public class AccumuloQueryApp {
private static final Logger logger = LoggerFactory.getLogger(AccumuloQueryApp.class);
public static final String INSTANCE = "accumulo"; // miniInstance
public static final String ZOOKEEPERS = "ip-x-x-x-100:2181"; //localhost:28076
private static Connector conn;
static {
// Accumulo
Instance instance = new ZooKeeperInstance(INSTANCE, ZOOKEEPERS);
try {
conn = instance.getConnector("root", new PasswordToken("xxx"));
} catch (Exception e) {
logger.error("Connection", e);
}
}
public static void main(String[] args) throws TableNotFoundException, AccumuloException, AccumuloSecurityException, TableExistsException {
System.out.println("connection with : " + conn.whoami());
BatchWriter writer = conn.createBatchWriter("test", ofBatchWriter());
for (int i = 0; i < 10; i++) {
Mutation m1 = new Mutation(String.valueOf(i));
m1.put("personal_info", "first_name", String.valueOf(i));
m1.put("personal_info", "last_name", String.valueOf(i));
m1.put("personal_info", "phone", "983065281" + i % 2);
m1.put("personal_info", "email", String.valueOf(i));
m1.put("personal_info", "date_of_birth", String.valueOf(i));
m1.put("department_info", "id", String.valueOf(i));
m1.put("department_info", "short_name", String.valueOf(i));
m1.put("department_info", "full_name", String.valueOf(i));
m1.put("organization_info", "id", String.valueOf(i));
m1.put("organization_info", "short_name", String.valueOf(i));
m1.put("organization_info", "full_name", String.valueOf(i));
writer.addMutation(m1);
}
writer.close();
System.out.println("Writing complete ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`");
Scanner scanner = conn.createScanner("test", new Authorizations());
System.out.println("Step 1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`");
scanner.setRange(new Range("3", "7"));
System.out.println("Step 2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`");
scanner.forEach(e -> System.out.println("Key: " + e.getKey() + ", Value: " + e.getValue()));
System.out.println("Step 3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`");
scanner.close();
}
public static BatchWriterConfig ofBatchWriter() {
//Batch Writer Properties
final int MAX_LATENCY = 1;
final int MAX_MEMORY = 10000000;
final int MAX_WRITE_THREADS = 10;
final int TIMEOUT = 10;
BatchWriterConfig config = new BatchWriterConfig();
config.setMaxLatency(MAX_LATENCY, TimeUnit.MINUTES);
config.setMaxMemory(MAX_MEMORY);
config.setMaxWriteThreads(MAX_WRITE_THREADS);
config.setTimeout(TIMEOUT, TimeUnit.MINUTES);
return config;
}
}
Connection is established correctly but creating BatchWriter it getting error and it's trying in loop with same error
[impl.ThriftScanner] DEBUG: Error getting transport to ip-x-x-x-100:10011 : NotServingTabletException(extent:TKeyExtent(table:21 30, endRow:21 30 3C, prevEndRow:null))
When we run the same code (writing to Accumulo and reading from Accumulo) inside Spark job and submit to the YANK cluster it's running perfectly. We are struggling to figure out that but getting no clue. Please see the environment as described below
Cloudera CDH 5.8.2 on AWS environemnts (4 EC2 instances as one master and 3 child).
Consider the private IPs are like
Mater: x.x.x.100
Child1: x.x.x.101
Child2: x.x.x.102
Child3: x.x.x.103
We havethe follwing installation in CDH
Cluster (CDH 5.8.2)
Accumulo 1.6 (Tracer not installed, Garbage Collector in Child2, Master in Master, Monitor in child3, Tablet Server in Master)
HBase
HDFS (master as name node, all 3 child as datanode)
Kafka
Spark
YARN (MR2 Included)
ZooKeeper
Hrm, that's very curious that it runs with the Spark-on-YARN, but as a regular Java application. Usually, it's the other way around :)
I would verify that the JARs on the classpath of the standalone java app match the JARs used by the Spark-on-YARN job as well as the Accumulo server classpath.
If that doesn't help, try to increase the log4j level to DEBUG or TRACE and see if anything jumps out at you. If you have a hard time understanding what the logging is saying, feel free to send an email to user#accumulo.apache.org and you'll definitely have more eyes on the problem.

EMR cluster bootstrap failure (timeout) occurs most of the times I initialize a cluster

I'm writing an app that is consisted of 4 chained MapReduce jobs, which runs on Amazon EMR. I'm using the JobFlow interface to chain the jobs. Each job is contained in its own class, and has its own main method. All of these are packed into a .jar which is saved in S3, and the cluster is initialized from a small local app on my laptop, which configures the JobFlowRequest and submits it to EMR.
For most of the attempts I make to start the cluster, it fails with the error message Terminated with errors On the master instance (i-<cluster number>), bootstrap action 1 timed out executing. I looked up info on this issue, and all I could find is that if the combined bootstrap time of the cluster exceeds 45 minutes, then this exception is thrown. However, This only occurs ~15 minutes after the request is submitted to EMR, with disregard to the requested cluster size, be it of 4 EC2 instances, 10 or even 20. This makes no sense to me at all, what am I missing?
Some tech specs:
-The project is compiled with Java 1.7.79
-The requested EMR image is 4.6.0, which uses Hadoop 2.7.2
-I'm using the AWS SDK for Java v. 1.10.64
This is my local main method, which sets up and submits the JobFlowRequest:
import com.amazonaws.AmazonClientException;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.ec2.model.InstanceType;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduce;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient;
import com.amazonaws.services.elasticmapreduce.model.*;
public class ExtractRelatedPairs {
public static void main(String[] args) throws Exception {
if (args.length != 1) {
System.err.println("Usage: ExtractRelatedPairs: <k>");
System.exit(1);
}
int outputSize = Integer.parseInt(args[0]);
if (outputSize < 0) {
System.err.println("k should be positive");
System.exit(1);
}
AWSCredentials credentials = null;
try {
credentials = new ProfileCredentialsProvider().getCredentials();
} catch (Exception e) {
throw new AmazonClientException(
"Cannot load the credentials from the credential profiles file. " +
"Please make sure that your credentials file is at the correct " +
"location (~/.aws/credentials), and is in valid format.",
e);
}
AmazonElasticMapReduce mapReduce = new AmazonElasticMapReduceClient(credentials);
HadoopJarStepConfig jarStep1 = new HadoopJarStepConfig()
.withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
.withMainClass("Phase1")
.withArgs("s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-gb-all/5gram/data/", "hdfs:///output1/");
StepConfig step1Config = new StepConfig()
.withName("Phase 1")
.withHadoopJarStep(jarStep1)
.withActionOnFailure("TERMINATE_JOB_FLOW");
HadoopJarStepConfig jarStep2 = new HadoopJarStepConfig()
.withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
.withMainClass("Phase2")
.withArgs("shdfs:///output1/", "hdfs:///output2/");
StepConfig step2Config = new StepConfig()
.withName("Phase 2")
.withHadoopJarStep(jarStep2)
.withActionOnFailure("TERMINATE_JOB_FLOW");
HadoopJarStepConfig jarStep3 = new HadoopJarStepConfig()
.withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
.withMainClass("Phase3")
.withArgs("hdfs:///output2/", "hdfs:///output3/", args[0]);
StepConfig step3Config = new StepConfig()
.withName("Phase 3")
.withHadoopJarStep(jarStep3)
.withActionOnFailure("TERMINATE_JOB_FLOW");
HadoopJarStepConfig jarStep4 = new HadoopJarStepConfig()
.withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
.withMainClass("Phase4")
.withArgs("hdfs:///output3/", "s3n://dsps162assignment2benasaf/output4");
StepConfig step4Config = new StepConfig()
.withName("Phase 4")
.withHadoopJarStep(jarStep4)
.withActionOnFailure("TERMINATE_JOB_FLOW");
JobFlowInstancesConfig instances = new JobFlowInstancesConfig()
.withInstanceCount(10)
.withMasterInstanceType(InstanceType.M1Small.toString())
.withSlaveInstanceType(InstanceType.M1Small.toString())
.withHadoopVersion("2.7.2")
.withEc2KeyName("AWS")
.withKeepJobFlowAliveWhenNoSteps(false)
.withPlacement(new PlacementType("us-east-1a"));
RunJobFlowRequest runFlowRequest = new RunJobFlowRequest()
.withName("extract-related-word-pairs")
.withInstances(instances)
.withSteps(step1Config, step2Config, step3Config, step4Config)
.withJobFlowRole("EMR_EC2_DefaultRole")
.withServiceRole("EMR_DefaultRole")
.withReleaseLabel("emr-4.6.0")
.withLogUri("s3n://dsps162assignment2benasaf/logs/");
System.out.println("Submitting the JobFlow Request to Amazon EMR and running it...");
RunJobFlowResult runJobFlowResult = mapReduce.runJobFlow(runFlowRequest);
String jobFlowId = runJobFlowResult.getJobFlowId();
System.out.println("Ran job flow with id: " + jobFlowId);
}
}
A while back, I encountered a similar issue, where even a Vanilla EMR cluster of 4.6.0 was failing to get past the startup, and thus it was throwing a timeout error on the bootstrap step.
I ended up just creating a cluster on a different/new VPC in a different region and it worked fine, and thus it led me to believe there may be a problem with either the original VPC itself or the software in 4.6.0.
Also, regarding the VPC, it was specifically having an issue setting and resolving DNS names for the newly created cluster nodes, even though older versions of EMR were not having this problem

where is my EMR cluster

I am trying to create an EMR cluster on java, but i can't neither find it on the EMR cluster list, neither can see the instances requested on EC2.
EMR roles do exist:
sqlInjection#VirtualBox:~$ aws iam list-roles | grep EMR
"RoleName": "EMR_DefaultRole",
"Arn": "arn:aws:iam::removed:role/EMR_DefaultRole"
"RoleName": "EMR_EC2_DefaultRole",
"Arn": "arn:aws:iam::removed:role/EMR_EC2_DefaultRole"
and now my java code:
AWSCredentials awsCredentials = new BasicAWSCredentials(awsKey, awsKeySecret);
AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(awsCredentials);
StepFactory stepFactory = new StepFactory();
StepConfig enabledebugging = new StepConfig()
.withName("Enable debugging")
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopJarStep(stepFactory.newEnableDebuggingStep());
HadoopJarStepConfig hadoopConfig1 = new HadoopJarStepConfig()
.withJar("s3://foo.bucket/hadoop_jar/2015-01-12/foo.jar")
.withMainClass("com.strackoverflow.DriverFoo") // optional main class, this can be omitted if jar above has a manifest
.withArgs("--input=s3://foo.bucket/logs/,s3://foo.bucket/morelogs/", "--output=s3://foo.bucket/myEMROutput" , "--inputType=text"); // i have custom java code to handle the --input, --output and --inputType parameters
StepConfig customStep = new StepConfig("Step1", hadoopConfig1);
Collection <StepConfig> steps = new ArrayList<StepConfig>();
{
steps.add(enabledebugging);
steps.add(customStep);
}
JobFlowInstancesConfig instancesConfig = new JobFlowInstancesConfig()
.withEc2KeyName("fookey") //not fookey.pem
.withInstanceCount(2)
.withKeepJobFlowAliveWhenNoSteps(false) // on aws example is set to true
.withMasterInstanceType("m1.medium")
.withSlaveInstanceType("m1.medium");
RunJobFlowRequest request = new RunJobFlowRequest()
.withName("java programatic request")
.withAmiVersion("3.3.1")
.withSteps(steps) // on the amazon example is lunched debug and hive, here is debug and a jar
.withLogUri("s3://devel.rui/emr_clusters/pr01/")
.withInstances(instancesConfig)
.withVisibleToAllUsers(true);
RunJobFlowResult result = emr.runJobFlow(request);
System.out.println("toString "+ result.toString());
System.out.println("getJobFlowId "+ result.getJobFlowId());
System.out.println("hashCode "+ result.hashCode());
Where is my cluster? I cannot see it on cluster list, output folder is not created, logs folder stays empty and no instances are visible on EC2.
by the program outputs this
toString {JobFlowId: j-2xxxxxxU}
getJobFlowId j-2xxxxxU
hashCode -1xxxxx4
I had follow the instruction from here to create the cluster
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/calling-emr-with-java-sdk.html
And this to create the java job
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-common-programming-sample.html
On the Amazon example, the region is not configured.
After configuring the region the cluster is launched properly.
AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(awsCredentials);
emr.setRegion(Region.getRegion(Regions.EU_WEST_1));

Java code or Oozie

I'm new to Hadoop, so I have some doubts what to do in the next case.
I have an algorithm that includes multiple runs of different jobs and sometimes multiple runs of a single job (in a loop).
How should I achieve this, using Oozie, or using Java code? I was looking through Mahout code and in ClusterIterator function function found this:
public static void iterateMR(Configuration conf, Path inPath, Path priorPath, Path outPath, int numIterations)
throws IOException, InterruptedException, ClassNotFoundException {
ClusteringPolicy policy = ClusterClassifier.readPolicy(priorPath);
Path clustersOut = null;
int iteration = 1;
while (iteration <= numIterations) {
conf.set(PRIOR_PATH_KEY, priorPath.toString());
String jobName = "Cluster Iterator running iteration " + iteration + " over priorPath: " + priorPath;
Job job = new Job(conf, jobName);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(ClusterWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(ClusterWritable.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setMapperClass(CIMapper.class);
job.setReducerClass(CIReducer.class);
FileInputFormat.addInputPath(job, inPath);
clustersOut = new Path(outPath, Cluster.CLUSTERS_DIR + iteration);
priorPath = clustersOut;
FileOutputFormat.setOutputPath(job, clustersOut);
job.setJarByClass(ClusterIterator.class);
if (!job.waitForCompletion(true)) {
throw new InterruptedException("Cluster Iteration " + iteration + " failed processing " + priorPath);
}
ClusterClassifier.writePolicy(policy, clustersOut);
FileSystem fs = FileSystem.get(outPath.toUri(), conf);
iteration++;
if (isConverged(clustersOut, conf, fs)) {
break;
}
}
Path finalClustersIn = new Path(outPath, Cluster.CLUSTERS_DIR + (iteration - 1) + Cluster.FINAL_ITERATION_SUFFIX);
FileSystem.get(clustersOut.toUri(), conf).rename(clustersOut, finalClustersIn);
}
So, they have a loop in which they run MR jobs. Is this a good approach? I know that Oozie is used for DAGs, and can be used with another components, such Pig, but should I consider using it for something like this?
What if I want to run clustering algorithm multiple times, let's say for clustering (using specific driver), should I do that in a loop, or using Oozie.
Thanks
If you are looking to run map reduce jobs only then you can consider following ways
chain MR jobs using Map reduce job Control API.
http://hadoop.apache.org/docs/r2.5.0/api/org/apache/hadoop/mapreduce/lib/jobcontrol/JobControl.html
Submit multiple MR jobs from a single driver class.
Job job1 = new Job( getConf() );
job.waitForCompletion( true );
if(job.isSuccessful()){
//start another job with different Mapper.
//change config
Job job2 = new Job( getConf() );
}
If you have a complex DAG or involving multiple ecosystem tools like hive,pig then Oozie suits well.

Categories