I am having spark streaming application using kinesis and running in EMR 6.0.0,
It's running fine locally but when deploying to AWS EMR it keeps failing with
NoClassDefFoundError exception
20/11/17 15:26:56 INFO Client:
client token: N/A
diagnostics: User class threw exception: java.lang.NoClassDefFoundError: com/fasterxml/jackson/dataformat/cbor/CBORFactory
at com.amazonaws.protocol.json.SdkJsonProtocolFactory.getSdkFactory(SdkJsonProtocolFactory.java:123)
at com.amazonaws.protocol.json.SdkJsonProtocolFactory.createGenerator(SdkJsonProtocolFactory.java:54)
at com.amazonaws.protocol.json.SdkJsonProtocolFactory.createGenerator(SdkJsonProtocolFactory.java:74)
at com.amazonaws.protocol.json.SdkJsonProtocolFactory.createProtocolMarshaller(SdkJsonProtocolFactory.java:64)
at com.amazonaws.services.kinesis.model.transform.DescribeStreamRequestProtocolMarshaller.marshall(DescribeStreamRequestProtocolMarshaller.java:52)
at com.amazonaws.services.kinesis.AmazonKinesisClient.executeDescribeStream(AmazonKinesisClient.java:861)
at com.amazonaws.services.kinesis.AmazonKinesisClient.describeStream(AmazonKinesisClient.java:846)
at com.amazonaws.services.kinesis.AmazonKinesisClient.describeStream(AmazonKinesisClient.java:887)
at com.gartner.tn.datafeed.application.PositionStreamApplicationV4.getJavaDStream(PositionStreamApplicationV4.java:240)
I had the exact same issue and I solved it by removing the dependence on CBOR from Kinesis. I am not sure if that is an option for you but it worked for me.
There are a few ways to do this but, for when running in local mode, I put the following code at the beginning of the main class in my streaming spark application;
System.setProperty(SDKGlobalConfiguration.AWS_CBOR_DISABLE_SYSTEM_PROPERTY, "true");
When running in cluster mode start your spark submit as follows;
spark-submit --deploy-mode cluster \
--conf spark.driver.extraJavaOptions='-Dcom.amazonaws.sdk.disableCbor=true' \
--conf spark.executor.extraJavaOptions='-Dcom.amazonaws.sdk.disableCbor=true'
When running in client mode on the cluster start like this;
spark-submit --deploy-mode client \
--driver-java-options '-Dcom.amazonaws.sdk.disableCbor=true' \
--conf spark.executor.extraJavaOptions='-Dcom.amazonaws.sdk.disableCbor=true'
This question led me to the answer; Getting an AmazonKinesisException Status Code: 502 when using LocalStack from Java
Related
I wrote a few Spark job in Java then submitted the jars with submit script.
bin/spark-submit --class "com.company.spark.jobName.SparkMain" --master local[*] /tmp/spark-job-1.0.jar
There will be a service and will run in same server. The service should stop the job when receive the stop command.
I have these information about job in service:
SparkHome
AppName
AppResource
Master uri
app-id
status
Is there any way to stop running spark job in java code.
Have you reviewed the REST server and the ability to use /submissions/kill/[submissionId]? That seems like it would work for your need.
Hello my Spark configuration in JAVA is :
ss=SparkSession.builder()
.config("spark.driver.host", "192.168.0.103")
.config("spark.driver.port", "4040")
.config("spark.dynamicAllocation.enabled", "false")
.config("spark.cores.max","1")
.config("spark.executor.memory","471859200")
.config("spark.executor.cores","1")
//.master("local[*]")
.master("spark://kousik-pc:7077")
.appName("abc")
.getOrCreate();
Now when I am submitting any job from inside code(not submitting jar) I am getting the Warning:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
The spark UI is
The worker that is in the screenshot is started from command:
~/spark/sbin/start-slave.sh
All the four jobs those are in waiting stage is submitted from java code. Tried all solutions from all sites. Any idea please.
As per my understanding, you wanted to run a spark job using only one executor core, you don't have to specify spark.executor.cores.
spark.cores.max should handle assigning only one core to each job as its value is 1.
Its always good practice to provide the configuration details like master, executor memory/cores in spark-submit command like below:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://xxx.xxx.xxx.xxx:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
In case if you want to explicitly specify the number of executors to each job use --total-executor-cores in your spark-submit command
Check the documentation here
I am running spark 1.6.0 on a small computing cluster and wish to kill a driver program. I've submitted a custom implementation of the out of the box Spark Pi calculation example with the following options:
spark-submit --class JavaSparkPi --master spark://clusterIP:portNum --deploy-mode cluster /path/to/jarfile/JavaSparkPi.jar 10
Note: 10 is a command line argument and is irrelevant for this question.
I've tried many methods of killing the driver program that was started on the cluster:
./bin/spark-class org.apache.spark.deploy.Client kill
spark-submit --master spark://node-1:6066 --kill $driverid
Issue the kill command from the spark administrative interface (web ui): http://my-cluster-url:8080
Number 2 yields a success JSON response:
{
"action" : "KillSubmissionResponse",
"message" : "Kill request for driver-xxxxxxxxxxxxxx-xxxx submitted",
"serverSparkVersion" : "1.6.0",
"submissionId" : "driver-xxxxxxxxxxxxxx-xxxx",
"success" : true
}
Where 'driver-xxxxxxxxxxxxxx-xxxx' is the actual driver id.
But the web UI http://my-cluster-url:8080/ still shows the driver program as running.
Is there anything else I can try?
I am trying to run a simple Map/Reduce java program using spark over yarn (Cloudera Hadoop 5.2 on CentOS). I have tried this 2 different ways. The first way is the following:
YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/;
/var/tmp/spark/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class MRContainer --master yarn-cluster --jars /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar simplemr.jar
This method gives the following error:
diagnostics: Application application_1434177111261_0007 failed 2 times
due to AM Container for appattempt_1434177111261_0007_000002 exited
with exitCode: -1000 due to: Resource
hdfs://kc1ltcld29:9000/user/myuser/.sparkStaging/application_1434177111261_0007/spark-assembly-1.4.0-hadoop2.4.0.jar
changed on src filesystem (expected 1434549639128, was 1434549642191
Then I tried without the --jars:
YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/;
/var/tmp/spark/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class MRContainer --master yarn-cluster simplemr.jar
diagnostics: Application application_1434177111261_0008 failed 2 times
due to AM Container for appattempt_1434177111261_0008_000002 exited
with exitCode: -1000 due to: File does not exist:
hdfs://kc1ltcld29:9000/user/myuser/.sparkStaging/application_1434177111261_0008/spark-assembly-1.4.0-hadoop2.4.0.jar
.Failing this attempt.. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.myuser
start time: 1434549879649
final status: FAILED
tracking URL: http://kc1ltcld29:8088/cluster/app/application_1434177111261_0008
user: myuser Exception in thread "main" org.apache.spark.SparkException: Application
application_1434177111261_0008 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:841)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:867)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 15/06/17 10:04:57 INFO util.Utils: Shutdown hook called 15/06/17
10:04:57 INFO util.Utils: Deleting directory
/tmp/spark-2aca3f35-abf1-4e21-a10e-4778a039d0f4
I tried deleting all the .jars from hdfs://users//.sparkStaging and resubmitting but that didn't help.
The problem was solved by copying spark-assembly.jar into a directory on the hdfs for each node and then passing it to spark-submit --conf spark.yarn.jar as a parameter. Commands are listed below:
hdfs dfs -copyFromLocal /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar /user/spark/spark-assembly.jar
/var/tmp/spark/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class MRContainer --master yarn-cluster --conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar simplemr.jar
If you are getting this error it means you are uploading assembly jars using --jars option or manually copying to hdfs in each node.
I have followed this approach and it works for me.
In yarn-cluster mode, Spark submit automatically uploads the assembly jar to a distributed cache that all executor containers read from, so there is no need to manually copy the assembly jar to all nodes (or pass it through --jars).
It seems there are two versions of the same jar in your HDFS.
Try removing all old jars from your .sparkStaging directory and try again, it should work.
I have done following configuration for Apache Spark 1.2.1 Standalone Cluster:
Hadoop 2.6.0
2 nodes - one master and one slave - in Standalone cluster
3-node Cassandra
total cores: 6 (2 master, 4 slaves)
total memory: 13 GB
I run Spark in Standalone cluster manager as:
./spark-submit --class com.b2b.processor.ProcessSampleJSONFileUpdate \
--conf num-executors=2 \
--executor-memory 2g \
--driver-memory 3g \
--deploy-mode cluster \
--supervise \
--master spark://abc.xyz.net:7077 \
hdfs://abc:9000/b2b/b2bloader-1.0.jar ds6_2000/*.json
My job is getting executed successfully, i.e. reads data from files and inserts it to Cassandra.
Spark documentation says that in Standalone cluster make use of all available cores but my cluster is using only 1 core per application. Also,after starting application on Spark UI it is showing Applications:0 running and Drivers:1 running.
My query is:
Why it is not using all available 6 cores?
Why spark UI showing Applications:0 Running?
The code:
public static void main(String[] args) throws Exception {
String fileName = args[0];
System.out.println("----->Filename : "+fileName);
Long now = new Date().getTime();
SparkConf conf = new SparkConf(true)
.setMaster("local")
.setAppName("JavaSparkSQL_" +now)
.set("spark.executor.memory", "1g")
.set("spark.cassandra.connection.host", "192.168.1.65")
.set("spark.cassandra.connection.native.port", "9042")
.set("spark.cassandra.connection.rpc.port", "9160");
JavaSparkContext ctx = new JavaSparkContext(conf);
JavaRDD<String> input = ctx.textFile("hdfs://abc.xyz.net:9000/dataLoad/resources/" + fileName,6);
JavaRDD<DataInput> result = input.mapPartitions(new ParseJson()).filter(new FilterLogic());
System.out.print("Count --> "+result.count());
System.out.println(StringUtils.join(result.collect(), ","));
javaFunctions(result).writerBuilder("ks","pt_DataInput",mapToRow(DataInput.class)).saveToCassandra();
}
If you're setting your master in your app to local (via .setMaster("local")), it will not connect to the spark://abc.xyz.net:7077.
You don't need to set the master in app if you are setting it up with the spark-submit command.
What was happening is that you thought you were using standalone mode, for which the default is to use all available nodes, but in reality with "local" as master, you were using local mode. In local mode even when you set local[*], Spark is going to use always only 1 core, since local mode is a non-distributed single-JVM deployment mode. This is also why when you changed your master parameter to "spark://abc.xyz.net:7077" everything went as you were expecting.
Try setting master as local[*] this will use all the cores.