I am trying to submit spark program from cmd in windows 10 with below mentioned command:
spark-submit --class abc.Main --master local[2] C:\Users\arpitbh\Desktop\AmdocsIDE\workspace\Line_Count_Spark\target\Line_Count_Spark-0.0.1-SNAPSHOT.jar
but after running this i am getting error:
17/05/02 11:56:57 INFO ShutdownHookManager: Deleting directory C:\Users\arpitbh\AppData\Local\Temp\spark-03f14dbe-1802-40ca-906c-af8de0f462f9
17/05/02 11:56:57 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\arpitbh\AppData\Local\Temp\spark-03f14dbe-1802-40ca-906c-af8de0f462f9
java.io.IOException: Failed to delete: C:\Users\arpitbh\AppData\Local\Temp\spark-03f14dbe-1802-40ca-906c-af8de0f462f9
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65)
at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:62)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:62)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
I have also checked JIRA of apache spark, This defect has been marked solved but no solution is mentioned. Please help.
package abc;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class Main {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
SparkConf conf =new SparkConf().setAppName("Line_Count").setMaster("local[2]");
JavaSparkContext ctx= new JavaSparkContext(conf);
JavaRDD<String> textLoadRDD = ctx.textFile("C:/spark/README.md");
System.out.println(textLoadRDD.count());
System.getProperty("java.io.tmpdir");
}
}
this is probably because you are instantiating the SparkContext without a SPARK_HOME or HUPA_HOME that allows the program to find winutils.exe in the bin directory. I found that when I went from
SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf);
to
JavaSparkContext sc = new JavaSparkContext("local[*], "programname",
System.getenv("SPARK_HOME"), System.getenv("JARS"));
the error went away.
I believed, you are trying to execute program without setting up user variables HADOOP_HOME or SPARK_LOCAL_DIRS.
I had same issued, resolved it by creating variables e.g HADOOP_HOME-> C:\Hadoop, SPARK_LOCAL_DIRS->C:\tmp\spark
Related
I have one situation where in if a certain condition is not meet, then there is no need to create a spark session inside the class and application exits with a messsage.
I am submitting job as below in "yarn-cluster" mode
spark2-submit --class com.test.TestSpark --master yarn --deploy-mode client /home/test.jar false
The final status of the job is "failed".
But if same is run in "yarn-client" mode the spark job completes successfully.
Below is the code:
package com.test;
import org.apache.spark.sql.SparkSession;
public class TestSpark {
public static void main(String[] args) {
boolean condition = false;
condition = Boolean.parseBoolean(args[0]);
if(condition){
SparkSession sparkSession = SparkSession.builder().appName("Data Ingestion Framework")
.config("hive.metastore.warehouse.dir", "/user/hive/warehouse").config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate();
}else{
System.out.println("coming out no processing required");
}
}
}
In the logs for "yarn-cluster" i can see two conatiners are getting created and one of them fails with below error:
18/05/09 18:21:51 WARN security.UserGroupInformation: PriviledgedActionException as:*****<uername> (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: hdfs://hostname/user/*****<uername>/.sparkStaging/application_1525778267559_0054/__spark_conf__.zip
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://hostname/user/*****<uername>/.sparkStaging/application_1525778267559_0054/__spark_conf__.zip
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1257)
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$4$$anonfun$apply$3.apply(ApplicationMaster.scala:198)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$4$$anonfun$apply$3.apply(ApplicationMaster.scala:195)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$4.apply(ApplicationMaster.scala:195)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$4.apply(ApplicationMaster.scala:160)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:787)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
Could you please explain why this is happening and how does spark handles the container creation.
Amit, this is a known issue that is still open.
https://issues.apache.org/jira/browse/SPARK-10795
The workaround is to have a SparkContext initialized.
package com.test;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
public class TestSpark {
public static void main(String[] args) {
boolean condition = false;
condition = Boolean.parseBoolean(args[0]);
if(condition){
SparkSession sparkSession = SparkSession.builder().appName("Data Ingestion Framework")
.config("hive.metastore.warehouse.dir", "/user/hive/warehouse").config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate();
}else{
// Initialize a spark context to avoid failure : https://issues.apache.org/jira/browse/SPARK-10795
JavaSparkContext sparkContext = new JavaSparkContext(new SparkConf());
System.out.println("coming out no processing required");
}
}
I am building a gradle java project (please refer below) using Apache Beam code and executing on Eclipse Oxygen.
package com.xxxx.beam;
import java.io.IOException;
import org.apache.beam.runners.spark.SparkContextOptions;
import org.apache.beam.runners.spark.SparkPipelineResult;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineRunner;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.KV;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.io.FileIO.ReadableFile;
public class ApacheBeamTestProject {
public void modelExecution(){
SparkContextOptions options = (SparkContextOptions) PipelineOptionsFactory.create();
options.setSparkMaster("xxxxxxxxx");
JavaSparkContext sc = options.getProvidedSparkContext();
JavaLinearRegressionWithSGDExample.runJavaLinearRegressionWithSGDExample(sc);
Pipeline p = Pipeline.create(options);
p.apply(FileIO.match().filepattern("hdfs://path/to/*.gz"))
// withCompression can be omitted - by default compression is detected from the filename.
.apply(FileIO.readMatches())
.apply(MapElements
// uses imports from TypeDescriptors
.via(
new SimpleFunction <ReadableFile, KV<String,String>>() {
private static final long serialVersionUID = -5715607038612883677L;
#SuppressWarnings("unused")
public KV<String,String> createKV(ReadableFile f) {
String temp = null;
try{
temp = f.readFullyAsUTF8String();
}catch(IOException e){
}
return KV.of(f.getMetadata().resourceId().toString(), temp);
}
}
))
.apply(FileIO.write())
;
SparkPipelineResult result = (SparkPipelineResult) p.run();
result.getState();
}
public static void main(String[] args) throws IOException {
System.out.println("Test log");
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
p.apply(FileIO.match().filepattern("hdfs://path/to/*.gz"))
// withCompression can be omitted - by default compression is detected from the filename.
.apply(FileIO.readMatches())
.apply(MapElements
// uses imports from TypeDescriptors
.via(
new SimpleFunction <ReadableFile, KV<String,String>>() {
private static final long serialVersionUID = -5715607038612883677L;
#SuppressWarnings("unused")
public KV<String,String> createKV(ReadableFile f) {
String temp = null;
try{
temp = f.readFullyAsUTF8String();
}catch(IOException e){
}
return KV.of(f.getMetadata().resourceId().toString(), temp);
}
}
))
.apply(FileIO.write());
p.run();
}
}
I am observing the following error when executing this project in Eclipse.
Test log
Exception in thread "main" java.lang.IllegalArgumentException: No Runner was specified and the DirectRunner was not found on the classpath.
Specify a runner by either:
Explicitly specifying a runner by providing the 'runner' property
Adding the DirectRunner to the classpath
Calling 'PipelineOptions.setRunner(PipelineRunner)' directly
at org.apache.beam.sdk.options.PipelineOptions$DirectRunner.create(PipelineOptions.java:291)
at org.apache.beam.sdk.options.PipelineOptions$DirectRunner.create(PipelineOptions.java:281)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper(ProxyInvocationHandler.java:591)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault(ProxyInvocationHandler.java:532)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:155)
at org.apache.beam.sdk.options.PipelineOptionsValidator.validate(PipelineOptionsValidator.java:95)
at org.apache.beam.sdk.options.PipelineOptionsValidator.validate(PipelineOptionsValidator.java:49)
at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:44)
at org.apache.beam.sdk.Pipeline.create(Pipeline.java:150)
This project doesn't contain pom.xml file. Gradle has setup for all the links.
I am not sure how to fix this error? Could someone advise?
It seems that you are trying to use the DirectRunner and it is not on the classpath of your application. You can supply it by adding beam-runners-direct-java dependency to your application:
https://mvnrepository.com/artifact/org.apache.beam/beam-runners-direct-java
EDIT (answered in comment): you are trying to run this code on spark, but didn't specify it in PipelineOptions. Beam by default tries to run the code on DirectRunner, so I think this is why you get this error. Specifying:
options.setRunner(SparkRunner.class); before creating the pipeline sets the correct runner and fixes the issue.
Downloading the beam-runners-direct-java-x.x.x.jar and adding it to the project classpath worked for me. Please refer to this maven repository to download the DirectRunner jar file.
Furthermore, if you need a specific beam runner for your project, you can pass the runner name as a program argument (eg: --runner=DataflowRunner) and add the corresponding jar to the project classpath.
I am trying to read a json file using spark in Java. The few changes I tried were :
SparkConf conf = new SparkConf().setAppName("Search").setMaster("local[*]");
DataFrame df = sqlContext.read().json("../Users/pshah/Desktop/sample.json/*");
Code:
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
public class ParseData {
public static void main(String args[]){
SparkConf conf = new SparkConf().setAppName("Search").setMaster("local");
JavaSparkContext sc= new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
// Create the DataFrame
DataFrame df = sqlContext.read().json("/Users/pshah/Desktop/sample.json");
// Show the content of the DataFrame
df.show();
}}
Error:
Exception in thread "main" java.io.IOException: No input paths specified in job
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:198)
i wrote the same code, and meet the same problem. i put the people.json file under project directory src/main/resources. the reason is program could not finding the file. after i copy the people.json file to the program's working directory, the program works well
I am using Spark 1.5 on Windows. I haven't installed any separate binaries of Hadoop.
I running a Master and a single worker.
It's a simple HelloWorld Program as below :
package com.java.spark;
import java.io.Serializable;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
public class HelloWorld implements Serializable{
/**
*
*/
private static final long serialVersionUID = -7926281781224763077L;
public static void main(String[] args) {
// Local mode
//SparkConf sparkConf = new SparkConf().setAppName("HelloWorld").setMaster("local");
SparkConf sparkConf = new SparkConf().setAppName("HelloWorld").setMaster("spark://192.168.1.106:7077")
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "file:///D:/SparkEventLogsHistory");
//.set("spark.eventLog.dir", "/work/");
//tried many combinations above but all gives error.
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
String[] arr = new String[] { "John", "Paul", "Gavin", "Rahul", "Angel" };
List<String> inputList = Arrays.asList(arr);
JavaRDD<String> inputRDD = ctx.parallelize(inputList);
inputRDD.foreach(new VoidFunction<String>() {
public void call(String input) throws Exception {
System.out.println(input);
}
});
}
}
The exception I am getting is :
Exception in thread "main" java.io.IOException: Cannot run program "cygpath": CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:206)
at org.apache.hadoop.util.Shell.run(Shell.java:188)
at org.apache.hadoop.fs.FileUtil$CygPathCommand.<init>(FileUtil.java:412)
at org.apache.hadoop.fs.FileUtil.makeShellPath(FileUtil.java:438)
at org.apache.hadoop.fs.FileUtil.makeShellPath(FileUtil.java:465)
at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:592)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:584)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:420)
at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:130)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:541)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
at com.java.spark.HelloWorld.main(HelloWorld.java:28)
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(Unknown Source)
at java.lang.ProcessImpl.start(Unknown Source)
... 13 more
16/04/01 20:13:24 INFO ShutdownHookManager: Shutdown hook called
Does anyone has any idea how to resolve this exception, so that Spark can pick the eventLogs from local directory.
If I dont give configure eventLog.dir then exception changes to :
Exception in thread "main" java.io.FileNotFoundException: File file:/H:/tmp/spark-events does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:468)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:373)
at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:100)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:541)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
at com.java.spark.HelloWorld.main(HelloWorld.java:28)
I have a java client program that creates directory, but when execute the program its creating directory on my local machine even i have configured fs.defaultFS to vm url that matches core-site.xml.
here is the sample program that creates directory.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class Mkdir {
public static void main(String ar[]) throws IOException
{
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://testing:8020");
FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path("/user/newuser");
fileSystem.mkdirs(path) ;
fileSystem.close();
}
}
add this two file in your code
Configuration conf = new Configuration();
conf.addResource(new Path("/home/user17/BigData/hadoop/core-site.xml"));
conf.addResource(new Path("/home/user17/BigData/hadoop/hdfs-site.xml"));
FileSystem fileSystem = FileSystem.get(conf);
give path according to your system