Deleting filesystem from AWS- Hadoop - java

I'm trying to launch a Amazon AWS EMR JAR Map Reduce Job. Therefore I get the Exception Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3://bi/stuff already exists
In Hadoop I would enter A Command like:
hadoop fs -rmr /bi
The thing is that I haven't found a simular command in the AWS Commandline jet.
So can somebody please tell me how to delete the the Hadoop Filesystem in The Amazon S3 cloud

From the AWS doc:
aws s3 rb s3://bucket-name
However, why don't you implement it in your jar via the AWS S3Client library?

Related

How to run Apache Spark applications with a JAR dependency from AWS S3?

I have a .jar file containing useful functions for my application located in an AWS S3 bucket, and I want to use it as a dependency in Spark without having to first download it locally. Is it possible to directly reference the .jar file with spark-submit (or pyspark) --jars option?
So far, I have tried the following:
spark-shell --packages com.amazonaws:aws-java-sdk:1.12.336,org.apache.hadoop:hadoop-aws:3.3.4 --jars s3a://bucket/path/to/jar/file.jar
The AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY variables are correctly set, since when running the same command without the --jars option, other files in the same bucket are successfully read. However, if the option is added, I get the following error:
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.spark.util.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:317)
at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:273)
at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:271)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.util.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:271)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$4(SparkSubmit.scala:364)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:364)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:901)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
... 27 more
I'm using Spark 3.3.1 pre-built for Apache Hadoop 3.3 and later.
This may be because when in client mode - Spark during its boot distributes the Jars (specified in --jars) via Netty first. To download a remote JAR from a third-party file system (i.e. S3), it'll need the right dependency (i.e. hadoop-aws) in the classpath already (before it prepares the final classpath).
But since it is yet to distribute the JARs it has not prepared the classpath - thus when it tries to download the JAR from s3, it fails with ClassNotFound (as hadoop-aws is yet to be on the classpath), but when doing the same in the application code it succeeds - as by that time the classpath has been resolved.
i.e. Downloading the dependency is dependent on a library that will be loaded later.
To run Apache Spark applications with a JAR dependency from Amazon S3, you can use the --jars command-line option to specify the S3 URL of the JAR file when submitting the Spark application.
For example, if your JAR file is stored in the my-bucket S3 bucket at the jars/my-jar.jar path, you can submit the Spark application as follows:
spark-submit --jars s3a://my-bucket/jars/my-jar.jar \
--class com.example.MySparkApp \
s3a://my-bucket/my-spark-app.jar
This will download the JAR file from S3 and add it to the classpath of the Spark application.
Note that you will need to include the s3a:// prefix in the S3 URL to use the s3a filesystem connector, which is the recommended connector for reading from and writing to S3. You may also need to configure the fs.s3a.access.key and fs.s3a.secret.key properties with your AWS access key and secret key in order to authenticate the connection to S3.

How do I avoid java.lang.NoClassDefFoundError when submitting Spark job on EMR cluster?

I have Spark application that runs successfully on my local machine. I use Hbase Docker container, from which I load the data to my Spark app. Now I have created EMR cluster with Spark and Hbase installed. Buy when I'm trying to submit my JAR file I get the following exception:
java.lang.NoClassDefFoundError:
org/apache/hadoop/hbase/HBaseConfiguration
When running my app locally, I was able to avoid this kind of error by adding --jars flag to spark-submit, giving Spark the path to all Hbase Jars.
How could I overcome this error when running on EMR?
Should I re-direct Spark to Hbase jar's in EMR as well? Where those jars located on EMR cluster?
Configuration hBaseConf = HBaseConfiguration.create();
hBaseConf.set(TableInputFormat.INPUT_TABLE, "MyTable");
JavaRDD<String> myStrings = sparkContext.newAPIHadoopRDD(
hBaseConf, TableInputFormat.class,ImmutableBytesWritable.class, Result.class).keys().map(key -> {
String from = Bytes.toString(key.get());
return from;
});
.
.
.
I was able to locate the JAR's on EMR shell with hbase classpath command. Then I took Hbase path to jars and added to spark-submit with --jars flag.

java.io.IOException: No FileSystem for scheme: abfs for adls-gen 2 in spark java

I am trying to access adls gen2 in spark java with following configuration properties.
fs.azure.account.auth.type
fs.azure.account.oauth.provider.type
fs.azure.account.oauth2.client.endpoint
fs.azure.account.oauth2.client.id
fs.azure.account.oauth2.client.secret
I have created the blob container and uploaded the file path ex.https://devbdstreamsv2.dfs.core.windows.net/gen2container/adlsgen2/flat.json in it using the software "Azure storage Explorer" version 1.9 .I am trying to access the abfs filepath which I am using according to the code mentioned in the document.abfs[s]://<file_system>#<account_name>.dfs.core.windows.net/<path>/
But my doubt is we are not initialising abfs filepath anywhere in the runner code.So I am getting the exception " No FileSystem for scheme: abfs ".How can i resolve this issue?I want to know Initialization of abfs filesystem using spark java for adls gen2.
You need a distribution of Spark which has the abfs connector in the hadoop-azure JAR. The hadoop-2.7.x JARs in the normal ASF releases do not, as abfs came out later (2.9+)

Azure specific reading files from local on spark

I am struggling with Azure wasb on spark
I am reading loading a .json.gz file from disk and loading it into hdfs. I have used the following code extensively on other systems.
val file_a_raw = sqlContext.read.json('/home/users/repo_test/file_a.json.gz')
However, on Azure, this returns:
java.io.FileNotFoundException: Filewasb://server-2017-03-07t08-13-41-314z#server.blob.core.windows.net/home/users/repo_test/file_a.json.gz does not exist.
I have checked this location and the file is there and correct.
I think there should be a : between .net and then file path, but I get a java error trying to manually add that in.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme name at index 0:
I've also tried:
Filewasb:///home/users/repo_test/file_a.json.gz
But that returns:
java.io.IOException: No FileSystem for scheme: Filewasb
This code works fine on non Azure spark
For Azure, you'll need to configure Spark with the proper credentials. Databricks has documentation on this: https://docs.databricks.com/user-guide/faq/azure-blob-storage.html

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

So I've installed Hadoop File System on my machine and I'm using maven dependency to provide my code spark environment. (spark-mllib_2.10)
Now, My code is using spark mllib. And accessing data from Hadoop file system with this code.
String finalData = ProjectProperties.hadoopBasePath + ProjectProperties.finalDataPath;
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), finalData).toJavaRDD();
With following properties set.
finalDataPath = /data/finalInput.txt
hadoopBasePath = hdfs://127.0.0.1:54310
I am starting the dfs nodes externally through command
start-dfs.sh
Now, my code works perfectly fine when running from eclipse. But if I export the whole code to an executable jar, it gives me following exception.
Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
I also checked different solutions online given for this issue where people are asking me to add following
hadoopConfig.set("fs.hdfs.impl",
org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()
);
hadoopConfig.set("fs.file.impl",
org.apache.hadoop.fs.LocalFileSystem.class.getName()
);
OR
<property>
<name>fs.file.impl</name>
<value>org.apache.hadoop.fs.LocalFileSystem</value>
<description>The FileSystem for file: uris.</description>
</property>
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>
But I don't use any Hadoop context or hadoop config into my project. Simply load the data from Hadoop using the URL.
Can someone give some answer relevant to this issue?
Please mind that this totally works fine from Eclipse. And only doesn't work if I export the same project as an executable Jar.
Update
As suggested in the comment and from the solutions found online, I tried two things.
Added dependencies into my pom.xml for hadoop-core, hadoop-hdfs and hadoop-client libraries.
Added the above properties configuration to hadoop's site-core.xml as suggested here http://grokbase.com/t/cloudera/scm-users/1288xszz7r/no-filesystem-for-scheme-hdfs
But still no luck in getting the error resolved. Gives the same issue locally on my machine as well as one of the remote machines I tried it on.
I also installed hadoop the same way I did on my machine using the link mentioned above.

Categories