How to read csv file from GCS using spark-java? - java

I am trying to read csv file which is stored in GCS using spark,
I have a simple spark java project which does nothing but reading a csv.
the following code are used in it.
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("Hello world");
SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();
Dataset<Row> dataset = sparkSession.read().option("header", true).option("sep", "" + ",").option("delimiter", "\"").csv("gs://abc/WDC_age.csv");
but it throws an error which says:
Exception in thread "main" java.io.IOException: No FileSystem for scheme: gs
can anyone help me in this?
I just want to read csv from GCS using spark.
Thanks In Advance :)

In my case, i just added the following dependency on my pom.xml file:
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcs-connector</artifactId>
<version>hadoop3-2.2.4</version>
</dependency>
and it work for me.

No FileSystem for scheme: gs indicates Spark couldn't find the GCS connector. I guess you are not running in a Dataproc cluster, you might need to install the connector by yourself
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage

Related

Yarn Distributed cache, no mapper/reducer

I am unable to access files in distributed cache in Hadoop 2.6. Below is a code snippet. I am attempting to place a file pattern.properties, which is in args[0] in the distributed cache of Yarn
Configuration conf1 = new Configuration();
Job job = Job.getInstance(conf1);
DistributedCache.addCacheFile(new URI(args[0]), conf1);
Also, I am trying to access the file in cache using the below:
Context context =null;
URI[] cacheFiles = context.getCacheFiles(); //Error at this line
System.out.println(cacheFiles);
But I am getting the below error at the line mentioned above:
java.lang.NullPointerException
I am not using Mapper class. It's just a spark stream code to access a file in cluster. I want the file to be distributed in the cluster. But I can't take it from HDFS.
I don't know whether I understood your question correctly.
We had some local files which we need to access in Spark streaming jobs.
We used this option:-
time spark-submit --files
/user/dirLoc/log4j.properties#log4j.properties 'rest other options'
Another way we tried was :- SparkContext.addFile()

How to append data to existing file in HDFS using Java API?

I am using hadoop version 2.6.0 and trying to append data to existing file in HDFS but seem that it doesn't work for me. Here's my method to write into HDFS using FileSystem.Append function.
HdfsIO hdfsIO = new HdfsIO(hdfsCoreSite,hdfsSite);
FileSystem fs = FileSystem.get(hdfsIO.getConfiguration());
FSDataOutputStream out = fs.append(new Path("/test_dir_10/append_data_to_this_file.txt"));
out.writeUTF("Append demo...");
fs.close();
Write() and create() functions work well but append() function.
I got this error :
Failed to close file /test_dir_10/append_data_to_this_file.txt. Lease
recovery is in progress. Try again later.
I also added this property to hdsf-site.xml
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
So anyone have an idea what I'm missing or doing wrong?
Thanks.
Problem was solved by running on file system
hadoop dfs -setrep -R -w 2 /

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

So I've installed Hadoop File System on my machine and I'm using maven dependency to provide my code spark environment. (spark-mllib_2.10)
Now, My code is using spark mllib. And accessing data from Hadoop file system with this code.
String finalData = ProjectProperties.hadoopBasePath + ProjectProperties.finalDataPath;
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), finalData).toJavaRDD();
With following properties set.
finalDataPath = /data/finalInput.txt
hadoopBasePath = hdfs://127.0.0.1:54310
I am starting the dfs nodes externally through command
start-dfs.sh
Now, my code works perfectly fine when running from eclipse. But if I export the whole code to an executable jar, it gives me following exception.
Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
I also checked different solutions online given for this issue where people are asking me to add following
hadoopConfig.set("fs.hdfs.impl",
org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()
);
hadoopConfig.set("fs.file.impl",
org.apache.hadoop.fs.LocalFileSystem.class.getName()
);
OR
<property>
<name>fs.file.impl</name>
<value>org.apache.hadoop.fs.LocalFileSystem</value>
<description>The FileSystem for file: uris.</description>
</property>
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>
But I don't use any Hadoop context or hadoop config into my project. Simply load the data from Hadoop using the URL.
Can someone give some answer relevant to this issue?
Please mind that this totally works fine from Eclipse. And only doesn't work if I export the same project as an executable Jar.
Update
As suggested in the comment and from the solutions found online, I tried two things.
Added dependencies into my pom.xml for hadoop-core, hadoop-hdfs and hadoop-client libraries.
Added the above properties configuration to hadoop's site-core.xml as suggested here http://grokbase.com/t/cloudera/scm-users/1288xszz7r/no-filesystem-for-scheme-hdfs
But still no luck in getting the error resolved. Gives the same issue locally on my machine as well as one of the remote machines I tried it on.
I also installed hadoop the same way I did on my machine using the link mentioned above.

Deleting filesystem from AWS- Hadoop

I'm trying to launch a Amazon AWS EMR JAR Map Reduce Job. Therefore I get the Exception Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3://bi/stuff already exists
In Hadoop I would enter A Command like:
hadoop fs -rmr /bi
The thing is that I haven't found a simular command in the AWS Commandline jet.
So can somebody please tell me how to delete the the Hadoop Filesystem in The Amazon S3 cloud
From the AWS doc:
aws s3 rb s3://bucket-name
However, why don't you implement it in your jar via the AWS S3Client library?

InvalidInputException When loading file into Hbase MapReduce

I am very new for Hadoop and Map Reduce. For starting bases i executed Word Count Program. It executed well but when i try running csv file into Htable which i followed [Csv File][1]
It throwing me in to following error which i am not aware of it, please can any one help me in knowing the above error
12/09/07 05:47:31 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path [1]: http://salsahpc.indiana.edu/ScienceCloud/hbase_hands_on_1.htm#shell_exercises
This error is really kiiling my time, please can any one help me with this exception
The problem why you are directing to path hdfs://HadoopMaster:54310/user/hduser/csvtable instead of csvtable is.
1) Add your Hbase jars into Hadoop class path because your Map reduce doesn't by default configure to hbase jars.
2) GO to hadoop-ev.sh and edit Hadoop_classpath and add all your hbase jars in it. hope it might work now
your job is attempting to read an input file from:
hdfs://HadoopMaster:54310/user/hduser/csvtable
you should verify that this file exists on HDFS using the hadoop shell tools:
hadoop fs -ls /user/hduser/csvtable
my guess is that your file hasn't been loaded onto HDFS.

Categories