Yarn Distributed cache, no mapper/reducer - java

I am unable to access files in distributed cache in Hadoop 2.6. Below is a code snippet. I am attempting to place a file pattern.properties, which is in args[0] in the distributed cache of Yarn
Configuration conf1 = new Configuration();
Job job = Job.getInstance(conf1);
DistributedCache.addCacheFile(new URI(args[0]), conf1);
Also, I am trying to access the file in cache using the below:
Context context =null;
URI[] cacheFiles = context.getCacheFiles(); //Error at this line
System.out.println(cacheFiles);
But I am getting the below error at the line mentioned above:
java.lang.NullPointerException
I am not using Mapper class. It's just a spark stream code to access a file in cluster. I want the file to be distributed in the cluster. But I can't take it from HDFS.

I don't know whether I understood your question correctly.
We had some local files which we need to access in Spark streaming jobs.
We used this option:-
time spark-submit --files
/user/dirLoc/log4j.properties#log4j.properties 'rest other options'
Another way we tried was :- SparkContext.addFile()

Related

Copy file from HDFS to local directory from within Java Code when run using spark-submit in cluster mode

I am working on a java program where some code generates a file and stores it on some HDFS path. Then, I need to bring that file on the local machine storage/NAS and store it there. I am using the below for the same:
Configuration hadoopConf = new Configuration();
FileSystem hdfs = FileSystem.get(hadoopConf);
Path srcPath = new Path("/some/hdfs/path/someFile.csv");;
Path destPath = new Path("file:///data/output/files/");
hdfs.copyToLocalFile(false, newReportFilePath, destPath, false);
This gives me below error:
java.io.IOException: Mkdirs failed to create file:/data/output (exists=false, cwd=file:/data7/yarn/some/other/path)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:926)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:907)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:368)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
.
.
.
Below is the command used to run the java application
spark-submit --master yarn --deploy-mode cluster ..............
I am new to Spark/Hadoop but from other couple of questions on SO and web, it seems that since its running in cluster mode, any machine can act as a driver and FileSystem.copyToLocalFile will point to any machine which will be acting as a driver.
Any suggestions on how I can bring that csv file to local machine would be appreciated.

How does Spark know where the Yarn Resource Manager is running when not using spark-submit.sh?

I am quite new to Spark and I am trying to start a Spark job from inside my application (without using spark-submit.sh) in yarn-cluster mode and I am trying to figure out how the job gets to know where the Yarn ResourceManager is running.
I have done
SparkConf sConf = new SparkConf().setMaster("yarn-cluster").set("spark.driver.memory", "10g");
But what I am not able to configure is the location of the Yarn ResourceManager. Any ideas on how I go about doing it? I have a clustered setup where the Yarn RM does not run on the same machine as the application.
The properties can be found in yarn-site.xml either located in your HADOOP_CONF_DIR or YARN_CONF_DIR environment variables, which are either set at the OS level, or in spark-env.sh.
In a non-HA deployment, you are looking for yarn.resourcemanager.address
Look into Spark Launcher API - org.apache.spark.launcher Java Doc
Or read about it here - SparkLauncher — Launching Spark Applications

hdfs java file system API: creating Configuration object

I was trying to create a java program to write/read files from HDFS.
I saw some examples of Java API. With this, the following code works for me.
Configuration mConfiguration = new Configuration();
mConfiguration.set(“fs.default.name”, “hdfs://NAME_NODE_IP:9000″);
But my set up has to be changed for a Hadoop HA set up so hardcoded namenode addressing is not possible.
I saw some example where in we provide the path of configuration xmls like below.
mConfiguration.addResource(new Path(“/usr/local/hadoop/etc/hadoop/core-site.xml”));
mConfiguration.addResource(new Path(“/usr/local/hadoop/etc/hadoop/hdfs-site.xml”));
This code also works when running the application in the same system as of hadoop.
But it would not work when my application is not running on the same m/c as of hadoop.
So, what is the approach I should take so that the system works but direct namenode addressing is not done.
Any help would be appreciated.
While using the Hadoop High Availability concept, you need to set following properties in configuration object:
Configuration conf = new Configuration(false);
conf.set("fs.defaultFS", "hdfs://nameservice1");
conf.set("fs.default.name", conf.get("fs.defaultFS"));
conf.set("dfs.nameservices","nameservice1");
conf.set("dfs.ha.namenodes.nameservice1", "namenode1,namenode2");
conf.set("dfs.namenode.rpc-address.nameservice1.namenode1","hadoopnamenode01:8020");
conf.set("dfs.namenode.rpc-address.nameservice1.namenode2", "hadoopnamenode02:8020");
conf.set("dfs.client.failover.proxy.provider.nameservice1","org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
Try it out!

No FileSystem for scheme: webhdfs

I'm building a client which pushes some data into my HDFS. Because the HDFS is inside a cluster behind a firewall I use HttpFS as a proxy to access it. The client exits with an IOException when I try to read/write to the HDFS. The message is No FileSystem for scheme: webhdfs. The code is very simple
String hdfsURI = "webhdfs://myhttpfshost:14000/";
Configuration configuration = new Configuration();
FileSystem hdfs = FileSystem.get(new URI(hdfsURI), configuration);
It crashes in the last line. I'm building with Maven 3.0.4 and added the Hadoop-Client dependency 2.2.0 to my project. Accessing via curl on the command line works fine.
Any ideas why this could be failing?
Similar to this question on SO I had to add the following code prior doing any FS activities:
configuration.set("fs.webhdfs.impl", org.apache.hadoop.hdfs.web.WebHdfsFileSystem.class.getName());
I don't know why, but there seems to be something wrong with the Maven build process... for now it works.

InvalidInputException When loading file into Hbase MapReduce

I am very new for Hadoop and Map Reduce. For starting bases i executed Word Count Program. It executed well but when i try running csv file into Htable which i followed [Csv File][1]
It throwing me in to following error which i am not aware of it, please can any one help me in knowing the above error
12/09/07 05:47:31 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path [1]: http://salsahpc.indiana.edu/ScienceCloud/hbase_hands_on_1.htm#shell_exercises
This error is really kiiling my time, please can any one help me with this exception
The problem why you are directing to path hdfs://HadoopMaster:54310/user/hduser/csvtable instead of csvtable is.
1) Add your Hbase jars into Hadoop class path because your Map reduce doesn't by default configure to hbase jars.
2) GO to hadoop-ev.sh and edit Hadoop_classpath and add all your hbase jars in it. hope it might work now
your job is attempting to read an input file from:
hdfs://HadoopMaster:54310/user/hduser/csvtable
you should verify that this file exists on HDFS using the hadoop shell tools:
hadoop fs -ls /user/hduser/csvtable
my guess is that your file hasn't been loaded onto HDFS.

Categories