hdfs java file system API: creating Configuration object - java

I was trying to create a java program to write/read files from HDFS.
I saw some examples of Java API. With this, the following code works for me.
Configuration mConfiguration = new Configuration();
mConfiguration.set(“fs.default.name”, “hdfs://NAME_NODE_IP:9000″);
But my set up has to be changed for a Hadoop HA set up so hardcoded namenode addressing is not possible.
I saw some example where in we provide the path of configuration xmls like below.
mConfiguration.addResource(new Path(“/usr/local/hadoop/etc/hadoop/core-site.xml”));
mConfiguration.addResource(new Path(“/usr/local/hadoop/etc/hadoop/hdfs-site.xml”));
This code also works when running the application in the same system as of hadoop.
But it would not work when my application is not running on the same m/c as of hadoop.
So, what is the approach I should take so that the system works but direct namenode addressing is not done.
Any help would be appreciated.

While using the Hadoop High Availability concept, you need to set following properties in configuration object:
Configuration conf = new Configuration(false);
conf.set("fs.defaultFS", "hdfs://nameservice1");
conf.set("fs.default.name", conf.get("fs.defaultFS"));
conf.set("dfs.nameservices","nameservice1");
conf.set("dfs.ha.namenodes.nameservice1", "namenode1,namenode2");
conf.set("dfs.namenode.rpc-address.nameservice1.namenode1","hadoopnamenode01:8020");
conf.set("dfs.namenode.rpc-address.nameservice1.namenode2", "hadoopnamenode02:8020");
conf.set("dfs.client.failover.proxy.provider.nameservice1","org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
Try it out!

Related

Error while trying to write on parquet file in datastage 11.7 (File_Connector_20,0: java.lang.NoClassDefFoundError: org.apache.hadoop.fs.FileSystem)

we have recently upgraded the DataStage from 9.1 to 11.7 on Server AIX 7.1 .
and i'm trying to use the new connector "File Connector" to write on parquet file. i created simple job takes from teradata as a source and write on the parquet file as a target.
Image of the job
but facing below error :
> File_Connector_20,0: java.lang.NoClassDefFoundError: org.apache.hadoop.fs.FileSystem
at java.lang.J9VMInternals.prepareClassImpl (J9VMInternals.java)
at java.lang.J9VMInternals.prepare (J9VMInternals.java: 304)
at java.lang.Class.getConstructor (Class.java: 594)
at com.ibm.iis.jis.utilities.dochandler.impl.OutputBuilder.<init> (OutputBuilder.java: 80)
at com.ibm.iis.jis.utilities.dochandler.impl.Registrar.getBuilder (Registrar.java: 340)
at com.ibm.iis.jis.utilities.dochandler.impl.Registrar.getBuilder (Registrar.java: 302)
at com.ibm.iis.cc.filesystem.FileSystem.getBuilder (FileSystem.java: 2586)
at com.ibm.iis.cc.filesystem.FileSystem.writeFile (FileSystem.java: 1063)
at com.ibm.iis.cc.filesystem.FileSystem.process (FileSystem.java: 935)
at com.ibm.is.cc.javastage.connector.CC_JavaAdapter.run (CC_JavaAdapter.java: 444)
i followed the steps in below link :
https://www.ibm.com/support/knowledgecenter/SSZJPZ_11.7.0/com.ibm.swg.im.iis.conn.s3.usage.doc/topics/amaze_file_formats.html
1- i uploaded the jar files into "/ds9/IBM/InformationServer/Server/DSComponents/jars"
2- added them to CLASSPATH in agent.sh then restarted the datastage.
3- i have set The environment variable CC_USE_LATEST_FILECC_JARS to the value parquet-1.9.0.jar:orc-2.1.jar.
i tried also to add the CLASSPATH as an environment variable in the job but not worked.
noting that i'm using Local in File System.
so any hint is appreciated as i'm searching a lot time ago.
Thanks in advance,
Which File System mode you are using ? If you are using Native HDFS as File System mode, then you would need to configure CLASSPATH to include some third party jars.
Perhaps these links should provide you with some guidance.
https://www.ibm.com/support/pages/node/301847
https://www.ibm.com/support/pages/steps-required-configure-file-connector-use-parquet-or-orc-file-format
Note : Based on the hadoop distribution and version you are using, the version of the jars could be different.
If the above information does not help in resolving the issue, then you may have to reach out to IBM Support to get this addressed.
TO use File Connector, there is no need to add CLASSPATH in agent.sh unless you want to import HDFS files from IMAM.
If your requirement is reading Parquet files, then set
$CC_USE_LATEST_FILECC_JARS=parquet-1.9.0.jar
$FILECC_PARQUET_AVRO_COMPAT_MODE=TRUE
If you are still seeing issue, then run job with $CC_MSG_LEVEL=2 and open IBM support case along with job design, FULL job log and Version.xml file from Engine tier.

Yarn Distributed cache, no mapper/reducer

I am unable to access files in distributed cache in Hadoop 2.6. Below is a code snippet. I am attempting to place a file pattern.properties, which is in args[0] in the distributed cache of Yarn
Configuration conf1 = new Configuration();
Job job = Job.getInstance(conf1);
DistributedCache.addCacheFile(new URI(args[0]), conf1);
Also, I am trying to access the file in cache using the below:
Context context =null;
URI[] cacheFiles = context.getCacheFiles(); //Error at this line
System.out.println(cacheFiles);
But I am getting the below error at the line mentioned above:
java.lang.NullPointerException
I am not using Mapper class. It's just a spark stream code to access a file in cluster. I want the file to be distributed in the cluster. But I can't take it from HDFS.
I don't know whether I understood your question correctly.
We had some local files which we need to access in Spark streaming jobs.
We used this option:-
time spark-submit --files
/user/dirLoc/log4j.properties#log4j.properties 'rest other options'
Another way we tried was :- SparkContext.addFile()

How does Spark know where the Yarn Resource Manager is running when not using spark-submit.sh?

I am quite new to Spark and I am trying to start a Spark job from inside my application (without using spark-submit.sh) in yarn-cluster mode and I am trying to figure out how the job gets to know where the Yarn ResourceManager is running.
I have done
SparkConf sConf = new SparkConf().setMaster("yarn-cluster").set("spark.driver.memory", "10g");
But what I am not able to configure is the location of the Yarn ResourceManager. Any ideas on how I go about doing it? I have a clustered setup where the Yarn RM does not run on the same machine as the application.
The properties can be found in yarn-site.xml either located in your HADOOP_CONF_DIR or YARN_CONF_DIR environment variables, which are either set at the OS level, or in spark-env.sh.
In a non-HA deployment, you are looking for yarn.resourcemanager.address
Look into Spark Launcher API - org.apache.spark.launcher Java Doc
Or read about it here - SparkLauncher — Launching Spark Applications

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

So I've installed Hadoop File System on my machine and I'm using maven dependency to provide my code spark environment. (spark-mllib_2.10)
Now, My code is using spark mllib. And accessing data from Hadoop file system with this code.
String finalData = ProjectProperties.hadoopBasePath + ProjectProperties.finalDataPath;
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), finalData).toJavaRDD();
With following properties set.
finalDataPath = /data/finalInput.txt
hadoopBasePath = hdfs://127.0.0.1:54310
I am starting the dfs nodes externally through command
start-dfs.sh
Now, my code works perfectly fine when running from eclipse. But if I export the whole code to an executable jar, it gives me following exception.
Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
I also checked different solutions online given for this issue where people are asking me to add following
hadoopConfig.set("fs.hdfs.impl",
org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()
);
hadoopConfig.set("fs.file.impl",
org.apache.hadoop.fs.LocalFileSystem.class.getName()
);
OR
<property>
<name>fs.file.impl</name>
<value>org.apache.hadoop.fs.LocalFileSystem</value>
<description>The FileSystem for file: uris.</description>
</property>
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>
But I don't use any Hadoop context or hadoop config into my project. Simply load the data from Hadoop using the URL.
Can someone give some answer relevant to this issue?
Please mind that this totally works fine from Eclipse. And only doesn't work if I export the same project as an executable Jar.
Update
As suggested in the comment and from the solutions found online, I tried two things.
Added dependencies into my pom.xml for hadoop-core, hadoop-hdfs and hadoop-client libraries.
Added the above properties configuration to hadoop's site-core.xml as suggested here http://grokbase.com/t/cloudera/scm-users/1288xszz7r/no-filesystem-for-scheme-hdfs
But still no luck in getting the error resolved. Gives the same issue locally on my machine as well as one of the remote machines I tried it on.
I also installed hadoop the same way I did on my machine using the link mentioned above.

No FileSystem for scheme: webhdfs

I'm building a client which pushes some data into my HDFS. Because the HDFS is inside a cluster behind a firewall I use HttpFS as a proxy to access it. The client exits with an IOException when I try to read/write to the HDFS. The message is No FileSystem for scheme: webhdfs. The code is very simple
String hdfsURI = "webhdfs://myhttpfshost:14000/";
Configuration configuration = new Configuration();
FileSystem hdfs = FileSystem.get(new URI(hdfsURI), configuration);
It crashes in the last line. I'm building with Maven 3.0.4 and added the Hadoop-Client dependency 2.2.0 to my project. Accessing via curl on the command line works fine.
Any ideas why this could be failing?
Similar to this question on SO I had to add the following code prior doing any FS activities:
configuration.set("fs.webhdfs.impl", org.apache.hadoop.hdfs.web.WebHdfsFileSystem.class.getName());
I don't know why, but there seems to be something wrong with the Maven build process... for now it works.

Categories