Can't access HDFS via Java API (Cloudera-CDH4.4.0)

Can't access HDFS via Java API (Cloudera-CDH4.4.0) - java

I'm trying to access my HDFS using Java code but I can't get it working... after 2 days of struggling I think it's time to ask for help.
This is my code:
Configuration conf = new Configuration();
conf.addResource(new Path("/HADOOP_HOME/conf/core-site.xml"));
conf.addResource(new Path("/HADOOP_HOME/conf/hdfs-site.xml"));
FileSystem hdfs = FileSystem.get(conf);
boolean success = hdfs.mkdirs(new Path("/user/cloudera/testdirectory"));
System.out.println(success);
I got this code from here and here.
Unfortunately the hdfs object is just a "LocalFileSystem"-object, so something must be wrong. Looks like this is exactly what Rejeev wrote on his website:
[...] If you do not assign the configurations to conf object (using hadoop xml file) your HDFS operation will be performed on the local file system and not on the HDFS. [...]
With absolute paths I get the same result.
conf.addResource(new Path("/etc/hadoop/conf/core-site.xml"))
This is the libary I'm currently using:
hadoop-core-2.0.0-mr1-cdh4.4.0.jar
I heard that hadoop-core was split into multiple libs so I also tried the following libs:
hadoop-common-2.0.0-alpha.jar
hadoop-mapreduce-client-core-2.0.2-alpha.jar
I'm using Cloudera-CDH4.4.0 so hadoop is already installed. Via console everything is working fine.
For example:
hadoop fs -mkdir testdirectory
So everything should be set up correctly as per default.
I hope that you guys can help me... this stuff is driving me nuts! It's extremely frustrating to fail with such a simple task.
Many thanks in advance for any help.

Try this:
conf.set("fs.defaultFS", "file:///");
conf.set("mapreduce.framework.name", "local");

1) You don't need to conf.addResource unless you are overriding any configuration variables.
2) Hope you are creating a Jar file and running the jar file in command window and not in eclipse.
If you execute in eclipse, it will execute on local file system.
3) I ran below code and it worked.
public class Hmkdirs {
public static void main(String[] args)
throws IOException
{
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
boolean success = fs.mkdirs(new Path("/user/cloudera/testdirectory1"));
System.out.println(success);
}
}
4) To execute, you need to create a jar file, you can do that either from eclipse or command prompt
and execute the jar file.
command prompt jar file sample:
javac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar:/usr/local/hadoop/lib/commons-cli-1.2.jar -d classes WordCount.java && jar -cvf WordCount.jar -C classes/ .
jar file execution via hadoop at command prompt.
hadoop jar hadoopfile.jar hadoop.sample.fileaccess.Hmkdirs
hadoop.sample.fileaccess is the package in which my class Hmkdirs exist. If your class exist in default package, you don't have to specify it, just the class is fine.
Update: You can execute from eclipse and still access hdfs, check below code.
public class HmkdirsFromEclipse {
public static void main(String[] args)
throws IOException
{
Configuration conf = new Configuration();
conf.addResource("/etc/hadoop/conf/core-site.xml");
conf.addResource("/etc/hadoop/conf/hdfs-site.xml");
conf.set("fs.defaultFS", "hdfs://quickstart.cloudera:8020/");
conf.set("hadoop.job.ugi", "cloudera");
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
FileSystem fs = FileSystem.get(conf);
boolean success = fs.mkdirs(new Path("/user/cloudera/testdirectory9"));
System.out.println(success);
}
}

This is indeed a tricky bit of configuration, but this is essentially what you need to do:
Configuration conf = new Configuration();
conf.addResource("/etc/hadoop/conf/core-site.xml");
conf.addResource("/etc/hadoop/conf/hdfs-site.xml");
conf.set("fs.defaultFS", hdfs://[your namenode]);
conf.set("hadoop.job.ugi", [your user]
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
make sure you have hadoop-hdfs in your classpath, too.

Related

how to run hadoop with external jar?

I'm going to run my map reduce project on Hadoop 2.9.0. I'm using xml-rpc package in my project as follow:
import org.apache.xmlrpc.*;
I put additional jars into lib folder and When I ran my project jar in Hadoop, It shows this error:
Error: java.lang.ClassNotFoundException: org.apache.xmlrpc.XmlRpcClient
I executed this command:
bin/hadoop jar MRV.jar SumMR /user/hadoop/input /user/hadoop/output -libjars lib/xmlrpc-2.0.1.jar: lib/commons-codec-1.10.jar
How to execute this command without error of ClassNotFoundException ?

private static void addJarToDistributedCache
(Class classToAdd, Configuration conf)
throws IOException {
// Retrieve jar file for class2Add
String jar = classToAdd.getProtectionDomain().
getCodeSource().getLocation().
getPath();
File jarFile = new File(jar);
// Declare new HDFS location
Path hdfsJar = new Path("/user/hadoopi/lib/"
+ jarFile.getName());
// Mount HDFS
FileSystem hdfs = FileSystem.get(conf);
// Copy (override) jar file to HDFS
hdfs.copyFromLocalFile(false, true,
new Path(jar), hdfsJar);
// Add jar to distributed classPath
DistributedCache.addFileToClassPath(hdfsJar, conf);
}

How to distribute jar to hadoop before Job submission

I want to implement REST API to submit Hadoop JOBs for execution. This is done purely via Java code. If I compile a jar file and execute it via "hadoop -jar" everything works as expected. But when I submit Hadoop Job via Java code in my REST API - job is submitted but fails because of ClassNotFoundException.
Is it possible to deploy somehow jar file(with the code of my Jobs) to hadoop(nodemanagers and their's containers) so that hadoop will be able to locate jar file by class name?? Should I copy jar file to each nodemanager and set HADOOP_CLASSPATH there?

You can create a method that adds the jar file to the distributed cache of Hadoop, so it will be available to tasktrakers when needed.
private static void addJarToDistributedCache(
String jarPath, Configuration conf)
throws IOException {
File jarFile = new File(jarPath);
// Declare new HDFS location
Path hdfsJar = new Path(jarFile.getName());
// Mount HDFS
FileSystem hdfs = FileSystem.get(conf);
// Copy (override) jar file to HDFS
hdfs.copyFromLocalFile(false, true,
new Path(jar), hdfsJar);
// Add jar to distributed classPath
DistributedCache.addFileToClassPath(hdfsJar, conf);
}
and then in your application, before submitting your job call addJarToDistributedCache:
public static void main(String[] args) throws Exception {
// Create Hadoop configuration
Configuration conf = new Configuration();
// Add 3rd-party libraries
addJarToDistributedCache("/tmp/hadoop_app/file.jar", conf);
// Create my job
Job job = new Job(conf, "Hadoop-classpath");
.../...
}
you can find more details in this blog:

MapReduce job cannot read from HBase (throws java.lang.NoClassDefFoundError)

My goal is to run a simple MapReduce job on a Cloudera cluster that reads from a dummy HBase database and writes out in a HDFS file.
Some important notes:
- I've successfully run MapReduce jobs that took a HDFS file as input
and wrote to HDFS file as output on this cluster before.
- I've already replaced the libraries which are used for compiling the project from "purely" HBase to HBase-cloudera jars
- When I previously encountered this kind of issues, I used to simply copy a lib into a distributed cache (worked for me with Google Guice):
JobConf conf = new JobConf(getConf(), ParseJobConfig.class);
DistributedCache.addCacheFile(new URI("/user/hduser/lib/3.0/guice-multibindings-3.0.jar"), conf);
but now it doesn't work because the HBaseConfiguration class is used to create a configuration (before the configuration exists)
- Cloudera version is 5.3.1, Hadoop version is 2.5.0
This is my driver code:
public class HbaseJobDriver {
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "ExampleSummaryToFile");
job.setJarByClass(HbaseJobDriver.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
TableMapReduceUtil.initTableMapperJob("Metrics",
scan,
HbaseJobMapper.class,
Text.class,
IntWritable.class,
job);
job.setReducerClass(HbaseJobReducer.class);
job.setNumReduceTasks(1);
FileOutputFormat.setOutputPath(job, new Path(args[0]));
}
}
I am not sure if mapper/reducer classes are needed to solve this issue.
The exception that I am getting is:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

We've just solved it together with my colleague, in our case we needed to update .bashrc file:
nano ~/.bashrc
Put the libraries to the classpath like this:
HBASE_PATH=/opt/cloudera/parcels/CDH/jars
export HADOOP_CLASSPATH=${HBASE_PATH}/hbase-common-0.98.6-cdh5.3.1.jar:<ANY_OTHER_JARS_REQUIRED>
Don't forget to reload bashrc:
. .bashrc

Try this.
export HADOOP_CLASSPATH="/usr/lib/hbase/hbase.jar:$HADOOP_CLASSPATH"
add the above property in your /etc/hadoop/conf/hadoop-env.sh file or set it from command line

The error Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration is due to lack of HBase jar.
If what #sravan said did not work, then try to import HBaseConfiguration in your driver code (import section) like this:
import org.apache.hadoop.hbase.HBaseConfiguration;

Setting external jars to hadoop classpath

I am trying to set external jars to hadoop classpath but no luck so far.
I have the following setup
$ hadoop version
Hadoop 2.0.6-alpha
Subversion https://git-wip-us.apache.org/repos/asf/bigtop.git -r ca4c88898f95aaab3fd85b5e9c194ffd647c2109
Compiled by jenkins on 2013-10-31T07:55Z
From source with checksum 95e88b2a9589fa69d6d5c1dbd48d4e
This command was run using /usr/lib/hadoop/hadoop-common-2.0.6-alpha.jar
Classpath
$ echo $HADOOP_CLASSPATH
/home/tom/workspace/libs/opencsv-2.3.jar
I am able see the above HADOOP_CLASSPATH has been picked up by hadoop
$ hadoop classpath
/etc/hadoop/conf:/usr/lib/hadoop/lib/:/usr/lib/hadoop/.//:/home/tom/workspace/libs/opencsv-2.3.jar:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/:/usr/lib/hadoop-hdfs/.//:/usr/lib/hadoop-yarn/lib/:/usr/lib/hadoop-yarn/.//:/usr/lib/hadoop-mapreduce/lib/:/usr/lib/hadoop-mapreduce/.//
Command
$ sudo hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/1987.csv /user/root/result
I tried with -libjars option as well
$ sudo hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/1987.csv /user/root/result -libjars /home/tom/workspace/libs/opencsv-2.3.jar
The stacktrace
14/11/04 16:43:23 INFO mapreduce.Job: Running job: job_1415115532989_0001
14/11/04 16:43:55 INFO mapreduce.Job: Job job_1415115532989_0001 running in uber mode : false
14/11/04 16:43:56 INFO mapreduce.Job: map 0% reduce 0%
14/11/04 16:45:27 INFO mapreduce.Job: map 50% reduce 0%
14/11/04 16:45:27 INFO mapreduce.Job: Task Id : attempt_1415115532989_0001_m_000001_0, Status : FAILED
Error: java.lang.ClassNotFoundException: au.com.bytecode.opencsv.CSVParser
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at FlightsByCarrierMapper.map(FlightsByCarrierMapper.java:19)
at FlightsByCarrierMapper.map(FlightsByCarrierMapper.java:10)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:757)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153)
Any help is highly appreciated.

Your external jar is missing on the node running maps. You have to add it to the cache to make it available. Try :
DistributedCache.addFileToClassPath(new Path("pathToJar"), conf);
Not sure in which version DistributedCache was deprecated, but from Hadoop 2.2.0 onward you can use :
job.addFileToClassPath(new Path("pathToJar"));

If you are adding the external JAR to the Hadoop classpath then its better to copy your JAR to one of the existing directories that hadoop is looking at. On the command line run the command "hadoop classpath" and then find a suitable folder and copy your jar file to that location and hadoop will pick up the dependencies from there. This wont work with CloudEra etc as you may not have read/write rights to copy files to the hadoop classpath folders.
Looks like you tried the LIBJARs option as well, did you edit your driver class to implement the TOOL interface? First make sure that you edit your driver class as shown below:
public class myDriverClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new myDriverClass(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
// Configuration processed by ToolRunner
Configuration conf = getConf();
Job job = new Job(conf, "My Job");
...
...
return job.waitForCompletion(true) ? 0 : 1;
}
}
Now edit your "hadoop jar" command as shown below:
hadoop jar YourApplication.jar [myDriverClass] args -libjars path/to/jar/file
Now lets understand what happens underneath. Basically we are handling the new command line arguments by implementing the TOOL Interface. ToolRunner is used to run classes implementing Tool interface. It works in conjunction with GenericOptionsParser to parse the generic hadoop command line arguments and modifies the Configuration of the Tool.
Within our Main() we are calling ToolRunner.run(new Configuration(), new myDriverClass(), args) - this runs the given Tool by Tool.run(String[]), after parsing with the given generic arguments. It uses the given Configuration, or builds one if it's null and then sets the Tool's configuration with the possibly modified version of the conf.
Now within the run method, when we call getConf() we get the modified version of the Configuration. So make sure that you have the below line in your code. If you implement everything else and still make use of Configuration conf = new Configuration(), nothing would work.
Configuration conf = getConf();

I tried setting the opencsv jar in the hadoop classpath but it didn't work.We need to explicitly copy the jar in the classpath for this to work.It did worked for me.
Below are the steps i followed:
I have done this in a HDP CLuster.I ahave copied my opencsv jar in hbase libs and exported it before running my jar
Copy ExternalJars to HDP LIBS:
To Run Open CSV Jar:
1.Copy the opencsv jar in directory /usr/hdp/2.2.9.1-11/hbase/lib/ and /usr/hdp/2.2.9.1-11/hadoop-yarn/lib
**sudo cp /home/sshuser/Amedisys/lib/opencsv-3.7.jar /usr/hdp/2.2.9.1-11/hbase/lib/**
2.Give the file permissions using
sudo chmod 777 opencsv-3.7.jar
3.List Files
ls -lrt
4.exporthadoop classpath:hbase classpath
5.Now run your Jar.It will pick up the opencsv jar and will execute properly.

I found another workaround by implementing ToolRunner like below. With this approach hadoop accepts command line options. We can avoid hard coding of adding files to DistributedCache
public class FlightsByCarrier extends Configured implements Tool {
public int run(String[] args) throws Exception {
// Configuration processed by ToolRunner
Configuration conf = getConf();
// Create a JobConf using the processed conf
JobConf job = new JobConf(conf, FlightsByCarrier.class);
// Process custom command-line options
Path in = new Path(args[1]);
Path out = new Path(args[2]);
// Specify various job-specific parameters
job.setJobName("my-app");
job.setInputPath(in);
job.setOutputPath(out);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// Submit the job, then poll for progress until the job is complete
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
// Let ToolRunner handle generic command-line options
int res = ToolRunner.run(new Configuration(), new FlightsByCarrier(), args);
System.exit(res);
}
}

I found a very easy solution to the problem:
login as root :
cd /usr/lib
find . -name "opencsv.jar"
Pick up the locatin of the file. In my case >I found it under /usr/lib/hive/lib/opencsv*.jar
Now submit the command
hadoop classpath
The result shows the direcory where hadoop searches jar-files.
Pick up one directory and copy opencsv*jar to that directory.
In my case it worked.

Start a java application from Hadoop YARN

I'm trying to run a java application from a YARN application (in detail: from the ApplicationMaster in the YARN app). All examples I found are dealing with bash scripts that are ran.
My problem seems to be that I distribute the JAR file wrongly to the nodes in my cluster. I specify the JAR as local resource in the YARN client.
Path jarPath2 = new Path("/hdfs/yarn1/08_PrimeCalculator.jar");
jarPath2 = fs.makeQualified(jarPath2);
FileStatus jarStat2 = null;
try {
jarStat2 = fs.getFileStatus(jarPath2);
log.log(Level.INFO, "JAR path in HDFS is "+jarStat2.getPath());
} catch (IOException e) {
e.printStackTrace();
}
LocalResource packageResource = Records.newRecord(LocalResource.class);
packageResource.setResource(ConverterUtils.getYarnUrlFromPath(jarPath2));
packageResource.setSize(jarStat2.getLen());
packageResource.setTimestamp(jarStat2.getModificationTime());
packageResource.setType(LocalResourceType.ARCHIVE);
packageResource.setVisibility(LocalResourceVisibility.PUBLIC);
Map<String, LocalResource> res = new HashMap<String, LocalResource>();
res.put("package", packageResource);
So my JAR is supposed to be distributed to the ApplicationMaster and be unpacked since I specify the ResourceType to be an ARCHIVE. On the AM I try to call a class from the JAR like this:
String command = "java -cp './package/*' de.jofre.prime.PrimeCalculator";
The Hadoop logs tell me when running the application: "Could not find or load main class de.jofre.prime.PrimeCalculator". The class exists at exactly the path that is shown in the error message.
Any ideas what I am doing wrong here?

I found out how to start a java process from an ApplicationMaster. Infact, my problem was based on the command to start the process even if this is the officially documented way provided by the Apache Hadoop project.
What I did no was to specify the packageResource to be a file not an archive:
packageResource.setType(LocalResourceType.FILE);
Now the node manager does not extract the resource but leaves it as file. In my case as JAR.
To start the process I call:
java -jar primecalculator.jar
To start a JAR without specifying a main class in command line you have to specify the main class in the MANIFEST file (Manually or let maven do it for you).
To sum it up: I did NOT added the resource as archive but as file and I did not use the -cp command to add the syslink folder that is created by hadoop for the extracted archive folder. I simply startet the JAR via the -jar parameter and that's it.
Hope it helps you guys!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Can't access HDFS via Java API (Cloudera-CDH4.4.0) - java

Try this: conf.set("fs.defaultFS", "file:///"); conf.set("mapreduce.framework.name", "local");

Related

how to run hadoop with external jar?

How to distribute jar to hadoop before Job submission

MapReduce job cannot read from HBase (throws java.lang.NoClassDefFoundError)

Setting external jars to hadoop classpath

Start a java application from Hadoop YARN

Categories

Resources