how to run hadoop with external jar?

how to run hadoop with external jar? - java

I'm going to run my map reduce project on Hadoop 2.9.0. I'm using xml-rpc package in my project as follow:
import org.apache.xmlrpc.*;
I put additional jars into lib folder and When I ran my project jar in Hadoop, It shows this error:
Error: java.lang.ClassNotFoundException: org.apache.xmlrpc.XmlRpcClient
I executed this command:
bin/hadoop jar MRV.jar SumMR /user/hadoop/input /user/hadoop/output -libjars lib/xmlrpc-2.0.1.jar: lib/commons-codec-1.10.jar
How to execute this command without error of ClassNotFoundException ?

private static void addJarToDistributedCache
(Class classToAdd, Configuration conf)
throws IOException {
// Retrieve jar file for class2Add
String jar = classToAdd.getProtectionDomain().
getCodeSource().getLocation().
getPath();
File jarFile = new File(jar);
// Declare new HDFS location
Path hdfsJar = new Path("/user/hadoopi/lib/"
+ jarFile.getName());
// Mount HDFS
FileSystem hdfs = FileSystem.get(conf);
// Copy (override) jar file to HDFS
hdfs.copyFromLocalFile(false, true,
new Path(jar), hdfsJar);
// Add jar to distributed classPath
DistributedCache.addFileToClassPath(hdfsJar, conf);
}

Related

How to add jars on hive shell in a java application

i am trying to add jar on the hive shell. I am aware of the global option on the server but my requirement is to add them per session on the hive shell.
I have used this class for the hdfs dfs commands to add the jars to the hdfs file system
This is what i have tried:
Created a folder on the hdfs /tmp
Add the file to hdfs filesystem using FileSystem.copyFromLocalFile method
(equivalent to the hdfs dfs -put myjar.jar /tmp
Set permissions on the file on fhe fs file system
Check that the jar was loaded to hdfs using the getFileSystem method
List files on the fs FileSystem using listFiles to confirm the jars are there.
This works and I have the jars loaded to hdfs but i cannot add jars to the hive session
When i am trying to add it in the hive shell, i am doing the following:
statement = setStmt(createStatement(getConnection()));
query = "add jar " + path;
statement.execute(query);
I am getting this error [For example path of /tmp/myjar.jar]:
Error while processing statement: /tmp/myjar.jar does not exist
Other permutations on the path such as
query = "add jar hdfs://<host>:<port>" + path;
query = "add jar <host>:<port>" + path;
results with an error.
command to list jars works (with no results)
query = "list jars";
ResultSet rs = statement.executeQuery(query);

I managed to solve this issue
The process failed because of the configuration of the FileSystem.
This object is where we upload the jars to, before adding them on the session.
This is how you init the FileSystem
FileSystem fs = FileSystem.newInstance(conf);
The object conf should have the properties of the hive server.
In order for the process to work, I needed to set the following parameter on the Configuration property
conf.set("fs.defaultFS", hdfsDstStr);

How to distribute jar to hadoop before Job submission

I want to implement REST API to submit Hadoop JOBs for execution. This is done purely via Java code. If I compile a jar file and execute it via "hadoop -jar" everything works as expected. But when I submit Hadoop Job via Java code in my REST API - job is submitted but fails because of ClassNotFoundException.
Is it possible to deploy somehow jar file(with the code of my Jobs) to hadoop(nodemanagers and their's containers) so that hadoop will be able to locate jar file by class name?? Should I copy jar file to each nodemanager and set HADOOP_CLASSPATH there?

You can create a method that adds the jar file to the distributed cache of Hadoop, so it will be available to tasktrakers when needed.
private static void addJarToDistributedCache(
String jarPath, Configuration conf)
throws IOException {
File jarFile = new File(jarPath);
// Declare new HDFS location
Path hdfsJar = new Path(jarFile.getName());
// Mount HDFS
FileSystem hdfs = FileSystem.get(conf);
// Copy (override) jar file to HDFS
hdfs.copyFromLocalFile(false, true,
new Path(jar), hdfsJar);
// Add jar to distributed classPath
DistributedCache.addFileToClassPath(hdfsJar, conf);
}
and then in your application, before submitting your job call addJarToDistributedCache:
public static void main(String[] args) throws Exception {
// Create Hadoop configuration
Configuration conf = new Configuration();
// Add 3rd-party libraries
addJarToDistributedCache("/tmp/hadoop_app/file.jar", conf);
// Create my job
Job job = new Job(conf, "Hadoop-classpath");
.../...
}
you can find more details in this blog:

Can't access HDFS via Java API (Cloudera-CDH4.4.0)

I'm trying to access my HDFS using Java code but I can't get it working... after 2 days of struggling I think it's time to ask for help.
This is my code:
Configuration conf = new Configuration();
conf.addResource(new Path("/HADOOP_HOME/conf/core-site.xml"));
conf.addResource(new Path("/HADOOP_HOME/conf/hdfs-site.xml"));
FileSystem hdfs = FileSystem.get(conf);
boolean success = hdfs.mkdirs(new Path("/user/cloudera/testdirectory"));
System.out.println(success);
I got this code from here and here.
Unfortunately the hdfs object is just a "LocalFileSystem"-object, so something must be wrong. Looks like this is exactly what Rejeev wrote on his website:
[...] If you do not assign the configurations to conf object (using hadoop xml file) your HDFS operation will be performed on the local file system and not on the HDFS. [...]
With absolute paths I get the same result.
conf.addResource(new Path("/etc/hadoop/conf/core-site.xml"))
This is the libary I'm currently using:
hadoop-core-2.0.0-mr1-cdh4.4.0.jar
I heard that hadoop-core was split into multiple libs so I also tried the following libs:
hadoop-common-2.0.0-alpha.jar
hadoop-mapreduce-client-core-2.0.2-alpha.jar
I'm using Cloudera-CDH4.4.0 so hadoop is already installed. Via console everything is working fine.
For example:
hadoop fs -mkdir testdirectory
So everything should be set up correctly as per default.
I hope that you guys can help me... this stuff is driving me nuts! It's extremely frustrating to fail with such a simple task.
Many thanks in advance for any help.

Try this:
conf.set("fs.defaultFS", "file:///");
conf.set("mapreduce.framework.name", "local");

1) You don't need to conf.addResource unless you are overriding any configuration variables.
2) Hope you are creating a Jar file and running the jar file in command window and not in eclipse.
If you execute in eclipse, it will execute on local file system.
3) I ran below code and it worked.
public class Hmkdirs {
public static void main(String[] args)
throws IOException
{
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
boolean success = fs.mkdirs(new Path("/user/cloudera/testdirectory1"));
System.out.println(success);
}
}
4) To execute, you need to create a jar file, you can do that either from eclipse or command prompt
and execute the jar file.
command prompt jar file sample:
javac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar:/usr/local/hadoop/lib/commons-cli-1.2.jar -d classes WordCount.java && jar -cvf WordCount.jar -C classes/ .
jar file execution via hadoop at command prompt.
hadoop jar hadoopfile.jar hadoop.sample.fileaccess.Hmkdirs
hadoop.sample.fileaccess is the package in which my class Hmkdirs exist. If your class exist in default package, you don't have to specify it, just the class is fine.
Update: You can execute from eclipse and still access hdfs, check below code.
public class HmkdirsFromEclipse {
public static void main(String[] args)
throws IOException
{
Configuration conf = new Configuration();
conf.addResource("/etc/hadoop/conf/core-site.xml");
conf.addResource("/etc/hadoop/conf/hdfs-site.xml");
conf.set("fs.defaultFS", "hdfs://quickstart.cloudera:8020/");
conf.set("hadoop.job.ugi", "cloudera");
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
FileSystem fs = FileSystem.get(conf);
boolean success = fs.mkdirs(new Path("/user/cloudera/testdirectory9"));
System.out.println(success);
}
}

This is indeed a tricky bit of configuration, but this is essentially what you need to do:
Configuration conf = new Configuration();
conf.addResource("/etc/hadoop/conf/core-site.xml");
conf.addResource("/etc/hadoop/conf/hdfs-site.xml");
conf.set("fs.defaultFS", hdfs://[your namenode]);
conf.set("hadoop.job.ugi", [your user]
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
make sure you have hadoop-hdfs in your classpath, too.

Reading from a dynamically created directory (in file system) in a jar

Hi I have a java eclipse project which I want to execute from command line. I made a jar of it and am running it from command line.
I figured out how to access a file in a Jar by using getresourceasstream.
Ex:
InputStream is = Extractor.class.getResourceAsStream("CompanyNameListModified.txt");
BufferedReader in=new BufferedReader(new InputStreamReader(is));`
What I wanna know now is how to access a directory from jar .
Currently:
Runtime.getRuntime().exec("hadoop fs -get /tmp/stockmarkets/ localfile");
File dir = new File("/home/hadoop/project/localfile");`
This gives a filenotfoundexception.
What I want to do is
File[] directoryListing = dir.listFiles();
if (directoryListing != null) {
for (File child : directoryListing) {
....
}
}
Hence goto the directory and loop over each file in that directory.. How should I do it so it works for my JAR.?
So I tried this:
Runtime.getRuntime().exec("hadoop fs -get /tmp/stockmarkets/ localfile");
File dir =new File(Extractor.class.getResource("/home/hadoop/project/localfile").getPath());
error:
Exception in thread "main" java.lang.NullPointerException
I checked my directory it does have the local file directory.

You have to provide a filesystem path to use as a property
java -jar yourprogram.jar -DworkDir=/home/hadoop/project
or command-line argument, and then use that in both the call to hadoop and the File declaration.

Start a java application from Hadoop YARN

I'm trying to run a java application from a YARN application (in detail: from the ApplicationMaster in the YARN app). All examples I found are dealing with bash scripts that are ran.
My problem seems to be that I distribute the JAR file wrongly to the nodes in my cluster. I specify the JAR as local resource in the YARN client.
Path jarPath2 = new Path("/hdfs/yarn1/08_PrimeCalculator.jar");
jarPath2 = fs.makeQualified(jarPath2);
FileStatus jarStat2 = null;
try {
jarStat2 = fs.getFileStatus(jarPath2);
log.log(Level.INFO, "JAR path in HDFS is "+jarStat2.getPath());
} catch (IOException e) {
e.printStackTrace();
}
LocalResource packageResource = Records.newRecord(LocalResource.class);
packageResource.setResource(ConverterUtils.getYarnUrlFromPath(jarPath2));
packageResource.setSize(jarStat2.getLen());
packageResource.setTimestamp(jarStat2.getModificationTime());
packageResource.setType(LocalResourceType.ARCHIVE);
packageResource.setVisibility(LocalResourceVisibility.PUBLIC);
Map<String, LocalResource> res = new HashMap<String, LocalResource>();
res.put("package", packageResource);
So my JAR is supposed to be distributed to the ApplicationMaster and be unpacked since I specify the ResourceType to be an ARCHIVE. On the AM I try to call a class from the JAR like this:
String command = "java -cp './package/*' de.jofre.prime.PrimeCalculator";
The Hadoop logs tell me when running the application: "Could not find or load main class de.jofre.prime.PrimeCalculator". The class exists at exactly the path that is shown in the error message.
Any ideas what I am doing wrong here?

I found out how to start a java process from an ApplicationMaster. Infact, my problem was based on the command to start the process even if this is the officially documented way provided by the Apache Hadoop project.
What I did no was to specify the packageResource to be a file not an archive:
packageResource.setType(LocalResourceType.FILE);
Now the node manager does not extract the resource but leaves it as file. In my case as JAR.
To start the process I call:
java -jar primecalculator.jar
To start a JAR without specifying a main class in command line you have to specify the main class in the MANIFEST file (Manually or let maven do it for you).
To sum it up: I did NOT added the resource as archive but as file and I did not use the -cp command to add the syslink folder that is created by hadoop for the extracted archive folder. I simply startet the JAR via the -jar parameter and that's it.
Hope it helps you guys!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to run hadoop with external jar? - java

Related

How to add jars on hive shell in a java application

How to distribute jar to hadoop before Job submission

Can't access HDFS via Java API (Cloudera-CDH4.4.0)

Reading from a dynamically created directory (in file system) in a jar

Start a java application from Hadoop YARN

Categories

Resources