Run Hadoop MR Job via Intellij IDEA

Run Hadoop MR Job via Intellij IDEA - java

I have a Map-only Job configured to run in distributive mode. When I run it throw CLI, Job runs successfully. Launch string looks like:
hadoop jar FileHandy.jar com.company.MainRun arg1 arg2
But if I run it via IDE (Intellij IDEA), it fails with error (could not find Mapper class):
14/07/30 01:07:34 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/07/30 01:07:34 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
14/07/30 01:07:35 INFO input.FileInputFormat: Total input paths to process : 1
14/07/30 01:07:36 INFO mapred.JobClient: Running job: job_201407300013_0001
14/07/30 01:07:37 INFO mapred.JobClient: map 0% reduce 0%
14/07/30 01:07:55 INFO mapred.JobClient: Task Id : attempt_201407300013_0001_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.expedia.eww.FileMapper not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1617)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:191)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:631)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.ClassNotFoundException: Class com.expedia.eww.FileMapper not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1523)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1615)
... 8 more
I've setup IDE and use maven pom.xml with dependencies only (I using jar file generated by Build process by IDEA instead of maven jar, but if using maven jar file - results are same). My Run Configuration for IDE is following:
Main class: org.apache.hadoop.util.RunJar
Programs args: /path/to/jar/FileHandy.jar com.company.FileRun arg1 arg2
Work dir set
Code snippet:
Job job = new Job(conf, "File2Hdfs");
job.setJarByClass(FileRun.class);
job.setMapperClass(FileMapper.class);
job.setInputFormatClass(NLineInputFormat.class);
job.setNumReduceTasks(0);
//FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost/user/cloudera/out111"));
FileOutputFormat.setOutputPath(job, new Path(arg0[1]));
FileInputFormat.addInputPath(job, new Path(fileForMapper));
return job.waitForCompletion(true) ? 0 : 1;
FileRun.class (with main) and FileMapper.class (mapper) are in com.company package.
IDEA launch following when Run project:
/usr/java/jdk1.6.0_32/bin/java -Didea.launcher.port=7547 -Didea.launcher.bin.path=/home/cloudera/Downloads/idea-IC-135.909/bin -Dfile.encoding=UTF-8 -classpath /usr/java/jdk1.6.0_32/jre/lib/rt.jar:/usr/java/jdk1.6.0_32/jre/lib/deploy.jar:/usr/java/jdk1.6.0_32/jre/lib/resources.jar:/usr/java/jdk1.6.0_32/jre/lib/jsse.jar:/usr/java/jdk1.6.0_32/jre/lib/management-agent.jar:/usr/java/jdk1.6.0_32/jre/lib/jce.jar:/usr/java/jdk1.6.0_32/jre/lib/plugin.jar:/usr/java/jdk1.6.0_32/jre/lib/charsets.jar:/usr/java/jdk1.6.0_32/jre/lib/javaws.jar:/usr/java/jdk1.6.0_32/jre/lib/ext/sunpkcs11.jar:/usr/java/jdk1.6.0_32/jre/lib/ext/dnsns.jar:/usr/java/jdk1.6.0_32/jre/lib/ext/localedata.jar:/usr/java/jdk1.6.0_32/jre/lib/ext/sunjce_provider.jar:/home/cloudera/IdeaProjects/MavenFileHandy/target/classes:/home/cloudera/.m2/repository/org/apache/hadoop/hadoop-client/2.0.0-mr1-cdh4.4.0/hadoop-client-2.0.0-mr1-cdh4.4.0.jar:/home/cloudera/.m2/repository/org/apache/hadoop/hadoop-common/2.0.0-cdh4.4.0/hadoop-common-2.0.0-cdh4.4.0.jar:/home/cloudera/.m2/repository/org/apache/hadoop/hadoop-annotations/2.0.0-cdh4.4.0/hadoop-annotations-2.0.0-cdh4.4.0.jar:/usr/java/jdk1.6.0_32/lib/tools.jar:/home/cloudera/.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar:/home/cloudera/.m2/repository/com/google/code/findbugs/jsr305/1.3.9/jsr305-1.3.9.jar:/home/cloudera/.m2/repository/org/apache/commons/commons-math/2.1/commons-math-2.1.jar:/home/cloudera/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/home/cloudera/.m2/repository/commons-codec/commons-codec/1.4/commons-codec-1.4.jar:/home/cloudera/.m2/repository/commons-io/commons-io/2.1/commons-io-2.1.jar:/home/cloudera/.m2/repository/commons-net/commons-net/3.1/commons-net-3.1.jar:/home/cloudera/.m2/repository/commons-el/commons-el/1.0/commons-el-1.0.jar:/home/cloudera/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar:/home/cloudera/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar:/home/cloudera/.m2/repository/junit/junit/4.8.2/junit-4.8.2.jar:/home/cloudera/.m2/repository/commons-lang/commons-lang/2.5/commons-lang-2.5.jar:/home/cloudera/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/home/cloudera/.m2/repository/commons-collections/commons-collections/3.2.1/commons-collections-3.2.1.jar:/home/cloudera/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/home/cloudera/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/home/cloudera/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/home/cloudera/.m2/repository/org/slf4j/slf4j-api/1.6.1/slf4j-api-1.6.1.jar:/home/cloudera/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar:/home/cloudera/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.8.8/jackson-core-asl-1.8.8.jar:/home/cloudera/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.8.8/jackson-mapper-asl-1.8.8.jar:/home/cloudera/.m2/repository/org/mockito/mockito-all/1.8.5/mockito-all-1.8.5.jar:/home/cloudera/.m2/repository/org/apache/avro/avro/1.7.4/avro-1.7.4.jar:/home/cloudera/.m2/repository/com/thoughtworks/paranamer/paranamer/2.3/paranamer-2.3.jar:/home/cloudera/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/snappy-java-1.0.4.1.jar:/home/cloudera/.m2/repository/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar:/home/cloudera/.m2/repository/org/tukaani/xz/1.0/xz-1.0.jar:/home/cloudera/.m2/repository/com/google/protobuf/protobuf-java/2.4.0a/protobuf-java-2.4.0a.jar:/home/cloudera/.m2/repository/org/apache/hadoop/hadoop-auth/2.0.0-cdh4.4.0/hadoop-auth-2.0.0-cdh4.4.0.jar:/home/cloudera/.m2/repository/com/jcraft/jsch/0.1.42/jsch-0.1.42.jar:/home/cloudera/.m2/repository/org/apache/zookeeper/zookeeper/3.4.5-cdh4.4.0/zookeeper-3.4.5-cdh4.4.0.jar:/home/cloudera/.m2/repository/jline/jline/0.9.94/jline-0.9.94.jar:/home/cloudera/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.0.0-cdh4.4.0/hadoop-hdfs-2.0.0-cdh4.4.0.jar:/home/cloudera/.m2/repository/com/sun/jersey/jersey-core/1.8/jersey-core-1.8.jar:/home/cloudera/.m2/repository/com/sun/jersey/jersey-server/1.8/jersey-server-1.8.jar:/home/cloudera/.m2/repository/asm/asm/3.1/asm-3.1.jar:/home/cloudera/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/home/cloudera/.m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.4.0/hadoop-core-2.0.0-mr1-cdh4.4.0.jar:/home/cloudera/.m2/repository/hsqldb/hsqldb/1.8.0.10/hsqldb-1.8.0.10.jar:/home/cloudera/Downloads/idea-IC-135.909/lib/idea_rt.jar com.intellij.rt.execution.application.AppMain org.apache.hadoop.util.RunJar /home/cloudera/IdeaProjects/MavenFileHandy/target/FileHandy.jar com.company.FileRun arg1 arg2
Why scripts throws exception and can't find Mapper Class when runs via IDE, and successfully complete same script via hadoop jar ... command?
Thanks

I've find the reason. TaskTrackers can't run job task (map) because jar file is not in Distributed Cash. To solve the problem it's necessary to add jar file to project classpath. Steps are:
File -> Project Structure -> Libraries, type '+' at the bottom pane and add jar file

Related

Java Hadoop-lzo Found interface but class was expected LzoTextInputFormat

I'm trying to use the Hadoop-LZO package (built using the steps here). Seems like everything worked successfully as I was able to convert my lzo files to indexed files via (this returns big_file.lzo.index as expected):
hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.LzoIndexer big_file.lzo
Then I go to use these files in my mapreduce jobs (with big_file.lzo.index as the input):
import com.hadoop.mapreduce.LzoTextInputFormat;
....
Job jobConverter = new Job(conf, "conversion");
jobConverter.setJar("JsonConverter.jar");
jobConverter.setInputFormatClass(LzoTextInputFormat.class);
....
and I get the following error:
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at com.hadoop.mapreduce.LzoTextInputFormat.listStatus(LzoTextInputFormat.java:62)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:389)
at com.hadoop.mapreduce.LzoTextInputFormat.getSplits(LzoTextInputFormat.java:101)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:304)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:321)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:199)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.wwbp.JsonConverter.run(JsonConverter.java:116)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.wwbp.JsonConverter.main(JsonConverter.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
I've seen other questions answering this and they say to re-build against Hadoop v2. So I redownloaded everything fro Github and ran
% hadoop version
Hadoop 2.7.0-mapr-1607
Compiled by root on 2016-07-18T07:56Z
Compiled with protoc 2.5.0
This command was run using /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0-mapr-1607.jar
% ant clean compile-native tar -Dhadoopversion=27
....
tar:
[tar] Building tar: ../jars/hadoop-lzo/build/hadoop-lzo-0.4.15.tar.gz
BUILD SUCCESSFUL
Total time: 15 seconds
When building my paths are as follows:
C_INCLUDE_PATH=../jars/lzo-2.09/include
LIBRARY_PATH=../jars/lzo-2.09/lib
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
I'm really not sure what I am doing wrong. How do I get ant to see Hadoop v2?
Edit 1: Possibly of note: when I run both my mapreduce job (calling LzoTextInputFormat.class) and the lzo converter (on big_file.lzo) my classpath is as follows
CLASS_PATH=/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/opt/mapr/lib/kvstore*.jar:/opt/mapr/lib/libprotodefs*.jar:/opt/mapr/lib/baseutils*.jar:/opt/mapr/lib/maprutil*.jar:/opt/mapr/lib/json-20080701.jar:/opt/mapr/lib/flexjson-2.1.jar:/jars/hadoop-lzo-0.4.15/hadoop-lzo-0.4.15.jar
Edit 2: If I index the lzo file as follows (i.e. try to index via a mapreduce job with DistributedLzoIndexer instead of LzoIndexer) I get a similar error:
> hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer big_file.lzo
16/12/09 13:06:24 INFO mapreduce.Job: map 0% reduce 0%
16/12/09 13:06:29 INFO mapreduce.Job: Task Id : attempt_1472572940387_0370_m_000000_0, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

No idea why the above isn't working so I started from scratch using this repo:
https://github.com/twitter/hadoop-lzo
instead of the one linked above and used maven to build instead of ant (using all of the same settings as above).

How to make 'hadoop jar' command pick up new build of a same named jar

I'm new to hadoop and I tried its example within hadoop 2.6.0 for beginning.
First, I recompiled the source code of hadoop-mapreduce-examples-2.6.0.jar and build a new jar file MapReduce-0.0.1.jar
Then I ran the terasort example with this command line
jjin:hadoop$ bin/hadoop jar ~/shared/MapReduce-0.0.1.jar terasort /input /output
15/01/07 12:27:44 INFO terasort.TeraSort: starting
15/01/07 12:27:46 INFO input.FileInputFormat: Total input paths to process : 1
...
After terasort is done, I updated the source of TeraSort.java by modifying the first line log message like this
public int run(String[] args) throws Exception {
LOG.info("starting...");
// Update log message by adding '...' to the end of previous one.
Job job = Job.getInstance(getConf());
But after re-run this terasort job, I found the log message didn't change to 'starting...', so it means that the change I made to TeraSort.java doesn't take effect.
The question is how to make hadoop pick up the new MapReduce-0.0.1.jar I build.
Thanks

You have to recompile and re build jar and then specify different output directory other than in first job.

Setting external jars to hadoop classpath

I am trying to set external jars to hadoop classpath but no luck so far.
I have the following setup
$ hadoop version
Hadoop 2.0.6-alpha
Subversion https://git-wip-us.apache.org/repos/asf/bigtop.git -r ca4c88898f95aaab3fd85b5e9c194ffd647c2109
Compiled by jenkins on 2013-10-31T07:55Z
From source with checksum 95e88b2a9589fa69d6d5c1dbd48d4e
This command was run using /usr/lib/hadoop/hadoop-common-2.0.6-alpha.jar
Classpath
$ echo $HADOOP_CLASSPATH
/home/tom/workspace/libs/opencsv-2.3.jar
I am able see the above HADOOP_CLASSPATH has been picked up by hadoop
$ hadoop classpath
/etc/hadoop/conf:/usr/lib/hadoop/lib/:/usr/lib/hadoop/.//:/home/tom/workspace/libs/opencsv-2.3.jar:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/:/usr/lib/hadoop-hdfs/.//:/usr/lib/hadoop-yarn/lib/:/usr/lib/hadoop-yarn/.//:/usr/lib/hadoop-mapreduce/lib/:/usr/lib/hadoop-mapreduce/.//
Command
$ sudo hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/1987.csv /user/root/result
I tried with -libjars option as well
$ sudo hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/1987.csv /user/root/result -libjars /home/tom/workspace/libs/opencsv-2.3.jar
The stacktrace
14/11/04 16:43:23 INFO mapreduce.Job: Running job: job_1415115532989_0001
14/11/04 16:43:55 INFO mapreduce.Job: Job job_1415115532989_0001 running in uber mode : false
14/11/04 16:43:56 INFO mapreduce.Job: map 0% reduce 0%
14/11/04 16:45:27 INFO mapreduce.Job: map 50% reduce 0%
14/11/04 16:45:27 INFO mapreduce.Job: Task Id : attempt_1415115532989_0001_m_000001_0, Status : FAILED
Error: java.lang.ClassNotFoundException: au.com.bytecode.opencsv.CSVParser
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at FlightsByCarrierMapper.map(FlightsByCarrierMapper.java:19)
at FlightsByCarrierMapper.map(FlightsByCarrierMapper.java:10)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:757)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153)
Any help is highly appreciated.

Your external jar is missing on the node running maps. You have to add it to the cache to make it available. Try :
DistributedCache.addFileToClassPath(new Path("pathToJar"), conf);
Not sure in which version DistributedCache was deprecated, but from Hadoop 2.2.0 onward you can use :
job.addFileToClassPath(new Path("pathToJar"));

If you are adding the external JAR to the Hadoop classpath then its better to copy your JAR to one of the existing directories that hadoop is looking at. On the command line run the command "hadoop classpath" and then find a suitable folder and copy your jar file to that location and hadoop will pick up the dependencies from there. This wont work with CloudEra etc as you may not have read/write rights to copy files to the hadoop classpath folders.
Looks like you tried the LIBJARs option as well, did you edit your driver class to implement the TOOL interface? First make sure that you edit your driver class as shown below:
public class myDriverClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new myDriverClass(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
// Configuration processed by ToolRunner
Configuration conf = getConf();
Job job = new Job(conf, "My Job");
...
...
return job.waitForCompletion(true) ? 0 : 1;
}
}
Now edit your "hadoop jar" command as shown below:
hadoop jar YourApplication.jar [myDriverClass] args -libjars path/to/jar/file
Now lets understand what happens underneath. Basically we are handling the new command line arguments by implementing the TOOL Interface. ToolRunner is used to run classes implementing Tool interface. It works in conjunction with GenericOptionsParser to parse the generic hadoop command line arguments and modifies the Configuration of the Tool.
Within our Main() we are calling ToolRunner.run(new Configuration(), new myDriverClass(), args) - this runs the given Tool by Tool.run(String[]), after parsing with the given generic arguments. It uses the given Configuration, or builds one if it's null and then sets the Tool's configuration with the possibly modified version of the conf.
Now within the run method, when we call getConf() we get the modified version of the Configuration. So make sure that you have the below line in your code. If you implement everything else and still make use of Configuration conf = new Configuration(), nothing would work.
Configuration conf = getConf();

I tried setting the opencsv jar in the hadoop classpath but it didn't work.We need to explicitly copy the jar in the classpath for this to work.It did worked for me.
Below are the steps i followed:
I have done this in a HDP CLuster.I ahave copied my opencsv jar in hbase libs and exported it before running my jar
Copy ExternalJars to HDP LIBS:
To Run Open CSV Jar:
1.Copy the opencsv jar in directory /usr/hdp/2.2.9.1-11/hbase/lib/ and /usr/hdp/2.2.9.1-11/hadoop-yarn/lib
**sudo cp /home/sshuser/Amedisys/lib/opencsv-3.7.jar /usr/hdp/2.2.9.1-11/hbase/lib/**
2.Give the file permissions using
sudo chmod 777 opencsv-3.7.jar
3.List Files
ls -lrt
4.exporthadoop classpath:hbase classpath
5.Now run your Jar.It will pick up the opencsv jar and will execute properly.

I found another workaround by implementing ToolRunner like below. With this approach hadoop accepts command line options. We can avoid hard coding of adding files to DistributedCache
public class FlightsByCarrier extends Configured implements Tool {
public int run(String[] args) throws Exception {
// Configuration processed by ToolRunner
Configuration conf = getConf();
// Create a JobConf using the processed conf
JobConf job = new JobConf(conf, FlightsByCarrier.class);
// Process custom command-line options
Path in = new Path(args[1]);
Path out = new Path(args[2]);
// Specify various job-specific parameters
job.setJobName("my-app");
job.setInputPath(in);
job.setOutputPath(out);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// Submit the job, then poll for progress until the job is complete
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
// Let ToolRunner handle generic command-line options
int res = ToolRunner.run(new Configuration(), new FlightsByCarrier(), args);
System.exit(res);
}
}

I found a very easy solution to the problem:
login as root :
cd /usr/lib
find . -name "opencsv.jar"
Pick up the locatin of the file. In my case >I found it under /usr/lib/hive/lib/opencsv*.jar
Now submit the command
hadoop classpath
The result shows the direcory where hadoop searches jar-files.
Pick up one directory and copy opencsv*jar to that directory.
In my case it worked.

Exception when Servlet try to run Hadoop 2.2.0 MapReduce Job

SOLVED (the solution is in the comments)
I'm using Hadoop 2.2.0 (in pseudo-distributed mode) on ubuntu 13.10 and Eclipse Kepler v4.3 to develop my Hadoop program and Dynamic Web Project (without Maven).
My Hadoop jar project, called "WorkTest.jar", works correctly when I run job from command line with: "Hadoop jar WorkTest.jar" and I see correctly the work progress on the terminal.
Hadoop project contains four elements:
DriverJob.java (class that configures and starts the job)
Mapper.java
Combiner.java
Reducer.java
Now I have written a new Dynamic Web Project with a ServletTest.java in which I entered the DriverJob class code, the other class (Mapper.java, Combiner.java, Reducer.java) are placed in the same package as the servlet (main package). The WebContent/lib folder contains all Hadoop jar necessary dependencies.
I have successfully deploy my application on WildFly 8 Server whit Eclipse but when I try to run mapreduce job (the job configuration runs successfully and I managed to delete and write a folder on HDFS), he keeps on failing with the following exception visible from the Hadoop Job log file:
FATAL [IPC Server handler 5 on 46834] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1396015900746_0023_m_000002_0 - exited : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class Mapper not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1720)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:721)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.ClassNotFoundException: Class Mapper not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1718)
... 8 more
and from the WildFly log file:
WARN [org.apache.hadoop.mapreduce.JobSubmitter] Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
WARN [org.apache.hadoop.mapreduce.JobSubmitter] No job jar file set. User classes may not be found. See Job or Job#setJar(String).
But the WEB-INF/classes/ deploy folder on WildFly containing the Mapper.class, Combiner.class and Reducer.class.
I also tried to enter the class code of Mapper, Combiner and Reducer inside the servlet, but does not work with the same error...
What I'm doing wrong?

I believe you need to have your .class files in an archive (jar) that can be distributed to the nodes in the cluster.
WARN [org.apache.hadoop.mapreduce.JobSubmitter] No job jar file set. User classes may not be found. See Job or Job#setJar(String).
This error is the key. Generally you would use job.setJarByClass(DriverJob.class) to tell the mapreduce client which jar file has the Mapper/Reducer classes. You don't have a jar and so that method for distributing the proper classes falls apart.

Why hadoop does not recognize my Map class?

I am trying to run my PDFWordCount map-reduce program on hadoop 2.2.0 but I get this error:
13/12/25 23:37:26 INFO mapreduce.Job: Task Id : attempt_1388041362368_0003_m_000009_2, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class PDFWordCount$MyMap not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1720)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:721)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.ClassNotFoundException: Class PDFWordCount$MyMap not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1718)
... 8 more
It says that my map class is not known. I have a cluster with a namenod and 2 datanodes on 3 VMs.
My main function is this:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
#SuppressWarnings("deprecation")
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(MyMap.class);
job.setReducerClass(MyReduce.class);
job.setInputFormatClass(PDFInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(PDFWordCount.class);
job.waitForCompletion(true);
}
If I run my jar using this command:
yarn jar myjar.jar PDFWordCount /in /out
it takes /in as output path and gives me error while I have job.setJarByClass(PDFWordCount.class); in my main function as you see above.
I have run simple WordCount project with main function exactly like this main function and to run it, I used yarn jar wc.jar MyWordCount /in2 /out2 and it run flawlessly.
I can't understand what is the problem!
UPDATE: I tried to move my work from this project to wordcount project I have used successfully. I built a package, copied related files from pdfwordcount project to this package and exported the project (my main was not changed to used PDFInputFormat, so I did nothing except moving java files to new package.) It didn't work. I deleted files from other project but it didn't work. I moved java file back to default package but it didn't work!
What's wrong?!

I found a way to overcome this problem, even though I couldn't understand what was the problem actually.
When I want to export my java project as a jar file in eclipse, I have two options:
Extract required libraries into generated JAR
Package required libraries into generated JAR
I don't know exactly what is the difference or is it a big deal or not. I used to choose second option, but if I choose first option, I can run my job using this command:
yarn jar pdf.jar /in /out

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Run Hadoop MR Job via Intellij IDEA - java

I've find the reason. TaskTrackers can't run job task (map) because jar file is not in Distributed Cash. To solve the problem it's necessary to add jar file to project classpath. Steps are: File -> Project Structure -> Libraries, type '+' at the bottom pane and add jar file

Related

Java Hadoop-lzo Found interface but class was expected LzoTextInputFormat

How to make 'hadoop jar' command pick up new build of a same named jar

Setting external jars to hadoop classpath

Exception when Servlet try to run Hadoop 2.2.0 MapReduce Job

Why hadoop does not recognize my Map class?

Categories

Resources