Hadoop 1.2.1: Put jars in hdfs on classpath

Hadoop 1.2.1: Put jars in hdfs on classpath - java

I have a hadoop job which requires several 3rd party jars. I have put them on the classpath with conf/hadoop-env.sh
export HADOOP_CLASSPATH=hdfs://name.node.private.ip:9000/home/ec2-user/hadoop-gremlin-libs/
When I run $ bin/hadoop classpath this path is included, as you can see here. However, when I go to run a job, it throws an error in initialization:
Error: java.lang.ClassNotFoundException: com.google.common.collect.Lists
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.giraph.conf.AllOptions.<clinit>(AllOptions.java:37)
at org.apache.giraph.conf.ClassConfOption.<init>(ClassConfOption.java:47)
at org.apache.giraph.conf.ClassConfOption.create(ClassConfOption.java:60)
at org.apache.giraph.conf.GiraphConstants.<clinit>(GiraphConstants.java:62)
at org.apache.giraph.conf.GiraphClasses.readFromConf(GiraphClasses.java:152)
at org.apache.giraph.conf.GiraphClasses.<init (GiraphClasses.java:142)
at org.apache.giraph.conf.ImmutableClassesGiraphConfiguration.<init>(ImmutableClassesGiraphConfiguration.java:93)
at org.apache.giraph.bsp.BspOutputFormat.getOutputCommitter(BspOutputFormat.java:56)
at org.apache.hadoop.mapred.Task.initialize(Task.java:515)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
This particular class should be packaged in guava, which is included on the classpath:
[ec2-user]$ bin/hadoop dfs -ls /home/ec2-user/hadoop-gremlin-libs | grep guava
-rw-r--r-- 3 ec2-user supergroup 0 2017-04-20 17:57 /home/ec2-user/hadoop-gremlin-libs/guava-18.0.jar
I am submitting the job from gremlin as follows:
graph = GraphFactory.open('conf/hadoop.properties')
result = graph.compute().program(MyVertexProgram.build().create()).submit().get()
I have also tried putting the jars on the local filesystem and receive the same error. Does anyone know how to solve this issue?

I can't tell exactly what kind of job are you doing, but looking at those classes it appears to be a Mapreduce2 maptask it is trying to setup when you hit that exception.
I think you are updating the wrong classpath value probably. You are updating the Hadoop classpath not the mapreduce classpath.
More than likely you need to update the hadoop clusters yarn/mapreduce2 application classpath values in the cluster manager application, or their site xml files the cluster is using. You should have a mapred-site.xml file which has property named mapreduce.application.classpath that has its own classpath to point to its own jars it needs to execute its jobs, add your path to the classpath in the value of the mapreduce.application.classpath value instead.
The second goes for yarn, update the yarn.application.classpath property if yarn needs any other jars, as the yarn classpath points to yarn jars that help yarn run. You can update this easily in a cluster manager application if you have it, or edit the yarn-site.xml manually to add this classpath.
The only other option is if your client software program has its own dedicated mapred-site.xml file it reads to get the mapreduce.application.classpath from for you. If so it is possible you can just modify the mapreduce.application.classpath on the client site if your software supports it. Some client programs may have their own classpaths, or read the hadoop clusters site xml files to connect to the cluster.
I am pretty sure from what it shows in the exception you need this jar somehow in the mapreduce.application.path not the hadoop classpath.

Related

What Hadoop jar dependencies do I need to setup a HDFS sink in Flume?

I'm using a Docker image of Flume from probablyfine/flume.
I'm trying to configure a HDFS sink and I'm getting this error about dependencies. Google search results show I need to include Hadoop libs, but many of the results are old from when Hadoop 1.0 had a single hadoop-core-1.0.jar that I could include in my Docker image.
I'm trying to include the jars straight from the Hadoop 2.9 bin download in /share/hadoop/common/. But including these jars in my FLUME_CLASSPATH is not working.
FLUME_CLASSPATH=/var/tmp/hadoop/common/
I've also tried one level up and just doing the /hadoop/ directory. But it's all the same error before:
2018-01-22 21:49:21,643 (conf-file-poller-0) [ERROR
- org.apache.flume.node.Poll
ingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingProperties
FileConfigurationProvider.java:146)] Failed to start agent because dependencies
were not found in classpath. Error follows.
java.lanat org.apache.flume.sink.hdfs.HDFSEventSink.configure(HDFSEventSink.java
:235)
at org.apache.flume.node.AbstractConfigurationProvider.loadSinks(Abstrac
tConfiguat org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(
Abstractat org.apache.flume.node.PollingPropertiesFileConfigurationProvider$File
WatcherRat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:51
1) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:294)va:180)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:624)
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
Caused bat java.lang.ClassLoader.loadClass(ClassLoader.java:424)o.SequenceFile$C
ompressiat sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
... 12 moreg.ClassLoader.loadClass(ClassLoader.java:357)
Which specific jar dependency files do I need in my Docker image to setup a HDFS sink?

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

So I've installed Hadoop File System on my machine and I'm using maven dependency to provide my code spark environment. (spark-mllib_2.10)
Now, My code is using spark mllib. And accessing data from Hadoop file system with this code.
String finalData = ProjectProperties.hadoopBasePath + ProjectProperties.finalDataPath;
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), finalData).toJavaRDD();
With following properties set.
finalDataPath = /data/finalInput.txt
hadoopBasePath = hdfs://127.0.0.1:54310
I am starting the dfs nodes externally through command
start-dfs.sh
Now, my code works perfectly fine when running from eclipse. But if I export the whole code to an executable jar, it gives me following exception.
Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
I also checked different solutions online given for this issue where people are asking me to add following
hadoopConfig.set("fs.hdfs.impl",
org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()
);
hadoopConfig.set("fs.file.impl",
org.apache.hadoop.fs.LocalFileSystem.class.getName()
);
OR
<property>
<name>fs.file.impl</name>
<value>org.apache.hadoop.fs.LocalFileSystem</value>
<description>The FileSystem for file: uris.</description>
</property>
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>
But I don't use any Hadoop context or hadoop config into my project. Simply load the data from Hadoop using the URL.
Can someone give some answer relevant to this issue?
Please mind that this totally works fine from Eclipse. And only doesn't work if I export the same project as an executable Jar.
Update
As suggested in the comment and from the solutions found online, I tried two things.
Added dependencies into my pom.xml for hadoop-core, hadoop-hdfs and hadoop-client libraries.
Added the above properties configuration to hadoop's site-core.xml as suggested here http://grokbase.com/t/cloudera/scm-users/1288xszz7r/no-filesystem-for-scheme-hdfs
But still no luck in getting the error resolved. Gives the same issue locally on my machine as well as one of the remote machines I tried it on.
I also installed hadoop the same way I did on my machine using the link mentioned above.

Can't create a Hadoop sequence file on a local file system

I found this example of how to write to a local file system, but it throws this exception:
Exception in thread "main" java.io.IOException: (null) entry in command string: null chmod 0644 C:\temp\test.seq
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)
at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:778)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:1168)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
Running this on a Windows 10 box. I even tried using the msys git bash shell thinking maybe that would help the JVM simulate a chmod operation. Didn't change anything. Any suggestions on how to do this on Windows?

I too faced this error and it was resolved after following the steps. (Note : I am using Spark 2.0.2 and Hadoop 2.7)
Verify whether you are getting "java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.". You check it by running "spark-shell" command.
I got the above mentioned error. It occurred because I didn't add "HADOOP_HOME" in environment var. After adding the "HADOOP_HOME", in my case same as "SPARK_HOME", the issue was resolved.

Running a Hadoop program using only jars on Windows requires a few steps beyond just referencing the jars.
Credit to Professor Lu at University of Helsinki for posting a Hadoop on Windows guide for his students.
Here is a rundown of steps I had to take using Windows 10 and Hadoop 2.7.3:
Download and extract Hadoop binaries to somewhere like C:\hadoop-2.7.3.
Download patch files from https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip and extract them to your %HADOOP_HOME%\bin directory.
Set a HADOOP_HOME environment variable. For example, C:\hadoop-2.7.3.
Download the Hadoop source code, copy hadoop-common-project\hadoop-common\src\main\java\org\apache\hadoop\io\nativeio\NativeIO.java to your project, and modify line 609 from
return access0(path, desiredAccess.accessRight());
to
return true;

One of the solutions is as follows.
In the Project Structure (Intelij), under SDK's ensure there is no other version of Hadoop referenced. In my case - I was running Spark earlier and it was referring Hadoop JAR's and this was causing access issues. Once I removed them and ran the MR job it ran fine.

Hadoop log4j cannot find KafkaLog4JAppender.class

I added KafkaLog4JAppender functionality to my MR job.
locally the job is running and sending the formatted logs into my Kafka cluster.
when I try to run it from the yarn server, using:
jar [jar-name].jar [DriverClass].class [job-params] -Dlog4j.configuration=log4j.xml -libjars
I get the following expception:
log4j:ERROR Could not create an Appender. Reported error follows.
java.lang.ClassNotFoundException: kafka.producer.KafkaLog4jAppender
the KafkaLog4JAppender class is in the path.
running
jar tvf [my-jar].jar | grep KafkaLog4J
finds the class
I'm kinda lost and would appreciate any helpfull input
thanks in advance!

If it works in local mode and not working in Yarn/distributed mode, then it could be problem of jar not being distributed properly. YOu might want to check Using third part jars and files in your MapReduce application(Distributed cache) for details on how to distribute your jar containing KafkaLog4jAppender.class

Exception when Servlet try to run Hadoop 2.2.0 MapReduce Job

SOLVED (the solution is in the comments)
I'm using Hadoop 2.2.0 (in pseudo-distributed mode) on ubuntu 13.10 and Eclipse Kepler v4.3 to develop my Hadoop program and Dynamic Web Project (without Maven).
My Hadoop jar project, called "WorkTest.jar", works correctly when I run job from command line with: "Hadoop jar WorkTest.jar" and I see correctly the work progress on the terminal.
Hadoop project contains four elements:
DriverJob.java (class that configures and starts the job)
Mapper.java
Combiner.java
Reducer.java
Now I have written a new Dynamic Web Project with a ServletTest.java in which I entered the DriverJob class code, the other class (Mapper.java, Combiner.java, Reducer.java) are placed in the same package as the servlet (main package). The WebContent/lib folder contains all Hadoop jar necessary dependencies.
I have successfully deploy my application on WildFly 8 Server whit Eclipse but when I try to run mapreduce job (the job configuration runs successfully and I managed to delete and write a folder on HDFS), he keeps on failing with the following exception visible from the Hadoop Job log file:
FATAL [IPC Server handler 5 on 46834] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1396015900746_0023_m_000002_0 - exited : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class Mapper not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1720)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:721)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.ClassNotFoundException: Class Mapper not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1718)
... 8 more
and from the WildFly log file:
WARN [org.apache.hadoop.mapreduce.JobSubmitter] Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
WARN [org.apache.hadoop.mapreduce.JobSubmitter] No job jar file set. User classes may not be found. See Job or Job#setJar(String).
But the WEB-INF/classes/ deploy folder on WildFly containing the Mapper.class, Combiner.class and Reducer.class.
I also tried to enter the class code of Mapper, Combiner and Reducer inside the servlet, but does not work with the same error...
What I'm doing wrong?

I believe you need to have your .class files in an archive (jar) that can be distributed to the nodes in the cluster.
WARN [org.apache.hadoop.mapreduce.JobSubmitter] No job jar file set. User classes may not be found. See Job or Job#setJar(String).
This error is the key. Generally you would use job.setJarByClass(DriverJob.class) to tell the mapreduce client which jar file has the Mapper/Reducer classes. You don't have a jar and so that method for distributing the proper classes falls apart.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.