Exception when Servlet try to run Hadoop 2.2.0 MapReduce Job

Exception when Servlet try to run Hadoop 2.2.0 MapReduce Job - java

SOLVED (the solution is in the comments)
I'm using Hadoop 2.2.0 (in pseudo-distributed mode) on ubuntu 13.10 and Eclipse Kepler v4.3 to develop my Hadoop program and Dynamic Web Project (without Maven).
My Hadoop jar project, called "WorkTest.jar", works correctly when I run job from command line with: "Hadoop jar WorkTest.jar" and I see correctly the work progress on the terminal.
Hadoop project contains four elements:
DriverJob.java (class that configures and starts the job)
Mapper.java
Combiner.java
Reducer.java
Now I have written a new Dynamic Web Project with a ServletTest.java in which I entered the DriverJob class code, the other class (Mapper.java, Combiner.java, Reducer.java) are placed in the same package as the servlet (main package). The WebContent/lib folder contains all Hadoop jar necessary dependencies.
I have successfully deploy my application on WildFly 8 Server whit Eclipse but when I try to run mapreduce job (the job configuration runs successfully and I managed to delete and write a folder on HDFS), he keeps on failing with the following exception visible from the Hadoop Job log file:
FATAL [IPC Server handler 5 on 46834] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1396015900746_0023_m_000002_0 - exited : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class Mapper not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1720)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:721)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.ClassNotFoundException: Class Mapper not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1718)
... 8 more
and from the WildFly log file:
WARN [org.apache.hadoop.mapreduce.JobSubmitter] Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
WARN [org.apache.hadoop.mapreduce.JobSubmitter] No job jar file set. User classes may not be found. See Job or Job#setJar(String).
But the WEB-INF/classes/ deploy folder on WildFly containing the Mapper.class, Combiner.class and Reducer.class.
I also tried to enter the class code of Mapper, Combiner and Reducer inside the servlet, but does not work with the same error...
What I'm doing wrong?

I believe you need to have your .class files in an archive (jar) that can be distributed to the nodes in the cluster.
WARN [org.apache.hadoop.mapreduce.JobSubmitter] No job jar file set. User classes may not be found. See Job or Job#setJar(String).
This error is the key. Generally you would use job.setJarByClass(DriverJob.class) to tell the mapreduce client which jar file has the Mapper/Reducer classes. You don't have a jar and so that method for distributing the proper classes falls apart.

Related

Elastic Beanstalk fails creating

I finally reached the point where my Elastic Beanstalk Instance / Environment got launched. (Java Corretto 11 Platform) Now it fails starting up the provided .jar file.
In the eb-engine.log file, I am not able to find any more error than this:
2021/05/27 11:36:25.889735 [INFO] Executing instruction: StageJavaApplication
2021/05/27 11:36:25.889871 [ERROR] An error occurred during execution of command [app-deploy] - [StageJavaApplication]. Stop running the command. Error: staging java app failed due to invalid zip file
The jar file is a Spring Boot application built with mvn -B package.
Locally the whole thing starts, but crashes afterwards because of not given environment variables (Expected behaviour).
But it seems AWS is not even starting the application..
Any suggestions on this?

Spring Boot apps run nicely on Elastic Beanstalk. However, you do need to set some variables. For example, have you set server-port variable to 5000?
And as you stated, to successfully use a Service Client, you can set environment variables for your creds. Here is an end to end walkthrough that shows how to successfully put a Spring BOOT app that invokes several AWS Services on Elastic Beanstalk.
Creating your first AWS Java web application
PS - your log file mentions a ZIP file. Be sure to create the JAR properly as discussed in the above example.

Just in case someone arrive here looking for an answer about this guy:
Error: staging java app failed due to invalid zip file
I was renaming my service jar in Gradle, using:
tasks.withType<org.springframework.boot.gradle.tasks.bundling.BootJar> {
archiveFileName.set("service.jar")
launchScript()
}
And ElasticBeanstalk was not happy about the renaming.
When I let it have the default name, then no zip issues and all worked like a charm.

Hadoop 1.2.1: Put jars in hdfs on classpath

I have a hadoop job which requires several 3rd party jars. I have put them on the classpath with conf/hadoop-env.sh
export HADOOP_CLASSPATH=hdfs://name.node.private.ip:9000/home/ec2-user/hadoop-gremlin-libs/
When I run $ bin/hadoop classpath this path is included, as you can see here. However, when I go to run a job, it throws an error in initialization:
Error: java.lang.ClassNotFoundException: com.google.common.collect.Lists
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.giraph.conf.AllOptions.<clinit>(AllOptions.java:37)
at org.apache.giraph.conf.ClassConfOption.<init>(ClassConfOption.java:47)
at org.apache.giraph.conf.ClassConfOption.create(ClassConfOption.java:60)
at org.apache.giraph.conf.GiraphConstants.<clinit>(GiraphConstants.java:62)
at org.apache.giraph.conf.GiraphClasses.readFromConf(GiraphClasses.java:152)
at org.apache.giraph.conf.GiraphClasses.<init (GiraphClasses.java:142)
at org.apache.giraph.conf.ImmutableClassesGiraphConfiguration.<init>(ImmutableClassesGiraphConfiguration.java:93)
at org.apache.giraph.bsp.BspOutputFormat.getOutputCommitter(BspOutputFormat.java:56)
at org.apache.hadoop.mapred.Task.initialize(Task.java:515)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
This particular class should be packaged in guava, which is included on the classpath:
[ec2-user]$ bin/hadoop dfs -ls /home/ec2-user/hadoop-gremlin-libs | grep guava
-rw-r--r-- 3 ec2-user supergroup 0 2017-04-20 17:57 /home/ec2-user/hadoop-gremlin-libs/guava-18.0.jar
I am submitting the job from gremlin as follows:
graph = GraphFactory.open('conf/hadoop.properties')
result = graph.compute().program(MyVertexProgram.build().create()).submit().get()
I have also tried putting the jars on the local filesystem and receive the same error. Does anyone know how to solve this issue?

I can't tell exactly what kind of job are you doing, but looking at those classes it appears to be a Mapreduce2 maptask it is trying to setup when you hit that exception.
I think you are updating the wrong classpath value probably. You are updating the Hadoop classpath not the mapreduce classpath.
More than likely you need to update the hadoop clusters yarn/mapreduce2 application classpath values in the cluster manager application, or their site xml files the cluster is using. You should have a mapred-site.xml file which has property named mapreduce.application.classpath that has its own classpath to point to its own jars it needs to execute its jobs, add your path to the classpath in the value of the mapreduce.application.classpath value instead.
The second goes for yarn, update the yarn.application.classpath property if yarn needs any other jars, as the yarn classpath points to yarn jars that help yarn run. You can update this easily in a cluster manager application if you have it, or edit the yarn-site.xml manually to add this classpath.
The only other option is if your client software program has its own dedicated mapred-site.xml file it reads to get the mapreduce.application.classpath from for you. If so it is possible you can just modify the mapreduce.application.classpath on the client site if your software supports it. Some client programs may have their own classpaths, or read the hadoop clusters site xml files to connect to the cluster.
I am pretty sure from what it shows in the exception you need this jar somehow in the mapreduce.application.path not the hadoop classpath.

Flink Streamer in Apache Ignite

I am working on real time streaming data analysis using Apache Flink-1.1.3. My system consist of Kafka cluster for message queue, Flink cluster which read the messages from kafka partitions and make some analysis on it and finally I want to dump the generated data into Ignite Cache. For the system I am using IgniteSink class to sink the data into ignite cache. The versions are as follows:
Flink 1.1.3,
Kafka 2.10,
Ignite 2.0.0
When I tried to run the job on flink cluster it gives me the following error,
Exception in thread "main" org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed.
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:409)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:95)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:382)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:374)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.executeRemotely(RemoteStreamEnvironment.java:209)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:173)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1429)
at flink_ignite_sink_remote.main(flink_ignite_sink_remote.java:77)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$8.apply$mcV$sp(JobManager.scala:822)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$8.apply(JobManager.scala:768)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$8.apply(JobManager.scala:768)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.ignite.sink.flink.IgniteSink$SinkContext$Holder
at org.apache.ignite.sink.flink.IgniteSink$SinkContext.getStreamer(IgniteSink.java:201)
at org.apache.ignite.sink.flink.IgniteSink$SinkContext.access$100(IgniteSink.java:175)
at org.apache.ignite.sink.flink.IgniteSink.invoke(IgniteSink.java:165)
at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:39)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:373)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:358)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:346)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:329)
at org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
at flink_ignite_sink_remote$Splitter.flatMap(flink_ignite_sink_remote.java:177)
at flink_ignite_sink_remote$Splitter.flatMap(flink_ignite_sink_remote.java:1)
at org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:48)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:373)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:358)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:346)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:329)
at org.apache.flink.streaming.api.operators.StreamSource$NonTimestampContext.collect(StreamSource.java:161)
at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:225)
at org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.run(Kafka09Fetcher.java:253)
at java.lang.Thread.run(Thread.java:745)
I have included all the libraries of Ignite-Flink into my project, still I got java.lang.NoClassDefFoundError error.

I suspect you are using a simple jar instead of uber/fat jar.
If you are using maven try shade-plugin or for sbt sbt-assembly. You can also create your project as described in the quickstart-guide

This is also discussed on Apache Ignite user forum: http://apache-ignite-users.70518.x6.nabble.com/Flink-Streamer-td10650.html

When you work with flink jobs, that you deploy over a flink cluster, you have 2 options:
Generate a jar file with all your dependencies bundled inside and deploy that jar
Add all your dependencies to the classpath of your flink server, so it can find those dependencies out of the jar file.
You problem looks like you are not generating a jar file with all the dependencies inside of it and those dependencies aren't located inside the classpath of the fink server.
Try running the following mvn command to generate your jars:
mvn clean package -Pbuild-jar
It could generate multiple jar, pick the fat jar (the bigger)

Hadoop log4j cannot find KafkaLog4JAppender.class

I added KafkaLog4JAppender functionality to my MR job.
locally the job is running and sending the formatted logs into my Kafka cluster.
when I try to run it from the yarn server, using:
jar [jar-name].jar [DriverClass].class [job-params] -Dlog4j.configuration=log4j.xml -libjars
I get the following expception:
log4j:ERROR Could not create an Appender. Reported error follows.
java.lang.ClassNotFoundException: kafka.producer.KafkaLog4jAppender
the KafkaLog4JAppender class is in the path.
running
jar tvf [my-jar].jar | grep KafkaLog4J
finds the class
I'm kinda lost and would appreciate any helpfull input
thanks in advance!

If it works in local mode and not working in Yarn/distributed mode, then it could be problem of jar not being distributed properly. YOu might want to check Using third part jars and files in your MapReduce application(Distributed cache) for details on how to distribute your jar containing KafkaLog4jAppender.class

Some problems with Flume

I have 2 CDH4 Clusters.
One with CentOS 6.4 (Real hardware) other Ubuntu 12.04(Amazon EC2).
All configuration file a made manually same(With Cloudera manager).
I try to start Cloudera-twitter-example. When I start flume on CentOS cluster it works without any problems. But on Ubuntu cluster Flume gives such error in log file:
2013-09-11 15:04:54,491 INFO org.apache.flume.instrumentation.MonitoredCounterGroup: Component type: SINK, name: HDFS started
2013-09-11 15:04:54,527 ERROR org.apache.flume.lifecycle.LifecycleSupervisor: Unable to start EventDrivenSourceRunner: { source:com.cloudera.flume.source.TwitterSource
{name:Twitter,state:IDLE} } - Exception follows.
java.lang.NoSuchMethodError: twitter4j.FilterQuery.setIncludeEntities(Z)Ltwitter4j/FilterQuery;
After some googling i found this solution in comments by
Suresh E GopalanAugust 20, 2013 at 2:43 AM
So we have another JAR file
search-contrib-0.9.1-cdh4.3.0-SNAPSHOT-jar-with-dependencies.jar with
the same class and conflicting with correct one in FLUME_CLASSPATH
Temporarily rename it to .org extension, so that it will be excluded
from classpath at the startup
After renaming this jar it Flume start to work on Ubuntu cluster.
On CentOS cluster i have same jar with same Classes but it don't need to be renamed.
Why it happens and What should i do change in Ubuntu cluster to have same behaviour without renaming?

rebuild the flume-source, don't download the prebuilt snapshot.jar

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.