Flink Streamer in Apache Ignite - java

I am working on real time streaming data analysis using Apache Flink-1.1.3. My system consist of Kafka cluster for message queue, Flink cluster which read the messages from kafka partitions and make some analysis on it and finally I want to dump the generated data into Ignite Cache. For the system I am using IgniteSink class to sink the data into ignite cache. The versions are as follows:
Flink 1.1.3,
Kafka 2.10,
Ignite 2.0.0
When I tried to run the job on flink cluster it gives me the following error,
Exception in thread "main" org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed.
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:409)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:95)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:382)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:374)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.executeRemotely(RemoteStreamEnvironment.java:209)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:173)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1429)
at flink_ignite_sink_remote.main(flink_ignite_sink_remote.java:77)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$8.apply$mcV$sp(JobManager.scala:822)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$8.apply(JobManager.scala:768)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$8.apply(JobManager.scala:768)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.ignite.sink.flink.IgniteSink$SinkContext$Holder
at org.apache.ignite.sink.flink.IgniteSink$SinkContext.getStreamer(IgniteSink.java:201)
at org.apache.ignite.sink.flink.IgniteSink$SinkContext.access$100(IgniteSink.java:175)
at org.apache.ignite.sink.flink.IgniteSink.invoke(IgniteSink.java:165)
at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:39)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:373)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:358)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:346)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:329)
at org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
at flink_ignite_sink_remote$Splitter.flatMap(flink_ignite_sink_remote.java:177)
at flink_ignite_sink_remote$Splitter.flatMap(flink_ignite_sink_remote.java:1)
at org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:48)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:373)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:358)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:346)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:329)
at org.apache.flink.streaming.api.operators.StreamSource$NonTimestampContext.collect(StreamSource.java:161)
at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:225)
at org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.run(Kafka09Fetcher.java:253)
at java.lang.Thread.run(Thread.java:745)
I have included all the libraries of Ignite-Flink into my project, still I got java.lang.NoClassDefFoundError error.

I suspect you are using a simple jar instead of uber/fat jar.
If you are using maven try shade-plugin or for sbt sbt-assembly. You can also create your project as described in the quickstart-guide

This is also discussed on Apache Ignite user forum: http://apache-ignite-users.70518.x6.nabble.com/Flink-Streamer-td10650.html

When you work with flink jobs, that you deploy over a flink cluster, you have 2 options:
Generate a jar file with all your dependencies bundled inside and deploy that jar
Add all your dependencies to the classpath of your flink server, so it can find those dependencies out of the jar file.
You problem looks like you are not generating a jar file with all the dependencies inside of it and those dependencies aren't located inside the classpath of the fink server.
Try running the following mvn command to generate your jars:
mvn clean package -Pbuild-jar
It could generate multiple jar, pick the fat jar (the bigger)


Submitting a Storm topology fails /getting started)

I'm trying to get started with Storm. I setup a hosted cluster. I followed all the steps listed here for getting started. It works fine till submitting:
storm jar target/storm-starter-*.jar org.apache.storm.starter.RollingTopWords production-topology fails with
Running: java -client -Ddaemon.name= -Dstorm.options= -Dstorm.home=/usr/local/Cellar/storm/1.2.2/libexec -Dstorm.log.dir=/usr/local/Cellar/storm/1.2.2/libexec/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /usr/local/Cellar/storm/1.2.2/libexec/*:/usr/local/Cellar/storm/1.2.2/libexec/lib/*:/usr/local/Cellar/storm/1.2.2/libexec/extlib/*:target/storm-starter-2.0.1-SNAPSHOT.jar:/usr/local/Cellar/storm/1.2.2/libexec/conf:/usr/local/Cellar/storm/1.2.2/libexec/bin -Dstorm.jar=target/storm-starter-2.0.1-SNAPSHOT.jar -Dstorm.dependency.jars= -Dstorm.dependency.artifacts={} org.apache.storm.starter.RollingTopWords production-topology
Error: Could not find or load main class org.apache.storm.starter.RollingTopWords
Caused by: java.lang.NoClassDefFoundError: org/apache/storm/topology/ConfigurableTopology
I'm not familiar with Java and Storm but getting started doesn't feel any good yet.
ConfigurableTopology doesn't exist in Storm 1.2.2. Most likely you are trying to use a storm-starter jar built from Storm 2.x with a 1.2.2 cluster. This will not work. Build storm-starter from the 1.x sources instead, and it should work.

What Hadoop jar dependencies do I need to setup a HDFS sink in Flume?

I'm using a Docker image of Flume from probablyfine/flume.
I'm trying to configure a HDFS sink and I'm getting this error about dependencies. Google search results show I need to include Hadoop libs, but many of the results are old from when Hadoop 1.0 had a single hadoop-core-1.0.jar that I could include in my Docker image.
I'm trying to include the jars straight from the Hadoop 2.9 bin download in /share/hadoop/common/. But including these jars in my FLUME_CLASSPATH is not working.
I've also tried one level up and just doing the /hadoop/ directory. But it's all the same error before:
2018-01-22 21:49:21,643 (conf-file-poller-0) [ERROR
- org.apache.flume.node.Poll
FileConfigurationProvider.java:146)] Failed to start agent because dependencies
were not found in classpath. Error follows.
java.lanat org.apache.flume.sink.hdfs.HDFSEventSink.configure(HDFSEventSink.java
at org.apache.flume.node.AbstractConfigurationProvider.loadSinks(Abstrac
tConfiguat org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(
Abstractat org.apache.flume.node.PollingPropertiesFileConfigurationProvider$File
WatcherRat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:51
1) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
Caused bat java.lang.ClassLoader.loadClass(ClassLoader.java:424)o.SequenceFile$C
ompressiat sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
... 12 moreg.ClassLoader.loadClass(ClassLoader.java:357)
Which specific jar dependency files do I need in my Docker image to setup a HDFS sink?

Spark-submit job throws NoSuchMethodError at redis.clients.jedis.JedisPoolConfig.setFairness(Z)V

My spark streaming job (spark 1.6) is trying to store data in Redis cluster, when I run it locally it worked fine, when deploy it on the cluster, I got the below stack trace:
Caused by: java.lang.NoSuchMethodError: redis.clients.jedis.JedisPoolConfig.setFairness(Z)V
at com.xyz.utils.redis.RedisClient.<init>(RedisClient.java:87)
at com.xzy.redis.Cache.<init>(Cache.java:76)
at com.xyz.spark.broadcastvariables.RedisCacheWrapper$RedisCacheHolder.getDevelopmentInstance(RedisCacheWrapper.java:29)
at com.zyx.spark.broadcastvariables.RedisCacheWrapper$RedisCacheHolder.<clinit>(RedisCacheWrapper.java:18)
at com.zyx.spark.broadcastvariables.RedisCacheWrapper.getCache(RedisCacheWrapper.java:74)
at com.zyx.spark.sparkjob.ProcessArticles.lambda$null$0(ProcessArticles.java:380)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
I am using redis.clients:jedis 2.9.0 and this jar depends on org.apache.commons:commons-pool2 2.4.2 which contains the setFairness() method.
I know it is a dependency conflict problem between my uber jar dependencies and spark dependencies, as I found that spark classpath depends on a different version of commons-pool.
I've tried:
to use --jar option to add my commons-pool2
to use spark.executor.extraClassPath=true
And both trials failed.
Could anyone help?

Hadoop 1.2.1: Put jars in hdfs on classpath

I have a hadoop job which requires several 3rd party jars. I have put them on the classpath with conf/hadoop-env.sh
export HADOOP_CLASSPATH=hdfs://name.node.private.ip:9000/home/ec2-user/hadoop-gremlin-libs/
When I run $ bin/hadoop classpath this path is included, as you can see here. However, when I go to run a job, it throws an error in initialization:
Error: java.lang.ClassNotFoundException: com.google.common.collect.Lists
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.giraph.conf.AllOptions.<clinit>(AllOptions.java:37)
at org.apache.giraph.conf.ClassConfOption.<init>(ClassConfOption.java:47)
at org.apache.giraph.conf.ClassConfOption.create(ClassConfOption.java:60)
at org.apache.giraph.conf.GiraphConstants.<clinit>(GiraphConstants.java:62)
at org.apache.giraph.conf.GiraphClasses.readFromConf(GiraphClasses.java:152)
at org.apache.giraph.conf.GiraphClasses.<init (GiraphClasses.java:142)
at org.apache.giraph.conf.ImmutableClassesGiraphConfiguration.<init>(ImmutableClassesGiraphConfiguration.java:93)
at org.apache.giraph.bsp.BspOutputFormat.getOutputCommitter(BspOutputFormat.java:56)
at org.apache.hadoop.mapred.Task.initialize(Task.java:515)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
This particular class should be packaged in guava, which is included on the classpath:
[ec2-user]$ bin/hadoop dfs -ls /home/ec2-user/hadoop-gremlin-libs | grep guava
-rw-r--r-- 3 ec2-user supergroup 0 2017-04-20 17:57 /home/ec2-user/hadoop-gremlin-libs/guava-18.0.jar
I am submitting the job from gremlin as follows:
graph = GraphFactory.open('conf/hadoop.properties')
result = graph.compute().program(MyVertexProgram.build().create()).submit().get()
I have also tried putting the jars on the local filesystem and receive the same error. Does anyone know how to solve this issue?
I can't tell exactly what kind of job are you doing, but looking at those classes it appears to be a Mapreduce2 maptask it is trying to setup when you hit that exception.
I think you are updating the wrong classpath value probably. You are updating the Hadoop classpath not the mapreduce classpath.
More than likely you need to update the hadoop clusters yarn/mapreduce2 application classpath values in the cluster manager application, or their site xml files the cluster is using. You should have a mapred-site.xml file which has property named mapreduce.application.classpath that has its own classpath to point to its own jars it needs to execute its jobs, add your path to the classpath in the value of the mapreduce.application.classpath value instead.
The second goes for yarn, update the yarn.application.classpath property if yarn needs any other jars, as the yarn classpath points to yarn jars that help yarn run. You can update this easily in a cluster manager application if you have it, or edit the yarn-site.xml manually to add this classpath.
The only other option is if your client software program has its own dedicated mapred-site.xml file it reads to get the mapreduce.application.classpath from for you. If so it is possible you can just modify the mapreduce.application.classpath on the client site if your software supports it. Some client programs may have their own classpaths, or read the hadoop clusters site xml files to connect to the cluster.
I am pretty sure from what it shows in the exception you need this jar somehow in the mapreduce.application.path not the hadoop classpath.

Exception when Servlet try to run Hadoop 2.2.0 MapReduce Job

SOLVED (the solution is in the comments)
I'm using Hadoop 2.2.0 (in pseudo-distributed mode) on ubuntu 13.10 and Eclipse Kepler v4.3 to develop my Hadoop program and Dynamic Web Project (without Maven).
My Hadoop jar project, called "WorkTest.jar", works correctly when I run job from command line with: "Hadoop jar WorkTest.jar" and I see correctly the work progress on the terminal.
Hadoop project contains four elements:
DriverJob.java (class that configures and starts the job)
Now I have written a new Dynamic Web Project with a ServletTest.java in which I entered the DriverJob class code, the other class (Mapper.java, Combiner.java, Reducer.java) are placed in the same package as the servlet (main package). The WebContent/lib folder contains all Hadoop jar necessary dependencies.
I have successfully deploy my application on WildFly 8 Server whit Eclipse but when I try to run mapreduce job (the job configuration runs successfully and I managed to delete and write a folder on HDFS), he keeps on failing with the following exception visible from the Hadoop Job log file:
FATAL [IPC Server handler 5 on 46834] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1396015900746_0023_m_000002_0 - exited : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class Mapper not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1720)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:721)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.ClassNotFoundException: Class Mapper not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1718)
... 8 more
and from the WildFly log file:
WARN [org.apache.hadoop.mapreduce.JobSubmitter] Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
WARN [org.apache.hadoop.mapreduce.JobSubmitter] No job jar file set. User classes may not be found. See Job or Job#setJar(String).
But the WEB-INF/classes/ deploy folder on WildFly containing the Mapper.class, Combiner.class and Reducer.class.
I also tried to enter the class code of Mapper, Combiner and Reducer inside the servlet, but does not work with the same error...
What I'm doing wrong?
I believe you need to have your .class files in an archive (jar) that can be distributed to the nodes in the cluster.
WARN [org.apache.hadoop.mapreduce.JobSubmitter] No job jar file set. User classes may not be found. See Job or Job#setJar(String).
This error is the key. Generally you would use job.setJarByClass(DriverJob.class) to tell the mapreduce client which jar file has the Mapper/Reducer classes. You don't have a jar and so that method for distributing the proper classes falls apart.
