Kafka Spark-Streaming offset issue - java

Working with Kafka Spark-Streaming. Able to read and process the data sent from Producer. I have a scenario here, lets assume Producer is producing messages and Consumer is turned down for a while and switched on. Now the Conumser is only reading live data. Instead, it should have also retained the data from where it stopped reading.
Here is the pom.xml I have been using.
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<spark.version>2.0.1</spark.version>
<kafka.version>0.8.2.2</kafka.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.11 -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>${kafka.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>${kafka.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.json/json -->
<dependency>
<groupId>org.json</groupId>
<artifactId>json</artifactId>
<version>20160810</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.json4s/json4s-ast_2.11 -->
<dependency>
<groupId>org.json4s</groupId>
<artifactId>json4s-ast_2.11</artifactId>
<version>3.2.11</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.2.0</version>
</dependency>
I have tried working with Kafka-v0.10.1.0 Producer and Conumser. The behaviour is as expected(consumer reads data from where it left). So, in this version offset is picked up correctly.
Have tried using the same version in above pom.xml too, but failed with java.lang.ClassCastException: kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker.
I understand the compatability of versions, but I'm also looking for continuous stream.

The different behavior likely arises from the fact that Kafka underwent some rather large changes between versions 0.8 and 0.10.
Unless you absolutely have to use the old version, I suggest switching to newer ones.
Take a look at this link:
https://spark.apache.org/docs/latest/streaming-kafka-integration.html
The Kafka project introduced a new consumer api between versions 0.8 and 0.10, so there are 2 separate corresponding Spark Streaming packages available.
If you want to use Kafka v0.10.1.0, you must thus specify some kafka spark streaming integration dependency at
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.11.
Something like this for example:
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
Additional note: you're using hadoop 2.2.0 which was released in Oct, 2013 and is thus ancient in Hadoop terms, you should consider changing it to a newer version.
Let me know if this helps.

Related

hadoop-aws compile dependencies on jackson library

I am trying to write a Spark program using Java to fetch records from oracle database and write it into S3 bucket. To access bucket using s3a:// convention, I have added hadoop-aws as dependency in the pom.xml.
I have added below dependencies in the pom.xml file :
<properties>
<java.version>1.8</java.version>
<scala.version>2.12</scala.version>
<spark.version>3.0.2</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.github.noraui</groupId>
<artifactId>ojdbc7</artifactId>
<version>12.1.0.2</version>
</dependency>
<dependency>
<groupId>com.github.noraui</groupId>
<artifactId>ojdbc7</artifactId>
<version>12.1.0.2</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.11.690</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
</dependency>
I tried to run the program but got dependency error :
Exception in thread "main" java.lang.NoClassDefFoundError: com/fasterxml/jackson/databind/type/ReferenceType
at com.fasterxml.jackson.module.scala.modifiers.ScalaTypeModifierModule.$init$(ScalaTypeModifier.scala:35)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.<init>(DefaultScalaModule.scala:18)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<init>(DefaultScalaModule.scala:36)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<clinit>(DefaultScalaModule.scala)
After bit of research, I got to know the hadoop-aws has compilation dependency on com.fasterxml.jackson.core as mentioned here :
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.8.5
So, I updated my pom.xml with the below dependencies :
<!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-core -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
</dependency>
<!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</dependency>
<!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-annotations -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
</dependency>
But, I am still getting the same error. I am not sure what else I am missing here.

Spring boot 2.2 activemq jetty conflict

I am trying to upgrade Spring boot to the version 2.2 together with jetty starter.
I get these errors due to jetty version conflict
The following method did not exist:
'org.eclipse.jetty.websocket.server.NativeWebSocketConfiguration org.eclipse.jetty.websocket.server.NativeWebSocketServletContainerInitializer.initialize(org.eclipse.jetty.servlet.ServletContextHandler)'
The method's class, org.eclipse.jetty.websocket.server.NativeWebSocketServletContainerInitializer, is available from the following locations:
jar:file:/some-dir/target/p3.0.166-SNAPSHOT.war!/WEB-INF/lib/jetty-all-9.4.19.v20190610-uber.jar!/org/eclipse/jetty/websocket/server/NativeWebSocketServletContainerInitializer.class
jar:file:/some-dir/target/p3.0.166-SNAPSHOT.war!/WEB-INF/lib/websocket-server-9.4.20.v20190813.jar!/org/eclipse/jetty/websocket/server/NativeWebSocketServletContainerInitializer.class
I have activemq dependency which brings in it's own jetty-all versioned 9.4.19 dependency which is in conflict with spring-boot 2.2 jetty (9.4.20)
And part of my pom.xml is:
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
<exclusions>
<exclusion>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-tomcat</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-jetty</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<!--
Jsp-api isn't standard in spring boot
-->
<dependency>
<groupId>javax.servlet.jsp</groupId>
<artifactId>javax.servlet.jsp-api</artifactId>
<version>${jsp.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>javax.servlet-api</artifactId>
<scope>provided</scope>
</dependency>
<!-- artefacts enable JSP running in spring-boot -->
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>jstl</artifactId>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>apache-jsp</artifactId>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>apache-jstl</artifactId>
</dependency>
<!--
Used to be a single artifact.
<groupId>javax.servlet</groupId>
<artifactId>jstl</artifactId>
<version>1.2</version>
Newer versions splits the interface and implementation.
This suggests to use a Glassfish implementation.
https://www.andygibson.net/blog/quickbyte/jstl-missing-from-maven-repositories/
The one we used had an Apache implementation, so going with that.
https://stackoverflow.com/a/24444342
-->
<dependency>
<groupId>org.apache.taglibs</groupId>
<artifactId>taglibs-standard-spec</artifactId>
<version>${taglibs.version}</version>
</dependency>
<dependency>
<groupId>org.apache.taglibs</groupId>
<artifactId>taglibs-standard-impl</artifactId>
<version>${taglibs.version}</version>
</dependency>
<!-- Unit test dependencies -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.easytesting</groupId>
<artifactId>fest-assert</artifactId>
<version>1.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-test</artifactId>
<scope>test</scope>
</dependency>
<!-- html compressing is used by hrmanager in the JSP -->
<dependency>
<groupId>com.googlecode.htmlcompressor</groupId>
<artifactId>htmlcompressor</artifactId>
<version>1.5.2</version>
</dependency>
<!-- ApacheMQ HTTP jarfile set -->
<dependency>
<groupId>org.apache.activemq</groupId>
<artifactId>activemq-client</artifactId>
</dependency>
<dependency>
<groupId>org.apache.activemq</groupId>
<artifactId>activemq-http</artifactId>
</dependency>
</dependencies>
Any idea how I can fix this?
ActiveMQ is wrong here.
jetty-all is not meant to be used as a dependency in a project.
See https://www.eclipse.org/lists/jetty-users/msg06030.html
It only exists as a command line tool for the documentation to educate folks about basic featureset of Jetty.
It does not, and cannot, contain all of Jetty.
A single artifact with everything that Jetty produces is impossible.
As #Shilan pointed out, excluding jetty-all is the correct solution.
The use of the other appropriate dependencies (by spring-boot-starter-jetty it seems) will already pull in the correct Jetty transitive dependencies that you need.
You can use $ mvn dependency:tree from the command line to see this (before and after excluding jetty-all)
You might want to run one of the duplicate class finder maven plugins to see what other duplicate classes you have going on and correct for those as well.
https://github.com/ning/maven-duplicate-finder-plugin
https://github.com/basepom/duplicate-finder-maven-plugin

NoClassDefFoundError: org/apache/spark/sql/DataFrame in spark-cassandra-connector

I'm trying to upgrade spark-cassandra-connector from 1.4 to 1.5.
Everything seems fine but when I run test cases then It stuck between the process and log some error message saying:
Exception in thread "dag-scheduler-event-loop"
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
My pom file looks like:
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10 -->
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>1.5.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>16.0.1</version>
</dependency>
<!-- Scala Library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.10.5</version>
</dependency>
<!--Spark Cassandra Connector-->
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>1.5.0</version>
</dependency>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector-java_2.10</artifactId>
<version>1.5.0</version>
</dependency>
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>3.0.2</version>
</dependency>
<!--Spark-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.0</version>
<exclusions>
<exclusion>
<groupId>net.java.dev.jets3t</groupId>
<artifactId>jets3t</artifactId>
</exclusion>
</exclusions>
</dependency>
</dependencies>
</project>
Thank you in advance!!
Can anyone please help me with this ?
If you need more info please let me know!!
Try to add dependency
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
Also make sure that your version spark-cassandra-connector is compatible with version of Spark you're using. I had the same error message even with all proper dependencies when was trying to use older spark-cassandra-connector with newer Spark version. Refer to this table: https://github.com/datastax/spark-cassandra-connector#version-compatibility

What maven artifacts do I need for spring hibernate and mysql support?

I have a IDEA project using maven2.
I want to use hibernate + mysql, what dependancies do I need?
first of all, I separate the versions from the artifacts:
<properties>
<spring.version>3.0.3.RELEASE</spring.version>
<hibernate.version>3.5.3-Final</hibernate.version>
<mysql.version>5.1.13</mysql.version>
<junit.version>4.7</junit.version>
</properties>
then I reference them like this:
<dependencies>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-orm</artifactId>
<version>${spring.version}</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.version}</version>
<!-- perhaps using scope = provided, as this will often
be present on the app server -->
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-core</artifactId>
<!-- or hibernate-entitymanager if you use jpa -->
<version>${hibernate.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>${junit.version}</version>
<scope>test</scope>
</dependency>
</dependencies>
That way you keep the versions all in one place and can easily update them, especially if you reference e.g. multiple spring artifacts.
BTW: these should be the current versions, but you can always look up current versions using MvnRepository.com
Pasting these dependencies into pom.xml after <depdendencies> should work:
<!-- MySQL database driver -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.9</version>
</dependency>
<!-- Hibernate framework -->
<dependency>
<groupId>hibernate</groupId>
<artifactId>hibernate3</artifactId>
<version>3.2.3.GA</version>
</dependency>
<!-- Hibernate library dependecy start -->
<dependency>
<groupId>dom4j</groupId>
<artifactId>dom4j</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1.1</version>
</dependency>
<dependency>
<groupId>commons-collections</groupId>
<artifactId>commons-collections</artifactId>
<version>3.2.1</version>
</dependency>
<dependency>
<groupId>cglib</groupId>
<artifactId>cglib</artifactId>
<version>2.2</version>
</dependency>
<dependency>
<groupId>javax.transaction</groupId>
<artifactId>jta</artifactId>
<version>1.1</version>
</dependency>
<!-- Hibernate library dependecy end -->
Shamelessly cloned from http://www.mkyong.com/hibernate/quick-start-maven-hibernate-mysql-example/ (with the addition of jta as recommended by a commenter)
You may want to tweak the version numbers on the dependencies.
IntelliJ IDEA 9 can find Maven dependencies based on class name. If you start using a class which isn't available in the current dependencies you can get IntelliJ to help find it by using Alt-Enter.
I used this to great effect with a Java-base Subversion hook implementation I am building at work. I was able to get SVNKit and Google Guice dependencies into my project fairly easily this way.
MySQL in your case may be trickier since it is more of a runtime dependency when using Hibernate.

Java - getting jar dependencies right

I'm relatively new to Java & maven, and so to get to know my way around, I decided to do a project as a means for learning.
I picked a pretty common stack :
Java 1.6
Hibernate (with annotations)
Spring (with annotations)
JUnit 4
Tomcat
Oracle XE / In-mem hsqldb
By far one of the biggest problems I've experienced is getting the correct combination of jar versions to get a stable environment. It's an issue I'm still fighting with over two months later.
Quite often I get noSuchMethod or classNotFound exceptions thrown, and it turns out to be that Spring module A x.x.x is not compatible with Hibernate module B y.y.y. Or even, just as commonly, spring module A x.x.x is not compatible with spring module B y.y.y
I expected in starting from a clean slate, version dependencies should be minimal -- just grab the latest version and everything should work... but that has not been the case.
I expected that using maven would simplify this process, and no doubt it has.
But it's certainly be far from painless. I'd have thought that if module A requires a specific version of module B, that it be enforced somewhere along the line, and certinaly provide more meaningful messages that just "noSuchMethod".
Additionally, it seems that the only way I discover these problems is to try a new method call, get the dreaded noSuchMethod error, and start googling.
Have I missed something along the way here that has made this more difficult on myself than it needed to be?
For reference, here's the dependencies section of my pom...if you notice anything horrendously non-standard, please let me know!
<dependencies>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.5.6</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.5.6</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.3</version>
</dependency>
<dependency>
<groupId>ojdbc</groupId>
<artifactId>ojdbc</artifactId>
<version>14</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-io</artifactId>
<version>1.3.2</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>1.4</version>
</dependency>
<dependency><!-- java bytecode processor -->
<groupId>javassist</groupId>
<artifactId>javassist</artifactId>
<version>3.8.0.GA</version>
</dependency>
<dependency>
<groupId>commons-dbcp</groupId>
<artifactId>commons-dbcp</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>hsqldb</groupId>
<artifactId>hsqldb</artifactId>
<version>1.8.0.7</version>
</dependency>
<dependency>
<groupId>org.dbunit</groupId>
<artifactId>dbunit</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring</artifactId>
<version>2.5.6</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-test</artifactId>
<version>2.5.6</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-orm</artifactId>
<version>2.5.6</version>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-annotations</artifactId>
<version>3.4.0.GA</version>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-commons-annotations</artifactId>
<version>3.3.0.ga</version>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-core</artifactId>
<version>3.3.1.GA</version>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-validator</artifactId>
<version>3.1.0.GA</version>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-entitymanager</artifactId>
<version>3.4.0.GA</version>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.3</version>
</dependency>
</dependencies>
Thanks
Marty
One thing that I've found challenging is determining what is in each package, especially from Spring.
To that end, I've found Netbeans' support for maven to be outstanding in how it lets you know what libraries are pulled in by each requirement. 6.7 Beta contains a graphical tree which is outstanding, and m2eclipse also has a very nice graphical dependency tree. How else would you know that spring-orm includes, spring-beans, spring-core, spring-context, and spring-tx? You can ask maven for the dependencies using the dependency plugin from the command line, but the graphical representation is quite handy. dependency:tree is the goal you want to run. Obviously you can also run that from Netbeans or Eclipse.
So, as an example of one of your collisions:
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-annotations</artifactId>
<version>3.4.0.GA</version>
</dependency>
actually includes hibernate-commons-annotations-3.1.0.GA not 3.3. It also includes hibernate-core-3.3.0.SP1, not 3.3.1.GA.
I would start at your "biggest" component, and start to see what parts that already includes and only add what is missing. Even then, double check that you don't have a duplicate dependency and if need be, exclude the duplicate as shown in the answer to this question.
If you are using eclipse, then you should download the maven plugin from sonatype here http://m2eclipse.sonatype.org/.
This comes with a useful graphic visualisation of your dependencies (in particular transitive dependencies - dependencies you have not explicitly defined in your POM), and also shows conflicting dependencies.
Update: from the comments below, your mileage may vary.

Categories