Apache Flink integration with Elasticsearch - java

I am trying to integrate Flink with Elasticsearch 2.1.1, I am using the maven dependency
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch2_2.10</artifactId>
<version>1.1-SNAPSHOT</version>
</dependency>
and here's the Java Code where I am reading the events from a Kafka queue (which works fine) but somehow the events are not getting posted in the Elasticsearch and there is no error either, in the below code if I change any of the settings related to port, hostname, cluster name or index name of ElasticSearch then immediately I see an error but currently it doesn't show any error nor any new documents get created in ElasticSearch
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// parse user parameters
ParameterTool parameterTool = ParameterTool.fromArgs(args);
DataStream<String> messageStream = env.addSource(new FlinkKafkaConsumer082<>(parameterTool.getRequired("topic"), new SimpleStringSchema(), parameterTool.getProperties()));
messageStream.print();
Map<String, String> config = new HashMap<>();
config.put(ElasticsearchSink.CONFIG_KEY_BULK_FLUSH_MAX_ACTIONS, "1");
config.put(ElasticsearchSink.CONFIG_KEY_BULK_FLUSH_INTERVAL_MS, "1");
config.put("cluster.name", "FlinkDemo");
List<InetSocketAddress> transports = new ArrayList<>();
transports.add(new InetSocketAddress(InetAddress.getByName("localhost"), 9300));
messageStream.addSink(new ElasticsearchSink<String>(config, transports, new TestElasticsearchSinkFunction()));
env.execute();
}
private static class TestElasticsearchSinkFunction implements ElasticsearchSinkFunction<String> {
private static final long serialVersionUID = 1L;
public IndexRequest createIndexRequest(String element) {
Map<String, Object> json = new HashMap<>();
json.put("data", element);
return Requests.indexRequest()
.index("flink").id("hash"+element).source(json);
}
#Override
public void process(String element, RuntimeContext ctx, RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}
}

I was indeed running it on the local machine and debugging as well but, the only thing I was missing is to properly configure logging, as most of elastic issues are described in "log.warn" statement. The issue was the exception inside "BulkRequestHandler.java" in elasticsearch-2.2.1 client API, which was throwing the error -"org.elasticsearch.action.ActionRequestValidationException: Validation Failed: 1: type is missing;" As I had created the index but not an type which I find pretty strange as it should be primarily be concerned with index and create the type by default.

I have found a very good example of Flink & Elasticsearch Connector
First Maven dependency:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch2_2.10</artifactId>
<version>1.1-SNAPSHOT</version>
</dependency>
Second Example Java code
public static void writeElastic(DataStream<String> input) {
Map<String, String> config = new HashMap<>();
// This instructs the sink to emit after every element, otherwise they would be buffered
config.put("bulk.flush.max.actions", "1");
config.put("cluster.name", "es_keira");
try {
// Add elasticsearch hosts on startup
List<InetSocketAddress> transports = new ArrayList<>();
transports.add(new InetSocketAddress("127.0.0.1", 9300)); // port is 9300 not 9200 for ES TransportClient
ElasticsearchSinkFunction<String> indexLog = new ElasticsearchSinkFunction<String>() {
public IndexRequest createIndexRequest(String element) {
String[] logContent = element.trim().split("\t");
Map<String, String> esJson = new HashMap<>();
esJson.put("IP", logContent[0]);
esJson.put("info", logContent[1]);
return Requests
.indexRequest()
.index("viper-test")
.type("viper-log")
.source(esJson);
}
#Override
public void process(String element, RuntimeContext ctx, RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}
};
ElasticsearchSink esSink = new ElasticsearchSink(config, transports, indexLog);
input.addSink(esSink);
} catch (Exception e) {
System.out.println(e);
}
}

Related

Problem creating tests with Spring cloud streams kafka streams using embeddedKafka with MockSchemaRegistryClient

I'm trying to figure out how i can test my Spring Cloud Streams Kafka-Streams application.
The application lookls like this:
Stream 1: Topic1 > Topic2
Stream 2: Topic2 + Topic3 joined > Topic4
Stream 3: Topic4 > Topic5
I tried different approaches like the TestChannelBinder but this approach only works with Simple functions not Streams and Avro.
I decided to use EmbeddedKafka with MockSchemaRegistryClient. I can produce to a topic and also consume from the same topic again (topic1) but i'm not able to consume from (topic2).
In my test application.yaml i put the following configuration (i'm only testing the first stream for now, i want to extend it once this works):
spring.application.name: processingapp
spring.cloud:
function.definition: stream1 # not now ;stream2;stream3
stream:
bindings:
stream1-in-0:
destination: topic1
stream1-out-0:
destination: topic2
kafka:
binder:
min-partition-count: 1
replication-factor: 1
auto-create-topics: true
auto-add-partitions: true
bindings:
default:
consumer:
autoRebalanceEnabled: true
resetOffsets: true
startOffset: earliest
stream1-in-0:
consumer:
keySerde: io.confluent.kafka.streams.serdes.avro.PrimitiveAvroSerde
valueSerde: io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde
stream1-out-0:
producer:
keySerde: io.confluent.kafka.streams.serdes.avro.PrimitiveAvroSerde
valueSerde: io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde
streams:
binder:
configuration:
schema.registry.url: mock://localtest
specivic.avro.reader: true
My test looks like the following:
#RunWith(SpringRunner.class)
#SpringBootTest
public class Test {
private static final String INPUT_TOPIC = "topic1";
private static final String OUTPUT_TOPIC = "topic2";
#ClassRule
public static EmbeddedKafkaRule embeddedKafka = new EmbeddedKafkaRule(1, true, 1, INPUT_TOPIC, OUTPUT_TOPIC);
#BeforeClass
public static void setup() {
System.setProperty("spring.cloud.stream.kafka.binder.brokers", embeddedKafka.getEmbeddedKafka().getBrokersAsString());
}
#org.junit.Test
public void testSendReceive() throws IOException {
Map<String, Object> senderProps = KafkaTestUtils.producerProps(embeddedKafka.getEmbeddedKafka());
senderProps.put("key.serializer", LongSerializer.class);
senderProps.put("value.serializer", SpecificAvroSerializer.class);
senderProps.put("schema.registry.url", "mock://localtest");
AvroFileParser fileParser = new AvroFileParser();
DefaultKafkaProducerFactory<Long, Test1> pf = new DefaultKafkaProducerFactory<>(senderProps);
KafkaTemplate<Long, Test1> template = new KafkaTemplate<>(pf, true);
Test1 test1 = fileParser.parseTest1("src/test/resources/mocks/test1.json");
template.send(INPUT_TOPIC, 123456L, test1);
System.out.println("produced");
Map<String, Object> consumer1Props = KafkaTestUtils.consumerProps("testConsumer1", "false", embeddedKafka.getEmbeddedKafka());
consumer1Props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
consumer1Props.put("key.deserializer", LongDeserializer.class);
consumer1Props.put("value.deserializer", SpecificAvroDeserializer.class);
consumer1Props.put("schema.registry.url", "mock://localtest");
DefaultKafkaConsumerFactory<Long, Test1> cf = new DefaultKafkaConsumerFactory<>(consumer1Props);
Consumer<Long, Test1> consumer1 = cf.createConsumer();
consumer1.subscribe(Collections.singleton(INPUT_TOPIC));
ConsumerRecords<Long, Test1> records = consumer1.poll(Duration.ofSeconds(10));
consumer1.commitSync();
System.out.println("records count?");
System.out.println("" + records.count());
Test1 fetchedTest1;
fetchedTest1 = records.iterator().next().value();
assertThat(records.count()).isEqualTo(1);
System.out.println("found record");
System.out.println(fetchedTest1.toString());
Map<String, Object> consumer2Props = KafkaTestUtils.consumerProps("testConsumer2", "false", embeddedKafka.getEmbeddedKafka());
consumer2Props.put("key.deserializer", StringDeserializer.class);
consumer2Props.put("value.deserializer", TestAvroDeserializer.class);
consumer2Props.put("schema.registry.url", "mock://localtest");
DefaultKafkaConsumerFactory<String, Test2> consumer2Factory = new DefaultKafkaConsumerFactory<>(consumer2Props);
Consumer<String, Test2> consumer2 = consumer2Factory.createConsumer();
consumer2.subscribe(Collections.singleton(OUTPUT_TOPIC));
ConsumerRecords<String, Test2> records2 = consumer2.poll(Duration.ofSeconds(30));
consumer2.commitSync();
if (records2.iterator().hasNext()) {
System.out.println("has next");
} else {
System.out.println("has no next");
}
}
}
I receive the following exception when trying to consume and deserialize from topic2:
Caused by: org.apache.kafka.common.errors.SerializationException: Error retrieving Avro unknown schema for id 0
Caused by: java.io.IOException: Cannot get schema from schema registry!
at io.confluent.kafka.schemaregistry.client.MockSchemaRegistryClient.getSchemaBySubjectAndIdFromRegistry(MockSchemaRegistryClient.java:193) ~[kafka-schema-registry-client-6.2.0.jar:na]
at io.confluent.kafka.schemaregistry.client.MockSchemaRegistryClient.getSchemaBySubjectAndId(MockSchemaRegistryClient.java:249) ~[kafka-schema-registry-client-6.2.0.jar:na]
at io.confluent.kafka.schemaregistry.client.MockSchemaRegistryClient.getSchemaById(MockSchemaRegistryClient.java:232) ~[kafka-schema-registry-client-6.2.0.jar:na]
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer$DeserializationContext.schemaFromRegistry(AbstractKafkaAvroDeserializer.java:307) ~[kafka-avro-serializer-6.2.0.jar:na]
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserialize(AbstractKafkaAvroDeserializer.java:107) ~[kafka-avro-serializer-6.2.0.jar:na]
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserialize(AbstractKafkaAvroDeserializer.java:86) ~[kafka-avro-serializer-6.2.0.jar:na]
at io.confluent.kafka.serializers.KafkaAvroDeserializer.deserialize(KafkaAvroDeserializer.java:55) ~[kafka-avro-serializer-6.2.0.jar:na]
at org.apache.kafka.common.serialization.Deserializer.deserialize(Deserializer.java:60) ~[kafka-clients-2.7.1.jar:na]
at org.apache.kafka.streams.processor.internals.SourceNode.deserializeKey(SourceNode.java:54) ~[kafka-streams-2.7.1.jar:na]
at org.apache.kafka.streams.processor.internals.RecordDeserializer.deserialize(RecordDeserializer.java:65) ~[kafka-streams-2.7.1.jar:na]
at org.apache.kafka.streams.processor.internals.RecordQueue.updateHead(RecordQueue.java:176) ~[kafka-streams-2.7.1.jar:na]
at org.apache.kafka.streams.processor.internals.RecordQueue.addRawRecords(RecordQueue.java:112) ~[kafka-streams-2.7.1.jar:na]
at org.apache.kafka.streams.processor.internals.PartitionGroup.addRawRecords(PartitionGroup.java:185) ~[kafka-streams-2.7.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamTask.addRecords(StreamTask.java:895) ~[kafka-streams-2.7.1.jar:na]
at org.apache.kafka.streams.processor.internals.TaskManager.addRecordsToTasks(TaskManager.java:1008) ~[kafka-streams-2.7.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.pollPhase(StreamThread.java:812) ~[kafka-streams-2.7.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:625) ~[kafka-streams-2.7.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:564) ~[kafka-streams-2.7.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:523) ~[kafka-streams-2.7.1.jar:na]
There won't be a message consumed.
So i tried to overwrite the SpecificAvroSerde and register the schemas directly and use this deserializer.
public class TestAvroDeserializer<T extends org.apache.avro.specific.SpecificRecord>
extends SpecificAvroDeserializer<T> implements Deserializer<T> {
private final KafkaAvroDeserializer inner;
public TestAvroDeserializer() throws IOException, RestClientException {
MockSchemaRegistryClient mockedClient = new MockSchemaRegistryClient();
Schema.Parser parser = new Schema.Parser();
Schema test2Schema = parser.parse(new File("./src/main/resources/avro/test2.avsc"));
mockedClient.register("test2-value", test2Schema , 1, 0);
inner = new KafkaAvroDeserializer(mockedClient);
}
/**
* For testing purposes only.
*/
TestAvroDeserializer(final SchemaRegistryClient client) throws IOException, RestClientException {
MockSchemaRegistryClient mockedClient = new MockSchemaRegistryClient();
Schema.Parser parser = new Schema.Parser();
Schema test2Schema = parser.parse(new File("./src/main/resources/avro/test2.avsc"));
mockedClient.register("test2-value", test2Schema , 1, 0);
inner = new KafkaAvroDeserializer(mockedClient);
}
}
With this deserializer it won't work too. Does anyone have experience on how to do this tests with EmbeddedKafka and MockSchemaRegistry? Or is there another approach i should use?
I'm very glad if someone can help. Thank you in advance.
I found an appropriate way of integration testing my topology.
I use the TopologyTestDriver from the kafka-streams-test-utils package.
Include this dependency to Maven:
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams-test-utils</artifactId>
<scope>test</scope>
</dependency>
For the application described in the question setting up the TopologyTestDriver would look like the following. This code is just sequentially to show how it works.
#Test
void test() {
keySerde.configure(Map.of(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "mock://schemas"), true);
valueSerdeTopic1.configure(Map.of(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "mock://schemas"), false);
valueSerdeTopic2.configure(Map.of(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "mock://schemas"), false);
final StreamsBuilder builder = new StreamsBuilder();
Configuration config = new Configuration(); // class where you declare your spring cloud stream functions
KStream<String, Topic1> input = builder.stream("topic1", Consumed.with(keySerde, valueSerdeTopic1));
KStream<String, Topic2> output = config.stream1().apply(input);
output.to("topic2");
Topology topology = builder.build();
Properties streamsConfig = new Properties();
streamsConfig.putAll(Map.of(
org.apache.kafka.streams.StreamsConfig.APPLICATION_ID_CONFIG, "toplogy-test-driver",
org.apache.kafka.streams.StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "ignored",
KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "mock://schemas",
org.apache.kafka.streams.StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, PrimitiveAvroSerde.class.getName(),
org.apache.kafka.streams.StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, SpecificAvroSerde.class.getName()
));
TopologyTestDriver testDriver = new TopologyTestDriver(topology, streamsConfig);
TestInputTopic<String, Topic1> inputTopic = testDriver.createInputTopic("topic1", keySerde.serializer(), valueSerdeTopic1.serializer());
TestOutputTopic<String, Topic2> outputTopic = testDriver.createOutputTopic("topic2", keySerde.deserializer(), valueSerdeTopic2.deserializer());
inputTopic.pipeInput("key", topic1AvroModel); // Write to the input topic which applies the topology processor of your spring-cloud-stream app
KeyValue<String, Topic2> outputRecord = outputTopic.readKeyValue(); // Read from the output topic
}
If you write more tests i recommend to abstract the setup code to not repeat yourself for each test.
I highly suggest this example from the spring-cloud-streams-samples repository, it leaded me to the solution to use TopologyTestDriver.

Kafka Consumer: Stop processing messages when exception was raised

I'm a bit confused about the poll() behaviour of (Spring) Kafka after/when stopping the ConcurrentMessageListenerContainer.
What I want to achieve:
Stop the consumer after an exception was raised (for example message could not be saved to the database), do not commit offset, restart it after a given time and start processing again from the previously failed message.
I read this article which says that the container will call the listener with the remaining records from the poll (https://github.com/spring-projects/spring-kafka/issues/451) which means that there is no guarantee that after the failed message a further message which was processed successfully will commit the offset. This could end up in lost/skipped messages.
Is this really the case and if yes is there a solution to solve this without upgrading the newer versions? (DLQ is not a solution for my case)
What I already did:
Setting the setErrorHandler() and setAckOnError(false)
private Map<String, Object> getConsumerProps(CustomKafkaProps kafkaProps, Class keyDeserializer) {
Map<String, Object> props = new HashMap<>();
//Set common props
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaProps.getBootstrapServers());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ByteArrayDeserializer.class);
props.put(ConsumerConfig.GROUP_ID_CONFIG, kafkaProps.getConsumerGroupId());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest"); // Start with the first message when a new consumer group (app) arrives at the topic
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false); // We will use "RECORD" AckMode in the Spring Listener Container
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, keyDeserializer);
if (kafkaProps.isSslEnabled()) {
props.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, "SSL");
props.put("ssl.keystore.location", kafkaProps.getKafkaKeystoreLocation());
props.put("ssl.keystore.password", kafkaProps.getKafkaKeystorePassword());
props.put("ssl.key.password", kafkaProps.getKafkaKeyPassword());
}
return props;
}
Consumer
public ConcurrentMessageListenerContainer<String, byte[]> kafkaReceiverContainer(CustomKafkaProps kafkaProps) throws Exception {
StoppingErrorHandler stoppingErrorHandler = new StoppingErrorHandler();
ContainerProperties containerProperties = new ContainerProperties(...);
containerProperties.setAckMode(AbstractMessageListenerContainer.AckMode.RECORD);
containerProperties.setAckOnError(false);
containerProperties.setErrorHandler(stoppingErrorHandler);
ConcurrentMessageListenerContainer<String, byte[]> container = ...
container.setConcurrency(1); //use only one container
stoppingErrorHandler.setConcurrentMessageListenerContainer(container);
return container;
}
Error Handler
public class StoppingErrorHandler implements ErrorHandler {
#Setter
private ConcurrentMessageListenerContainer concurrentMessageListenerContainer;
#Value("${backends.kafka.consumer.halt.timeout}")
int consumerHaltTimeout;
#Override
public void handle(Exception thrownException, ConsumerRecord<?, ?> record) {
if (concurrentMessageListenerContainer != null) {
concurrentMessageListenerContainer.stop();
}
new Timer().schedule(new TimerTask() {
#Override
public void run() {
if (concurrentMessageListenerContainer != null && !concurrentMessageListenerContainer.isRunning()) {
concurrentMessageListenerContainer.start();
}
}
}, consumerHaltTimeout);
}
}
What I'm using:
<groupId>org.springframework.integration</groupId>
<artifactId>spring-integration-kafka</artifactId>
<version>2.1.2.RELEASE</version>
<groupId>org.springframework.kafka</groupId>
<artifactId>spring-kafka</artifactId>
<version>1.1.7.RELEASE</version>
without upgrading the newer versions?
2.1 introduced the ContainerStoppingErrorHandler which is a ContainerAwareErrorHandler, the remaining unconsumed messages are discarded (and will be re-fetched when the container is restarted).
With earlier versions, your listener will need to reject (fail) the remaining messages in the batch (or set max.records.per.poll=1).

Way to read data from Kafka headers in Apache Flink

I have a project where I am consuming data from Kafka. Apparently, there are a couple fields that are going to be included in the headers that I will need to read as well for each message. Is there a way to do this in Flink currently?
Thanks!
#Jicaar, Actually Kafka has added Header notion since version 0.11.0.0. https://issues.apache.org/jira/browse/KAFKA-4208
The problem is flink-connector-kafka-0.11_2.11 which comes with flink-1.4.0, and supposedly supports kafka-0.11.0.0 just ignores message headers when reading from kafka.
So unfortunately there is no way to read those headers unless you implement your own KafkaConsumer in flin.
I'm also interested in readin in kafka message headers and hope Flink team will add support for this.
I faced similar issue and found a way to do this in Flink 1.8. Here is what I wrote:
FlinkKafkaConsumer<ObjectNode> consumer = new FlinkKafkaConsumer("topic", new JSONKeyValueDeserializationSchema(true){
ObjectMapper mapper = new ObjectMapper();
#Override
public ObjectNode deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception {
ObjectNode result = super.deserialize(record);
if (record.headers() != null) {
Map<String, JsonNode> headers = StreamSupport.stream(record.headers().spliterator(), false).collect(Collectors.toMap(h -> h.key(), h -> (JsonNode)this.mapper.convertValue(new String(h.value()), JsonNode.class)));
result.set("headers", mapper.convertValue(headers, JsonNode.class));
}
return result;
}
}, kafkaProps);
Hope this helps!
Here's the code for new versions of Flink.
KafkaSource<String> source = KafkaSource.<String>builder()
.setBootstrapServers(ParameterConfig.parameters.getRequired(ParameterConstant.KAFKA_ADDRESS))
.setTopics(ParameterConfig.parameters.getRequired(ParameterConstant.KAFKA_SOURCE_TOPICS))
.setGroupId(ParameterConfig.parameters.getRequired(ParameterConstant.KAFKA_SOURCE_GROUPID))
.setStartingOffsets(OffsetsInitializer.latest())
.setDeserializer(new KafkaRecordDeserializationSchema<String>() {
#Override
public void deserialize(ConsumerRecord<byte[], byte[]> consumerRecord, Collector<String> collector) {
try {
Map<String, String> headers = StreamSupport
.stream(consumerRecord.headers().spliterator(), false)
.collect(Collectors.toMap(Header::key, h -> new String(h.value())));
collector.collect(new JSONObject(headers).toString());
} catch (Exception e){
e.printStackTrace();
log.error("Headers Not found in Kafka Stream with consumer record : {}", consumerRecord);
}
}
#Override
public TypeInformation<String> getProducedType() {
return TypeInformation.of(new TypeHint<>() {});
}
})
.build();

Getting error when invoking elasticSearch from spark

I have a use case, where I need to read messages from kafka and for each message, extract data and invoke elasticsearch Index. The response will be further used to do further processing.
I am getting below error when invoking JavaEsSpark.esJsonRDD
java.lang.ClassCastException: org.elasticsearch.spark.rdd.EsPartition incompatible with org.apache.spark.rdd.ParallelCollectionPartition
at org.apache.spark.rdd.ParallelCollectionRDD.compute(ParallelCollectionRDD.scala:102)
My code snippet is below
public static void main(String[] args) {
if (args.length < 4) {
System.err.println("Usage: JavaKafkaIntegration <zkQuorum> <group> <topics> <numThreads>");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaIntegration").setMaster("local[2]").set("spark.driver.allowMultipleContexts", "true");
//Setting when using JavaEsSpark.esJsonRDD
sparkConf.set("es.nodes",<NODE URL>);
sparkConf.set("es.nodes.wan.only","true");
context = new JavaSparkContext(sparkConf);
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
//Receive Message From kafka
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc,args[0], args[1], topicMap);
JavaDStream<String> jsons = messages
.map(new Function<Tuple2<String, String>, String>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public String call(Tuple2<String, String> tuple2){
JavaRDD<String> esRDD = JavaEsSpark.esJsonRDD(context, <index>,<search string> ).values() ;
return null;
}
});
jsons.print();
jssc.start();
jssc.awaitTermination();
}
I am getting error when invoking JavaEsSpark.esJsonRDD. Is it correct way to do it? How do I successfully invoke ES from spark?
I am running kafka and spark on windows and invoking external elastic search index.

Spark Streaming + kafka "JobGenerator" java.lang.NoSuchMethodError

I'm new in spark streaming and kafka and I don't understand this runtime exception. I've already setup the kafka server.
Exception in thread "JobGenerator" java.lang.NoSuchMethodError: org.apache.spark.streaming.scheduler.InputInfoTracker.reportInfo(Lorg/apache/spark/streaming/Time;Lorg/apache/spark/streaming/scheduler/StreamInputInfo;)V
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:166)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
at scala.Option.orElse(Option.scala:257)
and this is my code
public class TwitterStreaming {
// setup kafka :
public static final String ZKQuorum = "localhost:2181";
public static final String ConsumerGroupID = "ingi2145-analytics";
public static final String ListTopics = "newTweet";
public static final String ListBrokers = "localhost:9092"; // I'm not sure about ...
#SuppressWarnings("deprecation")
public static void main(String[] args) throws Exception {
// Location of the Spark directory
String sparkHome = "usr/local/spark";
// URL of the Spark cluster
String sparkUrl = "local[4]";
// Location of the required JAR files
String jarFile = "target/analytics-1.0.jar";
// Generating spark's streaming context
JavaStreamingContext jssc = new JavaStreamingContext(
sparkUrl, "Streaming", new Duration(1000), sparkHome, new String[]{jarFile});
// Start kafka stream
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList(ListTopics.split(",")));
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", ListBrokers);
//JavaPairReceiverInputDStream<String, String> kafkaStream = KafkaUtils.createStream(ssc, ZKQuorum, ConsumerGroupID, mapPartitionsPerTopics);
// Create direct kafka stream with brokers and topics
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
// get the json file :
JavaDStream<String> json = messages.map(
new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
The aim of this project is to compute the 10 bests hashtags from a twitter stream by using a kafka queue. The code was working without kakfa.
Have you got an idea of what's the problem ?
I had the same issue and it was the version of spark I was using. I was using 1.5, then used 1.4 and ultimately the version that worked for me was 1.6.
So, please make sure that the Kafka version you are using is compatible with with the Spark version.
In my case, I'm using Kafka version 2.10-0.10.1.1 with spark-1.6.0-bin-hadoop2.3.
Also, (very important) make sure you are not getting any forbidden error in your log files. You have to assign the proper security grants to folders used by spark, otherwise you may receive a lot of errors that has nothing to do with the application itself but with improper security setup.

Categories