I have a use case, where I need to read messages from kafka and for each message, extract data and invoke elasticsearch Index. The response will be further used to do further processing.
I am getting below error when invoking JavaEsSpark.esJsonRDD
java.lang.ClassCastException: org.elasticsearch.spark.rdd.EsPartition incompatible with org.apache.spark.rdd.ParallelCollectionPartition
at org.apache.spark.rdd.ParallelCollectionRDD.compute(ParallelCollectionRDD.scala:102)
My code snippet is below
public static void main(String[] args) {
if (args.length < 4) {
System.err.println("Usage: JavaKafkaIntegration <zkQuorum> <group> <topics> <numThreads>");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaIntegration").setMaster("local[2]").set("spark.driver.allowMultipleContexts", "true");
//Setting when using JavaEsSpark.esJsonRDD
sparkConf.set("es.nodes",<NODE URL>);
sparkConf.set("es.nodes.wan.only","true");
context = new JavaSparkContext(sparkConf);
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
//Receive Message From kafka
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc,args[0], args[1], topicMap);
JavaDStream<String> jsons = messages
.map(new Function<Tuple2<String, String>, String>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public String call(Tuple2<String, String> tuple2){
JavaRDD<String> esRDD = JavaEsSpark.esJsonRDD(context, <index>,<search string> ).values() ;
return null;
}
});
jsons.print();
jssc.start();
jssc.awaitTermination();
}
I am getting error when invoking JavaEsSpark.esJsonRDD. Is it correct way to do it? How do I successfully invoke ES from spark?
I am running kafka and spark on windows and invoking external elastic search index.
Related
I'm using Apache Kafka Stream where I added a transform in my stream
final StreamsBuilder streamsBuilder = new StreamsBuilder();
final StoreBuilder<KeyValueStore<String, byte[]>> correlationStore =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(STORE_NAME),
Serdes.String(),
Serdes.ByteArray());
streamsBuilder.addStateStore(correlationStore);
streamsBuilder.stream(topicName, inputConsumed)
.peek(InboundPendingMessageStreamer::logEntries)
.transform(() -> new CleanerTransformer<String, byte[], KeyValue<String, byte[]>>(Duration.ofMillis(5000), STORE_NAME), STORE_NAME)
.toTable();
I'm having difficulties to understand the CleanerTransformer Transformer class that I create, where in the init method, I set a schedule with a scanFrequency and a PunctuationType.
#Override
public void init(ProcessorContext context) {
this.stateStore = context.getStateStore(purgeStoreName);
context.schedule(scanFrequency, PunctuationType.STREAM_TIME, timestamp -> {
try (final KeyValueIterator<K, byte[]> all = stateStore.all()) {
while (all.hasNext()) {
final var headers = context.headers();
final KeyValue<K, byte[]> record = all.next();
}
}
});
}
Adding an event in the stream, I got the message in the schedule callback, but it's only executed once.
My understanding was, that it should be executed every time configured in the scanFrequency.
Any idea what I'm doing wrong here?
I have a Kafka producer class which works fine. The producer fills the Kafka topic. Its code is in following:
public class kafka_test {
private final static String TOPIC = "flinkTopic";
private final static String BOOTSTRAP_SERVERS = "10.32.0.2:9092,10.32.0.3:9092,10.32.0.4:9092";
public FlinkKafkaConsumer<String> createStringConsumerForTopic(
String topic, String kafkaAddress, String kafkaGroup) {
// ************************** KAFKA Properties ******
Properties props = new Properties();
props.setProperty("bootstrap.servers", kafkaAddress);
props.setProperty("group.id", kafkaGroup);
FlinkKafkaConsumer<String> myconsumer = new FlinkKafkaConsumer<>(
topic, new SimpleStringSchema(), props);
myconsumer.setStartFromLatest();
return myconsumer;
}
private static Producer<Long, String> createProducer() {
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, BOOTSTRAP_SERVERS);
props.put(ProducerConfig.CLIENT_ID_CONFIG, "MyKafkaProducer");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, LongSerializer.class.getName());
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
return new KafkaProducer<>(props);
}
public void runProducer(String msg) throws Exception {
final Producer<Long, String> producer = createProducer();
try {
final ProducerRecord<Long, String> record = new ProducerRecord<>(TOPIC, msg );
RecordMetadata metadata = producer.send(record).get();
System.out.printf("sent record(key=%s value='%s')" + " metadata(partition=%d, offset=%d)\n",
record.key(), record.value(), metadata.partition(), metadata.offset());
} finally {
producer.flush();
producer.close();
}
}
}
public class producerTest {
public static void main(String[] args) throws Exception{
kafka_test objKafka=new kafka_test();
String pathFile="/home/cfms11/IdeaProjects/pooyaflink2/KafkaTest/quickstart/lastDay4.csv";
String delimiter="\n";
objKafka.createStringProducer("flinkTopic",
"10.32.0.2:9092,10.32.0.3:9092,10.32.0.4:9092");
Scanner scanner = new Scanner(new File(pathFile));
scanner.useDelimiter(delimiter);
int i=0;
while(scanner.hasNext()){
if (i==0)
TimeUnit.MINUTES.sleep(1);
objKafka.runProducer(scanner.next());
i++;
}
scanner.close();
}
}
Because, I want to provide data for my Flink program, so, I use Kafka. In fact, I have this part code to consume data from Kafka topic:
Properties props = new Properties();
props.setProperty("bootstrap.servers",
"10.32.0.2:9092,10.32.0.3:9092,10.32.0.4:9092");
props.setProperty("group.id", kafkaGroup);
FlinkKafkaConsumer<String> myconsumer = new FlinkKafkaConsumer<>(
"flinkTopic", new SimpleStringSchema(), props);
DataStream<String> text = env.addSource(myconsumer).setStartFromEarliest());
I want to run Producer code at the same time that my program is running. My goal is that Producer send one record to the topic and consumer can poll that record from topic at the same time.
Would you please tell me how it is possible and how to manage it.
I think you need create two class file, one is the producer, the other is the consumer. Create topic first and then run the consumer, or run the producer directly.
I am new in spark streaming programming please someone explain for me what is the problem
I thing that that i iterate a null structure but i have a producer class which works normally
my source code :
public class Main3 implements java.io.Serializable {
public static JavaDStream<Double> pr;
public void consumer() throws Exception{
// Configure Spark to connect to Kafka running on local machine
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
kafkaParams.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put(ConsumerConfig.GROUP_ID_CONFIG,"group1");
kafkaParams.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"latest");
kafkaParams.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,true);
Collection<String> topics = Arrays.asList("testing");
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SparkKafka10WordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(30));
final JavaInputDStream<ConsumerRecord<String, String>> receiver=
KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String,String>Subscribe(topics,kafkaParams));
JavaDStream<String> stream = receiver.map(new Function<ConsumerRecord<String,String>, String>() {
#Override
public String call(ConsumerRecord<String, String> kafkaRecord) throws Exception {
return kafkaRecord.value();
}
});
stream.foreachRDD( x->x.saveAsTextFile("/home/khouloud/Desktop/exemple/b")); //that does no do any thing
stream.foreachRDD( x-> {
x.collect().stream().forEach(n-> System.out.println("item of list: "+n));
}); // also this i see any thing in the console
stream.foreachRDD( rdd -> {
if (rdd.isEmpty()) System.out.println("its empty"); }); //nothing`
JavaPairDStream<Integer, List<Double>> points= stream.mapToPair(new PairFunction<String, Integer, List<Double>>(){
#Override
public Tuple2<Integer, List<Double>> call(String x) throws Exception {
String[] item = x.split(" ");
List<Double> l = new ArrayList<Double>();
for (int i= 1 ; i < item.length ; i++)
{
l.add(new Double(item[i]));
}
return new Tuple2<>(new Integer(item[0]), l);
}}
);`
Error -
`org.apache.spark.SparkException: Task not serializable at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
at
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2294) at
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:547)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:547)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at
org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:265)
at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:546)
at
org.apache.spark.streaming.api.java.JavaDStreamLike$class.mapToPair(JavaDStreamLike.scala:163)
at
org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.mapToPair(JavaDStreamLike.scala:42)
at Min.calculDegSim(Min.java:43) at SkyRule.execute(SkyRule.java:34)
at Main3.consumer(Main3.java:159) at
Executer$2.run(Executer.java:27) at
java.lang.Thread.run(Thread.java:748) Caused by:
java.io.NotSerializableException: Graph is unexpectedly null when
DStream is being serialized. Serialization stack:
at
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:337)
I think I have an interesting question for all of you today. In the code below you will notice I have two SparkContexts one for SparkStreaming and the other one which is a normal SparkContext. According to best practices you should only have one SparkContext in a Spark application even though its possible to circumvent this via allowMultipleContexts in the configuration.
Problem is, I need to retrieve data from hive and from a Kafka topic to do some logic, and whenever I submit my application it obviously returns "Cannot have 2 Spark Contexts Running on JVM".
My question is, is there a correct way to do this than how I am doing it right now?
public class MainApp {
private final String logFile= Properties.getString("SparkLogFileDir");
private static final String KAFKA_GROUPID = Properties.getString("KafkaGroupId");
private static final String ZOOKEEPER_URL = Properties.getString("ZookeeperURL");
private static final String KAFKA_BROKER = Properties.getString("KafkaBroker");
private static final String KAFKA_TOPIC = Properties.getString("KafkaTopic");
private static final String Database = Properties.getString("HiveDatabase");
private static final Integer KAFKA_PARA = Properties.getInt("KafkaParrallel");
public static void main(String[] args){
//set settings
String sql="";
//START APP
System.out.println("Starting NPI_TWITTERAPP...." + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Configuring Settings...."+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
SparkConf conf = new SparkConf()
.setAppName(Properties.getString("SparkAppName"))
.setMaster(Properties.getString("SparkMasterUrl"));
//Set Spark/hive/sql Context
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(5000));
JavaHiveContext HiveSqlContext = new JavaHiveContext(sc);
//Check if Twitter Hive Table Exists
try {
HiveSqlContext.sql("DROP TABLE IF EXISTS "+Database+"TWITTERSTORE");
HiveSqlContext.sql("CREATE TABLE IF NOT EXISTS "+Database+".TWITTERSTORE "
+" (created_at String, id String, id_str String, text String, source String, truncated String, in_reply_to_user_id String, processed_at String, lon String, lat String)"
+" STORED AS TEXTFILE");
}catch(Exception e){
System.out.println(e);
}
//Check if Ivapp Table Exists
sql ="CREATE TABLE IF NOT EXISTS "+Database+".IVAPPGEO AS SELECT DISTINCT a.LATITUDE, a.LONGITUDE, b.ODNCIRCUIT_OLT_CLLI, b.ODNCIRCUIT_OLT_TID, a.CITY, a.STATE, a.ZIP FROM "
+Database+".T_PONNMS_SERVICE B, "
+Database+".CLLI_LATLON_MSTR A WHERE a.BID_CLLI = substr(b.ODNCIRCUIT_OLT_CLLI,0,8)";
try {
System.out.println(sql + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
HiveSqlContext.sql(sql);
sql = "SELECT LATITUDE, LONGITUDE, ODNCIRCUIT_OLT_CLLI, ODNCIRCUIT_OLT_TID, CITY, STATE, ZIP FROM "+Database+".IVAPPGEO";
JavaSchemaRDD RDD_IVAPPGEO = HiveSqlContext.sql(sql).cache();
}catch(Exception e){
System.out.println(sql + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
}
//JavaHiveContext hc = new JavaHiveContext();
System.out.println("Retrieve Data from Kafka Topic: "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
Map<String, Integer> topicMap = new HashMap<String, Integer>();
topicMap.put(KAFKA_TOPIC,KAFKA_PARA);
JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(
jssc, KAFKA_GROUPID, ZOOKEEPER_URL, topicMap);
JavaDStream<String> json = messages.map(
new Function<Tuple2<String, String>, String>() {
private static final long serialVersionUID = 42l;
#Override
public String call(Tuple2<String, String> message) {
return message._2();
}
}
);
System.out.println("Completed Kafka Messages... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Filtering Resultset... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
JavaPairDStream<Long, String> tweets = json.mapToPair(
new TwitterFilterFunction());
JavaPairDStream<Long, String> filtered = tweets.filter(
new Function<Tuple2<Long, String>, Boolean>() {
private static final long serialVersionUID = 42l;
#Override
public Boolean call(Tuple2<Long, String> tweet) {
return tweet != null;
}
}
);
JavaDStream<Tuple2<Long, String>> tweetsFiltered = filtered.map(
new TextFilterFunction());
tweetsFiltered = tweetsFiltered.map(
new StemmingFunction());
System.out.println("Finished Filtering Resultset... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Processing Sentiment Data... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
//calculate postive tweets
JavaPairDStream<Tuple2<Long, String>, Float> positiveTweets =
tweetsFiltered.mapToPair(new PositiveScoreFunction());
//calculate negative tweets
JavaPairDStream<Tuple2<Long, String>, Float> negativeTweets =
tweetsFiltered.mapToPair(new NegativeScoreFunction());
JavaPairDStream<Tuple2<Long, String>, Tuple2<Float, Float>> joined =
positiveTweets.join(negativeTweets);
//Score tweets
JavaDStream<Tuple4<Long, String, Float, Float>> scoredTweets =
joined.map(new Function<Tuple2<Tuple2<Long, String>,
Tuple2<Float, Float>>,
Tuple4<Long, String, Float, Float>>() {
private static final long serialVersionUID = 42l;
#Override
public Tuple4<Long, String, Float, Float> call(
Tuple2<Tuple2<Long, String>, Tuple2<Float, Float>> tweet)
{
return new Tuple4<Long, String, Float, Float>(
tweet._1()._1(),
tweet._1()._2(),
tweet._2()._1(),
tweet._2()._2());
}
});
System.out.println("Finished Processing Sentiment Data... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Outputting Tweets Data to flat file "+Properties.getString("HdfsOutput")+" ... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
JavaDStream<Tuple5<Long, String, Float, Float, String>> result =
scoredTweets.map(new ScoreTweetsFunction());
result.foreachRDD(new FileWriter());
System.out.println("Outputting Sentiment Data to Hive... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
jssc.start();
jssc.awaitTermination();
}
}
Creating SparkContext
You can create a SparkContext instance with or without creating a SparkConf object first.
Getting Existing or Creating New SparkContext (getOrCreate methods)
getOrCreate(): SparkContext
getOrCreate(conf: SparkConf): SparkContext
SparkContext.getOrCreate methods allow you to get the existing SparkContext or create a new one.
import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()
// Using an explicit SparkConf object
import org.apache.spark.SparkConf
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("SparkMe App")
val sc = SparkContext.getOrCreate(conf)
Refer Here - https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sparkcontext.html
Apparently if I use sc.close() to close the original SparkContext before executing JavaStreaming Context it works perfectly, no errors or issues.
you can use a singleton object ContextManager which would handle which context to provide.
public class ContextManager {
private static JavaSparkContext context;
private static String currentType;
private ContextManager() {}
public static JavaSparkContext getContext(String type) {
if(type == currentType && context != null) {
return context;
}
else if (type == "streaming"){
.. clean up the current context ..
.. initialize the context to streaming context ..
currentType = type;
}
else {
..clean up the current context..
... initialize the context to normal context ..
currentType = type;
}
return context;
}
}
There are some issues like in projects where you switch context quite rapidly the overhead would be quite large.
You can access the SparkContext from your JavaStreamingSparkContext, and use that reference when creating additional contexts.
SparkConf sparkConfig = new SparkConf().setAppName("foo");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConfig, Duration.seconds(30));
SqlContext sqlContext = new SqlContext(jssc.sparkContext());
I'm new in spark streaming and kafka and I don't understand this runtime exception. I've already setup the kafka server.
Exception in thread "JobGenerator" java.lang.NoSuchMethodError: org.apache.spark.streaming.scheduler.InputInfoTracker.reportInfo(Lorg/apache/spark/streaming/Time;Lorg/apache/spark/streaming/scheduler/StreamInputInfo;)V
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:166)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
at scala.Option.orElse(Option.scala:257)
and this is my code
public class TwitterStreaming {
// setup kafka :
public static final String ZKQuorum = "localhost:2181";
public static final String ConsumerGroupID = "ingi2145-analytics";
public static final String ListTopics = "newTweet";
public static final String ListBrokers = "localhost:9092"; // I'm not sure about ...
#SuppressWarnings("deprecation")
public static void main(String[] args) throws Exception {
// Location of the Spark directory
String sparkHome = "usr/local/spark";
// URL of the Spark cluster
String sparkUrl = "local[4]";
// Location of the required JAR files
String jarFile = "target/analytics-1.0.jar";
// Generating spark's streaming context
JavaStreamingContext jssc = new JavaStreamingContext(
sparkUrl, "Streaming", new Duration(1000), sparkHome, new String[]{jarFile});
// Start kafka stream
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList(ListTopics.split(",")));
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", ListBrokers);
//JavaPairReceiverInputDStream<String, String> kafkaStream = KafkaUtils.createStream(ssc, ZKQuorum, ConsumerGroupID, mapPartitionsPerTopics);
// Create direct kafka stream with brokers and topics
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
// get the json file :
JavaDStream<String> json = messages.map(
new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
The aim of this project is to compute the 10 bests hashtags from a twitter stream by using a kafka queue. The code was working without kakfa.
Have you got an idea of what's the problem ?
I had the same issue and it was the version of spark I was using. I was using 1.5, then used 1.4 and ultimately the version that worked for me was 1.6.
So, please make sure that the Kafka version you are using is compatible with with the Spark version.
In my case, I'm using Kafka version 2.10-0.10.1.1 with spark-1.6.0-bin-hadoop2.3.
Also, (very important) make sure you are not getting any forbidden error in your log files. You have to assign the proper security grants to folders used by spark, otherwise you may receive a lot of errors that has nothing to do with the application itself but with improper security setup.