Kafka streams KTable creation from a given topic - java

I'm doing a project and I'm stuck on the KTable.
I want to take records from a topic and put them in a KTable(store), so that I have 1 record for 1 key.
static KafkaStreams streams;
final Serde<Long> longSerde = Serdes.Long();
final Serde<byte[]> byteSerde = Serdes.ByteArray();
static String topicName;
static String storeName;
final StreamsBuilder builder = new StreamsBuilder();
KStream<Long, byte[]> streamed = builder.stream(topicName, Consumed.with(longSerde, byteSerde));
KTable<Long, byte[]> records = streamed.groupByKey().reduce(
new Reducer<Long>() {
#Override
public Long apply(Long aggValue, Long newValue) {
return newValue;
}
},
storeName);
This is the closest I got to the answer I think.

Your approach is correct, but you need to use the correct serdes.
In .reduce() function, value type should be byte[].
KStream<Long, byte[]> streamed = builder.stream(topicName, Consumed.with(longSerde, byteSerde));
KTable<Long, byte[]> records = streamed.groupByKey().reduce(
new Reducer<byte[]>() {
#Override
public byte[] apply(byte[] aggValue, byte[] newValue) {
return newValue;
}
},
Materialized.as(storename).with(longSerde,byteSerde));

Related

ProcessorContext schedule only executed once

I'm using Apache Kafka Stream where I added a transform in my stream
final StreamsBuilder streamsBuilder = new StreamsBuilder();
final StoreBuilder<KeyValueStore<String, byte[]>> correlationStore =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(STORE_NAME),
Serdes.String(),
Serdes.ByteArray());
streamsBuilder.addStateStore(correlationStore);
streamsBuilder.stream(topicName, inputConsumed)
.peek(InboundPendingMessageStreamer::logEntries)
.transform(() -> new CleanerTransformer<String, byte[], KeyValue<String, byte[]>>(Duration.ofMillis(5000), STORE_NAME), STORE_NAME)
.toTable();
I'm having difficulties to understand the CleanerTransformer Transformer class that I create, where in the init method, I set a schedule with a scanFrequency and a PunctuationType.
#Override
public void init(ProcessorContext context) {
this.stateStore = context.getStateStore(purgeStoreName);
context.schedule(scanFrequency, PunctuationType.STREAM_TIME, timestamp -> {
try (final KeyValueIterator<K, byte[]> all = stateStore.all()) {
while (all.hasNext()) {
final var headers = context.headers();
final KeyValue<K, byte[]> record = all.next();
}
}
});
}
Adding an event in the stream, I got the message in the schedule callback, but it's only executed once.
My understanding was, that it should be executed every time configured in the scanFrequency.
Any idea what I'm doing wrong here?

Trigger one Kafka consumer by using values of another consumer In Spring Kafka

I have one scheduler which produces one event. My consumer consumes this event. The payload of this event is a json with below fields:
private String topic;
private String partition;
private String filterKey;
private long CustId;
Now I need to trigger one more consumer which will take all this information which I get a response from first consumer.
#KafkaListener(topics = "<**topic-name-from-first-consumer-response**>", groupId = "group" containerFactory = "kafkaListenerFactory")
public void consumeJson(List<User> data, Acknowledgment acknowledgment,
#Header(KafkaHeaders.RECEIVED_PARTITION_ID) List<Integer> partitions,
#Header(KafkaHeaders.OFFSET) List<Long> offsets) {
// consumer code goes here...}
I need to create some dynamic variable which I can pass in place of topic name.
similarly, I am using the filtering in the configuration file and I need to pass key dynamically in the configuration.
factory.setRecordFilterStrategy(new RecordFilterStrategy<String, Object>() {
#Override
public boolean filter(ConsumerRecord<String, Object> consumerRecord) {
if(consumerRecord.key().equals("**Key will go here**")) {
return false;
}
else {
return true;
}
}
});
How can we dynamically inject these values from the response of first consumer and trigger the second consumer. Both the consumers are in same application
You cannot do that with an annotated listener, the configuration is only used during initialization; you would need to create a listener container yourself (using the ConcurrentKafkaListenerContainerFactory) to dynamically create a listener.
EDIT
Here's an example.
#SpringBootApplication
public class So69134055Application {
public static void main(String[] args) {
SpringApplication.run(So69134055Application.class, args);
}
#Bean
public NewTopic topic() {
return TopicBuilder.name("so69134055").partitions(1).replicas(1).build();
}
}
#Component
class Listener {
private static final Logger log = LoggerFactory.getLogger(Listener.class);
private static final Method otherListen;
static {
try {
otherListen = Listener.class.getDeclaredMethod("otherListen", List.class);
}
catch (NoSuchMethodException | SecurityException ex) {
throw new IllegalStateException(ex);
}
}
private final ConcurrentKafkaListenerContainerFactory<String, String> factory;
private final MessageHandlerMethodFactory methodFactory;
private final KafkaAdmin admin;
private final KafkaTemplate<String, String> template;
public Listener(ConcurrentKafkaListenerContainerFactory<String, String> factory, KafkaAdmin admin,
KafkaTemplate<String, String> template, KafkaListenerAnnotationBeanPostProcessor<?, ?> bpp) {
this.factory = factory;
this.admin = admin;
this.template = template;
this.methodFactory = bpp.getMessageHandlerMethodFactory();
}
#KafkaListener(id = "so69134055", topics = "so69134055")
public void listen(String topicName) {
try (AdminClient client = AdminClient.create(this.admin.getConfigurationProperties())) {
NewTopic topic = TopicBuilder.name(topicName).build();
client.createTopics(List.of(topic)).all().get(10, TimeUnit.SECONDS);
}
catch (Exception e) {
log.error("Failed to create topic", e);
}
ConcurrentMessageListenerContainer<String, String> container =
this.factory.createContainer(new TopicPartitionOffset(topicName, 0));
BatchMessagingMessageListenerAdapter<String, String> adapter =
new BatchMessagingMessageListenerAdapter<>(this, otherListen);
adapter.setHandlerMethod(new HandlerAdapter(
this.methodFactory.createInvocableHandlerMethod(this, otherListen)));
FilteringBatchMessageListenerAdapter<String, String> filtered =
new FilteringBatchMessageListenerAdapter<>(adapter, record -> !record.key().equals("foo"));
container.getContainerProperties().setMessageListener(filtered);
container.getContainerProperties().setGroupId("group.for." + topicName);
container.setBeanName(topicName + ".container");
container.start();
IntStream.range(0, 10).forEach(i -> this.template.send(topicName, 0, i % 2 == 0 ? "foo" : "bar", "test" + i));
}
void otherListen(List<String> others) {
log.info("Others: {}", others);
}
}
spring.kafka.consumer.auto-offset-reset=earliest
Output - showing that the filter was applied to the records with bar in the key.
Others: [test0, test2, test4, test6, test8]

Process and check event using kafka-streams during some period

I have a KStream eventsStream, which is get data from a topic "events".
There is two type of events, their keys:
1. {user_id = X, event_id = 1} {..value, include time_event...}
2. {user_id = X, event_id = 2} {..value, include time_event...}
I need to migrate events with event_id = 1 to a topic "results" if during 10 minutes there is not given an event with event_id = 2 by user.
For example,
1. First case: we get data {user_id = 100, event_id = 1} {.. time_event = xxxx ...} and no events during 10 minutes {user_id = 100, event_id = 2} {.. time_event = xxxx + 10 minutes...}, so we'll write it to results-topic
2. Second case: we get data {user_id = 100, event_id = 1} {.. time_event = xxxx ...} and an event during 10 minutes {user_id = 100, event_id = 2} {.. time_event = xxxx + 5 minutes...}, so we'll not write it to results-topic
How does it possible to realise in java code this behavior using kafka-streams?
My code:
public class ResultStream {
public static KafkaStreams newStream() {
Properties properties = Config.getProperties("ResultStream");
Serde<String> stringSerde = Serdes.String();
StreamsBuilder builder = new StreamsBuilder();
StoreBuilder<KeyValueStore<String, String>> store =
Stores.keyValueStoreBuilder(
Stores.inMemoryKeyValueStore("inmemory"),
stringSerde,
stringSerde
);
builder.addStateStore(store);
KStream<String, String> resourceEventStream = builder.stream(EVENTS.topicName(), Consumed.with(stringSerde, stringSerde));
resourceEventStream.print(Printed.toSysOut());
resourceEventStream.process(() -> new CashProcessor("inmemory"), "inmemory");
resourceEventStream.process(() -> new FilterProcessor("inmemory", resourceEventStream), "inmemory");
Topology topology = builder.build();
return new KafkaStreams(topology, properties);
}
}
public class FilterProcessor implements Processor {
private ProcessorContext context;
private String eventStoreName;
private KeyValueStore<String, String> eventStore;
private KStream<String, String> stream;
public FilterProcessor(String eventStoreName, KStream<String, String> stream) {
this.eventStoreName = eventStoreName;
this.stream = stream;
}
#Override
public void init(ProcessorContext processorContext) {
this.context = processorContext;
eventStore = (KeyValueStore) processorContext.getStateStore(eventStoreName);
}
#Override
public void process(Object key, Object value) {
this.context.schedule(Duration.ofMinutes(1), PunctuationType.WALL_CLOCK_TIME, timestamp -> {
System.out.println("Scheduler is working");
stream.filter((k, v) -> {
JsonObject events = new Gson().fromJson(k, JsonObject.class);
if (***condition***) {
return true;
}
return false;
}).to("results");
});
}
#Override
public void close() {
}
}
CashProcessor's role only to put events to local store, and delete record with event_id = 1 by user if there is given an event_id = 2 with the same user.
FilterProcess should filter events using local store every minute. But I can't invoke correctly this processing (as I do it in fact)...
I'm really need help.
Why do you pass KStream into your processor? That is not how the DSL works.
As you "connect" your processors via resourceEventStream.process() already, your FilterProcessor#process(key, value) method will be called for each record in the stream automatically -- however, a KStream#process() is a terminal operation and thus does not allow you to send any data downstream. Instead, you might want to use transform() (that is basically the same as process() plus an output KStream).
To actually forward data downstream in your punctuation, you should use context.forward() using the ProcessorContext that is provided via init() method.

Getting error when invoking elasticSearch from spark

I have a use case, where I need to read messages from kafka and for each message, extract data and invoke elasticsearch Index. The response will be further used to do further processing.
I am getting below error when invoking JavaEsSpark.esJsonRDD
java.lang.ClassCastException: org.elasticsearch.spark.rdd.EsPartition incompatible with org.apache.spark.rdd.ParallelCollectionPartition
at org.apache.spark.rdd.ParallelCollectionRDD.compute(ParallelCollectionRDD.scala:102)
My code snippet is below
public static void main(String[] args) {
if (args.length < 4) {
System.err.println("Usage: JavaKafkaIntegration <zkQuorum> <group> <topics> <numThreads>");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaIntegration").setMaster("local[2]").set("spark.driver.allowMultipleContexts", "true");
//Setting when using JavaEsSpark.esJsonRDD
sparkConf.set("es.nodes",<NODE URL>);
sparkConf.set("es.nodes.wan.only","true");
context = new JavaSparkContext(sparkConf);
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
//Receive Message From kafka
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc,args[0], args[1], topicMap);
JavaDStream<String> jsons = messages
.map(new Function<Tuple2<String, String>, String>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public String call(Tuple2<String, String> tuple2){
JavaRDD<String> esRDD = JavaEsSpark.esJsonRDD(context, <index>,<search string> ).values() ;
return null;
}
});
jsons.print();
jssc.start();
jssc.awaitTermination();
}
I am getting error when invoking JavaEsSpark.esJsonRDD. Is it correct way to do it? How do I successfully invoke ES from spark?
I am running kafka and spark on windows and invoking external elastic search index.

Spark: Two SparkContexts in a single Application Best Practice

I think I have an interesting question for all of you today. In the code below you will notice I have two SparkContexts one for SparkStreaming and the other one which is a normal SparkContext. According to best practices you should only have one SparkContext in a Spark application even though its possible to circumvent this via allowMultipleContexts in the configuration.
Problem is, I need to retrieve data from hive and from a Kafka topic to do some logic, and whenever I submit my application it obviously returns "Cannot have 2 Spark Contexts Running on JVM".
My question is, is there a correct way to do this than how I am doing it right now?
public class MainApp {
private final String logFile= Properties.getString("SparkLogFileDir");
private static final String KAFKA_GROUPID = Properties.getString("KafkaGroupId");
private static final String ZOOKEEPER_URL = Properties.getString("ZookeeperURL");
private static final String KAFKA_BROKER = Properties.getString("KafkaBroker");
private static final String KAFKA_TOPIC = Properties.getString("KafkaTopic");
private static final String Database = Properties.getString("HiveDatabase");
private static final Integer KAFKA_PARA = Properties.getInt("KafkaParrallel");
public static void main(String[] args){
//set settings
String sql="";
//START APP
System.out.println("Starting NPI_TWITTERAPP...." + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Configuring Settings...."+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
SparkConf conf = new SparkConf()
.setAppName(Properties.getString("SparkAppName"))
.setMaster(Properties.getString("SparkMasterUrl"));
//Set Spark/hive/sql Context
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(5000));
JavaHiveContext HiveSqlContext = new JavaHiveContext(sc);
//Check if Twitter Hive Table Exists
try {
HiveSqlContext.sql("DROP TABLE IF EXISTS "+Database+"TWITTERSTORE");
HiveSqlContext.sql("CREATE TABLE IF NOT EXISTS "+Database+".TWITTERSTORE "
+" (created_at String, id String, id_str String, text String, source String, truncated String, in_reply_to_user_id String, processed_at String, lon String, lat String)"
+" STORED AS TEXTFILE");
}catch(Exception e){
System.out.println(e);
}
//Check if Ivapp Table Exists
sql ="CREATE TABLE IF NOT EXISTS "+Database+".IVAPPGEO AS SELECT DISTINCT a.LATITUDE, a.LONGITUDE, b.ODNCIRCUIT_OLT_CLLI, b.ODNCIRCUIT_OLT_TID, a.CITY, a.STATE, a.ZIP FROM "
+Database+".T_PONNMS_SERVICE B, "
+Database+".CLLI_LATLON_MSTR A WHERE a.BID_CLLI = substr(b.ODNCIRCUIT_OLT_CLLI,0,8)";
try {
System.out.println(sql + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
HiveSqlContext.sql(sql);
sql = "SELECT LATITUDE, LONGITUDE, ODNCIRCUIT_OLT_CLLI, ODNCIRCUIT_OLT_TID, CITY, STATE, ZIP FROM "+Database+".IVAPPGEO";
JavaSchemaRDD RDD_IVAPPGEO = HiveSqlContext.sql(sql).cache();
}catch(Exception e){
System.out.println(sql + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
}
//JavaHiveContext hc = new JavaHiveContext();
System.out.println("Retrieve Data from Kafka Topic: "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
Map<String, Integer> topicMap = new HashMap<String, Integer>();
topicMap.put(KAFKA_TOPIC,KAFKA_PARA);
JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(
jssc, KAFKA_GROUPID, ZOOKEEPER_URL, topicMap);
JavaDStream<String> json = messages.map(
new Function<Tuple2<String, String>, String>() {
private static final long serialVersionUID = 42l;
#Override
public String call(Tuple2<String, String> message) {
return message._2();
}
}
);
System.out.println("Completed Kafka Messages... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Filtering Resultset... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
JavaPairDStream<Long, String> tweets = json.mapToPair(
new TwitterFilterFunction());
JavaPairDStream<Long, String> filtered = tweets.filter(
new Function<Tuple2<Long, String>, Boolean>() {
private static final long serialVersionUID = 42l;
#Override
public Boolean call(Tuple2<Long, String> tweet) {
return tweet != null;
}
}
);
JavaDStream<Tuple2<Long, String>> tweetsFiltered = filtered.map(
new TextFilterFunction());
tweetsFiltered = tweetsFiltered.map(
new StemmingFunction());
System.out.println("Finished Filtering Resultset... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Processing Sentiment Data... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
//calculate postive tweets
JavaPairDStream<Tuple2<Long, String>, Float> positiveTweets =
tweetsFiltered.mapToPair(new PositiveScoreFunction());
//calculate negative tweets
JavaPairDStream<Tuple2<Long, String>, Float> negativeTweets =
tweetsFiltered.mapToPair(new NegativeScoreFunction());
JavaPairDStream<Tuple2<Long, String>, Tuple2<Float, Float>> joined =
positiveTweets.join(negativeTweets);
//Score tweets
JavaDStream<Tuple4<Long, String, Float, Float>> scoredTweets =
joined.map(new Function<Tuple2<Tuple2<Long, String>,
Tuple2<Float, Float>>,
Tuple4<Long, String, Float, Float>>() {
private static final long serialVersionUID = 42l;
#Override
public Tuple4<Long, String, Float, Float> call(
Tuple2<Tuple2<Long, String>, Tuple2<Float, Float>> tweet)
{
return new Tuple4<Long, String, Float, Float>(
tweet._1()._1(),
tweet._1()._2(),
tweet._2()._1(),
tweet._2()._2());
}
});
System.out.println("Finished Processing Sentiment Data... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Outputting Tweets Data to flat file "+Properties.getString("HdfsOutput")+" ... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
JavaDStream<Tuple5<Long, String, Float, Float, String>> result =
scoredTweets.map(new ScoreTweetsFunction());
result.foreachRDD(new FileWriter());
System.out.println("Outputting Sentiment Data to Hive... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
jssc.start();
jssc.awaitTermination();
}
}
Creating SparkContext
You can create a SparkContext instance with or without creating a SparkConf object first.
Getting Existing or Creating New SparkContext (getOrCreate methods)
getOrCreate(): SparkContext
getOrCreate(conf: SparkConf): SparkContext
SparkContext.getOrCreate methods allow you to get the existing SparkContext or create a new one.
import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()
// Using an explicit SparkConf object
import org.apache.spark.SparkConf
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("SparkMe App")
val sc = SparkContext.getOrCreate(conf)
Refer Here - https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sparkcontext.html
Apparently if I use sc.close() to close the original SparkContext before executing JavaStreaming Context it works perfectly, no errors or issues.
you can use a singleton object ContextManager which would handle which context to provide.
public class ContextManager {
private static JavaSparkContext context;
private static String currentType;
private ContextManager() {}
public static JavaSparkContext getContext(String type) {
if(type == currentType && context != null) {
return context;
}
else if (type == "streaming"){
.. clean up the current context ..
.. initialize the context to streaming context ..
currentType = type;
}
else {
..clean up the current context..
... initialize the context to normal context ..
currentType = type;
}
return context;
}
}
There are some issues like in projects where you switch context quite rapidly the overhead would be quite large.
You can access the SparkContext from your JavaStreamingSparkContext, and use that reference when creating additional contexts.
SparkConf sparkConfig = new SparkConf().setAppName("foo");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConfig, Duration.seconds(30));
SqlContext sqlContext = new SqlContext(jssc.sparkContext());

Categories