Null value in spark streaming from Kafka - java

I have a simple program because I'm trying to receive data using kafka. When I start a kafka producer and I send data, for example: "Hello", I get this when I print the message: (null, Hello). And I don't know why this null appears. Is there any way to avoid this null? I think it's due to Tuple2<String, String>, the first parameter, but I only want to print the second parameter. And another thing, when I print that using System.out.println("inside map "+ message); it does not appear any message, does someone know why? Thanks.
public static void main(String[] args){
SparkConf sparkConf = new SparkConf().setAppName("org.kakfa.spark.ConsumerData").setMaster("local[4]");
// Substitute 127.0.0.1 with the actual address of your Spark Master (or use "local" to run in local mode
sparkConf.set("spark.cassandra.connection.host", "127.0.0.1");
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = KafkaProperties.TOPIC.split(",");
for (String topic: topics) {
topicMap.put(topic, KafkaProperties.NUM_THREADS);
}
/* connection to cassandra */
CassandraConnector connector = CassandraConnector.apply(sparkConf);
System.out.println("+++++++++++ cassandra connector created ++++++++++++++++++++++++++++");
/* Receive kafka inputs */
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, KafkaProperties.ZOOKEEPER, KafkaProperties.GROUP_CONSUMER, topicMap);
System.out.println("+++++++++++++ streaming-kafka connection done +++++++++++++++++++++++++++");
JavaDStream<String> lines = messages.map(
new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> message) {
System.out.println("inside map "+ message);
return message._2();
}
}
);
messages.print();
jssc.start();
jssc.awaitTermination();
}

Q1) Null values:
Messages in Kafka are Keyed, that means they all have a (Key, Value) structure.
When you see (null, Hello) is because the producer published a (null,"Hello") value in a topic.
If you want to omit the key in your process, map the original Dtream to remove the key: kafkaDStream.map( new Function<String,String>() {...})
Q2) System.out.println("inside map "+ message); does not print. A couple of classical reasons:
Transformations are applied in the executors, so when running in a cluster, that output will appear in the executors and not on the master.
Operations are lazy and DStreams need to be materialized for operations to be applied.
In this specific case, the JavaDStream<String> lines is never materialized i.e. not used for an output operation. Therefore the map is never executed.

Related

Side input in global window as slowly changing cache questions

Context:
We have some schema files in Cloud Storage. In our Dataflow job, we need to refer to these schema files to transform our data. These schema files change on a daily/weekly basis. Our data source is PubSub and we window PubSub messages into a fixed window of 1 minutes. The schema files we need fit well into memory, they are about 90 MB.
What I have tried:
Referring to this doc from Apache Beam, we created a side input that writes into a global window with a GenerateSequence like so:
// Creates a side input that refreshes the schema every minute
PCollectionView<Map<String, byte[]>> dataBlobView =
pipeline.apply(GenerateSequence.from(0).withRate(1, Duration.standardDays(1L)))
.apply(Window.<Long>into(new GlobalWindows()).triggering(
Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, Map<String, byte[]>>() {
#ProcessElement
public void processElement(ProcessContext ctx) throws Exception {
byte[] avroSchemaBlob = getAvroSchema();
byte[] fileDescriptorSetBlob = getFileDescriptorSet();
byte[] depsBlob = getFileDescriptorDeps();
Map<String, byte[]> dataBlobs = ImmutableMap.of(
"version", Longs.toByteArray(ctx.element().byteValue()),
"avroSchemaBlob", avroSchemaBlob,
"fileDescriptorSetBlob", fileDescriptorSetBlob,
"depsBlob", depsBlob);
ctx.output(dataBlobs);
}
}))
.apply(View.asSingleton());
"getAvroSchema", "getFileDescriptorSet" and "getFileDescriptorDeps" read files as byte[] from Cloud Storage.
However, this approach failed from the exception:
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: PCollection with more than one element accessed as a singleton view.
I then tried writing my own Combine Globally function like so:
static class GetLatestVersion implements SerializableFunction<Iterable<Map<String, byte[]>>, Map<String, byte[]>> {
#Override
public Map<String, byte[]> apply(Iterable<Map<String, byte[]>> versions) {
Map<String, byte[]> result = Maps.newHashMap();
Long maxVersion = Long.MIN_VALUE;
for (Map<String, byte[]> version: versions){
Long currentVersion = Longs.fromByteArray(version.get("version"));
logger.info("Side input version: " + currentVersion);
if (currentVersion > maxVersion) {
result = version;
maxVersion = currentVersion;
}
}
return result;
}
}
But it still triggers the same exception........
I then came across this and this Beam email archives and it seems like what's suggested in the Beam doc does not work. And I have to use a MultiMap to avoid the exception I ran into above. With a MultiMap, I will also have to iterate through the values and have my own logic to pick my desired value (latest).
My questions:
Why do I still get the exception "PCollection with more than one element accessed as a singleton view" even after I globally combine everything into 1 result?
If I go with the MultiMap approach, wouldn't the job eventually run out of memory? Because everyday we are basically increasing the MultiMap by 90 MB (the size of our data blob), unless Dataflow has some smart MultiMap implementation behind the scene.
What is the recommended way to do this?
Thanks
Use .apply(View.asMap()) instead of .apply(View.asSingleton());
This is the full example:
PCollectionView<Map<String, byte[]>> dataBlobView =
pipeline.apply(GenerateSequence.from(0).withRate(1, Duration.standardDays(1L)))
.apply(Window.<Long>into(new GlobalWindows()).triggering(
Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, KV<String, byte[]>>() {
#ProcessElement
public void processElement(ProcessContext ctx) throws Exception {
byte[] avroSchemaBlob = getAvroSchema();
byte[] fileDescriptorSetBlob = getFileDescriptorSet();
byte[] depsBlob = getFileDescriptorDeps();
ctx.output(KV.of("version", Longs.toByteArray(ctx.element().byteValue())));
ctx.output(KV.of("avroSchemaBlob", avroSchemaBlob));
ctx.output(KV.of("fileDescriptorSetBlob", fileDescriptorSetBlob));
ctx.output(KV.of("depsBlob", depsBlob));
}
}))
.apply(View.asMap());
You can use the map from the side inputs as described in documentation.
Apache Beam version 2.34.0

Is it possible to consume kafka messages using key and partition?

I am using kafka_2.12 version 2.3.0 where I am publishing data into kafka topic using partition and key. I need to find a way using which I can consume a particular message from topic using key and partition combination. That way I won't have to consume all the messages and iterate for the correct one.
Right now I am only able to do this
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props)
consumer.subscribe(Collections.singletonList("topic"))
ConsumerRecords<String, String> records = consumer.poll(100)
def data = records.findAll {
it -> it.key().equals(key)
}
You can't "get messages by key from Kafka".
One solution, if practical, would be to have as many partitions as keys and always route messages for a key to the same partition.
Message Key as Partition
kafkaConsumer.assign(topicPartitions);
kafkaConsumer.seekToBeginning(topicPartitions);
// Pull records from kafka, keep polling until we get nothing back
final List<ConsumerRecord<byte[], byte[]>> allRecords = new ArrayList<>();
ConsumerRecords<byte[], byte[]> records;
do {
// Grab records from kafka
records = kafkaConsumer.poll(2000L);
logger.info("Found {} records in kafka", records.count());
// Add to our array list
records.forEach(allRecords::add);
}
while (!records.isEmpty());
Access messages of a Topic using Topic Name only
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList(<Topic Name>,<Topic Name>));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}
There are two ways to consume topic/partitions is:
KafkaConsumer.assign() : Document link
KafkaConsumer.subscribe() : Document link
So, You can't get messages by key.
If you don't have a plan to expand partitions, consider using assign() method. Because all the messages that come with the specific key will go to the same partition.
How to use:
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(properties);
TopicPartition partition = new TopicPartition("some-topic", 0);
consumer.assign(Arrays.asList(partition));
while(true){
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
String data = records.findAll {
it -> it.key().equals(key)
}
}

Extracting Timestamp from producer message

I really need help!
I can't extract the timestamp for a message sent by a producer. In my project I work with Json, I have a class in which I define the keys and one in which I define the values ​​of the message that I will send via a producer on a "Raw" topic. I have 2 other classes that do the same thing for the output message that my consumer will read on the topic called "Tdt". In the main class KafkaStreams.java I define the stream and map the keys and values. Starting Kafka locally, I start a producer who writes a message on the "raw" topic with keys and values, then on another shell the consumer starts reading the exit message on the "tdt" topic. How do I get the event timestamp? I need to know the timestamp in which the message was sent by the producer. Do I need a TimestampExtractor?
Here is my main class kafkastreams (my application works great, I just need the timestamp)
#Bean("app1StreamTopology")
public KStream<LibAssIbanRawKey, LibAssIbanRawValue> kStream() throws ParseException {
JsonSerde<Dwsitspr4JoinValue> Dwsitspr4JoinValueSerde = new JsonSerde<>(Dwsitspr4JoinValue.class);
KStream<LibAssIbanRawKey, LibAssIbanRawValue> stream = defaultKafkaStreamsBuilder.stream(inputTopic);
stream.peek((k,v) -> logger.info("Debug3 Chiave descrizione -> ({})",v.getCATRAPP()));
GlobalKTable<Integer, Dwsitspr4JoinValue> categoriaRapporto = defaultKafkaStreamsBuilder
.globalTable(temptiptopicname,
Consumed.with(Serdes.Integer(), Dwsitspr4JoinValueSerde)
// .withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST)
);
logger.info("Debug3 Chiave descrizione -> ({})",categoriaRapporto.toString()) ;
stream.peek((k,v) -> logger.info("Debug4 Chiave descrizione -> ({})",v.getCATRAPP()) );
stream
.join(categoriaRapporto, (k, v) -> v.getCATRAPP(), (valueStream, valueGlobalKtable) -> {
// Value mapping
LibAssIbanTdtValue newValue = new LibAssIbanTdtValue();
newValue.setDescrizioneRidottaCodiceCategoriaDelRapporto(valueGlobalKtable.getDescrizioneRidotta());
newValue.setDescrizioneEstesaCodiceCategoriaDelRapporto(valueGlobalKtable.getDescrizioneEstesa());
newValue.setIdentificativo(valueStream.getAUD_CCID());
.
.
.//Other Value Mapped
.
.
.map((key, value) -> {
// Key mapping
LibAssIbanTdtKey newKey = new LibAssIbanTdtKey();
newKey.setData(dtf.format(localDate));
newKey.setIdentificatoreUnivocoDellaRigaDiTabella(key.getTABROWID());
return KeyValue.pair(newKey, value);
}).to(outputTopic, Produced.with(new JsonSerde<>(LibAssIbanTdtKey.class), new JsonSerde<>(LibAssIbanTdtValue.class)));
return stream;
}
}
Yes you need a TimestampExtractor.
public class YourTimestampExtractor implements TimestampExtractor {
#Override
public long extract(ConsumerRecord<Object, Object> consumerRecord, long l) {
// do whatever you want with the timestamp available with consumerRecord.timestamp()
...
// return here the timestamp you want to use (here default)
return consumerRecord.timestamp();
}
}
You'll need to tell kafka stream what extractor to use under the key StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG

Spark SQL failed in Spark Streaming (KafkaStream)

I use Spark SQL in a Spark Streaming Job to search in a Hive table.
Kafka streaming works fine without problems. If I run hiveContext.runSqlHive(sqlQuery); outside directKafkaStream.foreachRDD it works fine without problems. But I need the Hive-Table lookup inside the streaming job. The use of JDBC (jdbc:hive2://) would work, but I want to use the Spark SQL.
The significant places of my source code looks as follows:
// set context
SparkConf sparkConf = new SparkConf().setAppName(appName).set("spark.driver.allowMultipleContexts", "true");
SparkContext sparkSqlContext = new SparkContext(sparkConf);
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(batchDuration));
HiveContext hiveContext = new HiveContext(sparkSqlContext);
// Initialize Direct Spark Kafka Stream. Starts from top
JavaPairInputDStream<String, String> directKafkaStream =
KafkaUtils.createDirectStream(streamingContext,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet);
// work on stream
directKafkaStream.foreachRDD((Function<JavaPairRDD<String, String>, Void>) rdd -> {
rdd.foreachPartition(tuple2Iterator -> {
// get message
Tuple2<String, String> item = tuple2Iterator.next();
// lookup
String sqlQuery = "SELECT something FROM somewhere";
Seq<String> resultSequence = hiveContext.runSqlHive(sqlQuery);
List<String> result = scala.collection.JavaConversions.seqAsJavaList(resultSequence);
});
return null;
});
// Start the computation
streamingContext.start();
streamingContext.awaitTermination();
I get no meaningful error, even if I surround with try-catch.
I hope someone can help - Thanks.
//edit:
The solution looks like:
// work on stream
directKafkaStream.foreachRDD((Function<JavaPairRDD<String, String>, Void>) rdd -> {
// driver
Map<String, String> lookupMap = getResult(hiveContext); //something with hiveContext.runSqlHive(sqlQuery);
rdd.foreachPartition(tuple2Iterator -> {
// worker
while (tuple2Iterator != null && tuple2Iterator.hasNext()) {
// get message
Tuple2<String, String> item = tuple2Iterator.next();
// lookup
String result = lookupMap.get(item._2());
}
});
return null;
});
Just because you want to use Spark SQL it won't make it possible. Spark's rule number one is no nested actions, transformations or distributed data structures.
If you can express your query for example as join you can use push it to one level higher to foreachRDD and this pretty much exhaust your options to use Spark SQL here:
directKafkaStream.foreachRDD(rdd ->
hiveContext.runSqlHive(sqlQuery)
rdd.foreachPartition(...)
)
Otherwise direct JDBC connection can be a valid option.

Kafka: Java client that blocks read and doesn't poll

I was wondering if there is a java client code that is a Kafka Consumer that enables to read data via push notification / a blocking read, instead of the current poll:
final KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);
consumer.subscribe(Arrays.asList("test"));
new Thread(){
#Override
public void run()
{
while (true)
{
ConsumerRecords<String, String> records = consumer.poll(100); //poll
for (ConsumerRecord<String, String> record : records)
{
System.out.printf("offset = %d, key = %s, value = %s", record.offset(), record.key(),
record.value());
System.out.println();
callback.onMessage(record.value());
}
}
}
}.start();
if I understand your question correctly you wish for the data to be pushed to consumer when available instead of having the consumer being responsible of checking from new data and pulling.
On https://kafka.apache.org/08/design.html they discuss push vs. pull and the choice that was made in Kafka where the producer push the messages to the broker and the consumer pulls from broker. They also mention the attempts they have made to prevent the downsides of a pull-based approach. If you require a pushing pub/sub messaging system you may want to look at Scribe or Flume which is also mentioned in the link :)

Categories