Task not Serializable in Spark reading serialized input

Task not Serializable in Spark reading serialized input - java

I am working on a Spark based Kafka Consumer that reads the data in Avro format.
Following, is the try catch code reading and processing the input.
import java.util.*;
import java.io.*;
import com.twitter.bijection.Injection;
import com.twitter.bijection.avro.GenericAvroCodecs;
import kafka.serializer.StringDecoder;
import kafka.serializer.DefaultDecoder;
import scala.Tuple2;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import kafka.producer.KeyedMessage;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.Durations;
public class myKafkaConsumer{
/**
* Main function, entry point to the program.
* #param args, takes the user-ids as the parameters, which
*will be treated as topics
* in our case.
*/
private String [] topics;
private SparkConf sparkConf;
private JavaStreamingContext jssc;
public static final String USER_SCHEMA = "{"
+ "\"type\":\"record\","
+ "\"name\":\"myrecord\","
+ "\"fields\":["
+ " { \"name\":\"str1\", \"type\":\"string\" },"
+ " { \"name\":\"int1\", \"type\":\"int\" }"
+ "]}";
public static void main(String [] args){
if(args.length < 1){
System.err.println("Usage : myKafkaConsumber <topics/user-id>");
System.exit(1);
}
myKafkaConsumer bKC = new myKafkaConsumer(args);
bKC.run();
}
/**
* Constructor
*/
private myKafkaConsumer(String [] topics){
this.topics = topics;
sparkConf = new SparkConf();
sparkConf = sparkConf.setAppName("JavaDirectKafkaFilterMessages");
jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
}
/**
* run function, runs the entire program.
* #param topics, a string array containing the topics to be read from
* #return void
*/
private void run(){
HashSet<String> topicSet = new HashSet<String>();
for(String topic : topics){
topicSet.add(topic);
System.out.println(topic);
}
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", "128.208.244.3:9092");
kafkaParams.put("auto.offset.reset", "smallest");
try{
JavaPairInputDStream<String, byte[]> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
byte[].class,
StringDecoder.class,
DefaultDecoder.class,
kafkaParams,
topicSet
);
JavaDStream<String> avroRows = messages.map(new Function<Tuple2<String, byte[]>, String>(){
public String call(Tuple2<String, byte[]> tuple2){
return testFunction(tuple2._2().toString());
}
});
avroRows.print();
jssc.start();
jssc.awaitTermination();
}catch(Exception E){
System.out.println(E.toString());
E.printStackTrace();
}
}
private static String testFunction(String str){
System.out.println("Input String : " + str);
return "Success";
}
}
The code compiles correctly, however, when I try to run the code on a Spark cluster I get Task not Serializable error. I tried removing the function and simply printing some text, still, the error persists.
P.S. I have checked printing the messages and found that they are correctly read.

The print statement collects your RDD to the driver in order to print them on the screen. Such a task triggers serialization/deserialization of your data.
In order for your code to work, the records in the avroRows Dstream must be of a serializable type.
For example, it should work if you replace the avroRows definition by this :
JavaDStream<String> avroRows = messages.map(new Function<Tuple2<String, byte[]>, String>(){
public String call(Tuple2<String, byte[]> tuple2){
return tuple2._2().toString();
}
});
I just added a toString to your records because the String type is serializable (of course, it is not necessarily what you need, it is just an example).

Related

Getting partial json response for s3select with aws java sdk v2

I am trying to implement s3select in a spring boot app to query parquet file in s3 bucket, I am only getting partial result from the s3select output, Please help to identify the issue, i have used aws java sdk v2.
Upon checking the json output(printed in the console), overall characters in the output is 65k.
I am using eclipse and tried unchecking "Limit console output" in the console preference, which did not help.
Code is here:-
import java.util.List;
import java.util.concurrent.CompletableFuture;
import software.amazon.awssdk.auth.credentials.AwsBasicCredentials;
import software.amazon.awssdk.auth.credentials.AwsCredentialsProvider;
import software.amazon.awssdk.auth.credentials.StaticCredentialsProvider;
import software.amazon.awssdk.core.async.SdkPublisher;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.S3AsyncClient;
import software.amazon.awssdk.services.s3.model.CompressionType;
import software.amazon.awssdk.services.s3.model.EndEvent;
import software.amazon.awssdk.services.s3.model.ExpressionType;
import software.amazon.awssdk.services.s3.model.InputSerialization;
import software.amazon.awssdk.services.s3.model.JSONOutput;
import software.amazon.awssdk.services.s3.model.OutputSerialization;
import software.amazon.awssdk.services.s3.model.ParquetInput;
import software.amazon.awssdk.services.s3.model.RecordsEvent;
import software.amazon.awssdk.services.s3.model.SelectObjectContentEventStream;
import software.amazon.awssdk.services.s3.model.SelectObjectContentEventStream.EventType;
import software.amazon.awssdk.services.s3.model.SelectObjectContentRequest;
import software.amazon.awssdk.services.s3.model.SelectObjectContentResponse;
import software.amazon.awssdk.services.s3.model.SelectObjectContentResponseHandler;
public class ParquetSelect {
private static final String BUCKET_NAME = "<bucket-name>";
private static final String KEY = "<object-key>";
private static final String QUERY = "select * from S3Object s";
public static S3AsyncClient s3;
public static void selectObjectContent() {
Handler handler = new Handler();
SelectQueryWithHandler(handler).join();
RecordsEvent recordsEvent = (RecordsEvent) handler.receivedEvents.stream()
.filter(e -> e.sdkEventType() == EventType.RECORDS)
.findFirst()
.orElse(null);
System.out.println(recordsEvent.payload().asUtf8String());
}
private static CompletableFuture<Void> SelectQueryWithHandler(SelectObjectContentResponseHandler handler) {
InputSerialization inputSerialization = InputSerialization.builder()
.parquet(ParquetInput.builder().build())
.compressionType(CompressionType.NONE)
.build();
OutputSerialization outputSerialization = OutputSerialization.builder()
.json(JSONOutput.builder().build())
.build();
SelectObjectContentRequest select = SelectObjectContentRequest.builder()
.bucket(BUCKET_NAME)
.key(KEY)
.expression(QUERY)
.expressionType(ExpressionType.SQL)
.inputSerialization(inputSerialization)
.outputSerialization(outputSerialization)
.build();
return s3.selectObjectContent(select, handler);
}
private static class Handler implements SelectObjectContentResponseHandler {
private SelectObjectContentResponse response;
private List<SelectObjectContentEventStream> receivedEvents = new ArrayList<>();
private Throwable exception;
#Override
public void responseReceived(SelectObjectContentResponse response) {
this.response = response;
}
#Override
public void onEventStream(SdkPublisher<SelectObjectContentEventStream> publisher) {
publisher.subscribe(receivedEvents::add);
}
#Override
public void exceptionOccurred(Throwable throwable) {
exception = throwable;
}
#Override
public void complete() {
}
}
}

I see you are using selectObjectContent(). Have you tried calling the s3AsyncClient.getObject() method. Does that work for you?
For example, here is a code example that gets a PDF file from an Amazon S3 bucket and write the PDF file to a local file.
package com.example.s3.async;
// snippet-start:[s3.java2.async_stream_ops.complete]
// snippet-start:[s3.java2.async_stream_ops.import]
import software.amazon.awssdk.auth.credentials.ProfileCredentialsProvider;
import software.amazon.awssdk.core.async.AsyncResponseTransformer;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.S3AsyncClient;
import software.amazon.awssdk.services.s3.model.GetObjectRequest;
import software.amazon.awssdk.services.s3.model.GetObjectResponse;
import java.nio.file.Paths;
import java.util.concurrent.CompletableFuture;
// snippet-end:[s3.java2.async_stream_ops.import]
// snippet-start:[s3.java2.async_stream_ops.main]
/**
* Before running this Java V2 code example, set up your development environment, including your credentials.
*
* For more information, see the following documentation topic:
*
* https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/get-started.html
*/
public class S3AsyncStreamOps {
public static void main(String[] args) {
final String usage = "\n" +
"Usage:\n" +
" <bucketName> <objectKey> <path>\n\n" +
"Where:\n" +
" bucketName - The name of the Amazon S3 bucket (for example, bucket1). \n\n" +
" objectKey - The name of the object (for example, book.pdf). \n" +
" path - The local path to the file (for example, C:/AWS/book.pdf). \n" ;
if (args.length != 3) {
System.out.println(usage);
System.exit(1);
}
String bucketName = args[0];
String objectKey = args[1];
String path = args[2];
ProfileCredentialsProvider credentialsProvider = ProfileCredentialsProvider.create();
Region region = Region.US_EAST_1;
S3AsyncClient s3AsyncClient = S3AsyncClient.builder()
.region(region)
.credentialsProvider(credentialsProvider)
.build();
GetObjectRequest objectRequest = GetObjectRequest.builder()
.bucket(bucketName)
.key(objectKey)
.build();
CompletableFuture<GetObjectResponse> futureGet = s3AsyncClient.getObject(objectRequest,
AsyncResponseTransformer.toFile(Paths.get(path)));
futureGet.whenComplete((resp, err) -> {
try {
if (resp != null) {
System.out.println("Object downloaded. Details: "+resp);
} else {
err.printStackTrace();
}
} finally {
// Only close the client when you are completely done with it.
s3AsyncClient.close();
}
});
futureGet.join();
}
}

Flink Inner Join Missing records and adding duplicates

We are running a flink application on AWS Kinesis Analytics.
We are using kafka as our source and sink and event time as our water mark generation. We have a window of 5 seconds. We are performing an inner join using a common field.
Kakfa topics have 12 partitions and flink have 3 way parallelism.
Issues observed : We are seeing for some window we are missing records. Records should join based on the event-time but not joining and for other windows we are seeing duplicate records.
sample records
{"empName":"ted","timestamp":"0","uuid":"f2c2e48a44064d0fa8da5a3896e0e42a","empId":"23698"}
{"empName":"ted","timestamp":"1","uuid":"069f2293ad144dd38a79027068593b58","empId":"23145"}
{"empName":"john","timestamp":"2","uuid":"438c1f0b85154bf0b8e4b3ebf75947b6","empId":"23698"}
{"empName":"john","timestamp":"0","uuid":"76d1d21ed92f4a3f8e14a09e9b40a13b","empId":"23145"}
{"empName":"ted","timestamp":"0","uuid":"bbc3bad653aa44c4894d9c4d13685fba","empId":"23698"}
{"empName":"ted","timestamp":"0","uuid":"530871933d1e4443ade447adc091dcbe","empId":"23145"}
{"empName":"ted","timestamp":"1","uuid":"032d7be009cb448bb40fe5c44582cb9c","empId":"23698"}
{"empName":"john","timestamp":"1","uuid":"e5916821bd4049bab16f4dc62d4b90ea","empId":"23145"}
{"empId":"23698","timestamp":"0","expense":"234"}
{"empId":"23698","timestamp":"0","expense":"34"}
{"empId":"23698","timestamp":"1","expense":"234"}
{"empId":"23145","timestamp":"2","expense":"234"}
{"empId":"23698","timestamp":"2","expense":"234"}
{"empId":"23698","timestamp":"0","expense":"234"}
{"empId":"23145","timestamp":"0","expense":"234"}
{"empId":"23698","timestamp":"0","expense":"34"}
{"empId":"23145","timestamp":"1","expense":"34"}
Below is code for your reference.
As you can see here for the two streams there are a lot event timestamp which can repeat. There can be thousands of employee and empId combinations(in the real data there are many more dimensions) and they are all coming in a single kafka topic.
import java.text.SimpleDateFormat;
import java.time.Duration;
import java.time.Instant;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.time.temporal.ChronoUnit;
import java.util.Properties;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple1;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.deser.std.StringDeserializer;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.functions.co.CoFlatMapFunction;
import org.apache.flink.streaming.api.functions.co.KeyedCoProcessFunction;
import org.apache.flink.streaming.api.functions.co.RichCoFlatMapFunction;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.Semantic;
import org.apache.flink.util.Collector;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
private static final Logger LOG = LoggerFactory.getLogger(Main.class);
// static String TOPIC_IN = "event_hub_all-mt-partitioned";
static String TOPIC_ONE = "kafka_one_multi";
static String TOPIC_TWO = "kafka_two_multi";
static String TOPIC_OUT = "final_join_topic_multi";
static String BOOTSTRAP_SERVER = "localhost:9092";
public static void main(String[] args) {
Producer<String> emp = new Producer<String>(BOOTSTRAP_SERVER, StringSerializer.class.getName());
Producer<String> dept = new Producer<String>(BOOTSTRAP_SERVER, StringSerializer.class.getName());
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
Properties props = new Properties();
props.put("bootstrap.servers", BOOTSTRAP_SERVER);
props.put("client.id", "flink-example1");
FlinkKafkaConsumer<Employee> kafkaConsumerOne = new FlinkKafkaConsumer<>(TOPIC_ONE, new EmployeeSchema(),
props);
LOG.info("Coming to main function");
//Commenting event timestamp for watermark generation!!
var empDebugStream = kafkaConsumerOne.assignTimestampsAndWatermarks(
WatermarkStrategy.<Employee>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((employee, timestamp) -> employee.getTimestamp().getTime())
.withIdleness(Duration.ofSeconds(1)));
// for allowing Flink to handle late elements
kafkaConsumerOne.setStartFromLatest();
FlinkKafkaConsumer<EmployeeExpense> kafkaConsumerTwo = new FlinkKafkaConsumer<>(TOPIC_TWO,
new DepartmentSchema(), props);
//Commenting event timestamp for watermark generation!!
kafkaConsumerTwo.assignTimestampsAndWatermarks(
WatermarkStrategy.<EmployeeExpense>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((employeeExpense, timestamp) -> employeeExpense.getTimestamp().getTime())
.withIdleness(Duration.ofSeconds(1)));
kafkaConsumerTwo.setStartFromLatest();
// EventSerializationSchema<EmployeeWithExpenseAggregationStats> employeeWithExpenseAggregationSerializationSchema = new EventSerializationSchema<EmployeeWithExpenseAggregationStats>(
// TOPIC_OUT);
EventSerializationSchema<EmployeeWithExpense> employeeWithExpenseSerializationSchema = new EventSerializationSchema<EmployeeWithExpense>(
TOPIC_OUT);
// FlinkKafkaProducer<EmployeeWithExpenseAggregationStats> sink = new FlinkKafkaProducer<EmployeeWithExpenseAggregationStats>(
// TOPIC_OUT,
// employeeWithExpenseAggregationSerializationSchema,props,
// FlinkKafkaProducer.Semantic.AT_LEAST_ONCE);
FlinkKafkaProducer<EmployeeWithExpense> sink = new FlinkKafkaProducer<EmployeeWithExpense>(TOPIC_OUT,
employeeWithExpenseSerializationSchema, props, FlinkKafkaProducer.Semantic.AT_LEAST_ONCE);
DataStream<Employee> empStream = env.addSource(kafkaConsumerOne)
.transform("debugFilter", empDebugStream.getProducedType(), new StreamWatermarkDebugFilter<>())
.keyBy(emps -> emps.getEmpId());
DataStream<EmployeeExpense> expStream = env.addSource(kafkaConsumerTwo).keyBy(exps -> exps.getEmpId());
// DataStream<EmployeeWithExpense> aggInputStream = empStream.join(expStream)
empStream.join(expStream).where(new KeySelector<Employee, Tuple1<Integer>>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Tuple1<Integer> getKey(Employee value) throws Exception {
return Tuple1.of(value.getEmpId());
}
}).equalTo(new KeySelector<EmployeeExpense, Tuple1<Integer>>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Tuple1<Integer> getKey(EmployeeExpense value) throws Exception {
return Tuple1.of(value.getEmpId());
}
}).window(TumblingEventTimeWindows.of(Time.seconds(5))).allowedLateness(Time.seconds(15))
.apply(new JoinFunction<Employee, EmployeeExpense, EmployeeWithExpense>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public EmployeeWithExpense join(Employee first, EmployeeExpense second) throws Exception {
return new EmployeeWithExpense(second.getTimestamp(), first.getEmpId(), second.getExpense(),
first.getUuid(), LocalDateTime.now()
.format(DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'+0000'")));
}
}).addSink(sink);
// KeyedStream<EmployeeWithExpense, Tuple3<Integer, Integer,Long>> inputKeyedByGWNetAccountProductRTG = aggInputStream
// .keyBy(new KeySelector<EmployeeWithExpense, Tuple3<Integer, Integer,Long>>() {
//
// /**
// *
// */
// private static final long serialVersionUID = 1L;
//
// #Override
// public Tuple3<Integer, Integer,Long> getKey(EmployeeWithExpense value) throws Exception {
// return Tuple3.of(value.empId, value.expense,Instant.ofEpochMilli(value.timestamp.getTime()).truncatedTo(ChronoUnit.SECONDS).toEpochMilli());
// }
// });
//
// inputKeyedByGWNetAccountProductRTG.window(TumblingEventTimeWindows.of(Time.seconds(2)))
// .aggregate(new EmployeeWithExpenseAggregator()).addSink(sink);
// streamOne.print();
// streamTwo.print();
// DataStream<KafkaRecord> streamTwo = env.addSource(kafkaConsumerTwo);
//
// streamOne.connect(streamTwo).flatMap(new CoFlatMapFunction<KafkaRecord, KafkaRecord, R>() {
// })
//
// // Create Kafka producer from Flink API
// Properties prodProps = new Properties();
// prodProps.put("bootstrap.servers", BOOTSTRAP_SERVER);
//
// FlinkKafkaProducer<KafkaRecord> kafkaProducer =
//
// new FlinkKafkaProducer<KafkaRecord>(TOPIC_OUT,
//
// ((record, timestamp) -> new ProducerRecord<byte[], byte[]>(TOPIC_OUT, record.key.getBytes(), record.value.getBytes())),
//
// prodProps,
//
// Semantic.EXACTLY_ONCE);;
//
// DataStream<KafkaRecord> stream = env.addSource(kafkaConsumer);
//
// stream.filter((record) -> record.value != null && !record.value.isEmpty()).keyBy(record -> record.key)
// .timeWindow(Time.seconds(15)).allowedLateness(Time.milliseconds(500))
// .reduce(new ReduceFunction<KafkaRecord>() {
// /**
// *
// */
// private static final long serialVersionUID = 1L;
// KafkaRecord result = new KafkaRecord();
// #Override
// public KafkaRecord reduce(KafkaRecord record1, KafkaRecord record2) throws Exception
// {
// result.key = "outKey";
//
// result.value = record1.value + " " + record2.value;
//
// return result;
// }
// }).addSink(kafkaProducer);
// produce a number as string every second
new MessageGenerator(emp, TOPIC_ONE, "EMP").start();
new MessageGenerator(dept, TOPIC_TWO, "EXP").start();
// for visual topology of the pipeline. Paste the below output in
// https://flink.apache.org/visualizer/
// System.out.println(env.getExecutionPlan());
// start flink
try {
env.execute();
LOG.debug("Starting flink application!!");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Questions
How do we debug when the window is emitted ?? Is there a way to add
both the streams to a sink(kafka) and see when the records are
emitted window wise ?
Can we put the late arrival records to a sink to check more about them ?
What is cause of duplicates ?? how do we debug them?
Any help in this direction is greatly appreciated. Thanks in advance.

kafka streams abandoned cart development - session window

I am attempting to build out a kstreams app that takes in records from an input topic that is a simple json payload (id and timestamp included - the key is a simple 3 digit string) (there is also no schema required). for the output topic I wish to produce only the records in which have been abandoned for 30 minutes or more (session window). based on this link, I have begun to develop a kafka streams app:
package io.confluent.developer;
import org.apache.kafka.clients.admin.AdminClient;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.common.serialization.StringSerializer;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.KeyValue;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.Consumed;
import org.apache.kafka.streams.kstream.Produced;
import org.apache.kafka.streams.kstream.SessionWindows;
import java.io.FileInputStream;
import java.io.IOException;
import java.time.Duration;
import java.time.Instant;
import java.time.ZoneId;
import java.time.format.DateTimeFormatter;
import java.time.format.FormatStyle;
import java.time.temporal.ChronoUnit;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.Properties;
import java.util.concurrent.CountDownLatch;
public class SessionWindow {
private final DateTimeFormatter timeFormatter = DateTimeFormatter.ofLocalizedTime(FormatStyle.LONG)
.withLocale(Locale.US)
.withZone(ZoneId.systemDefault());
public Topology buildTopology(Properties allProps) {
final StreamsBuilder builder = new StreamsBuilder();
final String inputTopic = allProps.getProperty("input.topic.name");
final String outputTopic = allProps.getProperty("output.topic.name");
builder.stream(inputTopic, Consumed.with(Serdes.String(), Serdes.String()))
.groupByKey()
.windowedBy(SessionWindows.ofInactivityGapAndGrace(Duration.ofMinutes(5), Duration.ofSeconds(10)))
.count()
.toStream()
.map((windowedKey, count) -> {
String start = timeFormatter.format(windowedKey.window().startTime());
String end = timeFormatter.format(windowedKey.window().endTime());
String sessionInfo = String.format("Session info started: %s ended: %s with count %s", start, end, count);
return KeyValue.pair(windowedKey.key(), sessionInfo);
})
.to(outputTopic, Produced.with(Serdes.String(), Serdes.String()));
return builder.build();
}
public Properties loadEnvProperties(String fileName) throws IOException {
Properties allProps = new Properties();
FileInputStream input = new FileInputStream(fileName);
allProps.load(input);
input.close();
return allProps;
}
public static void main(String[] args) throws Exception {
if (args.length < 1) {
throw new IllegalArgumentException("This program takes one argument: the path to an environment configuration file.");
}
SessionWindow tw = new SessionWindow();
Properties allProps = tw.loadEnvProperties(args[0]);
allProps.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
allProps.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, ClickEventTimestampExtractor.class);
Topology topology = tw.buildTopology(allProps);
ClicksDataGenerator dataGenerator = new ClicksDataGenerator(allProps);
dataGenerator.generate();
final KafkaStreams streams = new KafkaStreams(topology, allProps);
final CountDownLatch latch = new CountDownLatch(1);
// Attach shutdown handler to catch Control-C.
Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
#Override
public void run() {
streams.close(Duration.ofSeconds(5));
latch.countDown();
}
});
try {
streams.cleanUp();
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
static class ClicksDataGenerator {
final Properties properties;
public ClicksDataGenerator(final Properties properties) {
this.properties = properties;
}
public void generate() {
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
}
}
}
package io.confluent.developer;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.streams.processor.TimestampExtractor;
public class ClickEventTimestampExtractor implements TimestampExtractor {
#Override
public long extract(ConsumerRecord<Object, Object> record, long previousTimestamp) {
System.out.println(record.value());
return record.getTimestamp();
}
}
i am having issues withe the following:
getting the code to compile - I keep getting this error (I am new to java so please bear with me). what is the correct way to call the getTimestamp?:
error: cannot find symbol
return record.getTimestamp();
^
symbol: method getTimestamp()
location: variable record of type ConsumerRecord<Object,Object>
1 error
not sure if the timestamp extractor will work for this particular scenario. I read here that 'The Timestamp extractor can only give you one timestamp'. does that mean that if there are multiple messages with different keys this wont work? some clarification or examples would help.
thanks!

Spark Streaming Job is running very slow (S3 upload is still in processing)

I am running a spark streaming job which is reading data continuously from a kafka topic with 12 partitions in the batch of 30secs and uploads it to s3 bucket.
The jobs are running extremely slow. Please check the below code
package com.example.main;
import com.example.Util.TableSchema;
import com.example.config.KafkaConfig;
import com.example.monitoring.MicroMeterMetricsCollector;
import org.apache.commons.lang3.StringUtils;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.log4j.Level;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.*;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka010.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import scala.Tuple2;
import java.util.Arrays;
import java.util.Collection;
import java.util.Map;
import static com.example.Util.Parser.*;
import static com.example.Util.SchemaHelper.checkSchema;
import static com.example.Util.SchemaHelper.mapToSchema;
public class StreamApp {
private static final String DATE = "dt";
private static final String HOUR = "hr";
private static final Logger LOG = LoggerFactory.getLogger(StreamApp.class);
private static Collection<String> kafkaTopics;
private static MicroMeterMetricsCollector microMeterMetricsCollector;
public static void main(String[] args) {
if(StringUtils.isEmpty(System.getenv("KAFKA_BOOTSTRAP_SERVER"))) {
System.err.println("Alert mail id is empty");
return;
}
final String KAFKA_BOOTSTRAP_SERVER = "localhost:9092";
kafkaTopics = Arrays.asList("capp_event");
LOG.info("Initializing the metric collector");
microMeterMetricsCollector = new MicroMeterMetricsCollector();
org.apache.log4j.Logger.getLogger("org.apache").setLevel(Level.WARN);
/**
* Spark configurations
*/
SparkConf sparkConf = new SparkConf();
sparkConf.setMaster("local[*]");
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConf.set("fs.s3a.access.key", "AKIAWAZAF7LRUXGCJTX3");
sparkConf.set("fs.s3a.secret.key", "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");
sparkConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
sparkConf.set("mapreduce.fileoutputcommitter.algorithm.version","2");
sparkConf.setAppName("Fabric-Streaming");
/**
* Kafka configurations
*/
KafkaConfig kafkaConfig = new KafkaConfig();
kafkaConfig.setKafkaParams(KAFKA_BOOTSTRAP_SERVER);
Map<String, Object> kafkaParamsMap = kafkaConfig.getKafkaParams();
/**
* Connect Kafka topic to the JavaStreamingContext
*/
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(30));
streamingContext.checkpoint("/Users/ritwik.raj/desktop/checkpoint");
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(streamingContext.sparkContext());
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(kafkaTopics, kafkaParamsMap));
stream.mapToPair(record -> new Tuple2<>(record.key(), record.value()));
/**
* Iterate over JavaDStream object
*/
stream.foreachRDD( rdd -> {
/**
* Offset range for this batch of RDD
*/
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
JavaRDD<TableSchema> rowRdd = rdd
.filter( value -> checkSchema(value.value()) )
.map( json -> mapToSchema(json.value()) )
.map( row -> addPartitionColumn(row) );
Dataset<Row> df = sqlContext.createDataFrame(rowRdd, TableSchema.class);
df.write().mode(SaveMode.Append)
.partitionBy(DATE, HOUR)
.option("compression", "snappy")
.orc("s3a://data-qritwik/capp_event/test/");
/**
* Offset committed for this batch of RDD
*/
((CanCommitOffsets) stream.inputDStream()).commitAsync(offsetRanges);
LOG.info("Offset committed");
});
streamingContext.start();
try {
streamingContext.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
In spark UI a new job is created after every 30 secs but first job is still in processing state and others are getting queued.
I checked inside the first job to know the reason why it is processing state, but I found all the tasks inside are SUCCEEDED but still the first job is in processing state, because of this other jobs are getting queued.
Please let the know the reason why first job is still in processing and how to optimise this, such that upload to s3 becomes fast.

Value not getting from Kafka topic using Java Spark - kafka direct stream

I am not getting any data from the queue using Kafka direct stream. In my code I put System.out.println() This statement not run that means I am not getting any data from that topic..
I am pretty sure data available in queue and since not getting in console.
I didn't see any error in console also.
Can anyone please suggest something?
Here is my Java code,
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaWordCount11").setMaster("local[*]");
sparkConf.set("spark.streaming.concurrentJobs", "3");
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(3000));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "x.xx.xxx.xxx:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "use_a_separate_group_id_for_each_stream");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", true);
Collection<String> topics = Arrays.asList("topicName");
final JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
JavaPairDStream<String, String> lines = stream
.mapToPair(new PairFunction<ConsumerRecord<String, String>, String, String>() {
#Override
public Tuple2<String, String> call(ConsumerRecord<String, String> record) {
return new Tuple2<>(record.key(), record.value());
}
});
lines.print();
// System.out.println(lines.count());
lines.foreachRDD(rdd -> {
rdd.values().foreachPartition(p -> {
while (p.hasNext()) {
System.out.println("Value of Kafka queue" + p.next());
}
});
});

I am able to print string which fetch from the kafka queue using direct kafka stream..
Here is my code,
import java.util.HashMap;
import java.util.HashSet;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
import java.util.Arrays;
import java.util.Calendar;
import java.util.Collection;
import java.util.Currency;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.atomic.AtomicReference;
import java.util.regex.Pattern;
import scala.Tuple2;
import kafka.serializer.StringDecoder;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka.HasOffsetRanges;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.kafka.OffsetRange;
import org.json.JSONObject;
import org.omg.CORBA.Current;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.Durations;
public final class KafkaConsumerDirectStream {
public static void main(String[] args) throws Exception {
try {
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaWordCount11").setMaster("local[*]");
sparkConf.set("spark.streaming.concurrentJobs", "30");
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(200));
Map<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", "x.xx.xxx.xxx:9091");
Set<String> topics = new HashSet();
topics.add("PartWithTopic02Queue");
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(jssc, String.class,
String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topics);
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
lines.foreachRDD(rdd -> {
if (rdd.count() > 0) {
List<String> strArray = rdd.collect();
// Print string here
}
});
jssc.start();
jssc.awaitTermination();
}
}
catch (Exception e) {
e.printStackTrace();
}
}

#Vimal Here is a link to the working version of creating direct streams in Scala.
I believe after reviewing it in Scala, you must convert it easily.
Please make sure that you are turning off for reading the latest topics in Kafka. It might not pick any topic which was processed last time.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Task not Serializable in Spark reading serialized input - java

Related

Getting partial json response for s3select with aws java sdk v2

Flink Inner Join Missing records and adding duplicates

kafka streams abandoned cart development - session window

Spark Streaming Job is running very slow (S3 upload is still in processing)

Value not getting from Kafka topic using Java Spark - kafka direct stream

Categories

Resources