We are running a flink application on AWS Kinesis Analytics.
We are using kafka as our source and sink and event time as our water mark generation. We have a window of 5 seconds. We are performing an inner join using a common field.
Kakfa topics have 12 partitions and flink have 3 way parallelism.
Issues observed : We are seeing for some window we are missing records. Records should join based on the event-time but not joining and for other windows we are seeing duplicate records.
sample records
{"empName":"ted","timestamp":"0","uuid":"f2c2e48a44064d0fa8da5a3896e0e42a","empId":"23698"}
{"empName":"ted","timestamp":"1","uuid":"069f2293ad144dd38a79027068593b58","empId":"23145"}
{"empName":"john","timestamp":"2","uuid":"438c1f0b85154bf0b8e4b3ebf75947b6","empId":"23698"}
{"empName":"john","timestamp":"0","uuid":"76d1d21ed92f4a3f8e14a09e9b40a13b","empId":"23145"}
{"empName":"ted","timestamp":"0","uuid":"bbc3bad653aa44c4894d9c4d13685fba","empId":"23698"}
{"empName":"ted","timestamp":"0","uuid":"530871933d1e4443ade447adc091dcbe","empId":"23145"}
{"empName":"ted","timestamp":"1","uuid":"032d7be009cb448bb40fe5c44582cb9c","empId":"23698"}
{"empName":"john","timestamp":"1","uuid":"e5916821bd4049bab16f4dc62d4b90ea","empId":"23145"}
{"empId":"23698","timestamp":"0","expense":"234"}
{"empId":"23698","timestamp":"0","expense":"34"}
{"empId":"23698","timestamp":"1","expense":"234"}
{"empId":"23145","timestamp":"2","expense":"234"}
{"empId":"23698","timestamp":"2","expense":"234"}
{"empId":"23698","timestamp":"0","expense":"234"}
{"empId":"23145","timestamp":"0","expense":"234"}
{"empId":"23698","timestamp":"0","expense":"34"}
{"empId":"23145","timestamp":"1","expense":"34"}
Below is code for your reference.
As you can see here for the two streams there are a lot event timestamp which can repeat. There can be thousands of employee and empId combinations(in the real data there are many more dimensions) and they are all coming in a single kafka topic.
import java.text.SimpleDateFormat;
import java.time.Duration;
import java.time.Instant;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.time.temporal.ChronoUnit;
import java.util.Properties;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple1;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.deser.std.StringDeserializer;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.functions.co.CoFlatMapFunction;
import org.apache.flink.streaming.api.functions.co.KeyedCoProcessFunction;
import org.apache.flink.streaming.api.functions.co.RichCoFlatMapFunction;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.Semantic;
import org.apache.flink.util.Collector;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
private static final Logger LOG = LoggerFactory.getLogger(Main.class);
// static String TOPIC_IN = "event_hub_all-mt-partitioned";
static String TOPIC_ONE = "kafka_one_multi";
static String TOPIC_TWO = "kafka_two_multi";
static String TOPIC_OUT = "final_join_topic_multi";
static String BOOTSTRAP_SERVER = "localhost:9092";
public static void main(String[] args) {
Producer<String> emp = new Producer<String>(BOOTSTRAP_SERVER, StringSerializer.class.getName());
Producer<String> dept = new Producer<String>(BOOTSTRAP_SERVER, StringSerializer.class.getName());
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
Properties props = new Properties();
props.put("bootstrap.servers", BOOTSTRAP_SERVER);
props.put("client.id", "flink-example1");
FlinkKafkaConsumer<Employee> kafkaConsumerOne = new FlinkKafkaConsumer<>(TOPIC_ONE, new EmployeeSchema(),
props);
LOG.info("Coming to main function");
//Commenting event timestamp for watermark generation!!
var empDebugStream = kafkaConsumerOne.assignTimestampsAndWatermarks(
WatermarkStrategy.<Employee>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((employee, timestamp) -> employee.getTimestamp().getTime())
.withIdleness(Duration.ofSeconds(1)));
// for allowing Flink to handle late elements
kafkaConsumerOne.setStartFromLatest();
FlinkKafkaConsumer<EmployeeExpense> kafkaConsumerTwo = new FlinkKafkaConsumer<>(TOPIC_TWO,
new DepartmentSchema(), props);
//Commenting event timestamp for watermark generation!!
kafkaConsumerTwo.assignTimestampsAndWatermarks(
WatermarkStrategy.<EmployeeExpense>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((employeeExpense, timestamp) -> employeeExpense.getTimestamp().getTime())
.withIdleness(Duration.ofSeconds(1)));
kafkaConsumerTwo.setStartFromLatest();
// EventSerializationSchema<EmployeeWithExpenseAggregationStats> employeeWithExpenseAggregationSerializationSchema = new EventSerializationSchema<EmployeeWithExpenseAggregationStats>(
// TOPIC_OUT);
EventSerializationSchema<EmployeeWithExpense> employeeWithExpenseSerializationSchema = new EventSerializationSchema<EmployeeWithExpense>(
TOPIC_OUT);
// FlinkKafkaProducer<EmployeeWithExpenseAggregationStats> sink = new FlinkKafkaProducer<EmployeeWithExpenseAggregationStats>(
// TOPIC_OUT,
// employeeWithExpenseAggregationSerializationSchema,props,
// FlinkKafkaProducer.Semantic.AT_LEAST_ONCE);
FlinkKafkaProducer<EmployeeWithExpense> sink = new FlinkKafkaProducer<EmployeeWithExpense>(TOPIC_OUT,
employeeWithExpenseSerializationSchema, props, FlinkKafkaProducer.Semantic.AT_LEAST_ONCE);
DataStream<Employee> empStream = env.addSource(kafkaConsumerOne)
.transform("debugFilter", empDebugStream.getProducedType(), new StreamWatermarkDebugFilter<>())
.keyBy(emps -> emps.getEmpId());
DataStream<EmployeeExpense> expStream = env.addSource(kafkaConsumerTwo).keyBy(exps -> exps.getEmpId());
// DataStream<EmployeeWithExpense> aggInputStream = empStream.join(expStream)
empStream.join(expStream).where(new KeySelector<Employee, Tuple1<Integer>>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Tuple1<Integer> getKey(Employee value) throws Exception {
return Tuple1.of(value.getEmpId());
}
}).equalTo(new KeySelector<EmployeeExpense, Tuple1<Integer>>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Tuple1<Integer> getKey(EmployeeExpense value) throws Exception {
return Tuple1.of(value.getEmpId());
}
}).window(TumblingEventTimeWindows.of(Time.seconds(5))).allowedLateness(Time.seconds(15))
.apply(new JoinFunction<Employee, EmployeeExpense, EmployeeWithExpense>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public EmployeeWithExpense join(Employee first, EmployeeExpense second) throws Exception {
return new EmployeeWithExpense(second.getTimestamp(), first.getEmpId(), second.getExpense(),
first.getUuid(), LocalDateTime.now()
.format(DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'+0000'")));
}
}).addSink(sink);
// KeyedStream<EmployeeWithExpense, Tuple3<Integer, Integer,Long>> inputKeyedByGWNetAccountProductRTG = aggInputStream
// .keyBy(new KeySelector<EmployeeWithExpense, Tuple3<Integer, Integer,Long>>() {
//
// /**
// *
// */
// private static final long serialVersionUID = 1L;
//
// #Override
// public Tuple3<Integer, Integer,Long> getKey(EmployeeWithExpense value) throws Exception {
// return Tuple3.of(value.empId, value.expense,Instant.ofEpochMilli(value.timestamp.getTime()).truncatedTo(ChronoUnit.SECONDS).toEpochMilli());
// }
// });
//
// inputKeyedByGWNetAccountProductRTG.window(TumblingEventTimeWindows.of(Time.seconds(2)))
// .aggregate(new EmployeeWithExpenseAggregator()).addSink(sink);
// streamOne.print();
// streamTwo.print();
// DataStream<KafkaRecord> streamTwo = env.addSource(kafkaConsumerTwo);
//
// streamOne.connect(streamTwo).flatMap(new CoFlatMapFunction<KafkaRecord, KafkaRecord, R>() {
// })
//
// // Create Kafka producer from Flink API
// Properties prodProps = new Properties();
// prodProps.put("bootstrap.servers", BOOTSTRAP_SERVER);
//
// FlinkKafkaProducer<KafkaRecord> kafkaProducer =
//
// new FlinkKafkaProducer<KafkaRecord>(TOPIC_OUT,
//
// ((record, timestamp) -> new ProducerRecord<byte[], byte[]>(TOPIC_OUT, record.key.getBytes(), record.value.getBytes())),
//
// prodProps,
//
// Semantic.EXACTLY_ONCE);;
//
// DataStream<KafkaRecord> stream = env.addSource(kafkaConsumer);
//
// stream.filter((record) -> record.value != null && !record.value.isEmpty()).keyBy(record -> record.key)
// .timeWindow(Time.seconds(15)).allowedLateness(Time.milliseconds(500))
// .reduce(new ReduceFunction<KafkaRecord>() {
// /**
// *
// */
// private static final long serialVersionUID = 1L;
// KafkaRecord result = new KafkaRecord();
// #Override
// public KafkaRecord reduce(KafkaRecord record1, KafkaRecord record2) throws Exception
// {
// result.key = "outKey";
//
// result.value = record1.value + " " + record2.value;
//
// return result;
// }
// }).addSink(kafkaProducer);
// produce a number as string every second
new MessageGenerator(emp, TOPIC_ONE, "EMP").start();
new MessageGenerator(dept, TOPIC_TWO, "EXP").start();
// for visual topology of the pipeline. Paste the below output in
// https://flink.apache.org/visualizer/
// System.out.println(env.getExecutionPlan());
// start flink
try {
env.execute();
LOG.debug("Starting flink application!!");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Questions
How do we debug when the window is emitted ?? Is there a way to add
both the streams to a sink(kafka) and see when the records are
emitted window wise ?
Can we put the late arrival records to a sink to check more about them ?
What is cause of duplicates ?? how do we debug them?
Any help in this direction is greatly appreciated. Thanks in advance.
Related
I am running a spark streaming job which is reading data continuously from a kafka topic with 12 partitions in the batch of 30secs and uploads it to s3 bucket.
The jobs are running extremely slow. Please check the below code
package com.example.main;
import com.example.Util.TableSchema;
import com.example.config.KafkaConfig;
import com.example.monitoring.MicroMeterMetricsCollector;
import org.apache.commons.lang3.StringUtils;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.log4j.Level;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.*;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka010.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import scala.Tuple2;
import java.util.Arrays;
import java.util.Collection;
import java.util.Map;
import static com.example.Util.Parser.*;
import static com.example.Util.SchemaHelper.checkSchema;
import static com.example.Util.SchemaHelper.mapToSchema;
public class StreamApp {
private static final String DATE = "dt";
private static final String HOUR = "hr";
private static final Logger LOG = LoggerFactory.getLogger(StreamApp.class);
private static Collection<String> kafkaTopics;
private static MicroMeterMetricsCollector microMeterMetricsCollector;
public static void main(String[] args) {
if(StringUtils.isEmpty(System.getenv("KAFKA_BOOTSTRAP_SERVER"))) {
System.err.println("Alert mail id is empty");
return;
}
final String KAFKA_BOOTSTRAP_SERVER = "localhost:9092";
kafkaTopics = Arrays.asList("capp_event");
LOG.info("Initializing the metric collector");
microMeterMetricsCollector = new MicroMeterMetricsCollector();
org.apache.log4j.Logger.getLogger("org.apache").setLevel(Level.WARN);
/**
* Spark configurations
*/
SparkConf sparkConf = new SparkConf();
sparkConf.setMaster("local[*]");
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConf.set("fs.s3a.access.key", "AKIAWAZAF7LRUXGCJTX3");
sparkConf.set("fs.s3a.secret.key", "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");
sparkConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
sparkConf.set("mapreduce.fileoutputcommitter.algorithm.version","2");
sparkConf.setAppName("Fabric-Streaming");
/**
* Kafka configurations
*/
KafkaConfig kafkaConfig = new KafkaConfig();
kafkaConfig.setKafkaParams(KAFKA_BOOTSTRAP_SERVER);
Map<String, Object> kafkaParamsMap = kafkaConfig.getKafkaParams();
/**
* Connect Kafka topic to the JavaStreamingContext
*/
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(30));
streamingContext.checkpoint("/Users/ritwik.raj/desktop/checkpoint");
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(streamingContext.sparkContext());
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(kafkaTopics, kafkaParamsMap));
stream.mapToPair(record -> new Tuple2<>(record.key(), record.value()));
/**
* Iterate over JavaDStream object
*/
stream.foreachRDD( rdd -> {
/**
* Offset range for this batch of RDD
*/
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
JavaRDD<TableSchema> rowRdd = rdd
.filter( value -> checkSchema(value.value()) )
.map( json -> mapToSchema(json.value()) )
.map( row -> addPartitionColumn(row) );
Dataset<Row> df = sqlContext.createDataFrame(rowRdd, TableSchema.class);
df.write().mode(SaveMode.Append)
.partitionBy(DATE, HOUR)
.option("compression", "snappy")
.orc("s3a://data-qritwik/capp_event/test/");
/**
* Offset committed for this batch of RDD
*/
((CanCommitOffsets) stream.inputDStream()).commitAsync(offsetRanges);
LOG.info("Offset committed");
});
streamingContext.start();
try {
streamingContext.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
In spark UI a new job is created after every 30 secs but first job is still in processing state and others are getting queued.
I checked inside the first job to know the reason why it is processing state, but I found all the tasks inside are SUCCEEDED but still the first job is in processing state, because of this other jobs are getting queued.
Please let the know the reason why first job is still in processing and how to optimise this, such that upload to s3 becomes fast.
I am trying to write a proxy server with SparkJava that queries the Google Maps Directions API given parameters (i.e. location data, traffic model preference, departure time, etc...) from a client and returns various routing details such as distance, duration, and duration.
The server stalls when it tries to send a request to the API on behalf of the client. I placed print statements throughout the code to confirm that the hang was due to the API query. I have tried using different ports namely: 4567, 443, 80, and 8080 by using port() method but the problem persists. I am sure the server-side code conducting the API query is not the issue; everything works fine (proper route information is generated i.e. DirectionsApiRequest.await() returns properly) when I cut the client out, disable the endpoints, and run everything manually from the main method on the (deactivated) server's side.
Does anyone know why this could be happening?
(I use maven for dependency management)
The following shows the client trying to get the distance of the default route and the aforementioned error:
Server-side code:
Main class
package com.mycompany.app;
//import
// data structures
import java.util.ArrayList;
// google maps
import com.google.maps.model.DirectionsRoute;
import com.google.maps.model.LatLng;
// gson
import com.google.gson.Gson;
import com.google.gson.reflect.TypeToken;
import java.lang.reflect.Type;
// static API methods
import com.mycompany.app.DirectionsUtility;
import static spark.Spark.*;
// exceptions
import com.google.maps.errors.ApiException;
import java.io.IOException;
public class App
{
private static ArrayList<LatLng> locationsDatabase = new ArrayList<LatLng>();
private static DirectionsRoute defaultRoute = null;
public static void main( String[] args ) throws ApiException, InterruptedException, IOException
{
// client posts location data
post("routingEngine/sendLocations", (request,response) -> {
response.type("application/json");
ArrayList<LatLng> locations = new Gson().fromJson(request.body(),new TypeToken<ArrayList<LatLng>>(){}.getType());
locationsDatabase = locations;
return "OK";
});
// before any default route queries, the default route must be generated
before("routingEngine/getDefaultRoute/*",(request,response) ->{
RequestParameters requestParameters = new Gson().fromJson(request.body(),(java.lang.reflect.Type)RequestParameters.class);
defaultRoute = DirectionsUtility.getDefaultRoute(locationsDatabase,requestParameters);
});
// client gets default route distance
get("routingEngine/getDefaultRoute/distance", (request,response) ->{
response.type("application/json");
return new Gson().toJson(new Gson().toJson(DirectionsUtility.getDefaultRouteDistance(defaultRoute)));
});
DirectionsUtility.context.shutdown();
}
}
DirectionsUtility is the class responsible for consulting with Google Maps' API:
package com.mycompany.app;
// import
// data structures
import java.util.List;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Map;
import java.util.HashMap;
// Google Directions API
import com.google.maps.GeoApiContext;
// request parameters
import com.google.maps.DirectionsApiRequest;
import com.google.maps.model.Unit;
import com.google.maps.model.TravelMode;
import com.google.maps.model.TrafficModel;
import com.google.maps.DirectionsApi.RouteRestriction;
import com.google.maps.model.Distance;
// result parameters
import com.google.maps.model.DirectionsResult;
import com.google.maps.model.LatLng;
import com.google.maps.model.DirectionsRoute;
import com.google.maps.model.DirectionsLeg;
// exceptions
import com.google.maps.errors.ApiException;
import java.io.IOException;
// time constructs
import java.time.Instant;
import java.util.concurrent.TimeUnit;
import okhttp3.OkHttpClient;
import okhttp3.Request;
import okhttp3.RequestBody;
import okhttp3.Call;
import okhttp3.Response;
import okhttp3.MediaType;
import okhttp3.HttpUrl;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
public final class DirectionsUtility{
/**
* Private constructor to prevent instantiation.
*/
private DirectionsUtility(){}
/**
* API key.
*/
private static final String API_KEY = "YOUR PERSONAL API KEY";
/**
* Queries per second limit (50 is max).
*/
private static int QPS = 50;
/**
* Singleton that facilitates Google Geo API queries; must be shutdown() for program termination.
*/
protected static GeoApiContext context = new GeoApiContext.Builder()
.apiKey(API_KEY)
.queryRateLimit(QPS)
.build();
// TESTING
// singleton client
private static final OkHttpClient httpClient = new OkHttpClient.Builder()
.connectTimeout(700,TimeUnit.SECONDS)
.writeTimeout(700, TimeUnit.SECONDS)
.readTimeout(700, TimeUnit.SECONDS)
.build();
/**
* Generates the route judged by the Google API as being the most optimal. The main purpose of this method is to provide a fallback
* for the optimization engine should it ever find the traditional processes of this server (i.e. generation of all possible routes)
* too slow for its taste. In other words, if this server delays to an excessive degree in providing the optimization engine with the
* set of all possible routes, the optimization engine can terminate those processes and instead entrust the decision to the Google
* Maps API. This method suffers from a minor caveat; the Google Maps API refuses to compute the duration in traffic for any journey
* involving multiple locations if the intermediate points separating the origin and destination are assumed to be stopover points (i.e.
* if it is assumed that the driver will stop at each point) therefore this method assumes that the driver will not stop at the intermediate
* points. This may introduce some inaccuracies into the predictions.
* (it should be noted that this server has not yet been equipped with the ability to generate all possible routes so this method is, at the
* at the moment, the only option)
*
* #param requestParameters the parameters required for a Google Maps API query; see the RequestParameters class for more information
*
* #return the default route
*/
public static DirectionsRoute getDefaultRoute(ArrayList<LatLng> locations,RequestParameters requestParameters) throws ApiException, InterruptedException, IOException
{
LatLng origin = locations.get(0);
LatLng destination = locations.get(locations.size() - 1);
// separate waypoints
int numWaypoints = locations.size() - 2;
DirectionsApiRequest.Waypoint[] waypoints = new DirectionsApiRequest.Waypoint[numWaypoints];
for(int i = 0; i < waypoints.length; i++)
{
// ensure that each waypoint is not designated as a stopover point
waypoints[i] = new DirectionsApiRequest.Waypoint(locations.get(i + 1),false);
}
// send API query
// store API query response
DirectionsResult directionsResult = null;
try
{
// create DirectionsApiRequest object
DirectionsApiRequest directionsRequest = new DirectionsApiRequest(context);
// set request parameters
directionsRequest.units(requestParameters.getUnit());
directionsRequest.mode(TravelMode.DRIVING);
directionsRequest.trafficModel(requestParameters.getTrafficModel());
if(requestParameters.getRestrictions() != null)
{
directionsRequest.avoid(requestParameters.getRestrictions());
}
directionsRequest.region(requestParameters.getRegion());
directionsRequest.language(requestParameters.getLanguage());
directionsRequest.departureTime(requestParameters.getDepartureTime());
// always generate alternative routes
directionsRequest.alternatives(false);
directionsRequest.origin(origin);
directionsRequest.destination(destination);
directionsRequest.waypoints(waypoints);
directionsRequest.optimizeWaypoints(requestParameters.optimizeWaypoints());
// send request and store result
// testing - notification that a new api query is being sent
System.out.println("firing off API query...");
directionsResult = directionsRequest.await();
// testing - notification that api query was successful
System.out.println("API query successful");
}
catch(Exception e)
{
System.out.println(e);
}
// directionsResult.routes contains only a single, optimized route
// return the default route
return directionsResult.routes[0];
} // end method
/**
* Returns the distance of the default route.
*
* #param defaultRoute the default route
*
* #return the distance of the default route
*/
public static Distance getDefaultRouteDistance(DirectionsRoute defaultRoute)
{
// testing - simple notification
System.out.println("Computing distance...");
// each route has only 1 leg since all the waypoints are non-stopover points
return defaultRoute.legs[0].distance;
}
}
Here is the client-side code:
package com.mycompany.app;
import java.util.ArrayList;
import java.util.Arrays;
import com.google.maps.model.LatLng;
import com.google.maps.model.TrafficModel;
import com.google.maps.DirectionsApi.RouteRestriction;
import com.google.maps.model.TransitRoutingPreference;
import com.google.maps.model.TravelMode;
import com.google.maps.model.Unit;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.JsonElement;
import com.google.gson.JsonArray;
import com.google.gson.reflect.TypeToken;
import java.lang.reflect.Type;
import okhttp3.OkHttpClient;
import okhttp3.Request;
import okhttp3.RequestBody;
import okhttp3.Call;
import okhttp3.Response;
import okhttp3.MediaType;
import okhttp3.HttpUrl;
// time constructs
import java.time.LocalDateTime;
import java.time.Instant;
import java.time.ZoneOffset;
import java.util.concurrent.TimeUnit;
import com.google.maps.model.Distance;
import com.google.maps.model.Duration;
import java.io.IOException;
public class App
{
// model database
private static LatLng hartford_ct = new LatLng(41.7658,-72.6734);
private static LatLng loretto_pn = new LatLng(40.5031,-78.6303);
private static LatLng chicago_il = new LatLng(41.8781,-87.6298);
private static LatLng newyork_ny = new LatLng(40.7128,-74.0060);
private static LatLng newport_ri = new LatLng(41.4901,-71.3128);
private static LatLng concord_ma = new LatLng(42.4604,-71.3489);
private static LatLng washington_dc = new LatLng(38.8951,-77.0369);
private static LatLng greensboro_nc = new LatLng(36.0726,-79.7920);
private static LatLng atlanta_ga = new LatLng(33.7490,-84.3880);
private static LatLng tampa_fl = new LatLng(27.9506,-82.4572);
// singleton client
private static final OkHttpClient httpClient = new OkHttpClient.Builder()
.connectTimeout(700,TimeUnit.SECONDS)
.writeTimeout(700, TimeUnit.SECONDS)
.readTimeout(700, TimeUnit.SECONDS)
.build();
private static final MediaType JSON
= MediaType.parse("application/json; charset=utf-8");
public static void main( String[] args ) throws IOException
{
// post location data
// get locations from database
ArrayList<LatLng> locations = new ArrayList<LatLng>();
// origin
LatLng origin = hartford_ct;
locations.add(origin);
// waypoints
locations.add(loretto_pn);
locations.add(chicago_il);
locations.add(newyork_ny);
locations.add(newport_ri);
locations.add(concord_ma);
locations.add(washington_dc);
locations.add(greensboro_nc);
locations.add(atlanta_ga);
// destination
LatLng destination = tampa_fl;
locations.add(destination);
// serialize locations list to json
Gson gson = new GsonBuilder().create();
String locationsJson = gson.toJson(locations);
// post to routing engine
RequestBody postLocationsRequestBody = RequestBody.create(JSON,locationsJson);
Request postLocationsRequest = new Request.Builder()
.url("http://localhost:4567/routingEngine/sendLocations")
.post(postLocationsRequestBody)
.build();
Call postLocationsCall = httpClient.newCall(postLocationsRequest);
Response postLocationsResponse = postLocationsCall.execute();
// get distance of default route
// generate parameters
Unit unit = Unit.METRIC;
LocalDateTime temp = LocalDateTime.now();
Instant departureTime= temp.atZone(ZoneOffset.UTC)
.withYear(2025)
.withMonth(8)
.withDayOfMonth(18)
.withHour(10)
.withMinute(12)
.withSecond(10)
.withNano(900)
.toInstant();
boolean optimizeWaypoints = true;
String optimizeWaypointsString = (optimizeWaypoints == true) ? "true" : "false";
TrafficModel trafficModel = TrafficModel.BEST_GUESS;
// restrictions
RouteRestriction[] restrictions = {RouteRestriction.TOLLS,RouteRestriction.FERRIES};
String region = "us"; // USA
String language = "en-EN";
RequestParameters requestParameters = new RequestParameters(unit,departureTime,true,trafficModel,restrictions,region,language);
// build url
HttpUrl url = new HttpUrl.Builder()
.scheme("http")
.host("127.0.0.1")
.port(4567)
.addPathSegment("routingEngine")
.addPathSegment("getDefaultRoute")
.addPathSegment("distance")
.build();
// build request
Request getDefaultRouteDistanceRequest = new Request.Builder()
.url(url)
.post(RequestBody.create(JSON,gson.toJson(requestParameters)))
.build();
// send request
Call getDefaultRouteDistanceCall = httpClient.newCall(getDefaultRouteDistanceRequest);
Response getDefaultRouteDistanceResponse = getDefaultRouteDistanceCall.execute();
// store and print response
Distance defaultRouteDistance = gson.fromJson(getDefaultRouteDistanceResponse.body().string(),Distance.class);
System.out.println("Default Route Distance: " + defaultRouteDistance);
}
}
Both classes use the following class RequestParameters to package all the request parameters together (i.e. unit, departure time, region, language etc...) just for convenience
package com.mycompany.app;
import com.google.maps.model.Unit;
import java.time.Instant;
import com.google.maps.model.TrafficModel;
import com.google.maps.DirectionsApi.RouteRestriction;
public class RequestParameters
{
private Unit unit;
private Instant departureTime;
private boolean optimizeWaypoints;
private TrafficModel trafficModel;
private RouteRestriction[] restrictions;
private String region;
private String language;
public RequestParameters(Unit unit, Instant departureTime, boolean optimizeWaypoints, TrafficModel trafficModel, RouteRestriction[] restrictions, String region, String language)
{
this.unit = unit;
this.departureTime = departureTime;
this.optimizeWaypoints = optimizeWaypoints;
this.trafficModel = trafficModel;
this.restrictions = restrictions;
this.region = region;
this.language = language;
}
// getters
public Unit getUnit()
{
return this.unit;
}
public Instant getDepartureTime()
{
return this.departureTime;
}
public boolean optimizeWaypoints()
{
return this.optimizeWaypoints;
}
public TrafficModel getTrafficModel()
{
return this.trafficModel;
}
public RouteRestriction[] getRestrictions()
{
return this.restrictions;
}
public String getRegion()
{
return this.region;
}
public String getLanguage()
{
return this.language;
}
// setters
public void setTrafficModel(TrafficModel trafficModel)
{
this.trafficModel = trafficModel;
}
public void setRegion(String region)
{
this.region = region;
}
public void setLanguage(String language)
{
this.language = language;
}
}
Hopefully this provides the information necessary to investigate the problem.
In the server-side App class, the last line of the main method reads
DirectionsUtility.context.shutdown();
This effectively shuts down the ExecutorService that the Maps Services API uses (inside its RateLimitExecutorService) and that is responsible for actually executing requests to Google. So your request is enqueued, but never actually executed.
Also, instead of doing System.out.println(e) (inside the DirectionsUtility class) it may be better do something like e.printStacktrace() so you'll have access to the whole error + it's stack.
There is code that is supposed to do the load testing form some function that performs the http call (we call it callInit here) and collects some data in the LoaTestMetricsData:
the collected responses
and the total duration of the execution.
import io.reactivex.Observable;
import io.reactivex.Scheduler;
import io.reactivex.Single;
import io.reactivex.observers.TestObserver;
import io.reactivex.schedulers.Schedulers;
import io.reactivex.subjects.PublishSubject;
import io.reactivex.subjects.Subject;
import io.restassured.internal.RestAssuredResponseImpl;
import io.restassured.response.Response;
import org.junit.jupiter.api.Test;
import java.time.Duration;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.function.Supplier;
import static java.lang.Thread.sleep;
import static org.hamcrest.CoreMatchers.equalTo;
import static org.hamcrest.CoreMatchers.is;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.allOf;
import static org.hamcrest.Matchers.greaterThanOrEqualTo;
import static org.hamcrest.Matchers.lessThan;
public class TestRx {
#Test
public void loadTest() {
int CALL_N_TIMES = 10;
final long CALL_NIT_EVERY_MILLISECONDS = 100;
final LoaTestMetricsData loaTestMetricsData = loadTestHttpCall(
this::callInit,
CALL_N_TIMES,
CALL_NIT_EVERY_MILLISECONDS
);
assertThat(loaTestMetricsData.responseList.size(), is(equalTo(Long.valueOf(CALL_N_TIMES).intValue())));
long errorCount = loaTestMetricsData.responseList.stream().filter(x -> x.getStatusCode() != 200).count();
long executionTime = loaTestMetricsData.duration.getSeconds();
//assertThat(errorCount, is(equalTo(0)));
assertThat(executionTime , allOf(greaterThanOrEqualTo(1L),lessThan(3L)));
}
// --
private Single<Response> callInit() {
try {
return Single.fromCallable(() -> {
System.out.println("...");
sleep(1000);
Response response = new RestAssuredResponseImpl();
return response;
});
} catch (Exception ex) {
throw new RuntimeException(ex.getMessage());
}
}
// --
private LoaTestMetricsData loadTestHttpCall(final Supplier<Single<Response>> restCallFunction, long callnTimes, long callEveryMilisseconds) {
long startTimeMillis = System.currentTimeMillis();
final LoaTestMetricsData loaDestMetricsData = new LoaTestMetricsData();
final AtomicInteger atomicInteger = new AtomicInteger(0);
final TestObserver<Response> testObserver = new TestObserver<Response>() {
public void onNext(Response response) {
loaDestMetricsData.responseList.add(response);
super.onNext(response);
}
public void onComplete() {
loaDestMetricsData.duration = Duration.ofMillis(System.currentTimeMillis() - startTimeMillis);
super.onComplete();
}
};
final Subject<Response> subjectInitCallResults = PublishSubject.create(); // Memo: Subjects are hot so if you don't observe them the right time, you may not get events. Thus: subscribe first then emit (onNext)
final Scheduler schedulerIo = Schedulers.io();
subjectInitCallResults
.subscribeOn(schedulerIo)
.subscribe(testObserver); // subscribe first
final Observable<Long> source = Observable.interval(callEveryMilisseconds, TimeUnit.MILLISECONDS).take(callnTimes);
source.subscribe(x -> {
final Single<Response> singleResult = restCallFunction.get();
singleResult
.subscribeOn(schedulerIo)
.subscribe( result -> {
int count = atomicInteger.incrementAndGet();
if(count == callnTimes) {
subjectInitCallResults.onNext(result); // then emit
subjectInitCallResults.onComplete();
} else {
subjectInitCallResults.onNext(result);
}
});
});
testObserver.awaitTerminalEvent();
testObserver.assertComplete();
testObserver.assertValueCount(Long.valueOf(callnTimes).intValue()); // !!!
return loaDestMetricsData;
}
}
The: LoaTestMetricsData is defined as:
public class LoaTestMetricsData {
public List<Response> responseList = new ArrayList<>();
public Duration duration;
}
Sometimes test fails with this error:
java.lang.AssertionError: Value counts differ; expected: 10 but was: 9 (latch = 0, values = 9, errors = 0, completions = 1)
Expected :10
Actual :9 (latch = 0, values = 9, errors = 0, completions = 1)
<Click to see difference>
If someone could tell me why?
As is some of the subjectInitCallResults.onNext() has not been executed, or consumed. But why.. I understand that PublishSubject is hot observable, thus I subscribe for the events before emitting/onNext anything to it.
UPDATE:
What would fix it, is this ugly code, that would wait for the subject to fill up:
while(subjectInitCallResults.count().blockingGet() != callnTimes) {
Thread.sleep(100);
}
..
testObserver.awaitTerminalEvent();
But is the proper / better way of doing it?
Thanks.
I'm not able to find a way to read messages from pub/sub using java.
I'm using this maven dependency in my pom
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-pubsub</artifactId>
<version>0.17.2-alpha</version>
</dependency>
I implemented this main method to create a new topic:
public static void main(String... args) throws Exception {
// Your Google Cloud Platform project ID
String projectId = ServiceOptions.getDefaultProjectId();
// Your topic ID
String topicId = "my-new-topic-1";
// Create a new topic
TopicName topic = TopicName.create(projectId, topicId);
try (TopicAdminClient topicAdminClient = TopicAdminClient.create()) {
topicAdminClient.createTopic(topic);
}
}
The above code works well and, indeed, I can see the new topic I created using the google cloud console.
I implemented the following main method to write a message to my topic:
public static void main(String a[]) throws InterruptedException, ExecutionException{
String projectId = ServiceOptions.getDefaultProjectId();
String topicId = "my-new-topic-1";
String payload = "Hellooooo!!!";
PubsubMessage pubsubMessage =
PubsubMessage.newBuilder().setData(ByteString.copyFromUtf8(payload)).build();
TopicName topic = TopicName.create(projectId, topicId);
Publisher publisher;
try {
publisher = Publisher.defaultBuilder(
topic)
.build();
publisher.publish(pubsubMessage);
System.out.println("Sent!");
} catch (IOException e) {
System.out.println("Not Sended!");
e.printStackTrace();
}
}
Now I'm not able to verify if this message was really sent.
I would like to implement a message reader using a subscription to my topic.
Could someone show me a correct and working java example about reading messages from a topic?
Anyone can help me?
Thanks in advance!
Here is the version using the google cloud client libraries.
package com.techm.data.client;
import com.google.cloud.pubsub.v1.AckReplyConsumer;
import com.google.cloud.pubsub.v1.MessageReceiver;
import com.google.cloud.pubsub.v1.Subscriber;
import com.google.cloud.pubsub.v1.SubscriptionAdminClient;
import com.google.common.util.concurrent.MoreExecutors;
import com.google.pubsub.v1.ProjectSubscriptionName;
import com.google.pubsub.v1.ProjectTopicName;
import com.google.pubsub.v1.PubsubMessage;
import com.google.pubsub.v1.PushConfig;
/**
* A snippet for Google Cloud Pub/Sub showing how to create a Pub/Sub pull
* subscription and asynchronously pull messages from it.
*/
public class CreateSubscriptionAndConsumeMessages {
private static String projectId = "projectId";
private static String topicId = "topicName";
private static String subscriptionId = "subscriptionName";
public static void createSubscription() throws Exception {
ProjectTopicName topic = ProjectTopicName.of(projectId, topicId);
ProjectSubscriptionName subscription = ProjectSubscriptionName.of(projectId, subscriptionId);
try (SubscriptionAdminClient subscriptionAdminClient = SubscriptionAdminClient.create()) {
subscriptionAdminClient.createSubscription(subscription, topic, PushConfig.getDefaultInstance(), 0);
}
}
public static void main(String... args) throws Exception {
ProjectSubscriptionName subscription = ProjectSubscriptionName.of(projectId, subscriptionId);
createSubscription();
MessageReceiver receiver = new MessageReceiver() {
#Override
public void receiveMessage(PubsubMessage message, AckReplyConsumer consumer) {
System.out.println("Received message: " + message.getData().toStringUtf8());
consumer.ack();
}
};
Subscriber subscriber = null;
try {
subscriber = Subscriber.newBuilder(subscription, receiver).build();
subscriber.addListener(new Subscriber.Listener() {
#Override
public void failed(Subscriber.State from, Throwable failure) {
// Handle failure. This is called when the Subscriber encountered a fatal error
// and is
// shutting down.
System.err.println(failure);
}
}, MoreExecutors.directExecutor());
subscriber.startAsync().awaitRunning();
// In this example, we will pull messages for one minute (60,000ms) then stop.
// In a real application, this sleep-then-stop is not necessary.
// Simply call stopAsync().awaitTerminated() when the server is shutting down,
// etc.
Thread.sleep(60000);
} finally {
if (subscriber != null) {
subscriber.stopAsync().awaitTerminated();
}
}
}
}
This is working fine for me.
The Cloud Pub/Sub Pull Subscriber Guide has sample code for reading messages from a topic.
I haven't used google cloud client libraries but used the api client libraries. Here is how I created a subscription.
package com.techm.datapipeline.client;
import java.io.IOException;
import java.security.GeneralSecurityException;
import com.google.api.client.googleapis.json.GoogleJsonResponseException;
import com.google.api.client.http.HttpStatusCodes;
import com.google.api.services.pubsub.Pubsub;
import com.google.api.services.pubsub.Pubsub.Projects.Subscriptions.Create;
import com.google.api.services.pubsub.Pubsub.Projects.Subscriptions.Get;
import com.google.api.services.pubsub.Pubsub.Projects.Topics;
import com.google.api.services.pubsub.model.ExpirationPolicy;
import com.google.api.services.pubsub.model.Subscription;
import com.google.api.services.pubsub.model.Topic;
import com.techm.datapipeline.factory.PubsubFactory;
public class CreatePullSubscriberClient {
private final static String PROJECT_NAME = "yourProjectId";
private final static String TOPIC_NAME = "yourTopicName";
private final static String SUBSCRIPTION_NAME = "yourSubscriptionName";
public static void main(String[] args) throws IOException, GeneralSecurityException {
Pubsub pubSub = PubsubFactory.getService();
String topicName = String.format("projects/%s/topics/%s", PROJECT_NAME, TOPIC_NAME);
String subscriptionName = String.format("projects/%s/subscriptions/%s", PROJECT_NAME, SUBSCRIPTION_NAME);
Topics.Get listReq = pubSub.projects().topics().get(topicName);
Topic topic = listReq.execute();
if (topic == null) {
System.err.println("Topic doesn't exist...run CreateTopicClient...to create the topic");
System.exit(0);
}
Subscription subscription = null;
try {
Get getReq = pubSub.projects().subscriptions().get(subscriptionName);
subscription = getReq.execute();
} catch (GoogleJsonResponseException e) {
if (e.getStatusCode() == HttpStatusCodes.STATUS_CODE_NOT_FOUND) {
System.out.println("Subscription " + subscriptionName + " does not exist...will create it");
}
}
if (subscription != null) {
System.out.println("Subscription already exists ==> " + subscription.toPrettyString());
System.exit(0);
}
subscription = new Subscription();
subscription.setTopic(topicName);
subscription.setPushConfig(null); // indicating a pull
ExpirationPolicy expirationPolicy = new ExpirationPolicy();
expirationPolicy.setTtl(null); // never expires;
subscription.setExpirationPolicy(expirationPolicy);
subscription.setAckDeadlineSeconds(null); // so defaults to 10 sec
subscription.setRetainAckedMessages(true);
Long _week = 7L * 24 * 60 * 60;
subscription.setMessageRetentionDuration(String.valueOf(_week)+"s");
subscription.setName(subscriptionName);
Create createReq = pubSub.projects().subscriptions().create(subscriptionName, subscription);
Subscription createdSubscription = createReq.execute();
System.out.println("Subscription created ==> " + createdSubscription.toPrettyString());
}
}
And once you create the subscription (pull type)...this is how you pull the messages from the topic.
package com.techm.datapipeline.client;
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.util.ArrayList;
import java.util.List;
import com.google.api.client.googleapis.json.GoogleJsonResponseException;
import com.google.api.client.http.HttpStatusCodes;
import com.google.api.client.util.Base64;
import com.google.api.services.pubsub.Pubsub;
import com.google.api.services.pubsub.Pubsub.Projects.Subscriptions.Acknowledge;
import com.google.api.services.pubsub.Pubsub.Projects.Subscriptions.Get;
import com.google.api.services.pubsub.Pubsub.Projects.Subscriptions.Pull;
import com.google.api.services.pubsub.model.AcknowledgeRequest;
import com.google.api.services.pubsub.model.Empty;
import com.google.api.services.pubsub.model.PullRequest;
import com.google.api.services.pubsub.model.PullResponse;
import com.google.api.services.pubsub.model.ReceivedMessage;
import com.techm.datapipeline.factory.PubsubFactory;
public class PullSubscriptionsClient {
private final static String PROJECT_NAME = "yourProjectId";
private final static String SUBSCRIPTION_NAME = "yourSubscriptionName";
private final static String SUBSCRIPTION_NYC_NAME = "test";
public static void main(String[] args) throws IOException, GeneralSecurityException {
Pubsub pubSub = PubsubFactory.getService();
String subscriptionName = String.format("projects/%s/subscriptions/%s", PROJECT_NAME, SUBSCRIPTION_NAME);
//String subscriptionName = String.format("projects/%s/subscriptions/%s", PROJECT_NAME, SUBSCRIPTION_NYC_NAME);
try {
Get getReq = pubSub.projects().subscriptions().get(subscriptionName);
getReq.execute();
} catch (GoogleJsonResponseException e) {
if (e.getStatusCode() == HttpStatusCodes.STATUS_CODE_NOT_FOUND) {
System.out.println("Subscription " + subscriptionName
+ " does not exist...run CreatePullSubscriberClient to create");
}
}
PullRequest pullRequest = new PullRequest();
pullRequest.setReturnImmediately(false); // wait until you get a message
pullRequest.setMaxMessages(1000);
Pull pullReq = pubSub.projects().subscriptions().pull(subscriptionName, pullRequest);
PullResponse pullResponse = pullReq.execute();
List<ReceivedMessage> msgs = pullResponse.getReceivedMessages();
List<String> ackIds = new ArrayList<String>();
int i = 0;
if (msgs != null) {
for (ReceivedMessage msg : msgs) {
ackIds.add(msg.getAckId());
//System.out.println(i++ + ":===:" + msg.getAckId());
String object = new String(Base64.decodeBase64(msg.getMessage().getData()));
System.out.println("Decoded object String ==> " + object );
}
//acknowledge all the received messages
AcknowledgeRequest content = new AcknowledgeRequest();
content.setAckIds(ackIds);
Acknowledge ackReq = pubSub.projects().subscriptions().acknowledge(subscriptionName, content);
Empty empty = ackReq.execute();
}
}
}
Note: This client only waits until it receives at least one message and terminates if it's receives one (up to a max of value - set in MaxMessages) at once.
Let me know if this helps. I'm going to try the cloud client libraries soon and will post an update once I get my hands on them.
And here's the missing factory class ...if you plan to run it...
package com.techm.datapipeline.factory;
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.logging.Level;
import java.util.logging.Logger;
import com.google.api.client.googleapis.auth.oauth2.GoogleCredential;
import com.google.api.client.googleapis.javanet.GoogleNetHttpTransport;
import com.google.api.client.http.HttpTransport;
import com.google.api.client.json.JsonFactory;
import com.google.api.client.json.jackson2.JacksonFactory;
import com.google.api.services.pubsub.Pubsub;
import com.google.api.services.pubsub.PubsubScopes;
public class PubsubFactory {
private static Pubsub instance = null;
private static final Logger logger = Logger.getLogger(PubsubFactory.class.getName());
public static synchronized Pubsub getService() throws IOException, GeneralSecurityException {
if (instance == null) {
instance = buildService();
}
return instance;
}
private static Pubsub buildService() throws IOException, GeneralSecurityException {
logger.log(Level.FINER, "Start of buildService");
HttpTransport transport = GoogleNetHttpTransport.newTrustedTransport();
JsonFactory jsonFactory = new JacksonFactory();
GoogleCredential credential = GoogleCredential.getApplicationDefault(transport, jsonFactory);
// Depending on the environment that provides the default credentials (for
// example: Compute Engine, App Engine), the credentials may require us to
// specify the scopes we need explicitly.
if (credential.createScopedRequired()) {
Collection<String> scopes = new ArrayList<>();
scopes.add(PubsubScopes.PUBSUB);
credential = credential.createScoped(scopes);
}
logger.log(Level.FINER, "End of buildService");
// TODO - Get the application name from outside.
return new Pubsub.Builder(transport, jsonFactory, credential).setApplicationName("Your Application Name/Version")
.build();
}
}
The message reader is injected on the subscriber. This part of the code will handle the messages:
MessageReceiver receiver =
new MessageReceiver() {
#Override
public void receiveMessage(PubsubMessage message, AckReplyConsumer consumer) {
// handle incoming message, then ack/nack the received message
System.out.println("Id : " + message.getMessageId());
System.out.println("Data : " + message.getData().toStringUtf8());
consumer.ack();
}
};
I am working on a Spark based Kafka Consumer that reads the data in Avro format.
Following, is the try catch code reading and processing the input.
import java.util.*;
import java.io.*;
import com.twitter.bijection.Injection;
import com.twitter.bijection.avro.GenericAvroCodecs;
import kafka.serializer.StringDecoder;
import kafka.serializer.DefaultDecoder;
import scala.Tuple2;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import kafka.producer.KeyedMessage;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.Durations;
public class myKafkaConsumer{
/**
* Main function, entry point to the program.
* #param args, takes the user-ids as the parameters, which
*will be treated as topics
* in our case.
*/
private String [] topics;
private SparkConf sparkConf;
private JavaStreamingContext jssc;
public static final String USER_SCHEMA = "{"
+ "\"type\":\"record\","
+ "\"name\":\"myrecord\","
+ "\"fields\":["
+ " { \"name\":\"str1\", \"type\":\"string\" },"
+ " { \"name\":\"int1\", \"type\":\"int\" }"
+ "]}";
public static void main(String [] args){
if(args.length < 1){
System.err.println("Usage : myKafkaConsumber <topics/user-id>");
System.exit(1);
}
myKafkaConsumer bKC = new myKafkaConsumer(args);
bKC.run();
}
/**
* Constructor
*/
private myKafkaConsumer(String [] topics){
this.topics = topics;
sparkConf = new SparkConf();
sparkConf = sparkConf.setAppName("JavaDirectKafkaFilterMessages");
jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
}
/**
* run function, runs the entire program.
* #param topics, a string array containing the topics to be read from
* #return void
*/
private void run(){
HashSet<String> topicSet = new HashSet<String>();
for(String topic : topics){
topicSet.add(topic);
System.out.println(topic);
}
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", "128.208.244.3:9092");
kafkaParams.put("auto.offset.reset", "smallest");
try{
JavaPairInputDStream<String, byte[]> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
byte[].class,
StringDecoder.class,
DefaultDecoder.class,
kafkaParams,
topicSet
);
JavaDStream<String> avroRows = messages.map(new Function<Tuple2<String, byte[]>, String>(){
public String call(Tuple2<String, byte[]> tuple2){
return testFunction(tuple2._2().toString());
}
});
avroRows.print();
jssc.start();
jssc.awaitTermination();
}catch(Exception E){
System.out.println(E.toString());
E.printStackTrace();
}
}
private static String testFunction(String str){
System.out.println("Input String : " + str);
return "Success";
}
}
The code compiles correctly, however, when I try to run the code on a Spark cluster I get Task not Serializable error. I tried removing the function and simply printing some text, still, the error persists.
P.S. I have checked printing the messages and found that they are correctly read.
The print statement collects your RDD to the driver in order to print them on the screen. Such a task triggers serialization/deserialization of your data.
In order for your code to work, the records in the avroRows Dstream must be of a serializable type.
For example, it should work if you replace the avroRows definition by this :
JavaDStream<String> avroRows = messages.map(new Function<Tuple2<String, byte[]>, String>(){
public String call(Tuple2<String, byte[]> tuple2){
return tuple2._2().toString();
}
});
I just added a toString to your records because the String type is serializable (of course, it is not necessarily what you need, it is just an example).