window Data hourly(clockwise) basis in Apache beam - java

I am trying to aggregate streaming data for each hour(like 12:00 to 12:59 and 01:00 to 01:59) in DataFlow/Apache Beam Job.
Following is my use case
Data is streaming from pubsub, It has a timestamp(order date). I want to count no of orders in each hour i am getting, Also i want to allow delay of 5 hours. Following is my sample code that I am using
LOG.info("Start Running Pipeline");
DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(DataflowPipelineOptions.class);
Pipeline pipeline = Pipeline.create(options);
PCollection<String> directShipmentFeedData = pipeline.apply("Get Direct Shipment Feed Data", PubsubIO.readStrings().fromSubscription(directShipmentFeedSubscription));
PCollection<String> tibcoRetailOrderConfirmationFeedData = pipeline.apply("Get Tibco Retail Order Confirmation Feed Data", PubsubIO.readStrings().fromSubscription(tibcoRetailOrderConfirmationFeedSubscription));
PCollection<String> flattenData = PCollectionList.of(directShipmentFeedData).and(tibcoRetailOrderConfirmationFeedData)
.apply("Flatten Data from PubSub", Flatten.<String>pCollections());
flattenData
.apply(ParDo.of(new DataParse())).setCoder(SerializableCoder.of(SalesAndUnits.class))
// Adding Window
.apply(
Window.<SalesAndUnits>into(
SlidingWindows.of(Duration.standardMinutes(15))
.every(Duration.standardMinutes(1)))
)
// Data Enrich with Dimensions
.apply(ParDo.of(new DataEnrichWithDimentions()))
// Group And Hourly Sum
.apply(new GroupAndSumSales())
.apply(ParDo.of(new SQLWrite())).setCoder(SerializableCoder.of(SalesAndUnits.class));
pipeline.run();
LOG.info("Finish Running Pipeline");

I'd the use a window with the requirements you have. Something along the lines of
Window.into(
FixedWindows.of(Duration.standardHours(1))
).withAllowedLateness(Duration.standardHours(5)))
Possibly followed by a count as that's what I understood you need.
Hope it helps

Related

How to reduce task size in Spark MLib?

I'm trying to implement Random Forest Classifier using Apache Spark (2.2.0) and Java.
Basically I've followed the example from the Spark documentation
For test purposes I'm using a local cluster:
SparkSession spark = SparkSession
.builder()
.master("local[*]")
.appName(appName)
.getOrCreate();
My training/test data includes 30k rows. Data is fetched from REST APIs and transformed to Spark DataSet.
List<PreparedWUMLogFile> logs = //... get from REST API
Dataset<PreparedWUMLogFile> dataSet = spark.createDataset(logs, Encoders.bean(PreparedWUMLogFile.class));
Dataset<Row> data = dataSet.toDF();
For many stages I get the following warning message:
[warn] o.a.s.s.TaskSetManager - Stage 0 contains a task of very large size (3002 KB). The maximum recommended task size is 100 KB.
How I can reduce the task size in this case?
Edit:
To be more concrete: There are 5 of 30 stages that produce these warning messages.
rdd at StringIndexer.scala:111 (two times)
take at VectorIndexer.scala:119
rdd at VectorIndexer.scala:122
rdd at Classifier.scala:82

Apache Beam reads data from 2 input sources and unable to join data correctly in some cases

I want to be able to read messages from 2 PubSub topics and associate the second message to the first message.
For example say we read user scores from a pubsub topic "scores" and who is the winner from another pub sub topic "results". Note that I use .withTimestampAttribute("timestamp") when I read from pubsub so it can use event time for processing and also if the scores message is associated with the results message then the event time for both messages will be set same when published to pubsub. The streaming pipeline should be able to output a result as to who won and who lost among the users.
This will write output every minute to Google Cloud Storage.
Below is my code snippet that tries to do that, but the issue is while trying to associate these 2 messages, few edge cases do not work. Say for instance, I know for a fact that the results as to who the winner is, will be available only after the scores for that event occurs. Hence I want to allow extra 5 minutes for it to arrive.
But when I try to CoGroupByKey these 2 messages one doesn't wait for the other.
//Read from PubSub
PCollection<String> scoresInput = pipeline.apply(PubsubIO.readStrings().withTimestampAttribute("timestamp")
.fromTopic("projects/test/topics/scores"));
PCollection<String> winInput = pipeline.apply(PubsubIO.readStrings().withTimestampAttribute("timestamp")
.fromTopic("projects/test/topics/results"));
//Apply some transformation on the scores message
PCollection<KV<String, User>> scoreToTrx = scoresInput.apply(Window.<String>
into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
.apply(ParDo.of(new ExtractScoresToTransactionIdFn()));
//group by all common trx
PCollection<KV<String, Iterable<User>>> scoresToTrxGrouped =
scoreToTrx.apply(GroupByKey.<String, User>create());
//Create one object with array of users for transaction. Users(transactionid,
//array of all user names, array of all scores, array of win info(defaulted to
//false))
PCollection<KV<String, Users>> users = scoresToTrxGrouped.apply(ParDo.of(new
ProcessAllUsersAndScoresToTransactionIdFn()));
//Apply some transformation on the results message
PCollection<KV<String, Winner>> winToTrx = winInput.apply(Window.<String>
into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardMinutes(5)))
.apply(ParDo.of(new ExtractWinToTransactionIdFn()));
//Now associate the user score to winner.
final TupleTag<Users> scoreTag = new TupleTag<Users>();
final TupleTag<Winner> winTag = new TupleTag<Winner>();
PCollection<KV<String, CoGbkResult>> trxIdToUserAndWinnerCoGbkResult =
KeyedPCollectionTuple.of(scoreTag, users)
.and(winTag, winToTrx).apply(CoGroupByKey.<String> create());
//The dofn here will check if the winner is present in the users list and then
//update the wininfo of that user to
//winner.
PCollection<String> joinUserToWinner = trxIdToUserAndWinnerCoGbkResult
.apply(ParDo.of(new MapWinnerToUserForTransactionFn(scoreTag,
winTag)));
Say for example I posted the below score message at event time Oct 12, 2017 at 4.12.04 pm
{
"transactionId": "1234",
"userName": "Amy",
"score": "10"
}
Then another one at Oct 12, 2017 at 4.12.20 pm
{
"transactionId": "1234",
"userName": "Becca",
"score": "7"
}
Finally the winner is posted at Oct 12, 2017 at 4.15.20 pm
{
"transactionId": "1234",
"winner": "Amy"
}
Since it is a Fixed Window of 1 minute, So the window for this case will be [4.12.00,4.13.00) but since the winner falls outside of this window it will not be considered and it outputs as no winner.
{
"transactionId": "1234",
"users": ["Amy","Becca"],
"scores": ["10","7"],
"winInfo": ["loser","loser"]
}
Another case that can happen is when results message reach the pipeline before the scores because the score messages are may be waiting to be triggered after groupby, In that case the results message is getting discarded cause it is not able to associate with any scores and the output is same again with no winner even though there was a winner.
Note it is ok in some cases where data is too late I don't want to wait for them.
But simple cases of 1:1 association say there is only 1 user score and 1 winner is not honored.
If I am reading from one input source I understand i can tweak my triggers to achieve what I want, but in my case the data is from 2 different sources and is dependent on each other to produce correct results.
My question is how can we honor cases like these where triggering of one is dependent on another input data.

Process binary data in Spark Structured Streaming

I am using Kafka and Spark Structured Streaming. I am receiving kafka messages in following format.
{"deviceId":"001","sNo":1,"data":"aaaaa"}
{"deviceId":"002","sNo":1,"data":"bbbbb"}
{"deviceId":"001","sNo":2,"data":"ccccc"}
{"deviceId":"002","sNo":2,"data":"ddddd"}
I am reading it like below.
Dataset<String> data = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option(subscribeType, topics)
.load()
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING());
Dataset<DeviceData> ds = data.as(ExpressionEncoder.javaBean(DeviceData.class)).orderBy("deviceId","sNo");
ds.foreach(event ->
processData(event.getDeviceId(),event.getSNo(),event.getData().getBytes())
);}
private void processData(String deviceId,int SNo, byte[] data)
{
//How to check previous processed Dataset???
}
In my json message "data" is String form of byte[]. I have a requirement where I need to process the binary "data" for given "deviceId" in order of "sNo". So for "deviceId"="001", I have to process the binary data for "sNo"=1 and then "sNo"=2 and so on. How can I check state of previous processed Dataset in Structured Streaming?
If you are looking for state management like DStream.mapWithState then it is not supported yet in Structured Streaming. Work is in progress. Please Check
https://issues.apache.org/jira/browse/SPARK-19067.

Input streaming data not distributed equally among tasks

I have written spark streaming job which reads data from a s3.
The job has series of mapwithstate followed by maptopair calls, like below:
JavaDStream<String> cdrLines = ssc.textFileStream(cdrInputFile);
JavaDStream<CDR> cdrRecords = cdrLines.map(x -> cdrStreamParser.parse(x));
JavaDStream<CDR> cdrRecordsFiltered = cdrRecords
.filter(t -> t != null);
JavaPairDStream<String, CDR> sTripletStream = cdrRecordsFiltered
.mapToPair(s -> new Tuple2<String, CDR>(s
.gettNumber(), s));
JavaPairDStream<String, Tuple2<CDR, List<StatusCode>>> stateDstream1 = sTripletStream
.mapWithState(
StateSpec.function(hsMappingFunc).initialState(
tripletRDD)).mapToPair(s -> s);
JavaPairDStream<String,Tuple2<CDR,List<StatusCode>>> stateDstream2 = stateDstream1
.mapWithState(StateSpec.function(cfMappingFunc).initialState(cfHistoryRDD))
.mapToPair(s -> s);
JavaPairDStream<String, Tuple2<CDR, List<StatusCode>>> stateDstream3 = stateDstream2
.mapWithState(StateSpec.function(imeiMappingFunc).initialState(imeiRDD))
.mapToPair(s -> s);
I have spark.default.parallelism set to 6. I see first and last maptopair stages are fast enough. The second and third maptopair stages are very slow.
Each of these stages run through 6 tasks. In the second and third maptopair stages, 5 tasks run with 2s. But one task is taking very long time ~3-4min. the shuffle data that task is very high compared to other tasks, which causing bottleneck.
Is there a way we can distrubute the load among all tasks more uniformly?
This is use case for CDR processing. Each CDR event has these fields telno, imei, imsi, callforward, timestamp.
I maintain 3 kinds of info in spark state: 1. last know CDR event (record) for a given telephone number 2. callforward number list for each telephone 3. list of all known imei's.
Three mapwithstate function calls corresponds to below functionality:
step1 : As the CDR events comes in, i need to do some field comparisons with last known CDR event with same telephone number. I maintain latest event for a given telno in the spark state, so that i can do field comparisons as new CDR events comes in.
step2 : For a given telno., i want to check if the callforward number is known number or not. So i need to maintain history of telno. -> list of callforward numbers in the state.
step3 : I need to maintain list of all imei numbers came across, so far in the state, so that for each imei in the CDR event, we can say if its known or new imei.

Does Storm Trident newValueStream after persistentAggregate maintain partition from groupBy

I am currently trying to scale a trident topology that does some post processing after a groupBy and persistentAggregate, using newValueStream to stream values after the aggregate step. I was wondering if the tuples remained partitioned as they were during the groupBy step, or are they redistributed in some other fashion.
relevant code:
.groupBy(new Fields("key"))
.name("GroupBy")
.persistentAggregate(new MemoryMapState.Factory(), new Fields("foo", "bar"), new Aggregator(), new Fields("foobar"))
.newValuesStream()
.name("NewValueStream")

Categories