I have a Beam pipeline to read Kafka avro messages based on java SDK.The pipeline receives the message and tries to create Sliding Window,
PCollection<AvroMessage> message_timestamped =
messageValues
.apply(
"append event time for PCollection records",
WithTimestamps.of(
(AvroMessage rec) -> new Instant(rec.getTime())));
PCollection<AvroMessage> messages_Windowed =
message_timestamped
.apply(
Window
.<AvroMessage>into(
SlidingWindows
.of(Duration.standardMinutes(2))
.every(Duration.standardMinutes(1)))
.discardingFiredPanes());
Does the window get invoked after 2 Minutes or a trigger configuration is necessary.I tried to access the Window pane information as part of ParDo but it is getting triggered for each received message and it doesn't wait to accumulate the messages for configured 2 minutes. What kind of trigger is required(after 2 minutes - process only current window messages)?
Do I need to include any specific configuration to run with unbounded kafka messages?
I have used the timestamppolicy to use the message timestamp during the KafkaIO read operation,
.withTimestampPolicyFactory(
(tp, previousWaterMark) -> new CustomFieldTimePolicy(previousWaterMark))
It is important to consider that windows and triggers have very different purposes:
Windows are based on the timestamps in the data, not on when they arrive or when they are processed. I find the best way to think about "windows" is as a secondary key. When data is unbounded/infinite, you need one of the grouping keys to have an "end" - a timestamp when you can say they are "done". Windows provide this "end". If you want to control how your data is aggregated, use windows.
Triggers are a way to try to control how output flows through your pipeline. They are not closely related to your business logic. If you want to manage the flow of data, use triggers.
To answer your specific questions:
Windows do not wait. An element that arrives may be assigned to a window that is "done" 1ms after it arrives. This is just fine.
Since you have not changed the default trigger, you will get one output with all of the elements for a window.
You also do not need discardingFiredPanes. Your configuration only produces one output per aggregation, so this has no effect.
But there is actually a problem that you will want to fix: the watermark (this controls when a window is "done") is determined by the source. Using WithTimestamps does not change the watermark. You will need to specify the timestamp in the KafkaIO transform, using withTimestampPolicyFactory. Otherwise, the watermark will move according to the publish time and may declare data late or drop data.
Related
I'm building a Kafka Streams application where I want to make use of Session Windows.
Say my session is configured as follows:
// Inactivity gap is 5 seconds
// Grace period is 1 second
Duration inactivityGapDuration = Duration.ofSeconds(5);
Duration graceDuration = Duration.ofSeconds(1);
KStream<Windowed<String>, EventData> windowedListKStream = groupedStream.windowedBy(
SessionWindows.ofInactivityGapAndGrace(inactivityGapDuration, graceDuration))
.aggregate(...)
.suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded()))
.toStream();
And given the following stream events:
Event Key
Time
A
10
B
12
Based on reading the docs and experiments I expect this will create 2 session windows: one with key A and one with key B.
Now say I receive this next event:
Event Key
Time
B
20
This will close the window with key B, but the window with key A will remain open. That is to say, when an event for a given key is received, only the stream time for the windows that have that key will advance. Is my understanding here correct?
If so, then this behavior is not exactly what I need. What I need is if I never see another event with key A then for the key A window to eventually close.
I think this is where the Punctuator can come in. However, if I read the docs correctly then I would need to basically re-implement the Session Window logic using the Processor API if I want to add a Punctuator. As far as I can tell I can't inject a Punctuator event into the session window DSL implementation in order to move the stream time along.
If all of the above is correct, then this seems like a big lift for what seems like a simple operation. Am I missing some other feature that would make this a simpler implementation?
Thank you!
I need to create an Apache Beam (Java) streaming job that should start once (and only once) every 60 seconds.
I got it working correctly using DirectRunner by using GenerateSequence, Window, and Combine.
However when I run it on Google Dataflow, sometimes it is triggered more than once within the 60 seconds window. I am guessing it has something to do with delays and out of order messages.
Pipeline pipeline = Pipeline.create(options);
pipeline
// Jenerate a tick every 15 seconds
.apply("Ticker", GenerateSequence.from(0).withRate(1, Duration.standardSeconds(15)))
// Just to check if individual ticks are being generated once every 15 second
.apply(ParDo.of(new DoFn<Long, Long>() {
#ProcessElement
public void processElement(#Element Long tick, OutputReceiver<Long> out) {
ZonedDateTime currentInstant = Instant.now().atZone(ZoneId.of("Asia/Jakarta"));
LOG.warn("-" + tick + "-" + currentInstant.toString());
out.output(word);
}
}
))
// 60 Second window
.apply("Window", Window.<Long>into(FixedWindows.of(Duration.standardSeconds(60))))
// Emit once per 60 second
.apply("Cobmine window into one", Combine.globally(Count.<Long>combineFn()).withoutDefaults())
.apply("START", ParDo.of(new DoFn<Long, ZonedDateTime>() {
#ProcessElement
public void processElement(#Element Long count, OutputReceiver<ZonedDateTime> out) {
ZonedDateTime currentInstant = Instant.now().atZone(ZoneId.of("Asia/Jakarta"));
// LOG just to check
// This log is sometimes printed more than once within 60 seconds
LOG.warn("x" + count + "-" + currentInstant.toString());
out.output(currentInstant);
}
}
));
It works most of the time, except once every 5 or 10 minutes at random I see two outputs in the same minute. How do I ensure "START" above runs once every 60 seconds? Thanks.
Short answer: you can't currently, Beam model is focused on event-time processing and correct handling of late data.
Workaround: you can define a processing-time timer, but you will have to deal with outputs and handling of the timer and late data manually, see this or this.
More details:
Windows and triggers in Beam are usually defined in event time, not in processing time. This way if you have late data coming after you already emitted the results for a window, late data still ends up in the correct window and results can be re-calculated for that window. Beam model allows you to express that logic and most of its functionality is tailored for that.
This also means that usually there is no requirement for a Beam pipeline to emit results at some specific real-world time, e.g. it doesn't make sense to say things like - "aggregate the events that belong to some window based on the data in the events themselves, and then output that window every minute". Beam runner aggregates the data for the window, possibly waits for the late data, and then emits results as soon as it deems right. The condition when the data is ready to be emitted is specified by a trigger. But that's just that - a condition when the window data is ready to be emitted, it doesn't actually force the runner to emit it. So the runner can emit it at any point in time after the trigger condition is met and the results are going to be correct, i.e. if more events have arrived since timer condition was met, only the ones that belong to a concrete window will be processed in that window.
Event-time windowing doesn't work with processing-time triggering and there are no convenient primitives (triggers/windows) in Beam to deal with processing time in presence of late data. And in this model if you use a trigger that only fires once, you lose the late data, and you still don't have a way to define a robust processing-time trigger. To build something like that you have to be able to specify things like the real-life point in time from which to start measuring the processing time from, and you will have to deal with issues of different processing time and delays that can happen across a large fleet of worker machines. This just is not part of Beam at the moment.
There are efforts in Beam community that will enable this use case, e.g. sink triggers and retractions that will allow you to define your pipeline in event-time space but remove the need for complex event-time triggers. The results could be either immediately updated/recalculated and emitted, or the trigger can be specified at a sink like "I want the output table to be updated every minute". And the results will be updated and recalculated for late data automatically without your involvement. These efforts are far from completion though at this point, so your best bet currently is either using one of the existing triggers or manually handling everything with timers.
I have an issue with KStreams aggregation and windows. I want to aggregate a record into a list of records which have the same key as long as it falls inside a time window.
I have chosen SessionWindows because I have to work with a moving window inside a session: let's say record A arrives at 10:00:00; then every other record with the same key that arrives
inside the 10 second window time (until 10:00:10) will fall into the same session, bearing in mind that if it arrives at 10:00:03, the window will move until 10:00:13 (+10s).
That leads us to have a moving window of +10s from the last record received for a given key.
Now the problem: I want to obtain the last aggregated result. I have used .suppress() to indicate that I don't want any intermediate results, I just want the last one when the window closes. This
is not working fine because while it doesn't send any intermediate aggregated result, when the time window ends, I don't get any result. I have noted that in order to receive it I need to publish another
message into the topic, something which is in my case impossible.
Reading about .suppress() I have come to the conclusion that it may not be the way to achieve what I want, that's why my question is: how can I force the window to close and send the latest aggregated calculated result?
#StreamListener(ExtractContractBinding.RECEIVE_PAGE)
#SendTo(ExtractCommunicationBinding.AGGREGATED_PAGES)
public KStream<String, List<Records>> aggregatePages(KStream<?, Record> input) {
input.map(this::getRecord)
.groupBy(keyOfElement)
.windowedBy(SessionWindows.with(Duration.ofSeconds(10L)).grace(Duration.ofSeconds(10L)))
.aggregate(...do stuff...)
.suppress(Suppressed.untilWindowCloses(unbounded()))
.toStream()
.map(this::createAggregatedResult);
}
In short, the reason why this happens is because in KStreams, and most other stream processing engines that compute aggregations, time works based on event time.
https://kafka.apache.org/0101/documentation/streams#streams_time
In other words the window cannot close until a new message arrives beyond your time window + grace time that accounts for late arriving messages.
Moreover, based on some unit tests I’ve been writing recently I’m inclined to believe that the second message needs to land in the same partition as the previous message for event time to move forward. In practice, when you run in production and presumably process hundreds of messages per second this becomes unnoticeable.
Let me also add that you can implement custom timestamp extractor which allows you fine-grained control in terms of which time window a particular message lands in.
how can I force the window to close and send the latest aggregated calculated result?
To finally answer your question, it’s not possible to force the time window to close without emitting an extra message to the source topic.
I have a SQS which will receive a huge number of messages. The messages keep coming to the queue.
And I have a use case where if the number of messages in a queue reaches X number (such as 1,000), the system needs to trigger an event to process 1,000 at a time.
And the system will make a chunk of triggers. Each trigger has a thousand messages.
For example, if we have 2300 messages in a queue, we expect 3 triggers to a lambda function, the first 2 triggers corresponding to 1,000 messages, and the last one will contain 300 messages.
I'm researching and see CloudWatch Alarm can hook up to SQS metric on "NumberOfMessageReceived" to send to SNS. But I don't know how can I configure a chunk of alarms for each 1,000 messages.
Please advice me if AWS can support this use case or any customize we can make to achieve this.
So after going through some clarifications on the comments section with the OP, here's my answer (combined with #ChrisPollard's comment):
Achieving what you want with SQS is impossible, because every batch can only contain up to 10 messages. Since you need to process 1000 messages at once, this is definitely a no-go.
#ChrisPollard suggested to create a new record in DynamoDB every time a new file is pushed to S3. This is a very good approach. Increment the partition key by 1 every time and trigger a lambda through DynamoDB Streams. On your function, run a check against your partition key and, if it equals 1000, you run a query against your DynamoDB table filtering the last 1000 updated items (you'll need a Global Secondary Index on your CreatedAt field). Map these items (or use Projections) to create a very minimal JSON that contains only the necessary information. Something like:
[
{
"key": "my-amazing-key",
"bucket": "my-super-cool-bucket"
},
...
]
A JSON like this is only 87 bytes long (if you take the square brackets out of the game because they won't be repeated, you're left out with 83 bytes). If you round it up to 100 bytes, you can still successfully send it as one event to SQS, as it will only be around 100KB of data.
Then have one Lambda function subscribe to your SQS queue and then finally concatenate the 1 thousand files.
Things to keep in mind:
Make sure you really create the createdAt field in DynamoDB. By the time it hits one thousand, new items could have been inserted, so this way you make sure you are reading the 1000 items that you expected.
On your Lambda check, just run batchId % 1000 = 0, this way you don't need to delete anything, saving DynamoDB operations.
Watch out for the execution time of your Lambda. Concatenating 1000 files at once may take a while to run, so I'd run a couple of tests and put 1 min overhead on top of it. I.e, if it usually takes 5 mins, set your function's timeout to 6 mins.
If you have new info to share I am happy to edit my answer.
You can add alarms at 1k, 2k, 3k, etc...but that seems clunky.
Is there a reason you're letting the messages batch up? You can make this trigger event-based (when a queue message is added fire my lambda) and get rid of the complications of batching them.
I handled a very similar situation recently, process-A puts objects in an S3 bucket and every time it does it puts a message in the SQS, with the key and bucket details, I have a lambda which is triggered every hour, but it can be any trigger like your cloud watch alarm. Here is what you can do on every trigger:
Read the messages from the queue, SQS allows you to read only 10 messages at a time, and every time you read the messages, keep adding them to some list in your lambda, you also get a receipt handle for every message , you can use it to delete the messages and repeat this process until you read all 1000 messages in your queue. Now you can perform whatever operations are required on your list and feed it to process B in a number of different ways , like a file in S3 and/or a new queue that process B can read from.
Alternate approach to reading messages: SQS allows you to read only 10 messages at a time, you can send an optional parameter 'VisibilityTimeout':60 that hides the messages from the queue for 60 seconds and you can recursively read all the messages until you dont see any messages in the queue, all while adding them to a list in lambda to process them, this can be tricky since you have to try out different values for visibility time out based on how long it takes to read 1000 messages. Once you know you read all the messages, you can simply have the receipt handles and delete all of them. You can also purge the queue but , you may delete some of the messages that came in during this process that are not read at least once.
Good time guys!
We have a pretty straightforward application-adapter: once in 30 seconds it reads records from a database (can't write to it) of one system, converts each of these records into an internal format, performs filtering, encrichment, ..., and, finally, transforms the resulting, let's say, entities into an xml format and sends them via a JMS to other system. Nothing new.
Let's add some spice here: records in the database are sequential (that means that their identifies are generated by a sequence), and when it is time to read a new bunch of records, we get a last-processed-sequence-number -- which is stored in our internal databese and updated each time the next record is processed (sent to the JMS) -- and start reading from that record (+1).
The problem is our customers gave us an NFR: processing of a read record bunch must not last longer than 30 seconds. As far as there are a lot of steps in the workflow (with some pretty long running ones), and it is possible to get a pretty big amount of records, and as far as we process them one by one, it can take more than 30 seconds.
Because of all the above I want to ask 2 questions:
1) Is there an approach of a parallel processing of sequential data, maybe with one or several intermediate storages, or Disruptor patern, or cqrs-like, or a notification-based, or ... that provides a possibility of working in such a system?
2) A general one. I need to store a last-processed-number and send an entity to the JMS. If I save a number to a database and then some problem raises with the JMS, on an application's restart my adapter will think that it successfuly sended the entity, which is not true and it won't be ever received. If I send an entity and after that try so save a number to a database and get an exception, on an application's restart a reprocessing will be performed which will lead to duplications in the JMS. I'm not sure that xa transactions will help here or some kind of a last resorce gambit...
Could somebody, please, share experience or ideas?
Thanks in advance!
1) 30 seconds is a long time and you can do a lot in that time esp with more than one CPU. Without specifics I can only say it is likely you can make it faster if you profile it and use more CPUs.
2) You can update the database before you send and listen to the JMS queue yourself to see it was received by the broker.
Dimitry - I don't know the detail around your problem so I'm just going to make a set of assumptions. I hope it willtrigger an idea that will lead to the solution at least.
Here goes:
Grab you list of items to process.
Store the last id (and maybe the starting id)
Process each item on a different thread (suggest using Tasks).
Record any failed item in a local failed queue.
When you grab the next bunch, ensure you process the failed queue first.
Have a way of determining a max number of retries and a way of moving/marking it as permanently failed.
Not sure if that was what you were after. NServiceBus has a retry process where the gap between each retry gets longer up to a point, then it is marked as failed.
Folks, finally we ended up with the following solution. We implemented a kind of the Actor Model. The idea is the following.
There are two main (internal) database tables for our application, let's call them READ_DATA_INFO, which contains a last-read-record-number of the 'source' external system, and DUMPED_DATA, which stores a metadata about each read record of the source system. This is how it all works: each n (a configurable property) seconds a service bus reads the last processed identifier of the source system and sends a request to the source system to get new records from it. If there are several new records, they are being wrapped with a DumpRecordBunchMessage message and sent to a DumpActor class. This class begins a transaction which comprises two operations: update the last-read-record-number (the READ_DATA_INFO table) and save a metadata about each reacord (the DUMPED_DATA table) (each dumped record gets the 'NEW' status. When a record is successfully processed, it gets the 'COMPLETED' status; otherwise - the 'FAILED' status). In case of a successfull transaction commit each of those records is wrapped with a RecordMessage message class and send to next processing actor; otherwise those records are just skipped - they would be reread after next n seconds.
There are three interesting points:
an application's disaster recovery. What if our application will be stopped somehow at the middle of a processing. No problem, at an application's startup (#PostConstruct marked method) we find all the records with the 'NEW' statuses at the DUMPED_DATA table and with a help of a stored metadata rebuild restore them from the source system.
parallel processing. After all records are successfully dumped, they become independent, which means that they can be processed in parallel. We introduced several mechanisms of a parallelism and a loa balancing. The simplest one is a round robin approach. Each processing actor consists of a parant actor (load balancer) and a configurable set of it's child actors (worker). When a new message arrives to the parent actor's queue, it dispatches it to the next worker.
duplicate record prevention. This is the most interesting one. Let's assume that we read data each 5 seconds. If there is an actor with a long running operation, it is possible to have several tryings to read from the source system's database starting from the same last-read-record number. Thus there would potentially be a lot duplicate records dumped and processed. In order to prevent this we added a CAS-like check of DumpActor's messages: if the last-read-record from a message is equal to a one from the DUMPED_DATA table, this message should be processed (no messages were processed before it); otherwise this message is rejected. Rather simple, but powerfull.
I hope this overview will help somebody. Have a good time!