Kafka Streams Session Windows with Punctuator - java

I'm building a Kafka Streams application where I want to make use of Session Windows.
Say my session is configured as follows:
// Inactivity gap is 5 seconds
// Grace period is 1 second
Duration inactivityGapDuration = Duration.ofSeconds(5);
Duration graceDuration = Duration.ofSeconds(1);
KStream<Windowed<String>, EventData> windowedListKStream = groupedStream.windowedBy(
SessionWindows.ofInactivityGapAndGrace(inactivityGapDuration, graceDuration))
.aggregate(...)
.suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded()))
.toStream();
And given the following stream events:
Event Key
Time
A
10
B
12
Based on reading the docs and experiments I expect this will create 2 session windows: one with key A and one with key B.
Now say I receive this next event:
Event Key
Time
B
20
This will close the window with key B, but the window with key A will remain open. That is to say, when an event for a given key is received, only the stream time for the windows that have that key will advance. Is my understanding here correct?
If so, then this behavior is not exactly what I need. What I need is if I never see another event with key A then for the key A window to eventually close.
I think this is where the Punctuator can come in. However, if I read the docs correctly then I would need to basically re-implement the Session Window logic using the Processor API if I want to add a Punctuator. As far as I can tell I can't inject a Punctuator event into the session window DSL implementation in order to move the stream time along.
If all of the above is correct, then this seems like a big lift for what seems like a simple operation. Am I missing some other feature that would make this a simpler implementation?
Thank you!

Related

Apache Beam - KafkaIO Sliding Window processing

I have a Beam pipeline to read Kafka avro messages based on java SDK.The pipeline receives the message and tries to create Sliding Window,
PCollection<AvroMessage> message_timestamped =
messageValues
.apply(
"append event time for PCollection records",
WithTimestamps.of(
(AvroMessage rec) -> new Instant(rec.getTime())));
PCollection<AvroMessage> messages_Windowed =
message_timestamped
.apply(
Window
.<AvroMessage>into(
SlidingWindows
.of(Duration.standardMinutes(2))
.every(Duration.standardMinutes(1)))
.discardingFiredPanes());
Does the window get invoked after 2 Minutes or a trigger configuration is necessary.I tried to access the Window pane information as part of ParDo but it is getting triggered for each received message and it doesn't wait to accumulate the messages for configured 2 minutes. What kind of trigger is required(after 2 minutes - process only current window messages)?
Do I need to include any specific configuration to run with unbounded kafka messages?
I have used the timestamppolicy to use the message timestamp during the KafkaIO read operation,
.withTimestampPolicyFactory(
(tp, previousWaterMark) -> new CustomFieldTimePolicy(previousWaterMark))
It is important to consider that windows and triggers have very different purposes:
Windows are based on the timestamps in the data, not on when they arrive or when they are processed. I find the best way to think about "windows" is as a secondary key. When data is unbounded/infinite, you need one of the grouping keys to have an "end" - a timestamp when you can say they are "done". Windows provide this "end". If you want to control how your data is aggregated, use windows.
Triggers are a way to try to control how output flows through your pipeline. They are not closely related to your business logic. If you want to manage the flow of data, use triggers.
To answer your specific questions:
Windows do not wait. An element that arrives may be assigned to a window that is "done" 1ms after it arrives. This is just fine.
Since you have not changed the default trigger, you will get one output with all of the elements for a window.
You also do not need discardingFiredPanes. Your configuration only produces one output per aggregation, so this has no effect.
But there is actually a problem that you will want to fix: the watermark (this controls when a window is "done") is determined by the source. Using WithTimestamps does not change the watermark. You will need to specify the timestamp in the KafkaIO transform, using withTimestampPolicyFactory. Otherwise, the watermark will move according to the publish time and may declare data late or drop data.

How to keep data temporary somewhere and using the same to store it in database every 10 minutes?

I want to solve one of my design problem where I am getting several events data but for one kind of data I do not want to store the event as soon as it happens instead I want to wait for 10 minutes to let them happen those event but want to keep the number of event counts happening so that once the interval of 10 minutes reaches I can store the count with the event cumulatively which will reduce the number of operation being involve with database.
For example
Let's say I have three events namely EV1, EV2 and EV3. Event EV3 happens in my application in very large volume other two events are less frequent. But I am not concerns to capture every trigger of the event but I am interested to know how many times that event (EV3) happened. So thinking to capture the EV3 count in 10 minute of interval by storing the counts somewhere on the fly and dump the same in every 10 minutes.
Please suggest a good and simple design for it which can be used in Java, Thanks in advance.
You can use caching to store your data temporarily(like Redis) and then save the data to your db. Not sure what your event is but this is general design followed to store data temporarily and reduce db writes.

Kafka KStreams Issue in Aggregation with Time Window

I have an issue with KStreams aggregation and windows. I want to aggregate a record into a list of records which have the same key as long as it falls inside a time window.
I have chosen SessionWindows because I have to work with a moving window inside a session: let's say record A arrives at 10:00:00; then every other record with the same key that arrives
inside the 10 second window time (until 10:00:10) will fall into the same session, bearing in mind that if it arrives at 10:00:03, the window will move until 10:00:13 (+10s).
That leads us to have a moving window of +10s from the last record received for a given key.
Now the problem: I want to obtain the last aggregated result. I have used .suppress() to indicate that I don't want any intermediate results, I just want the last one when the window closes. This
is not working fine because while it doesn't send any intermediate aggregated result, when the time window ends, I don't get any result. I have noted that in order to receive it I need to publish another
message into the topic, something which is in my case impossible.
Reading about .suppress() I have come to the conclusion that it may not be the way to achieve what I want, that's why my question is: how can I force the window to close and send the latest aggregated calculated result?
#StreamListener(ExtractContractBinding.RECEIVE_PAGE)
#SendTo(ExtractCommunicationBinding.AGGREGATED_PAGES)
public KStream<String, List<Records>> aggregatePages(KStream<?, Record> input) {
input.map(this::getRecord)
.groupBy(keyOfElement)
.windowedBy(SessionWindows.with(Duration.ofSeconds(10L)).grace(Duration.ofSeconds(10L)))
.aggregate(...do stuff...)
.suppress(Suppressed.untilWindowCloses(unbounded()))
.toStream()
.map(this::createAggregatedResult);
}
In short, the reason why this happens is because in KStreams, and most other stream processing engines that compute aggregations, time works based on event time.
https://kafka.apache.org/0101/documentation/streams#streams_time
In other words the window cannot close until a new message arrives beyond your time window + grace time that accounts for late arriving messages.
Moreover, based on some unit tests I’ve been writing recently I’m inclined to believe that the second message needs to land in the same partition as the previous message for event time to move forward. In practice, when you run in production and presumably process hundreds of messages per second this becomes unnoticeable.
Let me also add that you can implement custom timestamp extractor which allows you fine-grained control in terms of which time window a particular message lands in.
how can I force the window to close and send the latest aggregated calculated result?
To finally answer your question, it’s not possible to force the time window to close without emitting an extra message to the source topic.

Kafka Stream count on time window not reporting zero values

I'm using a Kafka streams to calculate how many events occurred in last 3 minutes using a hopping time window:
public class ViewCountAggregator {
void buildStream(KStreamBuilder builder) {
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
KStream<String, String> views = builder.stream(stringSerde, stringSerde, "streams-view-count-input");
KStream<String, Long> viewCount = views
.groupBy((key, value) -> value)
.count(TimeWindows.of(TimeUnit.MINUTES.toMillis(3)).advanceBy(TimeUnit.MINUTES.toMillis(1)))
.toStream()
.map((key, value) -> new KeyValue<>(key.key(), value));
viewCount.to(stringSerde, longSerde, "streams-view-count-output");
}
public static void main(String[] args) throws Exception {
// some not so important initialization code
...
}
}
When running a consumer and pushing some messages to an input topic it receives following updates as the time passes:
single 1
single 1
single 1
five 1
five 4
five 5
five 4
five 1
Which is almost correct, but it never receives updates for:
single 0
five 0
Without it my consumer that updates a counter will never set it back to zero when there are no events for a longer period of time. I'm expecting consumed messages to look like this:
single 1
single 1
single 1
single 0
five 1
five 4
five 5
five 4
five 1
five 0
Is there some configuration option / argument I'm missing that would help me achieving such behavior?
Which is almost correct, but it never receives updates for:
First, the computed output is correct.
Second, why is it correct:
If you apply a windowed aggregate, only those windows that do have actual content are created (all other systems I am familiar with, would produce the same output). Thus, if for some key, there is not data for a time period longer than the window size, there is no window instantiated and thus, there is also no count at all.
The reason to not instantiate windows if there is no content is quite simple: the processor cannot know all keys. In your example, you have two keys, but maybe later on there might come up a third key. Would you expect to get <thirdKey,0> from the beginning on? Also, as data streams are infinite in nature, keys might go away and never reappear. If you remember all seen keys, and emit <key,0> if there is no data for a key that disappeared, would you emit <key,0> for ever?
I don't want to say that your expected result/semantics does not make sense. It's just a very specific use case of yours and not applicable in general. Hence, stream processors don't implement it.
Third: What can you do?
There are multiple options:
Your consumer can keep track of what keys it did see, and using the embedded record timestamps figures out if a key is "missing" and then set the counter to zero for this key (for this, it might also help to remove the map step and preserve the Windowed<K> type for the key, such that the consumer get the information to which window a record belongs)
Add a stateful #transform() step in your Stream application that does the same thing as described in (1). For this, it might be helpful to register a punctuation call back.
Approach (2) should make it easier to track keys, as you can attach a state store to your transform step and thus don't need to deal with state (and failure/recovery) in your downstream consumer.
However, the tricky part for both approaches is still to decide when a key is missing, i.e., how long do you wait until you produce <key,0>. Note, that data might be late arriving (aka out-of-order) and even if you did emit <key,0> a late arriving record might producer a <key,1> message after your code did emit a <key,0> record. But maybe this is not really an issue for your case as it seems you use the latest window only anyways.
Last but not least one more comment: It seems that you are using only the latest count and that newer windows overwrite older windows in your downstream consumer. Thus, it might be worth to explore "Interactive Queries" to tap into the state of your count operator directly instead of consumer the topic and updating some other state. This might allow you to redesign and simplify you downstream application significantly. Check out the docs and a very good blog post about Interactive Queries for more details.

How to auto execute a function at server side

Recently I was creating an auction site. I want to make it like when user bid the item, there is a AI bidder to upbid the user. For say user bid on item1 after 5 seconds the AI bidder will auto bid the item1 as well. Any idea how can I execute it automatically after 5 seconds?
A simple and efficient solution could be to store all future bids with a "due date" and all the information to bid in a list. Then every 5 seconds or so you could loop through the list and make all bids if they are due. This system would be extensible and would work for a large amount of bids. Of course, ideally this would run in a different thread.
It's a bit like re-implementing a "cron-like" job management in your servlet but I can't see any solution that would fit your needs out of the box.
I am not sure I answered your question, hope so.
Regards,
Stéphane
Depends on what technology you actually use, you can use EJB timers for that for example, just start the timer ejb when a new bid occurs, on timer timeout (after some time) the method executes and updates the bid.
Standard servlet solution
Create a Filter, map it to the url pattern of your bid servlet.
In your doFilter(), after your filterChain.doFilter() call (ie, after the request has been processed by the servlet/JSP), schedule an action for 5 seconds in the future (you can use the standard java ScheduledExecutorService)
In the Runnable implementation you schedule (your task), place the AI bid.
In my opinion:
If user bid, and after 5 secs, it sends the request to the server, i prefer JS with setTimeout(). (Of course it required Browser's JS - read more abt this in W3School).
Otherwise, you can use an array (or smt like that) act as an queue (in server side), after each 5 secs, it's lock the queue (sync), and check for which inserted 5 secs ago, and process it (or use an Thread for each time an event requests to server). Basically, you can use a thread to do that trick? (Did u mean this?).

Categories