Understanding Kafka stream groupBy and window

Understanding Kafka stream groupBy and window - java

I am not able to understand the concept of groupBy/groupById and windowing in kafka streaming. My goal is to aggregate stream data over some time period (e.g. 5 seconds). My streaming data looks something like:
{"value":0,"time":1533875665509}
{"value":10,"time":1533875667511}
{"value":8,"time":1533875669512}
The time is in milliseconds (epoch). Here my timestamp is in my message and not in key. And I want to average the value of 5 seconds window.
Here is code that I am trying but it seems I am unable to get it work
builder.<String, String>stream("my_topic")
.map((key, val) -> { TimeVal tv = TimeVal.fromJson(val); return new KeyValue<Long, Double>(tv.time, tv.value);})
.groupByKey(Serialized.with(Serdes.Long(), Serdes.Double()))
.windowedBy(TimeWindows.of(5000))
.count()
.toStream()
.foreach((key, val) -> System.out.println(key + " " + val));
This code does not print anything even though the topic is generating messages every two seconds. When I press Ctrl+C then it prints something like
[1533877059029#1533877055000/1533877060000] 1
[1533877061031#1533877060000/1533877065000] 1
[1533877063034#1533877060000/1533877065000] 1
[1533877065035#1533877065000/1533877070000] 1
[1533877067039#1533877065000/1533877070000] 1
This output does not make sense to me.
Related code:
public class MessageTimeExtractor implements TimestampExtractor {
#Override
public long extract(ConsumerRecord<Object, Object> record, long previousTimestamp) {
String str = (String)record.value();
TimeVal tv = TimeVal.fromJson(str);
return tv.time;
}
}
public class TimeVal
{
final public long time;
final public double value;
public TimeVal(long tm, double val) {
this.time = tm;
this.value = val;
}
public static TimeVal fromJson(String val) {
Gson gson = new GsonBuilder().create();
TimeVal tv = gson.fromJson(val, TimeVal.class);
return tv;
}
}
Questions:
Why do you need to pass serializer/deserializer to group by. Some of the overloads also take ValueStore, what is that? When grouped, how the data looks in the grouped stream?
How window stream is related to group stream?
The above, I was expecting to print in streaming way. That means buffer for every 5 seconds and then count and then print. It only prints once press Ctrl+c on command prompt i.e. it prints and then exits

It seems you don't have keys in your input data (correct me if this is wrong), and it further seems, that you want to do global aggregation?
In general, grouping is for splitting a stream into sub-streams. Those sub-streams are build by key (ie, one logical sub-stream per key). You set your timestamp as key in your code snippet an thus generate a sub-stream per timestamps. I assume this is not intended.
If you want to go a global aggregation, you will need to map all record to a single substream, ie, assign the same key to all records in groupBy(). Note, that global aggregations don't scale as the aggregation must be computed by a single thread. Thus, this will only work for small workloads.
Windowing is applied to each generated sub-stream to build the windows, and the aggregation is computed per window. The windows are build base on the timestamp returned by the Timestamp extractor. It seems you have an implementation that extracts the timestamp for the value for this purpose already.
This code does not print anything even though the topic is generating messages every two seconds. When I press Ctrl+C then it prints something like
By default, Kafka Streams uses some internal caching and the cache will be flushed on commit -- this happens every 30 seconds by default, or when you stop your application. You would need to disable caching to see result earlier (cf. https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html)
Why do you need to pass serializer/deserializer to group by.
Because data needs to be redistributed and this happens via a topic in Kafka. Note, that Kafka Streams is build for a distributed setup, with multiple instances of the same application running in parallel to scale out horizontally.
Btw: we might also be interesting in this blog post about the execution model of Kafka Streams: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/

It seems like you misunderstand the nature of window DSL.
It works for internal message timestamps handled by kafka platform, not for arbitrary properties in your specific message type that encode time information. Also, this window does not group into intervals - it is a sliding window. It means any aggregation you get is for the last 5 seconds before the current message.
Also, you need the same key for all group elements to be combined into the same group, for example, null. In your example key is a timestamp which is kind of entry-unique, so there will be only a single element in a group.

Related

Filtering duplicates out of an infinite DataStream with windows

I want to filter out duplicates in Flink from an infinite DataStream. I know the duplicates arise only in a small time window (max 10 seconds). I found a promising approach that is pretty simple here. But it doesn't work. It uses a keyed DataStream and returns only the first message of every window.
This is my window code:
DataStream<Row> outputStream = inputStream
.keyBy(new MyKeySelector())
.window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.minutes(5)))
.process(new DuplicateFilter());
MyKeySelector()is just a class to select the first two attributes of the Row message as the key. This key works as a primary key and causes that only messages with same key are assigned to the same window (classic keyed stream behaviour).
That's the class Duplicate Filter which is very similar to the proposed answer to the above-mentioned question. I only used the newer process() function instead of apply().
public class DuplicateFilter extends ProcessWindowFunction<Row, Row, Tuple2<String, String>, TimeWindow> {
private static final Logger LOG = LoggerFactory.getLogger(DuplicateFilter.class);
#Override
public void process(Tuple2<String, String> key, Context context, Iterable<Row> iterable, Collector<Row> collector) throws Exception {
// this is just for debugging and can be ignored
int count = 0;
for (Row record :
iterable) {
LOG.info("Row number {}: {}", count, record);
count++;
}
LOG.info("first Row: {}", iterable.iterator().next());
collector.collect(iterable.iterator().next()); //output only the first message in this window
}
}
My messages arrive with an interval of max. one second, so a 30 seconds window should handle that well. But messages which arrive with a distance of less than 1 second are assigned to different windows. What I can see from the logs is that it works correctly only very rarely.
Has someone got an idea or another approach for this task? Please let me know if you need more information.

Flink's time windows are aligned to the clock, rather than to the events, so two events that are close together in time can be assigned to different windows. Windows are often not very well suited for deduplication, but you might get good results if you use session windows.
Personally, I would use a keyed flatmap (or a process function), and use state TTL (or timers) to clear the state for keys once it's no longer needed.
You can also do deduplication with Flink SQL: https://ci.apache.org/projects/flink/flink-docs-stable/docs/dev/table/sql/queries/deduplication/ (but you would need to set an idle state retention interval).

Apache storm reliability: would ack/fail a tuple with changed values guarantee reliability?

I'm working on a project that inherits from a custom-made legacy storm bolt. The bolt is supposed to be reliable and it acks or fails the tuple based on the success or failure of some operations. The problem is that in the transformation, the tuple values get changed. Sample code:
public void execute(Tuple tuple) {
object newValues = transformTuple(tuple);
tuple.getValues().set(0, newValues);
try {
// some other operation
...
collector.ack(tuple);
} catch (Exception e) {
collector.fail(tuple);
}
}
This is suspicious since it's acking/failing a tuple with changed values. I couldn't find any documentation on whether only the key of the tuple is used in acking or both key and values. So my question is: would such ack/fail work to guarantee reliability (retry if fail)?

It doesn't matter. Storm doesn't track the tuple contents. If you want to know how Storm tracks tuples, take a look at https://storm.apache.org/releases/2.0.0-SNAPSHOT/Guaranteeing-message-processing.html, in particular the section "How does Storm implement reliability in an efficient way?".
The tl;dr for that section is that Storm keeps track of one 64-bit number (the "ack val") for the entire tuple tree. When you emit a new tuple (i.e. a new edge in the tree), Storm generates a random id for that edge, and XORs it onto the ack val. When the edge gets acked by the receiving bolt, the same ID is XORd onto the ack val again. Since a XOR b XOR b is a, Storm ends up with the ack val being 0 when all the edges are processed.
Since acking only depends on the tuple ids (which you have no reason to want to change), it doesn't matter what you do with the tuple before acking it.

SlidingWindows for slow data (big intervals) on Apache Beam

I am working with Chicago Traffic Tracker dataset, where new data is published every 15 minutes. When new data is available, it represents records off by 10-15 minutes from the "real time" (example, look for _last_updt).
For example, at 00:20, I get data timestamped 00:10; at 00:35, I get from 00:20; at 00:50, I get from 00:40. So the interval that I can get new data "fixed" (every 15 minutes), although the interval on timestamps change slightly.
I am trying to consume this data on Dataflow (Apache Beam) and for that I am playing with Sliding Windows. My idea is to collect and work on 4 consecutive datapoints (4 x 15min = 60min), and ideally update my calculation of sum/averages as soon as a new datapoint is available. For that, I've started with the code:
PCollection<TrafficData> trafficData = input
.apply("MapIntoSlidingWindows", Window.<TrafficData>into(
SlidingWindows.of(Duration.standardMinutes(60)) // (4x15)
.every(Duration.standardMinutes(15))) . // interval to get new data
.triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes());
Unfortunately, looks like when I receive a new datapoint from my input, I do not get a new (updated) result from the GroupByKey that I have after.
Is this something wrong with my SlidingWindows? Or am I missing something else?

One issue may be that the watermark is going past the end of the window, and dropping all later elements. You may try giving a few minutes after the watermark passes:
PCollection<TrafficData> trafficData = input
.apply("MapIntoSlidingWindows", Window.<TrafficData>into(
SlidingWindows.of(Duration.standardMinutes(60)) // (4x15)
.every(Duration.standardMinutes(15))) . // interval to get new data
.triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane())
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.standardMinutes(15))
.accumulatingFiredPanes());
Let me know if this helps at all.

So #Pablo (from my understanding) gave the correct answer. But I had some suggestions that would not fit in a comment.
I wanted to ask whether you need sliding windows? From what I can tell, fixed windows would do the job for you and be computationally simpler as well. Since you are using accumulating fired panes, you don't need to use a sliding window since your next DoFn function will already be doing an average from the accumulated panes.
As for the code, I made changes to the early and late firing logic. I also suggest increasing the windowing size. Since you know the data comes every 15 minutes, you should be closing the window after 15 minutes rather than on 15 minutes. But you also don't want to pick a window which will eventually collide with multiples of 15 (like 20) because at 60 minutes you'll have the same problem. So pick a number that is co-prime to 15, for example 19. Also allow for late entries.
PCollection<TrafficData> trafficData = input
.apply("MapIntoFixedWindows", Window.<TrafficData>into(
FixedWindows.of(Duration.standardMinutes(19))
.triggering(AfterWatermark.pastEndOfWindow()
// fire the moment you see an element
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
//this line is optional since you already have a past end of window and a early firing. But just in case
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.standardMinutes(60))
.accumulatingFiredPanes());
Let me know if that solves your issue!
EDIT
So, I could not understand how you computed the above example, so I am using a generic example. Below is a generic averaging function:
public class AverageFn extends CombineFn<Integer, AverageFn.Accum, Double> {
public static class Accum {
int sum = 0;
int count = 0;
}
#Override
public Accum createAccumulator() { return new Accum(); }
#Override
public Accum addInput(Accum accum, Integer input) {
accum.sum += input;
accum.count++;
return accum;
}
#Override
public Accum mergeAccumulators(Iterable<Accum> accums) {
Accum merged = createAccumulator();
for (Accum accum : accums) {
merged.sum += accum.sum;
merged.count += accum.count;
}
return merged;
}
#Override
public Double extractOutput(Accum accum) {
return ((double) accum.sum) / accum.count;
}
}
In order to run it you would add the line:
PCollection<Double> average = trafficData.apply(Combine.globally(new AverageFn()));
Since you are currently using accumulating firing triggers, this would be the simplest coding way to solve the solution.
HOWEVER, if you want to use a discarding fire pane window, you would need to use a PCollectionView to store the previous average and pass it as a side input to the next one in order to keep track of the values. This is a little more complex in coding but would definitely improve performance since constant work is done every window, unlike in accumulating firing.
Does this make enough sense for you to generate your own function for discarding fire pane window?

Apache Flink: Windowed ReduceFunction is never executed

below is code snippet, where I'm using a Tumbling EventTime based window
DataStream<OHLC> ohlcStream = stockStream.assignTimestampsAndWatermarks(new TimestampExtractor()).map(new mapStockToOhlc()).keyBy((KeySelector<OHLC, Long>) o -> o.getMinuteKey())
.timeWindow(Time.seconds(60))
.reduce(new myAggFunction());
Unfortunatelly, it looks like it never exectutes the reduce function. If use code above w/o windowing, reduce function works fine. Below is code for TimestampExtractor. The 30 seconds watermark delay serves just as a testing value, but the one minute tumbling window is m
public static class TimestampExtractor implements AssignerWithPeriodicWatermarks<StockTrade> {
#Nullable
#Override
public Watermark getCurrentWatermark() {
return new Watermark(System.currentTimeMillis() - 30000);
}
#Override
public long extractTimestamp(StockTrade stockTrade, long l) {
BigDecimal bd = new BigDecimal(stockTrade.getTime());
// bd contains miliseconds timestamp 1498658629.036
return bd.longValue();
}
}
bd.longValue() which returns seconds timestamp 1498658629, as my window is defined also in seconds.
When I used bd.longValue()/60, which returns minute timestamp, reduce function is called. My output file than contains all records for each reduce operation
{time=1498717692.000, minuteTime=24978628, n=1, open=2248.0}
{time=1498717692.000, minuteTime=24978628, n=2, open=2248.0}
...
{time=1498717692.000, minuteTime=24978628, n=8, open=2248.0}
So, can anyone explain to me, what is happening? Thx a lot.

Normally watermarks should be relative to the timestamps in your data, and should not be based on the system clock. One of the great things about working with event time is that the same application can be used to reprocess historic data or to process current data, but that's not possible if you compare your timestamps to the the system clock, as you've done here.
A watermark can be thought of as a statement that all data with timestamps smaller than the watermark have already arrived. Or in other words, any data with a timestamp less than the current watermark will be considered late. My guess is that you are not seeing any results because your watermarks are causing all of your data to be considered late, and the window operator is dropping all this late data.
I suggest you use a BoundedOutOfOrdernessTimestampExtractor instead. It works by keeping track of the max timestamp seen so far in the data stream, and subtracts the delay from that max timestamp, rather than the system clock. The source code, in case you're curious.

How to generate incremental identifier in java

I have requirement in which I continuously receive messages that needs to be written in a file. Every time a new message is received it needs to be written in a separate file. What I want is to generate an unique identifier to be used as a file-name. I also want to preserve the order of the messages as well. By this I mean, the identifier generated as a file-name should always be incremental.
I was using UUID.randomUUID() to generate file-names but the problem with this approach is that UUID only assures randomness of the identifier but is not incremental. As a result I am losing the ordering of the file (I want file generated first should appear first in the list).
Approaches known
Can use System.currentTimeMillis() but I can receive multiple messages at same time stamp.
2.Another approach could be to implement static long value and increment it whenever a file is to be created and use the long value as a file-name. But I am not sure about this approach. Also it doesn't seem to be a proper solution to my problem. I think there could be far better solutions than this one.
If someone could suggest me a better solution to this problem, will be highly appreciated.

If you want your id value to uniformly rise even between server restarts, then you must either base it on the system time or have some elaborately robust logic that persists the last ID used. Note that achieving robustness on its own is not hard, but achieving it in a performant and scalable way is.
If you additionally need the id to be unique across multiple nodes in a redundant server cluster, then you need even more elaborate logic, which definitely involves a persistent store to which all the boxes synchronize access. Making this performant is, of course, even harder.
The best option I can see is to have a quite long ID so there's room for these parts:
System.currentTimeMillis for long-term uniqueness (across restarts);
System.nanotime for finer granularity;
a unique id of each server node (determined in a platform-specific way).
The method will still have to remember the last value generated and retry in case of a duplicate. It won't have to retry too many times, though, just until the next nanoTime clock tick—it could even busy-wait for it.
Sketch of code without point 3 (single-node implementation):
private static long lastNanos;
public static synchronized String uniqueId() {
for (;/*ever*/;) {
final long n = System.nanoTime();
if (n == lastNanos) continue;
lastNanos = n;
return "" + System.currentTimeMillis() + n;
}
}

Ok, my hands up. My last answer was fairly flaky and I've deleted it.
Keeping with the spirit of the site, I thought I'd try a different tac.
If you say you are keeping these messages in a single file then you could try something like creating an unique Id out of the size of the file?
Before you write the message to the file it's id could be the current size of the file.
You could add the filename + size as the id if these messages need to be unique across a number of files.
I'll leave the hot potato of synchronization to another day. But you could wrap all of this up in a syncronized object that keeps track of things.
Also, I am assuming that any messages written to the file will not be removed in the future.
ADDITIONAL NOTE:
You could create an message processing object that opens the file on construction (or via a create method).
This object will get the initial size of the file and this will be used as the unique id.
As each message is added (in a synchronized manner), the id is incremented by the size of the message.
This would address the performance issues. Will not work if more than one JVM/Node accesses the same file.
Skeletal Idea:
public class MessageSink {
private long id = 0;
public MessageSink(String filename) {
id = ... get file size ..
}
public synchronized addMessage(Message msg) {
msg.setId(id);
.. write to file + flush ..
.. or add to stack of messages that need to be written to file
.. at a later stage.
id = id + msg.getSize();
}
public void flushMessages() {
.. open file
.. for each message in stack write ...
.. flush and close file
}
}

I had the same requirement and found a suitable solution. Twitter Snowflake uses a simple algorithm to generate sortable 64bit (long) ids. Snowflake is written on Scala but the approach is simple and could be easily used in a Java code.
id is composed of:
timestamp - 41 bits (millisecond precision w/ a custom epoch gives us 69 years);
machine id - 10 bits (MAC address could be used as a hardware id);
sequence number - 12 bits - rolls over every 4096 per machine (with protection to avoid rollover in the same ms)
Formula looks like: ((timestamp - customEpoch) << timestampShift) | (machineId << machineIdShift) | sequenceNumber;
Shift for each component depends on it's bits position in ID.
Detailed description and source code could be found at github:
Twitter Snowflake
Basic Java implementation of the Snowflake algorithm

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.