SlidingWindows for slow data (big intervals) on Apache Beam

SlidingWindows for slow data (big intervals) on Apache Beam - java

I am working with Chicago Traffic Tracker dataset, where new data is published every 15 minutes. When new data is available, it represents records off by 10-15 minutes from the "real time" (example, look for _last_updt).
For example, at 00:20, I get data timestamped 00:10; at 00:35, I get from 00:20; at 00:50, I get from 00:40. So the interval that I can get new data "fixed" (every 15 minutes), although the interval on timestamps change slightly.
I am trying to consume this data on Dataflow (Apache Beam) and for that I am playing with Sliding Windows. My idea is to collect and work on 4 consecutive datapoints (4 x 15min = 60min), and ideally update my calculation of sum/averages as soon as a new datapoint is available. For that, I've started with the code:
PCollection<TrafficData> trafficData = input
.apply("MapIntoSlidingWindows", Window.<TrafficData>into(
SlidingWindows.of(Duration.standardMinutes(60)) // (4x15)
.every(Duration.standardMinutes(15))) . // interval to get new data
.triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes());
Unfortunately, looks like when I receive a new datapoint from my input, I do not get a new (updated) result from the GroupByKey that I have after.
Is this something wrong with my SlidingWindows? Or am I missing something else?

One issue may be that the watermark is going past the end of the window, and dropping all later elements. You may try giving a few minutes after the watermark passes:
PCollection<TrafficData> trafficData = input
.apply("MapIntoSlidingWindows", Window.<TrafficData>into(
SlidingWindows.of(Duration.standardMinutes(60)) // (4x15)
.every(Duration.standardMinutes(15))) . // interval to get new data
.triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane())
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.standardMinutes(15))
.accumulatingFiredPanes());
Let me know if this helps at all.

So #Pablo (from my understanding) gave the correct answer. But I had some suggestions that would not fit in a comment.
I wanted to ask whether you need sliding windows? From what I can tell, fixed windows would do the job for you and be computationally simpler as well. Since you are using accumulating fired panes, you don't need to use a sliding window since your next DoFn function will already be doing an average from the accumulated panes.
As for the code, I made changes to the early and late firing logic. I also suggest increasing the windowing size. Since you know the data comes every 15 minutes, you should be closing the window after 15 minutes rather than on 15 minutes. But you also don't want to pick a window which will eventually collide with multiples of 15 (like 20) because at 60 minutes you'll have the same problem. So pick a number that is co-prime to 15, for example 19. Also allow for late entries.
PCollection<TrafficData> trafficData = input
.apply("MapIntoFixedWindows", Window.<TrafficData>into(
FixedWindows.of(Duration.standardMinutes(19))
.triggering(AfterWatermark.pastEndOfWindow()
// fire the moment you see an element
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
//this line is optional since you already have a past end of window and a early firing. But just in case
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.standardMinutes(60))
.accumulatingFiredPanes());
Let me know if that solves your issue!
EDIT
So, I could not understand how you computed the above example, so I am using a generic example. Below is a generic averaging function:
public class AverageFn extends CombineFn<Integer, AverageFn.Accum, Double> {
public static class Accum {
int sum = 0;
int count = 0;
}
#Override
public Accum createAccumulator() { return new Accum(); }
#Override
public Accum addInput(Accum accum, Integer input) {
accum.sum += input;
accum.count++;
return accum;
}
#Override
public Accum mergeAccumulators(Iterable<Accum> accums) {
Accum merged = createAccumulator();
for (Accum accum : accums) {
merged.sum += accum.sum;
merged.count += accum.count;
}
return merged;
}
#Override
public Double extractOutput(Accum accum) {
return ((double) accum.sum) / accum.count;
}
}
In order to run it you would add the line:
PCollection<Double> average = trafficData.apply(Combine.globally(new AverageFn()));
Since you are currently using accumulating firing triggers, this would be the simplest coding way to solve the solution.
HOWEVER, if you want to use a discarding fire pane window, you would need to use a PCollectionView to store the previous average and pass it as a side input to the next one in order to keep track of the values. This is a little more complex in coding but would definitely improve performance since constant work is done every window, unlike in accumulating firing.
Does this make enough sense for you to generate your own function for discarding fire pane window?

Related

Why is a particular Guava Stopwatch.elapsed() call much later than others? (output in post)

I am working on a small game project and want to track time in order to process physics. After scrolling through different approaches, at first I had decided to use Java's Instant and Duration classes and now switched over to Guava's Stopwatch implementation, however, in my snippet, both of those approaches have a big gap at the second call of runtime.elapsed(). That doesn't seem like a big problem in the long run, but why does that happen?
I have tried running the code below as both in focus and as a Thread, in Windows and in Linux (Ubuntu 18.04) and the result stays the same - the exact values differ, but the gap occurs. I am using the IntelliJ IDEA environment with JDK 11.
Snippet from Main:
public static void main(String[] args) {
MassObject[] planets = {
new Spaceship(10, 0, 6378000)
};
planets[0].run();
}
This is part of my class MassObject extends Thread:
public void run() {
// I am using StringBuilder to eliminate flushing delays.
StringBuilder output = new StringBuilder();
Stopwatch runtime = Stopwatch.createStarted();
// massObjectList = static List<MassObject>;
for (MassObject b : massObjectList) {
if(b!=this) calculateGravity(this, b);
}
for (int i = 0; i < 10; i++) {
output.append(runtime.elapsed().getNano()).append("\n");
}
System.out.println(output);
}
Stdout:
30700
1807000
1808900
1811600
1812400
1813300
1830200
1833200
1834500
1835500
Thanks for your help.

You're calling Duration.getNano() on the Duration returned by elapsed(), which isn't what you want.
The internal representation of a Duration is a number of seconds plus a nano offset for whatever additional fraction of a whole second there is in the duration. Duration.getNano() returns that nano offset, and should almost never be called unless you're also calling Duration.getSeconds().
The method you probably want to be calling is toNanos(), which converts the whole duration to a number of nanoseconds.
Edit: In this case that doesn't explain what you're seeing because it does appear that the nano offsets being printed are probably all within the same second, but it's still the case that you shouldn't be using getNano().
The actual issue is probably some combination of classloading or extra work that has to happen during the first call, and/or JIT improving performance of future calls (though I don't think looping 10 times is necessarily enough that you'd see much of any change from JIT).

Understanding Kafka stream groupBy and window

I am not able to understand the concept of groupBy/groupById and windowing in kafka streaming. My goal is to aggregate stream data over some time period (e.g. 5 seconds). My streaming data looks something like:
{"value":0,"time":1533875665509}
{"value":10,"time":1533875667511}
{"value":8,"time":1533875669512}
The time is in milliseconds (epoch). Here my timestamp is in my message and not in key. And I want to average the value of 5 seconds window.
Here is code that I am trying but it seems I am unable to get it work
builder.<String, String>stream("my_topic")
.map((key, val) -> { TimeVal tv = TimeVal.fromJson(val); return new KeyValue<Long, Double>(tv.time, tv.value);})
.groupByKey(Serialized.with(Serdes.Long(), Serdes.Double()))
.windowedBy(TimeWindows.of(5000))
.count()
.toStream()
.foreach((key, val) -> System.out.println(key + " " + val));
This code does not print anything even though the topic is generating messages every two seconds. When I press Ctrl+C then it prints something like
[1533877059029#1533877055000/1533877060000] 1
[1533877061031#1533877060000/1533877065000] 1
[1533877063034#1533877060000/1533877065000] 1
[1533877065035#1533877065000/1533877070000] 1
[1533877067039#1533877065000/1533877070000] 1
This output does not make sense to me.
Related code:
public class MessageTimeExtractor implements TimestampExtractor {
#Override
public long extract(ConsumerRecord<Object, Object> record, long previousTimestamp) {
String str = (String)record.value();
TimeVal tv = TimeVal.fromJson(str);
return tv.time;
}
}
public class TimeVal
{
final public long time;
final public double value;
public TimeVal(long tm, double val) {
this.time = tm;
this.value = val;
}
public static TimeVal fromJson(String val) {
Gson gson = new GsonBuilder().create();
TimeVal tv = gson.fromJson(val, TimeVal.class);
return tv;
}
}
Questions:
Why do you need to pass serializer/deserializer to group by. Some of the overloads also take ValueStore, what is that? When grouped, how the data looks in the grouped stream?
How window stream is related to group stream?
The above, I was expecting to print in streaming way. That means buffer for every 5 seconds and then count and then print. It only prints once press Ctrl+c on command prompt i.e. it prints and then exits

It seems you don't have keys in your input data (correct me if this is wrong), and it further seems, that you want to do global aggregation?
In general, grouping is for splitting a stream into sub-streams. Those sub-streams are build by key (ie, one logical sub-stream per key). You set your timestamp as key in your code snippet an thus generate a sub-stream per timestamps. I assume this is not intended.
If you want to go a global aggregation, you will need to map all record to a single substream, ie, assign the same key to all records in groupBy(). Note, that global aggregations don't scale as the aggregation must be computed by a single thread. Thus, this will only work for small workloads.
Windowing is applied to each generated sub-stream to build the windows, and the aggregation is computed per window. The windows are build base on the timestamp returned by the Timestamp extractor. It seems you have an implementation that extracts the timestamp for the value for this purpose already.
This code does not print anything even though the topic is generating messages every two seconds. When I press Ctrl+C then it prints something like
By default, Kafka Streams uses some internal caching and the cache will be flushed on commit -- this happens every 30 seconds by default, or when you stop your application. You would need to disable caching to see result earlier (cf. https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html)
Why do you need to pass serializer/deserializer to group by.
Because data needs to be redistributed and this happens via a topic in Kafka. Note, that Kafka Streams is build for a distributed setup, with multiple instances of the same application running in parallel to scale out horizontally.
Btw: we might also be interesting in this blog post about the execution model of Kafka Streams: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/

It seems like you misunderstand the nature of window DSL.
It works for internal message timestamps handled by kafka platform, not for arbitrary properties in your specific message type that encode time information. Also, this window does not group into intervals - it is a sliding window. It means any aggregation you get is for the last 5 seconds before the current message.
Also, you need the same key for all group elements to be combined into the same group, for example, null. In your example key is a timestamp which is kind of entry-unique, so there will be only a single element in a group.

Apache Flink: Windowed ReduceFunction is never executed

below is code snippet, where I'm using a Tumbling EventTime based window
DataStream<OHLC> ohlcStream = stockStream.assignTimestampsAndWatermarks(new TimestampExtractor()).map(new mapStockToOhlc()).keyBy((KeySelector<OHLC, Long>) o -> o.getMinuteKey())
.timeWindow(Time.seconds(60))
.reduce(new myAggFunction());
Unfortunatelly, it looks like it never exectutes the reduce function. If use code above w/o windowing, reduce function works fine. Below is code for TimestampExtractor. The 30 seconds watermark delay serves just as a testing value, but the one minute tumbling window is m
public static class TimestampExtractor implements AssignerWithPeriodicWatermarks<StockTrade> {
#Nullable
#Override
public Watermark getCurrentWatermark() {
return new Watermark(System.currentTimeMillis() - 30000);
}
#Override
public long extractTimestamp(StockTrade stockTrade, long l) {
BigDecimal bd = new BigDecimal(stockTrade.getTime());
// bd contains miliseconds timestamp 1498658629.036
return bd.longValue();
}
}
bd.longValue() which returns seconds timestamp 1498658629, as my window is defined also in seconds.
When I used bd.longValue()/60, which returns minute timestamp, reduce function is called. My output file than contains all records for each reduce operation
{time=1498717692.000, minuteTime=24978628, n=1, open=2248.0}
{time=1498717692.000, minuteTime=24978628, n=2, open=2248.0}
...
{time=1498717692.000, minuteTime=24978628, n=8, open=2248.0}
So, can anyone explain to me, what is happening? Thx a lot.

Normally watermarks should be relative to the timestamps in your data, and should not be based on the system clock. One of the great things about working with event time is that the same application can be used to reprocess historic data or to process current data, but that's not possible if you compare your timestamps to the the system clock, as you've done here.
A watermark can be thought of as a statement that all data with timestamps smaller than the watermark have already arrived. Or in other words, any data with a timestamp less than the current watermark will be considered late. My guess is that you are not seeing any results because your watermarks are causing all of your data to be considered late, and the window operator is dropping all this late data.
I suggest you use a BoundedOutOfOrdernessTimestampExtractor instead. It works by keeping track of the max timestamp seen so far in the data stream, and subtracts the delay from that max timestamp, rather than the system clock. The source code, in case you're curious.

Impose order in Jsprit with HardActivityConstraint

In a scenario of re-solving a previously solved problem (with some new data, of course), it's typically impossible to re-assign a vehicle's very-first assignment once it was given. The driver is already on its way, and any new solution has to take into account that:
the job must remain his (can't be assigned to another vehicle)
the activity that's been assigned to him as the very-first, must remain so in future solutions
For the sake of simplicity, I'm using a single vehicle scenario, and only trying to impose the second bullet (i.e. ensure that a certain activity will be the first in the solution).
This is how I defined the constraint:
new HardActivityConstraint()
{
#Override
public ConstraintsStatus fulfilled(JobInsertionContext iFacts, TourActivity prevAct, TourActivity newAct, TourActivity nextAct,
double prevActDepTime)
{
String locationId = newAct.getLocation().getId();
// we want to make sure that any solution will have "C1" as its first activity
boolean activityShouldBeFirst = locationId.equals("C1");
boolean attemptingToInsertFirst = (prevAct instanceof Start);
if (activityShouldBeFirst && !attemptingToInsertFirst)
return ConstraintsStatus.NOT_FULFILLED_BREAK;
if (!activityShouldBeFirst && attemptingToInsertFirst)
return ConstraintsStatus.NOT_FULFILLED;
return ConstraintsStatus.FULFILLED;
}
}
This is how I build the algorithm:
VehicleRoutingAlgorithmBuilder vraBuilder;
vraBuilder = new VehicleRoutingAlgorithmBuilder(vrpProblem, "schrimpf.xml");
vraBuilder.addCoreConstraints();
vraBuilder.addDefaultCostCalculators();
StateManager stateManager = new StateManager(vrpProblem);
ConstraintManager constraintManager = new ConstraintManager(vrpProblem, stateManager);
constraintManager.addConstraint(new HardActivityConstraint() { ... }, Priority.HIGH);
vraBuilder.setStateAndConstraintManager(stateManager, constraintManager);
VehicleRoutingAlgorithm algorithm = vraBuilder.build();
The results are not good. I'm only getting solutions with a single job assigned (the one with the required activity). In debug it's clear that the job insertion iterations consider many viable options that appear to solve the problem entirely, but at the bottom line, the best solution returned by the algorithm doesn't include the other jobs.
UPDATE: even more surprising, is that when I use the constraint in scenarios with over 5 vehicles, it works fine (worst results are with 1 vehicle).
I'll gladly attach more information if needed.
Thanks
Zach

First, you can use initial routes to ensure that certain jobs need to be assigned to specific vehicles right from the beginning (see example).
Second, to ensure that no activity will be inserted between start and your initial job(location) (e.g. "C1" in your example), you need to prohibit it the way you defined your HardActConstraint, just modify it so that a newAct can never be between prevAct=Start and nextAct=act(C1).
Third, with regards to your update, just have in mind that the essence of the algorithm is to ruin part of the solution (remove a number of jobs) and recreate the solution again (insert the unassigned jobs). Currently, the schrimpf algorithm ruins a number of jobs relative to the total number of jobs, i.e. noJobs = 0.5 * totalNoJobs for the random ruin and 0.3 * totalNoJobs for the radial ruin. If your problem is very small, the share of jobs to be removed might not sufficiant. This is going to change with next release, where you can use an algorithm out of the box which defines an absolute minimum of jobs that need to be removed. For the time being, modify the shares in your algorithmConfig.xml.

Sampling function produces same result everytime

I am generating weighted random numbers(sampling with replacement) through the following code
Object[] population = { 0, 1 };
double[] weights = { p1, p2 };
Sampling randsamp = new Sampling(population, weights);
X = (Integer) randsamp.next();
I have tried different values of p1 and p2 which are the probabilities and 0 and 1 are the population(numbers which are to be generated based on p1 and p2).
However, running the code multiple times produces the same result, for example if I make 10 iterations and store the result in an array X[] I get the same array every time the code is executes. Can someone tell me why is this happening? Should I not get different array/numbers at each iteration?
Thanks

If you search in google jpsgcs.alun.random.Sampling, you get some broken links about this Sampling class. Moreover, if you browse here you can see that in the jar, that you can download, there is no more even such a package like random. So, probably that was removed for some reasons ... Maybe this Sampling class was removed because not working properly? I can just suggest you to get in touch with somebody that wrote this library.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.