Is it possible in Flink to compute over aggregated output of a keyed window?
We have a Datastream, we call byKey() specifying a field that is composed by a char and a number (for example A01, A02... A10, B01, B02, ... B10, etc), like the squares of the chessboard.
After the byKey() we call window(TumblingEventTimeWindow.of(Time.days(7)), so we create a weekly window.
After this, we call reduce() and as result we obtain SingleOutputStreamOperator<Result>.
Now, we want to group the SingleOutputStreamOperator<Result> based on a field of each Result object and iterate over each group to extract the top3 based on a field in the Result objects in that group, is it possible to do this without creating another weekly window and having to perform an aggregation function on it?
Obviously this works, however I don't like the thought of having this second weekly window after another weekly window. I would like to be able to merge all the SingleOutputStreamOperator<Result>of the first window and execute a function on them without having to use a new window that receives all the elements together.
This is my code, as you can see:
We use keyBy() based on a Tuple2<String, Integer> based on fields of the object Query2IntermediateOutcome. The String in the tuple is the code A01,...,A10 which I had mentioned before.
The code window(timeIntervalConstructor.newInstance()) basically creates a weekly window.
We call reduce() so for each key we have an aggregated value.
Now we use another keyBy(), this time the key is basically computed looking at the number of the code A01,...,A10: if it's greater than 5 we have a sea type, if it's less or equal we have another.
Again, window(timeIntervalConstructor.newInstance()) for the second weekly window.
Finally, in the aggregate() we compute the top3 for each group.
.keyBy(new KeySelector<Query2IntermediateOutcome, Tuple2<String, Integer>>() {
#Override
public Tuple2<String, Integer> getKey(Query2IntermediateOutcome intermediateOutcome) throws Exception {
return new Tuple2<String, Integer>(intermediateOutcome.getCellId(), intermediateOutcome.getHourInDate());
}
})
.window(timeIntervalConstructor.newInstance())
.reduce(new ReduceFunction<Query2IntermediateOutcome>() {
#Override
public Query2IntermediateOutcome reduce(Query2IntermediateOutcome t1, Query2IntermediateOutcome t2) throws Exception {
t1.setAttendance(t1.getAttendance()+t2.getAttendance());
return t1;
}
})
.keyBy(new KeySelector<Query2IntermediateOutcome, String>() {
#Override
public String getKey(Query2IntermediateOutcome query2IntermediateOutcome) throws Exception {
return query2IntermediateOutcome.getSeaType().toString();
}
})
.window(timeIntervalConstructor.newInstance())
.aggregate(new Query2FinalAggregator(), new Query2Window())
This solution works, but I don't really like it because the second window receive all the data when the previous fires, but it happens weekly, so the second window receive all the data together and must immediately run the aggregate().
I think it would be reasonably straightforward to collapse all of this business logic into one KeyedProcessFunction. Then you could avoid the burst of activity at the end of the week.
Take a look at this tutorial in the Flink docs for an example of how to replace a keyed window with a KeyedProcessFunction.
Related
My classes.
class MyLoan {
private Long loanId;
private BigDecimal loanAmount;
private BigDecimal totalPaid;
....
}
class Customer {
private Long loanId;
private List<MyLoan> myLoan;
}
I want to iterate over the myLoan from a Customer and calculate the totalPaid amount.
My logic is "If loanId is 23491L or 23492L, then add the loanAmount of those two loanId's and set the value in the totalPaid amount of loanId 23490L".totalPaid amount is always showing as zero with my logic below.
And want to use Java 8 streams, but unable to write multiple conditions when using streams.
BigDecimal spreadAmount;
for (MyLoan myloan: customer.getMyLoan()) {
if (myloan.getLoanId() == 23491L || myloan.getLoanId() == 23492L) {
spreadAmount = spreadAmount.add(myloan.getLoanAmount());
}
if (myloan.getLoanId() == 23490L) {
myloan.setTotalPaid(spreadAmount);
}
}
The totalPaid field is not modified because your MyLoan instance with id 23490l is encountered before the other two MyLoans.
As #Silvio Mayolo has suggested in the comments you should first compute the total amount with a temp variable and then assign it to the totalPaid field of the MyLoan instance with id 23490l.
This is a stream implementation of what you were trying to do:
//If to make sure that the element MyLoan invoking the setter is actually present
if (myLoan.stream().map(MyLoan::getLoanId).anyMatch(value -> value == 23490l)){
myLoan.stream()
.filter(loan -> loan.getLoanId() == 23490l)
.findFirst()
.get()
.setTotalPaid(myLoan.stream()
.filter(loan -> loan.getLoanId() == 23491l || loan.getLoanId() == 23492l)
.map(MyLoan::getLoanAmount)
.reduce(BigDecimal.valueOf(0), (a, b) -> a = a.add(b)));
}
WARNING
The method get(), invoked on the Optional retrieved with the terminal operation findFirst(), could throw a NoSuchElementException if a MyLoan with id 23490l is not present within the list. You should first make sure that the element is present, as I've done with my if statement.
A second (bad practice) could involve catching the NoSuchElementException thrown by the get(), in case the desired MyLoan is not present. As it has been pointed out in the comments, catching a RuntimeException (NoSuchElementException is a subclass of it) is a bad practice, as we should investigate on the origin of the problem rather than simply catching the exception. This second approach was honestly a (lazy) last resort only to show another possible way of handling the case.
Firstly, you need to fetch a loan for which you want to define a total paid amount. If this step succeeds, then calculate a total.
In order to find a loan with a particular id using streams, you need to create a stream over the customers loans and apply filter() in conjunction with findFirst() on it. It'll give you the first element from the stream that matches the predicate passed into the filter. Because result might not be present in the stream, findFirst() returns an Optional object.
Optional class offers a wide range of method to interact with it like orElse(), ifPresent(), orElse(), etc. Avoid blindly using get(), unless you didn't check that value is present, which is in many cases isn't the most convenient way to deal with it. Like in the code below, ifPresent() is being used to proceed with the logic if value is present.
So if the required loan was found, the next step is to calculate the total. Which is done by filtering out target ids, extracting amount by applying map() and adding the amounts together using reduce() as a terminal operation.
public static void setTotalPaid(Customer customer, Long idToSet, Long... idsToSumUp) {
List<MyLoan> loans = customer.getMyLoan();
getLoanById(loans, idToSet).ifPresent(loan -> loan.setTotalPaid(getTotalPaid(loans, idsToSumUp)));
}
public static Optional<MyLoan> getLoanById(List<MyLoan> loans, Long id) {
return loans.stream()
.filter(loan -> loan.getLoanId().equals(id))
.findFirst();
}
public static BigDecimal getTotalPaid(List<MyLoan> loans, Long... ids) {
Set<Long> targetLoans = Set.of(ids); // wrapping with set to improve performance
return loans.stream()
.filter(loan -> targetLoans.contains(loan.getLoanId()))
.map(MyLoan::getLoanAmount)
.reduce(BigDecimal.ZERO, BigDecimal::add);
}
I have an object, Bill, with a number of fields. In the method below, I get the bill with a function. I want to validate it with a list of Predicate<Bill>, which are paired with the appropriate error message to be applied if the predicate test fails. How can I accumulate the error messages given a list of tests, given that I can have more than eight conditions, and therefore won't be able to use Validation.combine?
default Validation<Seq<String>, Long> validate(
Long id,
Long, Bill> getBill,
List<Pair<Predicate<Bill>,String>> tests){
Bill bill = getBill.apply(id);
//I want to do the same thing
//but using the list I passed in,
//without the limitation of eight validations.
return Validation.combine(
validateBill(bill, Pair.of(hasDateInsurerReceivedBill, "Date Insurer Received Bill absent")),
validateBill(bill, Pair.of(EventValidation.hasEmployeeIdNumber, "Employee ID Number absent"))
).ap((x, y) -> id);
}
default Validation<String,Long> validateBill(
Bill bill, Pair<Predicate<Bill>, String> condition)
{
return condition.getFirst().test(bill) ?
Validation.valid(bill.getIntId())
: Validation.invalid(condition.getSecond());
}
I'm brand new to this library and I'm not terribly familiar with functional programming yet, so please use examples and the simplest terminology possible in any explanations.
I would do a nested combine and then flatten the results.
In our project we always have Seq<ValidationError> on the left side of a Validation, you don't have to but it is good to understand the code I'll show you.
With the first 8 Validations you return a new Validation in the .ap
When you return a Validation inside .ap you will end up with something like this:
Validation<Seq<ValidationError>, Validation<Seq<ValidationError>, String>> x = ...
The needs to be flattened with the following piece of code:
Validation
.combine(step1, step2, step3, step4, step5, step6, step7, step8)
.ap((a, b, c, d, e, f ,g, h) -> {
// do important stuff and
return Validation......
})
.mapError(Util::flattenErrors)
.fold(Validation::invalid, Function.identity());
The Util class:
public static Seq<ValidationError> flattenErrors(final Seq<Seq<ValidationError>> nested) {
return nested
.flatMap(Function.identity())
.distinct(); //optional duplicate filtering
}
With this new validation you can do the same trick again (you can add 7 new validations every time or create a few and do another combine, depends a bit on the number of validations you have).
I have this particular problem now - I have a grid that I am trying to have the data filtered through multiple filters. For that, I am using textboxes that serve as input fields for my filtering criterion.
My grid has three columns (First Name, Last Name, Address) and I would like to be able to chain the filtering operations one after the other. All of the values are taken from a MySQL database.
Essentially the filter process should go like this:
FirstName ^ LastName ^ Address
For example, grid with three columns:
And in the filter for First Name column, I input the variables Aa, which would result in the table looking like this:
However, if I decided input D into the Last Name filter it returns results like this (ignores the modifications by the first filter):
Instead of the expected result which would look like this:
The way I am filtering through the grid is like this:
firstNameFilter.addValueChangeListener( e->
{
Notification.show(e.getValue());
ldp.setFilter(desc ->
{
return StringUtils.containsIgnoreCase(desc.getFName(), firstNameFilter.getValue());
});
});
firstNameFilter.setValueChangeMode(ValueChangeMode.EAGER);
What would be the best way to filter through multiple columns whilst taking into consideration previous filter actions?
listDataProvider.setFilter(...) will overwrite any existing filter.
I have written an answer about this very topic, with a complete example code ready for copy paste, and screenshots showing that the multiple filters work as expected.
The most important takeaway from it is this:
Every time that any filter value changes, I reset the current filter using setFilter. But within that new Filter, I will check the values of ALL filter fields, and not only the value of the field whose value just changed. In other words, I always have only one single filter active, but that filter accounts for all defined filter-values.
Here is how it could look with your code:
firstNameFilter.addValueChangeListener( e-> this.onFilterChange());
lastNameFilter.addValueChangeListener( e-> this.onFilterChange());
addressFilter.addValueChangeListener( e-> this.onFilterChange());
// sidenote: all filter fields need ValueChangeMode.EAGER to work this way
private void onFilterChange(){
ldp.setFilter(desc -> {
boolean fNameMatch = true;
boolean lNameMatch = true;
boolean addressMatch = true;
if(!firstNameFilter.isEmpty()){
fNameMatch = StringUtils.containsIgnoreCase(desc.getFName(), firstNameFilter.getValue());
}
if(!lastNameFilter.isEmpty()){
lNameMatch = StringUtils.containsIgnoreCase(desc.getLName(), lastNameFilter.getValue());
}
if(!addressFilter.isEmpty()){
addressMatch = StringUtils.containsIgnoreCase(desc.getAddress(), addressFilter.getValue());
}
return fNameMatch && lNameMatch && addressMatch;
});
});
Trying to create some mechanism of alerting system, I am looking to find a drop in an average between two windows.
I was happy to find TrafficRoutes example, specifically when I saw it says:
A 'slowdown' occurs if a supermajority of speeds in a sliding window
are less than the reading of the previous window.
I looked in the code, but failed to understand why this means we get the previous value from the previous window. Since I had no experience with sliding windows till now, I thought I might missing something.
Implementing this kind of mechanism, with or without sliding windows - does not get data from previous windows, as I suspected.
Any idea what do I miss ?
Is there a certain way to get values from previous window ?
I am executing on GCP Dataflow, with SDK 1.9.0.
Please advise,
Shushu
My assumptions:
Your alerting system has data partitioned into "metrics" identified by "metric ids".
The value of a metric at a given time is Double.
You are receiving the metric data as a PCollection<KV<String, Double>> where the String is metric id, the Double is the metric value, and each element has the appropriate implicit timestamp (if it doesn't, you can assign one using the WithTimestamps transform).
You want to compute sliding averages of each metric for each 5-minute interval starting at every 1 minute, and want to do something in case the average for interval starting at T+1min is smaller than average for interval starting at T
You can accomplish it like this:
PCollection<KV<String, Double>> metricValues = ...;
// Collection of (metric, timestamped 5-minute average)
// windowed into the same 5-minute windows as the input,
// where timestamp is assigned as the beginning of the window.
PCollection<KV<String, TimestampedValue<Double>>>
metricSlidingAverages = metricValues
.apply(Window.<KV<String, Double>>into(
SlidingWindows.of(Duration.standardMinutes(5))
.every(Duration.standardMinutes(1))))
.apply(Mean.<String, Double>perKey())
.apply(ParDo.of(new ReifyWindowFn()));
// Rewindow the previous collection into global window so we can
// do cross-window comparisons.
// For each metric, an unsorted list of (timestamp, average) pairs.
PCollection<KV<String, Iterable<TimestampedValue<Double>>>
metricAverageSequences = metricSlidingAverages
.apply(Window.<KV<String, TimestampedValue<Double>>>into(
new GlobalWindows()))
// We need to group the data by key again since the grouping key
// has changed (remember, GBK implicitly groups by key and window)
.apply(GroupByKey.<String, TimestampedValue<Double>>create())
metricAverageSequences.apply(new DetectAnomaliesFn());
...
class ReifyWindowFn extends DoFn<
KV<String, Double>, KV<String, TimestampedValue<Double>>> {
#ProcessElement
public void process(ProcessContext c, BoundedWindow w) {
// This DoFn makes the implicit window of the element be explicit
// and extracts the starting timestamp of the window.
c.output(KV.of(
c.element().getKey(),
TimestampedValue.of(c.element.getValue(), w.minTimestamp())));
}
}
class DetectAnomaliesFn extends DoFn<
KV<String, Iterable<TimestampedValue<Double>>>, Void> {
#ProcessElement
public void process(ProcessContext c) {
String metricId = c.element().getKey();
// Sort the (timestamp, average) pairs by timestamp.
List<TimestampedValue<Double>> averages = Ordering.natural()
.onResultOf(TimestampedValue::getTimestamp)
.sortedCopy(c.element().getValue());
// Scan for anomalies.
for (int i = 1; i < averages.size(); ++i) {
if (averages.get(i).getValue() < averages.get(i-1).getValue()) {
// Detected anomaly! Could do something with it,
// e.g. publish to a third-party system or emit into
// a PCollection.
}
}
}
}
Note that I did not test this code, but it should provide enough conceptual guidance for you to accomplish the task.
I'm having troubles properly implementing the following scenario using RxJava (v1.2.1):
I need to handle a request for some data object. I have a meta-data copy of this object which I can return immediately, while making an API call to a remote server to retrieve the whole object data. When I receive the data from the API call I need to process the data before emitting it.
My solution currently looks like this:
return Observable.just(localDataCall())
.concatWith(externalAPICall().map(new DataProcessFunction()));
The first Observable, localDataCall(), should emit the local data, which is then concatenated with the remote API call, externalAPICall(), mapped to the DataProcessFunction.
This solution works but it has a behavior that is not clear to me. When the local data call returns its value, this value goes through the DataProcessFunction even though it's not connected to the first call.
Any idea why this is happening? Is there a better implementation for my use case?
I believe that the issue lies in some part of your code that has not been provided. The data returned from localDataCall() is independent of the new DataProcessFunction() object, unless somewhere within localDataCall you use another DataProcessFunction.
To prove this to you I will create a small example using io.reactivex:rxjava:1.2.1:
public static void main(String[] args){
Observable.just(foo())
.concatWith(bar().map(new IntMapper()))
.subscribe(System.out::println);
}
static int foo() {
System.out.println("foo");
return 0;
}
static Observable<Integer> bar() {
System.out.println("bar");
return Observable.just(1, 2);
}
static class IntMapper implements Func1<Integer, Integer>
{
#Override
public Integer call(Integer integer)
{
System.out.println("IntMapper " + integer);
return integer + 5;
}
}
This prints to the console:
foo
bar
0
IntMapper 1
6
IntMapper 2
7
As can be seen, the value 0 created in foo never gets processed by IntMapper; IntMapper#call is only called twice for the values created in bar. The same can be said for the value created by localDataCall. It will not be mapped by the DataProcessFunction object passed to your map call. Just like bar and IntMapper, only values returned from externalAPICall will be processed by DataProcessFunction.
.concatWith() concatenates all items emitted by one observable with all items emitted by the other observable, so no wonder that .map() is being called twice.
But I do not understand why do you need localDataCall() at all in this scenario. Perhaps you might want to use .switchIfEmpty() or .switchOnNext() instead.