Getting previous window data in DataFlow

Getting previous window data in DataFlow - java

Trying to create some mechanism of alerting system, I am looking to find a drop in an average between two windows.
I was happy to find TrafficRoutes example, specifically when I saw it says:
A 'slowdown' occurs if a supermajority of speeds in a sliding window
are less than the reading of the previous window.
I looked in the code, but failed to understand why this means we get the previous value from the previous window. Since I had no experience with sliding windows till now, I thought I might missing something.
Implementing this kind of mechanism, with or without sliding windows - does not get data from previous windows, as I suspected.
Any idea what do I miss ?
Is there a certain way to get values from previous window ?
I am executing on GCP Dataflow, with SDK 1.9.0.
Please advise,
Shushu

My assumptions:
Your alerting system has data partitioned into "metrics" identified by "metric ids".
The value of a metric at a given time is Double.
You are receiving the metric data as a PCollection<KV<String, Double>> where the String is metric id, the Double is the metric value, and each element has the appropriate implicit timestamp (if it doesn't, you can assign one using the WithTimestamps transform).
You want to compute sliding averages of each metric for each 5-minute interval starting at every 1 minute, and want to do something in case the average for interval starting at T+1min is smaller than average for interval starting at T
You can accomplish it like this:
PCollection<KV<String, Double>> metricValues = ...;
// Collection of (metric, timestamped 5-minute average)
// windowed into the same 5-minute windows as the input,
// where timestamp is assigned as the beginning of the window.
PCollection<KV<String, TimestampedValue<Double>>>
metricSlidingAverages = metricValues
.apply(Window.<KV<String, Double>>into(
SlidingWindows.of(Duration.standardMinutes(5))
.every(Duration.standardMinutes(1))))
.apply(Mean.<String, Double>perKey())
.apply(ParDo.of(new ReifyWindowFn()));
// Rewindow the previous collection into global window so we can
// do cross-window comparisons.
// For each metric, an unsorted list of (timestamp, average) pairs.
PCollection<KV<String, Iterable<TimestampedValue<Double>>>
metricAverageSequences = metricSlidingAverages
.apply(Window.<KV<String, TimestampedValue<Double>>>into(
new GlobalWindows()))
// We need to group the data by key again since the grouping key
// has changed (remember, GBK implicitly groups by key and window)
.apply(GroupByKey.<String, TimestampedValue<Double>>create())
metricAverageSequences.apply(new DetectAnomaliesFn());
...
class ReifyWindowFn extends DoFn<
KV<String, Double>, KV<String, TimestampedValue<Double>>> {
#ProcessElement
public void process(ProcessContext c, BoundedWindow w) {
// This DoFn makes the implicit window of the element be explicit
// and extracts the starting timestamp of the window.
c.output(KV.of(
c.element().getKey(),
TimestampedValue.of(c.element.getValue(), w.minTimestamp())));
}
}
class DetectAnomaliesFn extends DoFn<
KV<String, Iterable<TimestampedValue<Double>>>, Void> {
#ProcessElement
public void process(ProcessContext c) {
String metricId = c.element().getKey();
// Sort the (timestamp, average) pairs by timestamp.
List<TimestampedValue<Double>> averages = Ordering.natural()
.onResultOf(TimestampedValue::getTimestamp)
.sortedCopy(c.element().getValue());
// Scan for anomalies.
for (int i = 1; i < averages.size(); ++i) {
if (averages.get(i).getValue() < averages.get(i-1).getValue()) {
// Detected anomaly! Could do something with it,
// e.g. publish to a third-party system or emit into
// a PCollection.
}
}
}
}
Note that I did not test this code, but it should provide enough conceptual guidance for you to accomplish the task.

Related

Interactive Broker Java API

Everytime before I place a new order to IB, I need to make a request to IB for next valid orderId and do Thread.Sleep(500) to sleep for 0.5 seconds and wait for IB API's callback function nextValidId to return the latest orderID. If I want to place multiple orders out, then I have to naively do thread.sleep multiple times, This is not a very good way to handle this, as the orderID could have been updated earlier and hence the new order could have been placed earlier. And what if the orderID takes longer time to update than thread sleep time, this would result in error.
Is there a more efficient and elegant way to do this ?
Ideally, I want the program to prevent running placeNewOrder until the latest available orderID is updated and notify the program to run placeNewOrder.
I do not know much about Java data synchronization but I reckon there might be a better solution using synchronized or wait-notify or locking or blocking.
my code:
// place first order
ib_client.reqIds(-1);
Thread.sleep(500);
int currentOrderId = ib_wrapper.getCurrentOrderId();
placeNewOrder(currentOrderId, orderDetails); // my order placement method
// place 2nd order
ib_client.reqIds(-1);
Thread.sleep(500);
int currentOrderId = ib_wrapper.getCurrentOrderId();
placeNewOrder(currentOrderId, orderDetails); // my order placement method
IB EWrapper:
public class EWrapperImpl implements EWrapper {
...
protected int currentOrderId = -1;
...
public int getCurrentOrderId() {
return currentOrderId;
}
public void nextValidId(int orderId) {
System.out.println("Next Valid Id: ["+orderId+"]");
currentOrderId = orderId;
}
...
}

You never need to ask for id's. Just increment by one for every order.
When you first connect, nextValidId is the first or second message to be received, just keep track of the id and keep incrementing.
The only rules for orderId is to use an integer and always increment by some amount. This is per clientId so if you connect with a new clientId then the last orderId is something else.
I always use max(1000, nextValidId) to make sure my id's start at 1000 or more since I use <1000 for data requests. It just helps with errors that have ids.
You can also reset the sequence somehow.
https://interactivebrokers.github.io/tws-api/order_submission.html
This means that if there is a single client application submitting
orders to an account, it does not have to obtain a new valid
identifier every time it needs to submit a new order. It is enough to
increase the last value received from the nextValidId method by one.

You should not mess around with order ID, it's automatically tracked and being set by the API. Otherwise you will get the annoying "Duplicate order id" error 103. From ApiController class:
public void placeOrModifyOrder(Contract contract, final Order order, final IOrderHandler handler) {
if (!checkConnection())
return;
// when placing new order, assign new order id
if (order.orderId() == 0) {
order.orderId( m_orderId++);
if (handler != null) {
m_orderHandlers.put( order.orderId(), handler);
}
}
m_client.placeOrder( contract, order);
sendEOM();
}

Flink how to compute over aggregated output of a keyed window

Is it possible in Flink to compute over aggregated output of a keyed window?
We have a Datastream, we call byKey() specifying a field that is composed by a char and a number (for example A01, A02... A10, B01, B02, ... B10, etc), like the squares of the chessboard.
After the byKey() we call window(TumblingEventTimeWindow.of(Time.days(7)), so we create a weekly window.
After this, we call reduce() and as result we obtain SingleOutputStreamOperator<Result>.
Now, we want to group the SingleOutputStreamOperator<Result> based on a field of each Result object and iterate over each group to extract the top3 based on a field in the Result objects in that group, is it possible to do this without creating another weekly window and having to perform an aggregation function on it?
Obviously this works, however I don't like the thought of having this second weekly window after another weekly window. I would like to be able to merge all the SingleOutputStreamOperator<Result>of the first window and execute a function on them without having to use a new window that receives all the elements together.
This is my code, as you can see:
We use keyBy() based on a Tuple2<String, Integer> based on fields of the object Query2IntermediateOutcome. The String in the tuple is the code A01,...,A10 which I had mentioned before.
The code window(timeIntervalConstructor.newInstance()) basically creates a weekly window.
We call reduce() so for each key we have an aggregated value.
Now we use another keyBy(), this time the key is basically computed looking at the number of the code A01,...,A10: if it's greater than 5 we have a sea type, if it's less or equal we have another.
Again, window(timeIntervalConstructor.newInstance()) for the second weekly window.
Finally, in the aggregate() we compute the top3 for each group.
.keyBy(new KeySelector<Query2IntermediateOutcome, Tuple2<String, Integer>>() {
#Override
public Tuple2<String, Integer> getKey(Query2IntermediateOutcome intermediateOutcome) throws Exception {
return new Tuple2<String, Integer>(intermediateOutcome.getCellId(), intermediateOutcome.getHourInDate());
}
})
.window(timeIntervalConstructor.newInstance())
.reduce(new ReduceFunction<Query2IntermediateOutcome>() {
#Override
public Query2IntermediateOutcome reduce(Query2IntermediateOutcome t1, Query2IntermediateOutcome t2) throws Exception {
t1.setAttendance(t1.getAttendance()+t2.getAttendance());
return t1;
}
})
.keyBy(new KeySelector<Query2IntermediateOutcome, String>() {
#Override
public String getKey(Query2IntermediateOutcome query2IntermediateOutcome) throws Exception {
return query2IntermediateOutcome.getSeaType().toString();
}
})
.window(timeIntervalConstructor.newInstance())
.aggregate(new Query2FinalAggregator(), new Query2Window())
This solution works, but I don't really like it because the second window receive all the data when the previous fires, but it happens weekly, so the second window receive all the data together and must immediately run the aggregate().

I think it would be reasonably straightforward to collapse all of this business logic into one KeyedProcessFunction. Then you could avoid the burst of activity at the end of the week.
Take a look at this tutorial in the Flink docs for an example of how to replace a keyed window with a KeyedProcessFunction.

Computing statistics over a stream for a given window

I have a ticker KStream that that ticks frequently (think seconds), and I want to compute various statistics over a 24 hour window. For example, 24 hour change, the difference in price between a given point and one 24 hours before it.
My output for my desired input is:
t1 -> t1c1
t2 -> t1c2
t3 -> t1c3
Where t1 is the input ticker, and t1c1 is the input ticker with additional statistics computed for the 24 hour window preceding it.
I've considered a few ways of doing this that haven't worked:
* Window my ticker stream by size 24 hours with 1 second hops.
builder.stream(rawPriceTickerTopic, ...)
.groupByKey()
.windowedBy(
TimeWindows.of(TimeUnit.DAYS.toMillis(1))
.advanceBy(TimeUnit.SECONDS.toMillis(1))
.reduce((value1, value2) ->
value1.tickerWithStatsFrom(value2), ...)
.toStream();
However, this generates an immense number of output points, as each input ticker generates an output ticker for each window it is a member of.
Keep some kind of time series store up to date, get the the value 24 hours previous from the store, and compute my statistics ticker from that, however this seems to be going against the point of streams.

My final solution here was to abandon windowing and simply aggregate over my tickers, maintaining my own 24 hour window in the aggregator. This still doesn't feel like the best way and there's a nagging feeling that I could have solved it with Kafka's built in windowing concepts.
As said above, I use simple aggregation with my aggregator:
streamBuilder.stream(tickerTopic, Consumed.with(...)
.groupByKey()
.aggregate(MyAggregator::new,
(key, value, aggregate) -> aggregate.addTicker(value),
Materialized.with(...)
.toStream()
The result is that for every record in the original ticker stream, I get an aggregated value in my output stream. My aggregators logic is simple:
Add a new ticker to the ordered collection.
Discard any tickers that are more than 24 hours older that this new latest ticker.
Compute the new 24 hour change.
(This technique could be used for any kind of calculation over a given window, for example a moving average.)
Sample code for the aggregator:
public class MyAggregator {
private BigDecimal change;
private TreeSet<Ticker> orderedTickers = new TreeSet<>(MyAggregator::tickerTimeComparator);
public MyAggregator () {
this.windowMilis = 86400000;
}
public MyAggregator addTicker(Ticker ticker) {
orderedTickers.add(ticker);
cleanOldTickers();
change = getLatest().getAsk().subtract(getEarliest().getAsk());
return this;
}
public BigDecimal getChange() {
return change;
}
public Ticker getEarliest() {
return orderedTickers.first();
}
public Ticker getLatest() {
return orderedTickers.last();
}
private void cleanOldTickers() {
Date endOfWindow = latestWindow();
Iterator<Ticker> iterator = orderedTickers.iterator();
while(iterator.hasNext()) {
Ticker next = iterator.next();
if (next.getTimestamp().before(endOfWindow)) {
iterator.remove();
}
// The collection is sorted by time so if we get here we can break.
break;
}
}
private Date latestWindow() {
return new Date(getLatest().getTimestamp().getTime() - windowMilis);
}
private static int tickerTimeComparator(Ticker t1, Ticker t2) {
return t1.getTimestamp().compareTo(t2.getTimestamp());
}
}

Listener for counter until variable changing

Is there a way to build listener that detect if date are still transmitted to variable and if yes do one thing and when not do other?
For example
Until “int counter1” increasing set boolean (true) or print or change another int for 1
int counter (not increasing or decreasing anymore) set Boolean (false) print different thing change another int for 2.
Basically variable changing plus or minus do one thing stop changing do other thing start changing again go back to doing first thing etc etc.
Is there a way to do this?
Without obvious whole if statements compering way.

Handmade
Most simple way is to access that variable through getters and setters. You can put preferred logic into your setter and track all mutations from there.
public class Main {
static int observable = 0;
static void setObservable(int newValue) {
if (observable != newValue) {
System.out.printf("Observable int has been changed from %d to %d.%n", observable, newValue);
observable = newValue;
}
}
public static void main(String[] args) {
observable = 1; // Nothing notified us that value has been changed
setObservable(2); // Console output 'Observable int changed from 1 to 2.'
}
}
Built-in solutions
There are plenty other ways to implement the same functionality: create actual java bean with getters and setters, implement observable and observer interfaces on your own or use ready built-in solutions, for example IntegerProperty:
IntegerProperty intProperty = new SimpleIntegerProperty();
intProperty.addListener((observable, oldValue, newValue) -> {
if (!oldValue.equals(newValue) ) {
System.out.printf("Value has been changed from %d to %d!%n", oldValue.intValue(), newValue.intValue());
}
});
intProperty.setValue(1); // Output: Value has been changed from 0 to 1!
intProperty.setValue(2); // Output: Value has been changed from 1 to 2!
intProperty.setValue(2); // No output
System.out.println(intProperty.intValue()); // Output: 2
stopped changing
As for "stopped changing" listener, it's a little bit more complex issue. Depending on exact situation, there are several possible solutions I can think of:
1) if your loop is predictable and determined by you, just code the logic manually as it's required
/* listening for changes up there */
System.out.println("I'll go get some coffee");
Thread.sleep(60000); // stopped changing, eh?
/* do your stuff */
/* Continue listening for changes below */
2) if your loop is unpredictable but designed by you, you can try make it a little bit more predictable, design set of rules and protocols to follow, for example if new value is exactly zero, system will pause and switch to another task
3) you can also run background task which will periodically check last updated time, to determine if system is idle
There a lot of possible solutions to suggest, but I can't come up with something more specific without knowing more details

Simple Java String cache with expiration possibility

I am looking for a concurrent Set with expiration functionality for a Java 1.5 application. It would be used as a simple way to store / cache names (i.e. String values) that expire after a certain time.
The problem I'm trying to solve is that two threads should not be able to use the same name value within a certain time (so this is sort of a blacklist ensuring the same "name", which is something like a message reference, can't be reused by another thread until a certain time period has passed). I do not control name generation myself, so there's nothing I can do about the actual names / strings to enforce uniqueness, it should rather be seen as a throttling / limiting mechanism to prevent the same name to be used more than once per second.
Example:
Thread #1 does cache.add("unique_string, 1) which stores the name "unique_string" for 1 second.
If any thread is looking for "unique_string" by doing e.g. cache.get("unique_string") within 1 second it will get a positive response (item exists), but after that the item should be expired and removed from the set.
The container would at times handle 50-100 inserts / reads per second.
I have really been looking around at different solutions but am not finding anything that I feel really suites my needs. It feels like an easy problem, but all solutions I find are way too complex or overkill.
A simple idea would be to have a ConcurrentHashMap object with key set to "name" and value to the expiration time then a thread running every second and removing all elements whose value (expiration time) has passed, but I'm not sure how efficient that would be? Is there not a simpler solution I'm missing?

Google's Guava library contains exactly such cache: CacheBuilder.

How about creating a Map where the item expires using a thread executor
//Declare your Map and executor service
final Map<String, ScheduledFuture<String>> cacheNames = new HashMap<String, ScheduledFuture<String>>();
ScheduledExecutorService executorService = Executors.newSingleThreadScheduledExecutor();
You can then have a method that adds the cache name to your collection which will remove it after it has expired, in this example its one second. I know it seems like quite a bit of code but it can be quite an elegant solution in just a couple of methods.
ScheduledFuture<String> task = executorService.schedule(new Callable<String>() {
#Override
public String call() {
cacheNames.remove("unique_string");
return "unique_string";
}
}, 1, TimeUnit.SECONDS);
cacheNames.put("unique_string", task);

A simple unique string pattern which doesn't repeat
private static final AtomicLong COUNTER = new AtomicLong(System.currentTimeMillis()*1000);
public static String generateId() {
return Long.toString(COUNTER.getAndIncrement(), 36);
}
This won't repeat even if you restart your application.
Note: It will repeat after:
you restart and you have been generating over one million ids per second.
after 293 years. If this is not long enough you can reduce the 1000 to 100 and get 2930 years.

It depends - If you need strict condition of time, or soft (like 1 sec +/- 20ms).
Also if you need discrete cache invalidation or 'by-call'.
For strict conditions I would suggest to add a distinct thread which will invalidate cache each 20milliseconds.
Also you can have inside the stored key timestamp and check if it's expired or not.

Why not store the time for which the key is blacklisted in the map (as Konoplianko hinted)?
Something like this:
private final Map<String, Long> _blacklist = new LinkedHashMap<String, Long>() {
#Override
protected boolean removeEldestEntry(Map.Entry<String, Long> eldest) {
return size() > 1000;
}
};
public boolean isBlacklisted(String key, long timeoutMs) {
synchronized (_blacklist) {
long now = System.currentTimeMillis();
Long blacklistUntil = _blacklist.get(key);
if (blacklistUntil != null && blacklistUntil >= now) {
// still blacklisted
return true;
} else {
// not blacklisted, or blacklisting has expired
_blacklist.put(key, now + timeoutMs);
return false;
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.