Computing statistics over a stream for a given window - java

I have a ticker KStream that that ticks frequently (think seconds), and I want to compute various statistics over a 24 hour window. For example, 24 hour change, the difference in price between a given point and one 24 hours before it.
My output for my desired input is:
t1 -> t1c1
t2 -> t1c2
t3 -> t1c3
Where t1 is the input ticker, and t1c1 is the input ticker with additional statistics computed for the 24 hour window preceding it.
I've considered a few ways of doing this that haven't worked:
* Window my ticker stream by size 24 hours with 1 second hops.
builder.stream(rawPriceTickerTopic, ...)
.groupByKey()
.windowedBy(
TimeWindows.of(TimeUnit.DAYS.toMillis(1))
.advanceBy(TimeUnit.SECONDS.toMillis(1))
.reduce((value1, value2) ->
value1.tickerWithStatsFrom(value2), ...)
.toStream();
However, this generates an immense number of output points, as each input ticker generates an output ticker for each window it is a member of.
Keep some kind of time series store up to date, get the the value 24 hours previous from the store, and compute my statistics ticker from that, however this seems to be going against the point of streams.

My final solution here was to abandon windowing and simply aggregate over my tickers, maintaining my own 24 hour window in the aggregator. This still doesn't feel like the best way and there's a nagging feeling that I could have solved it with Kafka's built in windowing concepts.
As said above, I use simple aggregation with my aggregator:
streamBuilder.stream(tickerTopic, Consumed.with(...)
.groupByKey()
.aggregate(MyAggregator::new,
(key, value, aggregate) -> aggregate.addTicker(value),
Materialized.with(...)
.toStream()
The result is that for every record in the original ticker stream, I get an aggregated value in my output stream. My aggregators logic is simple:
Add a new ticker to the ordered collection.
Discard any tickers that are more than 24 hours older that this new latest ticker.
Compute the new 24 hour change.
(This technique could be used for any kind of calculation over a given window, for example a moving average.)
Sample code for the aggregator:
public class MyAggregator {
private BigDecimal change;
private TreeSet<Ticker> orderedTickers = new TreeSet<>(MyAggregator::tickerTimeComparator);
public MyAggregator () {
this.windowMilis = 86400000;
}
public MyAggregator addTicker(Ticker ticker) {
orderedTickers.add(ticker);
cleanOldTickers();
change = getLatest().getAsk().subtract(getEarliest().getAsk());
return this;
}
public BigDecimal getChange() {
return change;
}
public Ticker getEarliest() {
return orderedTickers.first();
}
public Ticker getLatest() {
return orderedTickers.last();
}
private void cleanOldTickers() {
Date endOfWindow = latestWindow();
Iterator<Ticker> iterator = orderedTickers.iterator();
while(iterator.hasNext()) {
Ticker next = iterator.next();
if (next.getTimestamp().before(endOfWindow)) {
iterator.remove();
}
// The collection is sorted by time so if we get here we can break.
break;
}
}
private Date latestWindow() {
return new Date(getLatest().getTimestamp().getTime() - windowMilis);
}
private static int tickerTimeComparator(Ticker t1, Ticker t2) {
return t1.getTimestamp().compareTo(t2.getTimestamp());
}
}

Related

Flink how to compute over aggregated output of a keyed window

Is it possible in Flink to compute over aggregated output of a keyed window?
We have a Datastream, we call byKey() specifying a field that is composed by a char and a number (for example A01, A02... A10, B01, B02, ... B10, etc), like the squares of the chessboard.
After the byKey() we call window(TumblingEventTimeWindow.of(Time.days(7)), so we create a weekly window.
After this, we call reduce() and as result we obtain SingleOutputStreamOperator<Result>.
Now, we want to group the SingleOutputStreamOperator<Result> based on a field of each Result object and iterate over each group to extract the top3 based on a field in the Result objects in that group, is it possible to do this without creating another weekly window and having to perform an aggregation function on it?
Obviously this works, however I don't like the thought of having this second weekly window after another weekly window. I would like to be able to merge all the SingleOutputStreamOperator<Result>of the first window and execute a function on them without having to use a new window that receives all the elements together.
This is my code, as you can see:
We use keyBy() based on a Tuple2<String, Integer> based on fields of the object Query2IntermediateOutcome. The String in the tuple is the code A01,...,A10 which I had mentioned before.
The code window(timeIntervalConstructor.newInstance()) basically creates a weekly window.
We call reduce() so for each key we have an aggregated value.
Now we use another keyBy(), this time the key is basically computed looking at the number of the code A01,...,A10: if it's greater than 5 we have a sea type, if it's less or equal we have another.
Again, window(timeIntervalConstructor.newInstance()) for the second weekly window.
Finally, in the aggregate() we compute the top3 for each group.
.keyBy(new KeySelector<Query2IntermediateOutcome, Tuple2<String, Integer>>() {
#Override
public Tuple2<String, Integer> getKey(Query2IntermediateOutcome intermediateOutcome) throws Exception {
return new Tuple2<String, Integer>(intermediateOutcome.getCellId(), intermediateOutcome.getHourInDate());
}
})
.window(timeIntervalConstructor.newInstance())
.reduce(new ReduceFunction<Query2IntermediateOutcome>() {
#Override
public Query2IntermediateOutcome reduce(Query2IntermediateOutcome t1, Query2IntermediateOutcome t2) throws Exception {
t1.setAttendance(t1.getAttendance()+t2.getAttendance());
return t1;
}
})
.keyBy(new KeySelector<Query2IntermediateOutcome, String>() {
#Override
public String getKey(Query2IntermediateOutcome query2IntermediateOutcome) throws Exception {
return query2IntermediateOutcome.getSeaType().toString();
}
})
.window(timeIntervalConstructor.newInstance())
.aggregate(new Query2FinalAggregator(), new Query2Window())
This solution works, but I don't really like it because the second window receive all the data when the previous fires, but it happens weekly, so the second window receive all the data together and must immediately run the aggregate().
I think it would be reasonably straightforward to collapse all of this business logic into one KeyedProcessFunction. Then you could avoid the burst of activity at the end of the week.
Take a look at this tutorial in the Flink docs for an example of how to replace a keyed window with a KeyedProcessFunction.

RxJava - Cache Observable Updates and Emit Largest Values

I currently have an Observable<ProductIDUpdate> emitting an object that represents an update of a product ID. The update can either be the ID is an new ADDITION or has expired and requires DELETION.
public class ProductIDUpdate {
enum UpdateType {
ADDITION, DELETEION;
}
private int id;
private UpdateType type;
public ProductIDUpdate(int id) {
this(id, UpdateType.ADDITION);
}
public ProductIDUpdate(int id, UpdateType type) {
this.id = id;
this.type = type;
}
}
I want to track the update with the largest ID value, hence I want to modify the stream so that the current highest ID is emitted. How would I cache the update items in the stream such that if the current highest ID is deleted, the next highest available ID is emitted?
I don't know anything about Rx, but here's my understanding:
you have a bunch of product ids. It's not clear to me whether you receive them over time as part of some messages being sent to your class or if you know all the ids from the beginning
you want to create a stream on top of your source of product ids that emits the highest available id at any point in time
If my understanding is correct, how about using a PriorityQueue? You cache ids in the queue with a reverse comparator (it keeps the smallest element at the top of the heap by default) and when you want to emit a new value you just pop the top value.
Can something like that meet your requirements?
public static void main(String[] args) {
Observable<ProductIDUpdate> products =
Observable.just(new ProductIDUpdate(1, ADDITION),
new ProductIDUpdate(4, ADDITION),
new ProductIDUpdate(2, ADDITION),
new ProductIDUpdate(5, ADDITION),
new ProductIDUpdate(1, DELETION),
new ProductIDUpdate(5, DELETION),
new ProductIDUpdate(3, ADDITION),
new ProductIDUpdate(6, ADDITION));
products.distinctUntilChanged((prev, current) -> prev.getId() > current.getId())
.filter(p -> p.getType().equals(ADDITION))
.subscribe(System.out::println,
Throwable::printStackTrace);
Observable.timer(1, MINUTES) // just for blocking the main thread
.toBlocking()
.subscribe();
}
This prints:
ProductIDUpdate{id=1, type=ADDITION}
ProductIDUpdate{id=4, type=ADDITION}
ProductIDUpdate{id=5, type=ADDITION}
ProductIDUpdate{id=6, type=ADDITION}
If you remove the filter(), this prints:
ProductIDUpdate{id=1, type=ADDITION}
ProductIDUpdate{id=4, type=ADDITION}
ProductIDUpdate{id=5, type=ADDITION}
ProductIDUpdate{id=5, type=DELETION}
ProductIDUpdate{id=6, type=ADDITION}

Getting previous window data in DataFlow

Trying to create some mechanism of alerting system, I am looking to find a drop in an average between two windows.
I was happy to find TrafficRoutes example, specifically when I saw it says:
A 'slowdown' occurs if a supermajority of speeds in a sliding window
are less than the reading of the previous window.
I looked in the code, but failed to understand why this means we get the previous value from the previous window. Since I had no experience with sliding windows till now, I thought I might missing something.
Implementing this kind of mechanism, with or without sliding windows - does not get data from previous windows, as I suspected.
Any idea what do I miss ?
Is there a certain way to get values from previous window ?
I am executing on GCP Dataflow, with SDK 1.9.0.
Please advise,
Shushu
My assumptions:
Your alerting system has data partitioned into "metrics" identified by "metric ids".
The value of a metric at a given time is Double.
You are receiving the metric data as a PCollection<KV<String, Double>> where the String is metric id, the Double is the metric value, and each element has the appropriate implicit timestamp (if it doesn't, you can assign one using the WithTimestamps transform).
You want to compute sliding averages of each metric for each 5-minute interval starting at every 1 minute, and want to do something in case the average for interval starting at T+1min is smaller than average for interval starting at T
You can accomplish it like this:
PCollection<KV<String, Double>> metricValues = ...;
// Collection of (metric, timestamped 5-minute average)
// windowed into the same 5-minute windows as the input,
// where timestamp is assigned as the beginning of the window.
PCollection<KV<String, TimestampedValue<Double>>>
metricSlidingAverages = metricValues
.apply(Window.<KV<String, Double>>into(
SlidingWindows.of(Duration.standardMinutes(5))
.every(Duration.standardMinutes(1))))
.apply(Mean.<String, Double>perKey())
.apply(ParDo.of(new ReifyWindowFn()));
// Rewindow the previous collection into global window so we can
// do cross-window comparisons.
// For each metric, an unsorted list of (timestamp, average) pairs.
PCollection<KV<String, Iterable<TimestampedValue<Double>>>
metricAverageSequences = metricSlidingAverages
.apply(Window.<KV<String, TimestampedValue<Double>>>into(
new GlobalWindows()))
// We need to group the data by key again since the grouping key
// has changed (remember, GBK implicitly groups by key and window)
.apply(GroupByKey.<String, TimestampedValue<Double>>create())
metricAverageSequences.apply(new DetectAnomaliesFn());
...
class ReifyWindowFn extends DoFn<
KV<String, Double>, KV<String, TimestampedValue<Double>>> {
#ProcessElement
public void process(ProcessContext c, BoundedWindow w) {
// This DoFn makes the implicit window of the element be explicit
// and extracts the starting timestamp of the window.
c.output(KV.of(
c.element().getKey(),
TimestampedValue.of(c.element.getValue(), w.minTimestamp())));
}
}
class DetectAnomaliesFn extends DoFn<
KV<String, Iterable<TimestampedValue<Double>>>, Void> {
#ProcessElement
public void process(ProcessContext c) {
String metricId = c.element().getKey();
// Sort the (timestamp, average) pairs by timestamp.
List<TimestampedValue<Double>> averages = Ordering.natural()
.onResultOf(TimestampedValue::getTimestamp)
.sortedCopy(c.element().getValue());
// Scan for anomalies.
for (int i = 1; i < averages.size(); ++i) {
if (averages.get(i).getValue() < averages.get(i-1).getValue()) {
// Detected anomaly! Could do something with it,
// e.g. publish to a third-party system or emit into
// a PCollection.
}
}
}
}
Note that I did not test this code, but it should provide enough conceptual guidance for you to accomplish the task.

Accumulating streams in java

Recently I've been trying to reimplement my data parser into streams in java, but I can't figure out how to do one specific thing:
Consider object A with timeStamp.
Consider object B which is made of various A objects
Consider some metrics which tells us time range for object B.
What I have now is some method with state which goes though list with objects A and if it fits into last object B, it goes there, otherwise it creates new B instance and starts putting objects A there.
I would like to do this in streams way
Take whole list of objects A and make it as stream. Now I need to figure out function which will create "chunks" and accumulate them into objects B. How do I do that?
Thanks
EDIT:
A and B are complex, but I will try to post here some simplified version.
class A {
private final long time;
private A(long time) {
this.time = time;
}
long getTime() {
return time;
}
}
class B {
// not important, build from "full" temporaryB class
// result of accumulation
}
class TemporaryB {
private final long startingTime;
private int counter;
public TemporaryB(A a) {
this.startingTime = a.getTime();
}
boolean fits(A a) {
return a.getTime() - startingTime < THRESHOLD;
}
void add(A a) {
counter++;
}
}
class Accumulator {
private List<B> accumulatedB;
private TemporaryBParameters temporaryBParameters
public void addA(A a) {
if(temporaryBParameters.fits(a)) {
temporaryBParameters.add(a)
} else {
accumulateB.add(new B(temporaryBParameters)
temporaryBParameters = new TemporaryBParameters(a)
}
}
}
ok so this is very simplified way how do I do this now. I don't like it. it's ugly.
In general such problem is badly suitable for Stream API as you may need non-local knowledge which makes parallel processing harder. Imagine that you have new A(1), new A(2), new A(3) and so on up to new A(1000) with Threshold set to 10. So you basically need to combine input into batches by 10 elements. Here we have the same problem as discussed in this answer: when we split the task into subtasks the suffix part may not know exactly how many elements are in the prefix part, so it cannot even start combining data into batches until the whole prefix is processed. Your problem is essentially serial.
On the other hand, there's a solution provided by new headTail method in my StreamEx library. This method parallelizes badly, but having it you can define almost any operation in just a few lines.
Here's how to solve your problem with headTail:
static StreamEx<TemporaryB> combine(StreamEx<A> input, TemporaryB tb) {
return input.headTail((head, tail) ->
tb == null ? combine(tail, new TemporaryB(head)) :
tb.fits(head) ? combine(tail, tb.add(head)) :
combine(tail, new TemporaryB(head)).prepend(tb),
() -> StreamEx.ofNullable(tb));
}
Here I modified your TemporaryB method this way:
TemporaryB add(A a) {
counter++;
return this;
}
Sample (assuming Threshold = 1000):
List<A> input = Arrays.asList(new A(1), new A(10), new A(1000), new A(1001), new A(
1002), new A(1003), new A(2000), new A(2002), new A(2003), new A(2004));
Stream<B> streamOfB = combine(StreamEx.of(input), null).map(B::new);
streamOfB.forEach(System.out::println);
Output (I wrote simple B.toString()):
B [counter=2, startingTime=1]
B [counter=3, startingTime=1001]
B [counter=2, startingTime=2002]
So here you actually have a lazy Stream of B.
Explanation:
StreamEx.headTail parameters are two lambdas. First is called at most once when input stream is non-empty. It receives the first stream element (head) and the stream containing all other elements (tail). The second is called at most once when input stream is empty and receives no parameters. Both should produce an output stream which would be used instead. So what we have here:
return input.headTail((head, tail) ->
tb == null is the starting case, create new TemporaryB from the head and call self with the tail:
tb == null ? combine(tail, new TemporaryB(head)) :
tb.fits(head) ? Ok, just add the head into existing tb and call self with the tail:
tb.fits(head) ? combine(tail, tb.add(head)) :
Otherwise again create new TemporaryB(head), but also prepend the output stream with the current tb (actually emitting a new element into target stream):
combine(tail, new TemporaryB(head)).prepend(tb),
Input stream is exhausted? Ok, return the last gathered tb if any:
() -> StreamEx.ofNullable(tb));
Note that headTail implementation guarantees that such solution while looking recursive does not eat the stack and heap more than constant amount. You can check it on thousands of input elements if you doubt:
Stream<B> streamOfB = combine(LongStreamEx.range(100000).mapToObj(A::new), null).map(B::new);
streamOfB.forEach(System.out::println);

Simple Java String cache with expiration possibility

I am looking for a concurrent Set with expiration functionality for a Java 1.5 application. It would be used as a simple way to store / cache names (i.e. String values) that expire after a certain time.
The problem I'm trying to solve is that two threads should not be able to use the same name value within a certain time (so this is sort of a blacklist ensuring the same "name", which is something like a message reference, can't be reused by another thread until a certain time period has passed). I do not control name generation myself, so there's nothing I can do about the actual names / strings to enforce uniqueness, it should rather be seen as a throttling / limiting mechanism to prevent the same name to be used more than once per second.
Example:
Thread #1 does cache.add("unique_string, 1) which stores the name "unique_string" for 1 second.
If any thread is looking for "unique_string" by doing e.g. cache.get("unique_string") within 1 second it will get a positive response (item exists), but after that the item should be expired and removed from the set.
The container would at times handle 50-100 inserts / reads per second.
I have really been looking around at different solutions but am not finding anything that I feel really suites my needs. It feels like an easy problem, but all solutions I find are way too complex or overkill.
A simple idea would be to have a ConcurrentHashMap object with key set to "name" and value to the expiration time then a thread running every second and removing all elements whose value (expiration time) has passed, but I'm not sure how efficient that would be? Is there not a simpler solution I'm missing?
Google's Guava library contains exactly such cache: CacheBuilder.
How about creating a Map where the item expires using a thread executor
//Declare your Map and executor service
final Map<String, ScheduledFuture<String>> cacheNames = new HashMap<String, ScheduledFuture<String>>();
ScheduledExecutorService executorService = Executors.newSingleThreadScheduledExecutor();
You can then have a method that adds the cache name to your collection which will remove it after it has expired, in this example its one second. I know it seems like quite a bit of code but it can be quite an elegant solution in just a couple of methods.
ScheduledFuture<String> task = executorService.schedule(new Callable<String>() {
#Override
public String call() {
cacheNames.remove("unique_string");
return "unique_string";
}
}, 1, TimeUnit.SECONDS);
cacheNames.put("unique_string", task);
A simple unique string pattern which doesn't repeat
private static final AtomicLong COUNTER = new AtomicLong(System.currentTimeMillis()*1000);
public static String generateId() {
return Long.toString(COUNTER.getAndIncrement(), 36);
}
This won't repeat even if you restart your application.
Note: It will repeat after:
you restart and you have been generating over one million ids per second.
after 293 years. If this is not long enough you can reduce the 1000 to 100 and get 2930 years.
It depends - If you need strict condition of time, or soft (like 1 sec +/- 20ms).
Also if you need discrete cache invalidation or 'by-call'.
For strict conditions I would suggest to add a distinct thread which will invalidate cache each 20milliseconds.
Also you can have inside the stored key timestamp and check if it's expired or not.
Why not store the time for which the key is blacklisted in the map (as Konoplianko hinted)?
Something like this:
private final Map<String, Long> _blacklist = new LinkedHashMap<String, Long>() {
#Override
protected boolean removeEldestEntry(Map.Entry<String, Long> eldest) {
return size() > 1000;
}
};
public boolean isBlacklisted(String key, long timeoutMs) {
synchronized (_blacklist) {
long now = System.currentTimeMillis();
Long blacklistUntil = _blacklist.get(key);
if (blacklistUntil != null && blacklistUntil >= now) {
// still blacklisted
return true;
} else {
// not blacklisted, or blacklisting has expired
_blacklist.put(key, now + timeoutMs);
return false;
}
}
}

Categories