rxJava buffer() with time that honours backpressure

rxJava buffer() with time that honours backpressure - java

The versions of buffer operator that don't operate on time honour backpressure as per JavaDoc:
http://reactivex.io/RxJava/2.x/javadoc/io/reactivex/Flowable.html#buffer-int-
However, any version of buffer that involves time based buffers doesn't support backpressure, like this one
http://reactivex.io/RxJava/2.x/javadoc/io/reactivex/Flowable.html#buffer-long-java.util.concurrent.TimeUnit-int-
I understand this comes from the fact that once the time is ticking, you can't stop it similarly to, for example interval operator, that doesn't support backpressure either for the same reason.
What I want is a buffering operator that is both size and time based and fully supports backpressure by propagating the backpressure signals to BOTH the upstream AND the time ticking producer, something like this:
someFlowable()
.buffer(
Flowable.interval(1, SECONDS).onBackpressureDrop(),
10
);
So now I could drop the tick on backpressure signals.
Is this something currently achievable in rxJava2? How about Project-Reactor?

I've encountered the problem recently and here is my implementation. It can be used like this:
Flowable<List<T>> bufferedFlow = (some flowable of T)
.compose(new BufferTransformer(1, TimeUnit.MILLISECONDS, 8))
It supports backpressure by the count you've specified.
Here is the implementation: https://gist.github.com/driventokill/c49f86fb0cc182994ef423a70e793a2d

I had problems with solution from https://stackoverflow.com/a/55136139/6719538 when used DisposableSubscriber as subscribers, and as far as I can see this transformer don't consider calls Suscription#request from downstream subscribers (it could overflow them). I create my version that was tested in production - BufferTransformerHonorableToBackpressure.java. fang-yang - great respect for idea.

It's been a while, but I had a look at this again and somehow it struck me that this:
public static <T> FlowableTransformer<T, List<T>> buffer(
int n, long period, TimeUnit unit)
{
return o ->
o.groupBy(__ -> 1)
.concatMapMaybe(
gf ->
gf.take(n)
.take(period, SECONDS)
.toList()
.filter(l -> !l.isEmpty())
);
}
is pretty much doing what I described.
That, if I am correct is fully backpressured and will either buffer n items or after specified time if enough items haven't been collected

I had another go at it that lead to a quite overengineered solution that seems to be working (TM)
The requirements:
A buffering operator that releases a buffer after a time interval elapses, or the buffer reaches maximum size, whichever happens first
The operator has to be fully backpressured, that is, if requests cease from downstream, the buffer operator should not emit data nor should it raise any exception (like the starndard Flowable.buffer(interval, TimeUnit) operator does. The operator should not consume its source/upstream in an unbounded mode either
Do this with composing existing/implemented operators.
Why would anyone want it?:
The need for such operator came when I wanted to implement a buffering on an infinite/long running stream. I wanted to buffer for efficiency, but the standard Flowable.buffer(n) is not suitable here since an "infinite" stream can emit k < n elements and then not emit items for a long time. Those k elements are trapped in buffer(n). So adding timeout would do the job, but in rxJava2 the buffer operator with timeout doesn't honor backpressure and buffering/dropping or any other built in strategy is not good enough.
The solution outline:
The solution is based on generateAsync and partialCollect operators, both implemented in https://github.com/akarnokd/RxJava2Extensions project. The rest is starndard RxJava2.
First wrap all the values from upstream in a container class C
Then merge that stream with a stream which source is using generateAsync. That stream uses switchMap to emit instancesof C that are effectively timeout signals.
The two merged streams are flowing into partialCollect that holds a reference to an "API" object to emit items into the generateAsync upstream. This is a sort of feedback loop that goes from paritialCollect via the "API" object to generateAsync that feeds back to partialCollect. In this way partialCollect can upon receiving the first element in a buffer emit a signal that will effectively start a timeout. If the buffer doesn't fill before the timeout, it will cause an instance of empty C (not containing any value) flowing back into partialCollect. It will detect it as a timeout signal and release the aggregated buffer downstream. If the buffer is released because of reaching its maximum size, it will be released and the next item will kick off another timeout. Any timeout signal (an instance of empty C) arriving late, aka after the buffer has been released because of reaching maximum size will be ignored. It is possible, because it's the partialCollect that instantiate and sends out the timeout signal item that will potentially flow back to it. Checking the identity of that item allows to detect a late vs legitimate timeout signal.
The code:
https://gist.github.com/artur-jablonski/5eb2bb470868d9eeeb3c9ee247110d4a

Related

How to pool Akka substreams

I am trying to maintain a long running Akka stream, that fans into substreams (lets say to conflate/throttle/writeToDB records with a particular ID).
Because the stream should be kept alive for a long time, at one point the stream will be out of available substreams (And I'd like to clear unused memory anyway).
How can I cleanup the 'idle' substreams? (The doc points to idleTimeout and recoverWithRetries, but to me it does not seem to actually liberate a substream. Am I not using it properly? I can see that recoverWithRetries is called at the right time, but the next MAX_SUBSTREAMS + 1th key that arrives later still fails (Cannot open a new substream as there are too many substreams open))
How to handle the case that maybe, there is no substream to clean? (can I / how do I slow down the upstream?)
This post says that
groupBy removes inputs for subflows that have already been closed
This is not what I want, I need the substream to just be re-created in that case. Also I cannot find any mention of this behaviour in the doc.
In the end, what I need is to fan out a stream into a pool of substreams. If all substreams are used, slow down upstream. If a substream does not receive any new record for x seconds, emit, clear it and move it back to the pool.
Flow.of(Record.class)
.groupBy(MAX_SUBSTREAMS, Record::getKey)
.via(conflateThenThrottleThenCommitRecord)
.idleTimeout(Duration.of(2, ChronoUnit.SECONDS))
.recoverWithRetries(1, new PFBuilder()
.matchAny(ex -> Source.empty())
.build())
.mergeSubstreams();

Can a queue of events be persisted when using Reactor Project or RxJava

The main benefit of reactive programming is that it is fault-tolerant and can process a lot more events than a blocking implementation, despite the fact that the processing will usually happen slower.
What I don't fully understand is how and where the events are stored. I know that there is an event buffer and it can be tweaked but that buffer can easily overload the memory if the queue is unbound, can't it? Can this buffer flush onto disk? Isn't it a rist to have it in-memory? Can it be configured similarly to Lagom event-sourcing or persistent Akka actors where events can be stored in DB?

The short answer is no, this buffer cannot be persisted. At least in reference implementation.
The internal in-memory buffer can hold up to 128 emited values by default, but there are some points. First of all, there is a backpressure — situatuion when the source emits items faster than observer or operator consumes them. Thus, when this internal buffer is overloaded you get a MissingBackpressureException and there are no any disk or some other way to persist it. However you can tweak the behavior, for instance keep only latest emit or just drop new emits. There are special operators for that — onBackpressureBuffer, onBackpressureDrop, onBackpressureLatest.
RxJava2 introduces a new type — Flowable which supports backpressure by default and gives more ways to tweak internal buffer.
Rx is a way to process data streams and you should care if you can consume all the items and how to store them if you can't.
One of the main advantages of rxjava is contract and there are ways to create your own operators or use some extensions like rxjava-extras

Observable vs Flowable rxJava2

I have been looking at new rx java 2 and I'm not quite sure I understand the idea of backpressure anymore...
I'm aware that we have Observable that does not have backpressure support and Flowable that has it.
So based on example, lets say I have flowable with interval:
Flowable.interval(1, TimeUnit.MILLISECONDS, Schedulers.io())
.observeOn(AndroidSchedulers.mainThread())
.subscribe(new Consumer<Long>() {
#Override
public void accept(Long aLong) throws Exception {
// do smth
}
});
This is going to crash after around 128 values, and thats pretty obvious I am consuming slower than getting items.
But then we have the same with Observable
Observable.interval(1, TimeUnit.MILLISECONDS, Schedulers.io())
.observeOn(AndroidSchedulers.mainThread())
.subscribe(new Consumer<Long>() {
#Override
public void accept(Long aLong) throws Exception {
// do smth
}
});
This will not crash at all, even when I put some delay on consuming it still works. To make Flowable work lets say I put onBackpressureDrop operator, crash is gone but not all values are emitted either.
So the base question I can not find answer currently in my head is why should I care about backpressure when I can use plain Observable still receive all values without managing the buffer? Or maybe from the other side, what advantages do backpressure give me in favour of managing and handling the consuming?

What backpressure manifests in practice is bounded buffers, Flowable.observeOn has a buffer of 128 elements that gets drained as fast as the dowstream can take it. You can increase this buffer size individually to handle bursty source and all the backpressure-management practices still apply from 1.x. Observable.observeOn has an unbounded buffer that keeps collecting the elements and your app may run out of memory.
You may use Observable for example:
handling GUI events
working with short sequences (less than 1000 elements total)
You may use Flowable for example:
cold and non-timed sources
generator like sources
network and database accessors

Backpressure is when your observable (publisher) is creating more events than your subscriber can handle. So you can get subscribers missing events, or you can get a huge queue of events which just leads to out of memory eventually. Flowable takes backpressure into consideration. Observable does not. Thats it.
it reminds me of a funnel which when it has too much liquid overflows. Flowable can help with not making that happen:
with tremendous backpressure:
but with using flowable, there is much less backpressure :
Rxjava2 has a few backpressure strategies you can use depending on your usecase. by strategy i mean Rxjava2 supplies a way to handle the objects that cannot be processed because of the overflow (backpressure).
here are the strategies.
I wont go through them all, but for example if you want to not worry about the items that are overflowed you can use a drop strategy like this:
observable.toFlowable(BackpressureStrategy.DROP)
As far as i know there should be a 128 item limit on the queue, after that there can be a overflow (backpressure). Even if its not 128 its close to that number. Hope this helps someone.
if you need to change the buffer size from 128 it looks like it can be done
like this (but watch any memory constraints:
myObservable.toFlowable(BackpressureStrategy.MISSING).buffer(256); //but using MISSING might be slower.
in software developement usually back pressure strategy means your telling the emitter to slow down a bit as the consumer cannot handle the velocity your emitting events.

The fact that your Flowable crashed after emitting 128 values without backpressure handling doesn't mean it will always crash after exactly 128 values: sometimes it will crash after 10, and sometimes it will not crash at all. I believe this is what happened when you tried the example with Observable - there happened to be no backpressure, so your code worked normally, next time it may not. The difference in RxJava 2 is that there is no concept of backpressure in Observables anymore, and no way to handle it. If you're designing a reactive sequence that will probably require explicit backpressure handling - then Flowable is your best choice.

Observable to batch like Lmax Disruptor

Those who are familiar with lmax ring buffer (disruptor) know that one of the biggest advanatages of that data structure is that it batches incomming events and when we have a consumer that can take advantage of batching that makes the system automatically adjustable to the load, the more events you throw at it the better.
I wonder couldnt we achieve the same effect with an Observable (targeting the batching feature). I've tried out Observable.buffer but this is very different, buffer will wait and not emit the batch while the expected number of events didnt arrive. what we want is quite different.
given the subsriber is waiting for a batch from Observable<Collection<Event>>, when a single item arrives at stream it emits a single element batch which gets processed by subscriber, while it is processing other elements are arriving and getting collected into next batch, as soon as subscriber finishes with the execution it gets the next batch with as many events as had arrived since it started last processing...
So as a result if our subscriber is fast enough to process one event at a time it will do so, if load gets higher it will still have the same frequency of processing but more events each time (thus solving backpressure problem)... unlike buffer which will stick and wait for batch to fill up.
Any suggestions? or shall i go with ring buffer?

RxJava and Disruptor represent two different programming approaches.
I'm not experienced with Disruptor but based on video talks, it is basically a large buffer where producer emit data like a firehose and consumers spin/yield/block until data is available.
RxJava, on the other hand, aims at non-blocking event delivery. We too have ringbuffers, notably in observeOn which acts as the async-boundary between producers and consumers, but these are much smaller and we avoid buffer overflows and buffer bloat by applying the co-routines approach. Co-routines boil down to callbacks sent to your callbacks so yo can callback our callbacks to send you some data at your pace. The frequency of such requests determines the pacing.
There are data sources that don't support such co-op streaming and require one of the onBackpressureXXX operators that will buffer/drop values if the downstream doesn't request fast enough.
If you think you can process data in batches more efficiently than one-by-one, you can use the buffer operator which has overloads to specify time duration for the buffers: you can have, for example, 10 ms worth of data, independent of how many values arrive in this duration.
Controlling the batch-size via request frequency is tricky and may have unforseen consequences. The problem, generally, is that if you request(n) from a batching source, you indicate you can process n elements but the source now has to create n buffers of size 1 (because the type is Observable<List<T>>). In contrast, if no request is called, the operator buffers the data resulting in longer buffers. These behaviors introduce extra overhead in the processing if you really could keep up and also has to turn the cold source into a firehose (because otherwise what you have is essentially buffer(1)) which itself can now lead to buffer bloat.

Why do we need Publish and RefCount Rx operators in this case?

I'm trying to familiarise myself with the problem of reactive backpressure handling, specifically by reading this wiki: https://github.com/ReactiveX/RxJava/wiki/Backpressure
In the buffer paragraph, we have this more involved example code:
// we have to multicast the original bursty Observable so we can use it
// both as our source and as the source for our buffer closing selector:
Observable<Integer> burstyMulticast = bursty.publish().refCount();
// burstyDebounced will be our buffer closing selector:
Observable<Integer> burstyDebounced = burstMulticast.debounce(10, TimeUnit.MILLISECONDS);
// and this, finally, is the Observable of buffers we're interested in:
Observable<List<Integer>> burstyBuffered = burstyMulticast.buffer(burstyDebounced);
If I understand correctly, we're effectively debouncing the bursty source stream by generating a debounced signal stream for the buffer operator.
But why do we need to use the publish and refcount operators here? What problem would it cause if we'd just drop them? The comments don't make it much clearer for me, aren't RxJava Observables up to multicasting by default?

The answer lies in the difference between hot and cold observables.
Buffer operator combines the 2 streams and has no way to know they have a common source (in your case). When activated (subscribed to), it'll subscribe to them both, which in return will trigger 2 distinct subscriptions to your original input.
Now 2 things can happen, either the input is a hot observable, and the subscription has no effect but to register the listener, and everything will work as expected, or it's a cold observable, and each subscription will result in potentially distinct and desynchronized streams.
For instance a cold observable can be one which performs a network request when subscribed, and notified the result. Not calling publish on it means 2 requests will be done.
Publish+refcount/connect is the usual way to transform a cold observable into a hot one, making sure a single subscribe will happen, and all streams will behave identically.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.