Let's say I have:
list.stream()
.map(someService::someRateLimitedApiCall) //does not implement Runnable
.filter(Optional::isPresent)
.map(Optional::get)
.sleep(1000) //is something like this possible?
.min...;
The API service only allows limited number of transactions per second, and I am seeking to introduce a delay in between calls.
If not, is there a way to add an executor with a fixed delay within the iteration of the stream?
(To be clear, I am not violating the terms of the external API and will not abuse the service.)
Rather than use peek, why not just put the delay in the map operation which calls the API?
.map(e -> {
try {
Thread.sleep(1000);
} catch (InterruptedException ex) {
return Optional.empty();
}
return someRateLimitedApiCall(e);
})
The simple solution (without parallel streams) was to use peek as multiple commenters suggested. Since it requires a Consumer:
.peek(i -> {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
})
Related
I need to copy date from one source (in parallel) to another with batches.
I did this:
Flux.generate((SynchronousSink<String> sink) -> {
try {
String val = dataSource.getNextItem();
if (val == null) {
sink.complete();
return;
}
sink.next(val);
} catch (InterruptedException e) {
sink.error(e);
}
})
.parallel(4)
.runOn(Schedulers.parallel())
.doOnNext(dataTarget::write)
.sequential()
.blockLast();
class dataSource{
public Item getNextItem(){
//...
}
}
class dataTarget{
public void write(List<Item> items){
//...
}
}
It receives data in parallel, but writes one at a time.
I need to collect them in batches (like by 10 items) and then write the batch.
How can I do that?
UPDATE:
The main idea that the source is the messaging system (i.e. rabbitmq or nats) that's suitable to efficiently send messages one by one, but the target is the database which is more efficient on inserting a batch.
So the final result should be like — I receive messages in parallel until buffer is not filled up, then I write all the buffer into database by one shot.
It's easy to do in regular java, but in case of streams — I don't get how to do it. How to buffer the data and how to pause the reader till the writer is not ready to get next part.
All you need is Flux#buffer(int maxSize) operator:
Flux.generate((SynchronousSink<String> sink) -> {
try {
String val = dataSource.getNextItem();
if (val == null) {
sink.complete();
return;
}
sink.next(val);
} catch (InterruptedException e) {
sink.error(e);
}
})
.buffer(10) //Flux<List<String>>
.flatMap(dataTarget::write)
.blockLast();
class DataTarget{
public Mono<Void> write(List<String> items){
return reactiveDbClient.insert(items);
}
}
Here, buffer collects items into multiple List's of 10 items(batches). You do not need to use parallel scheduler. The flatmap will run these operations asynchronously. See Understanding Reactive’s .flatMap() Operator.
You need to do your heavy work in individual Publisher-s which will be materialized in flatMap() in parallel. Like this
Flux.generate((SynchronousSink<String> sink) -> {
try {
String val = dataSource.getNextItem();
if (val == null) {
sink.complete();
return;
}
sink.next(val);
} catch (InterruptedException e) {
sink.error(e);
}
})
.parallel(4)
.runOn(Schedulers.parallel())
.flatMap(item -> Mono.fromCallable(() -> dataTarget.write(item)))
.sequential()
.blockLast();
Best approach (from algorithmic view) is to have ringbuffer and use microbatching technique. Writes to ringbuffer is done from rabbitmq, one-by-one (or multiple in parallel). Reading thread (single only) would get all messages at once (presented at a time of batch start), insert them into database and do it again... All at once means single message (if there is only one), or bunch of them (if they have been accumulated while duration of last insert was long enough to).
This technique is used also in jdbc (if I remember correctly) and can be implemented easily using lmax disruptor library in java.
Sample project (using ractor /Flux/ and System.out.println) can be found on https://github.com/luvarqpp/reactorBatch
Core code:
final Flux<String> stringFlux = Flux.interval(Duration.ofMillis(1)).map(x -> "Msg number " + x);
final Flux<List<String>> stringFluxMicrobatched = stringFlux
.bufferTimeout(100, Duration.ofNanos(1));
stringFluxMicrobatched.subscribe(strings -> {
// Batch insert into DB
System.out.print("Inserting in batch " + strings.size() + " strings.");
try {
// Inserting into db is simulated by 10 to 40 ms sleep here...
Thread.sleep(rnd.nextInt(30) + 10);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(" ... Done");
});
Please feel welcome to edit and improve this post with name of technique and references. This is community wiki...
I have a Stream<Item> which I'm mapping to a CompleteableFuture<ItemResult>
What I'd like to do is to know when all the futures are completed.
One may suggest to:
collect all the futures to an array and use CompleteableFuture.allOf(). This is somewhat problematic since there could be hundreds of thousands of items
just continue with forEach(CompleteableFuture::join). This is problematic too as calling forEach with join will just block the stream and it will be essentially a serial processing and not concurrent
Inject a poisoned item in the end of the stream. This could work but it's not that elegant in my view
check if the executor queue is empty - This is quite limiting because I might use more than one executor in the future. Also, the queue can be momentarily empty
Monitor the database instead and check the number of new items
I feel like all the suggested solutions aren't good enough.
What is the appropriate way to monitor the futures?
Thanks
EDIT:
another (vague) idea I had in mind is to use a counter and wait for it to go down to zero. But again, need to check that it's not a momentarily 0..
Disclaimer: I'm not sure whether Phaser is the right tool here, and if yes, whether it's better to have one root with multiple children or to chain them like I'm proposing below, so feel free to correct me.
Here's one approach that uses Phaser.
A Phaser has a limited number of parties, so we need to create a new child Phaser if that limit is about to get reached:
private Phaser register(Phaser phaser) {
if (phaser.getRegisteredParties() < 65534) {
// warning: side-effect,
// conflicts with AtomicReference#updateAndGet recommendation,
// might not fit well if the Stream is parallel:
phaser.register();
return phaser;
} else {
return new Phaser(phaser, 1);
}
}
Register each CompletableFuture against that Phaser chain, and deregister once done:
private void register(CompletableFuture<?> future, AtomicReference<Phaser> phaser) {
Phaser registeredPhaser = phaser.updateAndGet(this::register);
future
.thenRun(registeredPhaser::arriveAndDeregister)
.exceptionally(e -> {
// log e?
registeredPhaser.arriveAndDeregister();
return null;
});
}
Wait for all futures to be finished:
private <T> void await(Stream<CompletableFuture<T>> futures) {
Phaser rootPhaser = new Phaser(1);
AtomicReference<Phaser> phaser = new AtomicReference<>(rootPhaser);
futures.forEach(future -> register(future, phaser));
rootPhaser.arriveAndAwaitAdvance();
rootPhaser.arriveAndDeregister();
}
Example:
ExecutorService executor = Executors.newFixedThreadPool(500);
// creating fake stream with 500,000 futures:
Stream<CompletableFuture<Integer>> stream = IntStream
.rangeClosed(1, 500_000)
.mapToObj(i -> CompletableFuture.supplyAsync(() -> {
try {
TimeUnit.MILLISECONDS.sleep(10);
if (i % 50_000 == 0) {
System.out.println(Thread.currentThread().getName() + ": " + i);
}
return i;
} catch (InterruptedException e) {
throw new IllegalStateException(e);
}
}, executor));
// usage:
await(stream);
System.out.println("Done");
Outputs:
pool-1-thread-348: 50000
pool-1-thread-395: 100000
pool-1-thread-333: 150000
pool-1-thread-30: 200000
pool-1-thread-120: 250000
pool-1-thread-10: 300000
pool-1-thread-241: 350000
pool-1-thread-340: 400000
pool-1-thread-283: 450000
pool-1-thread-176: 500000
Done
Well, backpressure in RxJava is not real backpressure, but only ignoring some sets of elements.
But what if I cannot loose any elements and I need to slow emition somehow?
RxJava cannot affect element emition, so developer needs to implement it by himself. But how?
The simpliest way comes to mind is to use some counter with incrementing on emition and decrementing on finishing.
Like that:
public static void sleep(int ms) {
try {
Thread.sleep(ms);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
public static void main(String[] args) throws InterruptedException {
AtomicInteger counter = new AtomicInteger();
Scheduler sA = Schedulers.from(Executors.newFixedThreadPool(1));
Scheduler sB = Schedulers.from(Executors.newFixedThreadPool(5));
Observable.create(s -> {
while (!s.isUnsubscribed()) {
if (counter.get() < 100) {
s.onNext(Math.random());
counter.incrementAndGet();
} else {
sleep(100);
}
}
}).subscribeOn(sA)
.flatMap(r ->
Observable.just(r)
.subscribeOn(sB)
.doOnNext(x -> sleep(1000))
.doOnNext(x -> counter.decrementAndGet())
)
.subscribe();
}
But I think this way is very poor. Is there any better solutions?
Well, backpressure in RxJava is not real backpressure
RxJava's backpressure implementation is a non-blocking cooperation between subsequent producers and consumers through a request channel. The consumer asks for some amount of elements via request() and the producers creates/generates/emits at most that amount of items via onNext, sometimes with delays between onNexts.
but only ignoring some sets of elements.
This happens only when you explicitly tell RxJava to drop any overflow.
RxJava cannot affect element emition, so developer needs to implement it by himself. But how?
Using Observable.create requires advanced knowledge of how non-blocking backpressure can be implemented and practically it is not recommended to library users. RxJava has plenty of ways to give you backpressure-enabled flows without complications:
Observable.range(1, 100)
.map(v -> Math.random())
.subscribeOn(sA)
.flatMap(v ->
Observable.just(v).subscribeOn(sB)
.doOnNext(x -> sleep(1000))
)
.subscribe();
or
Observable.create(SyncOnSubscribe.createStateless(
o -> o.onNext(Math.random())
)
.subscribeOn(sA)
...
As you noted yourself, this actually has nothing to do with RxJava.
If you must process all events eventually, but you want to do that at your own pace, use queues:
ExecutorService emiter = Executors.newSingleThreadExecutor();
ScheduledExecutorService workers = Executors.newScheduledThreadPool(4);
BlockingQueue<String> events = new LinkedBlockingQueue<>();
emiter.submit(() -> {
System.out.println("I'll send 100 events as fast as I can");
for (int i = 0; i < 100; i++) {
try {
events.put(UUID.randomUUID().toString());
} catch (InterruptedException e) {
e.printStackTrace();
}
}
});
workers.scheduleWithFixedDelay(
() -> {
String result = null;
try {
result = events.take();
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(String.format("I don't care, got %s only now", result));
}, 0, 1, TimeUnit.SECONDS
);
I'm looking for better way to "close" some resource, here destroy external Process, in CompletableFuture chain. Right now my code looks roughly like this:
public CompletableFuture<ExecutionContext> createFuture()
{
final Process[] processHolder = new Process[1];
return CompletableFuture.supplyAsync(
() -> {
try {
processHolder[0] = new ProcessBuilder(COMMAND)
.redirectErrorStream(true)
.start();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
return PARSER.parse(processHolder[0].getInputStream());
}, SCHEDULER)
.applyToEither(createTimeoutFuture(DURATION), Function.identity())
.exceptionally(throwable -> {
processHolder[0].destroyForcibly();
if (throwable instanceof TimeoutException) {
throw new DatasourceTimeoutException(throwable);
}
Throwables.propagateIfInstanceOf(throwable, DatasourceException.class);
throw new DatasourceException(throwable);
});
}
The problem I see is a "hacky" one-element array which holds reference to the process, so that it can be closed in case of error. Is there some CompletableFuture API which allows to pass some "context" to exceptionally (or some other method to achieve that)?
I was considering custom CompletionStage implementation, but it looks like a big task to get rid of "holder" variable.
There is no need to have linear chain of CompletableFutures. Well actually, you already haven’t due to the createTimeoutFuture(DURATION) which is quite convoluted for implementing a timeout. You can simply put it this way:
public CompletableFuture<ExecutionContext> createFuture() {
CompletableFuture<Process> proc=CompletableFuture.supplyAsync(
() -> {
try {
return new ProcessBuilder(COMMAND).redirectErrorStream(true).start();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}, SCHEDULER);
CompletableFuture<ExecutionContext> result
=proc.thenApplyAsync(process -> PARSER.parse(process.getInputStream()), SCHEDULER);
proc.thenAcceptAsync(process -> {
if(!process.waitFor(DURATION, TimeUnit.WHATEVER_DURATION_REFERS_TO)) {
process.destroyForcibly();
result.completeExceptionally(
new DatasourceTimeoutException(new TimeoutException()));
}
});
return result;
}
If you want to keep the timout future, perhaps you consider the process startup time to be significant, you could use
public CompletableFuture<ExecutionContext> createFuture() {
CompletableFuture<Throwable> timeout=createTimeoutFuture(DURATION);
CompletableFuture<Process> proc=CompletableFuture.supplyAsync(
() -> {
try {
return new ProcessBuilder(COMMAND).redirectErrorStream(true).start();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}, SCHEDULER);
CompletableFuture<ExecutionContext> result
=proc.thenApplyAsync(process -> PARSER.parse(process.getInputStream()), SCHEDULER);
timeout.exceptionally(t -> new DatasourceTimeoutException(t))
.thenAcceptBoth(proc, (x, process) -> {
if(process.isAlive()) {
process.destroyForcibly();
result.completeExceptionally(x);
}
});
return result;
}
I've used the one item array myself to emulate what would be proper closures in Java.
Another option is using a private static class with fields. The advantages are that it makes the purpose clearer and has a bit less impact on the garbage collector with big closures, i.e. an object with N of fields versus N arrays of length 1. It also becomes useful if you need to close over the same fields in other methods.
This is a de facto pattern, even outside the scope of CompletableFuture and it has been (ab)used long before lambdas were a thing in Java, e.g. anonymous classes. So, don't feel so bad, it's just that Java's evolution didn't provide us with proper closures (yet? ever?).
If you want, you may return values from CompletableFutures inside .handle(), so you can wrap the completion result in full and return a wrapper. In my opinion, this is not any better than manual closures, added the fact that you'll create such wrappers per future.
Subclassing CompletableFuture is not necessary. You're not interested in altering its behavior, only in attaching data to it, which you can do with current Java's final variable capturing. That is, unless you profile and see that creating these closures is actually affecting performance somehow, which I highly doubt.
EDITED: see this question which is more clear and precise:
RxJava flatMap and backpressure strange behavior
I'm currently writing a data synchronization job with RxJava and I'm quite novice with reactive programming and especialy RxJava library.
My job is quite simple I have a list of element IDs, I call a webservice to get each element by ID, do some processing and do multiple call to push data to DB.
I load the data from WS with 1 io thread and push the data to DB with multiple io threads.
However I always end-up with OutOfMemory error.
I thought first that loading the data from the WS was faster than storing them in the DBs.
But as both WS call and DB call synchronous call should they exert backpressure on each other?
Thank you for your help.
My code pretty much look like this:
#Test
public void test() {
int MAX_CONCURRENT_LOAD = 1;
int MAX_CONCURRENT_STORE = 2;
List<Integer> ids = IntStream.range(0, 10000).boxed().collect(Collectors.toList());
Observable.from(ids)
.flatMap(this::produce, MAX_CONCURRENT_LOAD)
.flatMap(this::consume, MAX_CONCURRENT_STORE)
.toBlocking().forEach(s -> System.out.println("Value " + s));
System.out.println("Finished");
}
private Observable<Integer> produce(final int value) {
return Observable.<Integer>create(s -> {
try {
if (!s.isUnsubscribed()) {
Thread.sleep(500); //Here I call WS to retrieve data
s.onNext(value);
s.onCompleted();
}
} catch (Exception e) {
s.onError(e);
}
}).subscribeOn(Schedulers.io());
}
private Observable<Boolean> consume(Integer value) {
return Observable.<Boolean>create(s -> {
try {
if (!s.isUnsubscribed()) {
Thread.sleep(10000); //Here I call DB to store data
s.onNext(true);
s.onCompleted();
}
} catch (Exception e) {
s.onNext(false);
s.onCompleted();
}
}).subscribeOn(Schedulers.io());
}
It seems your WS is poll based so if you use fromCallable instead of your custom Observable, you get proper backpressure:
return Observable.<Integer>fromCallabe(s -> {
Thread.sleep(500); //Here I call WS to retrieve data
return value;
}).subscribeOn(Schedulers.io());
Otherwise, if you have blocking WS and blocking database, you can use them to backpressure each other:
ids.map(id -> db.store(ws.get(id)).subscribeOn(Schedulers.io())
.toBlocking().subscribe(...)
and potentially leave off subscribeOn and toBlocking as well.