I need to copy date from one source (in parallel) to another with batches.
I did this:
Flux.generate((SynchronousSink<String> sink) -> {
try {
String val = dataSource.getNextItem();
if (val == null) {
sink.complete();
return;
}
sink.next(val);
} catch (InterruptedException e) {
sink.error(e);
}
})
.parallel(4)
.runOn(Schedulers.parallel())
.doOnNext(dataTarget::write)
.sequential()
.blockLast();
class dataSource{
public Item getNextItem(){
//...
}
}
class dataTarget{
public void write(List<Item> items){
//...
}
}
It receives data in parallel, but writes one at a time.
I need to collect them in batches (like by 10 items) and then write the batch.
How can I do that?
UPDATE:
The main idea that the source is the messaging system (i.e. rabbitmq or nats) that's suitable to efficiently send messages one by one, but the target is the database which is more efficient on inserting a batch.
So the final result should be like — I receive messages in parallel until buffer is not filled up, then I write all the buffer into database by one shot.
It's easy to do in regular java, but in case of streams — I don't get how to do it. How to buffer the data and how to pause the reader till the writer is not ready to get next part.
All you need is Flux#buffer(int maxSize) operator:
Flux.generate((SynchronousSink<String> sink) -> {
try {
String val = dataSource.getNextItem();
if (val == null) {
sink.complete();
return;
}
sink.next(val);
} catch (InterruptedException e) {
sink.error(e);
}
})
.buffer(10) //Flux<List<String>>
.flatMap(dataTarget::write)
.blockLast();
class DataTarget{
public Mono<Void> write(List<String> items){
return reactiveDbClient.insert(items);
}
}
Here, buffer collects items into multiple List's of 10 items(batches). You do not need to use parallel scheduler. The flatmap will run these operations asynchronously. See Understanding Reactive’s .flatMap() Operator.
You need to do your heavy work in individual Publisher-s which will be materialized in flatMap() in parallel. Like this
Flux.generate((SynchronousSink<String> sink) -> {
try {
String val = dataSource.getNextItem();
if (val == null) {
sink.complete();
return;
}
sink.next(val);
} catch (InterruptedException e) {
sink.error(e);
}
})
.parallel(4)
.runOn(Schedulers.parallel())
.flatMap(item -> Mono.fromCallable(() -> dataTarget.write(item)))
.sequential()
.blockLast();
Best approach (from algorithmic view) is to have ringbuffer and use microbatching technique. Writes to ringbuffer is done from rabbitmq, one-by-one (or multiple in parallel). Reading thread (single only) would get all messages at once (presented at a time of batch start), insert them into database and do it again... All at once means single message (if there is only one), or bunch of them (if they have been accumulated while duration of last insert was long enough to).
This technique is used also in jdbc (if I remember correctly) and can be implemented easily using lmax disruptor library in java.
Sample project (using ractor /Flux/ and System.out.println) can be found on https://github.com/luvarqpp/reactorBatch
Core code:
final Flux<String> stringFlux = Flux.interval(Duration.ofMillis(1)).map(x -> "Msg number " + x);
final Flux<List<String>> stringFluxMicrobatched = stringFlux
.bufferTimeout(100, Duration.ofNanos(1));
stringFluxMicrobatched.subscribe(strings -> {
// Batch insert into DB
System.out.print("Inserting in batch " + strings.size() + " strings.");
try {
// Inserting into db is simulated by 10 to 40 ms sleep here...
Thread.sleep(rnd.nextInt(30) + 10);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(" ... Done");
});
Please feel welcome to edit and improve this post with name of technique and references. This is community wiki...
Related
Let's say I have:
list.stream()
.map(someService::someRateLimitedApiCall) //does not implement Runnable
.filter(Optional::isPresent)
.map(Optional::get)
.sleep(1000) //is something like this possible?
.min...;
The API service only allows limited number of transactions per second, and I am seeking to introduce a delay in between calls.
If not, is there a way to add an executor with a fixed delay within the iteration of the stream?
(To be clear, I am not violating the terms of the external API and will not abuse the service.)
Rather than use peek, why not just put the delay in the map operation which calls the API?
.map(e -> {
try {
Thread.sleep(1000);
} catch (InterruptedException ex) {
return Optional.empty();
}
return someRateLimitedApiCall(e);
})
The simple solution (without parallel streams) was to use peek as multiple commenters suggested. Since it requires a Consumer:
.peek(i -> {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
})
I have a Flowable that we are returning in a function that will continually read from a database and add it to a Flowable.
public void scan() {
Flowable<String> flow = Flowable.create((FlowableOnSubscribe<String>) emitter -> {
Result result = new Result();
while (!result.hasData()) {
result = request.query(skip, limit);
partialResult.getResult()
.getFeatures().forEach(feature -> emmitter.emit(feature));
}
}, BackpressureStrategy.BUFFER)
.subscribeOn(Schedulers.io());
return flow;
}
Then I have another object that can call this method.
myObj.scan()
.parallel()
.runOn(Schedulers.computation())
.map(feature -> {
//Heavy Computation
})
.sequential()
.blockingSubscribe(msg -> {
logger.debug("Successfully processed " + msg);
}, (e) -> {
logger.error("Failed to process features because of error with scan", e);
});
My heavy computation section could potentially take a very long time. So long in fact that there is a good chance that the database requests will load the whole database into memory before the consumer finishes the first couple entries.
I have read up on backpressure with rxjava but the only 4 options essentially make me drop data or replace it with the last.
Is there a way to make it so that when I call emmitter.emit(feature) the call blocks until there is more room in the Flowable?
I.E I want to treat the Flowable as a blocking queue where push will sleep if the queue is past the capacity.
Well, backpressure in RxJava is not real backpressure, but only ignoring some sets of elements.
But what if I cannot loose any elements and I need to slow emition somehow?
RxJava cannot affect element emition, so developer needs to implement it by himself. But how?
The simpliest way comes to mind is to use some counter with incrementing on emition and decrementing on finishing.
Like that:
public static void sleep(int ms) {
try {
Thread.sleep(ms);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
public static void main(String[] args) throws InterruptedException {
AtomicInteger counter = new AtomicInteger();
Scheduler sA = Schedulers.from(Executors.newFixedThreadPool(1));
Scheduler sB = Schedulers.from(Executors.newFixedThreadPool(5));
Observable.create(s -> {
while (!s.isUnsubscribed()) {
if (counter.get() < 100) {
s.onNext(Math.random());
counter.incrementAndGet();
} else {
sleep(100);
}
}
}).subscribeOn(sA)
.flatMap(r ->
Observable.just(r)
.subscribeOn(sB)
.doOnNext(x -> sleep(1000))
.doOnNext(x -> counter.decrementAndGet())
)
.subscribe();
}
But I think this way is very poor. Is there any better solutions?
Well, backpressure in RxJava is not real backpressure
RxJava's backpressure implementation is a non-blocking cooperation between subsequent producers and consumers through a request channel. The consumer asks for some amount of elements via request() and the producers creates/generates/emits at most that amount of items via onNext, sometimes with delays between onNexts.
but only ignoring some sets of elements.
This happens only when you explicitly tell RxJava to drop any overflow.
RxJava cannot affect element emition, so developer needs to implement it by himself. But how?
Using Observable.create requires advanced knowledge of how non-blocking backpressure can be implemented and practically it is not recommended to library users. RxJava has plenty of ways to give you backpressure-enabled flows without complications:
Observable.range(1, 100)
.map(v -> Math.random())
.subscribeOn(sA)
.flatMap(v ->
Observable.just(v).subscribeOn(sB)
.doOnNext(x -> sleep(1000))
)
.subscribe();
or
Observable.create(SyncOnSubscribe.createStateless(
o -> o.onNext(Math.random())
)
.subscribeOn(sA)
...
As you noted yourself, this actually has nothing to do with RxJava.
If you must process all events eventually, but you want to do that at your own pace, use queues:
ExecutorService emiter = Executors.newSingleThreadExecutor();
ScheduledExecutorService workers = Executors.newScheduledThreadPool(4);
BlockingQueue<String> events = new LinkedBlockingQueue<>();
emiter.submit(() -> {
System.out.println("I'll send 100 events as fast as I can");
for (int i = 0; i < 100; i++) {
try {
events.put(UUID.randomUUID().toString());
} catch (InterruptedException e) {
e.printStackTrace();
}
}
});
workers.scheduleWithFixedDelay(
() -> {
String result = null;
try {
result = events.take();
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(String.format("I don't care, got %s only now", result));
}, 0, 1, TimeUnit.SECONDS
);
EDITED: see this question which is more clear and precise:
RxJava flatMap and backpressure strange behavior
I'm currently writing a data synchronization job with RxJava and I'm quite novice with reactive programming and especialy RxJava library.
My job is quite simple I have a list of element IDs, I call a webservice to get each element by ID, do some processing and do multiple call to push data to DB.
I load the data from WS with 1 io thread and push the data to DB with multiple io threads.
However I always end-up with OutOfMemory error.
I thought first that loading the data from the WS was faster than storing them in the DBs.
But as both WS call and DB call synchronous call should they exert backpressure on each other?
Thank you for your help.
My code pretty much look like this:
#Test
public void test() {
int MAX_CONCURRENT_LOAD = 1;
int MAX_CONCURRENT_STORE = 2;
List<Integer> ids = IntStream.range(0, 10000).boxed().collect(Collectors.toList());
Observable.from(ids)
.flatMap(this::produce, MAX_CONCURRENT_LOAD)
.flatMap(this::consume, MAX_CONCURRENT_STORE)
.toBlocking().forEach(s -> System.out.println("Value " + s));
System.out.println("Finished");
}
private Observable<Integer> produce(final int value) {
return Observable.<Integer>create(s -> {
try {
if (!s.isUnsubscribed()) {
Thread.sleep(500); //Here I call WS to retrieve data
s.onNext(value);
s.onCompleted();
}
} catch (Exception e) {
s.onError(e);
}
}).subscribeOn(Schedulers.io());
}
private Observable<Boolean> consume(Integer value) {
return Observable.<Boolean>create(s -> {
try {
if (!s.isUnsubscribed()) {
Thread.sleep(10000); //Here I call DB to store data
s.onNext(true);
s.onCompleted();
}
} catch (Exception e) {
s.onNext(false);
s.onCompleted();
}
}).subscribeOn(Schedulers.io());
}
It seems your WS is poll based so if you use fromCallable instead of your custom Observable, you get proper backpressure:
return Observable.<Integer>fromCallabe(s -> {
Thread.sleep(500); //Here I call WS to retrieve data
return value;
}).subscribeOn(Schedulers.io());
Otherwise, if you have blocking WS and blocking database, you can use them to backpressure each other:
ids.map(id -> db.store(ws.get(id)).subscribeOn(Schedulers.io())
.toBlocking().subscribe(...)
and potentially leave off subscribeOn and toBlocking as well.
I'm using Stream.generate to get data from Instagram. As instagram limits calls per hour I want generate to run less frequent then every 2 seconds.
I've chosen such title because I moved from ScheduledExecutorService.scheduleAtFixedRate and that's what I was searching for. I do realise that stream intermediate operations are lazy and cannot be called on schedule. If you have better idea for title let me know.
So again I want to have at least 2 second delay between genations.
My attempt wich doesn't take into consideration time consumed by operations after generate, which might take longer then 2s:
Stream.generate(() -> {
List<MediaFeedData> feedDataList = null;
while (feedDataList == null) {
try {
Thread.sleep(2000);
feedDataList = newData();
} catch (InstagramException e) {
notifyError(e.getMessage());
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
return feedDataList;
})
A solution would be to decouple the generator from the Stream, for example using a BlockingQueue
final BlockingQueue<Integer> queue = new LinkedBlockingQueue<>(100);
ScheduledExecutorService scheduler = new ScheduledThreadPoolExecutor(1);
scheduler.scheduleAtFixedRate(() -> {
// Generate new data every 2s, regardless of their processing rate
ThreadLocalRandom random = ThreadLocalRandom.current();
queue.offer(random.nextInt(10));
}, 0, 2, TimeUnit.SECONDS);
Stream.generate(() -> {
try {
// Accept new data if ready, or wait for some more to be generated
return queue.take();
} catch (InterruptedException e) {}
return -1;
}).forEach(System.out::println);
If the data processing takes more than 2s, new data will be enqueued and wait to be consumed. If it takes less than 2s, the take method in the generator will wait for new data to be produced by the scheduler.
This way, you are guaranteed to make less than N calls per hour to Instagram !
As far as I understand, your question is about solving two problems:
waiting at a fixed rate rather than a fixed delay
creating a stream for an unknown number of items which allows processing until some point of time (i.e. is not infinite)
You can solve the first task by using a deadline-based waiting and the second by implementing a Spliterator:
Stream<List<MediaFeedData>> stream = StreamSupport.stream(
new Spliterators.AbstractSpliterator<List<MediaFeedData>>(Long.MAX_VALUE, 0) {
long lastTime=System.currentTimeMillis();
#Override
public boolean tryAdvance(Consumer<? super List<MediaFeedData>> action) {
if(quitCondition()) return false;
List<MediaFeedData> feedDataList = null;
while (feedDataList == null) {
lastTime+=TimeUnit.SECONDS.toMillis(2);
while(System.currentTimeMillis()<lastTime)
LockSupport.parkUntil(lastTime);
try {
feedDataList=newData();
} catch (InstagramException e) {
notifyError(e.getMessage());
if(QUIT_ON_EXCEPTION) return false;
}
}
action.accept(feedDataList);
return true;
}
}, false);
Make a Timer and a semaphore. The timer raises the semaphore every 2 seconds, and in the stream you wait on every call for the semaphore.
This keeps the waits to the specified minimum (2 s), and - funnily - would even work with .parallel().
private final volatile Semaphore tickingSemaphore= new Semaphore(1, true);
In its own thread:
Stream.generate(() -> {
tickingSemaphore.acquire();
...
};
In the timer:
tickingSemaphore.release();