I'm picking up Java/Reactor after moving over from C#. I'm well versed in the C# async-await approach to non-blocking calls and am struggling to adapt to Flux/Mono.
I'm implementing a solution where I need to make a call to ElasticSearch via the Java SDK, get results, apply additional filters to strip out ES results, and keep paging through ES until my final collection of results is complete.
The ES SDK doesn't support Reactor but there are examples of Java adapter code that takes the ES callback and converts to a mono (I see a direct correlation to the C# async-await here as this is a non-blocking call to ES). What I then struggle with is the next bit - I need to take the results from the ES mono, filter them.
I do this by calling out to other external services to get additional data based on the results from the ES call, so I need to know the ids of each page of content the ES mono result before I can apply the filtering (effectively a kind of block), then apply the in-memory filters and if I don't have enough content, then go back to ES to get the next page... repeat until I have enough data or there are no more results from ES.
This appears to be very difficult to achieve compared to C# but I probably just don't understand the Java paradigm correctly.
My problem is that I can't use "block()" as this throws an error in Reactor 3.2 so I don't really know how to "wait" until the mono calls to ES and external services are complete until continuing. In C#, this would be as simple as call to an Async method with an await to handle the implicit callbacks
My blocking version (works in IntellJ, fails when published via maven and then run in a webserver) is effectively:
do {
var sr = GetSearchRequest(xxxx);
this.elasticsearch.results(sr)
.map(r -> chunk.addAdd(r))
.block();
if (chunk.size() == 0 {
isComplete = true;
}
else {
var filtered = postFilterResults(chunk);
finalResults.add(filtered);
if (finalResults.size() = MAXIMUM_RESULTS) {
isComplete = true;
}
esPage = esPage + 1;
while (isComplete == false);
If I try to subscribe() or other non-blocking reaktor calls, then (obviously) the code skips over the "get ES" bit and hits the do-while, looping repeatedly until the callback from ES finally happens and the subscribed map is invoked.
I think I need to perform an "async block" for each ES call but I don't know how.
To answer my own question... The underlying issue IMO is that Flux/Mono simply is not like any existing programming style in that it absolutely forces you to work within the fluent style that reactor mandates. This is very similar to C# Linq but it's almost a "false friend" as even things like loops need to be in Reactor.
In this case, the key issue to solve is one of paging and to keep doing this within a loop. it is very unclear how to achieve this as a subscription to a flux "locks in" the original parameters so repeating the subscription call simply gets the same page again. The solution is to use the Flux.defer method which forces lazy building of the subscription on each repeated invoke. You then need Atomic integers to keep track of the page counter across different calls. Again, this is something that C# handles for you, so it can catch a .net developer out.
Something like:
//The response from the elasticsearch adapter is a Flux<T> but we do not want to filter
//results on a row by row basis as this incurs one call for each row to the DB/Network
//(as appropriate). We choose to batch these up
var result = new SearchResult();
var page = new AtomicInteger();
var chunkSize = new AtomicInteger();
//Use a defer so we recalculate the subscription to the search with the new page count
var results = Flux.defer(() -> elasticsearch.results(GetSearchRequest(request, lc, pf, page.get()))
.doOnComplete(() -> {
chunkSize.set(0);
page.getAndAdd(1);
})
.collectList()
.map(chunk -> {
chunkSize.set(chunk.size());
return chunk;
})
.map(chunk -> postFilterResults(request, chunk, pf))
.map(filtered -> result.getDocuments().addAll(filtered)));
//Repeat the deferred flux (recalculating each time) until we have enough content or we don't get anything from the search engine
return results
.repeat()
.takeUntil(r -> chunkSize.get() == 0 || result.getDocuments().size() >= this.elasticsearch.getMaximumSearchResults())
.take(this.elasticsearch.getMaximumSearchResults())
.collectList()
.flatMap(r -> {
result.setTotalHits(result.getDocuments().size());
return Mono.just(result);
});
Related
In my Spring Boot application I have a component that is supposed to monitor the health status of another, external system. This component also offers a public method that reactive chains can subscribe to in order to wait for the external system to be up.
#Component
public class ExternalHealthChecker {
private static final Logger LOG = LoggerFactory.getLogger(ExternalHealthChecker.class);
private final WebClient externalSystemWebClient = WebClient.builder().build(); // config omitted
private volatile boolean isUp = true;
private volatile CompletableFuture<String> completeWhenUp = new CompletableFuture<>();
#Scheduled(cron = "0/10 * * ? * *")
private void checkExternalSystemHealth() {
webClient.get() //
.uri("/health") //
.retrieve() //
.bodyToMono(Void.class) //
.doOnError(this::handleHealthCheckError) //
.doOnSuccess(nothing -> this.handleHealthCheckSuccess()) //
.subscribe(); //
}
private void handleHealthCheckError(final Throwable error) {
if (this.isUp) {
LOG.error("External System is now DOWN. Health check failed: {}.", error.getMessage());
}
this.isUp = false;
}
private void handleHealthCheckSuccess() {
// the status changed from down -> up, which has to complete the future that might be currently waited on
if (!this.isUp) {
LOG.warn("External System is now UP again.");
this.isUp = true;
this.completeWhenUp.complete("UP");
this.completeWhenUp = new CompletableFuture<>();
}
}
public Mono<String> waitForExternalSystemUPStatus() {
if (this.isUp) {
LOG.info("External System is already UP!");
return Mono.empty();
} else {
LOG.warn("External System is DOWN. Requesting process can now wait for UP status!");
return Mono.fromFuture(completeWhenUp);
}
}
}
The method waitForExternalSystemUPStatus is public and may be called from many, different threads. The idea behind this is to provide some of the reactive flux chains in the application a method of pausing their processing until the external system is up. These chains cannot process their elements when the external system is down.
someFlux
.doOnNext(record -> LOG.info("Next element")
.delayUntil(record -> externalHealthChecker.waitForExternalSystemUPStatus())
... // starting processing
The issue here is that I can't really wrap my head around which part of this code needs to be synchronised. I think there should not be an issue with multiple threads calling waitForExternalSystemUPStatusat the same time, as this method is not writing anything. So I feel like this method does not need to be synchronised. However, the method annotated with #Scheduled will also run on it's own thread and will in-fact write the value of isUp and also potentially change the reference of completeWhenUpto a new, uncompleted future instance. I have marked these two mutable attributes with volatilebecause from reading about this keyword in Java it feels to me like it would help with guaranteeing that the threads reading these two values see the latest value. However, I am unsure if I also need to add synchronized keywords to part of the code. I am also unsure if the synchronized keyword plays well with reactor code, I have a hard time finding information on this. Maybe there is also a way of providing the functionality of the ExternalHealthCheckerin a more complete, reactive way, but I cannot think of any.
I'd strongly advise against this approach. The problem with threaded code like this is it becomes immensely difficult to follow & reason about. I think you'd at least need to synchronise the parts of handleHealthCheckSuccess() and waitForExternalSystemUPStatus() that reference your completeWhenUp field otherwise you could have a race hazard on your hands (only one writes to it, but it might be read out-of-order after that write) - but there could well be something else I'm missing, and if so it may show as one of these annoying "one in a million" type bugs that's almost impossible to pin down.
There should be a much more reliable & simple way of achieving this though. Instead of using the Spring scheduler, I'd create a flux when your ExternalHealthChecker component is created as follows:
healthCheckStream = Flux.interval(Duration.ofMinutes(10))
.flatMap(i ->
webClient.get().uri("/health")
.retrieve()
.bodyToMono(String.class)
.map(s -> true)
.onErrorResume(e -> Mono.just(false)))
.cache(1);
...where healthCheckStream is a field of type Flux<Boolean>. (Note it doesn't need to be volatile, as you'll never replace it so cross-thread worries don't apply - it's the same stream that will be updated with different results every 10 minutes based on the healthcheck status, whatever thread you'll access it from.)
This essentially creates a stream of healthcheck response values every 10 minutes, always caches the latest response, and turns it into a hot source. This means that the "nothing happens until you subscribe" doesn't apply in this case - the flux will start executing immediately, and any new subscribers that come in on any thread will always get the latest result, be that a pass or a fail. handleHealthCheckSuccess() and handleHealthCheckError(), isUp, and completeWhenUp are then all redundant, they can go - and then your waitForExternalSystemUPStatus() can just become a single line:
return healthCheckStream.filter(x -> x).next();
...then job done, you can call that from anywhere and you'll have a Mono that will only complete when the system is up.
I'm pretty new to RX in general, and rxjava in particular, pardon mistakes.
This operation depends on a two async operations.
The first uses a filter function to attempt to get a single entity from a list returned by an async Observable.
The second is an async operation that communicates with a device and produces an Observable of status updates.
I want to take the Single that is created from the filter function, apply that to pairReader(...), and subscribe to its Observable for updates. I can get this to work as shown, but only if I include the take(1) commented, otherwise I get an exception because the chain tries to pull another value from the Single.
Observable<DeviceCredential> getCredentials() {
return deviceCredentialService()
.getCredentials()
.flatMapIterable(event -> event.getData());
}
Single<Organization> getOrgFromCreds(String orgid) {
return getCredentials()
// A device is logically constrained to only have a single cred per org
.map(DeviceCredential::getOrganization)
.filter(org -> org.getId().equals(orgid))
.take(1) // Without this I get an exception
.singleOrError();
}
Function<Organization, Observable<Reader.EnrollmentState>> pairReader(String name) {
return org -> readerService().pair(name, org);
}
getOrgFromCreds(orgid)
.flatMapObservable(pairReader(readerid))
.subscribe(state -> {
switch(state) {
case BEGUN:
LOG.d(TAG, "Pairing begun");
break;
case PAIRED:
LOG.d(TAG, "Pairing success");
callback.success();
break;
case NOTIFIED_SERVER:
LOG.d(TAG, "Pairing server notified");
break;
}},
error -> {
Crashlytics.logException(error);
callback.error(error.getLocalizedMessage());
});
If the source stream emits more than one item, singleOrError() is supposed to emit an error. Doc
For your case, use either first() or firstOrError() instead.
Single<Organization> getOrgFromCreds(String orgid) {
return getCredentials()
.map(DeviceCredential::getOrganization)
.filter(org -> org.getId().equals(orgid))
.firstOrError();
}
If I got you right, you need to make some action using previously retrieved async data. So, you could use .zip() operator.
Here is an example:
Observable.zip(
getOrgFromCreds().toObservable(),
getCredentials(),
(first, second) -> /*create output object here*/
)
.subscribe(
(n) -> /*do onNext*/,
(e) -> /*do onError*/
);
Note, that .zip() operator will wait for both emission from two streams, and then it will create outer emission using the function you provided in "create output object here".
If you don't want to wait for both items - you can use .combineLatest().
The problem here turned out to be that the API was designed in an odd way (and unfortunately has extremely poor documentation). I couldn't figure out why I was getting duplicates, and thought I was using flatMapIterable incorrectly.
What the deviceCredentialService.getCredentials() call actually creates is an observable that emits DataEvent objects which are simple wrappers over a list of results, and with a flag of where the results came from.
The API designer wanted to allow the user to use locally cached data to fill the UI immediately while a longer request to a REST API executes. The DataEvent.from property is an enum that flags the source, either from the local device cache or from the remote API call.
The way I solved this was to simply ignore the results coming from local cache and only emit results from the API:
Observable<DeviceCredential> getCredentials() {
return deviceCredentialService()
.getCredentials()
// Only get creds from network
.filter(e -> e.getFrom() == SyncedDataSourceObservableFactory.From.SOURCE)
.flatMapIterable(e -> e.getData());
}
Single<Organization> getOrgFromCreds(String orgid) {
return getCredentials()
// A device is logically constrained to only have a single cred per org
.map(DeviceCredential::getOrganization)
.filter(org -> org.getId().equals(orgid))
.singleOrError();
}
The plan then is to use memoization to cache entities in a way that gives the implementing app access to cache invalidation. Since the provided interface doesn't allow squelching the API call, there is no way to work only with cache if the app feels its is fresh.
I have an issue while processing a flux that is built from a Stream.generate construct.
The Java stream is fetching some data from a remote source, hence I implemented a custom supplier that has the data fetching logic embedded, and then used it to populate the Stream.
Stream.generate(new SearchSupplier(...))
My idea is to detect an empty list and use the Java9 feature of takeWhile ->
Stream.generate(new SearchSupplier(this, queryBody))
.takeWhile(either -> either.isRight() && either.get().nonEmpty())
(using Vavr's Either construct)
The repositoroy layer flux will then do:
return Flux.fromStream (
this.searchStream(...) //this is where the stream gets generated
)
.map(Either::get)
.flatMap(Flux::fromIterable);
The "service" layer is composed of some transformation steps on the flux, but the method signature is something like Flux<JsonObject> search(...).
Finally, the controller layer has a GetMapping:
#GetMapping(produces = "application/stream+json")
public Flux search(...) {
return searchService.search(...) //this is the Flux<JsonObject> parth
.subscriberContext(...) //stuff I need available during processing
.doOnComplete(() -> log.debug("DONE"));
}
My problem is that the Flux seems to never terminate.
Doing a call from Postman for example just shot the 'Loading...' part in the response section. When I terminate the process from my IDE the results are then flushed to postman and I see what I'm expecting. Also the doOnComplete lambda never gets called
What I noticed is that if I change the source of a Flux:
Flux.fromArray(...) //harcoded array of lists of jsons
the doOnComplete lambda is called and also the http connection closes, and results are displayed in postman.
Any idea of what might be the issue?
Thanks.
You could create the Flux directly using code that looks like this. Note that I'm adding some assumed methods which you would need to implement based on your how your SearchSupplier works:
Flux<SearchResultType> flux = Flux.generate(
() -> new SearchSupplier(this, queryBody),
(supplier, sink) -> {
SearchResultType current = supplier.next();
if (isNotLast(current)) {
sink.next(current);
} else {
sink.complete();
}
return supplier;
},
supplier -> anyCleanupOperations(supplier)
);
I'm trying to get to grips with Spark Streaming but I'm having difficulty. Despite reading the documentation and analysing the examples I wish to do something more than a word count on a text file/stream/Kafka queue which is the only thing we're allowed to understand from the docs.
I wish to listen to an incoming Kafka message stream, group messages by key and then process them. The code below is a simplified version of the process; get the stream of messages from Kafka, reduce by key to group messages by message key then to process them.
JavaPairDStream<String, byte[]> groupByKeyList = kafkaStream.reduceByKey((bytes, bytes2) -> bytes);
groupByKeyList.foreachRDD(rdd -> {
List<MyThing> myThingsList = new ArrayList<>();
MyCalculationCode myCalc = new MyCalculationCode();
rdd.foreachPartition(partition -> {
while (partition.hasNext()) {
Tuple2<String, byte[]> keyAndMessage = partition.next();
MyThing aSingleMyThing = MyThing.parseFrom(keyAndMessage._2); //parse from protobuffer format
myThingsList.add(aSingleMyThing);
}
});
List<MyResult> results = myCalc.doTheStuff(myThingsList);
//other code here to write results to file
});
When debugging I see that in the while (partition.hasNext()) the myThingsList has a different memory address than the declared List<MyThing> myThingsList in the outer forEachRDD.
When List<MyResult> results = myCalc.doTheStuff(myThingsList); is called there are no results because the myThingsList is a different instance of the List.
I'd like a solution to this problem but would prefer a reference to documentation to help me understand why this is not working (as anticipated) and how I can solve it for myself (I don't mean a link to the single page of Spark documentation but also section/paragraph or preferably still, a link to 'JavaDoc' that does not provide Scala examples with non-functional commented code).
The reason you're seeing different list addresses is because Spark doesn't execute foreachPartition locally on the driver, it has to serialize the function and send it over the Executor handling the processing of the partition. You have to remember that although working with the code feels like everything runs in a single location, the calculation is actually distributed.
The first problem I see with you code has to do with your reduceByKey which takes two byte arrays and returns the first, is that really what you want to do? That means you're effectively dropping parts of the data, perhaps you're looking for combineByKey which will allow you to return a JavaPairDStream<String, List<byte[]>.
Regarding parsing of your protobuf, looks to me like you don't want foreachRDD, you need an additional map to parse the data:
kafkaStream
.combineByKey(/* implement logic */)
.flatMap(x -> x._2)
.map(proto -> MyThing.parseFrom(proto))
.map(myThing -> myCalc.doStuff(myThing))
.foreachRDD(/* After all the processing, do stuff with result */)
Since I´m using Vertx 3.1 in my stack, I was thinking to use the Future feature that the tools brings, but after read the API seems pretty limited to me. I cannot even find the way to make the the future wait for an Observable.
Here my code
public Observable<CommitToOrderCommand> validateProductRestrictions(CommitToOrderCommand cmd) {
Future<Observable<CommitToOrderCommand>> future = Future.future();
orderRepository.getOrder(cmd, cmd.orderId)
.flatMap(order -> validateOrderProducts(cmd, order))
.subscribe(map -> checkMapValues(map, future, cmd));
Observable<CommitToOrderCommand> result = future.result();
if(errorFound){
throw MAX_QUANTITY_PRODUCT_EXCEED.create("Fail"/*restrictions.getBulkBuyLimit().getDescription())*/);
}
return result;
}
private void checkMapValues(Multimap<String, BigDecimal> totalUnitByRestrictions, Future<Observable<CommitToOrderCommand>> future,
CommitToOrderCommand cmd) {
for (String restrictionName : totalUnitByRestrictions.keySet()) {
Restrictions restrictions = Restrictions.valueOf(restrictionName);
if (totalUnitByRestrictions.get(restrictionName)
.stream()
.reduce(BigDecimal.ZERO, BigDecimal::add)
.compareTo(restrictions.getBulkBuyLimit()
.getMaxQuantity()) == 1) {
errorFound = true;
}
}
future.complete(Observable.just(cmd));
}
In the onComplete of my first Observable I´m checking the results, and after finish is when I finish the future to unblock the operation.
But I´m looking that future.result is not block until future.complete is invoke as I was expecting. Instead is just returning null.
Any idea what´s wrong here?
Regards.
The vertx future doesn't block but rather work with a handler that is invoked when a result has been injected (see setHandler and isComplete).
If the outer layer of code requires an Observable, you don't need to wrap it in a Future, just return Observable<T>. Future<Observable<T>> doesn't make much sense, you're mixing two ways of doing async results.
Note that there are ways to collapse an Observable into a Future, but the difficulty is that an Observable may emit several items whereas a Future can hold only a single item. You already took care of that by collecting your results into a single emission of map.
Since this Observable only ever emits one item, if you want a Future out of it you should subscribe to it and call future.complete(yourMap) in the onNext method. Also define a onError handler that will call future.fail.