I'm constructing a data pipeline using Akka streams and Akka HTTP. The use case is quite simple, receive a web request from a user which will do two things. First create a session by calling a 3rd party API, secondly committing this session to some persistent storage, when we have received the session it will then proxy the original user request but add the session data.
I have started working on the first branch of the data pipeline which is the session processing but I'm wondering if there is a more elegant way of unmarshalling the HTTP response from the 3rd party API to a POJO currently I'm using Jackson.unmarshaller.unmarshal which returns a CompletionStage<T> which I then have to unwrap into T. It's not very elegant and I'm guessing that Akka HTTP has more clever ways of doing this.
Here is the code I have right now
private final Source<Session, NotUsed> session =
Source.fromCompletionStage(
getHttp().singleRequest(getSessionRequest(), getMat())).
map(r -> Jackson.unmarshaller(Session.class).unmarshal(r.entity(), getMat())).
map(f -> f.toCompletableFuture().get()).
alsoTo(storeSession);
Akka Streams offers you mapAsync, a stage to handle asynchronous computation in your pipeline in a configurable, non-blocking way.
Your code should become something like
Source.fromCompletionStage(
getHttp().singleRequest(getSessionRequest(), getMat())).
mapAsync(4, r -> Jackson.unmarshaller(Session.class).unmarshal(r.entity(), getMat())).
alsoTo(storeSession);
Note that:
it is not just a matter of elegance in this case, as CompletableFuture.get is a blocking call. This can cause dreadful issues in your pipeline.
the Int parameter required by mapAsync (parallelism) allows for fine-tuning of how many parallel async operations can be run at the same time.
More info in mapAsync can be found in the docs.
Related
I'm implementing a GET method in Quarkus that should send large amounts of data to the client. The data is read from the database using JPA/Hibernate, serialized to JSON, and then sent to the client. How can this can be done efficiently without having the whole data in memory? I tried the following three possibilities all without success:
Use getResultList from JPA and return a Response with the list as the body. A MessageBodyWriter will take care of serializing the list to JSON. However, this will pull all data into memory which is not feasible for a larger number of records.
Use getResultStream from JPA and return a Response with the stream as the body. A MessageBodyWriter will take care of serializing the stream to JSON. Unfortunately this doesn't work because it seems the EntityManager is closed after the JAX-RS method has been executed and before the MessageBodyWriter is invoked. This means that the underlying ResultSet is also closed and the writer cannot read from the stream any more.
Use a StreamingOutput as Response body. The same problem as in 2. occurs.
So my question is: what's the trick for sending large data read via JPA with Quarkus?
Do your results have to be all in one response? How about making the client request the next results page until there's no next - a typical REST API pagination exercise? Also the JPA backend will only fetch that page from the database so there's no moment when everything would sit in memory.
Based on your requirement you have two options:
Option 1:
Take HATEOAS approach (https://restfulapi.net/hateoas/). One of standard pattern to exchange large data sets over REST standard. So in this approach server will respond with set of HATEOAS URIs in first response quickly. Where each HATEOAS URI represents on group of elements. So you need to generate these URIs based on data size and let client code to take responsibility of calling these URIs individually as REST APIs to get actual data. But again in this option also you can consider Reactive style to get more advantage of streaming processing with small memory foot print.
Option 2:
As suggested by #Serkan above, continuously stream the result set from database as REST response to client. Here you need to make sure the gateway between client and Service for timeout settings. If there is no gateway you are good. So you can take advantage of reactive programming at all layers to achieve continuous streaming. "DAO/data access layer" --> "Service layer" --> REST Controller --> Client. Spring reactor is compliant of JAX-RS as well. https://quarkus.io/guides/getting-started-reactive. This is best architecture style while dealing large data processing.
Here you have some resources that can help you with this:
Using reactive Hibernate: https://quarkusio.zulipchat.com/#narrow/stream/187030-users/topic/Large.20datasets.20using.20reactive.20SQL.20clients
Paging vs Forward only ResultSets: https://knes1.github.io/blog/2015/2015-10-19-streaming-mysql-results-using-java8-streams-and-spring-data.html
The last article is for SpringBoot, but the idea can also be implemented with Quarkus.
------------Edit:
OK, I've worked out an example where I do a batch select. I did it with Panache, but you can do it easily also without it.
I'm returning a ScrollableResult, then use this in the Rest resource to stream it via SSE (server sent event) to the client.
------------Edit 2:
I've added the setFetchSize to the query. You should play with this number and set it between 1-50. If value = 1, then the db rows will be fetched 1 by 1, this mimics streaming the most. And it will use the least amount of memory, but the I/O between the db & app will be more often.
And the usage of a StatelessSession is highly recommended when doing bulk operations like this.
#Entity
public class Fruit extends PanacheEntity {
public String name;
// I've removed the logic from here to the Rest resource,
// otherwise you cannot close the session
}
#Path("/fruits")
public class FruitResource {
#GET
#Produces(SERVER_SENT_EVENTS)
public void fruitsStream(#Context Sse sse, #Context SseEventSink sink) {
var sf = Fruit.getEntityManager().getEntityManagerFactory().unwrap(SessionFactory.class);
try (var session = sf.openStatelessSession();
var scrollableResults = session.createQuery("select f from Fruit f")
.scroll(ScrollMode.FORWARD_ONLY)
.setFetchSize(1) {
while (scrollableResults.next()) {
sink.send(sse.newEventBuilder().data(scrollableResults.get(0)).mediaType(APPLICATION_JSON_TYPE).build());
}
sink.close();
}
}
}
Then I call this Rest endpoint like this (via httpie):
> http :8080/fruits --stream
data: {"id":9996,"name":"applecfcdd592-1934-4f0e-a6a8-2f88fae5d14c"}
data: {"id":9997,"name":"apple7f5045a8-03bd-4bf5-9809-03b22069d9f3"}
data: {"id":9998,"name":"apple0982b65a-bc74-408f-a6e7-a165ec3250a1"}
data: {"id":9999,"name":"apple2f347c25-d0a1-46b7-bcb6-1f1fd5098402"}
data: {"id":10000,"name":"apple65d456b8-fb04-41da-bf07-73c962930629"}
Hope this helps you.
I am currently on a Project that builds Microservices, and are trying to move from the more traditional Spring Boot RestClient to Reactive Stack using Netty and WebClient as the HTTP Client in order to connect to backend systems.
This is going well for backends with REST APIs, however I'm still having some difficulties implementing WebClient to services that connect to SOAP backends and Oracle databases, which still uses traditional JDBC.
I managed to find some workaround online regarding JDBC calls that make use of parallel schedulers to publish the result of the blocking JDBC call:
//the method that is called by #Service
#Override
public Mono<TransactionManagerModel> checkTransaction(String transactionId, String channel, String msisdn) {
return asyncCallable(() -> checkTransactionDB(transactionId, channel, msisdn))
.onErrorResume(error -> Mono.error(error));
}
...
//the actual JDBC call
private TransactionManagerModel checkTransactionDB(String transactionId, String channel, String msisdn) {
...
List<TransactionManagerModel> result =
jdbcTemplate.query(CHECK_TRANSACTION, paramMap, new BeanPropertyRowMapper<>(TransactionManagerModel.class));
...
}
//Generic async callable
private <T> Mono<T> asyncCallable(Callable<T> callable) {
return Mono.fromCallable(callable).subscribeOn(Schedulers.parallel()).publishOn(transactionManagerJdbcScheduler);
}
and I think this works quite well.
While for SOAP calls, what I did was encapsulating the SOAP call in a Mono while the SOAP call itself is using a CloseableHttpClient which is obviously a blocking HTTP Client.
//The method that is being 'reactive'
public Mono<OfferRs> addOffer(String transactionId, String channel, String serviceId, OfferRq request) {
...
OfferRs result = adapter.addOffer(transactionId, channel, generateRequest(request));
...
}
//The SOAP adapter that uses blocking HTTP Client
public OfferRs addOffer(String transactionId, String channel, JAXBElement<OfferRq> request) {
...
response = (OfferRs) getWebServiceTemplate().marshalSendAndReceive(url, request, webServiceMessage -> {
try {
SoapHeader soapHeader = ((SoapMessage) webServiceMessage).getSoapHeader();
ObjectFactory headerFactory = new ObjectFactory();
AuthenticationHeader authHeader = headerFactory.createAuthenticationHeader();
authHeader.setUserName(username);
authHeader.setPassWord(password);
JAXBContext headerContext = JAXBContext.newInstance(AuthenticationHeader.class);
Marshaller marshaller = headerContext.createMarshaller();
marshaller.marshal(authHeader, soapHeader.getResult());
} catch (Exception ex) {
log.error("Failed to marshall SOAP Header!", ex);
}
});
return response;
...
}
My question is: Does this implementation for SOAP calls "reactive" enough that I won't have to worry about some calls being blocked in some part of the microservice? I have already implemented reactive stack - calling a block() explicitly will throw an exception as it's not permitted if using Netty.
Or should I adapt the use of parallel Schedulers in SOAP calls as well?
After some discussions i'll write an answer.
Reactor documentation states that you should place blocking calls on their own schedulers. Thats basically to keep the non-blocking part of reactor going, and if something comes in that blocks, then reactor will fallback to traditional servlet behaviour which means assigning one thread to each request.
Reactor has very good documentation about schedulers their types etc.
But short:
onSubscribe
When someone subscribes, reactor will go into something called the assembly phase which means it will basically from the subscribe point start calling the operators backwards upstream until it finds a producer of data (for example a database, or another service etc). If it finds a onSubscribe-operator somewhere during this phase it will place this entire chain on its own defined Scheduler. So one good thing to know is that placement of the onSubscribe does not really matter, as long as it is found during the assembly phase the entire chain will be affected.
Example usage could be:
We have blocking calls to a database, slow calls using a blocking rest client, reading a file from the system in a blocking manor etc.
onPublish
if you have onPublish somewhere in the chain during the assembly phase the chain will know that where it is placed the chain will switch from the default scheduler to the designated scheduler at that specific point. So onPublish placement DOES matter. As it will switch at where it is placed. This operator is more to control that you want to place something on a specific scheduler at specific point in the code.
Examples usage could be:
You are doing some heavy blocking cpu calculations at a specific point, you could switch to a Scheduler.parallell() that will guarantee that all calculations will be placed on separate cores do do heavy cpu work, and when you are done you could switch back to the default scheduler.
Above example
Your soap calls should be placed on its own Scheduler if they are blocking and i think onSubscribe will be enough with a usage of a Schedulers.elasticBound() will be fine to get traditional servlet behaviour. If you feel like you are scared of having every blocking call on the same Scheduler, you could pass in the Scheduler in the asyncCallable function and split up calls to use different Schedulers.
I have a use case when I should send email to the users.
First I create email body.
Mono<String> emailBody = ...cache();
And then I select users and send the email to them:
Flux.fromIterable(userRepository.findAllByRole(Role.USER))
.map(User::getEmail)
.doOnNext(email -> sendEmail(email, emailBody.block(), massSendingSubject))
.subscribe();
What I don't like
Without cache() method emailBody Mono calculates in each iteration step.
To get emailBody value I use emailBody.block() but maybe there's a reactive way and not call block method inside Flux flow?
There are several issues in this code sample.
I'll assume that this is a reactive web application.
First, it's not clear how you are creating the email body; are you fetching things from a database or a remote service? If it is mostly CPU bound (and not I/O), then you don't need to wrap that into a reactive type. Now if it should be wrapper in a Publisher and the email content is the same for all users, using the cache operator is not a bad choice.
Also, Flux.fromIterable(userRepository.findAllByRole(Role.USER)) suggest that you're calling a blocking repository from a reactive context.
You should never do heavy I/O operations in a doOn*** operator. Those are made for logging or light side-effects operations. The fact that you need to .block() on it is another clue that you'll block your whole reactive pipeline.
Last one: you should not call subscribe anywhere in a web application. If this is bound to an HTTP request, you're basically triggering the reactive pipeline with no guarantee about resources or completion. Calling subscribe triggers the pipeline but does not wait until it's complete (this method returns a Disposable).
A more "typical" sample of that would look like:
Flux<User> users = userRepository.findAllByRole(Role.USER);
String emailBody = emailContentGenerator.createEmail();
// sendEmail() should return Mono<Void> to signal when the send operation is done
Mono<Void> sendEmailsOperation = users
.flatMap(user -> sendEmail(user.getEmail(), emailBody, subject))
.then();
// something else should subscribe to that reactive type,
// you could plug that as a return value of a Controller for example
If you're somehow stuck with blocking components (the sendEmail one, for example), you should schedule those blocking operations on a specific scheduler to avoid blocking your whole reactive pipeline. For that, look at the Schedulers section on the reactor reference documentation.
I had some legacy code that had SOAP services. Now I am building Rest API for some objects that may call one or more SOAP operation. I was looking into Spring Integration. From the docs
In addition to wiring together fine-grained components, Spring Integration provides a wide selection of channel adapters and gateways to communicate with external systems.
Above statement sounds enticing. I was writing rest microservice controller, Validation service, Rest request to SOAP request mapper and SOAP client. I some cases when there are multiple calls, there is even more code I had to write and I did write the code in many cases.
Spring Integration at high level looked like a framework oriented for Async messages. My problem is that the call need to be more or less a synchronous call and performance is critical. Had anyone used Spring integration for this problem and can you share your experiences.
To complement Artem's answer it's worth to note that if you're going to use one of Spring Integration DSLs (Java, Groovy or Scala) then (the synchronous) DirectChannel will be picked by default by Spring Integration to wire up the endpoints of your integration flow. This means that as long as your endpoints stay synchronous and you rely on default channels between them, the whole integration flow stay synchronous as well.
For instance (in Java DSL):
#Bean
public IntegrationFlow syncFlow() {
return IntegrationFlows
.from(/* get a REST message from microservice */)
// here the DirectChannel is used by default
.filter(/* validate (and filter out) incorrect messages */)
// here the DirectChannel is used by default too
.transform(/* map REST to SOAP */)
// guess what would be here?
.handle(/* send a message with SOAP client */)
.get();
}
This absolutely doesn't mean you tied up with synchronous flow forever. At any step you can go async or parallel. For example, if you decide to send SOAP messages in parallel all you need to do is to specify appropriate channel before SOAP client invocation:
#Bean
public IntegrationFlow syncFlow() {
// ... the same as above ...
.transform(/* map REST to SOAP */)
.channel(c -> c.executor(Executors.newCachedThreadPool())) // see (1)
.handle(/* send a message with SOAP client */)
.get();
}
(1) From this point on the downstream flow will be processed in parallel thanks to use of ExecutorChannel.
Note that message endpoints may also behave asynchronously depending on their logic.
I've used Spring Integration for building synchronous integration flows in my home and work projects and it's proven to be a very powerful yet flexible solution.
One of the first class citizens in Spring Integration is MessageChannel abstraction. The simplest, synchronous, and therefore direct method invocation is DirectChannel.
Not sure what makes you think that everything in Spring Integration is async. Actually it is always direct unless you tell to be async.
UPDATE: I upgraded the code to Java 8 without too much of a hassle. So I would like answers tied to Spring 4/Java 8.
I am working on a task to fix performance issues (Tomcat max thread count of 200 reached at a request rate of just 400/s, request latencies building up periodically, etc) in a Tomcat/Spring 4.2.4/Java 8 web mvc application.
It is a typical web application which looks up Mysql via Hibernate for small but frequent things like user info per request, then does actual data POST/GET to another web service via RestTemplate.
The code is in Java 7 style as I just migrated to Java 8, but no new code has been written in that style yet. (I am also back using Spring after ages, so not sure what would be best).
As expected in a normal such application, the Service layer calls other services for info, and then also passes that along to a call to the DAO. So I have some dependent callbacks here.
Setup
#EnableAsync is set
The flow of our Http requests goes from Controller -> Service -> DAO -> REST or Hibernate
Sample flow
Say Controller receives POST request R and expects a DeferredResult
Controller calls entityXService.save()
EntityXService calls userService.findUser(id)
UserService calls UserDAO.findUser(id) which in turn talks to Hibernate
UserService returns a Spring ListenableFuture to the caller
EntityXService awaits the user info (using callback) in and calls EntityXDAO.save(user, R)
EntityXDAO calls AsyncRestTemplate.postForEntity(user, R)
EntityXDAO receives DeferredResult> which is our data abstraction for the response.
EntityXDAO processes the response and converts to EntityXDTO
Eventually somehow the DeferredResult is sent back through the same chain as a response.
I am getting lost at how in step 3, EntityXService asynchronously calls UserService.findUser(id) with an onSuccess callback to EntityXDAO.save(user, R). However, EntityXDAO.save(user, R) also now returns a DeferredResult from the AsyncRestTemplate.
Questions:
Is using DeferredResult a good way to get concurrency going in this application?
Is using Guava's ListenableFuture or Java 8 CompletableFuture going to help make it better in anyway, rather than using DeferredResult?
My BIGGEST question and confusion is how to arrange the DeferredResult from one service lookup to be used by another, and then finally set a DeferredResult of a completely different return type for the final response?
Is there an example of how to chain such callbacks and what is the recommended way to build such a flow? If this sounds completely wrong, is Java 7 going to be the right choice for this?
Thanks in advance!