How to stream large data from database via REST in Quarkus

How to stream large data from database via REST in Quarkus - java

I'm implementing a GET method in Quarkus that should send large amounts of data to the client. The data is read from the database using JPA/Hibernate, serialized to JSON, and then sent to the client. How can this can be done efficiently without having the whole data in memory? I tried the following three possibilities all without success:
Use getResultList from JPA and return a Response with the list as the body. A MessageBodyWriter will take care of serializing the list to JSON. However, this will pull all data into memory which is not feasible for a larger number of records.
Use getResultStream from JPA and return a Response with the stream as the body. A MessageBodyWriter will take care of serializing the stream to JSON. Unfortunately this doesn't work because it seems the EntityManager is closed after the JAX-RS method has been executed and before the MessageBodyWriter is invoked. This means that the underlying ResultSet is also closed and the writer cannot read from the stream any more.
Use a StreamingOutput as Response body. The same problem as in 2. occurs.
So my question is: what's the trick for sending large data read via JPA with Quarkus?

Do your results have to be all in one response? How about making the client request the next results page until there's no next - a typical REST API pagination exercise? Also the JPA backend will only fetch that page from the database so there's no moment when everything would sit in memory.

Based on your requirement you have two options:
Option 1:
Take HATEOAS approach (https://restfulapi.net/hateoas/). One of standard pattern to exchange large data sets over REST standard. So in this approach server will respond with set of HATEOAS URIs in first response quickly. Where each HATEOAS URI represents on group of elements. So you need to generate these URIs based on data size and let client code to take responsibility of calling these URIs individually as REST APIs to get actual data. But again in this option also you can consider Reactive style to get more advantage of streaming processing with small memory foot print.
Option 2:
As suggested by #Serkan above, continuously stream the result set from database as REST response to client. Here you need to make sure the gateway between client and Service for timeout settings. If there is no gateway you are good. So you can take advantage of reactive programming at all layers to achieve continuous streaming. "DAO/data access layer" --> "Service layer" --> REST Controller --> Client. Spring reactor is compliant of JAX-RS as well. https://quarkus.io/guides/getting-started-reactive. This is best architecture style while dealing large data processing.

Here you have some resources that can help you with this:
Using reactive Hibernate: https://quarkusio.zulipchat.com/#narrow/stream/187030-users/topic/Large.20datasets.20using.20reactive.20SQL.20clients
Paging vs Forward only ResultSets: https://knes1.github.io/blog/2015/2015-10-19-streaming-mysql-results-using-java8-streams-and-spring-data.html
The last article is for SpringBoot, but the idea can also be implemented with Quarkus.
------------Edit:
OK, I've worked out an example where I do a batch select. I did it with Panache, but you can do it easily also without it.
I'm returning a ScrollableResult, then use this in the Rest resource to stream it via SSE (server sent event) to the client.
------------Edit 2:
I've added the setFetchSize to the query. You should play with this number and set it between 1-50. If value = 1, then the db rows will be fetched 1 by 1, this mimics streaming the most. And it will use the least amount of memory, but the I/O between the db & app will be more often.
And the usage of a StatelessSession is highly recommended when doing bulk operations like this.
#Entity
public class Fruit extends PanacheEntity {
public String name;
// I've removed the logic from here to the Rest resource,
// otherwise you cannot close the session
}
#Path("/fruits")
public class FruitResource {
#GET
#Produces(SERVER_SENT_EVENTS)
public void fruitsStream(#Context Sse sse, #Context SseEventSink sink) {
var sf = Fruit.getEntityManager().getEntityManagerFactory().unwrap(SessionFactory.class);
try (var session = sf.openStatelessSession();
var scrollableResults = session.createQuery("select f from Fruit f")
.scroll(ScrollMode.FORWARD_ONLY)
.setFetchSize(1) {
while (scrollableResults.next()) {
sink.send(sse.newEventBuilder().data(scrollableResults.get(0)).mediaType(APPLICATION_JSON_TYPE).build());
}
sink.close();
}
}
}
Then I call this Rest endpoint like this (via httpie):
> http :8080/fruits --stream
data: {"id":9996,"name":"applecfcdd592-1934-4f0e-a6a8-2f88fae5d14c"}
data: {"id":9997,"name":"apple7f5045a8-03bd-4bf5-9809-03b22069d9f3"}
data: {"id":9998,"name":"apple0982b65a-bc74-408f-a6e7-a165ec3250a1"}
data: {"id":9999,"name":"apple2f347c25-d0a1-46b7-bcb6-1f1fd5098402"}
data: {"id":10000,"name":"apple65d456b8-fb04-41da-bf07-73c962930629"}
Hope this helps you.

Related

How to make $batch POST request using Olingo v4 and Java?

We've to implement batch request for odata in java.I'm new to odata,from the below 2 following references,which one has to be followed.Do we've to construct a batch request or will it be done using odata batch api's?Can anyone please help on how to proceed with the implementation?
https://olingo.apache.org/doc/odata4/tutorials/batch/tutorial_batch.html
https://olingo.apache.org/doc/odata4/tutorials/od4_basic_batch_client.html

The batch request will be created automatically by the OData Client.
TLDR;
A batch request is a REST call to a special endpoint $batch, with a well-defined payload type.
The payload consists of batch requests and subtype of chagesets. Both of them are used to club multiple requests into one except the requests in one changeset is expected to be atomic. So, either all the requests execute or in case one or more fails there should be a rollback (or similar) to prevent the others from persisting
https://olingo.apache.org/doc/odata4/tutorials/od4_basic_batch_client.html
This link has the example for creating the client, Then create an entity and set some properties, put it in change set and execute. In the background it will send a batch request as per the OData $batch format as documented in
https://olingo.apache.org/doc/odata4/tutorials/batch/tutorial_batch.html

Akka HTTP using a response unmarshaller

I'm constructing a data pipeline using Akka streams and Akka HTTP. The use case is quite simple, receive a web request from a user which will do two things. First create a session by calling a 3rd party API, secondly committing this session to some persistent storage, when we have received the session it will then proxy the original user request but add the session data.
I have started working on the first branch of the data pipeline which is the session processing but I'm wondering if there is a more elegant way of unmarshalling the HTTP response from the 3rd party API to a POJO currently I'm using Jackson.unmarshaller.unmarshal which returns a CompletionStage<T> which I then have to unwrap into T. It's not very elegant and I'm guessing that Akka HTTP has more clever ways of doing this.
Here is the code I have right now
private final Source<Session, NotUsed> session =
Source.fromCompletionStage(
getHttp().singleRequest(getSessionRequest(), getMat())).
map(r -> Jackson.unmarshaller(Session.class).unmarshal(r.entity(), getMat())).
map(f -> f.toCompletableFuture().get()).
alsoTo(storeSession);

Akka Streams offers you mapAsync, a stage to handle asynchronous computation in your pipeline in a configurable, non-blocking way.
Your code should become something like
Source.fromCompletionStage(
getHttp().singleRequest(getSessionRequest(), getMat())).
mapAsync(4, r -> Jackson.unmarshaller(Session.class).unmarshal(r.entity(), getMat())).
alsoTo(storeSession);
Note that:
it is not just a matter of elegance in this case, as CompletableFuture.get is a blocking call. This can cause dreadful issues in your pipeline.
the Int parameter required by mapAsync (parallelism) allows for fine-tuning of how many parallel async operations can be run at the same time.
More info in mapAsync can be found in the docs.

Jersey web service proxy

I am trying to implement a web service that proxies another service that I want to hide from external users of the API. Basically I want to play the middle man to have ability to add functionality to the hidden api which is solr.
I have to following code:
#POST
#Path("/update/{collection}")
public Response update(#PathParam("collection") String collection,
#Context Request request) {
//extract URL params
//update URL to target internal web service
//put body from incoming request to outgoing request
//send request and relay response back to original requestor
}
I know that I need to rewrite the URL to point to the internally available service adding the parameters coming from either the URL or the body.
This is where I am confused how can I access the original request body and pass it to the internal web service without having to unmarshall the content? Request object does not seem to give me the methods to performs those actions.
I am looking for Objects I should be using with potential methods that would help me. I would also like to get some documentation if someone knows any I have not really found anything targeting similar or portable behaviour.

Per section 4.2.4 of the JSR-311 spec, all JAX-RS implementations must provide access to the request body as byte[], String, or InputStream.
You can use UriInfo to get information on the query parameters. It would look something like this:
#POST
#Path("/update/{collection}")
public Response update(#PathParam("collection") String collection, #Context UriInfo info, InputStream inputStream)
{
String fullPath = info.getAbsolutePath().toASCIIString();
System.out.println("full request path: " + fullPath);
// query params are also available from a map. query params can be repeated,
// so the Map values are actually Lists. getFirst is a convenience method
// to get the value of the first occurrence of a given query param
String foo = info.getQueryParameters().getFirst("bar");
// do the rewrite...
String newURL = SomeOtherClass.rewrite(fullPath);
// the InputStream will have the body of the request. use your favorite
// HTTP client to make the request to Solr.
String solrResponse = SomeHttpLibrary.post(newURL, inputStream);
// send the response back to the client
return Response.ok(solrResponse).build();
One other thought. It looks like you're simply rewriting the requests and passing through to Solr. There are a few others ways that you could do this.
If you happen to have a web server in front of your Java app server or Servlet container, you could potentially accomplish your task without writing any Java code. Unless the rewrite conditions were extremely complex, my personal preference would be to try doing this with Apache mod_proxy and mod_rewrite.
There are also libraries for Java available that will rewrite URLs after they hit the app server but before they reach your code. For instance, https://code.google.com/p/urlrewritefilter/. With something like that, you'd only need to write a very simple method that invoked Solr because the URL would be rewritten before it hits your REST resource. For the record, I haven't actually tried using that particular library with Jersey.

1/ for the question of the gateway taht will hide the database or index, I would rather use and endpoint that is configured with #Path({regex}) (instead of rebuilding a regexp analyser in your endpoint) .
Use this regex directly in the #path, this is a good practice.
Please take a look at another post that is close to this : #Path and regular expression (Jersey/REST)
for exemple you can have regexp like this one :
#Path("/user/{name : [a-zA-Z][a-zA-Z_0-9]}")
2/ Second point in order to process all the request from one endpoint, you will need to have a dynamic parameter. I would use a MultivaluedMap that gives you the possibility to add params to the request without modifying your endpoint :
#POST
#Path("/search")
#Consumes(MediaType.APPLICATION_FORM_URLENCODED)
#Produces({"application/json"})
public Response search( MultivaluedMap<String, String> params ) {
// perform search operations
return search( params);
}
3/ My 3rd advice is Reuse : make economy and economy make fewer bugs.
it's such a pitty to rewrite a rest api in order to perform solr search. You can hide the params and the endpoint, but could be great to keep the solr uri Rest formatting of the params in order to reuse all the search logic of solr directly in your api. This will make you perform a great economy in code even if you hide your solr instance behind you REST GATEWAY SERVER.
in this case you can imagine :
1. receive a query in search gateway endpoint
2. Transform the query to add your params, controls...
3. execute the REST query on solr (behind your gateway).

"Sessions" with Google Cloud Endpoints

This question is only to confirm that I'm clear about this concept.
As far as I understand, Google Cloud Endpoints are kind of Google's implementation of REST services, so that they can't keep any "session" data in memory, therefore:
Users must send authentication data with each request.
All the data I want to use later on must be persisted, namely, with each API request I receive, I have to access the Datastore, do something and store the data again.
Is this correct? And if so, is this actually good in terms of performance?

Yes you can use session, only put another Paramether in your API method with HttpServlet:
#ApiMethod
public MyResponse getResponse( HttpServletRequest req, #Named("infoId") String infoId ) {
// Use 'req' as you would in a servlet, e.g.
String ipAddress = req.getRemoteAddr();
...
}

The datastore is pretty quick especially if you do a key lookup (as apposed to query). if you use NDB then you will have the benefit of auto memache your lookups.

Yes, your Cloud Endpoints API backend code (Java or Python) is still running on App Engine, so you have the same access to all resources you would have on App Engine.
Though you can't set client-side cookies for sessions, you still can obtain a user for a request and store user-specific data in the datastore. As #Shay Erlichmen mentioned, if you couple the datastore with memcache and an in-context cache (as ndb does), you can make these lookups very quick.
To do this in either Python or Java, either allowed_client_ids or audiences will need to be specified in the annotation/decorator on the API and/or on the method(s). See the docs for more info.
Python:
If you want to get a user in Python, call
endpoints.get_current_user()
from within a request that has been annotated with allowed_client_ids or audiences. If this returns None, then there is no valid user (and you should return a 401).
Java:
To get a user, on an annotated method (or method contained in an annotated API), simply specify a user object in the request:
import com.google.appengine.api.users.User;
...
public Model insert(Model model, User user) throws
OAuthRequestException, IOException {
and as in Python, check if user is null to determine if a valid OAuth 2.0 token was sent with the request.

GWT RequestFactory client scenarios

My understanding is that the GWT RequestFactory (RF) API is for building data-oriented services whereby a client-side entity can communicate directly with it's server-side DAO.
My understanding is that when you fire a RF method from the client-side, a RequestFactoryServlet living on the server is what first receives the request. This servlet acts like a DispatchServlet and routes the request on to the correct service, which is tied to a single entity (model) in the data store.
I'm used to writing servlets that might pass the request on to some business logic (like an EJB), and then compute some response to send back. This might be a JSP view, some complicated JSON (Jackson) object, or anything else.
In all the RF examples, I see no such existence of these servlets, and I'm wondering if they even exist in GWT-RF land. If the RequestFactoryServlet is automagically routing requests to the correct DAO and method, and the DAO method is what is returned in the response, then I can see a scenario where GWT RF doesn't even utilize traditional servlets. (1) Is this the case?
Regardless, there are times in my GWT application where I want to hit a specific url, such as http://www.example.com?foo=bar. (2) Can I use RF for this, and if so, how?
I think if I could see two specific examples, side-by-side of GWT RF in action, I'd be able to connect all the dots:
Scenario #1 : I have a Person entity with methods like isHappy(), isSad(), etc. that would require interaction with a server-side DAO; and
Scenario #2 : I want to fire an HTTP request to http://www.example.com?foo=bar and manually inspect the HTTP response
If it's possible to accomplish both with the RF API, that would be my first preference. If the latter scenario can't be accomplished with RF, then please explain why and what is the GWT-preferred alternative. Thanks in advance!

1.- Request factory not only works for Entities but Services, so you could define any service in server-side with methods which you call from client. Of course when you use RF services they are able to deal with certain types (primitive, boxed primitives, sets, lists and RF proxies)
#Service(value=RfService.class, locator=RfServiceLocator.class)
public interface TwService extends RequestContext {
Request<String> parse(String value);
}
public class RfService {
public String parse(String value) {
return value.replace("a", "b");
}
2.- RF is not thought to receive other message payloads than the RF servlet produces, and the most you can do in client side with RF is ask for that services hosted in a different site (when you deploy your server and client sides in different hosts).
You can use other mechanisms in gwt world to get data from other urls, take a look to gwtquery Ajax and data-binding or this article

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.