Server-side paging possible? - java

In a Java application, I am using Spring-Data to access a Neo4j database via the REST binding.
The spring.xml used as a context contains the following lines:
<neo4j:config graphDatabaseService="graphDatabaseService" />
<neo4j:repositories base-package="org.example.graph.repositories"/>
<bean id="graphDatabaseService"
class="org.springframework.data.neo4j.rest.SpringRestGraphDatabase">
<constructor-arg index="0" value="http://example.org:1234/db/data" />
</bean>
My repository is very simple:
public interface FooRepository extends GraphRepository<Foo> {
}
Now, I would like to loop through some Foos:
for (Foo foo : fooRepository.findAll(new PageRequest(0, 5))) //...
However, the performance of this request is awful: It takes over 400 seconds (!) to complete.
After a bit of debugging, I found out that Spring-data generates the following query:
START `foo`=node:__types__(className="org.example.Foo") RETURN `foo`
It then looks like as if paging is done on the client and all Foos (more than 100,000) are transferred to the client. When issuing the above query to the Neo4j server using the web interface, it takes around 60 seconds. However, if I manually append a "LIMIT 5", the execution time reduces to around 0.5 seconds.
What am I doing wrong so that spring-data does not use server-side, CYPHER pagination?
According to Programming Model
the expensive operations like traversals and querying are executed efficiently on the server side by using the REST API to forward those calls.
Or does this exclude the pagination?
What other options do I have in this case?

You can do the below to handle this server side.
Provide your own query method in the repository
The cypher query should use order, skip and limit and parameterize them so that you can pass in the skip and limit values on a per page basis.
E.g.
start john=node:users("name:pangea")
match john-[:HAS_SEEN]-(movie)
return movie
order by movie.name?
skip 20
limit 10

Related

How to stream large data from database via REST in Quarkus

I'm implementing a GET method in Quarkus that should send large amounts of data to the client. The data is read from the database using JPA/Hibernate, serialized to JSON, and then sent to the client. How can this can be done efficiently without having the whole data in memory? I tried the following three possibilities all without success:
Use getResultList from JPA and return a Response with the list as the body. A MessageBodyWriter will take care of serializing the list to JSON. However, this will pull all data into memory which is not feasible for a larger number of records.
Use getResultStream from JPA and return a Response with the stream as the body. A MessageBodyWriter will take care of serializing the stream to JSON. Unfortunately this doesn't work because it seems the EntityManager is closed after the JAX-RS method has been executed and before the MessageBodyWriter is invoked. This means that the underlying ResultSet is also closed and the writer cannot read from the stream any more.
Use a StreamingOutput as Response body. The same problem as in 2. occurs.
So my question is: what's the trick for sending large data read via JPA with Quarkus?
Do your results have to be all in one response? How about making the client request the next results page until there's no next - a typical REST API pagination exercise? Also the JPA backend will only fetch that page from the database so there's no moment when everything would sit in memory.
Based on your requirement you have two options:
Option 1:
Take HATEOAS approach (https://restfulapi.net/hateoas/). One of standard pattern to exchange large data sets over REST standard. So in this approach server will respond with set of HATEOAS URIs in first response quickly. Where each HATEOAS URI represents on group of elements. So you need to generate these URIs based on data size and let client code to take responsibility of calling these URIs individually as REST APIs to get actual data. But again in this option also you can consider Reactive style to get more advantage of streaming processing with small memory foot print.
Option 2:
As suggested by #Serkan above, continuously stream the result set from database as REST response to client. Here you need to make sure the gateway between client and Service for timeout settings. If there is no gateway you are good. So you can take advantage of reactive programming at all layers to achieve continuous streaming. "DAO/data access layer" --> "Service layer" --> REST Controller --> Client. Spring reactor is compliant of JAX-RS as well. https://quarkus.io/guides/getting-started-reactive. This is best architecture style while dealing large data processing.
Here you have some resources that can help you with this:
Using reactive Hibernate: https://quarkusio.zulipchat.com/#narrow/stream/187030-users/topic/Large.20datasets.20using.20reactive.20SQL.20clients
Paging vs Forward only ResultSets: https://knes1.github.io/blog/2015/2015-10-19-streaming-mysql-results-using-java8-streams-and-spring-data.html
The last article is for SpringBoot, but the idea can also be implemented with Quarkus.
------------Edit:
OK, I've worked out an example where I do a batch select. I did it with Panache, but you can do it easily also without it.
I'm returning a ScrollableResult, then use this in the Rest resource to stream it via SSE (server sent event) to the client.
------------Edit 2:
I've added the setFetchSize to the query. You should play with this number and set it between 1-50. If value = 1, then the db rows will be fetched 1 by 1, this mimics streaming the most. And it will use the least amount of memory, but the I/O between the db & app will be more often.
And the usage of a StatelessSession is highly recommended when doing bulk operations like this.
#Entity
public class Fruit extends PanacheEntity {
public String name;
// I've removed the logic from here to the Rest resource,
// otherwise you cannot close the session
}
#Path("/fruits")
public class FruitResource {
#GET
#Produces(SERVER_SENT_EVENTS)
public void fruitsStream(#Context Sse sse, #Context SseEventSink sink) {
var sf = Fruit.getEntityManager().getEntityManagerFactory().unwrap(SessionFactory.class);
try (var session = sf.openStatelessSession();
var scrollableResults = session.createQuery("select f from Fruit f")
.scroll(ScrollMode.FORWARD_ONLY)
.setFetchSize(1) {
while (scrollableResults.next()) {
sink.send(sse.newEventBuilder().data(scrollableResults.get(0)).mediaType(APPLICATION_JSON_TYPE).build());
}
sink.close();
}
}
}
Then I call this Rest endpoint like this (via httpie):
> http :8080/fruits --stream
data: {"id":9996,"name":"applecfcdd592-1934-4f0e-a6a8-2f88fae5d14c"}
data: {"id":9997,"name":"apple7f5045a8-03bd-4bf5-9809-03b22069d9f3"}
data: {"id":9998,"name":"apple0982b65a-bc74-408f-a6e7-a165ec3250a1"}
data: {"id":9999,"name":"apple2f347c25-d0a1-46b7-bcb6-1f1fd5098402"}
data: {"id":10000,"name":"apple65d456b8-fb04-41da-bf07-73c962930629"}
Hope this helps you.

Spring JPA: How to run multi sql query in one round?

In spring if we define method in repository like this:findByName(String name), we could call this method to retrieve data back. What I want is that, could I have some ways to call 2 or more methods like I say above, and spring sends query to database in just one round instead of 2rounds? I would like to optimize performance in the case that I am certain that some sql queries will be sent togother
update: one round means in one connection we send multi sql queries. The object is to avoid more than one round trip time when there is more than 1 sql query is about to send.
e.g., query 1 is select * from table where xx=bb
query 2 is selext * from another_table where zz=cc
in trivial way, we may send 2 queries like this:
1. send query 1 by calling repository's findbyxx method
2. send query 2 by calling repository's findbyzz method
in above case, query 2 will be sent after query 1's response came back. This is a waste IMHW. I am seeking a way to send these 2 queries at once and got answer at once.
If you want to keep the database connection between these two queries, you must set up a transaction manager for your JPA Configuration:
<bean id="txManager" class="org.springframework.orm.jpa.JpaTransactionManager">
<property name="entityManagerFactory" ref="yourEntityManagerFactory" />
</bean>
<tx:annotation-driven transaction-manager="txManager" />
This will imply that when you annotate a #Service's method with #Transaction (or the whole class), the same session/connection will be kept between your queries.
For more info: https://www.baeldung.com/transaction-configuration-with-jpa-and-spring

Spring Integration and Transaction Management - how difficult need it be?

Using Spring Integration I am trying to built a simple message producing component. Basically something like this:
<jdbc:inbound-channel-adapter
channel="from.database"
data-source="dataSource"
query="SELECT * FROM my_table"
update="DELETE FROM my_table WHERE id IN (:id)"
row-mapper="someRowMapper">
<int:poller fixed-rate="5000">
<int:transactional/>
</int:poller>
</jdbc:inbound-channel-adapter>
<int:splitter
id="messageProducer"
input-channel="from.database"
output-channel="to.mq" />
<jms:outbound-channel-adapter
channel="to.mq"
destination="myMqQueue"
connection-factory="jmsConnectionFactory"
extract-payload="true" />
<beans:bean id="myMqQueue" class="com.ibm.mq.jms.MQQueue">
<!-- properties omitted --!>
</beans:bean>
The "messageProducer" may produce several messages per poll but not necessarily one per row.
My concern is that I want to make sure that rows are not deleted from my_table unless the messages produced has been committed to the MQ channel.
On the other hand I will accept that rows in case of db- or network failure are not deleted thus causing duplicate messages to be produced. In other words I will settle for a non-XA one-phase commit with possible duplicates.
When trying to figure out what I need to put to my Spring configuration I quickly get lost in endless discussions about transaction managers, AOP and transaction advice chains which I find difficult to understand - I know I ought to though.
But I fear that I will spend a lot of time cooking up a configuration that is not really necessary for my problem at hand.
So - my question is: Can it be that simple - or do I need to provide explicit configuration for transaction synchronization?
But can I do something similar with a jdbc/jms mix?
I'd say "Yes".
Please, read Dave Syer's article about Best effort 1PC, where the ChainedTransactionManager came from.

Spring JDBC Adapter in Cluster mode

I am using spring JDBC inbound channel adapter in my web application. If I deploy this application in clustered environment, two or more instances pickup the same job and run.
Can anybody help to overcome this issue by changing the spring configuration ?
I have attached my spring configuration.
<int-jdbc:inbound-channel-adapter
query=" SELECT JOBID,
JOBKEY,
JOBPARAM
FROM BATCHJOB
WHERE JOBSTATUS = 'A' "
max-rows-per-poll="1" channel="inboundAdhocJobTable" data-source="dataSource"
row-mapper="adhocJobMapper"
update=" delete from BATCHJOB where JOBKEY in (:jobKey)"
>
<int:poller fixed-rate="1000" >
<int:advice-chain>
</int:advice-chain>
</int:poller>
</int-jdbc:inbound-channel-adapter>
Unfortunately this will not be possible without some sort of syncing. Additionally using the database as some sort of message queue is not a good idea (http://mikehadlow.blogspot.de/2012/04/database-as-queue-anti-pattern.html). I'd try to follow different approaches:
Use some sort of message bus + message store to store the jobs objects rather than executing SQL directly. In this case you'll have to change the way jobs are being stored. Either by using some sort of message store backed channel (Spring integration only) or push them to a message queue like RabbitMQ to store these jobs.
I'm not 100% sure but remember that Spring Batch offers something similar like Master-Slave-Job splitting and synchronization. Maybe you have a look there.

Spring integration - Appropriate pattern for collating/batching service calls

I have a remote service that I'm calling to load pricing data for a product, when a specific event occurs. Once loaded, the product pricing is then broadcast for another consumer to process elsewhere.
Rather than call the remote service on every event, I'd like to batch the events into small groups, and send them in one go.
I've cobbled together the following pattern based on an Aggregator. Although it works, lots of it 'smells' -- especially my SimpleCollatingAggregator. I'm new to Spring Integration, and EIP in general, and suspect I'm misusing components.
The Code
My code is triggered elsewhere in code by calling a method on the below #Gateway:
public interface ProductPricingGateway {
#Gateway(requestChannel="product.pricing.outbound.requests")
public void broadcastPricing(ProductIdentifer productIdentifier);
}
This is then wired to an aggregator, as follows:
<int:channel id="product.pricing.outbound.requests" />
<int:channel id="product.pricing.outbound.requests.batch" />
<int:aggregator input-channel="product.pricing.outbound.requests"
output-channel="product.pricing.outbound.requests.batch" release-strategy="releaseStrategy"
ref="collatingAggregator" method="collate"
correlation-strategy-expression="0"
expire-groups-upon-completion="true"
send-partial-result-on-expiry="true"/>
<bean id="collatingAggregator" class="com.mangofactory.pricing.SimpleCollatingAggregator" />
<bean id="releaseStrategy" class="org.springframework.integration.aggregator.TimeoutCountSequenceSizeReleaseStrategy">
<!-- Release when: 10 Messages ... or ... -->
<constructor-arg index="0" value="10" />
<!-- ... 5 seconds since first request -->
<constructor-arg index="1" value="5000" />
</bean>
Here's the aggregator implementation:
public class SimpleCollatingAggregator {
public List<?> collate(List<?> input)
{
return input;
}
}
Finally, this gets consumed on the following #ServiceActivator:
#ServiceActivator(inputChannel="product.pricing.outbound.requests.batch")
public void fetchPricing(List<ProductIdentifer> identifiers)
{
// omitted
}
Note: In practice, I'm also using #Async, to keep the calling code as quick-to-return as possible. I have a bunch of questions about that too, which I'll move to a seperate question.
Question 1:
Given what I'm trying to acheive, is an aggregator pattern an appropriate choice here? This feels like a lot of boilerplate -- is there a better way?
Question 2:
I'm using a fixed collation value of 0, to effectively say : 'It doesn't matter how you group these messages, take 'em as they come.'
Is this an appropriate way of achieving this?
Question 3:
SimpleCollatingAggregator simply looks wrong to me.
I want this to receive my individual inbound ProductIdentifier objects, and group them into batches, and then pass them along. This works, but is it appropriate? Are there better ways of acheiving the same thing?
Q1: Yes, but see Q3 and the further discussion below.
Q2: That is the correct way to say 'no correlation needed' (but you need the expire-groups-on-completion, which you have).
Q3: In this case, you don't need a custom Aggregator, just use the default (remove the ref and method attributes).
Note that the aggregator is a passive component; the release is triggered by the arrival of a new message; hence the second part of your release strategy will only kick in when a new message arrives (it won't spontaneously release the group after 5 seconds).
However, you can configure a MessageGroupStoreReaper for that purpose: http://static.springsource.org/spring-integration/reference/html/messaging-routing-chapter.html#aggregator

Categories