spring jdbc RowCallbackHandler nightmare

spring jdbc RowCallbackHandler nightmare - java

I'm having trouble retrieving data from my database using Spring Jdbc. Here's my issue:
I have a getData() method on my DAO which is supposed to return ONE row from the result of some select statement. When invoked again, the getData() method should return the second row in a FIFO-like manner. I'm aiming for having only one result in memory at a time, since my table will get potentially huge in the future and bringing everything to memory would be a disaster.
If I were using regular jdbc code with a result set I could set its fetch size to 1 and everything would be fine. However I recently found out that Spring Jdbc operations via the JdbcTemplate object don't allow me to achieve such a behaviour (as far as I know... I'm not really knowledgeable about the Spring framework's features). I've heard of the RowCallbackHandler interface, and this post in the java ranch said I could somehow expose the result set to be used later (though using this method it stores the result set as many times over as there are rows, which is pretty dumb).
I have been playing with implementing the RowCallbackHandler interface for a day now and I still can't find a way to get it to retrieve one row from my select at a time. If anyone could enlighten me in this matter i'd greatly appreciate it.

JdbcTemplate.setFetchSize(int fetchSize):
Set the fetch size for this JdbcTemplate. This is important for processing large result sets: Setting this higher than the default value will increase processing speed at the cost of memory consumption; setting this lower can avoid transferring row data that will never be read by the application.
Default is 0, indicating to use the JDBC driver's default.

After a lot of searching and consulting with the rest of my team, we have come to the conclusion that this is not the best implementation path for our project. As Boris suggested, a different approach is the way to go. However, I'm doing something different and using SimpleJdbcTemplate instead and splitting my query so it'll fit in memory better. A "status" field in my records table will be responsbile for telling if the record was successfully processed or read, so i know what records to fetch next.
The question if Spring Jdbc is capable of the behaviour i mentioned in my OP is, however, still in the air. If anyone has an answer for that question I'm sure it would help someone else out there.
Cheers!

You can take a different approach. Create a query which will return just IDs of rows that you want to read. Keep this collection of IDs in memory. You really need to have huge data set to consume a lot of memory. Iterate over it and load one by one row referenced by its ID.

We have the same issue:
- Test fetching fetchSize records in raw jdbc Preparestatement works well: when stop Db after fetching a fetchSize of records, the error throw is Jdbc Connection when the resultset.next() get run.
- Test fetchSize with JdbcTemplate:
PreparedStatementSetter preparedStatementSetter = ps -> { ps.setFetchSize(_exportParams.getFetchSize()); };
RowCallbackHandler rowCallbackHandler = _rs -> { //do st here}
this.jdbcTemplate.query(_exportParams.getSqlscript(), preparedStatementSetter, rowCallbackHandler);
After getting first record, we stop the Postgres. The callback record handler can still handle the rest of records without error.

Related

How to batch the fetches when pulling huge number of records from Oracle using JPA?

I have a Java application where we use spring data JPA to query our oracle database. For one use case, I need to fetch all the records present in the table. Now the table has record count of 400,000 thousand and it might grow in the near future. I don't feel comfortable pulling all records into the JVM since we don't know how large they can be. So, I want to configure the code to fetch specific number of records at a time say 50,000 and process before it goes to next 50,000. Is there a way I can achieve this with JPA? I came across this JDBC property that can be used with hibernate hibernate.jdbc.fetch_size. What I am trying to understand is if I use repository.findAll() returning List<Entity>How can a fetch Size work in this case? because List will have all the entities. I was also looking into repository methods returning Stream<>, not sure if I have to use that. Please do suggest. If there can be better solution for this use case?
Thanks

With JPA you can use the Pagination feature, means you tell the Repository how many result should be present at one page. (E.g. 50 000)
For more information follow up here https://www.baeldung.com/jpa-pagination

ScrollableResults with Hibernate/Oracle pulling everything into memory

I want a page of filtered data from an Oracle database table, but I have a query that might return tens of millions of records, so it's not feasible to pull it all into memory. I need to filter records out in a way that cannot be done via SQL, and return back a page of records. In other words, the pagination part must be done after the filtering.
So, I attempted to use Hibernate's ScrollableResults, thinking it would be a way to pull in only chunks at a time and iterate through them. So, I created it:
ScrollableResults results = query.setReadOnly(true)
.setFetchSize(500)
.setCacheable(false)
.scroll();
... and yet, it appears to pull everything into memory (2.5GB pulled in per query). I've seen another question and I've tried some of the suggestions, but most seem MySQL specific, and I'm using an Oracle 19 driver (e.g. Integer.MIN_VALUE is rejected outright as a fetch size in the Oracle driver).
There was a suggestion to use a stateless session (I'm using the EntityManager which has no stateless option), but my thought is that if we don't fetch many records (because we only want the first page of 200 filtered records), why would Hibernate have millions of records in memory anyway, even though we never scrolled over them?
It's clear to me that I don't understand how/why Hibernate pulls things into memory, or how to get it to stop doing so. Any suggestions on how to prevent it from doing so, given the constraints above?
Some things I'm going to try:
Different scroll modes. Maybe insensitive or forward only prevents Hibernate's need to pull everything in?
Clearing the session after we have our page. I'm closing the session (both using close() in the ScrollableResults and the EntityManager), but maybe an explicit clear() will help?

We were scrolling through the entire ScrollableResults to get the total count. This caused two things:
The Hibernate session cached entities.
The ResultSet in the driver kept rows that it has scrolled past.
Fixing this is specific to my case, really, but I did two things:
As we scroll, periodically clear the Hibernate session. Since we use the EntityManager, I had to do entityManager.unwrap(Session.class).clear(). Not sure if entityManager.clear() would do the job or not.
Make the ScrollableResults forward-only so the Oracle driver doesn't have to keep records in memory as it scrolls. This was as simple as doing .scroll(ScrollMode.FORWARD_ONLY). Only possible since we're only moving forward, though.
This allowed us to maintain a smaller memory footprint, even while scrolling through literally every single record (tens of millions).

Why would you scroll through all results just to get the count? Why not just execute a count query?

Processing large amount of data from PostgreSQL

I am looking for a way how to process a large amount of data that are loaded from the database in a reasonable time.
The problem I am facing is that I have to read all the data from the database (currently around 30M of rows) and then process them in Java. The processing itself is not the problem but fetching the data from the database is. The fetching generally takes from 1-2 minutes. However, I need it to be much faster than that. I am loading the data from db straight to DTO using following query:
select id, id_post, id_comment, col_a, col_b from post_comment
Where id is primary key, id_post and id_comment are foreign keys to respective tables and col_a and col_b are columns of small int data types. The columns with foreign keys have indexes.
The tools I am using for the job currently are Java, Spring Boot, Hibernate and PostgreSQL.
So far the only options that came to my mind were
Ditch hibernate for this query and try to use plain jdbc connection hoping that it will be faster.
Completely rewrite the processing algorithm from Java to SQL procedure.
Did I miss something or these are my only options? I am open to any ideas.
Note that I only need to read the data, not change them in any way.
EDIT: The explain analyze of the used query
"Seq Scan on post_comment (cost=0.00..397818.16 rows=21809216 width=28) (actual time=0.044..6287.066 rows=21812469 loops=1), Planning Time: 0.124 ms, Execution Time: 8237.090 ms"

Do you need to process all rows at once, or can you process them one at a time?
If you can process them one at a time, you should try using a scrollable result set.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
}
This will still remember every object in the entity manager, and so will get progressively slower and slower. To avoid that issue, you might detach the object from the entity manager after you're done. This can only be done if the objects are not modified. If they are modified, the changes will NOT be persisted.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
entityManager.detach(myObject);
}

If I was in your shoes I would definitely bypass hibernate and go directly to JDBC for this query. Hibernate is not made for dealing with large result sets, and it represents an additional overhead for benefits that are not applicable to cases like this one.
When you use JDBC, do not forget to set autocommit to false and set some large fetch size (of the order of thousands) or else postgres will first fetch all 21 million rows into memory before starting to yield them to you. (See https://stackoverflow.com/a/10959288/773113)

Since you asked for ideas, I have seen this problem being resolved in below options depending on how it fits in your environment:
1) First try with JDBC and Java, simple code and you can do a test run on your database and data to see if this improvement is enough. You will here need to compromise on the other benefits of Hibernate.
2) In point 1, use Multi-threading with multiple connections pulling data to one queue and then you can use that queue to process further or print as you need. you may consider Kafka also.
3) If data is going to further keep on increasing you can consider Spark as the latest technology which can make it all in memory and will be much more faster.
These are some of the options, please like if these ideas help you anywhere.

Why do you 30M keep in memory ??
it's better to rewrite it to pure sql and use pagination based on id
you will be sent 5 as the id of the last comment and you will issue
select id, id_post, id_comment, col_a, col_b from post_comment where id > 5 limit 20
if you need to update the entire table then you need to put the task in the cron but also there to process it in parts
the memory of the road and downloading 30M is very expensive - you need to process parts 0-20 20-n n+20

JDBC Pagination: vendor specific sql versus result set fetchSize

There are a lot of different tutorials across the internet about pagination with JDBC/iterating over huge result set.
So, basically there are a number of approaches I've found so far:
Vendor specific sql
Scrollable result set (?)
Holding plain result set in a memory and map the rows only when necessary (using fetchSize)
The result set fetch size, either set explicitly, or by default equal
to the statement fetch size that was passed to it, determines the
number of rows that are retrieved in any subsequent trips to the
database for that result set. This includes any trips that are still
required to complete the original query, as well as any refetching of
data into the result set. Data can be refetched, either explicitly or
implicitly, to update a scroll-sensitive or
scroll-insensitive/updatable result set.
Cursor (?)
Custom seek method paging implemented by jooq
Sorry for messing all these but I need someone to clear that out for me.
I have a simple task where service consumer asks for results with a pageNumber and pageSize. Looks like I have two options:
Use vendor specific sql
Hold the connection/statement/result set in the memory and rely on jdbc fetchSize
In the latter case I use rxJava-jdbc and if you look at producer implementation it holds the result set, then all you do is calling request(long n) and another n rows are processed. Of course everything is hidden under Observable suggar of rxJava. What I don't like about this approach is that you have to hold the resultSet between different service calls and have to clear that resultSet if client forgets to exhaust or close it. (Note: resultSet here is java ResultSet class, not the actual data)
So, what is recommended way of doing pagination? Is vendor specific sql considered slow compared to holding the connection?
I am using oracle, ScrollableResultSet is not recommended to be used with huge result sets as it caches the whole result set data on the client side. proof

Keeping resources open for an indefinite time is a bad thing in general. The database will, for example, create a cursor for you to obtain the fetched rows. That cursor and other resources will be kept open until you close the result set. The more queries you do in parallel the more resources will be occupied and at some point the database will reject further requests due to an exhausted resource pool (e.g. there is a limited number of cursors, that can be opened at a time).
Hibernate, for example, uses vendor specific SQL to fetch a "page" and I would do it just like that.

There are many approaches because there are many different use cases.
Do you actually expect users to fetch every page of the result set? Or are they more likely to fetch the first page or two and try something else if the data they're interested in isn't there. If you are Google, for example, you can be pretty confident that people will look at results from the first page, a small number will look at results from the second page, and a tiny fraction of results will come from the third page. It makes perfect sense in that case to use vendor-specific code to request a page of data and only run that for the next page when the user asks for it. If you expect the user to fetch the last page of the result, on the other hand, running a separate query for each page is going to be more expensive than running a single query and doing multiple fetches.
How long do users need to keep the queries open? How many concurrent users? If you're building an internal application that dozens of users will have access to and you expect users to keep cursors open for a few minutes, that might be reasonable. If you are trying to build an application that will have thousands of users that will be paging through a result over a span of hours, keeping resources allocated is a bad idea. If your users are really machines that are going to fetch data and process it in a loop as quickly as possible, a single ResultSet with multiple fetches makes far more sense.
How important is it that no row is missed/ every row is seen exactly once/ the results across pages are consistent? Multiple fetches from a single cursor guarantees that every row in the result is seen exactly once. Separate paginated queries might not-- new data could have been added or removed between queries being executed, your sort might not be fully deterministic, etc.

ScrollableResultSet caches result on client side - this requires memory resources. But for example PostgreSQL does it by default and nobody complains. Some databases simply use client's memory to hold the whole resultset. In most cases the database has to process much more data to re-evaluate the query.
Also you usually have much more clients, than database instances.
Also note that query re-execution - using rownum - as implemented by Hibernate does not guarantee correct(consistent) results. If data are modified between executions and default isolation level is used.
It really depends on use case. Changing Oracle's init parameter for max. connections and also for open cursors requires database restart.
So ScrollableResultSet and cursors can be used only when you can predict amount of (concurrent) users.

How to limit number of rows returned from Oracle at the JDBC data source level?

Is there a way to limit the rows returned at the Oracle datasource level in a Tomcat application?
It seems maxRows is only available if you set it on the datasource in the Java code. Putting maxRows="2" on the datasource doesn't apply.
Is there any other way limit the rows returned? Without a code change?

It isn't something that is available at the configuration level. You may want to double check that it does what you want it to do anyway: see the javadoc for setMaxRows. With Oracle it is still going to fetch every row back for the query and then just drop the ones outside the range. You would really need to use rownum to make it work well with Oracle and you can't do that either in the configuration.

The question is why do you want to limit the number of rows returned. There could be many reasons to do this. The first would be to just limit the data returned by the database. In my opinion this makes no sense in most cases as if I would like to get certain data only then I would use a different statement or add a filter condition or something. E.g. if you use rownum of Oracle you don't exactly know which data is in the rows you get and which data is not included as you just tell the database that you want row x to y.
The second approach is to limit memory usage and increase performance so that the ResultSet you get from the JDBC driver will not include all data. You can limit the number of rows hold by the ResultSet using Statement.setFetchSize(). If you move the cursor in the ResultSet beyond the number of rows fetched the JDBC driver will fetch the missing data from the database. (In case of Oracle the database will store the data in a ref cursor which is directly accessed by the JDBC driver).

*Beware: the code below is provided as pure example. It has not
been tested * It thus may harm yourself or your computer or even punch you
in the face.
If you want to avoid modifying your SQL queries but still want to have clean code (which means that your code stay maintainable), you may design the solution using wrappers. That is, by using a small set of classes wrapping existing ones, you may achieve what you want seamlessly for the rest of the application which will still think it is working with real DataSource, Connection and Statement.
1 - implement a StatementWrapper or PreparedStatementWrapper class, depending what your application already uses. Those classes are wrappers around normal Statement or PreparedStatement instances. They are implemented simply as using the inner statement as a delegate which does all the work, except when this is a QUERY statement (Statement.executeQuery() method). Only in that precise situation, the wrapper surrounds the query by the two following strings : "SELECT * FROM (" and ") WHERE ROWNUM < "+maxRowLimit. For basic code wrapper code, see how it looks for the DataSourceWrapper below.
2 - write one more wrapper : ConnectionWrapper which wraps a Connection which returns StatementWrapper in createStatement() and PreparedStatementWrapper in prepareStatement(). Those are the previously coded classes taking ConnectionWrapper's delegateConnection.createStatement()/prepareStatement() as construction arguments.
3 - repeat the step with a DataSourceWrapper. Here is a simple code example.
public class DataSourceWrapper implements DataSource
{
private DataSource mDelegate;
public DataSourceWrapper( DataSource delegate )
{
if( delegate == null ) { throw new NullPointerException( "Delegate cannot be null" );
mDelegate = delegate;
}
public Connection getConnection(String username, String password)
{
return new ConnectionWrapper( mDelegate.getConnection( username, password ) );
}
public Connection getConnection()
{
... <same as getConnection(String, String)> ...
}
}
4 - Finally, use that DataSourceWrapper as your application's DataSource. If you're using JNDI (NamingContext), this change should be trivial.
Coding all this is quick and very straightforward, especially if you're using smart IDE like Eclipse or IntelliJ which will implement the delegating methods automagically.

If you know you will be dealing with only one table, then define a view with rownum in the where statement to limit the number of rows. In this way, the number of rows is controlled at the DB and does not need to be specified as part of any query from a client application. If you want to change the number of rows returned, then redefine the view prior to executing query.
A more dynamic method would be to develop a procedure and pass in a number of rows, and have the procedure return a ref_cursor to your client. This would have the advantage of avoiding hard parsing on the DB, and increase performance.

Ok, a code change it'll have to be then.
The scenario is limiting an adhoc reporting tool so that the end user doesnt pull back too many records and generate a report which is unusable.
We already use oracle cost based resource management.

Take a look at this page with a description of limiting how much is sucked into the Java App at a time. As another post points out, the DB will still pull all of the data, this is more for controlling network use, and memory on the Java side.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.