ScrollableResults with Hibernate/Oracle pulling everything into memory - java

I want a page of filtered data from an Oracle database table, but I have a query that might return tens of millions of records, so it's not feasible to pull it all into memory. I need to filter records out in a way that cannot be done via SQL, and return back a page of records. In other words, the pagination part must be done after the filtering.
So, I attempted to use Hibernate's ScrollableResults, thinking it would be a way to pull in only chunks at a time and iterate through them. So, I created it:
ScrollableResults results = query.setReadOnly(true)
.setFetchSize(500)
.setCacheable(false)
.scroll();
... and yet, it appears to pull everything into memory (2.5GB pulled in per query). I've seen another question and I've tried some of the suggestions, but most seem MySQL specific, and I'm using an Oracle 19 driver (e.g. Integer.MIN_VALUE is rejected outright as a fetch size in the Oracle driver).
There was a suggestion to use a stateless session (I'm using the EntityManager which has no stateless option), but my thought is that if we don't fetch many records (because we only want the first page of 200 filtered records), why would Hibernate have millions of records in memory anyway, even though we never scrolled over them?
It's clear to me that I don't understand how/why Hibernate pulls things into memory, or how to get it to stop doing so. Any suggestions on how to prevent it from doing so, given the constraints above?
Some things I'm going to try:
Different scroll modes. Maybe insensitive or forward only prevents Hibernate's need to pull everything in?
Clearing the session after we have our page. I'm closing the session (both using close() in the ScrollableResults and the EntityManager), but maybe an explicit clear() will help?

We were scrolling through the entire ScrollableResults to get the total count. This caused two things:
The Hibernate session cached entities.
The ResultSet in the driver kept rows that it has scrolled past.
Fixing this is specific to my case, really, but I did two things:
As we scroll, periodically clear the Hibernate session. Since we use the EntityManager, I had to do entityManager.unwrap(Session.class).clear(). Not sure if entityManager.clear() would do the job or not.
Make the ScrollableResults forward-only so the Oracle driver doesn't have to keep records in memory as it scrolls. This was as simple as doing .scroll(ScrollMode.FORWARD_ONLY). Only possible since we're only moving forward, though.
This allowed us to maintain a smaller memory footprint, even while scrolling through literally every single record (tens of millions).

Why would you scroll through all results just to get the count? Why not just execute a count query?

Related

JDBC Pagination: vendor specific sql versus result set fetchSize

There are a lot of different tutorials across the internet about pagination with JDBC/iterating over huge result set.
So, basically there are a number of approaches I've found so far:
Vendor specific sql
Scrollable result set (?)
Holding plain result set in a memory and map the rows only when necessary (using fetchSize)
The result set fetch size, either set explicitly, or by default equal
to the statement fetch size that was passed to it, determines the
number of rows that are retrieved in any subsequent trips to the
database for that result set. This includes any trips that are still
required to complete the original query, as well as any refetching of
data into the result set. Data can be refetched, either explicitly or
implicitly, to update a scroll-sensitive or
scroll-insensitive/updatable result set.
Cursor (?)
Custom seek method paging implemented by jooq
Sorry for messing all these but I need someone to clear that out for me.
I have a simple task where service consumer asks for results with a pageNumber and pageSize. Looks like I have two options:
Use vendor specific sql
Hold the connection/statement/result set in the memory and rely on jdbc fetchSize
In the latter case I use rxJava-jdbc and if you look at producer implementation it holds the result set, then all you do is calling request(long n) and another n rows are processed. Of course everything is hidden under Observable suggar of rxJava. What I don't like about this approach is that you have to hold the resultSet between different service calls and have to clear that resultSet if client forgets to exhaust or close it. (Note: resultSet here is java ResultSet class, not the actual data)
So, what is recommended way of doing pagination? Is vendor specific sql considered slow compared to holding the connection?
I am using oracle, ScrollableResultSet is not recommended to be used with huge result sets as it caches the whole result set data on the client side. proof
Keeping resources open for an indefinite time is a bad thing in general. The database will, for example, create a cursor for you to obtain the fetched rows. That cursor and other resources will be kept open until you close the result set. The more queries you do in parallel the more resources will be occupied and at some point the database will reject further requests due to an exhausted resource pool (e.g. there is a limited number of cursors, that can be opened at a time).
Hibernate, for example, uses vendor specific SQL to fetch a "page" and I would do it just like that.
There are many approaches because there are many different use cases.
Do you actually expect users to fetch every page of the result set? Or are they more likely to fetch the first page or two and try something else if the data they're interested in isn't there. If you are Google, for example, you can be pretty confident that people will look at results from the first page, a small number will look at results from the second page, and a tiny fraction of results will come from the third page. It makes perfect sense in that case to use vendor-specific code to request a page of data and only run that for the next page when the user asks for it. If you expect the user to fetch the last page of the result, on the other hand, running a separate query for each page is going to be more expensive than running a single query and doing multiple fetches.
How long do users need to keep the queries open? How many concurrent users? If you're building an internal application that dozens of users will have access to and you expect users to keep cursors open for a few minutes, that might be reasonable. If you are trying to build an application that will have thousands of users that will be paging through a result over a span of hours, keeping resources allocated is a bad idea. If your users are really machines that are going to fetch data and process it in a loop as quickly as possible, a single ResultSet with multiple fetches makes far more sense.
How important is it that no row is missed/ every row is seen exactly once/ the results across pages are consistent? Multiple fetches from a single cursor guarantees that every row in the result is seen exactly once. Separate paginated queries might not-- new data could have been added or removed between queries being executed, your sort might not be fully deterministic, etc.
ScrollableResultSet caches result on client side - this requires memory resources. But for example PostgreSQL does it by default and nobody complains. Some databases simply use client's memory to hold the whole resultset. In most cases the database has to process much more data to re-evaluate the query.
Also you usually have much more clients, than database instances.
Also note that query re-execution - using rownum - as implemented by Hibernate does not guarantee correct(consistent) results. If data are modified between executions and default isolation level is used.
It really depends on use case. Changing Oracle's init parameter for max. connections and also for open cursors requires database restart.
So ScrollableResultSet and cursors can be used only when you can predict amount of (concurrent) users.

AJAX/JavaScript search performance better than Java/Oracle

I work with a very large, enterprise application written in Java which queries an Oracle SQL database. We use JavaScript on the front end, and are always looking for ways to improve upon the performance of the application with increased use.
The issue we're having right now is that we are sending a query, via Java, that results in 39,000 records. This is putting a significant load on the server and causes the browser to hang. I should mention that the data is relatively static (only changes about once a year) and we could use an xml map or something similar (flat file) since we know the exact results that will be returned each time.
The query, however, is still taking 1.5 - 2 minutes to load, which is unacceptable. I wanted to see if there were any suggestions as to how this scenario can be optimized, especially if it can be done any quicker with JavaScript (or jQuery) and using AJAX for the db connection. Or, are we going about this problem all wrong?
You want to determine if the slowness is due to:
the query executing in the database
the network is slow returning 39k records
the javascript working with the 39k records after the ajax is complete
If you can run the query in sqlplus or toad, this will eliminate the web-tier and network all together. If this is slow, then tune the query by checking indexes.
If after adding the appropriate indexes, the query is still slow, then you could prebuild the query's results and store the results in a table or you could create a materialized view.
Once you have the query performing well from sqlplus, then add the network back into the equation. Run it from your web browser and see what overhead is being added.
If it is still slow, then you need to determine if the problem is the act of ajaxing the data or if the slowness occurs after the page does something with the data (ie. populating a data grid via javascript).
If the slowness is because the browser is waiting for the data, then you want to make sure it's only ever fetched once. You can do this by setting the cache headers in the ajax request to cache the result for 1 year. Or you can store the results in localstorage.
If the slowness is due to the browser working with the 39k rows (ie. moving the data into a data grid), then you have a few options.
find a better approach or library
use pagination
You may find performance issues from each of these areas. Most likely the query just needs to be tuned and by adding indexes or pre-querying the data and storing it will solve the problem.
Another thing to consider is if you really need 39k rows at one time. If you can, paginate at the db level so you're returning 100 rows per page.

Hibernate Session.flush() efficiency problems

Sorry in advance if someone has already answered this specific question but I have yet to find an answer to my problem so here goes.
I am working on an application (no I cannot give the code as it is for a job so I'm sorry about that one) which uses DAO's and Hibernate and POJO's and all that stuff for communicating and writing to the database. This works well for the application assuming I don't have a ton of data to check when I call Session.flush(). That being said, there is a page where a user can add any number of items to a product and there is one particular case where there are something along the lines of 25 items. Each item has about 8 fields a piece that are all stored in the database. When I call the flush it does save everything to the database but it takes FOREVER to complete. The three lines I am calling are:
merge(myObject);
Session.flush();
Session.refresh(myObject);
I have tried a number of different combinations of things to fix this problem and a number of different solutions so coming back and saying "Don't use flus()" isn't much help as the saveOrUpdate() and other hibernate sessions don't seem to work. The only solution I can think of is to scrap the entire project (the code we got was inherited and poorly written to say the least) or tell the user community to suck it up.
It is my understanding from Hibernate API that if you want to write the data to the database it runs a check on every item, if there is a difference it creates a queue of update queries, then runs the queries. It seems as though this data is being updated every time because the "DATE_CREATED" column in my database is different even if the other values are unchanged.
What I was wondering is if there was another way to prevent such a large committing of data or a way of excluding that particular column from the "check" hibernate does so I don't have to commit all 25 items if I only made a change to 1?
Thanks in advance.
Mike
Well, you really cannot avoid the dirty checking in hibernate unless you use a StatelessSession. Of course, you lose a lot of features (lazy-load etc.) with that, but it's up to you to make this decision.
Another option: I would definitely try to use dynamic-update=true in your entity. Like:
#Entity(dynamicUpdate = true)
class MyClass
Using that, Hibernate will update the modified columns only. In small tables, with few columns, it's not so effective, but in your case maybe it can help make the whole process faster as you cannot avoid dirty checking with a regular Hibernate Session. Updating a few columns instead of the whole object is always better, right?
This post talks more about dynamic-update attribute.
What I was wondering is if there was another way to prevent such a
large committing of data or a way of excluding that particular column
from the "check" hibernate does so I don't have to commit all 25 items
if I only made a change to 1?
I would profile the application to ensure that the dirty checking on flush is actually the problem. If you find that this is indeed the case you can use evict to manage the session size.
session.update(myObject);
session.flush();
session.evict(myObject);

spring jdbc RowCallbackHandler nightmare

I'm having trouble retrieving data from my database using Spring Jdbc. Here's my issue:
I have a getData() method on my DAO which is supposed to return ONE row from the result of some select statement. When invoked again, the getData() method should return the second row in a FIFO-like manner. I'm aiming for having only one result in memory at a time, since my table will get potentially huge in the future and bringing everything to memory would be a disaster.
If I were using regular jdbc code with a result set I could set its fetch size to 1 and everything would be fine. However I recently found out that Spring Jdbc operations via the JdbcTemplate object don't allow me to achieve such a behaviour (as far as I know... I'm not really knowledgeable about the Spring framework's features). I've heard of the RowCallbackHandler interface, and this post in the java ranch said I could somehow expose the result set to be used later (though using this method it stores the result set as many times over as there are rows, which is pretty dumb).
I have been playing with implementing the RowCallbackHandler interface for a day now and I still can't find a way to get it to retrieve one row from my select at a time. If anyone could enlighten me in this matter i'd greatly appreciate it.
JdbcTemplate.setFetchSize(int fetchSize):
Set the fetch size for this JdbcTemplate. This is important for processing large result sets: Setting this higher than the default value will increase processing speed at the cost of memory consumption; setting this lower can avoid transferring row data that will never be read by the application.
Default is 0, indicating to use the JDBC driver's default.
After a lot of searching and consulting with the rest of my team, we have come to the conclusion that this is not the best implementation path for our project. As Boris suggested, a different approach is the way to go. However, I'm doing something different and using SimpleJdbcTemplate instead and splitting my query so it'll fit in memory better. A "status" field in my records table will be responsbile for telling if the record was successfully processed or read, so i know what records to fetch next.
The question if Spring Jdbc is capable of the behaviour i mentioned in my OP is, however, still in the air. If anyone has an answer for that question I'm sure it would help someone else out there.
Cheers!
You can take a different approach. Create a query which will return just IDs of rows that you want to read. Keep this collection of IDs in memory. You really need to have huge data set to consume a lot of memory. Iterate over it and load one by one row referenced by its ID.
We have the same issue:
- Test fetching fetchSize records in raw jdbc Preparestatement works well: when stop Db after fetching a fetchSize of records, the error throw is Jdbc Connection when the resultset.next() get run.
- Test fetchSize with JdbcTemplate:
PreparedStatementSetter preparedStatementSetter = ps -> { ps.setFetchSize(_exportParams.getFetchSize()); };
RowCallbackHandler rowCallbackHandler = _rs -> { //do st here}
this.jdbcTemplate.query(_exportParams.getSqlscript(), preparedStatementSetter, rowCallbackHandler);
After getting first record, we stop the Postgres. The callback record handler can still handle the rest of records without error.

Does Oracle support Server-Side Scrollable Cursors via JDBC?

Currently working in the deployment of an OFBiz based ERP, we've come to the following problem: some of the code of the framework calls the resultSet.last() to know the total rows of the resultset. Using the Oracle JDBC Driver v11 and v10, it tries to cache all of the rows in the client memory, crashing the JVM because it doesn't have enough heap space.
After researching, the problem seems to be that the Oracle JDBC implements the Scrollable Cursor in the client-side, instead of in the server, by the use of a cache. Using the datadirect driver, that issue is solved, but it seems that the call to resultset.last() takes too much to complete, thus the application server aborts the transaction
is there any way to implemente scrollable cursors via jdbc in oracle without resorting to the datadirect driver?
and what is the fastest way to know the length of a given resultSet??
Thanks in advance
Ismael
"what is the fastest way to know the length of a given resultSet"
The ONLY way to really know is to count them all. You want to know how many 'SMITH's are in the phone book. You count them.
If it is a small result set, and quickly arrived at, it is not a problem. EG There won't be many Gandalfs in the phone book, and you probably want to get them all anyway.
If it is a large result set, you might be able to do an estimate, though that's not generally something that SQL is well-designed for.
To avoid caching the entire result set on the client, you can try
select id, count(1) over () n from junk;
Then each row will have an extra column (in this case n) with the count of rows in the result set. But it will still take the same amount of time to arrive at the count, so there's still a strong chance of a timeout.
A compromise is get the first hundred (or thousand) rows, and don't worry about the pagination beyond that.
your proposed "workaround" with count basically doubles the work done by DB server. It must first walk through everything to count number of results and then do the same + return results. Much better is the method mentioned by Gary (count(*) over() - analytics). But even here the whole result set must be created before first output is returned to the client. So it is potentially slow a memory consuming for large outputs.
Best way in my opinion is select only the page you want on the screen (+1 to determine that next one exists) e.g. rows from 21 to 41. And have another button (usecase) to count them all in the (rare) case someone needs it.

Categories