I have a Java ResultSet which contains a big data amount (but still may be stored in RAM), retrieved from the database. I'm going to work with this data. From the performance point of view, should I copy result set content to some data structure in order to be able to close the result set as soon as possible and to work with the data from a new container or it's better to do not waste a time on copy the content and work with the data directly from the result set?
ResultSet's fetch a specified number of rows at a time, dependent on the JDBC driver and database being used. So if your query has a million rows returned, not all million are resident in memory, unless you iterate over the ResultSet and put them in memory of course. To answer your question directly, you close the ResultSet when you are finished reading the rows you need, usually when the ResultSet.next() returns false.
JDBC Developers Guide
Related
Assuming that I have to go through all the entries., does anyone know how the results for ResultSet is fetched?
Can I call SELECT * FROM MyTable instead of SELECT TOP 100 * FROM MyTable ORDER BY id ASC OFFSET 0; and just call resultSet.next() as needed to fetch the results, and process them on a program level, or are the results already in memory and not putting in TOP is bad?
The ResultSet class exposes a
void setFetchSize(int rows)
method, which, per JavaDoc
Gives the JDBC driver a hint as to the number of rows that should be
fetched from the database when more rows are needed for this ResultSet
object.
That means if we have a result set of 200 rows from the database, and we set the fetch size to 100, ~100 rows will loaded from the database at a time, and two trips to the database might be required.
The default fetch size is driver dependant, but for example, Oracle set it to 10 rows.
Depends on the DB engine and JDBC driver. Generally, the IDEA behind the JDBC API is that the DB engine creates a cursor (this is also why ResultSets are resources that must be closed), and thus, you can do a SELECT * FROM someTableWithBillionsOfRows without a LIMIT, and yet it can be fast.
Whether it actually is, well, that depends. In my experience, which is primarily interacting with postgres, it IS fast (as in, cursor based with limited data transfer from DB to VM even if the query would match billions of rows), and thus your plan (select without limits, keep calling next until you have what you want and then close the resultset) should work fine.
NB: Some DB engines meet you halfway and transfer results in batches, for the best of both worlds: Latency overhead is limited (a single latency overhead is shared by batchsize results), and yet the total transfer between DB and VM is limited to only rowsize times batchsize, even if you only read a single row and then close the resultset.
I need an advice :)
I have a database with almost 70 tables, many of them have over a dozen million big records. I want to split it into a few smaller ones. One for every big client data and one main database for the rest of the client's data(while also moving some of the data into NoSQL database). Because of many complicated relations between tables, before copying the data, I was disabling the triggers, that were checking the correctness of the foreign keys and then, just before a commit I was enabling them again.
It was all working with a small amount of data, but now, when I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
I could increase the heap size, but it's not the point here.
I'm selecting data by some specific id from every table that has any relation to client data and copy it to another database. The process looks like this:
Select data from table
Insert data to another database
Copy sequence (max(id) of data being copied)
Flush/Clear
Repeat for every table containing client data
I was trying to select portions of data(something like select parts with 5000 rows instead of all 50 000 in one) but it fails in the exact same position.
And here I am asking for an advice, how to manage this problem. I think it is all because I am trying to copy all data in one big fatty commit. The reason of it is that I have to disable triggers while copying but also I must enable them before I can commit my changes.
When I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
Copying data should not be using the heap, so it seems you're not using cursor-based queries.
See "Getting results based on a cursor" in the PostgreSQL JDBC documentation:
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
A small number of rows are cached on the client side of the connection and when exhausted the next block of rows is retrieved by repositioning the cursor.
[...]
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
So, add a stmt.setFetchSize(1000) (or something like that) to your code will ensure that the JDBC driver will not exhaust the heap.
If you still have trouble after that, then it's because your code is retaining all data, which means it's coded wrong for a copy operation.
We are working in our team pretty tight up with an Oracle DB server using jdbc. In one of our changes, I'm calling a Stored Procedure which returns me two different ResultSets. At first my implementation assumed default Scroll-ability.
After that failed, I looked it up in the Internet.
Everything I could read about it said basically the same thing: use prepareStatement or prepareCall methods with the appropriate TYPE_SCROLL_INSENSITIVE and CONCUR_READ_ONLY. None of these worked.
The Stored Procedure I use, again, return me two different result sets and they are extracted through a (ResultSet) rs.getObject("name"). Generally in examples, their ResultSet are coming back instantly from a .executeQuery.
My Question is, Do the Scrollablility/Updatability types in the prepareCall methods affecting these sort of ResultSets? if so, how do I get them?
I know that the JDBC driver can degrade my request for ScrollableResultSet. How can I tell if my ResultSet was degraded?
On that note, Why aren't ResultSets scrollable by default? What are the best practices and what is "the cost" of their flexibility?
In Oracle, a cursor is a forward-only structure. All the database knows how to do is fetch the next row (well, technically the next n rows). In order to make a ResultSet seem scrollable, you rely on the JDBC driver.
The JDBC driver has two basic approaches to making ResultSet seem scrollable. The first is to save the entire result set in memory as you fetch data just in case you want to go backwards. Functionally, that works but it has potentially catastrophic results on performance and scalability when a query potentially returns a fair amount of data. The first time some piece of code starts chewing up GB of RAM on app servers because a query returned thousands of rows that included a bunch of long comment fields, that JDBC driver will get rightly pilloried as a resource hog.
The more common approach is to for the driver to add a key to the query and to use that key to manage the data the driver caches. So, for example, the driver might keep the last 1000 rows in memory in their entirety but only cache the key for the earlier rows so it can go back and re-fetch that data later. That's more complicated to code but it also requires that the ResultSet has a unique key. Normally, that's done by trying to add a ROWID to the query. That's why, for example, the Oracle JDBC driver specifies that a scrollable or updatable ResultSet cannot use a SELECT * but can use SELECT alias.*-- the latter makes it possible for the driver to potentially be able to blindly add a ROWID column to the query.
A ResultSet coming from a stored procedure, however, is completely opaque to the driver-- it has no way of getting the query that was used to open the ResultSet so it has no way to add an additional key column or to go back and fetch the data again. If the driver wanted to make the ResultSet scrollable, it would have to go back to caching the entire ResultSet in memory. Technically, that is entirely possible to do but very few drivers will do so since it tends to lead to performance problems. It's much safer to downgrade the ResultSet. Most of the time, the application is in a better position to figure out whether it is reasonable to cache the entire ResultSet because you know it is only ever going to return a small amount of data or to be able to go back and fetch rows again by their natural key.
You can use the getType() and getConcurrency() methods on your ResultSet to determine whether your ResultSet has been downgraded by the driver.
There are a lot of different tutorials across the internet about pagination with JDBC/iterating over huge result set.
So, basically there are a number of approaches I've found so far:
Vendor specific sql
Scrollable result set (?)
Holding plain result set in a memory and map the rows only when necessary (using fetchSize)
The result set fetch size, either set explicitly, or by default equal
to the statement fetch size that was passed to it, determines the
number of rows that are retrieved in any subsequent trips to the
database for that result set. This includes any trips that are still
required to complete the original query, as well as any refetching of
data into the result set. Data can be refetched, either explicitly or
implicitly, to update a scroll-sensitive or
scroll-insensitive/updatable result set.
Cursor (?)
Custom seek method paging implemented by jooq
Sorry for messing all these but I need someone to clear that out for me.
I have a simple task where service consumer asks for results with a pageNumber and pageSize. Looks like I have two options:
Use vendor specific sql
Hold the connection/statement/result set in the memory and rely on jdbc fetchSize
In the latter case I use rxJava-jdbc and if you look at producer implementation it holds the result set, then all you do is calling request(long n) and another n rows are processed. Of course everything is hidden under Observable suggar of rxJava. What I don't like about this approach is that you have to hold the resultSet between different service calls and have to clear that resultSet if client forgets to exhaust or close it. (Note: resultSet here is java ResultSet class, not the actual data)
So, what is recommended way of doing pagination? Is vendor specific sql considered slow compared to holding the connection?
I am using oracle, ScrollableResultSet is not recommended to be used with huge result sets as it caches the whole result set data on the client side. proof
Keeping resources open for an indefinite time is a bad thing in general. The database will, for example, create a cursor for you to obtain the fetched rows. That cursor and other resources will be kept open until you close the result set. The more queries you do in parallel the more resources will be occupied and at some point the database will reject further requests due to an exhausted resource pool (e.g. there is a limited number of cursors, that can be opened at a time).
Hibernate, for example, uses vendor specific SQL to fetch a "page" and I would do it just like that.
There are many approaches because there are many different use cases.
Do you actually expect users to fetch every page of the result set? Or are they more likely to fetch the first page or two and try something else if the data they're interested in isn't there. If you are Google, for example, you can be pretty confident that people will look at results from the first page, a small number will look at results from the second page, and a tiny fraction of results will come from the third page. It makes perfect sense in that case to use vendor-specific code to request a page of data and only run that for the next page when the user asks for it. If you expect the user to fetch the last page of the result, on the other hand, running a separate query for each page is going to be more expensive than running a single query and doing multiple fetches.
How long do users need to keep the queries open? How many concurrent users? If you're building an internal application that dozens of users will have access to and you expect users to keep cursors open for a few minutes, that might be reasonable. If you are trying to build an application that will have thousands of users that will be paging through a result over a span of hours, keeping resources allocated is a bad idea. If your users are really machines that are going to fetch data and process it in a loop as quickly as possible, a single ResultSet with multiple fetches makes far more sense.
How important is it that no row is missed/ every row is seen exactly once/ the results across pages are consistent? Multiple fetches from a single cursor guarantees that every row in the result is seen exactly once. Separate paginated queries might not-- new data could have been added or removed between queries being executed, your sort might not be fully deterministic, etc.
ScrollableResultSet caches result on client side - this requires memory resources. But for example PostgreSQL does it by default and nobody complains. Some databases simply use client's memory to hold the whole resultset. In most cases the database has to process much more data to re-evaluate the query.
Also you usually have much more clients, than database instances.
Also note that query re-execution - using rownum - as implemented by Hibernate does not guarantee correct(consistent) results. If data are modified between executions and default isolation level is used.
It really depends on use case. Changing Oracle's init parameter for max. connections and also for open cursors requires database restart.
So ScrollableResultSet and cursors can be used only when you can predict amount of (concurrent) users.
I am using Java to read from a SQL RDBMS and return the results to the user. The problem is that the database table has 155 Million rows, which make the wait time really long.
I wanted to know if it is possible to retrieve results as they come from the database and present them incrementaly to the user (in batches).
My query is a simple SELECT * FROM Table_Name query.
Is there a mechanism or technology that can give me callbacks of DB records, in batches until the SELECT query finishes?
The RDBMS that is used is MS SQL Server 2008.
Thanks in advance.
Methods Statement#setFetchSize and Statement#getMoreResults are supposed to allow you to manage incremental fetches from the database. Unfortunately, this is the interface spec and vendors may or may not implement these. Memory management during a fetch is really down to the vendor (which is why I wouldn't strictly say that "JDBC just works like this").
From the JDBC documentation on Statement :
setFetchSize(int rows)
Gives the JDBC driver a hint as to the number of rows that should be
fetched from the database when more rows are needed for ResultSet
objects genrated by this Statement.
getMoreResults()
Moves to this Statement object's next result, returns true if it is a
ResultSet object, and implicitly closes any current ResultSet object(s)
obtained with the method getResultSet.
getMoreResults(int current)
Moves to this Statement object's next result, deals with any current
ResultSet object(s) according to the instructions specified by the given
flag, and returns true if the next result is a ResultSet object.
current param indicates Keep or close current ResultSet?
Also, this SO response answers about the use of setFetchSize with regards to SQLServer 2005 and how it doesn't seem to manage batched fetches. The recommendation is to test this using the 2008 driver or moreover, to use the jTDS driver (which gets thumbs up in the comments)
This response to the same SO post may also be useful as it contains a link to SQLServer driver settings on MSDN.
There's also some good info on the MS technet website but relating more to SQLServer 2005. Couldn't find the 2008 specific version in my cursory review. Anyway, it recommends creating the Statement with:
com.microsoft.sqlserver.jdbc.SQLServerResultSet.TYPE_SS_SERVER_CURSOR_FORWARD_ONLY (2004) scrollability for forward-only, read-only access, and then use the setFetchSize method to tune performance
Using pagination (LIMIT pageno, rows / TOP) might create holes and duplicates, but might be used in combination with checking the last row ID (WHERE id > ? ORDER BY id LIMIT 0, 100).
You may use TYPE_FORWARD_ONLY or FETCH_FORWARD_ONLY.
This is exactly how is JDBC driver supposed to work (I remember the bug in old PostgreSQL driver, that caused all fetched records to be stored in memory).
However, it enables you to read record when the query starts to fetch them. This is where I would start to search.
For example, Oracle optimizes SELECT * queries for fetching the whole set. It means it can take a lot of time before first results will appear. You can give hints to optimize for fetching first results, so you can show first rows to your user quite fast, but the whole query can take longer to execute.
You should test your query on console first, to check when it starts to fetch results. Then try with JDBC and monitor the memory usage while you iterate through ResultSet. If the memory usage grows fast, check if you have opened ResultSet in forward-only and read-only mode, if necessary update driver.
If such solution is not feasible because of memory usage, you can still use cursors manually and fetch N rows (say, 100) in each query.
Cursor documentation for MSSQL: for example here: http://msdn.microsoft.com/en-us/library/ms180152.aspx