SQL LIMIT vs. JDBC Statement setMaxRows. Which one is better?

SQL LIMIT vs. JDBC Statement setMaxRows. Which one is better? - java

I want to select the Top 10 records for a given query. So, I can use one of the following options:
Using the JDBC Statement.setMaxRows() method
Using LIMIT and OFFSET in the SQL query
What are the advantages and disadvantages of these two options?

SQL-level LIMIT
To restrict the SQL query result set size, you can use the SQL:008 syntax:
SELECT title
FROM post
ORDER BY created_on DESC
OFFSET 50 ROWS
FETCH NEXT 50 ROWS ONLY
which works on Oracle 12, SQL Server 2012, or PostgreSQL 8.4 or newer versions.
For MySQL, you can use the LIMIT and OFFSET clauses:
SELECT title
FROM post
ORDER BY created_on DESC
LIMIT 50
OFFSET 50
The advantage of using the SQL-level pagination is that the database execution plan can use this information.
So, if we have an index on the created_on column:
CREATE INDEX idx_post_created_on ON post (created_on DESC)
And we execute the following query that uses the LIMIT clause:
EXPLAIN ANALYZE
SELECT title
FROM post
ORDER BY created_on DESC
LIMIT 50
We can see that the database engine uses the index since the optimizer knows that only 50 records are to be fetched:
Execution plan:
Limit (cost=0.28..25.35 rows=50 width=564)
(actual time=0.038..0.051 rows=50 loops=1)
-> Index Scan using idx_post_created_on on post p
(cost=0.28..260.04 rows=518 width=564)
(actual time=0.037..0.049 rows=50 loops=1)
Planning time: 1.511 ms
Execution time: 0.148 ms
JDBC Statement maxRows
According to the setMaxRows Javadoc:
If the limit is exceeded, the excess rows are silently dropped.
That's not very reassuring!
So, if we execute the following query on PostgreSQL:
try (PreparedStatement statement = connection
.prepareStatement("""
SELECT title
FROM post
ORDER BY created_on DESC
""")
) {
statement.setMaxRows(50);
ResultSet resultSet = statement.executeQuery();
int count = 0;
while (resultSet.next()) {
String title = resultSet.getString(1);
count++;
}
}
We get the following execution plan in the PostgreSQL log:
Execution plan:
Sort (cost=65.53..66.83 rows=518 width=564)
(actual time=4.339..5.473 rows=5000 loops=1)
Sort Key: created_on DESC
Sort Method: quicksort Memory: 896kB
-> Seq Scan on post p (cost=0.00..42.18 rows=518 width=564)
(actual time=0.041..1.833 rows=5000 loops=1)
Planning time: 1.840 ms
Execution time: 6.611 ms
Because the database optimizer has no idea that we need to fetch only 50 records, it assumes that all 5000 rows need to be scanned. If a query needs to fetch a large number of records, the cost of a full-table scan is actually lower than if an index is used, hence the execution plan will not use the index at all.
I ran this test on Oracle, SQL Server, PostgreSQL, and MySQL, and it looks like the Oracle and PostgreSQL optimizers don't use the maxRows setting when generating the execution plan.
However, on SQL Server and MySQL, the maxRows JDBC setting is taken into consideration, and the execution plan is equivalent to an SQL query that uses TOP or LIMIT. You can run the tests for yourself, as they are available in my High-Performance Java Persistence GitHub repository.
Conclusion
Although it looks like the setMaxRows is a portable solution to limit the size of the ResultSet, the SQL-level pagination is much more efficient if the database server optimizer doesn't use the JDBC maxRows property.

For most cases, you want to use the LIMIT clause, but at the end of the day both will achieve what you want. This answer is targeted at JDBC and PostgreSQL, but is applicable to other languages and databases that use a similar model.
The JDBC documentation for Statement.setMaxRows says
If the limit is exceeded, the excess rows are silently dropped.
i.e. The database server may return more rows but the client will just ignore them. The PostgreSQL JDBC driver limits on both the client and server side. For the client side, have a look at the usage of maxRows in the AbstractJdbc2ResultSet. For the server side, have a look of maxRows in QueryExecutorImpl.
Server side, the PostgreSQL LIMIT documentation says:
The query optimizer takes LIMIT into account when generating a query
plan
So as long as the query is sensible, it will load only the data it needs to fulfill the query.

setFetchSize Gives the JDBC driver a hint as to the number of rows that should be fetched from the database when more rows are needed for ResultSet objects generated by this Statement.
setMaxRows Sets the limit for the maximum number of rows that any ResultSet object generated by this Statement object can contain to the given number.
I guess using above 2 JDBC API you can try by using setFetchSize you can try if it works for 100K records. Else you can fetch in batch and form ArrayList and return it to your Jasper report.

not sure if i am right, but i remember in the past i was involved in big project to change all queries that were expected to return one row into 'TOP 1' or numrows=1. Reason was that the DB would stop searching for 'next possible matches' when this 'hint' was used. And in high volume environments this really made a difference. The remark that you can 'ignore' superfluous records in the client or in the resultset is not enough. You should avoid unnecessary reads as early as possible. But i have no idea whether the JDBC methods add those db specific hints to the query y/n. I may need to test however to see and use it ... i am not db specialist and can imagine i am not right, but "Speedwise it seems like no difference" can be a wrong assumption ... E.g. if you are asked to search in box for red balls and you only need one, it does not add value to keep searching for all where for you one is enough ... Then it matters to specify 'TOP 1' ...

Related

ResultSet implementation - is it fetched as next() is called, or the results are already in memory?

Assuming that I have to go through all the entries., does anyone know how the results for ResultSet is fetched?
Can I call SELECT * FROM MyTable instead of SELECT TOP 100 * FROM MyTable ORDER BY id ASC OFFSET 0; and just call resultSet.next() as needed to fetch the results, and process them on a program level, or are the results already in memory and not putting in TOP is bad?

The ResultSet class exposes a
void setFetchSize(int rows)
method, which, per JavaDoc
Gives the JDBC driver a hint as to the number of rows that should be
fetched from the database when more rows are needed for this ResultSet
object.
That means if we have a result set of 200 rows from the database, and we set the fetch size to 100, ~100 rows will loaded from the database at a time, and two trips to the database might be required.
The default fetch size is driver dependant, but for example, Oracle set it to 10 rows.

Depends on the DB engine and JDBC driver. Generally, the IDEA behind the JDBC API is that the DB engine creates a cursor (this is also why ResultSets are resources that must be closed), and thus, you can do a SELECT * FROM someTableWithBillionsOfRows without a LIMIT, and yet it can be fast.
Whether it actually is, well, that depends. In my experience, which is primarily interacting with postgres, it IS fast (as in, cursor based with limited data transfer from DB to VM even if the query would match billions of rows), and thus your plan (select without limits, keep calling next until you have what you want and then close the resultset) should work fine.
NB: Some DB engines meet you halfway and transfer results in batches, for the best of both worlds: Latency overhead is limited (a single latency overhead is shared by batchsize results), and yet the total transfer between DB and VM is limited to only rowsize times batchsize, even if you only read a single row and then close the resultset.

fetchsize in resultset set to 0 by default

I have to query a database and result set is very big. I am using MySQL as data base. To avoid the "OutOfMemoryError" after a lot of search I got two options: One using LIMIT(specific to database) and other is using jdbc fetchSize attribute.
I have tested the option 1(LIMIT) an it is working but it is not the desired solution. I do not want to do it.
Using jdbc I found out that ResultSet size is set to 0 by default. How can I change this to some other value. I tried the following:
a) First Try:
rs = preparedStatement.executeQuery();
rs.setFetchSize(1000); //Not possible as exception occurs before.
b) Second T Even if this is not there then also I need to communicate to databse multiple timry:
rs.setFetchSize(1000); //Null pointer exception(rs is null).
rs = preparedStatement.executeQuery();
c) Third Try:
preparedStatement = dbConnection.createStatement(query);
preparedStatement.setFetchSize(1000);
None of this is working. Any help appreciated!
Edit:
I do not want a solution using limit because:
a) I have millions of rows in my result set. Now doing multiple query is slow. My assumption is that database takes multiple queries like
SELECT * FROM a LIMIT 0, 1000
SELECT * FROM a LIMIT 1000, 2000
as two different queries.
b) The code is looks messy because you need to have additional counters.

The MySQL JDBC driver always fetches all rows, unless the fetch size is set to Integer.MIN_VALUE.
See the MySQL Connector/J JDBC API Implementation Notes:
By default, ResultSets are completely retrieved and stored in memory.
In most cases this is the most efficient way to operate, and due to
the design of the MySQL network protocol is easier to implement. If
you are working with ResultSets that have a large number of rows or
large values, and cannot allocate heap space in your JVM for the
memory required, you can tell the driver to stream the results back
one row at a time.
To enable this functionality, create a Statement instance in the
following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch
size of Integer.MIN_VALUE serves as a signal to the driver to stream
result sets row-by-row. After this, any result sets created with the
statement will be retrieved row-by-row.

Besides all you should change your query like
SELECT * FROM RandomStones LIMIT 1000;
Or
PreparedStatement stmt = connection.prepareStatement(qry);
stmt.setFetchSize(1000);
stmt.executeQuery();
To set the fetch size for a query, call setFetchSize() on the statement object prior to executing the query. If you set the fetch size to N, then N rows are fetched with each trip to the database.

In MySQL connector-j 5.1 implementation notes, they said there are 2 ways to handle this situation.
enter image description here
The 1st is to make the statement retrieve data row by row, the 2nd can support batch retrieve.

How to fetch limited data from mysql java

Can I limit the number of rows in result set?
My table contains some 800000 rows, if I fetch them in result set, this will definitely lead to OOM exception. each row has 40 columns.
I do not want to want work on them at the same time, but each row is to be filtered out for some data.
Thank you in advance.

Something like following should be a SQL solution but albeit rather ineffective, since each time you will have to fetch the increasing amount of rows.
Assuming that you have your ORDER BY is based on unique int and
that you will be fetching 1000 rows at a time.
SET currenttop = 0;
SET cuurentid = 0;
SELECT * FROM YourTable t
WHERE t1.ID > #currentID AND (#currentid := t1.ID) IS NOT NULL;
LIMIT (#currenttop:=#currenttop+1000);
Of course you can choose to handle variable from your Java code.

You could use JDBC fetch size to limit the result in the result set. It is better than the SQL LIMIT as it will work for other database as well without changing the query. Jdbc diriver will not read the whole result from the database. Each time it will retrieve the records specified by the fetch size and there will be no memory issue anymore.

You can use limit keyword with sql query, see following
SELECT * FROM tbl LIMIT 5,10; # Retrieve rows 6-15
You can read more about using limit here
Cheers !!

You have a couple options. First, add a limit to the SQL query. Second, you could use JDBCTemplate.query() with a RowCallbackHandler to process one row at a time. The template will handle memory issues with a large result set.

Read SQL Database in batches

I am using Java to read from a SQL RDBMS and return the results to the user. The problem is that the database table has 155 Million rows, which make the wait time really long.
I wanted to know if it is possible to retrieve results as they come from the database and present them incrementaly to the user (in batches).
My query is a simple SELECT * FROM Table_Name query.
Is there a mechanism or technology that can give me callbacks of DB records, in batches until the SELECT query finishes?
The RDBMS that is used is MS SQL Server 2008.
Thanks in advance.

Methods Statement#setFetchSize and Statement#getMoreResults are supposed to allow you to manage incremental fetches from the database. Unfortunately, this is the interface spec and vendors may or may not implement these. Memory management during a fetch is really down to the vendor (which is why I wouldn't strictly say that "JDBC just works like this").
From the JDBC documentation on Statement :
setFetchSize(int rows)
Gives the JDBC driver a hint as to the number of rows that should be
fetched from the database when more rows are needed for ResultSet
objects genrated by this Statement.
getMoreResults()
Moves to this Statement object's next result, returns true if it is a
ResultSet object, and implicitly closes any current ResultSet object(s)
obtained with the method getResultSet.
getMoreResults(int current)
Moves to this Statement object's next result, deals with any current
ResultSet object(s) according to the instructions specified by the given
flag, and returns true if the next result is a ResultSet object.
current param indicates Keep or close current ResultSet?
Also, this SO response answers about the use of setFetchSize with regards to SQLServer 2005 and how it doesn't seem to manage batched fetches. The recommendation is to test this using the 2008 driver or moreover, to use the jTDS driver (which gets thumbs up in the comments)
This response to the same SO post may also be useful as it contains a link to SQLServer driver settings on MSDN.
There's also some good info on the MS technet website but relating more to SQLServer 2005. Couldn't find the 2008 specific version in my cursory review. Anyway, it recommends creating the Statement with:
com.microsoft.sqlserver.jdbc.SQLServerResultSet.TYPE_SS_SERVER_CURSOR_FORWARD_ONLY (2004) scrollability for forward-only, read-only access, and then use the setFetchSize method to tune performance

Using pagination (LIMIT pageno, rows / TOP) might create holes and duplicates, but might be used in combination with checking the last row ID (WHERE id > ? ORDER BY id LIMIT 0, 100).
You may use TYPE_FORWARD_ONLY or FETCH_FORWARD_ONLY.

This is exactly how is JDBC driver supposed to work (I remember the bug in old PostgreSQL driver, that caused all fetched records to be stored in memory).
However, it enables you to read record when the query starts to fetch them. This is where I would start to search.
For example, Oracle optimizes SELECT * queries for fetching the whole set. It means it can take a lot of time before first results will appear. You can give hints to optimize for fetching first results, so you can show first rows to your user quite fast, but the whole query can take longer to execute.
You should test your query on console first, to check when it starts to fetch results. Then try with JDBC and monitor the memory usage while you iterate through ResultSet. If the memory usage grows fast, check if you have opened ResultSet in forward-only and read-only mode, if necessary update driver.
If such solution is not feasible because of memory usage, you can still use cursors manually and fetch N rows (say, 100) in each query.
Cursor documentation for MSSQL: for example here: http://msdn.microsoft.com/en-us/library/ms180152.aspx

ResultSet.next very slow only when query contains FIRST_ROWS or ROWNUM restriction

I execute a native query using
entityManager.createNativeQuery(sqlQuery);
query.setMaxResults(maxResults);
List<Object[]> resultList = query.getResultList();
To speed up the query, I thought to include the FIRST_ROWS(n) hint or limiting using WHERE ROWNUM > n.
Using instrumentation, I see that indeed OraclePreparedStatement.executeQuery is faster, but a lot more time is spent in EJBQueryImpl.getResultList leading to an overall very poor performance. Looking more into detail, I see that every 10th call of ResultSet.next() takes about as long as executeQuery itself(). This strange behaviour stops when I leave out the query hint or the ROWNUM condition, then every 10th call of resultset.next is somewhat lower than the others, but only 2ms instead of 3 seconds.

Do you get different query plans when you include the hint? My assumption is that you do based on your description of the problem.
When you execute a query in Oracle, the database does not generally materialize the entire result set at any point in time (obviously, it may have to if you specify an ORDER BY clause that requires all the data to be materialized before the sort happens). Oracle doesn't actually start materializing data until the client starts fetching data. It runs enough of the query to generate however many rows the client has asked to fetch (which it sounds like is 10 in your case), returns those results to the client, and waits for the client to request more data before continuing to process the query.
It sounds like when the FIRST_ROWS hint is included, the query plan is changing in a way that makes it more expensive to execute. Obviously, that's not the goal of the FIRST_ROWS hint. The goal is to tell the optimizer to generate a plan that makes fetching the first N rows more efficient even if it makes fetching all the rows from the query less efficient. That tends to cause the optimizer to favor things like index scans over table scans where a table scan might be more efficient overall. It sounds like in your case, however, the optimizer's estimates are incorrect and it ends up picking a plan that is just generally less efficient. That frequently implies that some of the statistics on some of the objects your query is referencing are incomplete or incorrect.

Sounds like you made JDBC executeQuery faster but JDBC ResultSet next slower. You made executing the query faster but fetching the data slower. Seems to be a JDBC issue, not EclipseLink, you would get the same result through raw JDBC if you actually fetched the data.
10 is the default fetch size, so you could try setting that to be bigger.
See,
http://www.eclipse.org/eclipselink/api/2.3/org/eclipse/persistence/config/QueryHints.html#JDBC_FETCH_SIZE

Try adding the max rows limit to the SQL directly instead of using setMaxResults, ie add where rownum < maxResults to the sql string. EclipseLink will use rownum in the query for max rows when it creates the SQL, but since you are using your own SQL, it will use the result set to limit rows.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.