How does fetchLazy work in jooq? - java

How does fetchLazy work in jooq?
Is it equivalent to doing paginated select with limit and offset?

They're different.
fetchLazy()
... returns a Cursor type, which is jOOQ's equivalent of the JDBC ResultSet type. The query will fully materialise in the database, but jOOQ (JDBC) will fetch rows one-by-one. This is useful
when large result sets need to be fetched without waiting for the data transfer between server and client to finish - as opposed to a simple fetch(), which loads all rows from the server in one go.
when the client doesn't know in advance how many rows they really want to fetch from the server.
LIMIT .. OFFSET
... will reduce the number of returned rows already in the database, without them ever surfacing in the client. This can heavily improve execution speed in the server, as the server
May choose a different execution plan - e.g. using nested loops instead of hash joins for low values of LIMIT
Doesn't need to keep an open cursor for a long data transfer time, as only few rows are transferred over the wire.

Related

ResultSet implementation - is it fetched as next() is called, or the results are already in memory?

Assuming that I have to go through all the entries., does anyone know how the results for ResultSet is fetched?
Can I call SELECT * FROM MyTable instead of SELECT TOP 100 * FROM MyTable ORDER BY id ASC OFFSET 0; and just call resultSet.next() as needed to fetch the results, and process them on a program level, or are the results already in memory and not putting in TOP is bad?
The ResultSet class exposes a
void setFetchSize(int rows)
method, which, per JavaDoc
Gives the JDBC driver a hint as to the number of rows that should be
fetched from the database when more rows are needed for this ResultSet
object.
That means if we have a result set of 200 rows from the database, and we set the fetch size to 100, ~100 rows will loaded from the database at a time, and two trips to the database might be required.
The default fetch size is driver dependant, but for example, Oracle set it to 10 rows.
Depends on the DB engine and JDBC driver. Generally, the IDEA behind the JDBC API is that the DB engine creates a cursor (this is also why ResultSets are resources that must be closed), and thus, you can do a SELECT * FROM someTableWithBillionsOfRows without a LIMIT, and yet it can be fast.
Whether it actually is, well, that depends. In my experience, which is primarily interacting with postgres, it IS fast (as in, cursor based with limited data transfer from DB to VM even if the query would match billions of rows), and thus your plan (select without limits, keep calling next until you have what you want and then close the resultset) should work fine.
NB: Some DB engines meet you halfway and transfer results in batches, for the best of both worlds: Latency overhead is limited (a single latency overhead is shared by batchsize results), and yet the total transfer between DB and VM is limited to only rowsize times batchsize, even if you only read a single row and then close the resultset.

Split database into smaller ones. Too much data in single commit

I need an advice :)
I have a database with almost 70 tables, many of them have over a dozen million big records. I want to split it into a few smaller ones. One for every big client data and one main database for the rest of the client's data(while also moving some of the data into NoSQL database). Because of many complicated relations between tables, before copying the data, I was disabling the triggers, that were checking the correctness of the foreign keys and then, just before a commit I was enabling them again.
It was all working with a small amount of data, but now, when I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
I could increase the heap size, but it's not the point here.
I'm selecting data by some specific id from every table that has any relation to client data and copy it to another database. The process looks like this:
Select data from table
Insert data to another database
Copy sequence (max(id) of data being copied)
Flush/Clear
Repeat for every table containing client data
I was trying to select portions of data(something like select parts with 5000 rows instead of all 50 000 in one) but it fails in the exact same position.
And here I am asking for an advice, how to manage this problem. I think it is all because I am trying to copy all data in one big fatty commit. The reason of it is that I have to disable triggers while copying but also I must enable them before I can commit my changes.
When I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
Copying data should not be using the heap, so it seems you're not using cursor-based queries.
See "Getting results based on a cursor" in the PostgreSQL JDBC documentation:
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
A small number of rows are cached on the client side of the connection and when exhausted the next block of rows is retrieved by repositioning the cursor.
[...]
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
So, add a stmt.setFetchSize(1000) (or something like that) to your code will ensure that the JDBC driver will not exhaust the heap.
If you still have trouble after that, then it's because your code is retaining all data, which means it's coded wrong for a copy operation.

Does MySQL use server-side pre-fetching when streaming a ResultSet

The MySQL JDBC connector defines two fetch modes:
the default one fetches the whole ResultSet at once
streaming, when the statement fetchSize is set to Integer.MIN_VALUE
According to the documentation, the streaming will fetch each row individually, one at a time.
Is it true that, when using streaming, each row is fetched in a separate database roundtrip?
Does the MySQL server prefetches the result-set in advance or does it traverse the server-side cursor one row at a time too?
I believe the short answer is yes. I don't know the nuances as it applies to a mysql_use_result/mysql_store_result, but there are a few types of prefetch:
The InnoDB storage engine underneath has read-ahead, so it will start fetching pages in advance.
Some queries do need to be materialized in full before they can be streamed row at a time (think of a sort without using an index, or a group by without loose index scan). If this happens, the temporary table will show up using the show profiles feature.
Finally, in MySQL 5.6+ the retrieve from the storage engine can be batched (BKA). This is probably the case you were hinting at, the buffer that fills is called join_buffer_size.

Does using Limit in query using JDBC, have any effect in performance?

If we use the Limit clause in a query which also has ORDER BY clause and execute the query in JDBC, will there be any effect in performance? (using MySQL database)
Example:
SELECT modelName from Cars ORDER BY manuDate DESC Limit 1
I read in one of the threads in this forum that, by default a set size is fetched at a time. How can I find the default fetch size?
I want only one record. Originally, I was using as follows:
SQL Query:
SELECT modelName from Cars ORDER BY manuDate DESC
In the JAVA code, I was extracting as follows:
if(resultSett.next()){
//do something here.
}
Definitely the LIMIT 1 will have a positive effect on the performance. Instead of the entire (well, depends on default fetch size) data set of mathes being returned from the DB server to the Java code, only one row will be returned. This saves a lot of network bandwidth and Java memory usage.
Always delegate as much as possible constraints like LIMIT, ORDER, WHERE, etc to the SQL language instead of doing it in the Java side. The DB will do it much better than your Java code can ever do (if the table is properly indexed, of course). You should try to write the SQL query as much as possibe that it returns exactly the information you need.
Only disadvantage of writing DB-specific SQL queries is that the SQL language is not entirely portable among different DB servers, which would require you to change the SQL queries everytime when you change of DB server. But it's in real world very rare anyway to switch to a completely different DB make. Externalizing SQL strings to XML or properties files should help a lot anyway.
There are two ways the LIMIT could speed things up:
by producing less data, which means less data gets sent over the wire and processed by the JDBC client
by potentially having MySQL itself look at fewer rows
The second one of those depends on how MySQL can produce the ordering. If you don't have an index on manuDate, MySQL will have to fetch all the rows from Cars, then order them, then give you the first one. But if there's an index on manuDate, MySQL can just look at the first entry in that index, fetch the appropriate row, and that's it. (If the index also contains modelName, MySQL doesn't even need to fetch the row after it looks at the index -- it's a covering index.)
With all that said, watch out! If manuDate isn't unique, the ordering is only partially deterministic (the order for all rows with the same manuDate is undefined), and your LIMIT 1 therefore doesn't have a single correct answer. For instance, if you switch storage engines, you might start getting different results.

Storing result set for later fetch

I have some queries that run for a quite long (20-30 minutes). If a lot of queries are started simultaneously, connection pool is drained quickly.
Is it possible to wrap the long-running query into a statement (procedure) that will store the result of a generic query into a temp table, terminanting the connection, and fetchin (polling) the results later on demand?
EDIT: queries and data stuctures are optimized, and tips like 'check your indices and execution plan' don't work for me. I'm looking for a way to store [maybe a] byte presentation of a generic result set, for later retreive.
First of all, 20-30 minutes is an extremely long time for a query - are you sure you aren't missing any indexes for the query? Do check your execution plan - you could get a huge performance gain from a well-placed index.
In MySQL, you could do
INSERT INTO `cached_result_table` (
SELECT your_query_here
)
(of course, cached_result_table needs to have the exact same column structure as your SELECT returns, otherwise you'll get an error).
Then, you could query these cached results (instead of the original tables), and only run the above query from time to time - to update the cached_result_table.
Of course, the query will need to run at least once initially, which will take the 20-30 minutes you mentioned. I suggest to pre-populate the cached table before the data are requested, and keep some locking mechanism to prevent the update query to run several times simultaneously. Pseudocode:
init:
insert select your_big_query
work:
if your_big_query cached table is empty or nearing expiration:
refresh in the background:
check flag to see if there's another "refresh" process running
if yes
end // don't run two your_big_queries at the same time
else
set flag
re-run your_big_query, save to cached table
clear flag
serve data to clients always from cached table
An easy way to do that in Oracle is "CREATE TABLE sometempname AS SELECT...". That will create a new table using the result columns from the select.
Not quite sure what you are requesting.
Currently you have 50 database sessions. Say you get 40 running long-running queries, that leaves 10 to service the rest.
What you seem to be asking for is, you want those 40 queries asynchronously (running in the background) not clogging up the connection pool of 50. The question is, do you want those 40 running concurrently with (potentially) another 50 queries from the connection pool, or do you want them queued up in some way ?
Queuing can be done (look into DBMS_SCHEDULER and DBMS_JOB). But you will need to deliver those results into some other table and know how to deliver that result set. The old fashioned way is simply to generate reports on request that get delivered to a directory on a shared drive or by email. Could be PDF or CSV or Excel.
If you want the 40 running concurrently alongside the 50 'connection pool' settings, then you may be best off setting up a separate connection pool for the long-running queries.
You can look into Resource Manager for terminating calls that take too long or too many resources. That way the quickie pool can't get bogged down in long running requests.
The most generic approach in Oracle I can think of is creating a stored procedure that will convert a result set into XML, and store it as CLOB XMLType in a table with the results of your long-running queries.
You can find more on generation XMLs from a generic result sets here.
SQL> select dbms_xmlgen.getxml('select employee_id, first_name,
2 last_name, phone_number from employees where rownum < 6') xml
3 from dual

Categories