MySQL Connector-J Documentation (here) mentions two ways in which the JDBC retrieves results from the MySQL database. One is the default operation, in which the entire result set is loaded into memory and made accessible in the code. The second is row by row streaming.
I would like to know whether the latest versions of MySQL/MySQL JDBC support server side cursors. Specifically, I would like to know whether the options useCursorFetch=True and defaultFetchSize>0 can be used to ensure that the result set is retrieved from the database in batches of certain size (fetch size). MySQL describes server side cursors in its C API (here), and I would like to know whether similar support is there with MySQL JDBC.
If this support exists, what are the constraints of such an operation? I understand that a temporary table would be created in the server's memory from which results would be fetched. But what are the other things to look out for (such as table/row locks, restrictions on update/insertions, and result set/connection closing)?
The most recent version of the documentation you linked to has this note:
By default, ResultSets are completely retrieved and stored in memory. [...] you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
This sounds like what you're looking for.
Related
We are working in our team pretty tight up with an Oracle DB server using jdbc. In one of our changes, I'm calling a Stored Procedure which returns me two different ResultSets. At first my implementation assumed default Scroll-ability.
After that failed, I looked it up in the Internet.
Everything I could read about it said basically the same thing: use prepareStatement or prepareCall methods with the appropriate TYPE_SCROLL_INSENSITIVE and CONCUR_READ_ONLY. None of these worked.
The Stored Procedure I use, again, return me two different result sets and they are extracted through a (ResultSet) rs.getObject("name"). Generally in examples, their ResultSet are coming back instantly from a .executeQuery.
My Question is, Do the Scrollablility/Updatability types in the prepareCall methods affecting these sort of ResultSets? if so, how do I get them?
I know that the JDBC driver can degrade my request for ScrollableResultSet. How can I tell if my ResultSet was degraded?
On that note, Why aren't ResultSets scrollable by default? What are the best practices and what is "the cost" of their flexibility?
In Oracle, a cursor is a forward-only structure. All the database knows how to do is fetch the next row (well, technically the next n rows). In order to make a ResultSet seem scrollable, you rely on the JDBC driver.
The JDBC driver has two basic approaches to making ResultSet seem scrollable. The first is to save the entire result set in memory as you fetch data just in case you want to go backwards. Functionally, that works but it has potentially catastrophic results on performance and scalability when a query potentially returns a fair amount of data. The first time some piece of code starts chewing up GB of RAM on app servers because a query returned thousands of rows that included a bunch of long comment fields, that JDBC driver will get rightly pilloried as a resource hog.
The more common approach is to for the driver to add a key to the query and to use that key to manage the data the driver caches. So, for example, the driver might keep the last 1000 rows in memory in their entirety but only cache the key for the earlier rows so it can go back and re-fetch that data later. That's more complicated to code but it also requires that the ResultSet has a unique key. Normally, that's done by trying to add a ROWID to the query. That's why, for example, the Oracle JDBC driver specifies that a scrollable or updatable ResultSet cannot use a SELECT * but can use SELECT alias.*-- the latter makes it possible for the driver to potentially be able to blindly add a ROWID column to the query.
A ResultSet coming from a stored procedure, however, is completely opaque to the driver-- it has no way of getting the query that was used to open the ResultSet so it has no way to add an additional key column or to go back and fetch the data again. If the driver wanted to make the ResultSet scrollable, it would have to go back to caching the entire ResultSet in memory. Technically, that is entirely possible to do but very few drivers will do so since it tends to lead to performance problems. It's much safer to downgrade the ResultSet. Most of the time, the application is in a better position to figure out whether it is reasonable to cache the entire ResultSet because you know it is only ever going to return a small amount of data or to be able to go back and fetch rows again by their natural key.
You can use the getType() and getConcurrency() methods on your ResultSet to determine whether your ResultSet has been downgraded by the driver.
I'm trying to access data in multiple databases on a single running instance. the table structures of these databases are all the same; As far as I know, create a new connnection using jdbc is very expensive. But the connection string of jdbc require format like this
jdbc:mysql://hostname/ databaseName, which needs to specify a specific database.
So I'm wondering is there any way to query data in multiple databases using one connection?
The MySQL documentation is badly written on this topic.
The SELECT Syntax page refers to the JOIN Syntax page for how a table name can be written, even if you don't use JOIN clauses. The JOIN Syntax page simply says tbl_name, without further defining what that is. There is even a comment at the bottom calling this out:
This page needs to make it explicit that a table reference can be of the form schema_name.tbl_name, and that joins between databases are therefore posible.
The Schema Object Names page says nothing about qualifying names, but does have a sub-page called Identifier Qualifiers, which says that a table column can be referred to using the syntax db_name.tbl_name.col_name. The page says nothing about the ability to refer to tables using db_name.tbl_name.
But, if you can refer to a column using db_name.tbl_name.col_name, it would only make sense if you can also refer to a table using db_name.tbl_name, which means that you can access all your databases using a single Connection, if you're ok with having to qualify the table names in the SQL statements.
As mentioned by #MarkRotteveel in a comment, you can also switch database using the Connection.setCatalog(String catalog) method.
This is documented in the MySQL Connector/J 5.1 Developer Guide:
Initial Database for Connection
If the database is not specified, the connection is made with no default database. In this case, either call the setCatalog() method on the Connection instance, or fully specify table names using the database name (that is, SELECT dbname.tablename.colname FROM dbname.tablename...) in your SQL. Opening a connection without specifying the database to use is generally only useful when building tools that work with multiple databases, such as GUI database managers.
Note: Always use the Connection.setCatalog() method to specify the desired database in JDBC applications, rather than the USE database statement.
The MySQL JDBC connector defines two fetch modes:
the default one fetches the whole ResultSet at once
streaming, when the statement fetchSize is set to Integer.MIN_VALUE
According to the documentation, the streaming will fetch each row individually, one at a time.
Is it true that, when using streaming, each row is fetched in a separate database roundtrip?
Does the MySQL server prefetches the result-set in advance or does it traverse the server-side cursor one row at a time too?
I believe the short answer is yes. I don't know the nuances as it applies to a mysql_use_result/mysql_store_result, but there are a few types of prefetch:
The InnoDB storage engine underneath has read-ahead, so it will start fetching pages in advance.
Some queries do need to be materialized in full before they can be streamed row at a time (think of a sort without using an index, or a group by without loose index scan). If this happens, the temporary table will show up using the show profiles feature.
Finally, in MySQL 5.6+ the retrieve from the storage engine can be batched (BKA). This is probably the case you were hinting at, the buffer that fills is called join_buffer_size.
I have so difficulty to use the CachedRowSetImpl class in java.
I want to analyse the data of a huge postgres table, that contains ~35,000,000 lines and 3 integer columns.
I cannot load everything into my computer physical memory, then I want to read these lines per batch of 100000 lines.
When executing the corresponding query (select col1,col2,col3 from theTable limit 10000) in psql prompt or in a graphical interface such as pgadmin, it takes around 4000ms to load the 100000 lines and a few megabytes of memory.
I try to do the same operation with the following java code:
CachedRowSet rowset = new CachedRowSetImpl();
int pageSize=1000000;
rowset.setCommand("select pk_lib_scaf_a,pk_lib_scaf_b,similarity_evalue from from_to_scaf");
rowset.setPageSize(pageSize);
rowset.setReadOnly(true);
rowset.setFetchSize(pageSize);
rowset.setFetchDirection(ResultSet.FETCH_FORWARD);
rowset.execute(myConnection);
System.out.println("start !");
while (rowset.nextPage()) {
while (rowset.next()) {
//treatment of current data page
} // End of inner while
rowset.release();
}
When running the above code, the "start !" message is never displayed in the console and the execution seems to be stuck in the rowset.execute() line.
Moreover, the memory consumption gets crazy and reach the limit of my computer physical memory (8gb).
That's strange, it looks like the program tries to fill the rowset with the ~35,000,000 lines, without considering the pageSize configuration.
Does anybody experienced such problem with java JDBC and postgres drivers ? What do I miss ?
postgres 9.1
java jdk 1.7
From the CachedRowSet Javadoc (emphasis mine):
A CachedRowSet object is a disconnected rowset, which means that it makes use of a connection to its data source only briefly. It connects to its data source while it is reading data to populate itself with rows and again while it is propagating changes back to its underlying data source. The rest of the time, a CachedRowSet object is disconnected, including while its data is being modified.
From your question:
it looks like the program tries to fill the rowset with the ~35,000,000 lines, without considering the pageSize configuration
Yes, CachedRowSet will retrieve the 35m rows from your database at once, and after that it will apply the pagination and other defined properties. A possible solution would be processing the data by small chunks and having a flag on each row to mark it as processed.
I would recommend using an ETL tool like Pentaho that already handles this kind of problems.
In fact, the support of cursor is implicitly coded in the postgres JDBC as described in its documentation. A cursor is however created automatically with some conditions.
http://jdbc.postgresql.org/documentation/head/query.html
I am using Java to read from a SQL RDBMS and return the results to the user. The problem is that the database table has 155 Million rows, which make the wait time really long.
I wanted to know if it is possible to retrieve results as they come from the database and present them incrementaly to the user (in batches).
My query is a simple SELECT * FROM Table_Name query.
Is there a mechanism or technology that can give me callbacks of DB records, in batches until the SELECT query finishes?
The RDBMS that is used is MS SQL Server 2008.
Thanks in advance.
Methods Statement#setFetchSize and Statement#getMoreResults are supposed to allow you to manage incremental fetches from the database. Unfortunately, this is the interface spec and vendors may or may not implement these. Memory management during a fetch is really down to the vendor (which is why I wouldn't strictly say that "JDBC just works like this").
From the JDBC documentation on Statement :
setFetchSize(int rows)
Gives the JDBC driver a hint as to the number of rows that should be
fetched from the database when more rows are needed for ResultSet
objects genrated by this Statement.
getMoreResults()
Moves to this Statement object's next result, returns true if it is a
ResultSet object, and implicitly closes any current ResultSet object(s)
obtained with the method getResultSet.
getMoreResults(int current)
Moves to this Statement object's next result, deals with any current
ResultSet object(s) according to the instructions specified by the given
flag, and returns true if the next result is a ResultSet object.
current param indicates Keep or close current ResultSet?
Also, this SO response answers about the use of setFetchSize with regards to SQLServer 2005 and how it doesn't seem to manage batched fetches. The recommendation is to test this using the 2008 driver or moreover, to use the jTDS driver (which gets thumbs up in the comments)
This response to the same SO post may also be useful as it contains a link to SQLServer driver settings on MSDN.
There's also some good info on the MS technet website but relating more to SQLServer 2005. Couldn't find the 2008 specific version in my cursory review. Anyway, it recommends creating the Statement with:
com.microsoft.sqlserver.jdbc.SQLServerResultSet.TYPE_SS_SERVER_CURSOR_FORWARD_ONLY (2004) scrollability for forward-only, read-only access, and then use the setFetchSize method to tune performance
Using pagination (LIMIT pageno, rows / TOP) might create holes and duplicates, but might be used in combination with checking the last row ID (WHERE id > ? ORDER BY id LIMIT 0, 100).
You may use TYPE_FORWARD_ONLY or FETCH_FORWARD_ONLY.
This is exactly how is JDBC driver supposed to work (I remember the bug in old PostgreSQL driver, that caused all fetched records to be stored in memory).
However, it enables you to read record when the query starts to fetch them. This is where I would start to search.
For example, Oracle optimizes SELECT * queries for fetching the whole set. It means it can take a lot of time before first results will appear. You can give hints to optimize for fetching first results, so you can show first rows to your user quite fast, but the whole query can take longer to execute.
You should test your query on console first, to check when it starts to fetch results. Then try with JDBC and monitor the memory usage while you iterate through ResultSet. If the memory usage grows fast, check if you have opened ResultSet in forward-only and read-only mode, if necessary update driver.
If such solution is not feasible because of memory usage, you can still use cursors manually and fetch N rows (say, 100) in each query.
Cursor documentation for MSSQL: for example here: http://msdn.microsoft.com/en-us/library/ms180152.aspx