Strange behavior for java class CachedRowSetImpl

Strange behavior for java class CachedRowSetImpl - java

I have so difficulty to use the CachedRowSetImpl class in java.
I want to analyse the data of a huge postgres table, that contains ~35,000,000 lines and 3 integer columns.
I cannot load everything into my computer physical memory, then I want to read these lines per batch of 100000 lines.
When executing the corresponding query (select col1,col2,col3 from theTable limit 10000) in psql prompt or in a graphical interface such as pgadmin, it takes around 4000ms to load the 100000 lines and a few megabytes of memory.
I try to do the same operation with the following java code:
CachedRowSet rowset = new CachedRowSetImpl();
int pageSize=1000000;
rowset.setCommand("select pk_lib_scaf_a,pk_lib_scaf_b,similarity_evalue from from_to_scaf");
rowset.setPageSize(pageSize);
rowset.setReadOnly(true);
rowset.setFetchSize(pageSize);
rowset.setFetchDirection(ResultSet.FETCH_FORWARD);
rowset.execute(myConnection);
System.out.println("start !");
while (rowset.nextPage()) {
while (rowset.next()) {
//treatment of current data page
} // End of inner while
rowset.release();
}
When running the above code, the "start !" message is never displayed in the console and the execution seems to be stuck in the rowset.execute() line.
Moreover, the memory consumption gets crazy and reach the limit of my computer physical memory (8gb).
That's strange, it looks like the program tries to fill the rowset with the ~35,000,000 lines, without considering the pageSize configuration.
Does anybody experienced such problem with java JDBC and postgres drivers ? What do I miss ?
postgres 9.1
java jdk 1.7

From the CachedRowSet Javadoc (emphasis mine):
A CachedRowSet object is a disconnected rowset, which means that it makes use of a connection to its data source only briefly. It connects to its data source while it is reading data to populate itself with rows and again while it is propagating changes back to its underlying data source. The rest of the time, a CachedRowSet object is disconnected, including while its data is being modified.
From your question:
it looks like the program tries to fill the rowset with the ~35,000,000 lines, without considering the pageSize configuration
Yes, CachedRowSet will retrieve the 35m rows from your database at once, and after that it will apply the pagination and other defined properties. A possible solution would be processing the data by small chunks and having a flag on each row to mark it as processed.
I would recommend using an ETL tool like Pentaho that already handles this kind of problems.

In fact, the support of cursor is implicitly coded in the postgres JDBC as described in its documentation. A cursor is however created automatically with some conditions.
http://jdbc.postgresql.org/documentation/head/query.html

Related

How to efficiently export/import database data with JDBC

I have a JAVA application that can use a SQL database from any vendor. Right now we have tested Vertica and PostgreSQL. I want to export all the data from one table in the DB and import it later on in a different instance of the application. The size of the DB is pretty big so there are many rows in there. The export and import process has to be done from inside the java code.
What we've tried so far is:
Export: we read the whole table (select * from) through JDBC and then dump it to an SQL file with all the INSERTS needed.
Import: The file containing those thousands of INSERTS is executed in the target database through JDBC.
This is not an efficient process. Firstly, the select * from part is giving us problems because of the size of it and secondly, executing a lot if inserts one after another gives us problems in Vertica (https://forum.vertica.com/discussion/235201/vjdbc-5065-error-too-many-ros-containers-exist-for-the-following-projections)
What would be a more efficient way of doing this? Are there any tools that can help with the process or there is no "elegant" solution?

Why not do the export/import in a single step with batching (for performance) and chunking (to avoid errors and provide a checkpoint where to start off after a failure).
In most cases, databases support INSERT queries with many values, e.g.:
INSERT INTO table_a (col_a, col_b, ...) VALUES
(val_a, val_b, ...),
(val_a, val_b, ...),
(val_a, val_b, ...),
...
The number of rows you generate into a single such INSERT statement is then your chunk-size, which might need tuning for the specific target database (big enough to speed things up but small enough to make the chunk not exceed some database limit and create failures).
As already proposed, each of this chunk should then be executed in a transaction and your application should remember which chunk it successfully executed last in case some error occurs so it can continue at the next run there.
For the chunks itself, you really should use LIMIT OFFSET .
This way, you can repeat any chunk at any time, each chunk by itself is atomic and it should perform much better than with single row statements.

I can only speak about PostgreSQL.
The size of the SELECT is not a problem if you use server-side cursors by calling setFetchSize with a value greater than 0 (perhaps 10000) on the statement.
The INSERTS will perform well if
you run them all in a single transaction
you use a PreparedStatement for the INSERT

Each insert into Vertica goes into WOS (memory), and periodically data from WOS gets moved to ROS (disk) into a single container. You can only have 1024 ROS containers per projection per node. Doing many thousands of INSERTs at a time is never a good idea for Vertica. The best way to do this is to copy all that data into a file and bulk load the file into Vertica using the COPY command.
This will create a single ROS container for the contents of the file. Depending on how many rows you want to copy it will be many times (sometimes even hundreds of times) faster.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/COPY/COPY.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ConnectingToVertica/ClientJDBC/UsingCOPYLOCALWithJDBC.htm

Server side cursor support in MySQL JDBC

MySQL Connector-J Documentation (here) mentions two ways in which the JDBC retrieves results from the MySQL database. One is the default operation, in which the entire result set is loaded into memory and made accessible in the code. The second is row by row streaming.
I would like to know whether the latest versions of MySQL/MySQL JDBC support server side cursors. Specifically, I would like to know whether the options useCursorFetch=True and defaultFetchSize>0 can be used to ensure that the result set is retrieved from the database in batches of certain size (fetch size). MySQL describes server side cursors in its C API (here), and I would like to know whether similar support is there with MySQL JDBC.
If this support exists, what are the constraints of such an operation? I understand that a temporary table would be created in the server's memory from which results would be fetched. But what are the other things to look out for (such as table/row locks, restrictions on update/insertions, and result set/connection closing)?

The most recent version of the documentation you linked to has this note:
By default, ResultSets are completely retrieved and stored in memory. [...] you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
This sounds like what you're looking for.

Read SQL Database in batches

I am using Java to read from a SQL RDBMS and return the results to the user. The problem is that the database table has 155 Million rows, which make the wait time really long.
I wanted to know if it is possible to retrieve results as they come from the database and present them incrementaly to the user (in batches).
My query is a simple SELECT * FROM Table_Name query.
Is there a mechanism or technology that can give me callbacks of DB records, in batches until the SELECT query finishes?
The RDBMS that is used is MS SQL Server 2008.
Thanks in advance.

Methods Statement#setFetchSize and Statement#getMoreResults are supposed to allow you to manage incremental fetches from the database. Unfortunately, this is the interface spec and vendors may or may not implement these. Memory management during a fetch is really down to the vendor (which is why I wouldn't strictly say that "JDBC just works like this").
From the JDBC documentation on Statement :
setFetchSize(int rows)
Gives the JDBC driver a hint as to the number of rows that should be
fetched from the database when more rows are needed for ResultSet
objects genrated by this Statement.
getMoreResults()
Moves to this Statement object's next result, returns true if it is a
ResultSet object, and implicitly closes any current ResultSet object(s)
obtained with the method getResultSet.
getMoreResults(int current)
Moves to this Statement object's next result, deals with any current
ResultSet object(s) according to the instructions specified by the given
flag, and returns true if the next result is a ResultSet object.
current param indicates Keep or close current ResultSet?
Also, this SO response answers about the use of setFetchSize with regards to SQLServer 2005 and how it doesn't seem to manage batched fetches. The recommendation is to test this using the 2008 driver or moreover, to use the jTDS driver (which gets thumbs up in the comments)
This response to the same SO post may also be useful as it contains a link to SQLServer driver settings on MSDN.
There's also some good info on the MS technet website but relating more to SQLServer 2005. Couldn't find the 2008 specific version in my cursory review. Anyway, it recommends creating the Statement with:
com.microsoft.sqlserver.jdbc.SQLServerResultSet.TYPE_SS_SERVER_CURSOR_FORWARD_ONLY (2004) scrollability for forward-only, read-only access, and then use the setFetchSize method to tune performance

Using pagination (LIMIT pageno, rows / TOP) might create holes and duplicates, but might be used in combination with checking the last row ID (WHERE id > ? ORDER BY id LIMIT 0, 100).
You may use TYPE_FORWARD_ONLY or FETCH_FORWARD_ONLY.

This is exactly how is JDBC driver supposed to work (I remember the bug in old PostgreSQL driver, that caused all fetched records to be stored in memory).
However, it enables you to read record when the query starts to fetch them. This is where I would start to search.
For example, Oracle optimizes SELECT * queries for fetching the whole set. It means it can take a lot of time before first results will appear. You can give hints to optimize for fetching first results, so you can show first rows to your user quite fast, but the whole query can take longer to execute.
You should test your query on console first, to check when it starts to fetch results. Then try with JDBC and monitor the memory usage while you iterate through ResultSet. If the memory usage grows fast, check if you have opened ResultSet in forward-only and read-only mode, if necessary update driver.
If such solution is not feasible because of memory usage, you can still use cursors manually and fetch N rows (say, 100) in each query.
Cursor documentation for MSSQL: for example here: http://msdn.microsoft.com/en-us/library/ms180152.aspx

what's the fastest way to get a large volume of data from an Oracle database into Java objects

What's the fastest way to get a large volume of data from an Oracle database into Java objects.
Are there any Oracle tricks as to the way the data should be organised?
I was thinking of using plain JDBC rather than any Hibernate style libraries?
Would it be better to get Oracle to produce a file and then read from the file - although this has to be done programatically.
All thoughts appreciated.

I am not a Java or JDBC expert, but if you plan on pulling a lot of rows down from a database, you will likely benefit by increasing the prefetch rows on the connection.
Connection conn = DriverManager.getConnection ("jdbc:oracle:","user","password");
//Set the default row prefetch setting for this connection
((OracleConnection)conn).setDefaultRowPrefetch(100);
I believe the default for JDBC is to fetch one row at a time, so you're paying for a round trip to the database for each row fetched. (Note, I've seen documentation that suggests the default is 10 rows per round trip). Setting prefetch to a larger number will fetch more rows per round trip to the database. Speed increases can be dramatic depending on the number of rows and the performance of your network.

Depending on how far you want to go with this I'd imagine dropping jdbc and writing a custom application residing on the same machine as the DB using Oracle Call API and JNI would be the fastest...
It's probably much simpler to just use a plain prepared statment using JDBC and then if that's not enough (and depending on where the bottle neck is) try making a stored procedure. The caching done by ORM's like Hibernate should not be discounted though, so I guess you'd have to do some benchmarks. Also if the bottle neck is the database and you write a stored procedure which improves the read performance, then you could still use Hibernate to marshal the data to java objects. See Using stored procedures for querying

Whatever you wind up doing, design for/implement "lazy initialization" [really only applies for complex object hierarchies/networks; you said java objects (plural) so I'm imagining something more than just a single table that maps to a single object]. So basically, you are only reading in the objects that are needed at that time; when you run a getter method, then it does more db calls for just that data.
Another trick sometimes overlooked in the Java world is: if you have some complex sql coming from the code, you can rather create a view on the Oracle side, embedding that complexity there, then map your object to the view; so if you can flatten your object like the view, then you're in business.

How to limit number of rows returned from Oracle at the JDBC data source level?

Is there a way to limit the rows returned at the Oracle datasource level in a Tomcat application?
It seems maxRows is only available if you set it on the datasource in the Java code. Putting maxRows="2" on the datasource doesn't apply.
Is there any other way limit the rows returned? Without a code change?

It isn't something that is available at the configuration level. You may want to double check that it does what you want it to do anyway: see the javadoc for setMaxRows. With Oracle it is still going to fetch every row back for the query and then just drop the ones outside the range. You would really need to use rownum to make it work well with Oracle and you can't do that either in the configuration.

The question is why do you want to limit the number of rows returned. There could be many reasons to do this. The first would be to just limit the data returned by the database. In my opinion this makes no sense in most cases as if I would like to get certain data only then I would use a different statement or add a filter condition or something. E.g. if you use rownum of Oracle you don't exactly know which data is in the rows you get and which data is not included as you just tell the database that you want row x to y.
The second approach is to limit memory usage and increase performance so that the ResultSet you get from the JDBC driver will not include all data. You can limit the number of rows hold by the ResultSet using Statement.setFetchSize(). If you move the cursor in the ResultSet beyond the number of rows fetched the JDBC driver will fetch the missing data from the database. (In case of Oracle the database will store the data in a ref cursor which is directly accessed by the JDBC driver).

*Beware: the code below is provided as pure example. It has not
been tested * It thus may harm yourself or your computer or even punch you
in the face.
If you want to avoid modifying your SQL queries but still want to have clean code (which means that your code stay maintainable), you may design the solution using wrappers. That is, by using a small set of classes wrapping existing ones, you may achieve what you want seamlessly for the rest of the application which will still think it is working with real DataSource, Connection and Statement.
1 - implement a StatementWrapper or PreparedStatementWrapper class, depending what your application already uses. Those classes are wrappers around normal Statement or PreparedStatement instances. They are implemented simply as using the inner statement as a delegate which does all the work, except when this is a QUERY statement (Statement.executeQuery() method). Only in that precise situation, the wrapper surrounds the query by the two following strings : "SELECT * FROM (" and ") WHERE ROWNUM < "+maxRowLimit. For basic code wrapper code, see how it looks for the DataSourceWrapper below.
2 - write one more wrapper : ConnectionWrapper which wraps a Connection which returns StatementWrapper in createStatement() and PreparedStatementWrapper in prepareStatement(). Those are the previously coded classes taking ConnectionWrapper's delegateConnection.createStatement()/prepareStatement() as construction arguments.
3 - repeat the step with a DataSourceWrapper. Here is a simple code example.
public class DataSourceWrapper implements DataSource
{
private DataSource mDelegate;
public DataSourceWrapper( DataSource delegate )
{
if( delegate == null ) { throw new NullPointerException( "Delegate cannot be null" );
mDelegate = delegate;
}
public Connection getConnection(String username, String password)
{
return new ConnectionWrapper( mDelegate.getConnection( username, password ) );
}
public Connection getConnection()
{
... <same as getConnection(String, String)> ...
}
}
4 - Finally, use that DataSourceWrapper as your application's DataSource. If you're using JNDI (NamingContext), this change should be trivial.
Coding all this is quick and very straightforward, especially if you're using smart IDE like Eclipse or IntelliJ which will implement the delegating methods automagically.

If you know you will be dealing with only one table, then define a view with rownum in the where statement to limit the number of rows. In this way, the number of rows is controlled at the DB and does not need to be specified as part of any query from a client application. If you want to change the number of rows returned, then redefine the view prior to executing query.
A more dynamic method would be to develop a procedure and pass in a number of rows, and have the procedure return a ref_cursor to your client. This would have the advantage of avoiding hard parsing on the DB, and increase performance.

Ok, a code change it'll have to be then.
The scenario is limiting an adhoc reporting tool so that the end user doesnt pull back too many records and generate a report which is unusable.
We already use oracle cost based resource management.

Take a look at this page with a description of limiting how much is sucked into the Java App at a time. As another post points out, the DB will still pull all of the data, this is more for controlling network use, and memory on the Java side.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.