I got this doubt when I was modifying a code for doing batch update for MySQL retrieval using Java.
My understanding is that fetch size is the maximum number of rows in a ResultSet object and Batch Limit is the number of select/insert/update queries that can be added to a batch, for batch execution. Can anyone correct me if I am wrong here.
Thanks.
You are almost correct. However to add to it javadoc of Statement#setFetchSize()
Gives the JDBC driver a hint as to the number of rows that should be
fetched from the database
Whereas the batch limit is something which is related to how many rows you can insert or update something related to max_allowed_packet
On a side note:
You may also check the JDBC API implementation notes as a good read
ResultSet
By default, ResultSets are completely retrieved and stored in memory.
In most cases this is the most efficient way to operate, and due to
the design of the MySQL network protocol is easier to implement. If
you are working with ResultSets that have a large number of rows or
large values, and cannot allocate heap space in your JVM for the
memory required, you can tell the driver to stream the results back
one row at a time.
Related
I have a JAVA application that can use a SQL database from any vendor. Right now we have tested Vertica and PostgreSQL. I want to export all the data from one table in the DB and import it later on in a different instance of the application. The size of the DB is pretty big so there are many rows in there. The export and import process has to be done from inside the java code.
What we've tried so far is:
Export: we read the whole table (select * from) through JDBC and then dump it to an SQL file with all the INSERTS needed.
Import: The file containing those thousands of INSERTS is executed in the target database through JDBC.
This is not an efficient process. Firstly, the select * from part is giving us problems because of the size of it and secondly, executing a lot if inserts one after another gives us problems in Vertica (https://forum.vertica.com/discussion/235201/vjdbc-5065-error-too-many-ros-containers-exist-for-the-following-projections)
What would be a more efficient way of doing this? Are there any tools that can help with the process or there is no "elegant" solution?
Why not do the export/import in a single step with batching (for performance) and chunking (to avoid errors and provide a checkpoint where to start off after a failure).
In most cases, databases support INSERT queries with many values, e.g.:
INSERT INTO table_a (col_a, col_b, ...) VALUES
(val_a, val_b, ...),
(val_a, val_b, ...),
(val_a, val_b, ...),
...
The number of rows you generate into a single such INSERT statement is then your chunk-size, which might need tuning for the specific target database (big enough to speed things up but small enough to make the chunk not exceed some database limit and create failures).
As already proposed, each of this chunk should then be executed in a transaction and your application should remember which chunk it successfully executed last in case some error occurs so it can continue at the next run there.
For the chunks itself, you really should use LIMIT OFFSET .
This way, you can repeat any chunk at any time, each chunk by itself is atomic and it should perform much better than with single row statements.
I can only speak about PostgreSQL.
The size of the SELECT is not a problem if you use server-side cursors by calling setFetchSize with a value greater than 0 (perhaps 10000) on the statement.
The INSERTS will perform well if
you run them all in a single transaction
you use a PreparedStatement for the INSERT
Each insert into Vertica goes into WOS (memory), and periodically data from WOS gets moved to ROS (disk) into a single container. You can only have 1024 ROS containers per projection per node. Doing many thousands of INSERTs at a time is never a good idea for Vertica. The best way to do this is to copy all that data into a file and bulk load the file into Vertica using the COPY command.
This will create a single ROS container for the contents of the file. Depending on how many rows you want to copy it will be many times (sometimes even hundreds of times) faster.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/COPY/COPY.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ConnectingToVertica/ClientJDBC/UsingCOPYLOCALWithJDBC.htm
I have to read huge data from the database (for example lets consider more than 500 000 records). Then I have to save the read data to a file. I have many issues with cursor (not only memory issue).
Is it possible to do it without cursor, for example using stream? If so how can I achieve it?
I have experienced working with huge data (almost 500 milions of records). I simply used PreparedStatement query, ResultSet and of cource some buffer tweaking through:
setFetchSize(int)
In my case, i split the program into threads because the huge table was partitioned (each thread processed one partition) but i think that this is not your case.
It is pointless to fetch data through cursor. I would rather use the database view or SQL query. Do not use ORM for this purpose.
According to your comment, your best option is to limit JDBC to fetch only specific number of rows instead of fetching all of them (this helps to begin processing faster and does not load entire table into ResultSet). Save your data into collection and write it into file using BufferedWriter. You can also benefit from multi-core CPU to make it run in more threads - like first fetched rows run in 1 thread, other fetched rows in second thread. In case of threading, use synchronized collections and be aware that you might face the problem of ordering.
I wish to read all rows of a large table from PostgreSQL in Java. I am processing the rows one by one in the Java software.
By default the JDBC PostgreSQL driver reads all rows into memory, meaning my program runs out of memory.
The documentation talks of "Getting results based on a cursor" using st.setFetchSize(50); I have implemented that and it works well.
Is there any disadvantage to this approach? If not, I would enable it for all our queries, big and small, or is that a bad idea?
Well, if you have a fetchsize of 50 and you get 1000 results, it will result in 20 round-trips to the database. So no, it's not a good idea to enable it blindly without thinking of the actual queries being run.
A bigger question is why are your ResultSets so big that you run out of memory. Are you only loading data you're going to use and you just don't have a lot of memory, or are there perhaps poorly designed queries that return excessive results.
I am using a java jdbc application to fetch about 500,000 records from DB. The Database being used is Oracle. I write the data into a file as soon as each row is fetched. Since it takes about an hour to complete fetching the entire data, I am trying to increase the fetch size of the result set. I have seen in multiple links that while increasing the fetch size one should be careful about the memory consumption. Does increasing the fetch size actually increase the heap memory used by the jvm?
Suppose if the fetch size is 10 and the program query returns 100 rows in total. During the first fetch the resultset contains 10 record. Once I read the first 10 records the resultset fetches the next 10. Does this mean that after the 2nd fetch the dataset will contain 20 records? Are the earlier 10 records still maintained in memory or are they removed while fetching the newer batch?
Any help is appreciated.
It depends. Different drivers may behave differently and different ResultSet settings may behave differently.
If you have a CONCUR_READ_ONLY, FETCH_FORWARD, TYPE_FORWARD_ONLY ResultSet, the driver will almost certainly actively store in memory the number of rows that corresponds to your fetch size (of course data for earlier rows will remain in memory for some period of time until it is garbage collected). If you have a TYPE_SCROLL_INSENSITIVE ResultSet, on the other hand, it is very likely that the driver would store all the data that was fetched in memory in order to allow you to scroll backwards and forwards through the data. That's not the only possible way to implement this behavior, so different drivers (and different versions of drivers) may have different behaviors but it is the simplest and the way that most drivers I've come across behave.
While increasing the fetch size may help the performance a bit I would also look into tuning the SDU size which controls the size of the packets at the sqlnet layer. Increasing the SDU size can speed up data transfers.
Of course the time it takes to fetch these 500,000 rows largely depends on how much data you're fetching. If it takes an hour I'm guessing you're fetching a lot of data and/or you're doing it from a remote client over a WAN.
To change the SDU size:
First change the default SDU size on the server to 32k (starting in 11.2.0.3 you can even use 64kB and up to 2MB starting in 12c) by changing or adding this line in sqlnet.ora on the server:
DEFAULT_SDU_SIZE=32767
Then modify your JDBC URL:
jdbc:oracle:thin:#(DESCRIPTION=(SDU=32767)(HOST=...)(PORT=...))(CONNECT_DATA=
My java app fetches about 200,000 records in its result set.
While trying to fetch the data from Oracle DB, the server throws java.lang.OutOfMemoryError: Java heap space
One way to solve this, IMO, is to fetch the records from the DB in smaller chunks (say 100,000 records in each fetch or even smaller count). How can I do this (meaning what API method to use)?
Kindly suggest how to do this or if you think there's a better way to overcome this memory space problem, do suggest. I do not want to use JVM params like -Xmx because I read that that's not a good way to handle OutOfMemory errors.
If you are using Oracle DB you may set AND ROWNUM < XXX to your SQL query. This will cause that only XXX-1 rows will be fetched in query.
Other way is to call statement.setFetchSize(xxx) method before executing statement.
Setting larger JVM memory pool is poor idea, because in future there may be larger data set which will cause OOM.