I am using a java jdbc application to fetch about 500,000 records from DB. The Database being used is Oracle. I write the data into a file as soon as each row is fetched. Since it takes about an hour to complete fetching the entire data, I am trying to increase the fetch size of the result set. I have seen in multiple links that while increasing the fetch size one should be careful about the memory consumption. Does increasing the fetch size actually increase the heap memory used by the jvm?
Suppose if the fetch size is 10 and the program query returns 100 rows in total. During the first fetch the resultset contains 10 record. Once I read the first 10 records the resultset fetches the next 10. Does this mean that after the 2nd fetch the dataset will contain 20 records? Are the earlier 10 records still maintained in memory or are they removed while fetching the newer batch?
Any help is appreciated.
It depends. Different drivers may behave differently and different ResultSet settings may behave differently.
If you have a CONCUR_READ_ONLY, FETCH_FORWARD, TYPE_FORWARD_ONLY ResultSet, the driver will almost certainly actively store in memory the number of rows that corresponds to your fetch size (of course data for earlier rows will remain in memory for some period of time until it is garbage collected). If you have a TYPE_SCROLL_INSENSITIVE ResultSet, on the other hand, it is very likely that the driver would store all the data that was fetched in memory in order to allow you to scroll backwards and forwards through the data. That's not the only possible way to implement this behavior, so different drivers (and different versions of drivers) may have different behaviors but it is the simplest and the way that most drivers I've come across behave.
While increasing the fetch size may help the performance a bit I would also look into tuning the SDU size which controls the size of the packets at the sqlnet layer. Increasing the SDU size can speed up data transfers.
Of course the time it takes to fetch these 500,000 rows largely depends on how much data you're fetching. If it takes an hour I'm guessing you're fetching a lot of data and/or you're doing it from a remote client over a WAN.
To change the SDU size:
First change the default SDU size on the server to 32k (starting in 11.2.0.3 you can even use 64kB and up to 2MB starting in 12c) by changing or adding this line in sqlnet.ora on the server:
DEFAULT_SDU_SIZE=32767
Then modify your JDBC URL:
jdbc:oracle:thin:#(DESCRIPTION=(SDU=32767)(HOST=...)(PORT=...))(CONNECT_DATA=
Related
I need an advice :)
I have a database with almost 70 tables, many of them have over a dozen million big records. I want to split it into a few smaller ones. One for every big client data and one main database for the rest of the client's data(while also moving some of the data into NoSQL database). Because of many complicated relations between tables, before copying the data, I was disabling the triggers, that were checking the correctness of the foreign keys and then, just before a commit I was enabling them again.
It was all working with a small amount of data, but now, when I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
I could increase the heap size, but it's not the point here.
I'm selecting data by some specific id from every table that has any relation to client data and copy it to another database. The process looks like this:
Select data from table
Insert data to another database
Copy sequence (max(id) of data being copied)
Flush/Clear
Repeat for every table containing client data
I was trying to select portions of data(something like select parts with 5000 rows instead of all 50 000 in one) but it fails in the exact same position.
And here I am asking for an advice, how to manage this problem. I think it is all because I am trying to copy all data in one big fatty commit. The reason of it is that I have to disable triggers while copying but also I must enable them before I can commit my changes.
When I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
Copying data should not be using the heap, so it seems you're not using cursor-based queries.
See "Getting results based on a cursor" in the PostgreSQL JDBC documentation:
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
A small number of rows are cached on the client side of the connection and when exhausted the next block of rows is retrieved by repositioning the cursor.
[...]
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
So, add a stmt.setFetchSize(1000) (or something like that) to your code will ensure that the JDBC driver will not exhaust the heap.
If you still have trouble after that, then it's because your code is retaining all data, which means it's coded wrong for a copy operation.
I am using Sybase ASE, and for a table, in which I will save results calculated by Java. This table has 10 columns, one column type is INT value (but not an ID column), and other 9 columns are all VARCHAR(50) type.
There's no index or trigger on this table (in fact this table is really independent). I need to insert around 160K rows into this table. I tried to separate the work by batch, which will do 10,000 insertions every time. I used two different ways, one is Spring's JdbcTemplate.batchUpdate the other one is native JDBC PreparedStatement.executeBatch api.
However no clear winner regarding the performance. Both of them takes around 25 to 30 seconds for 10K insertions.
Then I thought it could be related to the JDBC driver, so I tried two different drivers: jConnect and jTDS. No real impact on insertion performance.
Finally I decided to compare Sybase with another database, i.e. PostgreSQL in my test. I kept the same Java code, and surprisingly PostgreSQL takes only 0.3 second for every 10K insertions, while Sybase took 25 to 30 seconds (75 to 100 times longer).
DBA support team explains the difference is due to that PostgreSQL is installed on my local machine, while Sybase is installed on our enterprise's server. However, I am not convinced by this explanation at all.
Does anyone know if there's a configuration in Sybase which could considerably impact the insertion speed? Or are there any other possible causes for my above scenario?
The delay that you see on the sybase end is because of a lot of factors that needs to be checked and comparing it to a different database that too on a local machine is not correct.
For starting, we need to check the network latency and the storage used in the sybase database. We need to check the sybase server configuration, page size and locking scheme of the table that you are inserting into. We also need to do a basic health check of the server while you are inserting the data. As you have mentioned that you have used two different ways to insert the data, It is important that you check whether these two ways along are updated accordingly to the sybase client you have installed on your system.
To sum it up, It may be a simple issue as blocking on the sybase instance or it could be related to the storage which is not able to write it quickly. Given the sybase is configured properly, The performance would be very good.
Whether the DB server is local or not may indeed make a significant difference. Until you cut out this factor, comparison with a local DB makes little sense.
But that aside, there are many aspects that affect insert performance in ASE. First off, make sure the overall memory configuration (e.g. data cache and procedure cache) is not too small -- leaving it at the installation defaults is a guarantee for disappointing results. Then there is network packet size that can play a role. And the batch size (#rows before you commit). And the table's lock scheme.
Trying to use minimally logged inserts will help (requires config setting changes), especially since the table has no indexes (and no UNIQUE or PK constraints either?)
The ASE server page size (which you choose when you create the server) also makes a difference: bigger is basically better for inserts.
Set the ENABLE_BULK_LOAD parameter to True. It will speed it up.
I got this doubt when I was modifying a code for doing batch update for MySQL retrieval using Java.
My understanding is that fetch size is the maximum number of rows in a ResultSet object and Batch Limit is the number of select/insert/update queries that can be added to a batch, for batch execution. Can anyone correct me if I am wrong here.
Thanks.
You are almost correct. However to add to it javadoc of Statement#setFetchSize()
Gives the JDBC driver a hint as to the number of rows that should be
fetched from the database
Whereas the batch limit is something which is related to how many rows you can insert or update something related to max_allowed_packet
On a side note:
You may also check the JDBC API implementation notes as a good read
ResultSet
By default, ResultSets are completely retrieved and stored in memory.
In most cases this is the most efficient way to operate, and due to
the design of the MySQL network protocol is easier to implement. If
you are working with ResultSets that have a large number of rows or
large values, and cannot allocate heap space in your JVM for the
memory required, you can tell the driver to stream the results back
one row at a time.
My java app fetches about 200,000 records in its result set.
While trying to fetch the data from Oracle DB, the server throws java.lang.OutOfMemoryError: Java heap space
One way to solve this, IMO, is to fetch the records from the DB in smaller chunks (say 100,000 records in each fetch or even smaller count). How can I do this (meaning what API method to use)?
Kindly suggest how to do this or if you think there's a better way to overcome this memory space problem, do suggest. I do not want to use JVM params like -Xmx because I read that that's not a good way to handle OutOfMemory errors.
If you are using Oracle DB you may set AND ROWNUM < XXX to your SQL query. This will cause that only XXX-1 rows will be fetched in query.
Other way is to call statement.setFetchSize(xxx) method before executing statement.
Setting larger JVM memory pool is poor idea, because in future there may be larger data set which will cause OOM.
I’ve been experiencing a performance problem with deleting blobs in derby, and was wondering if anyone could offer any advice.
This is primarily with 10.4.2.0 under windows and solaris, although I’ve also tested with the new 10.5.1.1 release candidate (as it has many lob changes), but this makes no significant difference.
The problem is that with a table containing many large blobs, deleting a single row can take a long time (often over a minute).
I’ve reproduced this with a small test that creates a table, inserts a few rows with blobs of differing sizes, then deletes them.
The table schema is simple, just:
create table blobtest( id integer generated BY DEFAULT as identity, b blob )
and I’ve then created 7 rows with the following blob sizes : 1024 bytes, 1Mb, 10Mb, 25Mb, 50Mb, 75Mb, 100Mb.
I’ve read the blobs back, to check they have been created properly and are the correct size.
They have then been deleted using the sql statement ( “delete from blobtest where id = X” ).
If I delete the rows in the order I created them, average timings to delete a single row are:
1024 bytes: 19.5 seconds
1Mb: 16 seconds
10Mb: 18 seconds
25Mb: 15 seconds
50Mb: 17 seconds
75Mb: 10 seconds
100Mb: 1.5 seconds
If I delete them in reverse order, the average timings to delete a single row are:
100Mb: 20 seconds
75Mb: 10 seconds
50Mb: 4 seconds
25Mb: 0.3 seconds
10Mb: 0.25 seconds
1Mb: 0.02 seconds
1024 bytes: 0.005 seconds
If I create seven small blobs, delete times are all instantaneous.
It thus appears that the delete time seems to be related to the overall size of the rows in the table more than the size of the blob being removed.
I’ve run the tests a few times, and the results seem reproducible.
So, does anyone have any explanation for the performance, and any suggestions on how to work around it or fix it? It does make using large blobs quite problematic in a production environment…
I have exact the same issue you have.
I found that when I do DELETE, derby actually "read through" the large segment file completely. I use Filemon.exe to observe how it run.
My file size it 940MB, and it takes 90s to delete just a single row.
I believe that derby store the table data in a single file inside. And some how a design/implementation bug that cause it read everything rather then do it with a proper index.
I do batch delete rather to workaround this problem.
I rewrite a part of my program. It was "where id=?" in auto-commit.
Then I rewrite many thing and it now "where ID IN(?,.......?)" enclosed in a transaction.
The total time reduce to 1/1000 then it before.
I suggest that you may add a column for "mark as deleted", with a schedule that do batch actual deletion.
As far as I can tell, Derby will only store BLOBs inline with the other database data, so you end up with the BLOB split up over a ton of separate DB page files. This BLOB storage mechanism is good for ACID, and good for smaller BLOBs (say, image thumbnails), but breaks down with larger objects. According to the Derby docs, turning autocommit off when manipulating BLOBs may also improve performance, but this will only go so far.
I strongly suggest you migrate to H2 or another DBMS if good performance on large BLOBs is important, and the BLOBs must stay within the DB. You can use the SQuirrel SQL client and its DBCopy plugin to directly migrate between DBMSes (you just need to point it to the Derby/JavaDB JDBC driver and the H2 driver). I'd be glad to help with this part, since I just did it myself, and haven't been happier.
Failing this, you can move the BLOBs out of the database and into the filesystem. To do this, you would replace the BLOB column in the database with a BLOB size (if desired) and location (a URI or platform-dependent file string). When creating a new blob, you create a corresponding file in the filesystem. The location could be based off of a given directory, with the primary key appended. For example, your DB is in "DBFolder/DBName" and your blobs go in "DBFolder/DBName/Blob" and have filename "BLOB_PRIMARYKEY.bin" or somesuch. To edit or read the BLOBs, you query the DB for the location, and then do read/write to the file directly. Then you log the new file size to the DB if it changed.
I'm sure this isn't the answer you want, but for a production environment with throughput requirements I wouldn't use Java DB. MySQL is just as free and will handle your requirements a lot better. I think you are really just beating your head against a limitation of the solution you've chosen.
I generally only use Derby as a test case, and especially only when my entire DB can fit easily into memory. YMMV.
Have you tried increasing the page size of your database?
There's information about this and more in the Tuning Java DB manual which you may find useful.