Performance problem on Java DB Derby Blobs & Delete - java

I’ve been experiencing a performance problem with deleting blobs in derby, and was wondering if anyone could offer any advice.
This is primarily with 10.4.2.0 under windows and solaris, although I’ve also tested with the new 10.5.1.1 release candidate (as it has many lob changes), but this makes no significant difference.
The problem is that with a table containing many large blobs, deleting a single row can take a long time (often over a minute).
I’ve reproduced this with a small test that creates a table, inserts a few rows with blobs of differing sizes, then deletes them.
The table schema is simple, just:
create table blobtest( id integer generated BY DEFAULT as identity, b blob )
and I’ve then created 7 rows with the following blob sizes : 1024 bytes, 1Mb, 10Mb, 25Mb, 50Mb, 75Mb, 100Mb.
I’ve read the blobs back, to check they have been created properly and are the correct size.
They have then been deleted using the sql statement ( “delete from blobtest where id = X” ).
If I delete the rows in the order I created them, average timings to delete a single row are:
1024 bytes: 19.5 seconds
1Mb: 16 seconds
10Mb: 18 seconds
25Mb: 15 seconds
50Mb: 17 seconds
75Mb: 10 seconds
100Mb: 1.5 seconds
If I delete them in reverse order, the average timings to delete a single row are:
100Mb: 20 seconds
75Mb: 10 seconds
50Mb: 4 seconds
25Mb: 0.3 seconds
10Mb: 0.25 seconds
1Mb: 0.02 seconds
1024 bytes: 0.005 seconds
If I create seven small blobs, delete times are all instantaneous.
It thus appears that the delete time seems to be related to the overall size of the rows in the table more than the size of the blob being removed.
I’ve run the tests a few times, and the results seem reproducible.
So, does anyone have any explanation for the performance, and any suggestions on how to work around it or fix it? It does make using large blobs quite problematic in a production environment…

I have exact the same issue you have.
I found that when I do DELETE, derby actually "read through" the large segment file completely. I use Filemon.exe to observe how it run.
My file size it 940MB, and it takes 90s to delete just a single row.
I believe that derby store the table data in a single file inside. And some how a design/implementation bug that cause it read everything rather then do it with a proper index.
I do batch delete rather to workaround this problem.
I rewrite a part of my program. It was "where id=?" in auto-commit.
Then I rewrite many thing and it now "where ID IN(?,.......?)" enclosed in a transaction.
The total time reduce to 1/1000 then it before.
I suggest that you may add a column for "mark as deleted", with a schedule that do batch actual deletion.

As far as I can tell, Derby will only store BLOBs inline with the other database data, so you end up with the BLOB split up over a ton of separate DB page files. This BLOB storage mechanism is good for ACID, and good for smaller BLOBs (say, image thumbnails), but breaks down with larger objects. According to the Derby docs, turning autocommit off when manipulating BLOBs may also improve performance, but this will only go so far.
I strongly suggest you migrate to H2 or another DBMS if good performance on large BLOBs is important, and the BLOBs must stay within the DB. You can use the SQuirrel SQL client and its DBCopy plugin to directly migrate between DBMSes (you just need to point it to the Derby/JavaDB JDBC driver and the H2 driver). I'd be glad to help with this part, since I just did it myself, and haven't been happier.
Failing this, you can move the BLOBs out of the database and into the filesystem. To do this, you would replace the BLOB column in the database with a BLOB size (if desired) and location (a URI or platform-dependent file string). When creating a new blob, you create a corresponding file in the filesystem. The location could be based off of a given directory, with the primary key appended. For example, your DB is in "DBFolder/DBName" and your blobs go in "DBFolder/DBName/Blob" and have filename "BLOB_PRIMARYKEY.bin" or somesuch. To edit or read the BLOBs, you query the DB for the location, and then do read/write to the file directly. Then you log the new file size to the DB if it changed.

I'm sure this isn't the answer you want, but for a production environment with throughput requirements I wouldn't use Java DB. MySQL is just as free and will handle your requirements a lot better. I think you are really just beating your head against a limitation of the solution you've chosen.
I generally only use Derby as a test case, and especially only when my entire DB can fit easily into memory. YMMV.

Have you tried increasing the page size of your database?
There's information about this and more in the Tuning Java DB manual which you may find useful.

Related

How to efficiently export/import database data with JDBC

I have a JAVA application that can use a SQL database from any vendor. Right now we have tested Vertica and PostgreSQL. I want to export all the data from one table in the DB and import it later on in a different instance of the application. The size of the DB is pretty big so there are many rows in there. The export and import process has to be done from inside the java code.
What we've tried so far is:
Export: we read the whole table (select * from) through JDBC and then dump it to an SQL file with all the INSERTS needed.
Import: The file containing those thousands of INSERTS is executed in the target database through JDBC.
This is not an efficient process. Firstly, the select * from part is giving us problems because of the size of it and secondly, executing a lot if inserts one after another gives us problems in Vertica (https://forum.vertica.com/discussion/235201/vjdbc-5065-error-too-many-ros-containers-exist-for-the-following-projections)
What would be a more efficient way of doing this? Are there any tools that can help with the process or there is no "elegant" solution?
Why not do the export/import in a single step with batching (for performance) and chunking (to avoid errors and provide a checkpoint where to start off after a failure).
In most cases, databases support INSERT queries with many values, e.g.:
INSERT INTO table_a (col_a, col_b, ...) VALUES
(val_a, val_b, ...),
(val_a, val_b, ...),
(val_a, val_b, ...),
...
The number of rows you generate into a single such INSERT statement is then your chunk-size, which might need tuning for the specific target database (big enough to speed things up but small enough to make the chunk not exceed some database limit and create failures).
As already proposed, each of this chunk should then be executed in a transaction and your application should remember which chunk it successfully executed last in case some error occurs so it can continue at the next run there.
For the chunks itself, you really should use LIMIT OFFSET .
This way, you can repeat any chunk at any time, each chunk by itself is atomic and it should perform much better than with single row statements.
I can only speak about PostgreSQL.
The size of the SELECT is not a problem if you use server-side cursors by calling setFetchSize with a value greater than 0 (perhaps 10000) on the statement.
The INSERTS will perform well if
you run them all in a single transaction
you use a PreparedStatement for the INSERT
Each insert into Vertica goes into WOS (memory), and periodically data from WOS gets moved to ROS (disk) into a single container. You can only have 1024 ROS containers per projection per node. Doing many thousands of INSERTs at a time is never a good idea for Vertica. The best way to do this is to copy all that data into a file and bulk load the file into Vertica using the COPY command.
This will create a single ROS container for the contents of the file. Depending on how many rows you want to copy it will be many times (sometimes even hundreds of times) faster.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/COPY/COPY.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ConnectingToVertica/ClientJDBC/UsingCOPYLOCALWithJDBC.htm

is there a parameter accelerates Sybase insertion

I am using Sybase ASE, and for a table, in which I will save results calculated by Java. This table has 10 columns, one column type is INT value (but not an ID column), and other 9 columns are all VARCHAR(50) type.
There's no index or trigger on this table (in fact this table is really independent). I need to insert around 160K rows into this table. I tried to separate the work by batch, which will do 10,000 insertions every time. I used two different ways, one is Spring's JdbcTemplate.batchUpdate the other one is native JDBC PreparedStatement.executeBatch api.
However no clear winner regarding the performance. Both of them takes around 25 to 30 seconds for 10K insertions.
Then I thought it could be related to the JDBC driver, so I tried two different drivers: jConnect and jTDS. No real impact on insertion performance.
Finally I decided to compare Sybase with another database, i.e. PostgreSQL in my test. I kept the same Java code, and surprisingly PostgreSQL takes only 0.3 second for every 10K insertions, while Sybase took 25 to 30 seconds (75 to 100 times longer).
DBA support team explains the difference is due to that PostgreSQL is installed on my local machine, while Sybase is installed on our enterprise's server. However, I am not convinced by this explanation at all.
Does anyone know if there's a configuration in Sybase which could considerably impact the insertion speed? Or are there any other possible causes for my above scenario?
The delay that you see on the sybase end is because of a lot of factors that needs to be checked and comparing it to a different database that too on a local machine is not correct.
For starting, we need to check the network latency and the storage used in the sybase database. We need to check the sybase server configuration, page size and locking scheme of the table that you are inserting into. We also need to do a basic health check of the server while you are inserting the data. As you have mentioned that you have used two different ways to insert the data, It is important that you check whether these two ways along are updated accordingly to the sybase client you have installed on your system.
To sum it up, It may be a simple issue as blocking on the sybase instance or it could be related to the storage which is not able to write it quickly. Given the sybase is configured properly, The performance would be very good.
Whether the DB server is local or not may indeed make a significant difference. Until you cut out this factor, comparison with a local DB makes little sense.
But that aside, there are many aspects that affect insert performance in ASE. First off, make sure the overall memory configuration (e.g. data cache and procedure cache) is not too small -- leaving it at the installation defaults is a guarantee for disappointing results. Then there is network packet size that can play a role. And the batch size (#rows before you commit). And the table's lock scheme.
Trying to use minimally logged inserts will help (requires config setting changes), especially since the table has no indexes (and no UNIQUE or PK constraints either?)
The ASE server page size (which you choose when you create the server) also makes a difference: bigger is basically better for inserts.
Set the ENABLE_BULK_LOAD parameter to True. It will speed it up.

H2 performance recommendations

I'm currently working with a somewhat larger database, and though I have no specific issues, I would like some recommendations, if anyone has any.
The database is 2.2 gigabyte (after recreation/compacting). It contains about 50 tables. One of those tables contains a blob plus some metadata. It currently has about 22000 rows. If I remove the blobs from the table (UPDATE table SET blob = null), the database size is reduced to about 200 megabyte (after recreation/compacting). The metadata is accessed a lot, the blobs however are not that often needed.
The database URL I currently use is:
jdbc:h2:D:/data;AUTO_SERVER=true;MVCC=true;CACHE_SIZE=524288
It runs in our Java VM which has 4GB max heap.
Some things I was wondering:
Would running H2 in a separate process have any impact on performance (for better or for worse)?
Would it help to have the blobs in a separate table with a 1-1 relation to the metadata? I could imagine it would help with the caching, not having the blobs in the way?
The internet seems divided on whether to include blobs in a database or write them to files on a filesystem with a link in the DB. Any H2-specific advise here?
The answer for you depends on the growth rate of your blob data. If for example, your data set is going to grow at 10% per week - then there is little point of trying to extend the use of H2 to store blob data (as it will quickly out pace the available heap memory). If instead the blob data is the biggest it will ever be, then attempting to use H2 might make sense.
To answer your questions about H2:
1) Running H2 in a separate process will allow H2 claim the majority of heap space - making controlling the available heap space for H2 much more manageable. However, you'll also be adding the maintenance overhead of having a separate process to maintain and monitor. So the answer is "it depends on your operating environment and goals". If you have the people and time, running H2 in a separate process might make sense. But if that's true - then you should probably consider just running an appropriate blob storage platform instead.
2) Yes, you're correct that storing the blobs in a separate table would help with caching - in the case that you don't often need the blobs. It should also help with retrieval times, as H2 won't have to read past the blobs to find the metadata.
3) Note that "the internet" represents many thousands of people with almost as many different specific use cases. You'll need to filter down your use case into requirements, and then apply the logic you glean from others.
4) My personal advice is, if you're trying to make a scalable and maintainable platform - use the right tools. H2, or any other relational database, is most often not the right tool for storing many large blobs. I'd recommend that you investigate using a key/value store.

H2 Database File Size is too big (7x larger than expected) [duplicate]

I have an H2 database that has ballooned to several Gigabytes in size, causing all sorts of operational problems. The database size didn't seem right. So I took one little slice of it, just one table, to try to figure out what's going on.
I brought this table into a test environment:
The columns add up to 80 bytes per row, per my calculations.
The table has 280,000 rows.
For this test, all indexes were removed.
The table should occupy approximately
80 bytes per row * 280,000 rows = 22.4 MB on disk.
However, it is physically taking up 157 MB.
I would expect to see some overhead here and there, but why is this database a full 7x larger than can be reasonably estimated?
UPDATE
Output from CALL DISK_SPACE_USED
There's always indices, etc. to be taken into account.
Can you try:
CALL DISK_SPACE_USED('my_table');
Also, I would also recommend running SHUTDOWN DEFRAG and calculating the size again.
Setting MV_STORE=FALSE on database creation solves the problem. Whole database (not the test slice from the example) is now approximately 10x smaller.
Update
I had to revisit this topic recently and had to run a comparison to MySQL. On my test dataset, when MV_STORE=FALSE, the H2 database takes up 360MB of disk space, while the same data on MySQL 5.7 InnoDB with default-ish configurations takes up 432MB. YMMV.

Why does a Derby database take up so much space?

I am new to databases and I love how easy it is to get data from a relational database (such as a Derby database). What I don't like is how much data one takes up; I have made a database with two tables, written a total of 130 records to these tables (each table has 6 columns), and the whole relational database gets saved in the system directory as a folder that houses a total of approximately 1914014 bytes! (Assuming that I did the arithmetic right....) What the heck is going on to cause such a huge request of memory?! //I also notice that there is a log1.dat file in log folder that takes up exactly 1MB of data. I looked into this file via Notepad++, and saw that it was mostly NULL characters. What is that all about?
Derby need to keep track on your database data, the redo logs and transactions so your database is in a consistent state and can recover even from pc crashes.
Also he creates most files with a fixed size (like 1MB) to ensure he did not need to increase the file size later on (performance issues and to not fragment his files to much).
Over the runtime or when stopping, Derby will clean up some of this files or regroup them and free space.
So overall the space and the files are the trade offs you get for using a database.
Maybe you can change some of this behaviour via some Derby configs (I did not find any one suitable in the doc :().
When last checked in 2011, an empty Derby database takes about 770 K of disk space: http://apache-database.10148.n7.nabble.com/Database-size-larger-than-expected-td104630.html
The log1.dat file is your transaction log, and records database changes so that the database can be recovered if there is a crash (or if your program calls ROLLBACK).
Note that log1.dat is disk space, not memory.
If you'd like to learn about the basics of Derby's transaction log, start here: http://db.apache.org/derby/papers/recovery.html

Categories