Why does a Derby database take up so much space? - java

I am new to databases and I love how easy it is to get data from a relational database (such as a Derby database). What I don't like is how much data one takes up; I have made a database with two tables, written a total of 130 records to these tables (each table has 6 columns), and the whole relational database gets saved in the system directory as a folder that houses a total of approximately 1914014 bytes! (Assuming that I did the arithmetic right....) What the heck is going on to cause such a huge request of memory?! //I also notice that there is a log1.dat file in log folder that takes up exactly 1MB of data. I looked into this file via Notepad++, and saw that it was mostly NULL characters. What is that all about?

Derby need to keep track on your database data, the redo logs and transactions so your database is in a consistent state and can recover even from pc crashes.
Also he creates most files with a fixed size (like 1MB) to ensure he did not need to increase the file size later on (performance issues and to not fragment his files to much).
Over the runtime or when stopping, Derby will clean up some of this files or regroup them and free space.
So overall the space and the files are the trade offs you get for using a database.
Maybe you can change some of this behaviour via some Derby configs (I did not find any one suitable in the doc :().

When last checked in 2011, an empty Derby database takes about 770 K of disk space: http://apache-database.10148.n7.nabble.com/Database-size-larger-than-expected-td104630.html
The log1.dat file is your transaction log, and records database changes so that the database can be recovered if there is a crash (or if your program calls ROLLBACK).
Note that log1.dat is disk space, not memory.
If you'd like to learn about the basics of Derby's transaction log, start here: http://db.apache.org/derby/papers/recovery.html

Related

How to efficiently export/import database data with JDBC

I have a JAVA application that can use a SQL database from any vendor. Right now we have tested Vertica and PostgreSQL. I want to export all the data from one table in the DB and import it later on in a different instance of the application. The size of the DB is pretty big so there are many rows in there. The export and import process has to be done from inside the java code.
What we've tried so far is:
Export: we read the whole table (select * from) through JDBC and then dump it to an SQL file with all the INSERTS needed.
Import: The file containing those thousands of INSERTS is executed in the target database through JDBC.
This is not an efficient process. Firstly, the select * from part is giving us problems because of the size of it and secondly, executing a lot if inserts one after another gives us problems in Vertica (https://forum.vertica.com/discussion/235201/vjdbc-5065-error-too-many-ros-containers-exist-for-the-following-projections)
What would be a more efficient way of doing this? Are there any tools that can help with the process or there is no "elegant" solution?
Why not do the export/import in a single step with batching (for performance) and chunking (to avoid errors and provide a checkpoint where to start off after a failure).
In most cases, databases support INSERT queries with many values, e.g.:
INSERT INTO table_a (col_a, col_b, ...) VALUES
(val_a, val_b, ...),
(val_a, val_b, ...),
(val_a, val_b, ...),
...
The number of rows you generate into a single such INSERT statement is then your chunk-size, which might need tuning for the specific target database (big enough to speed things up but small enough to make the chunk not exceed some database limit and create failures).
As already proposed, each of this chunk should then be executed in a transaction and your application should remember which chunk it successfully executed last in case some error occurs so it can continue at the next run there.
For the chunks itself, you really should use LIMIT OFFSET .
This way, you can repeat any chunk at any time, each chunk by itself is atomic and it should perform much better than with single row statements.
I can only speak about PostgreSQL.
The size of the SELECT is not a problem if you use server-side cursors by calling setFetchSize with a value greater than 0 (perhaps 10000) on the statement.
The INSERTS will perform well if
you run them all in a single transaction
you use a PreparedStatement for the INSERT
Each insert into Vertica goes into WOS (memory), and periodically data from WOS gets moved to ROS (disk) into a single container. You can only have 1024 ROS containers per projection per node. Doing many thousands of INSERTs at a time is never a good idea for Vertica. The best way to do this is to copy all that data into a file and bulk load the file into Vertica using the COPY command.
This will create a single ROS container for the contents of the file. Depending on how many rows you want to copy it will be many times (sometimes even hundreds of times) faster.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/COPY/COPY.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ConnectingToVertica/ClientJDBC/UsingCOPYLOCALWithJDBC.htm

Split database into smaller ones. Too much data in single commit

I need an advice :)
I have a database with almost 70 tables, many of them have over a dozen million big records. I want to split it into a few smaller ones. One for every big client data and one main database for the rest of the client's data(while also moving some of the data into NoSQL database). Because of many complicated relations between tables, before copying the data, I was disabling the triggers, that were checking the correctness of the foreign keys and then, just before a commit I was enabling them again.
It was all working with a small amount of data, but now, when I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
I could increase the heap size, but it's not the point here.
I'm selecting data by some specific id from every table that has any relation to client data and copy it to another database. The process looks like this:
Select data from table
Insert data to another database
Copy sequence (max(id) of data being copied)
Flush/Clear
Repeat for every table containing client data
I was trying to select portions of data(something like select parts with 5000 rows instead of all 50 000 in one) but it fails in the exact same position.
And here I am asking for an advice, how to manage this problem. I think it is all because I am trying to copy all data in one big fatty commit. The reason of it is that I have to disable triggers while copying but also I must enable them before I can commit my changes.
When I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
Copying data should not be using the heap, so it seems you're not using cursor-based queries.
See "Getting results based on a cursor" in the PostgreSQL JDBC documentation:
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
A small number of rows are cached on the client side of the connection and when exhausted the next block of rows is retrieved by repositioning the cursor.
[...]
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
So, add a stmt.setFetchSize(1000) (or something like that) to your code will ensure that the JDBC driver will not exhaust the heap.
If you still have trouble after that, then it's because your code is retaining all data, which means it's coded wrong for a copy operation.

Split the JDBC Oracle Resultset to avoid OOM error

I have a program that connects through JDBC to an oracle database and extracts 3+ Million records. If I load everything into memory I am getting an out of memory error. I want to load the data into memory into parts of 50000.
There is two ways that I am approaching the issue:
a) Keep the connection open and process the data into groups of 50 0000 as they come from the result set.
I do not really like this approach because there could be a risk of leaving the connection open when everything is done and also the connection could be open for a long time (risking timeouts and decreasing connections pool) as each group of 50 000 records is being processed (and by being processed I mean each of these could cause other connections to open and close quickly based on the derived data that may be needed)
b) Process based on row numbers but I am not sure what the impact might be if the underplaying data changes and also I cannot really afford to do a sort every time I process 50 000 records.
This seems to be a common problem and I would like to know what are some industry standards/ best approaches/ design patterns to this issue.
if you need a durable transaction that spans the entire read (aka no one changing data out from under you, which you allude to), you might want to investigate moving this problem to the rdbms, and coding it as a stored procedure that you can call from jdbc/jpa/whatever.
i know it doesn't solve it from the java side, but sometimes moving the problem IS the proper solution, depending on context and details.
cheers

MySQL memory exhausted error

Today I was using a simple Java application to load a large size data into MySQL DB, and got a error below:
java.sql.SQLException: Syntax error or access violation message from server: "memory exhausted near ''Q1',2.34652631E10,'000','000',5.0519608E9,5.8128358E9,'000','000',8.2756818E9,2' at line 5332"
I've tried to modified the my.ini file to increase some point, however it doesn't work at all and actually the size of file is not so large, it's just a 14mb xls file, almost running out of idea, awaiting for any suggestion. Appreciate your help!
(Without the relevant parts of your code I can only guess, but here we go...)
From the error message, I will take a shot in the dark and guess that you are trying to load all of 300,000 rows in a single query, which is probably produced by concatenating a whole bunch of INSERT statements in a single string. A 14MB XLS file can become a lot bigger when translated into SQL statements and your server runs out of memory trying to parse the query.
To resolve this (in order of preference):
Convert your file to CSV and use mysqlimport.
Convert your file to CSV and use LOAD DATA INFILE.
Use multiple transactions of moderate size with only a few thousand INSERT statements each. This is the recommended option if you cannot simply import the file.
Use a single transaction - InnoDB MySQL databases should handle transaction sizes in this size range.

Performance problem on Java DB Derby Blobs & Delete

I’ve been experiencing a performance problem with deleting blobs in derby, and was wondering if anyone could offer any advice.
This is primarily with 10.4.2.0 under windows and solaris, although I’ve also tested with the new 10.5.1.1 release candidate (as it has many lob changes), but this makes no significant difference.
The problem is that with a table containing many large blobs, deleting a single row can take a long time (often over a minute).
I’ve reproduced this with a small test that creates a table, inserts a few rows with blobs of differing sizes, then deletes them.
The table schema is simple, just:
create table blobtest( id integer generated BY DEFAULT as identity, b blob )
and I’ve then created 7 rows with the following blob sizes : 1024 bytes, 1Mb, 10Mb, 25Mb, 50Mb, 75Mb, 100Mb.
I’ve read the blobs back, to check they have been created properly and are the correct size.
They have then been deleted using the sql statement ( “delete from blobtest where id = X” ).
If I delete the rows in the order I created them, average timings to delete a single row are:
1024 bytes: 19.5 seconds
1Mb: 16 seconds
10Mb: 18 seconds
25Mb: 15 seconds
50Mb: 17 seconds
75Mb: 10 seconds
100Mb: 1.5 seconds
If I delete them in reverse order, the average timings to delete a single row are:
100Mb: 20 seconds
75Mb: 10 seconds
50Mb: 4 seconds
25Mb: 0.3 seconds
10Mb: 0.25 seconds
1Mb: 0.02 seconds
1024 bytes: 0.005 seconds
If I create seven small blobs, delete times are all instantaneous.
It thus appears that the delete time seems to be related to the overall size of the rows in the table more than the size of the blob being removed.
I’ve run the tests a few times, and the results seem reproducible.
So, does anyone have any explanation for the performance, and any suggestions on how to work around it or fix it? It does make using large blobs quite problematic in a production environment…
I have exact the same issue you have.
I found that when I do DELETE, derby actually "read through" the large segment file completely. I use Filemon.exe to observe how it run.
My file size it 940MB, and it takes 90s to delete just a single row.
I believe that derby store the table data in a single file inside. And some how a design/implementation bug that cause it read everything rather then do it with a proper index.
I do batch delete rather to workaround this problem.
I rewrite a part of my program. It was "where id=?" in auto-commit.
Then I rewrite many thing and it now "where ID IN(?,.......?)" enclosed in a transaction.
The total time reduce to 1/1000 then it before.
I suggest that you may add a column for "mark as deleted", with a schedule that do batch actual deletion.
As far as I can tell, Derby will only store BLOBs inline with the other database data, so you end up with the BLOB split up over a ton of separate DB page files. This BLOB storage mechanism is good for ACID, and good for smaller BLOBs (say, image thumbnails), but breaks down with larger objects. According to the Derby docs, turning autocommit off when manipulating BLOBs may also improve performance, but this will only go so far.
I strongly suggest you migrate to H2 or another DBMS if good performance on large BLOBs is important, and the BLOBs must stay within the DB. You can use the SQuirrel SQL client and its DBCopy plugin to directly migrate between DBMSes (you just need to point it to the Derby/JavaDB JDBC driver and the H2 driver). I'd be glad to help with this part, since I just did it myself, and haven't been happier.
Failing this, you can move the BLOBs out of the database and into the filesystem. To do this, you would replace the BLOB column in the database with a BLOB size (if desired) and location (a URI or platform-dependent file string). When creating a new blob, you create a corresponding file in the filesystem. The location could be based off of a given directory, with the primary key appended. For example, your DB is in "DBFolder/DBName" and your blobs go in "DBFolder/DBName/Blob" and have filename "BLOB_PRIMARYKEY.bin" or somesuch. To edit or read the BLOBs, you query the DB for the location, and then do read/write to the file directly. Then you log the new file size to the DB if it changed.
I'm sure this isn't the answer you want, but for a production environment with throughput requirements I wouldn't use Java DB. MySQL is just as free and will handle your requirements a lot better. I think you are really just beating your head against a limitation of the solution you've chosen.
I generally only use Derby as a test case, and especially only when my entire DB can fit easily into memory. YMMV.
Have you tried increasing the page size of your database?
There's information about this and more in the Tuning Java DB manual which you may find useful.

Categories