Split database into smaller ones. Too much data in single commit

Split database into smaller ones. Too much data in single commit - java

I need an advice :)
I have a database with almost 70 tables, many of them have over a dozen million big records. I want to split it into a few smaller ones. One for every big client data and one main database for the rest of the client's data(while also moving some of the data into NoSQL database). Because of many complicated relations between tables, before copying the data, I was disabling the triggers, that were checking the correctness of the foreign keys and then, just before a commit I was enabling them again.
It was all working with a small amount of data, but now, when I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
I could increase the heap size, but it's not the point here.
I'm selecting data by some specific id from every table that has any relation to client data and copy it to another database. The process looks like this:
Select data from table
Insert data to another database
Copy sequence (max(id) of data being copied)
Flush/Clear
Repeat for every table containing client data
I was trying to select portions of data(something like select parts with 5000 rows instead of all 50 000 in one) but it fails in the exact same position.
And here I am asking for an advice, how to manage this problem. I think it is all because I am trying to copy all data in one big fatty commit. The reason of it is that I have to disable triggers while copying but also I must enable them before I can commit my changes.

When I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
Copying data should not be using the heap, so it seems you're not using cursor-based queries.
See "Getting results based on a cursor" in the PostgreSQL JDBC documentation:
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
A small number of rows are cached on the client side of the connection and when exhausted the next block of rows is retrieved by repositioning the cursor.
[...]
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
So, add a stmt.setFetchSize(1000) (or something like that) to your code will ensure that the JDBC driver will not exhaust the heap.
If you still have trouble after that, then it's because your code is retaining all data, which means it's coded wrong for a copy operation.

Related

How can I run an outer join on two large postgreSQL tables in batches?

I have two tables with millions of rows. They share a common email address. They don't share any other fields.
I have a join operation that works fine.
select r.*,l.* from righttable r full outer join lefttable l on r.email=l.email
However, the result set contains millions of rows, which overwhelms my server's memory. How can I run consecutive queries that only pull a limited number of rows from each table at a time and ultimately visit all of the rows in the two tables?
Furthermore, after receving a result set, our server may make some inserts into one or both of the tables. I'm afraid this may complicate keeping track of the offset in each consecutive query. Maybe it's not a problem. I can't wrap my head around it.

I don't think you can do this in batches, because it won't know what rows to fabricate to fulfill the "FULL OUTER" without seeing all of the data. You might be able to get around that if you know that no one is making changes to the tables while you work, by selecting the left-only tuples, right-only tuples, and inner tuples in separate queries.
But, it should not consume all your memory (assuming you mean RAM, not disk space) on the server, because it should use temp files instead of RAM for the bulk of the storage needed (though there are some problems with memory usage for huge hash joins, so you might try set enable_hashjoin=off).
The client might use too much memory, as it might try to read the entire result set into the client RAM at once. There are ways around this, but they probably do not involve manipulating the JOIN itself. You can use a cursor to read in batches from a single result stream, or you could just spool the results out to disk using \copy, and then us something like GNU split on it.

How to efficiently export/import database data with JDBC

I have a JAVA application that can use a SQL database from any vendor. Right now we have tested Vertica and PostgreSQL. I want to export all the data from one table in the DB and import it later on in a different instance of the application. The size of the DB is pretty big so there are many rows in there. The export and import process has to be done from inside the java code.
What we've tried so far is:
Export: we read the whole table (select * from) through JDBC and then dump it to an SQL file with all the INSERTS needed.
Import: The file containing those thousands of INSERTS is executed in the target database through JDBC.
This is not an efficient process. Firstly, the select * from part is giving us problems because of the size of it and secondly, executing a lot if inserts one after another gives us problems in Vertica (https://forum.vertica.com/discussion/235201/vjdbc-5065-error-too-many-ros-containers-exist-for-the-following-projections)
What would be a more efficient way of doing this? Are there any tools that can help with the process or there is no "elegant" solution?

Why not do the export/import in a single step with batching (for performance) and chunking (to avoid errors and provide a checkpoint where to start off after a failure).
In most cases, databases support INSERT queries with many values, e.g.:
INSERT INTO table_a (col_a, col_b, ...) VALUES
(val_a, val_b, ...),
(val_a, val_b, ...),
(val_a, val_b, ...),
...
The number of rows you generate into a single such INSERT statement is then your chunk-size, which might need tuning for the specific target database (big enough to speed things up but small enough to make the chunk not exceed some database limit and create failures).
As already proposed, each of this chunk should then be executed in a transaction and your application should remember which chunk it successfully executed last in case some error occurs so it can continue at the next run there.
For the chunks itself, you really should use LIMIT OFFSET .
This way, you can repeat any chunk at any time, each chunk by itself is atomic and it should perform much better than with single row statements.

I can only speak about PostgreSQL.
The size of the SELECT is not a problem if you use server-side cursors by calling setFetchSize with a value greater than 0 (perhaps 10000) on the statement.
The INSERTS will perform well if
you run them all in a single transaction
you use a PreparedStatement for the INSERT

Each insert into Vertica goes into WOS (memory), and periodically data from WOS gets moved to ROS (disk) into a single container. You can only have 1024 ROS containers per projection per node. Doing many thousands of INSERTs at a time is never a good idea for Vertica. The best way to do this is to copy all that data into a file and bulk load the file into Vertica using the COPY command.
This will create a single ROS container for the contents of the file. Depending on how many rows you want to copy it will be many times (sometimes even hundreds of times) faster.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/COPY/COPY.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ConnectingToVertica/ClientJDBC/UsingCOPYLOCALWithJDBC.htm

JDBC Read without cursor

I have to read huge data from the database (for example lets consider more than 500 000 records). Then I have to save the read data to a file. I have many issues with cursor (not only memory issue).
Is it possible to do it without cursor, for example using stream? If so how can I achieve it?

I have experienced working with huge data (almost 500 milions of records). I simply used PreparedStatement query, ResultSet and of cource some buffer tweaking through:
setFetchSize(int)
In my case, i split the program into threads because the huge table was partitioned (each thread processed one partition) but i think that this is not your case.
It is pointless to fetch data through cursor. I would rather use the database view or SQL query. Do not use ORM for this purpose.
According to your comment, your best option is to limit JDBC to fetch only specific number of rows instead of fetching all of them (this helps to begin processing faster and does not load entire table into ResultSet). Save your data into collection and write it into file using BufferedWriter. You can also benefit from multi-core CPU to make it run in more threads - like first fetched rows run in 1 thread, other fetched rows in second thread. In case of threading, use synchronized collections and be aware that you might face the problem of ordering.

JDBC Pagination: vendor specific sql versus result set fetchSize

There are a lot of different tutorials across the internet about pagination with JDBC/iterating over huge result set.
So, basically there are a number of approaches I've found so far:
Vendor specific sql
Scrollable result set (?)
Holding plain result set in a memory and map the rows only when necessary (using fetchSize)
The result set fetch size, either set explicitly, or by default equal
to the statement fetch size that was passed to it, determines the
number of rows that are retrieved in any subsequent trips to the
database for that result set. This includes any trips that are still
required to complete the original query, as well as any refetching of
data into the result set. Data can be refetched, either explicitly or
implicitly, to update a scroll-sensitive or
scroll-insensitive/updatable result set.
Cursor (?)
Custom seek method paging implemented by jooq
Sorry for messing all these but I need someone to clear that out for me.
I have a simple task where service consumer asks for results with a pageNumber and pageSize. Looks like I have two options:
Use vendor specific sql
Hold the connection/statement/result set in the memory and rely on jdbc fetchSize
In the latter case I use rxJava-jdbc and if you look at producer implementation it holds the result set, then all you do is calling request(long n) and another n rows are processed. Of course everything is hidden under Observable suggar of rxJava. What I don't like about this approach is that you have to hold the resultSet between different service calls and have to clear that resultSet if client forgets to exhaust or close it. (Note: resultSet here is java ResultSet class, not the actual data)
So, what is recommended way of doing pagination? Is vendor specific sql considered slow compared to holding the connection?
I am using oracle, ScrollableResultSet is not recommended to be used with huge result sets as it caches the whole result set data on the client side. proof

Keeping resources open for an indefinite time is a bad thing in general. The database will, for example, create a cursor for you to obtain the fetched rows. That cursor and other resources will be kept open until you close the result set. The more queries you do in parallel the more resources will be occupied and at some point the database will reject further requests due to an exhausted resource pool (e.g. there is a limited number of cursors, that can be opened at a time).
Hibernate, for example, uses vendor specific SQL to fetch a "page" and I would do it just like that.

There are many approaches because there are many different use cases.
Do you actually expect users to fetch every page of the result set? Or are they more likely to fetch the first page or two and try something else if the data they're interested in isn't there. If you are Google, for example, you can be pretty confident that people will look at results from the first page, a small number will look at results from the second page, and a tiny fraction of results will come from the third page. It makes perfect sense in that case to use vendor-specific code to request a page of data and only run that for the next page when the user asks for it. If you expect the user to fetch the last page of the result, on the other hand, running a separate query for each page is going to be more expensive than running a single query and doing multiple fetches.
How long do users need to keep the queries open? How many concurrent users? If you're building an internal application that dozens of users will have access to and you expect users to keep cursors open for a few minutes, that might be reasonable. If you are trying to build an application that will have thousands of users that will be paging through a result over a span of hours, keeping resources allocated is a bad idea. If your users are really machines that are going to fetch data and process it in a loop as quickly as possible, a single ResultSet with multiple fetches makes far more sense.
How important is it that no row is missed/ every row is seen exactly once/ the results across pages are consistent? Multiple fetches from a single cursor guarantees that every row in the result is seen exactly once. Separate paginated queries might not-- new data could have been added or removed between queries being executed, your sort might not be fully deterministic, etc.

ScrollableResultSet caches result on client side - this requires memory resources. But for example PostgreSQL does it by default and nobody complains. Some databases simply use client's memory to hold the whole resultset. In most cases the database has to process much more data to re-evaluate the query.
Also you usually have much more clients, than database instances.
Also note that query re-execution - using rownum - as implemented by Hibernate does not guarantee correct(consistent) results. If data are modified between executions and default isolation level is used.
It really depends on use case. Changing Oracle's init parameter for max. connections and also for open cursors requires database restart.
So ScrollableResultSet and cursors can be used only when you can predict amount of (concurrent) users.

Why does a Derby database take up so much space?

I am new to databases and I love how easy it is to get data from a relational database (such as a Derby database). What I don't like is how much data one takes up; I have made a database with two tables, written a total of 130 records to these tables (each table has 6 columns), and the whole relational database gets saved in the system directory as a folder that houses a total of approximately 1914014 bytes! (Assuming that I did the arithmetic right....) What the heck is going on to cause such a huge request of memory?! //I also notice that there is a log1.dat file in log folder that takes up exactly 1MB of data. I looked into this file via Notepad++, and saw that it was mostly NULL characters. What is that all about?

Derby need to keep track on your database data, the redo logs and transactions so your database is in a consistent state and can recover even from pc crashes.
Also he creates most files with a fixed size (like 1MB) to ensure he did not need to increase the file size later on (performance issues and to not fragment his files to much).
Over the runtime or when stopping, Derby will clean up some of this files or regroup them and free space.
So overall the space and the files are the trade offs you get for using a database.
Maybe you can change some of this behaviour via some Derby configs (I did not find any one suitable in the doc :().

When last checked in 2011, an empty Derby database takes about 770 K of disk space: http://apache-database.10148.n7.nabble.com/Database-size-larger-than-expected-td104630.html
The log1.dat file is your transaction log, and records database changes so that the database can be recovered if there is a crash (or if your program calls ROLLBACK).
Note that log1.dat is disk space, not memory.
If you'd like to learn about the basics of Derby's transaction log, start here: http://db.apache.org/derby/papers/recovery.html

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.