Insert one large amount of data in one transaction

Insert one large amount of data in one transaction - java

I have struggled with architectural problem.
I have table in DB2 v.9.7 database in which I need to insert ~250000 rows, with 13 columns each, in a single transaction. I especially need that this data would inserted as one unit of work.
Simple insert into and executeBatch give me:
The transaction log for the database is full. SQL Code: -964, SQL State: 57011
I don't have rights to change the size of transaction log. So I need to resolve this problem on the developer's side.
My second thought was to use savepoint before all inserts then I found out that works only with current transaction so it doesn't help me.
Any ideas?

You want to perform a large insert as a single transaction, but don't have enough log space for such transaction and no permissions to increase it.
This means you need to break up your insert into multiple database transactions and manage higher level commit or rollback on the application side. There is not anything in the driver, either JDBC or CLI, to help with that, so you will have to write custom code to record all committed rows and manually delete them if you need to roll back.
Another alternative might be to use the LOAD command by means of the ADMIN_CMD() system stored procedure. LOAD requires less log space. However, for this to work you will need to write rows that you want to insert into a file on the database server or to a shared filesystem or drive accessible from the server.

Hi you can use export/load commands to export/import large tables, this should be very fast.The LOAD command should not be using the transaction log.You may have problem if your user have no privilege to write file on server filesystem.
call SYSPROC.ADMIN_CMD('EXPORT TO /export/location/file.txt OF DEL MODIFIED BY COLDEL0x09 DECPT, select * from some_table ' )
call SYSPROC.ADMIN_CMD('LOAD FROM /export/location/file.txt OF DEL MODIFIED BY COLDEL0x09 DECPT, KEEPBLANKS INSERT INTO other_table COPY NO');

Related

How to efficiently export/import database data with JDBC

I have a JAVA application that can use a SQL database from any vendor. Right now we have tested Vertica and PostgreSQL. I want to export all the data from one table in the DB and import it later on in a different instance of the application. The size of the DB is pretty big so there are many rows in there. The export and import process has to be done from inside the java code.
What we've tried so far is:
Export: we read the whole table (select * from) through JDBC and then dump it to an SQL file with all the INSERTS needed.
Import: The file containing those thousands of INSERTS is executed in the target database through JDBC.
This is not an efficient process. Firstly, the select * from part is giving us problems because of the size of it and secondly, executing a lot if inserts one after another gives us problems in Vertica (https://forum.vertica.com/discussion/235201/vjdbc-5065-error-too-many-ros-containers-exist-for-the-following-projections)
What would be a more efficient way of doing this? Are there any tools that can help with the process or there is no "elegant" solution?

Why not do the export/import in a single step with batching (for performance) and chunking (to avoid errors and provide a checkpoint where to start off after a failure).
In most cases, databases support INSERT queries with many values, e.g.:
INSERT INTO table_a (col_a, col_b, ...) VALUES
(val_a, val_b, ...),
(val_a, val_b, ...),
(val_a, val_b, ...),
...
The number of rows you generate into a single such INSERT statement is then your chunk-size, which might need tuning for the specific target database (big enough to speed things up but small enough to make the chunk not exceed some database limit and create failures).
As already proposed, each of this chunk should then be executed in a transaction and your application should remember which chunk it successfully executed last in case some error occurs so it can continue at the next run there.
For the chunks itself, you really should use LIMIT OFFSET .
This way, you can repeat any chunk at any time, each chunk by itself is atomic and it should perform much better than with single row statements.

I can only speak about PostgreSQL.
The size of the SELECT is not a problem if you use server-side cursors by calling setFetchSize with a value greater than 0 (perhaps 10000) on the statement.
The INSERTS will perform well if
you run them all in a single transaction
you use a PreparedStatement for the INSERT

Each insert into Vertica goes into WOS (memory), and periodically data from WOS gets moved to ROS (disk) into a single container. You can only have 1024 ROS containers per projection per node. Doing many thousands of INSERTs at a time is never a good idea for Vertica. The best way to do this is to copy all that data into a file and bulk load the file into Vertica using the COPY command.
This will create a single ROS container for the contents of the file. Depending on how many rows you want to copy it will be many times (sometimes even hundreds of times) faster.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/COPY/COPY.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ConnectingToVertica/ClientJDBC/UsingCOPYLOCALWithJDBC.htm

Synchronizing table data across databases

I have one table that records its row insert/update timestamps on a field.
I want to synchronize data in this table with another table on another db server. Two db servers are not connected and synchronization is one way (master/slave). Using table triggers is not suitable
My workflow:
I use a global last_sync_date parameter and query table Master for
the changed/inserted records
Output the resulting rows to xml
Parse the xml and update table Slave using updates and inserts
The complexity of the problem rises when dealing with deleted records of Master table. To catch the deleted records I think I have to maintain a log table for the previously inserted records and use sql "NOT IN". This becomes a performance problem when dealing with large datasets.
What would be an alternative workflow dealing with this scenario?

It sounds like you need a transactional message queue.
How this works is simple. When you update the master db you can send a message to the message broker (of whatever the update was) which can go to any number of queues. Each slave db can have its own queue and because queue's preserve order the process should eventually synchronize correctly (ironically this is sort of how most RDBMS do replication internally).
Think of the Message Queue as a sort of SCM change-list or patch-list database. That is for the most part the same (or roughly the same) SQL statements sent to master should be replicated to the other databases eventually. Don't worry about loosing messages as most message queues support durability and transactions.
I recommend you look at spring-amqp and/or spring-integration especially since you tagged this question with spring-batch.
Based on your comments:
See Spring Integration: http://static.springsource.org/spring-integration/reference/htmlsingle/ .
Google SEDA. Whether you go this route or not you should know about Message queues as it goes hand-in-hand with batch processing.
RabbitMQ has a good picture diagram of how messaging works
The contents of your message might be the entire row and whether its a CRUD, UPDATE, DELETE. You can use whatever format (e.g. JSON. See spring integration on recommendations).
You could even send the direct SQL statements as a message!
BTW your concern of NOT IN being a performance problem is not a very good one as there are a plethora of work-arounds but given your not wanting to do DB specific things (like triggers and replication) I still feel a message queue is your best option.
EDIT - Non MQ route
Since I gave you a tough time about asking this quesiton I will continue to try to help.
Besides the message queue you can do some sort of XML file like you we were trying before. THE CRITICAL FEATURE you need in the schema is a CREATE TIMESTAMP column on your master database so that you can do the batch processing while the system is up and running (otherwise you will have to stop the system). Now if you go this route you will want to SELECT * WHERE CREATE_TIME < ? is less than the current time. Basically your only getting the rows at a snapshot.
Now on your other database for the delete your going to remove rows by inner joining on a ID table but with != (that is you can use JOINS instead of slow NOT IN). Luckily you only need all the ids for delete and not the other columns. The other columns you can use a delta based on the the update time stamp column (for update, and create aka insert).

I am not sure about the solution. But I hope these links may help you.
http://knowledgebase.apexsql.com/2007/09/how-to-synchronize-data-between.htm
http://www.codeproject.com/Tips/348386/Copy-Synchronize-Table-Data-between-databases

Have a look at Oracle GoldenGate:
Oracle GoldenGate is a comprehensive software package for enabling the
replication of data in heterogeneous data environments. The product
set enables high availability solutions, real-time data integration,
transactional change data capture, data replication, transformations,
and verification between operational and analytical enterprise
systems.
SymmetricDS:
SymmetricDS is open source software for multi-master database
replication, filtered synchronization, or transformation across the
network in a heterogeneous environment. It supports multiple
subscribers with one direction or bi-directional asynchronous data
replication.
Daffodil Replicator:
Daffodil Replicator is a Java tool for data synchronization, data
migration, and data backup between various database servers.

Why don't you just add a TIMESTAMP column that indicates the last update/insert/delete time? Then add a deleted column -- ie. mark the row as deleted instead of actually deleting it immediately. Delete it after having exported the delete action.
In case you cannot alter schema usage in an existing app:
Can't you use triggers at all? How about a second ("hidden") table that gets populated with every insert/update/delete and which would constitute the content of the next to be generated xml export file? That is a common concept: a history (or "log") table: it would have its own progressing id column which can be used as an export marker.

Very interesting question.
In may case I was having enough RAM to load all ids from master and slave tables to diff them.
If ids in master table are sequential you try to may maintain a set of full filled ranges in master table (ranges with all ids used, without blanks, like 100,101,102,103).
To find removed ids without loading all of them to the memory you may execute SQL query to count number of records with id >= full_region.start and id <= full_region.end for each full filled region. If result of query == (full_region.end - full_region.end) + 1 it means all record in region are not deleted. Otherwise - split region into 2 parts and do the same check for both of them (in a lot of cases only one side contains removed records).
After some length of range (about 5000 I think) it will faster to load all present ids and check for absent using Set.
Also there is a sense to load all ids to the memory for a batch of small (10-20 records) regions.

Make a history table for the table that needs to be synchronized (basically a duplicate of that table, with a few extra fields perhaps) and insert the entire row every time something is inserted/updated/deleted in the active table.
Write a Spring batch job to sync the data to Slave machine based on the history table's extra fields
hope this helps..

A potential option for allowing deletes within your current workflow:
In the case that the trigger restriction is limited to triggers with references across databases, a possible solution within your current workflow would be to create a helper table in your Master database to store only the unique identifiers of the deleted rows (or whatever unique key would enable you to most efficiently delete your deleted rows).
Those ids would need to be inserted by a trigger on your master table on delete.
Using the same mechanism as your insert/updates, create a task following your inserts and updates. You could export your helper table to xml, as you noted in your current workflow.
This task would simply delete the rows out of the slave table, then delete all data from your helper table following completion of the task. Log any errors from the task so that you can troubleshoot this since there is no audit trail.

If your database has a transaction dump log, just ship that one.
It is possible with MySQL and should be possible with PostgreSQL.

I would agree with another comment - this requires the usage of triggers. I think another table should hold the history of your sql statements. See this answer about using 2008 extended events... Then, you can get the entire sql, and store the result query in the history table. Its up to you if you want to store it as a mysql query or a mssql query.

Here's my take. Do you really need to deal with this? I assume that the slave is for reporting purposes. So the question I would ask is how up to date should it be? Is it ok if the data is one day old? Do you plan a nightly refresh?
If so, forget about this online sync process, download the full tables; ship it to the mysql and batch load it. Processing time might be a lot quicker than you think.

Detect when a row get deleted from PostgreSQL database

Recently my team have get a situation in which some records in our shared test database disappear with no clear reason. Because it's a shared database (which is utilized by so many teams), so that we can't track down if it's a programming mistake or someone just run a bad sql script.
So that I'm looking for a way to notify (at database level) when a row of a specific table A get deleted. I have looked at the Postgres TRIGGER, but it failed to give me the specific sql that cause the deletion.
Is there anyway I can log the sql statement which cause the deletion of some rows in table A?

You could use something like this.
It allows you to create a special triggers for PostgreSQL tables, that log all the changes to the chosen tables.
This triggers can log the query, that cause the change (via current_query()).
Using this as a base you can add more fields/information to log.

You would do this to the actual postgres config files:
http://www.postgresql.org/docs/9.0/static/runtime-config-logging.html
log_statement (enum)
Controls which SQL statements are logged. Valid values are none (off), ddl, mod, and all (all statements). ddl logs all data
definition statements, such as CREATE, ALTER, and DROP statements. mod
logs all ddl statements, plus data-modifying statements such as
INSERT, UPDATE, DELETE, TRUNCATE, and COPY FROM. PREPARE, EXECUTE, and
EXPLAIN ANALYZE statements are also logged if their contained command
is of an appropriate type. For clients using extended query protocol,
logging occurs when an Execute message is received, and values of the
Bind parameters are included (with any embedded single-quote marks
doubled).
The default is none. Only superusers can change this setting.
You want either ddl or all to be the selection. This is what you need to alter:
In your data/postgresql.conf file, change the log_statement setting to 'all'. Further the following may also need to be validated:
1) make sure you have turned on the log_destination variable
2) make sure you turn on the logging_collector
3) also make sure that pg_log actually exists relative to your data directory, and that the postgres user can write to it.
taken from here

Keeping search result consistent across multiple transactions

I have to implement a requirement for a Java CRUD application where users want to keep their search results intact even if they do actions which affects the criteria by which the returned rows are matched.
Confused? Ok. Let me give you a familiar example. In Gmail if you do an advanced search on unread emails, you are presented with a list of matching results. Click on an entry and then go back to the search list. What happens is that you have just read that entry but it hasn't disappeard from the original result set. Only that line has changed from bold to normal.
I need to implement the exact same behaviour but the application is designed in such a way that any transaction is persisted first and then the UI requeries the db to keep in sync. The complexity of the application and the size of the database prevents me from doing just a simple in memory caching of the matching rows and making the changes both in db and in memory.
I'm thinking of solving the problem on the database level by creating an intermediate table in the Oracle database holding pointers to matching records and requerying only those records to keep the UI in sync with data. Any Ideas?

In Oracle, if you open a cursor, the results of that cursor are static, regardless if another transaction inserts a row that would appear in your cursor, or updates or deletes a row that does exist in your cursor.
The challenge then is to not close the cursor if you want results consistent from when the cursor was opened.

If the UI maintains a single session on the database, one solution is to use Global Temporary Tables in Oracle. When you execute a search, insert the unique IDs into the GTT, then the UI just queries the GTT.
If the UI doesn't keep the session open, you could do the same thing but with an ordinary table. Then, of course, you'd just have to add some cleanup code to remove old search results from the table.

You can use a flashback query to read data from the past. For example, select * from employee as of timestamp to_timestap('01-MAY-2011 070000', 'DD-MON-YYYY HH24MISS');
Oracle only stores this historical information for a limited period of time. You'll need to look into your retention settings; the UNDO_RETENTION parameter, UNDO tablespace retention gaurantee and proper sizing, and also LOBs have their own retention setting.

Create two connections to the database.
Set the first one to READ ONLY (using SET TRANSACTION READ ONLY) do your searching from that connection but make sure you never end that transaction by issuing a commit or rollback.
As a read only transaction only sees the data as it was at the time the transaction started, the first connection will never see any changes to the database - not even committed ones.
Then you can do your updates in the second connection without affecting the results in the first connection.
If you cannot use two connections, you could implement the updates through stored procedures that use autonomous transactions, then you can keep the read only transaction open in the single connection you have.

Insert a lot of data into database in very small inserts

So i have a database where there is a lot of data being inserted from a java application. Usualy i insert into table1 get the last id, then again insert into table2 and get the last id from there and finally insert into table3 and get that id as well and work with it within the application. And i insert around 1000-2000 rows of data every 10-15 minutes.
And using a lot of small inserts and selects on a production webserver is not really good, because it sometimes bogs down the server.
My question is: is there a way how to insert multiple data into table1, table2, table3 without using such a huge amount of selects and inserts? Is there a sql-fu technique i'm missing?

Since you're probably relying on auto_increment primary keys, you have to do the inserts one at a time, at least for table1 and table2. Because MySQL won't give you more than the very last key generated.
You should never have to select. You can get the last inserted id from the Statement using the getGeneratedKeys() method. See an example showing this in the MySQL manual for the Connector/J:
http://dev.mysql.com/doc/refman/5.1/en/connector-j-usagenotes-basic.html#connector-j-examples-autoincrement-getgeneratedkeys
Other recommendations:
Use multi-row INSERT syntax for table3.
Use ALTER TABLE DISABLE KEYS while you're importing, and re-enable them when you're finished.
Use explicit transactions. I.e. begin a transaction before your data-loading routine, and commit at the end. I'd probably also commit after every 1000 rows of table1.
Use prepared statements.
Unfortunately, you can't use the fastest method for bulk load of data, LOAD DATA INFILE, because that doesn't allow you to get the generated id values per row.

There's a lot to talk about here:
It's likely that network latency is killing you if each of those INSERTs is another network roundtrip. Try batching your requests so they only require a single roundtrip for the entire transaction.
Speaking of transactions, you don't mention them. If all three of those INSERTs need to be a single unit of work you'd better be handling transactions properly. If you don't know how, better research them.
Try caching requests if they're reused a lot. The fastest roundtrip is the one you don't make.

You could redesign your database such that the primary key was not a database-generated, auto-incremented value, but rather a client generated UUID. Then you could generated all the keys for every record upfront and batch the inserts however you like.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.