Update two databases daily so both contain the same valid data set - java

I have the task to update two databases daily. Simplified, an entry looks like this:
service_id; id_service_provider; valid_from; valid_to;
I get the data in the form of a csv file. To give you some examples how to interpret the lines of the file, here are some entries:
114; 20; 2011-12-06; 2017-10-16 //service terminated in 2017
211; 65; 2015-04-09; 9999-12-31 //service still valid
322; 57; 2019-08-22; 9999-12-31 //new service as of today
336; 20; 2009-08-20; 2019-07-11 //change provider, see next line
336; 37; 2019-07-11; 9999-12-31 //new provider for the above services
The files can have several thousand entries, because new entries or changes are simply added and I don't get a delta every day but always the whole file.
I only have full access to the first database which contains all entries (both current and historical). The second database should contain for faster queries only the current valid services and not the terminated ones. For the second database, which I don't have access to, I have to create a file containing the commands every day:
add new services
delete terminated services
update providers changes
My current approach looks like this:
Create from each line of the file a List<Service>
Make a database query for each entry in the list
if identical service exists and no changes delete service from this
list.
If service is available but end date or provider id different,
terminate service and simultaneously insert a new service valid as of
today. Additionally for the second database prepare a new list
toUpdate and add this service.
If service is not found insert it into the first database and create
a list toInsert and add service.
Send lists toInsert and toUpdate to second db.
Since my datasets in the databases are constantly diverging, I want to rethink my approach and reimplement the whole thing. How would you proceed with this task?
Edit
The database I have access to is from oracle the second one is DB2. I can't use database functions that keep the data synchronized. I am limited to creating a csv file with java to keep the second database synchronized.

For this kind of thing, I like to keep a separate table of what I think the remote database looks like. That way, I can:
Generate deltas easily by comparing my source data with my copy of what should be in the remote database.
Correct errors in PROD by updating the copy to force the process to resend (e.g., if the team managing the other database misses a file or something).
Here is a working example to illustrate the process.
Cast of characters:
SO_SERVICES --> your source table
SO_SERVICES_EXPORTED --> a copy of what the remote database should currently look like, if they've processed all our command .csv files correctly.
SO_SERVICES_EXPORT_CMDS --> the set of deltas generated by comparing SO_SERVICES and SO_SERVICES_EXPORTED. You would generate your .csv file from this table and then delete from it.
SYNC_SERVICES --> a procedure to generate the deltas
Setup Tables
CREATE TABLE so_services
( service_id NUMBER NOT NULL,
id_service_provider NUMBER NOT NULL,
valid_from DATE NOT NULL,
valid_to DATE DEFAULT DATE '9999-12-31' NOT NULL,
CONSTRAINT so_services_pk PRIMARY KEY ( service_id, id_service_provider ),
CONSTRAINT so_services_c1 CHECK ( valid_from <= valid_to ) );
CREATE TABLE so_services_exported
( service_id NUMBER NOT NULL,
id_service_provider NUMBER NOT NULL,
valid_from DATE NOT NULL,
valid_to DATE DEFAULT DATE '9999-12-31' NOT NULL,
CONSTRAINT so_services_exported_pk PRIMARY KEY ( service_id ),
CONSTRAINT so_services_exported_c1 CHECK ( valid_from <= valid_to ) );
CREATE TABLE so_services_export_cmds
( service_id NUMBER NOT NULL,
id_service_provider NUMBER,
cmd VARCHAR2(30) NOT NULL,
valid_from DATE,
valid_to DATE,
CONSTRAINT so_services_export_cmds_pk PRIMARY KEY ( service_id, cmd ) );
Procedure to process synchronization
-- You would put this in a package, for real code
CREATE OR REPLACE PROCEDURE sync_services IS
BEGIN
LOCK TABLE so_services IN EXCLUSIVE MODE;
-- Note the deltas between the current active services and what we've exported so far
-- CAVEAT: I am not sweating your exact business logic here. I am just trying to illustrate the approach.
-- The logic here assumes that the target database wants only one row for each service_id, so we will send an
-- "ADD" if the target database should insert a new service ID, "UPDATE", if it should modify an existing service ID,
-- or "DELETE" if it should delete it.
-- Also assuming, for "DELETE" command, we only need the service_id, no other fields.
INSERT INTO so_services_export_cmds
( service_id, id_service_provider, cmd, valid_from, valid_to )
SELECT nvl(so.service_id, soe.service_id) service_id,
so.id_service_provider id_service_provider,
CASE WHEN so.service_id IS NOT NULL AND soe.service_id IS NULL THEN 'ADD'
WHEN so.service_id IS NULL AND soe.service_id IS NOT NULL THEN 'DELETE'
WHEN so.service_id IS NOT NULL AND soe.service_id IS NOT NULL THEN 'UPDATE'
ELSE NULL -- this will fail and should.
END cmd,
so.valid_from valid_from,
so.valid_to valid_to
FROM ( SELECT * FROM so_services WHERE SYSDATE BETWEEN valid_from AND valid_to ) so
FULL OUTER JOIN so_services_exported soe ON soe.service_id = so.service_id
-- Exclude any UPDATES that don't change anything
WHERE NOT ( soe.service_id IS NOT NULL
AND so.service_id IS NOT NULL
AND so.id_service_provider = soe.id_service_provider
AND so.valid_from = soe.valid_from
AND so.valid_to = soe.valid_to);
-- Update the snapshot of what the remote database should now look like after processing the above commands.
-- (i.e., it should have all the current records from the service table)
DELETE FROM so_services_exported;
INSERT INTO so_services_exported
( service_id, id_service_provider, valid_from, valid_to )
SELECT service_id, id_service_provider, valid_from, valid_to
FROM so_services so
WHERE SYSDATE BETWEEN so.valid_from AND so.valid_to;
-- For testing (12c only)
DECLARE
c SYS_REFCURSOR;
BEGIN
OPEN c FOR SELECT * FROM so_services_export_cmds ORDER BY service_id;
DBMS_SQL.RETURN_RESULT(c);
END;
COMMIT; -- Release exclusive lock on services table
END sync_services;
Insert Test Data from OP
DELETE FROM so_services;
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 114, 20, DATE '2011-12-06', DATE '2017-10-16' );
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 211, 65, DATE '2015-05-09', DATE '9999-12-31' );
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 322, 57, DATE '2019-08-22', DATE '9999-12-31' );
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 336, 21, DATE '2009-08-20', DATE '2019-07-11' );
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 336, 37, DATE '2019-07-11', DATE '9999-12-31' );
Test #1 -- Nothing exported yet, so all latest records should be sent
exec sync_services;
SERVICE_ID ID_SERVICE_PROVIDER CMD VALID_FRO VALID_TO
---------- ------------------- ------------------------------ --------- ---------
211 65 ADD 09-MAY-15 31-DEC-99
322 57 ADD 22-AUG-19 31-DEC-99
336 37 ADD 11-JUL-19 31-DEC-99
Test #2 -- no additional updates, no additional commands
DELETE FROM so_services_export_cmds; -- You would do this after generating your .csv file
exec sync_services;
no rows selected
Test #3 - Add some changes to the source table
-- Add a new service #400
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 400, 20, DATE '2019-08-29', DATE '9999-12-31' );
-- Terminate service 322
UPDATE so_services
SET valid_to = DATE '2019-08-29'
WHERE service_id = 322
AND valid_to = DATE '9999-12-31';
-- Update service 336
UPDATE so_services
SET valid_to = DATE '2019-08-29'
WHERE service_id = 336
AND id_service_provider = 37
AND valid_to = DATE '9999-12-31';
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 336, 88, DATE '2019-08-29', DATE '9999-12-31' );
exec sync_services;
SERVICE_ID ID_SERVICE_PROVIDER CMD VALID_FRO VALID_TO
---------- ------------------- ------------------------------ --------- ---------
322 DELETE
336 88 UPDATE 29-AUG-19 31-DEC-99
400 20 ADD 29-AUG-19 31-DEC-99

Since you have all the access on Oracle DB, can we do this -
Have two new additional columns - Last_Updated_Time & Flag.
Last_Updated_Time should contain the date,on which the row was inserted/updated. We can create trigger on this table to have this column populated,no other modification needed.
For the second one - Flag let it can contain various values depending on business scenarios, and can also be populated through trigger. For example - For first time creating service id, set it as 1, for terminated an service - 2, updating provider - terminated = 3 and with new provider : 4, etc.
Oracle query which fetches the data should add a condition at the end of reporting query - and Last_Updated_Time > sysdate-1 this will fetch updated data only.
As is Oracle DB values :
114; 20; 2011-12-06; 2017-10-16 //service terminated in 2017
211; 65; 2015-04-09; 9999-12-31 //service still valid
322; 57; 2019-08-22; 9999-12-31 //new service as of today
336; 20; 2009-08-20; 2019-07-11 //change provider, see next line
336; 37; 2019-07-11; 9999-12-31 //new provider for the above services
Updated (you can populate last update date for existing records by updating it with Valid_To for the terminated record, and for the rest - Valid_From date ):
114; 20; 2011-12-06; 2017-10-16; 2017-10-17; 2 //service terminated in 2017; last update date is old
211; 65; 2015-04-09; 9999-12-31; 2015-04-09; 1 //service still valid; last update date is old
322; 57; 2019-08-22; 9999-12-31; 2019-08-28; 1 //new service as of today; last update daye
336; 20; 2009-08-20; 2019-07-11; 2019-08-28; 3 //change provider, see next line; assumed : updated today
336; 37; 2019-07-11; 9999-12-31; 2019-08-28; 4 //new provider for the above services; assumed : updated today
Now, you can have two separate queries to create list for New records and to be Updated records and send csv accordingly (ex: records with flags as 1,4 for toInsert list and records with value as 2,3 for toUpdate list).
tl;dr:
Add two columns in Oracle db table to identify the last update date & record status flag, and then based on these values, create daily two csv file with previous day's inserted/updated data.

This problem can be solved in multiple ways as others have answered in the thread already. I'm assuming this is a problem you are solving work-related problem, which means it has to be reliable, available and fault-tolerant. I don't see many constraints on the processing time(entire processing should be done in 30 mins, indirectly related to latency), throughput(we have few thousand records can it grow if yes by how much? can it ever grow to unmanageable proportion) and security(who can have access, how are they going to access etc).
Based on the above assumptions, we can solve this in different ways. I'm presenting 3 of them here.
Approach1
A Partitioned Master(MASTER_SERVICES_TABLE) Oracle table. Table definition contains all the columns from CSV and any additional columns needed(created/modified date fields). Partitioning can be determined based on retention. In both cases, the partition key depends on the created column.
Max one year retention is good enough? then use DAY_OF_YEAR number the partition key
Multi-year retention is expected? Use Day(DD-MM-YYYY) format to partition the key.
Use the SQLLDR command-line tool from Oracle to load the data into a temporary table on a daily basis. After a successful load execute the partition exchange between the temporary table and current date partition.
Create another table(SERVICE_TABLE) that contains all the columns from the incoming file and few other extra columns (primary key, status, service_expired_on, create, modified etc).
Have a single/multiple cronjobs based on system load and throughput requirements. If the system load(number of records) is just less(few thousand records) one cronjob is good. Higher system load calls for more cronjobs. If we are opting for a multi cronjobs model, it's better to have 2 step process.
A master cronjob which wakes once a day and creates as many slave jobs as we need. Based on system capacity we can set criteria for slave jobs. I want each slave to process only 100k records. If we have 1 million records master creates 1million/100k slave jobs.
Slave jobs can be configured in 2 ways.
Wake up more often(1 hour or even frequent based on system throughput)
Have the master span slave jobs once the master has done its job.
Slave cronjobs will contain business logic. Something like a new service on-boarded, decommissioned and new service provider started etc., This part must be UnitTest covered to document what's the expected behavior. End result of the slave job is to update the SERVICE_TABLE. SERVICE_TABLE will only contain one working(includes historical/active/decommissioned or it can be what our business need) copy of all services.
Slave cronjobs keep updating the status of all the slave jobs in the oracle table.
Another cronjob(active service generator) outside of the master/slave jobs will get triggered by the last exiting slave.
This new active generator reads data from SERVICE_TABLE and dumps into a predefined file(CSV/JSON/TSV/PSV etc) format. If we really want to use a file-based approach for second database. OR we can directly update the second database from this cronjob.
If the file generated is huge then loading this data to the secondary database can be done in parallel(based on the capabilities of that DB).
Cron jobs on traditional UNIX systems are not reliable. It's better to use Chronos/Mesos cluster for maintaining Cronjobs.
Have monitoring/alerting on the above jobs.
MASTER_SERVICE_TABLE acts as a source of truth in the case of discrepancy.
Have archiving/cleanup implemented on all the tables involved.
Approach2
Dump the above file to HDFS on a daily basis. ex(/projects/servicedata/DDMMYYYY/)
Use the pig-latin script to read the file contents.
Write a UDF which takes care of merging change in-service providers etc. Basically business logic. UnitTest this UDF for all possible use-cases.
Output the final outcome of pig-latin script to a file.
Write a program to read the generated file and load it to any database you want.
Use oozie workflow to load the above-generated file to the database.
Approach3
Assuming this is just a personal project, we don't care about all the industry standards.
You can a simple version of pipes and filters architecture pattern.
A standard java program(any other language) reads the file and splits them to the predefined number of threads/processes. Each record is hashed based on some key(service_id) and range1 goes to thread/process1, range2 goes to thread/process2 etc..
Each of the above threads/processes dependend on a library which can your business logic. The library can implement state management using a state machine.
Each of the threads/processes will have access to the data sources which they can write to.
Finally, apologies if this is not solving your problem. As I've not paid too much attention to the business logic of Add/Delete/Update because these can change on a case to case basis. If the framework/architecture is robust enough we can replace business logic with anything we need is my thought process.

Solution 1
Assumptions
you don't care about the commit log
you don't have any history table maintained over the table
for oracle this operation will be performed when there is no load on the database.
from the way you are currently doing, it seems like there will be enough memory available in DB servers to insert all data in one go.
Solution
I would truncate the tables and then insert the data.
TRUNCATE/INSERT has many benefits over DELETE/UPDATE/INSERT. The biggest one is sequential writes.
I would generate multi-row SQL statements like the following:
Oracle
TRUNCATE MyTable;
INSERT ALL
INTO MyTable(service_id, id_service_provider, valid_from, valid_to) VALUES (114, 20, 2011-12-06, 2017-10-16)
INTO MyTable(service_id, id_service_provider, valid_from, valid_to) VALUES (211, 65, 2015-04-09, 9999-12-31)
...
SELECT 1 FROM DUAL
DB2
BEGIN TRANSACTION;
TRUNCATE MyTable;
INSERT INTO MyTable(service_id, id_service_provider, valid_from, valid_to) VALUES (
(114, 20, 2011-12-06, 2017-10-16),
(211, 65, 2015-04-09, 9999-12-31)
...
);
COMMIT;
For Oracle, I would generate the SQL statements for all the rows since it's a replica.
For DB2, I would generate the SQL statements for all the rows which have end date '9999-12-31'.
Solution 2
Database 1
Assumptions
The data is extracted after day end (midnight). e.g. The data was extracted but on 26 Aug but the data does not contain any entry for 26 Aug.
There is no update performed on this table.
Solution:
I would create the delta myself with the help of a cursor. I would generate the SQL statements for all the rows which come after that cursor.
I would maintain a single value table with the cursor. The value of this cursor could be an auto-incremented serial id (if any) or the maximum date of either column fromDate or toDate except '9999-12-31'. This date will be essentially the date-1 when data was collected (see assumption).
The value of the cursor can be maintained in two ways:
Trigger on every insert in database.
Inserting it from the java code after every insert.
For insertion: I would fetch this cursor from the database and then generate SQL statements for all the lines in the file which come after my cursor.
(fromDate > max-date || toDate > max-date)
Database 2
I would write UPSERT queries for all the valid rows (rows having endDate: '9999-12-31') and then delete all the rows which don't have endDate: '9999-12-31' from the table. i.e.
MERGE INTO MyTable AS mt
USING (VALUES(
(114, 20, 2011-12-06, 2017-10-16),
(211, 65, 2015-04-09, 9999-12-31)
...
)) AS sh (service_id, id_service_provider, valid_from, valid_to)
ON (mt.service_id = sh.service_id)
WHEN MATCHED THEN
UPDATE SET
id_service_provider = sh.id_service_provider,
valid_from = sh.valid_from ,
valid_to = sh.valid_to
WHEN NOT MATCHED THEN
INSERT INTO MyTable(service_id, id_service_provider, valid_from, valid_to) VALUES (
(114, 20, 2011-12-06, 2017-10-16),
(211, 65, 2015-04-09, 9999-12-31)
...)

Since my datasets in the databases are constantly diverging, I want to rethink my approach and reimplement the whole thing. How would you proceed with this task?
You didn't specify which database you're using, but if you're open to changing that along with rethinking the approach, I would consider using whatever database replication mechanism are available. If no replication feature is available, I would consider changing databases to use one that supports replication.
As you have found, keeping two databases in sync is complicated, and quite likely not what you want to spend your time doing.

Given the requirements and constraints you provided, here is the approach I would take to solve this problem:
Parse the original file and store data in e.g. List (not sure how big the file is, assume the server has enough memory to accommodate the data)
Get unique list of service IDs (assume it's a unique key; up to 1000 - limit of Oracle) out of List and query Oracle to get such info as current service provider, from_date/to_date
Compare between two lists (what's in List and what's from Oracle) to determine the Action of each service (e.g. new, deleted, SP-changed, etc.)
Use Batch Update to insert/update each service to Oracle
Generate CSV file for DB2 based on the Action
Consider to use a light-weight JDBC framework like MyBatis. Also consider using List stream() function when manipulating the List.

Related

SQL Query for Order Status report from Oracle DB

I have a table Order_Status in Oracle DB 11, which stores order id and all its status for ex


ample
order id status date
100 at warehouse 01/01/18
100 dispatched 02/01/18
100 shipped 03/01/18
100 at customer doorstep 04/01/18
100 delivered 05/01/18
a few days back some of the orders were stuck in warehouse but it is not possible to check status of each order every day so no one noticed until we received a big escalation mail from business which arouse the requirement of a system or daily report which will tell us about status of all the order along with there present status and with some condition like if there are more than 2 days and no new status has been updated in DB for the order then mark it in red or highlight it.
we already have cron scheduled some of our reports but even if a create a SQL query for the status report it won't highlight pending order.
Note:- SQL, Java or some other tool suggestions both are welcome but SQL preferred then Tool then java.
I am assuming that your requirement is "status will always change in every 2 days, if not there is something wrong"
select * from (
select order_id,
status,
update_date,
RANK() OVER (PARTITION BY order_id ORDER BY update_date DESC) as rank
from Order_Status
where status != 'delivered'
)
where update_date < sysdate - 2 and rank = 1

Missing column that was just inserted in cassandra column family

We are constancly getting problem on our test cluster.
Cassandra configuration:
cassandra version: 2.2.12
nodes count: 6, seed-nodess 3, none-seed-nodes 3
replication factor 1 (of course for prod we will use 3)
Table configuration where we get problem:
CREATE TABLE "STATISTICS" (
key timeuuid,
column1 blob,
column2 blob,
column3 blob,
column4 blob,
value blob,
PRIMARY KEY (key, column1, column2, column3, column4)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC, column2 ASC, column3 ASC, column4 ASC)
AND caching = {
'keys':'ALL', 'rows_per_partition':'100'
}
AND compaction = {
'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
};
Our java code details
java 8
cassandra driver: astyanax
app-nodes count: 4
So, whats happening:
Under high load our application do many inserts in cassandra tables from all nodes.
During this we have one workflow when we do next with one row in STATISTICS table:
do insert 3 columns from app-node-1
do insert 1 column from app-node-2
do insert 1 column from app-node-3
do read all columns from row on app-node-4
at last step(4) when we read all columns we are sure that insert of all columns is done (it is guaranteed by other checks that we have)
The problem is that some times(2-5 times on 100'000) it happens that at stpp 4 when we read all columns, we get 4 columns instead of 5, i.e. we are missing column that was inserted at step 2 or 3.
We even start doing reads of this columns every 100ms in loop and we dont get expected result. During this time we also check columns using cqlsh - same result, i.e. 4 instead of 5.
BUT, if we add in this row any new column, then we immediately get expected result, i.e. we are getting then 6 columns - 5 columns from workflow and 1 dummy.
So after inserting dummy column we get missing column that was inserted at step 2 or 3.
Moreover when we get the timestamp of missing (and then apperared column), - its very closed to time when this column was actually added from our app-node.
Basically insertions from app-node-2 & app-node-3 are done nearlly at the same time, so finally these two columns allways have nearly same timestamp, even if we do insert of dummy column in 1 minute after first read of all columns at step 4.
With replication factor 3 we cannot reproduce this problem.
So open questions are:
May be this is expected behavior of Cassandra when replication factor is 1 ?
If its not expected, then what could be potential reason?
UPDATE 1:
next code is used to insert column:
UUID uuid = <some uuid>;
short shortV = <some short>;
int intVal = <some int>;
String strVal = <some string>;
ColumnFamily<UUID, Composite> statisticsCF = ColumnFamily.newColumnFamily(
"STATISTICS",
UUIDSerializer.get(),
CompositeSerializer.get()
);
MutationBatch mb = keyspace.prepareMutationBatch();
ColumnListMutation<Composite> clm = mb.withRow(statisticsCF, uuid);
clm.putColumn(new Composite(shortV, intVal, strVal, null), true);
mb.execute();
UPDATE 2:
Proceed testing/investigatnig.
When we caught this situation again, we immediately stop(killed) our java apps. And then can constantly see in cqlsh that particular row does not contain inserted column.
To appear it, first we tried nodetool flash on every cassandra node:
pssh -h cnodes.txt /path-to-cassandra/bin/nodetool flush
result - the same, column did not appear.
Then we just restarted the cassandra cluster and column appeared
UPDATE 3:
Tried to disable cassandra cache, by setting row_cache_size_in_mb property to 0 (before it was 2Gb)
row_cache_size_in_mb: 0
After it, the problem gone.
SO probably the probmlem may be in OHCProvider which is used as default cache provider.

Spring JDBC Template batchUpdate to update thousands of records in a tbale

I have an update query which I am trying to execute through batchUpdate method of spring jdbc template. This update query can potentially match 1000s of rows in EVENT_DYNAMIC_ATTRIBUTE table which needs to be get updated. Will updating thousands of rows in a table cause any issue in production database apart from timeout? like, will it crash database or slowdown the performance of entire database engine for other connections...etc?
Is there a better way to achieve this instead of firing single update query in spring JDBC template or JPA? I have the following settings for jdbc template.
this.jdbc = new JdbcTemplate(ds);
jdbc.setFetchSize(1000);
jdbc.setQueryTimeout(0); // zero means there is no limit
The update query:
UPDATE EVENT_DYNAMIC_ATTRIBUTE eda
SET eda.ATTRIBUTE_VALUE = 'claim',
eda.LAST_UPDATED_DATE = SYSDATE,
eda.LAST_UPDATED_BY = 'superUsers'
WHERE eda.DYNAMIC_ATTRIBUTE_NAME_ID = 4002
AND eda.EVENT_ID IN
(WITH category_data
AS ( SELECT c.CATEGORY_ID
FROM CATEGORY c
START WITH CATEGORY_ID = 495984
CONNECT BY PARENT_ID = PRIOR CATEGORY_ID)
SELECT event_id
FROM event e
WHERE EXISTS
(SELECT 't'
FROM category_data cd
WHERE cd.CATEGORY_ID = e.PRIMARY_CATEGORY_ID))
If it is one time thing, I normally first select the records which needs to be updated and put in a temporary table or in a csv, and I make sure that I save primary key of those records in a table or in a csv. Then I read records in batches from temporary table or csv, and do the update in the table using the primary key. This way tables are not locked for a long time and you can have fixed set of records added in the batch which needs update and updates are done using primary key so it will be very fast. And if any update fails then you know which records got failed by logging out the failed records primary key in a log file or in an error table. I have followed this approach many time for updating millions of records in the PROD database, as it is very safe approach.

JDBC - PostgreSQL - batch insert + unique index

I have a table with unique constraint on some field. I need to insert a large number of records in this table. To make it faster I'm using batch update with JDBC (driver version is 8.3-603).
Is there a way to do the following:
every batch execute I need to write into the table all the records from the batch that don't violate the unique index;
every batch execute I need to receive the records from the batch that were not inserted into DB, so I could save "wrong" records
?
The most efficient way of doing this would be something like this:
create a staging table with the same structure as the target table but without the unique constraint
batch insert all rows into that staging table. The most efficient way is to use copy or use the CopyManager (although I don't know if that is already supported in your ancient driver version.
Once that is done you copy the valid rows into the target table:
insert into target_table(id, col_1, col_2)
select id, col_1, col_2
from staging_table
where not exists (select *
from target_table
where target_table.id = staging_table.id);
Note that the above is not concurrency safe! If other processes do the same thing you might still get unique key violations. To prevent that you need to lock the target table.
If you want to remove the copied rows, you could do that using a writeable CTE:
with inserted as (
insert into target_table(id, col_1, col_2)
select id, col_1, col_2
from staging_table
where not exists (select *
from target_table
where target_table.id = staging_table.id)
returning staging_table.id;
)
delete from staging_table
where id in (select id from inserted);
A (non-unique) index on the staging_table.id should help for the performance.

Mysql Duplicate entry 'xxxxxxxx' for key(unique) 'xxxxxxxxxx'

I have a problem of updating a row. I have a column called serialNum with varchar(50) not null unique default null
When I get the response data from the partner company, i will update the row according to the unique serial_num (our company's serial num).
Sometimes update failed because of :
Duplicate entry 'xxxxxxxx' for key 'serialNum'
But the value to update is not exists when i search the whole table. It happens sometimes, not always, like about 10 times out of 300.
Why does this happen and how can I solve it?
below is the query i use to update:
String updateQuery = "update phone set serialNum=?, Order_state=?, Balance=? where Serial_num=" + resultSet.get("jno_cli");
PreparedStatement presta = con.prepareStatement(updateQuery);
presta.setString(1, resultSet.get("oid_goodsorder"));
presta.setString(2, "order success");
presta.setFloat(3, Float.valueOf(resultSet.get("leftmoney")));
presta.executeUpdate();
I think the reason is in resultSet.get("oid_goodsorder") where did you get this result? is 'oid_goodsorder' is unique? Did you always updates whole table?
If oid_goodsorder is unique, it is possible to have duplicates in serialNum, because you don't use bulk update, instead you update every record separately, therefore it is possible:
Before:
serialNum=11,22,33,44
oid_goodsorder=44,11,22,33
It tries to update first serialNum to 44, but 44 is exists!
But if you finish all update serialNum will be unique...
If you wants to get error rows you could disable set serialNum is not unique and check table for duplicating serialNum
If you don't have duplicating values try to use bulk update
Java - how to batch database inserts and updates

Categories