I have the task to update two databases daily. Simplified, an entry looks like this:
service_id; id_service_provider; valid_from; valid_to;
I get the data in the form of a csv file. To give you some examples how to interpret the lines of the file, here are some entries:
114; 20; 2011-12-06; 2017-10-16 //service terminated in 2017
211; 65; 2015-04-09; 9999-12-31 //service still valid
322; 57; 2019-08-22; 9999-12-31 //new service as of today
336; 20; 2009-08-20; 2019-07-11 //change provider, see next line
336; 37; 2019-07-11; 9999-12-31 //new provider for the above services
The files can have several thousand entries, because new entries or changes are simply added and I don't get a delta every day but always the whole file.
I only have full access to the first database which contains all entries (both current and historical). The second database should contain for faster queries only the current valid services and not the terminated ones. For the second database, which I don't have access to, I have to create a file containing the commands every day:
add new services
delete terminated services
update providers changes
My current approach looks like this:
Create from each line of the file a List<Service>
Make a database query for each entry in the list
if identical service exists and no changes delete service from this
list.
If service is available but end date or provider id different,
terminate service and simultaneously insert a new service valid as of
today. Additionally for the second database prepare a new list
toUpdate and add this service.
If service is not found insert it into the first database and create
a list toInsert and add service.
Send lists toInsert and toUpdate to second db.
Since my datasets in the databases are constantly diverging, I want to rethink my approach and reimplement the whole thing. How would you proceed with this task?
Edit
The database I have access to is from oracle the second one is DB2. I can't use database functions that keep the data synchronized. I am limited to creating a csv file with java to keep the second database synchronized.
For this kind of thing, I like to keep a separate table of what I think the remote database looks like. That way, I can:
Generate deltas easily by comparing my source data with my copy of what should be in the remote database.
Correct errors in PROD by updating the copy to force the process to resend (e.g., if the team managing the other database misses a file or something).
Here is a working example to illustrate the process.
Cast of characters:
SO_SERVICES --> your source table
SO_SERVICES_EXPORTED --> a copy of what the remote database should currently look like, if they've processed all our command .csv files correctly.
SO_SERVICES_EXPORT_CMDS --> the set of deltas generated by comparing SO_SERVICES and SO_SERVICES_EXPORTED. You would generate your .csv file from this table and then delete from it.
SYNC_SERVICES --> a procedure to generate the deltas
Setup Tables
CREATE TABLE so_services
( service_id NUMBER NOT NULL,
id_service_provider NUMBER NOT NULL,
valid_from DATE NOT NULL,
valid_to DATE DEFAULT DATE '9999-12-31' NOT NULL,
CONSTRAINT so_services_pk PRIMARY KEY ( service_id, id_service_provider ),
CONSTRAINT so_services_c1 CHECK ( valid_from <= valid_to ) );
CREATE TABLE so_services_exported
( service_id NUMBER NOT NULL,
id_service_provider NUMBER NOT NULL,
valid_from DATE NOT NULL,
valid_to DATE DEFAULT DATE '9999-12-31' NOT NULL,
CONSTRAINT so_services_exported_pk PRIMARY KEY ( service_id ),
CONSTRAINT so_services_exported_c1 CHECK ( valid_from <= valid_to ) );
CREATE TABLE so_services_export_cmds
( service_id NUMBER NOT NULL,
id_service_provider NUMBER,
cmd VARCHAR2(30) NOT NULL,
valid_from DATE,
valid_to DATE,
CONSTRAINT so_services_export_cmds_pk PRIMARY KEY ( service_id, cmd ) );
Procedure to process synchronization
-- You would put this in a package, for real code
CREATE OR REPLACE PROCEDURE sync_services IS
BEGIN
LOCK TABLE so_services IN EXCLUSIVE MODE;
-- Note the deltas between the current active services and what we've exported so far
-- CAVEAT: I am not sweating your exact business logic here. I am just trying to illustrate the approach.
-- The logic here assumes that the target database wants only one row for each service_id, so we will send an
-- "ADD" if the target database should insert a new service ID, "UPDATE", if it should modify an existing service ID,
-- or "DELETE" if it should delete it.
-- Also assuming, for "DELETE" command, we only need the service_id, no other fields.
INSERT INTO so_services_export_cmds
( service_id, id_service_provider, cmd, valid_from, valid_to )
SELECT nvl(so.service_id, soe.service_id) service_id,
so.id_service_provider id_service_provider,
CASE WHEN so.service_id IS NOT NULL AND soe.service_id IS NULL THEN 'ADD'
WHEN so.service_id IS NULL AND soe.service_id IS NOT NULL THEN 'DELETE'
WHEN so.service_id IS NOT NULL AND soe.service_id IS NOT NULL THEN 'UPDATE'
ELSE NULL -- this will fail and should.
END cmd,
so.valid_from valid_from,
so.valid_to valid_to
FROM ( SELECT * FROM so_services WHERE SYSDATE BETWEEN valid_from AND valid_to ) so
FULL OUTER JOIN so_services_exported soe ON soe.service_id = so.service_id
-- Exclude any UPDATES that don't change anything
WHERE NOT ( soe.service_id IS NOT NULL
AND so.service_id IS NOT NULL
AND so.id_service_provider = soe.id_service_provider
AND so.valid_from = soe.valid_from
AND so.valid_to = soe.valid_to);
-- Update the snapshot of what the remote database should now look like after processing the above commands.
-- (i.e., it should have all the current records from the service table)
DELETE FROM so_services_exported;
INSERT INTO so_services_exported
( service_id, id_service_provider, valid_from, valid_to )
SELECT service_id, id_service_provider, valid_from, valid_to
FROM so_services so
WHERE SYSDATE BETWEEN so.valid_from AND so.valid_to;
-- For testing (12c only)
DECLARE
c SYS_REFCURSOR;
BEGIN
OPEN c FOR SELECT * FROM so_services_export_cmds ORDER BY service_id;
DBMS_SQL.RETURN_RESULT(c);
END;
COMMIT; -- Release exclusive lock on services table
END sync_services;
Insert Test Data from OP
DELETE FROM so_services;
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 114, 20, DATE '2011-12-06', DATE '2017-10-16' );
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 211, 65, DATE '2015-05-09', DATE '9999-12-31' );
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 322, 57, DATE '2019-08-22', DATE '9999-12-31' );
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 336, 21, DATE '2009-08-20', DATE '2019-07-11' );
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 336, 37, DATE '2019-07-11', DATE '9999-12-31' );
Test #1 -- Nothing exported yet, so all latest records should be sent
exec sync_services;
SERVICE_ID ID_SERVICE_PROVIDER CMD VALID_FRO VALID_TO
---------- ------------------- ------------------------------ --------- ---------
211 65 ADD 09-MAY-15 31-DEC-99
322 57 ADD 22-AUG-19 31-DEC-99
336 37 ADD 11-JUL-19 31-DEC-99
Test #2 -- no additional updates, no additional commands
DELETE FROM so_services_export_cmds; -- You would do this after generating your .csv file
exec sync_services;
no rows selected
Test #3 - Add some changes to the source table
-- Add a new service #400
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 400, 20, DATE '2019-08-29', DATE '9999-12-31' );
-- Terminate service 322
UPDATE so_services
SET valid_to = DATE '2019-08-29'
WHERE service_id = 322
AND valid_to = DATE '9999-12-31';
-- Update service 336
UPDATE so_services
SET valid_to = DATE '2019-08-29'
WHERE service_id = 336
AND id_service_provider = 37
AND valid_to = DATE '9999-12-31';
INSERT INTO so_services ( service_id, id_service_provider, valid_from, valid_to )
VALUES ( 336, 88, DATE '2019-08-29', DATE '9999-12-31' );
exec sync_services;
SERVICE_ID ID_SERVICE_PROVIDER CMD VALID_FRO VALID_TO
---------- ------------------- ------------------------------ --------- ---------
322 DELETE
336 88 UPDATE 29-AUG-19 31-DEC-99
400 20 ADD 29-AUG-19 31-DEC-99
Since you have all the access on Oracle DB, can we do this -
Have two new additional columns - Last_Updated_Time & Flag.
Last_Updated_Time should contain the date,on which the row was inserted/updated. We can create trigger on this table to have this column populated,no other modification needed.
For the second one - Flag let it can contain various values depending on business scenarios, and can also be populated through trigger. For example - For first time creating service id, set it as 1, for terminated an service - 2, updating provider - terminated = 3 and with new provider : 4, etc.
Oracle query which fetches the data should add a condition at the end of reporting query - and Last_Updated_Time > sysdate-1 this will fetch updated data only.
As is Oracle DB values :
114; 20; 2011-12-06; 2017-10-16 //service terminated in 2017
211; 65; 2015-04-09; 9999-12-31 //service still valid
322; 57; 2019-08-22; 9999-12-31 //new service as of today
336; 20; 2009-08-20; 2019-07-11 //change provider, see next line
336; 37; 2019-07-11; 9999-12-31 //new provider for the above services
Updated (you can populate last update date for existing records by updating it with Valid_To for the terminated record, and for the rest - Valid_From date ):
114; 20; 2011-12-06; 2017-10-16; 2017-10-17; 2 //service terminated in 2017; last update date is old
211; 65; 2015-04-09; 9999-12-31; 2015-04-09; 1 //service still valid; last update date is old
322; 57; 2019-08-22; 9999-12-31; 2019-08-28; 1 //new service as of today; last update daye
336; 20; 2009-08-20; 2019-07-11; 2019-08-28; 3 //change provider, see next line; assumed : updated today
336; 37; 2019-07-11; 9999-12-31; 2019-08-28; 4 //new provider for the above services; assumed : updated today
Now, you can have two separate queries to create list for New records and to be Updated records and send csv accordingly (ex: records with flags as 1,4 for toInsert list and records with value as 2,3 for toUpdate list).
tl;dr:
Add two columns in Oracle db table to identify the last update date & record status flag, and then based on these values, create daily two csv file with previous day's inserted/updated data.
This problem can be solved in multiple ways as others have answered in the thread already. I'm assuming this is a problem you are solving work-related problem, which means it has to be reliable, available and fault-tolerant. I don't see many constraints on the processing time(entire processing should be done in 30 mins, indirectly related to latency), throughput(we have few thousand records can it grow if yes by how much? can it ever grow to unmanageable proportion) and security(who can have access, how are they going to access etc).
Based on the above assumptions, we can solve this in different ways. I'm presenting 3 of them here.
Approach1
A Partitioned Master(MASTER_SERVICES_TABLE) Oracle table. Table definition contains all the columns from CSV and any additional columns needed(created/modified date fields). Partitioning can be determined based on retention. In both cases, the partition key depends on the created column.
Max one year retention is good enough? then use DAY_OF_YEAR number the partition key
Multi-year retention is expected? Use Day(DD-MM-YYYY) format to partition the key.
Use the SQLLDR command-line tool from Oracle to load the data into a temporary table on a daily basis. After a successful load execute the partition exchange between the temporary table and current date partition.
Create another table(SERVICE_TABLE) that contains all the columns from the incoming file and few other extra columns (primary key, status, service_expired_on, create, modified etc).
Have a single/multiple cronjobs based on system load and throughput requirements. If the system load(number of records) is just less(few thousand records) one cronjob is good. Higher system load calls for more cronjobs. If we are opting for a multi cronjobs model, it's better to have 2 step process.
A master cronjob which wakes once a day and creates as many slave jobs as we need. Based on system capacity we can set criteria for slave jobs. I want each slave to process only 100k records. If we have 1 million records master creates 1million/100k slave jobs.
Slave jobs can be configured in 2 ways.
Wake up more often(1 hour or even frequent based on system throughput)
Have the master span slave jobs once the master has done its job.
Slave cronjobs will contain business logic. Something like a new service on-boarded, decommissioned and new service provider started etc., This part must be UnitTest covered to document what's the expected behavior. End result of the slave job is to update the SERVICE_TABLE. SERVICE_TABLE will only contain one working(includes historical/active/decommissioned or it can be what our business need) copy of all services.
Slave cronjobs keep updating the status of all the slave jobs in the oracle table.
Another cronjob(active service generator) outside of the master/slave jobs will get triggered by the last exiting slave.
This new active generator reads data from SERVICE_TABLE and dumps into a predefined file(CSV/JSON/TSV/PSV etc) format. If we really want to use a file-based approach for second database. OR we can directly update the second database from this cronjob.
If the file generated is huge then loading this data to the secondary database can be done in parallel(based on the capabilities of that DB).
Cron jobs on traditional UNIX systems are not reliable. It's better to use Chronos/Mesos cluster for maintaining Cronjobs.
Have monitoring/alerting on the above jobs.
MASTER_SERVICE_TABLE acts as a source of truth in the case of discrepancy.
Have archiving/cleanup implemented on all the tables involved.
Approach2
Dump the above file to HDFS on a daily basis. ex(/projects/servicedata/DDMMYYYY/)
Use the pig-latin script to read the file contents.
Write a UDF which takes care of merging change in-service providers etc. Basically business logic. UnitTest this UDF for all possible use-cases.
Output the final outcome of pig-latin script to a file.
Write a program to read the generated file and load it to any database you want.
Use oozie workflow to load the above-generated file to the database.
Approach3
Assuming this is just a personal project, we don't care about all the industry standards.
You can a simple version of pipes and filters architecture pattern.
A standard java program(any other language) reads the file and splits them to the predefined number of threads/processes. Each record is hashed based on some key(service_id) and range1 goes to thread/process1, range2 goes to thread/process2 etc..
Each of the above threads/processes dependend on a library which can your business logic. The library can implement state management using a state machine.
Each of the threads/processes will have access to the data sources which they can write to.
Finally, apologies if this is not solving your problem. As I've not paid too much attention to the business logic of Add/Delete/Update because these can change on a case to case basis. If the framework/architecture is robust enough we can replace business logic with anything we need is my thought process.
Solution 1
Assumptions
you don't care about the commit log
you don't have any history table maintained over the table
for oracle this operation will be performed when there is no load on the database.
from the way you are currently doing, it seems like there will be enough memory available in DB servers to insert all data in one go.
Solution
I would truncate the tables and then insert the data.
TRUNCATE/INSERT has many benefits over DELETE/UPDATE/INSERT. The biggest one is sequential writes.
I would generate multi-row SQL statements like the following:
Oracle
TRUNCATE MyTable;
INSERT ALL
INTO MyTable(service_id, id_service_provider, valid_from, valid_to) VALUES (114, 20, 2011-12-06, 2017-10-16)
INTO MyTable(service_id, id_service_provider, valid_from, valid_to) VALUES (211, 65, 2015-04-09, 9999-12-31)
...
SELECT 1 FROM DUAL
DB2
BEGIN TRANSACTION;
TRUNCATE MyTable;
INSERT INTO MyTable(service_id, id_service_provider, valid_from, valid_to) VALUES (
(114, 20, 2011-12-06, 2017-10-16),
(211, 65, 2015-04-09, 9999-12-31)
...
);
COMMIT;
For Oracle, I would generate the SQL statements for all the rows since it's a replica.
For DB2, I would generate the SQL statements for all the rows which have end date '9999-12-31'.
Solution 2
Database 1
Assumptions
The data is extracted after day end (midnight). e.g. The data was extracted but on 26 Aug but the data does not contain any entry for 26 Aug.
There is no update performed on this table.
Solution:
I would create the delta myself with the help of a cursor. I would generate the SQL statements for all the rows which come after that cursor.
I would maintain a single value table with the cursor. The value of this cursor could be an auto-incremented serial id (if any) or the maximum date of either column fromDate or toDate except '9999-12-31'. This date will be essentially the date-1 when data was collected (see assumption).
The value of the cursor can be maintained in two ways:
Trigger on every insert in database.
Inserting it from the java code after every insert.
For insertion: I would fetch this cursor from the database and then generate SQL statements for all the lines in the file which come after my cursor.
(fromDate > max-date || toDate > max-date)
Database 2
I would write UPSERT queries for all the valid rows (rows having endDate: '9999-12-31') and then delete all the rows which don't have endDate: '9999-12-31' from the table. i.e.
MERGE INTO MyTable AS mt
USING (VALUES(
(114, 20, 2011-12-06, 2017-10-16),
(211, 65, 2015-04-09, 9999-12-31)
...
)) AS sh (service_id, id_service_provider, valid_from, valid_to)
ON (mt.service_id = sh.service_id)
WHEN MATCHED THEN
UPDATE SET
id_service_provider = sh.id_service_provider,
valid_from = sh.valid_from ,
valid_to = sh.valid_to
WHEN NOT MATCHED THEN
INSERT INTO MyTable(service_id, id_service_provider, valid_from, valid_to) VALUES (
(114, 20, 2011-12-06, 2017-10-16),
(211, 65, 2015-04-09, 9999-12-31)
...)
Since my datasets in the databases are constantly diverging, I want to rethink my approach and reimplement the whole thing. How would you proceed with this task?
You didn't specify which database you're using, but if you're open to changing that along with rethinking the approach, I would consider using whatever database replication mechanism are available. If no replication feature is available, I would consider changing databases to use one that supports replication.
As you have found, keeping two databases in sync is complicated, and quite likely not what you want to spend your time doing.
Given the requirements and constraints you provided, here is the approach I would take to solve this problem:
Parse the original file and store data in e.g. List (not sure how big the file is, assume the server has enough memory to accommodate the data)
Get unique list of service IDs (assume it's a unique key; up to 1000 - limit of Oracle) out of List and query Oracle to get such info as current service provider, from_date/to_date
Compare between two lists (what's in List and what's from Oracle) to determine the Action of each service (e.g. new, deleted, SP-changed, etc.)
Use Batch Update to insert/update each service to Oracle
Generate CSV file for DB2 based on the Action
Consider to use a light-weight JDBC framework like MyBatis. Also consider using List stream() function when manipulating the List.
I am working on a personal project to help develop SQL skills. The current problem I am having is trying to have my SQL database automatically propagate into the desired table based on a certain column value from 5 other tables.
Is it more efficient to do this on the backend like this or just to query the information from the frontend GUI that is accessing the database and output into a table?
Just wondering if someone can point me in the right direction and not necessarily a solution, I want to figure this out on my own if possible.
This is basically an inventory reporting & tracking tool as of right now:
Database schema:
App
Source Tables for query:
Customer
Demo
Loaner
Training
Other
Target output Table from schema dbo:
Out
Table columns (all the same):
Serial
Model
Date
Category
Status
Skin
Fidelity
Responsibility
OutDate
The intended target value is any row within these source tables that contain the value "Out" within the Status column.
You want to select all rows from three tables with the same structure, where column Status has value 'Out'.
A UNION ALL query should do the trick :
SELECT * FROM Customer WHERE Status = 'Out'
UNION ALL
SELECT * FROM Demo WHERE Status = 'Out'
UNION ALL
SELECT * FROM Loaner WHERE Status = 'Out'
You could use something like:
insert into out (serial,
model,
date,
category,
status,
skin,
fidelity,
responsibility,
outdate)
select
c.serial,
c.model,
c.date,
c.category,
c.status,
c.skin,
c.fidelity,
c.responsibility,
c.outdate
from
customer c, demo d, loaner l
where
c.status = d.status and
d.status = l.status and
l.status = 'out';
I have an eCommerce app. I have an Item entity and whenever that item's end date time is equal to current time, the Item's status should change ( I also need to execute other SQL operations such as inserting a row to a table)
Basically, I want to execute an SQL operation that checks the database and changes entities every minute.
I have a few ideas on how to implement this:
Schedule a job in my linux server that checks the db every minute
Use sp_executesql (Transact-SQL) or DBMS Scheduler
Have a thread running in my Java backend to check db and execute operations.
I am very new to this, so I don't have any idea how to implement this. What is the most efficient implementation that takes into account scalability performance?
Other information: database is SQL Server, server is Linux, backend is Java Spring Boot.
If you need to run a script after an insert or update, you can consolidate all that complex logic (e.g. insert rows in other tables, update the status column, etc.) in a trigger:
Here's a sample table schema:
CREATE TABLE t1 (id INT IDENTITY(1,1), start_time DATETIME, end_time DATETIME,
status VARCHAR(25))
And a sample insert/update trigger for that table:
CREATE TRIGGER u_t1
ON t1
AFTER INSERT,UPDATE
AS
BEGIN
UPDATE t1
SET status = CASE WHEN inserted.end_time = inserted.start_time
THEN 'same' ELSE 'different' END
FROM t1
INNER JOIN inserted ON t1.id = inserted.id
-- do anything else you want!
-- e.g.
-- INSERT INTO t2 (id, status) SELECT id, status FROM inserted
END
GO
Insert a couple test records:
INSERT INTO t1 (start_time, end_time)
VALUES
(GETDATE(), GETDATE() - 1), -- different
(GETDATE(), GETDATE()) -- same
Query the table after the inserts:
SELECT * FROM t1
See that the status is calculated correctly:
id start_time end_time status
1 2018-07-17 02:53:24.577 2018-07-16 02:53:24.577 different
2 2018-07-17 02:53:24.577 2018-07-17 02:53:24.577 same
If your only goal is to update the status column based on other values in the table, then a computed column is the simplest approach; you just supply the formula:
create table t1 (id int identity(1,1), start_time datetime, end_time datetime,
status as
case
when start_time is null then 'start null'
when end_time is null then 'end null'
when start_time < end_time then 'start less'
when end_time < start_time then 'end less'
when start_time = end_time then 'same'
else 'what?'
end
)