efficient method to sync up two tables in different servers with java

efficient method to sync up two tables in different servers with java - java

we have a source table which is updated from various external systems. i require the destination table (in different server) to be in sync with this source table. the destination table is not an exact replica of the source table, some data processing has to be done before the data is inserted/updated into destination table.
i have thought of the following logic
every 15min we run this java consumer code which fetches the records where the timestamp is created than that of previous update and stored in a CachedRowSet. and call a stored procedure with CachedRowSet as parameter, where the data processing is done and the data is inserted/updated into the destination table.
do you believe above mentioned is an efficient way as we are dealing over a million records every update ?
also when a record is deleted in the source table in would not be replicated in the above method ! can you suggest what to do in such scenario

Something similar to technique used by database for save point and rollback.
Whenever there is some change in the source table e.g. CRUD. keep the scripts of change as per format required to the target table. periodically you can push those changes to the target server. As your source table is updated by various external system, you'll need to have trigger on your source table for keeping script logs.

You might want to check out mk-table-sync from Maatkit tools:
http://www.maatkit.org/doc/mk-table-sync.html
You'd need to be careful around your table differences.

Here are some existing solutions:
https://www.symmetricds.org/
http://opensource.replicator.daffodilsw.com/

Related

Get or Retrieve Generated PKs after a massive insert SQLLDR

I'll be direct about my situation right now. I'm working in a project which will perform a "Base load" procedure based on an excel (xlsx, xls) file. It has been developed in java with JDBC drivers. right now this project is working, It takes an excel file and based on a configuration It performances the insert into differents tables. The point is: It's taking too long doing the job, which makes it inefficient. (It takes around 2 hours inserting 3000 records on DB). in the future, this software will be inserted around 30k records and it will be painfully slow. So I need to improve its efficience and I was thinking in: Instead of inserting from java via JDBC drivers. I will generate control files and data files to be inserted in the DB using SQLLDR.
The point I'm facing right now, I need to insert these data into several tables, and this tables are related to each other. That's means, If I insert a person into "Person_table" I will need the Primary Key generated by a database sequence to insert the "Address, Phone, email, etc." into other table, so I do not know how to get the primary keys generated in the first insert via SQLLDR.
I'm not sure sure yet if SQLLDR is my best way to do this, but I guess It is, because the DBMS is Oracle
Can you guys lead me about how could I do what I explained you guys I need to do? any suggestion is welcome and well received. It does not matter if your suggestions are not about how to do this with SQLLDR.
I'm a kind of stuck at this point right now, I really appreciate the help you could give me.

SQL*Loader can't read native Excel files (at least, as far as I know). Therefore, you'll have to save the result as a CSV file.
As you need to manipulate foreign key constraints, consider switching to external tables feature - basically, the background is still SQL*Loader, but you can write (PL/)SQL against those files/tables (yes - a CSV file, stored on a hard disk, acts as if it was an Oracle table).
So, you'd "load" one table, populate primary key values, populate another (child) table - possibly into a "temporary" (not necessarily a global temporary table) which doesn't have any constraints enabled, populate foreign key values and move data into a "real" target table whose constraints now won't fail.
Possible drawback: CSV files have to reside in a directory that is accessible to the database server, as you'll have to create a directory (Oracle object) and grant required privileges (usually read, write) to user who will be using it. Directory is usually created on a server itself; if not, you'll have to use UNC while creating it.
Now you have something to read about/research; see if it makes sense to you.

Not allow DML operations during Packages exec

i need a little help here because i'm struggling a little bit to find the best solution for my problem. i googled and dont have any enlightening answer.
So, first of all, i'll explain the idea.
1 - i've a java application that insert data in my database (Oracle DB) using jdbc.
2 - My database is logically splited in two. One part that contains table with exported information (from another application) and another part with table that represents some reports.
3 - my java app only insert information in export table.
4 - I've developed some packages that makes the transformation of data from export table to report table (generate some reports).
5 - This packages are scheduled to execute 2, 3 times a day
So, my problem is that when transformation task starts, i want to prevent new DML operations. Then, when transformation stops, all new data that was supposed to be inserted/updated during that time, shall be inserted again in the export tables.
i tought in two approaches:
1 - during transformation time deviate the DML ops to temporary table
2 - lock the tables but i've not so many experience using this. My main question is, can i force DML operations in jdbc to wait until the lock is finished? Not tried yet, but read here and there that after some that is thrown a lockwaittimeout exception or something like that.
Can anyone more experienced give me some advices?
Any doubts on what i'm trying to do just ask.

Do not try locking tables as a solution. Sadly, that is common but rarely necessary. Just a few ideas:
at start of transformation select * data from export table into global_temp table. Then execute your transformation packages on that temp table
create a materialized view like select * data from export table. Investigate the options to refresh on commit but it seems you require to refresh the table just before your transformation
analyze your exported data. If it is like many other cases most of the data will never change once imported. Only new data needs to be analyzed. To aid in processing add a timestamp field called date_last_modified and a trigger on the table. When a row is updated then update the date_last_modified. This allows you to choose the smallest data set possible of "only changed records"
you should also investigate using bulk collect to optimize your cursor. This will allow you get a group of records all at once, sort of a snapshot of the data at a point in time
I believe you are over thinking this. If you get a group of records one at a time then Oracle will get the state of the record as of the last commit by any user. If you bulk collect a group of records they go into memory and will, again, represent the state as of a point in time.
The best way to feel more comfortable about this is to set up a test case. Set up a cursor that sleeps during every processing cycle. Open another session and change the data that is being processed. See what happens....

How to tell initial data load to insert only the values which are not there in target db?

i have some large data in one table and small data in other table,is there any way to run initial load of golden gate so that same data in both tables wont be changed and rest of the data got transferred from one table to other.

Initial loads are typically for when you are setting up the replication environment; however, you can do this as well on single tables. Everything in the Oracle database is driven by System Change Numbers/Change System Numbers (SCN/CSN).
By using the SCN/CSN, you can identify what the starting point in the table should be and start CDC from there. Any prior to the SCN/CSN will not get captured and would require you to manually move that data in some fashion. That can be done by using Oracle Data Pump (Export/Import).
Oracle GoldenGate also provided a parameter called SQLPredicate that allows you to use a "where" clause against a table. This is handy with initial load extracts because you would do something like TABLE ., SQLPredicate "as of ". Then data before that would be captured and moved to the target side for a replicat to apply into a table. You can reference that here:
https://www.dbasolved.com/2018/05/loading-tables-with-oracle-goldengate-and-rest-apis/
Official Oracle Doc: https://docs.oracle.com/en/middleware/goldengate/core/19.1/admin/loading-data-file-replicat-ma-19.1.html
On the replicat side, you would use HANDLECOLLISIONS to kick out any ducplicates. Then once the load is complete, remove it from the parameter file.
Lots of details, but I'm sure this is a good starting point for you.

That would require programming in java.
1) First you would read your database
2) Decide which data has to be added in which table on the basis of data that was read.
3) Execute update/ data entry queries to submit data to tables.

If you want to run Initial Load using GoldenGate:
Target tables should be empty
Data: Make certain that the target tables are empty. Otherwise, there
may be duplicate-row errors or conflicts between existing rows and
rows that are being loaded. Link to Oracle Documentations
If not empty, you have to treat conflicts. For instance if the row you are inserting already exists in the target table (INSERTROWEXISTS) you should discard it, if that's what you want to do. Link to Oracle Documentation

Synchronizing table data across databases

I have one table that records its row insert/update timestamps on a field.
I want to synchronize data in this table with another table on another db server. Two db servers are not connected and synchronization is one way (master/slave). Using table triggers is not suitable
My workflow:
I use a global last_sync_date parameter and query table Master for
the changed/inserted records
Output the resulting rows to xml
Parse the xml and update table Slave using updates and inserts
The complexity of the problem rises when dealing with deleted records of Master table. To catch the deleted records I think I have to maintain a log table for the previously inserted records and use sql "NOT IN". This becomes a performance problem when dealing with large datasets.
What would be an alternative workflow dealing with this scenario?

It sounds like you need a transactional message queue.
How this works is simple. When you update the master db you can send a message to the message broker (of whatever the update was) which can go to any number of queues. Each slave db can have its own queue and because queue's preserve order the process should eventually synchronize correctly (ironically this is sort of how most RDBMS do replication internally).
Think of the Message Queue as a sort of SCM change-list or patch-list database. That is for the most part the same (or roughly the same) SQL statements sent to master should be replicated to the other databases eventually. Don't worry about loosing messages as most message queues support durability and transactions.
I recommend you look at spring-amqp and/or spring-integration especially since you tagged this question with spring-batch.
Based on your comments:
See Spring Integration: http://static.springsource.org/spring-integration/reference/htmlsingle/ .
Google SEDA. Whether you go this route or not you should know about Message queues as it goes hand-in-hand with batch processing.
RabbitMQ has a good picture diagram of how messaging works
The contents of your message might be the entire row and whether its a CRUD, UPDATE, DELETE. You can use whatever format (e.g. JSON. See spring integration on recommendations).
You could even send the direct SQL statements as a message!
BTW your concern of NOT IN being a performance problem is not a very good one as there are a plethora of work-arounds but given your not wanting to do DB specific things (like triggers and replication) I still feel a message queue is your best option.
EDIT - Non MQ route
Since I gave you a tough time about asking this quesiton I will continue to try to help.
Besides the message queue you can do some sort of XML file like you we were trying before. THE CRITICAL FEATURE you need in the schema is a CREATE TIMESTAMP column on your master database so that you can do the batch processing while the system is up and running (otherwise you will have to stop the system). Now if you go this route you will want to SELECT * WHERE CREATE_TIME < ? is less than the current time. Basically your only getting the rows at a snapshot.
Now on your other database for the delete your going to remove rows by inner joining on a ID table but with != (that is you can use JOINS instead of slow NOT IN). Luckily you only need all the ids for delete and not the other columns. The other columns you can use a delta based on the the update time stamp column (for update, and create aka insert).

I am not sure about the solution. But I hope these links may help you.
http://knowledgebase.apexsql.com/2007/09/how-to-synchronize-data-between.htm
http://www.codeproject.com/Tips/348386/Copy-Synchronize-Table-Data-between-databases

Have a look at Oracle GoldenGate:
Oracle GoldenGate is a comprehensive software package for enabling the
replication of data in heterogeneous data environments. The product
set enables high availability solutions, real-time data integration,
transactional change data capture, data replication, transformations,
and verification between operational and analytical enterprise
systems.
SymmetricDS:
SymmetricDS is open source software for multi-master database
replication, filtered synchronization, or transformation across the
network in a heterogeneous environment. It supports multiple
subscribers with one direction or bi-directional asynchronous data
replication.
Daffodil Replicator:
Daffodil Replicator is a Java tool for data synchronization, data
migration, and data backup between various database servers.

Why don't you just add a TIMESTAMP column that indicates the last update/insert/delete time? Then add a deleted column -- ie. mark the row as deleted instead of actually deleting it immediately. Delete it after having exported the delete action.
In case you cannot alter schema usage in an existing app:
Can't you use triggers at all? How about a second ("hidden") table that gets populated with every insert/update/delete and which would constitute the content of the next to be generated xml export file? That is a common concept: a history (or "log") table: it would have its own progressing id column which can be used as an export marker.

Very interesting question.
In may case I was having enough RAM to load all ids from master and slave tables to diff them.
If ids in master table are sequential you try to may maintain a set of full filled ranges in master table (ranges with all ids used, without blanks, like 100,101,102,103).
To find removed ids without loading all of them to the memory you may execute SQL query to count number of records with id >= full_region.start and id <= full_region.end for each full filled region. If result of query == (full_region.end - full_region.end) + 1 it means all record in region are not deleted. Otherwise - split region into 2 parts and do the same check for both of them (in a lot of cases only one side contains removed records).
After some length of range (about 5000 I think) it will faster to load all present ids and check for absent using Set.
Also there is a sense to load all ids to the memory for a batch of small (10-20 records) regions.

Make a history table for the table that needs to be synchronized (basically a duplicate of that table, with a few extra fields perhaps) and insert the entire row every time something is inserted/updated/deleted in the active table.
Write a Spring batch job to sync the data to Slave machine based on the history table's extra fields
hope this helps..

A potential option for allowing deletes within your current workflow:
In the case that the trigger restriction is limited to triggers with references across databases, a possible solution within your current workflow would be to create a helper table in your Master database to store only the unique identifiers of the deleted rows (or whatever unique key would enable you to most efficiently delete your deleted rows).
Those ids would need to be inserted by a trigger on your master table on delete.
Using the same mechanism as your insert/updates, create a task following your inserts and updates. You could export your helper table to xml, as you noted in your current workflow.
This task would simply delete the rows out of the slave table, then delete all data from your helper table following completion of the task. Log any errors from the task so that you can troubleshoot this since there is no audit trail.

If your database has a transaction dump log, just ship that one.
It is possible with MySQL and should be possible with PostgreSQL.

I would agree with another comment - this requires the usage of triggers. I think another table should hold the history of your sql statements. See this answer about using 2008 extended events... Then, you can get the entire sql, and store the result query in the history table. Its up to you if you want to store it as a mysql query or a mssql query.

Here's my take. Do you really need to deal with this? I assume that the slave is for reporting purposes. So the question I would ask is how up to date should it be? Is it ok if the data is one day old? Do you plan a nightly refresh?
If so, forget about this online sync process, download the full tables; ship it to the mysql and batch load it. Processing time might be a lot quicker than you think.

Simple ways to note/log the last date/time when a database was updated, accessed, modified etc

I made Java/JDBC code which performs simple/basic operations on a database.
I want to add code which helps me to keep a track of when a particular database was accessed, updated, modified etc by this program.
I am thinking of creating another database inside my DBMS where these details or logs will be stored for each database involved.
Is this the best way to do it ? Are there any other ways (preferably simple) to do this ?
EDIT-
For now, I am using MySQL. But, I also want my code to work with at least
Oracle SQL and MS-SQL as well.

It is pretty standard to add a "last_modified" column to a table and then add an update trigger on the table to set it to the db current time. Then your apps don't need to worry about it. Also, a "create_time" is often used as well, populated by an insert trigger.
Update after comment:
Seems you are looking for audit logs. Some write apps where data manipulation only happens through stored procedures and not through inserts and updates. A fixed api. So you want to add an item to a table, you call the stored proc:
addItem(itemName, itemDescription)
Then the proc inserts into the item table and does what ever logging is necessary.
Another technique, if you are using some kind of framework for your jdbc access (say Spring) might be to intercept at that layer.

In almost all tables, I have the following columns:
CreatedBy
CreatedAt
These columns have default values of the current user and current time, respectively. They are populated when a row is added.
This solves only part of your problem. You can start adding triggers, but that gets complicated. Another method is to force modification access to the database through stored procedures, and then log the stored procedures. This has other advantages, in terms of controlling what users can do. But, you might want more flexibility.
A third possibility are auditing tools, that keep track of all queries being run on the database. I think most databases have a way of turning on internal auditing, although these are very specific to the database. There are also third party tools that allow you to see what has happened. Note, though, that these methods will affect performance if your database is doing high volume transactions.
For more information, you should revise your question to specify which database you are using or planning on using.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.