Sync data bethen two JPA applications

Sync data bethen two JPA applications - java

I wrote an application that uses JPA (and hibernate as persistence provider).
It works on a database with several tables.
I need to create an "offline mode", where a copy of the programa, which acts as a client, allows the same functionality while keeping their data synchronized with the server when it is reachable.
The aim is to get a client that you can "detach" from the server, make changes on the data and then merge changes back. A bit like a revision control system.
It is not important to manage conflicts, in case the user will decide which version to keep.
My idea, but it can't work, was to assign to each row in the database the last edit timestamp. The client initially download a copy of the entire database and also records a second timestamp when it modify a row while non connected to the server. In this way, it knows what data has changed and the last timestamp where it is synchronized with the server. When you reconnect to the server, he will have to ask what are the data that have been changed since the last synchronization from the server and sends the data it has changed. (a bit simplified, but the management of conflicts should not be a big problem)
This, of course, does not work in case of deleting a row. If both the server or the client deletes a row they will not notice it and the other will never know.
The solution would be to maintain a table with the list of deleted rows, but it seems too expensive.
Does anyone know a method that works? there is already something similar?

Enver:
If you like to have a simple solution, you can create Version-Fields that acts like your "Timestamp".
Audit:
If you like to have a complex, powerfull solution, you should use the Hibernateplugin

Related

How to synchronize data within programs using a database?

I have a very small program that is going to be used by two or more computers at the same time. It's a small program that can add string to the list and remove it from the list. I save all these strings in a remote postgres database. When I delete string from the list, it gets deleted from the database however if the program is running on other computer you still can see this string. Currently I have only one option in mind which is refreshing data in program every x time ? Are there better options ? Database is very small, only one column and shouldnt be more than 100 rows.

You should never allow client programs to interact with a remote database directly. Exposing your database to the clients is a huge security problem. You should always have a server-program in between which communicates with the clients, validates their input, communicate with the database, and then tells the clients what they want to know.
This would also give you the ability to add push-updates to your network protocol (when one client makes a change, update the database and also inform the other client about the change).
But when you really want to take the risk and you consider a server-program too much complexity, you could add a timestamp to every row when it gets changed. That way you could refresh the clients at regular intervals by querying only for the rows which got changed since the last refresh.
Another option would be to allow the clients to communicate with each other in a peer-to-peer manner. When a client makes a change, it doesn't just notify the database, it also notifies the other clients via separate network connections. In order to do that, the clients need to know each others IP addresses or at least each others hostnames. When these aren't known, you could have the clients write their IPs to another table of the database when they connect, so the other client can query for it. Just make sure the entries get deleted, so you don't annoy every single IP address some client ever had.

Synchronizing table data across databases

I have one table that records its row insert/update timestamps on a field.
I want to synchronize data in this table with another table on another db server. Two db servers are not connected and synchronization is one way (master/slave). Using table triggers is not suitable
My workflow:
I use a global last_sync_date parameter and query table Master for
the changed/inserted records
Output the resulting rows to xml
Parse the xml and update table Slave using updates and inserts
The complexity of the problem rises when dealing with deleted records of Master table. To catch the deleted records I think I have to maintain a log table for the previously inserted records and use sql "NOT IN". This becomes a performance problem when dealing with large datasets.
What would be an alternative workflow dealing with this scenario?

It sounds like you need a transactional message queue.
How this works is simple. When you update the master db you can send a message to the message broker (of whatever the update was) which can go to any number of queues. Each slave db can have its own queue and because queue's preserve order the process should eventually synchronize correctly (ironically this is sort of how most RDBMS do replication internally).
Think of the Message Queue as a sort of SCM change-list or patch-list database. That is for the most part the same (or roughly the same) SQL statements sent to master should be replicated to the other databases eventually. Don't worry about loosing messages as most message queues support durability and transactions.
I recommend you look at spring-amqp and/or spring-integration especially since you tagged this question with spring-batch.
Based on your comments:
See Spring Integration: http://static.springsource.org/spring-integration/reference/htmlsingle/ .
Google SEDA. Whether you go this route or not you should know about Message queues as it goes hand-in-hand with batch processing.
RabbitMQ has a good picture diagram of how messaging works
The contents of your message might be the entire row and whether its a CRUD, UPDATE, DELETE. You can use whatever format (e.g. JSON. See spring integration on recommendations).
You could even send the direct SQL statements as a message!
BTW your concern of NOT IN being a performance problem is not a very good one as there are a plethora of work-arounds but given your not wanting to do DB specific things (like triggers and replication) I still feel a message queue is your best option.
EDIT - Non MQ route
Since I gave you a tough time about asking this quesiton I will continue to try to help.
Besides the message queue you can do some sort of XML file like you we were trying before. THE CRITICAL FEATURE you need in the schema is a CREATE TIMESTAMP column on your master database so that you can do the batch processing while the system is up and running (otherwise you will have to stop the system). Now if you go this route you will want to SELECT * WHERE CREATE_TIME < ? is less than the current time. Basically your only getting the rows at a snapshot.
Now on your other database for the delete your going to remove rows by inner joining on a ID table but with != (that is you can use JOINS instead of slow NOT IN). Luckily you only need all the ids for delete and not the other columns. The other columns you can use a delta based on the the update time stamp column (for update, and create aka insert).

I am not sure about the solution. But I hope these links may help you.
http://knowledgebase.apexsql.com/2007/09/how-to-synchronize-data-between.htm
http://www.codeproject.com/Tips/348386/Copy-Synchronize-Table-Data-between-databases

Have a look at Oracle GoldenGate:
Oracle GoldenGate is a comprehensive software package for enabling the
replication of data in heterogeneous data environments. The product
set enables high availability solutions, real-time data integration,
transactional change data capture, data replication, transformations,
and verification between operational and analytical enterprise
systems.
SymmetricDS:
SymmetricDS is open source software for multi-master database
replication, filtered synchronization, or transformation across the
network in a heterogeneous environment. It supports multiple
subscribers with one direction or bi-directional asynchronous data
replication.
Daffodil Replicator:
Daffodil Replicator is a Java tool for data synchronization, data
migration, and data backup between various database servers.

Why don't you just add a TIMESTAMP column that indicates the last update/insert/delete time? Then add a deleted column -- ie. mark the row as deleted instead of actually deleting it immediately. Delete it after having exported the delete action.
In case you cannot alter schema usage in an existing app:
Can't you use triggers at all? How about a second ("hidden") table that gets populated with every insert/update/delete and which would constitute the content of the next to be generated xml export file? That is a common concept: a history (or "log") table: it would have its own progressing id column which can be used as an export marker.

Very interesting question.
In may case I was having enough RAM to load all ids from master and slave tables to diff them.
If ids in master table are sequential you try to may maintain a set of full filled ranges in master table (ranges with all ids used, without blanks, like 100,101,102,103).
To find removed ids without loading all of them to the memory you may execute SQL query to count number of records with id >= full_region.start and id <= full_region.end for each full filled region. If result of query == (full_region.end - full_region.end) + 1 it means all record in region are not deleted. Otherwise - split region into 2 parts and do the same check for both of them (in a lot of cases only one side contains removed records).
After some length of range (about 5000 I think) it will faster to load all present ids and check for absent using Set.
Also there is a sense to load all ids to the memory for a batch of small (10-20 records) regions.

Make a history table for the table that needs to be synchronized (basically a duplicate of that table, with a few extra fields perhaps) and insert the entire row every time something is inserted/updated/deleted in the active table.
Write a Spring batch job to sync the data to Slave machine based on the history table's extra fields
hope this helps..

A potential option for allowing deletes within your current workflow:
In the case that the trigger restriction is limited to triggers with references across databases, a possible solution within your current workflow would be to create a helper table in your Master database to store only the unique identifiers of the deleted rows (or whatever unique key would enable you to most efficiently delete your deleted rows).
Those ids would need to be inserted by a trigger on your master table on delete.
Using the same mechanism as your insert/updates, create a task following your inserts and updates. You could export your helper table to xml, as you noted in your current workflow.
This task would simply delete the rows out of the slave table, then delete all data from your helper table following completion of the task. Log any errors from the task so that you can troubleshoot this since there is no audit trail.

If your database has a transaction dump log, just ship that one.
It is possible with MySQL and should be possible with PostgreSQL.

I would agree with another comment - this requires the usage of triggers. I think another table should hold the history of your sql statements. See this answer about using 2008 extended events... Then, you can get the entire sql, and store the result query in the history table. Its up to you if you want to store it as a mysql query or a mssql query.

Here's my take. Do you really need to deal with this? I assume that the slave is for reporting purposes. So the question I would ask is how up to date should it be? Is it ok if the data is one day old? Do you plan a nightly refresh?
If so, forget about this online sync process, download the full tables; ship it to the mysql and batch load it. Processing time might be a lot quicker than you think.

Postgres trigger to update Java cache

I have a Java web app (WAR deployed to Tomcat) that keeps a cache (Map<Long,Widget>) in memory. I have a Postgres database that contains a widgets table:
widget_id | widget_name | widget_value
(INT) (VARCHAR 50) (INT)
To O/R map between Widget POJOs and widgets table records, I am using MyBatis. I would like to implement a solution whereby the Java cache (the Map) is updated in real-time whenever a value in the widgets table changes. I could have a polling component that checks the table every, say, 30 seconds, but polling just doesn't feel like the right solution here. So here's what I'm proposing:
Write a Postgres trigger that calls a stored procedure (run_cache_updater())
The procedure in turns runs a shell script (run_cache_updater.sh)
The script base-64 encodes the changed widgets record and then cURLs the encoded record to an HTTP URL
The Java WAR has a servlet listening on the cURLed URL and handles any HttpServletRequests sent to it. It base-64 decodes the record and somehow transforms it into a Widget POJO.
The cache (Map<Long,Widget>) is updated with the correct key/value.
This solution feels awkward, and so I am first wondering how any Java/Postgres gurus out there would handle such a situation. Is polling the better/simpler choice here (am I just being stubborn?) Is there another/better/more standard solution I am overlooking?
If not, and this solution is the standard way of pushing changed records from Postgres to the application layer, then I'm choking on how to write the trigger, stored procedure, and shell script so that the entire widgets record gets passed into the cURL statement. Thanks in advance for any help here.

I can't speak to MyBatis, but I can tell you that PostgreSQL has a publish/subscribe system baked in, which would let you do this with much less hackery.
First, set up a trigger on widgets that runs on every insert, update, and delete operation. Have it extract the primary key and NOTIFYwidgets_changed, id. (Well, from PL/pgSQL, you'd probably want PERFORM pg_notify(...).) PostgreSQL will broadcast your notification if and when that transaction commits, making both the notification and the corresponding data changes visible to other connections.
In the client, you'd want to run a thread dedicated to keeping this map up-to-date. It would connect to PostgreSQL, LISTENwidgets_changed to start queueing notifications, SELECT * FROM widgets to populate the map, and wait for notifications to arrive. (Checking for notifications apparently involves polling the JDBC driver, which sucks, but not as bad as you might think. See PgNotificationPoller for a concrete implementation.) Once you see a notification, look up the indicated record and update your map. Note that it's important to LISTEN before the initial SELECT *, since records could be changed between SELECT * and LISTEN.
This approach doesn't require PostgreSQL to know anything about your application. All it has to do is send notifications; your application does the rest. There's no shell scripts, no HTTP, and no callbacks, letting you reconfigure/redeploy your application without also having to reconfigure the database. It's just a database, and it can be backed up, restored, replicated, etc. with no extra complications. Similarly, your application has no extra complexities: all it needs is a connection to PostgreSQL, which you already have.

Simple ways to note/log the last date/time when a database was updated, accessed, modified etc

I made Java/JDBC code which performs simple/basic operations on a database.
I want to add code which helps me to keep a track of when a particular database was accessed, updated, modified etc by this program.
I am thinking of creating another database inside my DBMS where these details or logs will be stored for each database involved.
Is this the best way to do it ? Are there any other ways (preferably simple) to do this ?
EDIT-
For now, I am using MySQL. But, I also want my code to work with at least
Oracle SQL and MS-SQL as well.

It is pretty standard to add a "last_modified" column to a table and then add an update trigger on the table to set it to the db current time. Then your apps don't need to worry about it. Also, a "create_time" is often used as well, populated by an insert trigger.
Update after comment:
Seems you are looking for audit logs. Some write apps where data manipulation only happens through stored procedures and not through inserts and updates. A fixed api. So you want to add an item to a table, you call the stored proc:
addItem(itemName, itemDescription)
Then the proc inserts into the item table and does what ever logging is necessary.
Another technique, if you are using some kind of framework for your jdbc access (say Spring) might be to intercept at that layer.

In almost all tables, I have the following columns:
CreatedBy
CreatedAt
These columns have default values of the current user and current time, respectively. They are populated when a row is added.
This solves only part of your problem. You can start adding triggers, but that gets complicated. Another method is to force modification access to the database through stored procedures, and then log the stored procedures. This has other advantages, in terms of controlling what users can do. But, you might want more flexibility.
A third possibility are auditing tools, that keep track of all queries being run on the database. I think most databases have a way of turning on internal auditing, although these are very specific to the database. There are also third party tools that allow you to see what has happened. Note, though, that these methods will affect performance if your database is doing high volume transactions.
For more information, you should revise your question to specify which database you are using or planning on using.

Justification of the need for an in-memory database

My use case is as follows --
I have a database table with around 1000+ entries and this table is updated/edited infrequently but i expect this to change in future. Some of the columns in the table contain strings that are of considerable length.
Now I am in the process of writing a UI application that will have some mouseover events that will display texts derived from the aforementioned database table.
I have, for my use case, decided to write a backend 'server' that will host an in-memory database that will have all the data that was present in the aforementioned table. The UI app will now, on startup, cache the required data from the in-memory database present or hosted by the backend server.
Does my use case justify using an in-memory database ? If not, what are the alternatives I should consider ?
EDIT 1 --
My use case also involves running multiple searches of varying complexity on the database very frequently.
Thanks
p1ng

Seems like an excellent use-case for an in-memory database. Writing it yourself, on the other hand, is probably not the way to go.
There are plenty of existing options for just about any imaginable scenario: http://en.wikipedia.org/wiki/In-memory_database
If you're doing complex searches on text data, Lucene is quite excellent. It has special in-memory storage backends, but really, it doesn't matter for such a tiny dataset - it will always be quickly cached anyway.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.