I get some thousands of data from webservice call. (It would be id and version number, list of objects)
I am required to check if the record exists for an id in the database.If it does and the version number mismatches , I need to update the table
or else insert a new record.
What do you think is the optimal solution
Fetch the records from DB and cache it. Remove the records which are matching from the list. Prepare a list which requires update and
the others which require insert and then call out procedure to insert and update accordingly
(Once I prepare the list, it could be relatively lesser records)
Loop through each one of the record I receive from the webservice and pass the id and version to a procdure which carries out insert/update
based on the need
(Using connection pool but for each record, I would be calling the procedure)
Which do you think is better approach of the two...or do you think of a better solution than these two
Limitiations to technologies to be used:
Spring Jdbc 2.x ,Java 1.7,Sybase database
No ORM technologies available.
Can I use jdbcTemplate.batchUpdate() for calling a procedure
First option is better than option 2.
No operation is costlier then network latency between application server and database server.
Thumb rule is lesser the call , better the performance.
Not sure, contraints with sysbase, but even if you can process 5-10 records in each SP call , that will be even more better than processing single record everytime.
Related
I have a MySQL database where I need to do a 1k or so updates, and I am contemplating whether it would be more appropriate to use executeBatch or executeUpdate. The preparedstatement is to be built on an ArrayList of 1k or more ids (which are PKs of the table to be updated). For each update to the table I need to check if it was updated or not (it's possible that the id is not in the table). In the case that the id doesn't exist, I need to add that id to a separate ArrayList which will be used to do batch inserts.
Given the above, is it more appropriate to do:
Various separate executeUpdate() and then store the id if it is not updated, or
Simply create a batch and use executeBatch(), which will return an array of either a 0 or 1 for each separate statement/id.
In case two, the overhead would be an additional array to hold all the 0 or 1 return values. In case one, the overhead would be due to executing each UPDATE separately.
Definitely executeBatch(), and make sure that you add "rewriteBatchedStatements=true" to your jdbc connection string.
The increase in throughput is hard to exaggerate. Your 1K updates will likely take barely longer than a single update, assuming that you have proper indexes and a WHERE clause that makes use of them.
Without the extra setting on the connection string, the time to do the batch update is going to be about the same as to do each update individually.
I'd go with batch since network latency is something to consider unless you are somehow running it on the same box.
I have one table that records its row insert/update timestamps on a field.
I want to synchronize data in this table with another table on another db server. Two db servers are not connected and synchronization is one way (master/slave). Using table triggers is not suitable
My workflow:
I use a global last_sync_date parameter and query table Master for
the changed/inserted records
Output the resulting rows to xml
Parse the xml and update table Slave using updates and inserts
The complexity of the problem rises when dealing with deleted records of Master table. To catch the deleted records I think I have to maintain a log table for the previously inserted records and use sql "NOT IN". This becomes a performance problem when dealing with large datasets.
What would be an alternative workflow dealing with this scenario?
It sounds like you need a transactional message queue.
How this works is simple. When you update the master db you can send a message to the message broker (of whatever the update was) which can go to any number of queues. Each slave db can have its own queue and because queue's preserve order the process should eventually synchronize correctly (ironically this is sort of how most RDBMS do replication internally).
Think of the Message Queue as a sort of SCM change-list or patch-list database. That is for the most part the same (or roughly the same) SQL statements sent to master should be replicated to the other databases eventually. Don't worry about loosing messages as most message queues support durability and transactions.
I recommend you look at spring-amqp and/or spring-integration especially since you tagged this question with spring-batch.
Based on your comments:
See Spring Integration: http://static.springsource.org/spring-integration/reference/htmlsingle/ .
Google SEDA. Whether you go this route or not you should know about Message queues as it goes hand-in-hand with batch processing.
RabbitMQ has a good picture diagram of how messaging works
The contents of your message might be the entire row and whether its a CRUD, UPDATE, DELETE. You can use whatever format (e.g. JSON. See spring integration on recommendations).
You could even send the direct SQL statements as a message!
BTW your concern of NOT IN being a performance problem is not a very good one as there are a plethora of work-arounds but given your not wanting to do DB specific things (like triggers and replication) I still feel a message queue is your best option.
EDIT - Non MQ route
Since I gave you a tough time about asking this quesiton I will continue to try to help.
Besides the message queue you can do some sort of XML file like you we were trying before. THE CRITICAL FEATURE you need in the schema is a CREATE TIMESTAMP column on your master database so that you can do the batch processing while the system is up and running (otherwise you will have to stop the system). Now if you go this route you will want to SELECT * WHERE CREATE_TIME < ? is less than the current time. Basically your only getting the rows at a snapshot.
Now on your other database for the delete your going to remove rows by inner joining on a ID table but with != (that is you can use JOINS instead of slow NOT IN). Luckily you only need all the ids for delete and not the other columns. The other columns you can use a delta based on the the update time stamp column (for update, and create aka insert).
I am not sure about the solution. But I hope these links may help you.
http://knowledgebase.apexsql.com/2007/09/how-to-synchronize-data-between.htm
http://www.codeproject.com/Tips/348386/Copy-Synchronize-Table-Data-between-databases
Have a look at Oracle GoldenGate:
Oracle GoldenGate is a comprehensive software package for enabling the
replication of data in heterogeneous data environments. The product
set enables high availability solutions, real-time data integration,
transactional change data capture, data replication, transformations,
and verification between operational and analytical enterprise
systems.
SymmetricDS:
SymmetricDS is open source software for multi-master database
replication, filtered synchronization, or transformation across the
network in a heterogeneous environment. It supports multiple
subscribers with one direction or bi-directional asynchronous data
replication.
Daffodil Replicator:
Daffodil Replicator is a Java tool for data synchronization, data
migration, and data backup between various database servers.
Why don't you just add a TIMESTAMP column that indicates the last update/insert/delete time? Then add a deleted column -- ie. mark the row as deleted instead of actually deleting it immediately. Delete it after having exported the delete action.
In case you cannot alter schema usage in an existing app:
Can't you use triggers at all? How about a second ("hidden") table that gets populated with every insert/update/delete and which would constitute the content of the next to be generated xml export file? That is a common concept: a history (or "log") table: it would have its own progressing id column which can be used as an export marker.
Very interesting question.
In may case I was having enough RAM to load all ids from master and slave tables to diff them.
If ids in master table are sequential you try to may maintain a set of full filled ranges in master table (ranges with all ids used, without blanks, like 100,101,102,103).
To find removed ids without loading all of them to the memory you may execute SQL query to count number of records with id >= full_region.start and id <= full_region.end for each full filled region. If result of query == (full_region.end - full_region.end) + 1 it means all record in region are not deleted. Otherwise - split region into 2 parts and do the same check for both of them (in a lot of cases only one side contains removed records).
After some length of range (about 5000 I think) it will faster to load all present ids and check for absent using Set.
Also there is a sense to load all ids to the memory for a batch of small (10-20 records) regions.
Make a history table for the table that needs to be synchronized (basically a duplicate of that table, with a few extra fields perhaps) and insert the entire row every time something is inserted/updated/deleted in the active table.
Write a Spring batch job to sync the data to Slave machine based on the history table's extra fields
hope this helps..
A potential option for allowing deletes within your current workflow:
In the case that the trigger restriction is limited to triggers with references across databases, a possible solution within your current workflow would be to create a helper table in your Master database to store only the unique identifiers of the deleted rows (or whatever unique key would enable you to most efficiently delete your deleted rows).
Those ids would need to be inserted by a trigger on your master table on delete.
Using the same mechanism as your insert/updates, create a task following your inserts and updates. You could export your helper table to xml, as you noted in your current workflow.
This task would simply delete the rows out of the slave table, then delete all data from your helper table following completion of the task. Log any errors from the task so that you can troubleshoot this since there is no audit trail.
If your database has a transaction dump log, just ship that one.
It is possible with MySQL and should be possible with PostgreSQL.
I would agree with another comment - this requires the usage of triggers. I think another table should hold the history of your sql statements. See this answer about using 2008 extended events... Then, you can get the entire sql, and store the result query in the history table. Its up to you if you want to store it as a mysql query or a mssql query.
Here's my take. Do you really need to deal with this? I assume that the slave is for reporting purposes. So the question I would ask is how up to date should it be? Is it ok if the data is one day old? Do you plan a nightly refresh?
If so, forget about this online sync process, download the full tables; ship it to the mysql and batch load it. Processing time might be a lot quicker than you think.
I'm using SELECT GEN_ID(TABLE,1) FROM MON$DATABASE from a PreparedStatement to generate an ID that will be used in several tables.
I'm going to do a great number of INSERTs with PreparedStatements batches and I'm looking for a way to fetch a lot of new IDs at once from Firebird.
Doing a trigger seems to be out of the question, since I have to INSERT on other tables at another time with this ID in the Java code. Also, getGeneratedKeys() for batches seem to not have been implemented yet in (my?) Firebird JDBCdriver.
I'm answering from memory here, but I remember that I once had to load a bunch of transactions from a Quicken file into my Firebird database. I loaded an array with the transactions and set a variable named say iCount to the number. I then did SELECT GEN_ID(g_TABLE, iCount) from RDB$DATABASE. This gave me the next ID and incremented the generator by the number of records that I was going to insert. Then I started a transaction, stepped through the array and inserted the records one after the other and closed the transaction. I was surprised how fast this went. I think, at the time, I was working with about 28,000 transactions and the time was like a couple of seconds. Something like this might work for you.
As jrodenhi says, you can reserve a range of values using
SELECT GEN_ID(<generator>, <count>) FROM RDB$DATABASE
This will return a value of <count> higher than the previously generated key, so you can use all values from (value - count, value] (where ( signifies exclusive, ] inclusive). Say generator currently has value 10, calling GEN_ID(generator, 10) will return 20, you can then use 11...20 for ids.
This does assume that you normally use generators to generated ids for your table, and that no application makes up its own ids without using the generator.
As you noticed, getGeneratedKeys() has not been implemented for batches in Jaybird 2.2.x. Support for this option will be available in Jaybird 3.0.0, see JDBC-452.
Unless you are also targeting other databases, there is no real performance advantage to use batched updates (in Jaybird). Firebird does not support update batches, so the internal implementation in Jaybird does essentially the same as preparing a statement and executing it yourself repeatedly. This might change in the future as there are plans to add this to Firebird 4.
Disclosure: I am one of the Jaybird developers
I have a producer thread in Java pulling items from an Oracle table every n milliseconds.
The current implementation relies on a Java timestamp in order to retrieve data and never re-retrieve them again.
My objective is to get rid of the timestamp pattern and directly update the same items I'm pulling from the database.
Is there a way to SELECT a set of items and UPDATE them at the same time to mark them as "Being processed"?
If not, would a separate UPDATE query relying on the IN clause be a major performance hit?
I tried using a temporary table for that purpose, but I've seen that performance was severely affected.
Don't know if it helps, but the application is using iBatis.
If you are using oracle 10g or higher, you can use the RETURNING clause of the update statement. If you wish the retrieve more than one row you can use the BULK COLLECT statement.
Here is a link to some examples;
http://psoug.org/snippet/UPDATE-with-RETURNING-clause_604.htm
So i have a database where there is a lot of data being inserted from a java application. Usualy i insert into table1 get the last id, then again insert into table2 and get the last id from there and finally insert into table3 and get that id as well and work with it within the application. And i insert around 1000-2000 rows of data every 10-15 minutes.
And using a lot of small inserts and selects on a production webserver is not really good, because it sometimes bogs down the server.
My question is: is there a way how to insert multiple data into table1, table2, table3 without using such a huge amount of selects and inserts? Is there a sql-fu technique i'm missing?
Since you're probably relying on auto_increment primary keys, you have to do the inserts one at a time, at least for table1 and table2. Because MySQL won't give you more than the very last key generated.
You should never have to select. You can get the last inserted id from the Statement using the getGeneratedKeys() method. See an example showing this in the MySQL manual for the Connector/J:
http://dev.mysql.com/doc/refman/5.1/en/connector-j-usagenotes-basic.html#connector-j-examples-autoincrement-getgeneratedkeys
Other recommendations:
Use multi-row INSERT syntax for table3.
Use ALTER TABLE DISABLE KEYS while you're importing, and re-enable them when you're finished.
Use explicit transactions. I.e. begin a transaction before your data-loading routine, and commit at the end. I'd probably also commit after every 1000 rows of table1.
Use prepared statements.
Unfortunately, you can't use the fastest method for bulk load of data, LOAD DATA INFILE, because that doesn't allow you to get the generated id values per row.
There's a lot to talk about here:
It's likely that network latency is killing you if each of those INSERTs is another network roundtrip. Try batching your requests so they only require a single roundtrip for the entire transaction.
Speaking of transactions, you don't mention them. If all three of those INSERTs need to be a single unit of work you'd better be handling transactions properly. If you don't know how, better research them.
Try caching requests if they're reused a lot. The fastest roundtrip is the one you don't make.
You could redesign your database such that the primary key was not a database-generated, auto-incremented value, but rather a client generated UUID. Then you could generated all the keys for every record upfront and batch the inserts however you like.