Reading and wiring CSV File into database - java

I am implementing application specific data import feature from one database to another.
I have a CSV file containing say 10000 rows. These rows need to be inserted/updated into database.
I am using mysql database and inserting from Java.
There might be the case, where couple of rows may present in database that means those need to be updated. If not present in database, those need to be inserted.
One possible solution is that, I can read one by one line, check the entry in database and build insert/update queries accordingly. But this process may take much time to create update/insert queries and execute them in database. Some times my CSV file may have millions of records.
Is there any other faster way to achieve this feature?

I don't know how you determine "is already present", but if it's any kind of database level constraint (probably on a primary key?) you can make use of the REPLACE INTO statement, which will create a record unless it gets an error in which case it'll update the record that prevents it from being inserted.
It works just like INSERT basically:
REPLACE INTO table ( id, field1, field2 )
VALUES ( 1, 'value1', 'value'2 )
If a row with ID 1 exists, it's updated with these values; otherwise it's created.

Given that you're using MySQL you could use the INSERT ... ON DUPLICATE KEY UPDATE ... statement, which functions similarly to the SQL standard MERGE statement. MYSQL doc reference here and general Wikipedia reference to SQL MERGE functionality here. The statement would look something like
INSERT INTO MY_TABLE
(PRIMARY_KEY_COL, COL2, COL3, COL4)
VALUES
(1, 2, 3, 4)
ON DUPLICATE KEY
UPDATE COL2 = 2,
COL3 = 3,
COL4 = 4
In this example I'm assuming that PRIMARY_KEY_COL is a primary or unique key on MY_TABLE. If the INSERT statement would fail due to a duplicate value on the primary or unique key then the UPDATE clause is executed. Also note (on the MySQL doc page) that there are some gotcha's associated with auto-increment columns on an InnoDB table.
Share and enjoy.

Do you need to do this often or just once in a while?
I need to load csv files from time to time to a database for analysis and I created a SSIS-Datasolution with a Data Flow task which loads the csv-File into a table on the SQL Server.
For more infos look at this blog
http://blog.sqlauthority.com/2011/05/12/sql-server-import-csv-file-into-database-table-using-ssis/

Add a stored procedure in SQL for inserting. In the stored procedure use a try catch block to do the insert. If the insert fails do an update. Then you can simply call this method from your program.
Alternatively:
UPDATE Table1 SET (...) WHERE Column1='SomeValue'
IF ##ROWCOUNT=0
INSERT INTO Table1 VALUES (...)

Related

Transfer mysql binary log into select in CDC

I would like to do a real time reading from mysql.
The idea is simple. I use the binary log to trigger the select statement.
Meanwhile I'd like to read only the new rows on every change.
And currently I just consider insert.
So when someone do
insert into sometable(uid,somecolumn) values(uid,something)
My code will be triggered and do
select from sometable where uid=uid
Of course I have already written down which columns are the primary key because it seems no information from binlog.
I cannot find a tool to analysis mysql insert statement. So I use the regex to find out which column equals which value, then extract primary keys.
BUT the real problems what will happen if I do
Insert into `table` (`col`) values (select 0 as `col` from `dummy`);
How can I find out the col=0?
Is it impossible that make a select statement that select the new changed rows, triggered by the insert statement?
In a TRIGGER, you have access to the OLD and NEW values. With them, you can write code (in the TRIGGER) to log, for example, just the changes. Something like...
IF NEW.col1 != OLD.col1 THEN INSERT INTO LOG ...; END;
IF NEW.col2 != OLD.col2 THEN INSERT INTO LOG ...; END;

Locking Tables with postgres in JDBC

Just a quick question about locking tables in a postgres database using JDBC. I have a table for which I want to add a new record to, however, To do this for the primary key, I use an increasing integer value.
I want to be able to retrieve the max value of this column in Java and store it as a variable to be used as a new primary key when adding a new row.
This gives me a small problem, as this is going to be modelled as a multi-user system, what happens when 2 locations request the same max value? This will of course create a problem when trying to add the same primary key.
I realise that I should be using an EXCLUSIVE lock on the table to prevent reading or writing while getting the key and adding a new row. However, I can't seem to find any way to deal with table locking in JDBC, just standard transactions.
psuedo code as such:
primaryKey = "SELECT MAX(id) FROM table1;";
primary key++;
//id retrieved again from 2nd source
"INSERT INTO table1 (primaryKey, value 1, value 2);"
You're absolutely right, if two locations request at around the same time, you'll run into a race condition.
The way to handle this is to create a sequence in postgres and select the nextval as the primary key.
I don't know exactly what direction you're heading and how your handle your data, but you could also set the column as a serial and not even include the column in your insert query. The column will automatically auto increment.

Making changes to my program after updating my database?

Sorry if my question is not specific or if it has been answered before. I tried looking for it and for a better way to ask but this is the most accurate way.
I have developed a program in Java in which I insert a new row into my database in the following way:
INSERT INTO table_name VALUES (?,?,?)
The thing is that I have this query in many parts of the program, and now I decided to add a fourth column to my table. Do I have to update EVERY SINGLE query with a new question mark in the program? If I dont, it crashes.
What is the best way to proceed in these cases?
YES.
you need to add extra ? (parameter placeholder) because you are using implicit INSERT statement. That means that you didn't specify the column names of the table to which the values will be inserted.
INSERT INTO table_name VALUES (?,?,?)
// the server assumes that you are inserting values for all
// columns in your table
// if you fail to add value on one column. an exception will be thrown
The next time you create an INSERT statement, make sure that you specify the column names on it so when you alter the table by adding extra column, you won't update all your place holders.
INSERT INTO table_name (Col1, col2, col3) VALUES (?,?,?)
// the server knows that you are inserting values for a specific column
Do I have to update EVERY SINGLE query with a new question mark in the program?
Probably. What you should do, while you're updating every single one of those queries, is to encapsulate them into an object, probably using a Data Source pattern such as a Table Data Gateway or a Row Data Gateway. That way you Don't Repeat Yourself and the next time you update the table, you only have one place to update the query.
Because of the syntax you've used, you might run some issues. I've referring to the lack of column names. Your INSERT queries will start failing as soon as you change your table structure.
If you had used the following syntax:
INSERT INTO table_name (C1, C2, C3) VALUES (?,?,?)
assuming your new column has a proper default value, then it would've work fine.

how to enable multi-thread/connection modify the same mysql table?

I have a program that has 2 threads running, and each thread has its own database JDBC connection, and they will access/modify the same database table A like below. Table A only has 2 columns (id, name), and the primary key is the combination of id and name.
statement stmt;
// first delete it if the record has exist in table
stmt.addBatch("delete from A where id='arg_id' and name='arg_name';");
// then insert it to table
stmt.addBatch("insert into A values (arg_id, arg_name);");
stmt.executeBatch();
The 2 threads maybe insert the same data to the table, and i got the following exception,
java.sql.BatchUpdateException: Duplicate entry '0001-joey' for key 1
at com.mysql.jdbc.Statement.executeBatch(Statement.java:708)
at com.mchange.v2.c3p0.impl.NewProxyStatement.executeBatch(NewProxyStatement.java:743)
at proc.Worker.norD(NW.java:450)
Do you have any idea how I can fix this issue? Thank you.
Regards,
Joey
Why not introduce a simple optimistic locking mecanism on the database?
Add a version column and track the version number when performing delete or update transactions.
Your table would look like
create table test(
id int not null primary key,
name varchar,
rowversion int default = 0);
Every time you retrieve a row you should retrieve the row version so you can do
update test set name='new name' rowversion=rowversion+1 where id=id and rowversion=retrieved row version;
The same with delete
delete from test where id=id and rowversion=retrievedRowVersion;
This is a simple mechanism that will exploit the dbms concurency management features. Check this link for more information on optimistic locking http://en.wikipedia.org/wiki/Optimistic_concurrency_control#Examples
This is obviously only a very simple implementation of concurency management but your problem has to take these into account.
Also for the double insert the fact that your transaction is rejected is good that means that no duplicate keys are inserted. You should just handle the Exception and notify the user of what happen.
Wrap both statements in a transaction:
BEGIN;
DELETE FROM a WHERE ...;
INSERT INTO a VALUES (...);
COMMIT;
Note that as long as the table consists of only the primary key, this conflict arises only when the table is unmodified at the end; I presume you want to add more columns, in which case you should use the UPDATE ... WHERE syntax to change values.
Are you using any kind of synchronization? First you will need to wrap the code that modifies the table in:
synchronized(obj)
{
// code
}
where obj is an object that both threads can access.
I don't know the exact semantics of your table modifications, but if they both insert ids, you will also need to hold a "global" id and atomically increment it in each thread, such that they don't both get the same value.

Insert fail then update OR Load and then decide if insert or update

I have a webservice in java that receives a list of information to be inserted or updated in a database. I don't know which one is to insert or update.
Which one is the best approach to abtain better performance results:
Iterate over the list(a object list, with the table pk on it), try to insert the entry on Database. If the insert failed, run a update
Try to load the entry from database. if the results retrieved update, if not insert the entry.
another option? tell me about it :)
In first calls, i believe that most of the entries will be new bd entries, but there will be a saturation point that most of the entries will be to update.
I'm talking about a DB table that could reach over 100 million entries in a mature form.
What will be your approach? Performance is my most important goal.
If your database supports MERGE, I would have thought that was most efficient (and treats all the data as a single set).
See:
http://www.oracle.com/technology/products/oracle9i/daily/Aug24.html
https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=194
If performance is your goal then first get rid of the word iterate from your vocabulary! learn to do things in sets.
If you need to update or insert, always do the update first. Otherwise it is easy to find yourself updating the record you just inserted by accident. If you are doing this it helps to have an identifier you can look at to see if the record exists. If the identifier exists, then do the update otherwise do the insert.
The important thing is to understand the balance or ratio between the number of inserts versus the number of updates on the list you receive. IMHO you should implement an abstract strategy that says "persists this on database". Then create concrete strategies that (for example):
checks for primary key, if zero records are found does the insert, else updates
Does the update and, if fails, does the insert.
others
And then pull the strategy to use (the class fully qualified name for example) from a configuration file. This way you can switch from one strategy to another easily. If it is feasible, could be depending on your domain, you can put an heuristic that selects the best strategy based on the input entities on the set.
MySQL supports this:
INSERT INTO foo
SET bar='baz', howmanybars=1
ON DUPLICATE KEY UPDATE howmanybars=howmanybars+1
Option 2 is not going to be the most efficient. The database will already be making this check for you when you do the actual insert or update in order to enforce the primary key. By making this check yourself you are incurring the overhead of a table lookup twice as well as an extra round trip from your Java code. Choose which case is the most likely and code optimistically.
Expanding on option 1, you can use a stored procedure to handle the insert/update. This example with PostgreSQL syntax assumes the insert is the normal case.
CREATE FUNCTION insert_or_update(_id INTEGER, _col1 INTEGER) RETURNS void
AS $$
BEGIN
INSERT INTO
my_table (id, col1)
SELECT
_id, _col1;
EXCEPTION WHEN unique_violation THEN
UPDATE
my_table
SET
col1 = _col1
WHERE
id = _id;
END;
END;
$$
LANGUAGE plpgsql;
You could also make the update the normal case and then check the number of rows affected by the update statement to determine if the row is actually new and you need to do an insert.
As alluded to in some other answers, the most efficient way to handle this operation is in one batch:
Take all of the rows passed to the web service and bulk insert them into a temporary table
Update rows in the mater table from the temp table
Insert new rows in the master table from the temp table
Dispose of the temp table
The type of temporary table to use and most efficient way to manage it will depend on the database you are using.

Categories