Transferring Millions of rows from teradata to mySQL - java

I have to transfer around 5 million rows of data from Teradata to MySQL. Can anyone please suggest me the fastest way to do this over the network, without using the filesystem. I am new to Teradata and MySQL. I want to run this transfer as a batch job on weekly basis, so I am looking for the solution which can be fully automated. Any suggestions or hints will be greatly appreciated.
I have already written the code using JDBC to get the records from the Teradata and insert them into the MySQL. But it is very slow, so I am looking to make that code more efficient. I kept in generic because I didn't have the solution to be constrained by my implementation, as along with making existing code more efficient I am open to other alternatives also. But I don't want to use the file system since it's not easier to maintain or update the scripts.
My implementation:
Getting records from teradata:
connection = DBConnectionFactory.getDBConnection(SOURCE_DB);
statement = connection.createStatement();
rs = statement.executeQuery(QUERY_SELECT);
while (rs.next()) {
Offer offer = new Offer();
offer.setExternalSourceId(rs.getString("EXT_SOURCE_ID"));
offer.setClientOfferId(rs.getString("CLIENT_OFFER_ID"));
offer.setUpcId(rs.getString("UPC_ID"));
offers.add(offer);
}
Inserting the records in mySQL:
int count = 0;
if (isUpdated) {
for (Offer offer : offers) {
count++;
stringBuilderUpdate = new StringBuilder();
stringBuilderUpdate = stringBuilderUpdate
.append(QUERY_INSERT);
stringBuilderUpdate = stringBuilderUpdate.append("'"
+ offer.getExternalSourceId() + "'");
statement.addBatch(stringBuilderUpdate.toString());
queryBuilder = queryBuilder.append(stringBuilderUpdate
.toString() + SEMI_COLON);
if (count > LIMIT) {
countUpdate = statement.executeBatch();
LOG.info("DB update count : " + countUpdate.length);
count = 0;
}
}
if (count > 0) {
// Execute batch
countUpdate = statement.executeBatch();
}
Can anybody please tell me if we can make this code more efficient ???
Thanks
PS: Please ignore the syntax error in above code as this code is working fine. Some info might be missing because of copy and paste.

The fastest method of importing data to MySQL is by using LOAD DATA INFILE or mysqlimport, which is a command line interface to LOAD DATA INFILE and it involves loading data from a file, preferably residing on a local filesystem.
When loading a table from a text file, use LOAD DATA INFILE. This is
usually 20 times faster than using INSERT statements.
Therefore despite the fact that you don't want to use the filesystem I'd suggest to consider creating a dump to a file, transfer it to a MySQL server and use above mentioned means to load the data.
All these tasks can be fully automated via scripting.

Related

Jdbc Batch Insert Optimization

I'm trying to insert data in batches into PostgresSql db. With a batch size of 30 it takes ~17 seconds to insert each batch, which seems incredibly slow. If I persist 10,000 records it will be well over an hour. I need help speeding this up, I have added ?reWriteBatchedInserts=true onto the end of my db connection. I have used prepared statements which was faster, but was very clunky and manually involved. I've also tried using Hibernate, and that was just as slow. I would like to have a more spring based approach for batch inserting. Hence, the auto generated SQL Statements, along with the BeanMap that maps to the insert statement as we would expect, without having to manually set all fields using statement.setString(1, position.getCurrency) etc... The reason I don't want to use those prepared statements with all the manual setup, (even though it is faster) is because I have some tables with 100's of rows, which will become a pain to maintain if changes occur.
Here is my DB Structure:
My PG Database is version 11.16, and my postgres dependency in gradle is 42.5.0.
Any thoughts on why this is taking so long to insert into the DB? I am using NamedParameterJdbcTemplate. If you need any more information please let me know.
CODE:
String positionSql = createInsert("INSERT INTO %s(%s) VALUES(%s)", positionTableMapping, positionColumnMappings);
List<Map<String, Object>> posBuffer = new ArrayList<>();
for (BPSPositionTable position : bpsPositions) {
posBuffer.add(BeanMap.create(position));
if ((count + 1) % batchSize == 0 || (count + 1) == bpsPositions.size()) {
jdbcTemplate.batchUpdate(positionSql, SqlParameterSourceUtils.createBatch(posBuffer));
posBuffer.clear();
count = 0;
}
count ++;
}

How to increase insert of records into RDS MySQL more efficiently and fastly using JDBC API

I'm trying to insert nearly 200,000 records reading from a CSV file to RDS (MySQL) using a Lambda function. The time taken to insert completely is nearly 10 mins which is very concerning. I would like to know how to increase the speed for insertion.
Techniques I tried :
Using Prepared Statement for batch insertion like below code :
BufferedReader lineReader =
new BufferedReader(new InputStreamReader(inputStream, Charset.defaultCharset()));//inputStream is data from csv file
try (PreparedStatement batchStatement = connection.prepareStatement(INSERT_QUERY)) {//connection is JDBC connection instance
LOGGER.debug("Processing Insert");
Stream<String> lineStream = lineReader.lines().skip(1);
List<String> collect = lineStream.collect(Collectors.toList());
for (String line : collect) {
String[] data = line.split(",", -1);
batchStatement.setString(1, data[0]);
//remaining code of setting data
batchStatement.addBatch();
batchStatement.executeBatch();
batchStatement.clearBatch();
}
batchStatement.executeBatch();
connection.commit();
} catch(exception e){
//throw exception code
}finally{
lineReader.close();
connection.close();
}
Implemented rewritebatchedstatements=true in connection URL
Please suggest if anything is feasible in this case for faster inserting data into RDS (MySQL).
Only execute the batch in chunks, such as 100 at a time not one at a time as you have it now:
int rows = 0; // outside the loop
...
if((++rows % 100) == 0) {
batchStatement.executeBatch();
}
// Don't reset the batch as this will wipe the 99 previous rows:
//batchStatement.clearBatch();
Also: changing auto commit mode will improve bulk updates, remember to reset back afterwards if not using addBatch or if connections are re-used:
connection.setAutoCommit(false);
LOAD DATA INFILE into a separate table, t1.
Cleanse the data. That is, fix anything that needs modification, perform normalization, etc.
INSERT INTO real table (...) SELECT ... FROM t1.
If you need further discussion, please provide, in SQL, the table schema and any transforms needed by my step 2. Also, a few rows of sample data may help.

How to efficiently load data from CSV into Database?

I have a CSV/TSV file with data and want to load that CSV data into Database. I am using Java or Python and PostgreSQL to do that (I can't change that).
The problem is that for each row I make an INSERT query and it is not that efficient if I have let's say 600.000 rows. Is there any more efficient way to do it?
I was wondering if I can take more rows and create just one big query and execute it on my database but I'm not sure if that helps at all or should I divide the data in maybe let's say 100 pieces and execute 100 queries?
If the CSV file is compatible with the format required by copy from stdin, then the most efficient way is to use the CopyManager API.
See this answer or this answer for example code.
If your input file isn't compatible with Postgres' copy command, you will need to write the INSERT yourself. But you can speed up the process by using JDBC batching:
Something along the lines:
PreparedStatement insert = connection.prepareStatement("insert into ...");
int batchSize = 1000;
int batchRow = 0;
// iterate over the lines from the file
while (...) {
... parse the line, extract the columns ...
insert.setInt(1, ...);
insert.setString(2, ...);
insert.setXXX(...);
insert.addBatch();
batchRow ++;
if (batchRow == batchSize) {
insert.executeBatch();
batchRow = 0);
}
}
insert.executeBatch();
Using reWriteBatchedInserts=true in your JDBC URL will improve performance even more.
Assuming the server can access the file directly, you could try using the COPY FROM command. If your CSV is not of the right format it might still be faster to transcript it to something the COPY command will handle (e.g. while copying to a location that the server can access).

Writing data to remote mongodb database

I am reading a CSV file line by line and upserting data to mongodb database. It takes 2 mins approx. to read, process and write data to mongodb from all files, when db and the files are on same machine. Whereas when the db is located on another machine in my network, it takes around 5 mins. It is taking even more time on remote machine. can anyone please help me out to reduce time ?. Thanks.
An approach to your problem with reduction in processing time.
To Read the CSV file and put it into MongoDB Use an ETL such as Kettle.
http://wiki.pentaho.com/display/BAD/Write+Data+To+MongoDB
This will enhance the time in reading from CSV to writing in MongoDB.
Simplest way to have the data in the remote machine.
Export the Data in your local db and import it in the remote machine.
https://docs.mongodb.com/v2.6/core/import-export/
Hope it Helps!
I saw that you are using Java to load your Mongo database.
The Java driver on recent version allow bulk operations, so you can send a batch of insert to mongo instead of sending them one by one. This will speed up inserts in mongoDB a lot.
DBCollection collection = db.getCollection("my_collection");
List<DBObject> list = new ArrayList<>();
for(int i = 0; i < 100; i++){
//generate your datas
BasicDBObject obj = new BasicDBObject("key", "value");
list.add(obj);
}
collection.insert(list);//bulk insert of 100 obj!
This is available since Mongo 2.6 : https://docs.mongodb.com/manual/reference/method/Bulk.insert/

How do I persist data to disk, and both randomly update it, and stream it efficiently back into RAM?

I need to store up to tens or even hundreds of millions of pieces of data on-disk. Each piece of data contains information like:
id=23425
browser=firefox
ip-address=10.1.1.1
outcome=1.0
New pieces of data may be added at the rate of up-to 1 per millisecond.
So its a relatively simple set of key-value pairs, where the values can be strings, integers, or floats. Occasionally I may need to update the piece of data with a particular id, changing the flag field from 0 to 1. In other words, I need to be able to do random key lookups by id, and modify the data (actually only the floating point "outcome" field - so I'll never need to modify the size of the value).
The other requirement is that I need to be able to stream this data off disk (the order isn't particularly important) efficiently. This means that the hard disk head should not need to jump around the disk to read the data, rather it should be read in consecutive disk blocks.
I'm writing this in Java.
I've thought about using an embedded database, but DB4O is not an option as it is GPL and the rest of my code is not. I also worry about the efficiency of using an embedded SQL database, given the overhead of translating to and from SQL queries.
Does anyone have any ideas? Might I have to build a custom solution to this (where I'm dealing directly with ByteBuffers, and handling the id lookup)?
How about H2? The License should work for you.
You can use H2 for free. You can
integrate it into your application
(including commercial applications),
and you can distribute it.
Files
containing only your code are not
covered by this license (it is
'commercial friendly').
Modifications
to the H2 source code must be
published.
You don't need to provide
the source code of H2 if you did not
modify anything.
I get
1000000 insert in 22492ms (44460.252534234394 row/sec)
100000 updates in 9565ms (10454.783063251438 row/sec)
from
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.Random;
/**
* #author clint
*
*/
public class H2Test {
static int testrounds = 1000000;
public static void main(String[] args) {
try {
Class.forName("org.h2.Driver");
Connection conn = DriverManager.
getConnection("jdbc:h2:/tmp/test.h2", "sa", "");
// add application code here
conn.createStatement().execute("DROP TABLE IF EXISTS TEST");
conn.createStatement().execute("CREATE TABLE IF NOT EXISTS TEST(id INT PRIMARY KEY, browser VARCHAR(64),ip varchar(16), outcome real)");
//conn.createStatement().execute("CREATE INDEX IDXall ON TEST(id,browser,ip,outcome");
PreparedStatement ps = conn.prepareStatement("insert into TEST (id, browser, ip, outcome) values (?,?,?,?)");
long time = System.currentTimeMillis();
for ( int i = 0; i < testrounds; i++ ) {
ps.setInt(1,i);
ps.setString(2,"firefox");
ps.setString(3,"000.000.000.000");
ps.setFloat(4,0);
ps.execute();
}
long last = System.currentTimeMillis() ;
System.out.println( testrounds + " insert in " + (last - time) + "ms (" + ((testrounds)/((last - time)/1000d)) + " row/sec)" );
ps.close();
ps = conn.prepareStatement("update TEST set outcome = 1 where id=?");
Random random = new Random();
time = System.currentTimeMillis();
/// randomly updadte 10% of the entries
for ( int i = 0; i < testrounds/10; i++ ) {
ps.setInt(1,random.nextInt(testrounds));
ps.execute();
}
last = System.currentTimeMillis();
System.out.println( (testrounds/10) + " updates in " + (last - time) + "ms (" + ((testrounds/10)/((last - time)/1000d)) + " row/sec)" );
conn.close();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
JDBM is a great embedded database for Java (and not as encumbered with licensing as the Java version of Berkley). It would be worth trying. If you don't need ACID guarantees (i.e. you are OK with the database getting corrupted in the event of a crash), turn off the transaction manager (significantly increases speed).
I think you'd have a lot more success writing something that caches the most active records in memory and queues data changes as a low priority insert into the DB.
I understand there's a slight increase in IO using this method but if you're talking about millions of records I think it would still be faster because any search algorithm you create is going to be greatly outperformed by a a full fledged database engine.
You could try Berkley DB which is now owned by Oracle. They have Open Source and Commercial licenses. It uses a Key/Value model (with an option to create indexes if other forms of queries are required). There is a pure Java version and a native version with Java bindings.
http://www.zentus.com/sqlitejdbc/
SQLite database (public domain), JDBC connector with BSD license, native for a whole bunch of platforms (OSX, Linux, Windows), emulation for the rest.
You can use Apache Derby (or JavaDB) which is bundled with JDK. However, if a DBMS doesn't provide the required speed you may implement a specific file structure yourself. If just exact key lookup is required, you may use a hash-file to implement it. The hash-file is the fastest file structure for such requirements (much faster than general purpose file structures such as B-Trees and grids which are used in DBs). It also provides acceptable streaming efficiency.
In the end I decided to log the data to disk as it comes in, and also keep it in memory where I can update it. After a period of time I write the data out to disk and delete the log.
Have you taken a look at Oracle's 'TimesTen' database? Its an in-memory db that is supposed to be very high-performance. Don't know about costs/licenses, etc, but take a look at Oracles site and search for it. Eval download should be available.
I'd also take a look to see if there's anything existing based on either EHCache or JCS that might help.

Categories