How to efficiently load data from CSV into Database?

How to efficiently load data from CSV into Database? - java

I have a CSV/TSV file with data and want to load that CSV data into Database. I am using Java or Python and PostgreSQL to do that (I can't change that).
The problem is that for each row I make an INSERT query and it is not that efficient if I have let's say 600.000 rows. Is there any more efficient way to do it?
I was wondering if I can take more rows and create just one big query and execute it on my database but I'm not sure if that helps at all or should I divide the data in maybe let's say 100 pieces and execute 100 queries?

If the CSV file is compatible with the format required by copy from stdin, then the most efficient way is to use the CopyManager API.
See this answer or this answer for example code.
If your input file isn't compatible with Postgres' copy command, you will need to write the INSERT yourself. But you can speed up the process by using JDBC batching:
Something along the lines:
PreparedStatement insert = connection.prepareStatement("insert into ...");
int batchSize = 1000;
int batchRow = 0;
// iterate over the lines from the file
while (...) {
... parse the line, extract the columns ...
insert.setInt(1, ...);
insert.setString(2, ...);
insert.setXXX(...);
insert.addBatch();
batchRow ++;
if (batchRow == batchSize) {
insert.executeBatch();
batchRow = 0);
}
}
insert.executeBatch();
Using reWriteBatchedInserts=true in your JDBC URL will improve performance even more.

Assuming the server can access the file directly, you could try using the COPY FROM command. If your CSV is not of the right format it might still be faster to transcript it to something the COPY command will handle (e.g. while copying to a location that the server can access).

Related

How to increase insert of records into RDS MySQL more efficiently and fastly using JDBC API

I'm trying to insert nearly 200,000 records reading from a CSV file to RDS (MySQL) using a Lambda function. The time taken to insert completely is nearly 10 mins which is very concerning. I would like to know how to increase the speed for insertion.
Techniques I tried :
Using Prepared Statement for batch insertion like below code :
BufferedReader lineReader =
new BufferedReader(new InputStreamReader(inputStream, Charset.defaultCharset()));//inputStream is data from csv file
try (PreparedStatement batchStatement = connection.prepareStatement(INSERT_QUERY)) {//connection is JDBC connection instance
LOGGER.debug("Processing Insert");
Stream<String> lineStream = lineReader.lines().skip(1);
List<String> collect = lineStream.collect(Collectors.toList());
for (String line : collect) {
String[] data = line.split(",", -1);
batchStatement.setString(1, data[0]);
//remaining code of setting data
batchStatement.addBatch();
batchStatement.executeBatch();
batchStatement.clearBatch();
}
batchStatement.executeBatch();
connection.commit();
} catch(exception e){
//throw exception code
}finally{
lineReader.close();
connection.close();
}
Implemented rewritebatchedstatements=true in connection URL
Please suggest if anything is feasible in this case for faster inserting data into RDS (MySQL).

Only execute the batch in chunks, such as 100 at a time not one at a time as you have it now:
int rows = 0; // outside the loop
...
if((++rows % 100) == 0) {
batchStatement.executeBatch();
}
// Don't reset the batch as this will wipe the 99 previous rows:
//batchStatement.clearBatch();
Also: changing auto commit mode will improve bulk updates, remember to reset back afterwards if not using addBatch or if connections are re-used:
connection.setAutoCommit(false);

LOAD DATA INFILE into a separate table, t1.
Cleanse the data. That is, fix anything that needs modification, perform normalization, etc.
INSERT INTO real table (...) SELECT ... FROM t1.
If you need further discussion, please provide, in SQL, the table schema and any transforms needed by my step 2. Also, a few rows of sample data may help.

Writing Data from RDS to Disk in JOOQ

My use case is that I have to run a query on RDS instance and it returns 2 millions records. Now,I want to copy the result directly to disk instead of bringing it in memory then copying it to disk.
Following statement will bring all the records in memory, I want to transfer the results directly to file on disk.
SelectQuery<Record> abc = dslContext.selectQuery().fetch();
Can anyone suggest an pointer?
Update1:
I found the following way to read it :
try (Cursor<BookRecord> cursor = create.selectFrom(BOOK).fetchLazy()) {
while (cursor.hasNext()){
BookRecord book = cursor.fetchOne();
Util.doThingsWithBook(book);
}
}
How many records does it fetch at once and are those records brought in memory first?
Update2:
MySQL driver by default it fetches all the records at once. If fetch size is set to Integer.MIN_VALUE then it fetches one record at a time. If you want to fetch the records in batches then set useCursorFetch=true while setting connection properties.
Related wiki : https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-implementation-notes.html

Your approach using the ResultQuery.fetchLazy() method is the way to go for jOOQ to fetch records one at a time from JDBC. Note that you can use Cursor.fetchNext(int) to fetch a batch of records from JDBC as well.
There's a second thing you might need to configure, and that's the JDBC fetch size, see Statement.setFetchSize(int). This configures how many rows are fetched by the JDBC driver from the server in a single batch. Depending on your database / JDBC driver (e.g. MySQL), the default would again be to fetch all rows in one go. In order to specify the JDBC fetch size on a jOOQ query, use ResultQuery.fetchSize(int). So your loop would become:
try (Cursor<BookRecord> cursor = create
.selectFrom(BOOK)
.fetchSize(size)
.fetchLazy()) {
while (cursor.hasNext()){
BookRecord book = cursor.fetchOne();
Util.doThingsWithBook(book);
}
}
Please read your JDBC driver manual about how they interpret the fetch size, noting that MySQL is "special"

Faster way of updating database table using Hibernate (Java 8 reduction?)

I am working on a monitoring tool developed in Spring Boot using Hibernate as ORM.
I need to compare each row (already persisted rows of sent messages) in my table and see if a MailId (unique) has received a feedback (status: OPENED, BOUNCED, DELIVERED...) Yes or Not.
I get the feedbacks by reading csv files from a network folder. The CSV parsing and reading of files goes very fast, but the update of my database is very slow. My algorithm is not very efficient because I loop trough a list that can have hundred thousands of objects and look in my table.
This is the method that make the update in my table by updating the "target" Object (row in table database)
#Override
public void updateTargetObjectFoo() throws CSVProcessingException, FileNotFoundException {
// Here I make a call to performProcessing method which reads files on a folder and parse them to JavaObjects and I map them in a feedBackList of type Foo
List<Foo> feedBackList = performProcessing(env.getProperty("foo_in"), EXPECTED_HEADER_FIELDS_STATUS, Foo.class, ".LETTERS.STATUS.");
for (Foo foo: feedBackList) {
//findByKey does a simple Select in mySql where MailId = foo.getMailId()
Foo persistedFoo = fooDao.findByKey(foo.getMailId());
if (persistedFoo != null) {
persistedFoo.setStatus(foo.getStatus());
persistedFoo.setDnsCode(foo.getDnsCode());
persistedFoo.setReturnDate(foo.getReturnDate());
persistedFoo.setReturnTime(foo.getReturnTime());
//The save account here does an MySql UPDATE on the table
fooDao.saveAccount(foo);
}
}
}
What if I achieve this selection/comparison and update action in Java side? Then re-update the whole list in database?
Will it be faster?
Thanks to all for your help.

Hibernate is not particularly well-suited for batch processing.
You may be better off using Spring's JdbcTemplate to do jdbc batch processing.
However, if you must do this via Hibernate, this may help: https://docs.jboss.org/hibernate/orm/5.2/userguide/html_single/chapters/batch/Batching.html

Transferring Millions of rows from teradata to mySQL

I have to transfer around 5 million rows of data from Teradata to MySQL. Can anyone please suggest me the fastest way to do this over the network, without using the filesystem. I am new to Teradata and MySQL. I want to run this transfer as a batch job on weekly basis, so I am looking for the solution which can be fully automated. Any suggestions or hints will be greatly appreciated.
I have already written the code using JDBC to get the records from the Teradata and insert them into the MySQL. But it is very slow, so I am looking to make that code more efficient. I kept in generic because I didn't have the solution to be constrained by my implementation, as along with making existing code more efficient I am open to other alternatives also. But I don't want to use the file system since it's not easier to maintain or update the scripts.
My implementation:
Getting records from teradata:
connection = DBConnectionFactory.getDBConnection(SOURCE_DB);
statement = connection.createStatement();
rs = statement.executeQuery(QUERY_SELECT);
while (rs.next()) {
Offer offer = new Offer();
offer.setExternalSourceId(rs.getString("EXT_SOURCE_ID"));
offer.setClientOfferId(rs.getString("CLIENT_OFFER_ID"));
offer.setUpcId(rs.getString("UPC_ID"));
offers.add(offer);
}
Inserting the records in mySQL:
int count = 0;
if (isUpdated) {
for (Offer offer : offers) {
count++;
stringBuilderUpdate = new StringBuilder();
stringBuilderUpdate = stringBuilderUpdate
.append(QUERY_INSERT);
stringBuilderUpdate = stringBuilderUpdate.append("'"
+ offer.getExternalSourceId() + "'");
statement.addBatch(stringBuilderUpdate.toString());
queryBuilder = queryBuilder.append(stringBuilderUpdate
.toString() + SEMI_COLON);
if (count > LIMIT) {
countUpdate = statement.executeBatch();
LOG.info("DB update count : " + countUpdate.length);
count = 0;
}
}
if (count > 0) {
// Execute batch
countUpdate = statement.executeBatch();
}
Can anybody please tell me if we can make this code more efficient ???
Thanks
PS: Please ignore the syntax error in above code as this code is working fine. Some info might be missing because of copy and paste.

The fastest method of importing data to MySQL is by using LOAD DATA INFILE or mysqlimport, which is a command line interface to LOAD DATA INFILE and it involves loading data from a file, preferably residing on a local filesystem.
When loading a table from a text file, use LOAD DATA INFILE. This is
usually 20 times faster than using INSERT statements.
Therefore despite the fact that you don't want to use the filesystem I'd suggest to consider creating a dump to a file, transfer it to a MySQL server and use above mentioned means to load the data.
All these tasks can be fully automated via scripting.

data parsing from a file into java and then into a mysql database

I have .Data file given in the above format . I am writing a program in java that will take the values from the .data file and put it in the buffer. MY java program is connected to Mysql(windows) via JDBC. So I need to read the values from the file given in the above format and put it the buffer like
Insert Into building values ("--", "---",----)
In this way, i store these values and jdbc will populate the database tables on Mysql(windows). Please tell me teh best way.

Check out the answers to this question for reading file lines and splitting them into chunks. I know the question says Groovy: but most answers are Java. Then insert the values you retrieved via JDBC.
Actually, since your data file is obviously CSV, you could also use a CSV libary like OpenCSV to read the values.

The data is in CSV format, so use a CSV library to parse the file and then just add some JDBC code to insert this into database.
Or just call MySQL CSV import command from Java:
try {
// Execute a command with arguments
String command = "mysqlimport [options] db_name textfile1 [textfile2 ...]";
Process child = Runtime.getRuntime().exec(command);
} catch (IOException e) {
}

This is the fourth question for the same task... If your data file is well formatted like in the example you provided, then you don't have to split the line into values:
Source: "AAH196","Austin","TX","Virginia Beach","VA"
Target: INSERT INTO BUILDING VALUES("AAH196","Austin","TX","Virginia Beach","VA");
<=> "INSERT INTO BUILDING VALUES(" + Source + ");"
Just take a complete row from you csv file and concatenate a SQL expression.
(see my answer to question 1 of 4 - BTW, if SQL INJECTION is a potential problem, splitting a line of values is not a solution too)

you can bind your csv with java beans using opencsv.
http://opencsv.sourceforge.net/
you can make these beans persistent using an ORM framework, like Hibernate, Cayenne or with JPA which're based on annotations and map your fields to tables easily without creating any sql statement.

This would be a perfect job for Groovy. Here's a gist with a small skeleton script to build upon.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to efficiently load data from CSV into Database? - java

Assuming the server can access the file directly, you could try using the COPY FROM command. If your CSV is not of the right format it might still be faster to transcript it to something the COPY command will handle (e.g. while copying to a location that the server can access).

Related

How to increase insert of records into RDS MySQL more efficiently and fastly using JDBC API

Writing Data from RDS to Disk in JOOQ

Faster way of updating database table using Hibernate (Java 8 reduction?)

Transferring Millions of rows from teradata to mySQL

data parsing from a file into java and then into a mysql database

Categories

Resources