Best way to archive (in flat file) and then purge huge data

Best way to archive (in flat file) and then purge huge data - java

I have written below program to achieve this:
try {
PreparedStatement statement = connection.prepareStatement(
"SELECT * FROM some_table some_timestamp<?)");
statement.setTimestamp(1, new java.sql.Timestamp(dt.getTime()));
ResultSet resultSet = statement.executeQuery();
CSVWriter csvWriter = new CSVWriter(new FileWriter(activeDirectory + "/archive_data" + timeStamp + ".csv"), ',');
csvWriter.writeAll(resultSet, true);
csvWriter.flush();
} catch (Exception e) {
e.printStackTrace();
}
// delete from table
try {
PreparedStatement statement = connection.prepareStatement(
"DELETE FROM some_table some_timestamp<?)");
statement.setTimestamp(1, new java.sql.Timestamp(dt.getTime()));
statement.executeUpdate();
} catch (Exception e) {
e.printStackTrace();
}
}
dbUtil.close(connection);
Above program would just work fine for an average scenario but I would like to know how can I improve this program which:
Works smoothly for a million records without overloading the application server
Considering there would be many records getting inserted into the same table at the time this program runs, how can I ensure this program archives and then purges the exact same records.
Update: I m using openscv http://opencsv.sourceforge.net/

I would like to suggest several things:
refrain from using time as limit point. It can be cause
unpredictable bugs. Time can be different in different places and
different environment so we should be careful with time. Instead
time use sequence
Use connection pool to get data from database
Save information from db in different files. You can store them on
different drives. After that you have to concatenate information
from them.
Use memory mapped files.
Use multi-threaded model for getting and storing/restoring
information. Note: JDBC doens't support many threads so connection
pool is your helper
And these steps are only about java part. You need to have good design on your DB side. Not easy, right? But this is price for using large data.

Related

How to increase insert of records into RDS MySQL more efficiently and fastly using JDBC API

I'm trying to insert nearly 200,000 records reading from a CSV file to RDS (MySQL) using a Lambda function. The time taken to insert completely is nearly 10 mins which is very concerning. I would like to know how to increase the speed for insertion.
Techniques I tried :
Using Prepared Statement for batch insertion like below code :
BufferedReader lineReader =
new BufferedReader(new InputStreamReader(inputStream, Charset.defaultCharset()));//inputStream is data from csv file
try (PreparedStatement batchStatement = connection.prepareStatement(INSERT_QUERY)) {//connection is JDBC connection instance
LOGGER.debug("Processing Insert");
Stream<String> lineStream = lineReader.lines().skip(1);
List<String> collect = lineStream.collect(Collectors.toList());
for (String line : collect) {
String[] data = line.split(",", -1);
batchStatement.setString(1, data[0]);
//remaining code of setting data
batchStatement.addBatch();
batchStatement.executeBatch();
batchStatement.clearBatch();
}
batchStatement.executeBatch();
connection.commit();
} catch(exception e){
//throw exception code
}finally{
lineReader.close();
connection.close();
}
Implemented rewritebatchedstatements=true in connection URL
Please suggest if anything is feasible in this case for faster inserting data into RDS (MySQL).

Only execute the batch in chunks, such as 100 at a time not one at a time as you have it now:
int rows = 0; // outside the loop
...
if((++rows % 100) == 0) {
batchStatement.executeBatch();
}
// Don't reset the batch as this will wipe the 99 previous rows:
//batchStatement.clearBatch();
Also: changing auto commit mode will improve bulk updates, remember to reset back afterwards if not using addBatch or if connections are re-used:
connection.setAutoCommit(false);

LOAD DATA INFILE into a separate table, t1.
Cleanse the data. That is, fix anything that needs modification, perform normalization, etc.
INSERT INTO real table (...) SELECT ... FROM t1.
If you need further discussion, please provide, in SQL, the table schema and any transforms needed by my step 2. Also, a few rows of sample data may help.

Most efficient multithreading Database Insert in Java

We have to read a lot of data from a HDD (~50GB) into our database, but our multithreading procedure is pretty slow (~2h for ~10GB), because of a Thread lock inside of org.sqlite.core.NativeDB.reset[native] (see thread sampler).
We read our data relatively fast and use our insert method to execute a prepared statement. But only if we collected like 500.000 datasets we commit all these statements to our database. Currently we use JDBC as Interface for our sqlite database.
Everything works fine so far, if you use one thread total. But if you want to use multiple threads you do not see much of a performance/speed increase, because only one thread can run at time, and not in parallel.
We already reuse our preparedStatement and all threads use one instance of our Database class to prevent file locks (there is one connection to the database).
Unfortunately we have no clue how to improve our insert method any further. Is anyone able to give us some tips/solutions or a way how to not use this NativeDB.reset method?
We do not have to use SQLite, but we would like to use Java.
(Threads are named 1,2,...,15)
private String INSERT = "INSERT INTO urls (url) VALUES (?);";
public void insert(String urlFromFile) {
try {
preparedStatement.setString(1, urlFromFile);
preparedStatement.executeUpdate();
} catch (SQLException e) {
e.printStackTrace();
}
}
Updated insert method as suggested by #Andreas , but it is still throwing some Exceptions
public void insert(String urlFromFile) {
try {
preparedStatement.setString(1, urlFromFile);
preparedStatement.addBatch();
++callCounter;
if (callCounter%500000 == 0 && callCounter>0){
preparedStatement.executeBatch();
commit();
System.out.println("Exec");
}
} catch (SQLException e) {
e.printStackTrace();
}
}
java.lang.ArrayIndexOutOfBoundsException: 9
at org.sqlite.core.CorePreparedStatement.batch(CorePreparedStatement.java:121)
at org.sqlite.jdbc3.JDBC3PreparedStatement.setString(JDBC3PreparedStatement.java:421)
at UrlDatabase.insert(UrlDatabase.java:85)

Most databases have some sort of bulk insert functionality, though there's no standard for it, AFAIK.
Postrgresql has COPY, and MySql has LOAD DATA, for instance.
I don't think that SQLite has this facility, though - it might be worth switching to a database that does.

SQLite has no write concurrency.
The fastest way to load a large amount of data is to use a single thread (and a single transaction) to insert everything into the DB (and not to use WAL).

PGSQL/Java - Inserting data to different tables dynamically

So, I'm trying to insert data to a database completely dynamically, meaning the data will be inserted without any knowledge of what table we're inserting into. The same goes for the different attributes.
My problem is that for some reason, my table names get wrapped in '' when preparing the statment. It goes as follows:
String query = "INSERT INTO ? VALUES(?)";
try{
Connection conn = Connection.create();
PreparedStatment st = conn.prepareStatement(query);
st.setString(1, "test");
st.setString(2, "1");
st.addBatch();
//... (executes batch later on...)
catch (Exception e){
e.printStackTrace();
}
finally{
DbUtils.closeQuietly(rs);
DbUtils.closeQuietly(conn);
}
Now for some reason, and completely beyond me, if I print this statement I get:
"INSERT INTO 'test' VALUES("1");
while I would expect
"INSERT INTO test VALUES("1");
Could anyone explain why this happens and how I can solve this, or a better way to achieve what I'm trying to do? If it matters I'm using DbUtils to handle closing, and 3CP0 for connection pooling.
I have been looking all over without any luck. And I would also appreciate it if anyone could tell me if inserting data dynamically to different tables generally is a bad thing.

JDBC SELECT very slow in comparison to Firefox DB manager

Solved, of course after posting here it hit me... Now using different drivers from http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC#Download that don't need extensive configuration.
Original question below the break.
I'm fooling around with a SQLite database containing OpenStreetMap data, and I'm having some trouble with JDBC.
The query below is the one I'd like to use to get a location close to the location of my user quicky (numbers are from my test-data, and are added by the java code).
SELECT roads.nodeID, lat, lon
FROM roads
INNER JOIN nodes
ON roads.nodeID=nodes.nodeID
ORDER BY (ABS(lat - (12.598418)) + ABS(lon - (-70.043514))) ASC
LIMIT 1
'roads' and 'nodes' both contain approximately 130,000 rows.
This specific query is one of the most intensive buyt it's only used twice so that should be OK for my needs. It executes in about 281 ms when using the Firefox SQLite, but in Java using sqlitejdbc-v056 it takes between 12 and 14 seconds (with full processor load).
Any clues on how to fix this?
public Node getNodeClosestToLocation(Location loc){
try {
Class.forName("org.sqlite.JDBC");
Statement stat = conn.createStatement();
String q = "SELECT roads.nodeID, lat, lon "+
"FROM roads "+
"INNER JOIN nodes "+
"ON roads.nodeID=nodes.nodeID "+
"ORDER BY (ABS(lat - ("+loc.getLat()+")) +
ABS(lon - ("+loc.getLon()+"))) ASC "+
"LIMIT 1";
long start = System.currentTimeMillis();
System.out.println(q);
rs = stat.executeQuery(q);
if(rs.next()) {
System.out.println("Done. " + (System.currentTimeMillis() - start));
return new Node(rs.getInt("nodeID"), rs.getFloat("lat"), rs.getFloat("lon"));
}
}
catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}

When it comes to select statement queries in JDBC, they can be very painfully slow if they're not utilized correctly. A few points:
Make sure you index the proper columns in your table. A simple line such as:
Statement stat = connection.createStatement();
stat.executeUpdate("create index {index_name} on orders({column_name});");
stat.close();
Creating an index: http://www.w3schools.com/sql/sql_create_index.asp
Insertion takes longer with indices, since each previous index needs to be updated as new records are inserted. Creating an index is best done after all the insert statements have been executed (better performance). Indexed columns take a hit on insertion performance but have significantly faster selection performance.
Changing the JDBC driver may help slightly but overall should not be the underlying issue. Also make sure you're running in native mode. Pure-java mode is significantly slower, at least from what I've noticed. The following code segment will tell you which mode you're running, assuming you're using SQLite JDBC.
System.out.println(String.format("%s mode", SQLiteJDBCLoader.isNativeMode() ? "native" : "pure-java"));
I experienced the same issue with slow selections in a database with more than 500K records. The run time of my application would have been 9.9 days if I had not indexed. Now it is a blazing fast 2 minutes to do the exact same thing. SQLite is very fast when proper and optimized sql is used.

Using a PreparedStatement might give you slightly better performances but nothing of the magnitude described here.
Perhaps Firefox SQLite is using some hints. You could try to get the execution plan to see where the query is doing the hard work and create some index if required.
Have you tried to log any timing information to make sure it is not getting the connection which is expensive?

Sybase JConnect: ENABLE_BULK_LOAD usage

Can anyone out there provide an example of bulk inserts via JConnect (with ENABLE_BULK_LOAD) to Sybase ASE?
I've scoured the internet and found nothing.

I got in touch with one of the engineers at Sybase and they provided me a code sample. So, I get to answer my own question.
Basically here is a rundown, as the code sample is pretty large... This assumes a lot of pre initialized variables, but otherwise it would be a few hundred lines. Anyone interested should get the idea. This can yield up to 22K insertions a second in a perfect world (as per Sybase anyway).
SybDriver sybDriver = (SybDriver) Class.forName("com.sybase.jdbc3.jdbc.SybDriver").newInstance();
sybDriver.setVersion(com.sybase.jdbcx.SybDriver.VERSION_6);
DriverManager.registerDriver(sybDriver);
//DBProps (after including normal login/password etc.
props.put("ENABLE_BULK_LOAD","true");
//open connection here for sybDriver
dbConn.setAutoCommit(false);
String SQLString = "insert into batch_inserts (row_id, colname1, colname2)\n values (?,?,?) \n";
PreparedStatement pstmt;
try
{
pstmt = dbConn.prepareStatement(SQLString);
}
catch (SQLException sqle)
{
displaySQLEx("Couldn't prepare statement",sqle);
return;
}
for (String[] val : valuesToInsert)
{
pstmt.setString(1, val[0]); //row_id varchar(30)
pstmt.setString(2, val[1]);//logical_server varchar(30)
pstmt.setString(3, val[2]); //client_host varchar(30)
try
{
pstmt.addBatch();
}
catch (SQLException sqle)
{
displaySQLEx("Failed to build batch",sqle);
break;
}
}
try {
pstmt.executeBatch();
dbConn.commit();
pstmt.close();
} catch (SQLException sqle) {
//handle
}
try {
if (dbConn != null)
dbConn.close();
} catch (Exception e) {
//handle
}

After following most of your advice we didn't see any improvement over simply creating a massive string and sending that across in batches of ~100-1000rows with a surrounding transaction. we got around:
*Big String Method [5000rows in 500batches]: 1716ms = ~2914rows per second.
(this is shit!).
Our db is sitting on a virtual host with one CPU (i7 underneath) and the table schema is:
CREATE TABLE
archive_account_transactions
(
account_transaction_id INT,
entered_by INT,
account_id INT,
transaction_type_id INT,
DATE DATETIME,
product_id INT,
amount float,
contract_id INT NULL,
note CHAR(255) NULL
)
with four indexes on account_transaction_id (pk), account_id, DATE, contract_id.
Just thought I would post a few comments first we're connecting using:
jdbc:sybase:Tds:40.1.1.2:5000/ikp?EnableBatchWorkaround=true;ENABLE_BULK_LOAD=true
we did also try the .addBatch syntax described above but it was marginally slower than just using java StringBuilder to build the batch in sql manually and then just push it across in one execute statement. Removing the column names in the insert statement gave us a surprisingly large performance boost it seemed to be the only thing that actually effected the performance. As the Enable_bulk_load param didn't seem to effect it at all nor did the EnableBatchWorkaround we also tried DYNAMIC_PREPARE=false which sounded promising but also didn't seem to do anything.
Any help getting these parameters actually functioning would be great! In other words are there any tests we could run to verify that they are in effect? I'm still convinced that this performance isn't close to pushing the boundaries of sybase as mysql out of the box does more like 16,000rows per second using the same "big string method" with the same schema.
Cheers
Rod

In order to get the sample provided by Chris Kannon working, do not forget to disable auto commit mode first:
dbConn.setAutoCommit(false);
And place the following line before dbConn.commit():
pstmt.executeBatch();
Otherwise this technique will only slowdown the insertion.

Don't know how to do this in Java, but you can bulk-load text files with LOAD TABLE SQL statement. We did it with Sybase ASA over JConnect.

Support for Batch Updates
Batch updates allow a Statement object to submit multiple update commands
as one unit (batch) to an underlying database for processing together.
Note: To use batch updates, you must refresh the SQL scripts in the sp directory
under your jConnect installation directory.
CHAPTER
See BatchUpdates.java in the sample (jConnect 4.x) and sample2 (jConnect
5.x) subdirectories for an example of using batch updates with Statement,
PreparedStatement, and CallableStatement.
jConnect also supports dynamic PreparedStatements in batch.
Reference:
http://download.sybase.com/pdfdocs/jcg0420e/prjdbc.pdf
http://manuals.sybase.com/onlinebooks/group-jcarc/jcg0520e/prjdbc/#ebt-link;hf=0;pt=7694?target=%25N%14_4440_START_RESTART_N%25#X
.
Other Batch Update Resources
http://java.sun.com/j2se/1.3/docs/guide/jdbc/spec2/jdbc2.1.frame6.html
http://www.jguru.com/faq/view.jsp?EID=5079

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.