Multiple statements in one SQL-query, or use Batch? - java

I'm sending a long list of updates to a database from a Java-program. I'm wondering if there's a speed difference between
Putting all the updates in one query and execute that
Making a preparedStatement, adding every update to the batch and executing the batch

Using PreparedStatement and batching would be the preferred approach. It reduces the network traffic between the client and the database server.

Yes it does!
It will reduce the traffic between application and server.
I know you mentioned java but just as an example, this is a hint from a C# book Im reading :
NOTE FREE PERFORMANCE UPGRADE!
Setting UpdateBatchSize to 0 is a quick way to boost the update performance of the
DbDataAdapter object. (Setting the value to 0 instructs the DbDataAdapter object to create the largest possible batch size for changes)

Related

IO With Callback to set Database status

Give this
public void do(RequestObject request, Callback<RequestObject> callback);
Where Callback is called when the request is processed. One client has to set status of the request to the database. The client fetches some items passes them to the above method and the callback sets the status.
It was working ok for small number of items and slower IO. But now, the IO is speed up and the status is written to database vary frequently. This is causing my database (MySQL) to make so many disk read write calls. My disk usage goes through the roof.
I was thinking of aggregating the setting of status but power in not reliable, that is not a plausible solution. How should re'design this?
EDIT
When the process is started I insert a value and when there is an update, I fetch the item and update the item. #user2612030 Your question lead me to believe, using hibernate might be what is causing more reads than it is necessary.
I can upgrade my disk drive to SSD but that would only do so much. I want a solution that scales.
An SSD is a good starting point, more RAM to MySQL should also help. It can't get rid of the writes, but with enough RAM (and MySQL configured to use it!) there should be few physical reads. If you are using the default configuration, tune it. See for example https://www.percona.com/blog/2016/10/12/mysql-5-7-performance-tuning-immediately-after-installation/ or just search for MySQL memory configuration.
You could also add disks and spread the writes to multiple disks with multiple controllers. That should also help a bit.
It is hard to give good advice without knowing how you record status values. Inserts or updates? How many records are there? Data model? However, to really scale you need to shard the data somehow. That way one server can handle data in one range and another server data in another range and so on.
For write-heavy applications that is non-trivial to set up with MySQL unless you do the sharding in the application code. Most solutions with replication work best for read-mostly applications. You may want to look into a NoSQL database, for example MongoDB, that has been designed for distributing writes from the outset. MongoDB has other challenges (eventual consistency), but it can deliver scalable writes.

Apply Batch keyword to select statements

Is it possible to execute a batch of select statements using dse cassandra or should i consider a design change?
The reason is i have a lot of select queries i wish to execute against my db cluster and not sure about going about it. I have deleted all my secondary indexes so im not using those anymore.
That won't work and even if it would, it isn't adviseable.
You won't recieve the results in a way that you can use, no result set
Even if that worked, the batch query would be much less performant than doing them serially due to the way Cassandra batching is implemented.
Batching only works well if the keys (write executions) are distributed in an equal way, and this is only worth it if you want to do all the updates as a transaction.
So in summary you should definitely consider a design change

PreparedStatement.addBatch and thousands of rows from a file and a confusion

Hi I am trying to write to Sybase IQ using JDBC from a file which contains thousands of rows. People say that I should use batchUpdate. So I am reading file by NIO and adding it to PreparedStatement batches. But I dont see any advantage here for all the rows I need to do the following
PreparedStatement prepStmt = con.prepareStatement(
"UPDATE DEPT SET MGRNO=? WHERE DEPTNO=?");
prepStmt.setString(1,mgrnum1);
prepStmt.setString(2,deptnum1);
prepStmt.addBatch();
I dont understand what is the advantage of batches. I have to anyhow execute addBatch for thousands of time for all the records of file. Or Should I even be using addBatch() to write records from a file to sybase iq. Please guide. Thanks a lot.
With batch updates, basically, you're cutting down on your Network I/O overhead. It's providing the benefits analogous to what a BufferedWriter provides you while writing to the disk. That's basically what this is: buffering of database updates.
Any kind of I/O has a cost; be it disk I/O or network. By buffering your inserts or updates in a batch and doing a bulk update you're minimizing the performance hit incurred every time you hit the database and come back.
The performance hit becomes even more obvious in case of a real world application where the database server is almost always under some load serving other clients as opposed to development where you're the only one.
When paired with a PreparedStatement the bulk updates are even more efficient because the Statement is pre-compiled and the execution plan is cached as well throughout the execution of the batch. So, the binding of variables happen as per your chosen batch size and then a single batchUpdate() call persists all the values in one go.
The advantage of addBatch is that it allows the jdbc driver to write chunks of data instead of sending single insert statements to the database.
This can be faster in certain situations, but real life performance may vary.
It should also be noticed that it's recommended to use batches of 50-100 rows, instead of adding all the data into a single batch.

Multi threaded insert using ORM?

I have one application where "persisting to database" is consuming 85% time of the entire application flow.
I was thinking of using multiple threads to do the insert because inserts are mostly independent here. Is there any way to achieve multi threaded insert using any of JPA implementation ? Or is it worth doing the mutli threaded insert, from improving the performance perspective ?
Note: Inserts are in the range of 10K to 100K records in a single run. Also performance is very very critical here.
Thanks.
Multi-threading insert statements on database won't really make it perform any faster because in most databases the table requires a lock for an insert. So your threads will just be waiting for the one before it to finish up and unlock the table before the next can insert - which really doesn't make it any more multi-threaded than with a single thread. If you where to do it, it would most likely slow it down.
If you inserting 10k-100k records you should consider using either batch insert statements or bulk insert commands that are native to the database your using. The fastest way would be the native bulk insert commands but it would require you to not use JPA and to work directly with JDBC calls for the inserts you want to use bulk commands on.
If you don't want to play around with native bulk commands I recommend using Spring's JDBCTemplate which has templated batch insert commands. It is very fast and I use it to batch insert 10k-20k entities every 30 seconds on a high transaction system and I am very pleased with the performance.
Lastly, make sure your database tables are optimized with the correct indexes, keys and options. Since your database is the bottleneck this should be one of the first places you look to increase performance.
Multi-threading insert statements on database won't really make it perform any faster
because in most databases the table requires a lock for an insert. So your threads will
just be waiting for the one before it to finish up and unlock the table before the next can
insert - which really doesn't make it any more multi-threaded than with a single thread. If
you where to do it, it would most likely slow it down.
Are you saying concurrent inserts from different db connections on the same table require exclusive locks to complete? I tested this on Oracle, and I didn't find this to be the case. Do you actually have a test case to back up what you wrote here?
Anyway, bulk insert is of course a lot faster than one insert at a time.
Are you periodically flushing your session when doing this? if not, you can hit nasty slowdowns that have nothing to do with the database. generally, you want to "batch" the inserts by periodically calling flush() then clear() on your session (assuming you are using some variant of JPA).
This article has many tips to improve batch writing performance with JPA. I'll quote the two that should give you the best result for fast reference.
Optimization #6 - Sequence
Pre-allocation
We have optimized the
first part of the application, reading
from the MySQL database. The second
part is to optimize the writing to
Oracle.
The biggest issue with the writing
process is that the Id generation is
using an allocation size of 1. This
means that for every insert there will
be an update and a select for the next
sequence number. This is a major
issue, as it is effectively doubling
the amount of database access. By
default JPA uses a pre-allocation size
of 50 for TABLE and SEQUENCE Id
generation, and 1 for IDENTITY Id
generation (a very good reason to
never use IDENTITY Id generation). But
frequently applications are
unnecessarily paranoid of holes in
their Id values and set the
pre-allocaiton value to 1. By changing
the pre-allocation size from 1 to 500,
we reduce about 1000 database accesses
per page.
Optimization #8 - Batch Writing
Many
databases provide an optimization that
allows a batch of write operations to
be performed as a single database
access. There is both parametrized and
dynamic batch writing. For
parametrized batch writing a single
parametrized SQL statement can be
executed with a batch of parameter
vales instead of a single set of
parameter values. This is very optimal
as the SQL only needs to be executed
once, and all of the data can be
passed optimally to the database.
Dynamic batch writing requires dynamic
(non-parametrized) SQL that is batched
into a single big statement and sent
to the database all at once. The
database then needs to process this
huge string and execute each
statement. This requires the database
do a lot of work parsing the
statement, so is no always optimal. It
does reduce the database access, so if
the database is remote or poorly
connected with the application, this
can result in an improvement.
In general parametrized batch writing
is much more optimal, and on Oracle it
provides a huge benefit, where as
dynamic does not. JDBC defines the API
for batch writing, but not all JDBC
drivers support it, some support the
API but then execute the statements
one by one, so it is important to test
that your database supports the
optimization before using it. In
EclipseLink batch writing is enabled
using the persistence unit property
"eclipselink.jdbc.batch-writing"="JDBC".
Another important aspect of using
batch writing is that you must have
the same SQL (DML actually) statement
being executed in a grouped fashion in
a single transaction. Some JPA
providers do not order their DML, so
you can end up ping-ponging between
two statements such as the order
insert and the order-line insert,
making batch writing in-effective.
Fortunately EclipseLink orders and
groups its DML, so usage of batch
writing reduces the database access
from 500 order inserts and 5000
order-line inserts to 55 (default
batch size is 100). We could increase
the batch size using
"eclipselink.jdbc.batch-writing.size",
so increasing the batch size to 1000
reduces the database accesses to 6 per
page.

Multiple insert speed up

i'm parsing large log files (5+GB) and extracting ad-hoc placed profiling lines (call name and execution time). I want to insert those lines into a MySql db.
My question is: should I execute the insert statement every time I get the line while parsing or there's some best practice to speed up everything?
If there is any way that you could do a bulk insert, that would help a lot (or at least send your data to the database in batches, instead of making separate calls each time).
Edit
LOAD DATA INFILE sounds even faster ;o)
https://web.archive.org/web/20150413042140/http://jeffrick.com/2010/03/23/bulk-insert-into-a-mysql-database
There are better options.
See http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
In your case, I think writing the relevant records to a file and then using LOAD DATA INFILE is the best approach.
For small updates, the number of transactions is critical for performance. SO if you can perform a number of inserts in the same transaction it will go much faster. I would try 100 inserts per transaction first.
If you don't want to follow the recommendations in Galz's link ( which is excellent BTW ) then try to open the connection and prepare the statement once, then loop round your log files carrying out the inserts ( using the premared statement ), then finally close the statement and connection once at the end. It's not the fastest way of doing the inserts, but it's the fastest way that sticks to a "normal" JDBC approach.
From java
JDBC batch insert
Example:
You do this with every insert: http://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/lucene/Indexer.java#232
You do this with every batch http://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/lucene/Indexer.java#371
The size of the batch can be determined by the available memory.
Aside from insert speed, the other problem you may run into is memory. Whatever approach you use, you will still need to consider your memory usage as the records are loaded from the file. Unless you have a hard requirement on processing speed, then it may be a better to use an approach with a predictable memory foot print.

Categories