Process large amount data using hibernate

Process large amount data using hibernate - java

I am using hibernate for processing data in my application. Application is working fine but i am facing time related performance in application. The scenario is, i have one table that is located remotely and contain around 100000 rows. i have to insert that data in local database table(with different structure) using some mapping(so that i can know which remote table column is equivalent to local table column). it is taking 9 hours for processing that data. I am executing native SQL queries. is it causing performance issue? Any suggestion will be appreciated.

Set the following Hibernate properties to enable batching:
You need to clear the Session once a batch is processed to clear memory. This allows you to use a smaller Heap size, therefore reducing the chance of long GC runs:
session.flush();
session.clear();
Use the new identifier generators and in case you use DB sequences you can choose the pooled-lo optimizer. Using a hi/lo algorithm will reduce sequence calls and increase performance.
Don't use the identity generator, because that's going to disable batching

Related

How to efficiently export/import database data with JDBC

I have a JAVA application that can use a SQL database from any vendor. Right now we have tested Vertica and PostgreSQL. I want to export all the data from one table in the DB and import it later on in a different instance of the application. The size of the DB is pretty big so there are many rows in there. The export and import process has to be done from inside the java code.
What we've tried so far is:
Export: we read the whole table (select * from) through JDBC and then dump it to an SQL file with all the INSERTS needed.
Import: The file containing those thousands of INSERTS is executed in the target database through JDBC.
This is not an efficient process. Firstly, the select * from part is giving us problems because of the size of it and secondly, executing a lot if inserts one after another gives us problems in Vertica (https://forum.vertica.com/discussion/235201/vjdbc-5065-error-too-many-ros-containers-exist-for-the-following-projections)
What would be a more efficient way of doing this? Are there any tools that can help with the process or there is no "elegant" solution?

Why not do the export/import in a single step with batching (for performance) and chunking (to avoid errors and provide a checkpoint where to start off after a failure).
In most cases, databases support INSERT queries with many values, e.g.:
INSERT INTO table_a (col_a, col_b, ...) VALUES
(val_a, val_b, ...),
(val_a, val_b, ...),
(val_a, val_b, ...),
...
The number of rows you generate into a single such INSERT statement is then your chunk-size, which might need tuning for the specific target database (big enough to speed things up but small enough to make the chunk not exceed some database limit and create failures).
As already proposed, each of this chunk should then be executed in a transaction and your application should remember which chunk it successfully executed last in case some error occurs so it can continue at the next run there.
For the chunks itself, you really should use LIMIT OFFSET .
This way, you can repeat any chunk at any time, each chunk by itself is atomic and it should perform much better than with single row statements.

I can only speak about PostgreSQL.
The size of the SELECT is not a problem if you use server-side cursors by calling setFetchSize with a value greater than 0 (perhaps 10000) on the statement.
The INSERTS will perform well if
you run them all in a single transaction
you use a PreparedStatement for the INSERT

Each insert into Vertica goes into WOS (memory), and periodically data from WOS gets moved to ROS (disk) into a single container. You can only have 1024 ROS containers per projection per node. Doing many thousands of INSERTs at a time is never a good idea for Vertica. The best way to do this is to copy all that data into a file and bulk load the file into Vertica using the COPY command.
This will create a single ROS container for the contents of the file. Depending on how many rows you want to copy it will be many times (sometimes even hundreds of times) faster.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/COPY/COPY.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ConnectingToVertica/ClientJDBC/UsingCOPYLOCALWithJDBC.htm

Not able to run select query after setting TTL in cassandra

I have records already in cassandra DB,Using Java Class I am retrieving each row , updating with TTL and storing them back to Cassandra DB. after that if I run select query its executing and showing records. but when the TTL time was complete, If I run select query it has to show zero records but its not running select query showing Cassandra Failure during read query at consistency ONE error. For other tables select query working properly but for that table(to which rows I applied TTL) not working.

You are using common anti-patterns.
1) You are using batches to load data into two single tables, separately. I don't know if you already own a cluster or you're on your local machine, but this is not the way you load data to a C* cluster, and you are going to stress a lot your C* cluster. You should use batches only when you need to keep two or more tables in sync, and not to load a bunch of records at time. I suggest you the following readings on the topic:
DataStax documentation on BATCH
Ryan Svihla Blog
2) You are using synchronous writes to insert your pretty indipendent records into your cluster. You should use asynchronous writes to speed up your data processing.
DataStax Java Drive Async Queries
3) You are using the TTL features in your tables, which per se are not that bad. However, an expired TTL is a tombstone, and that means when you SELECT your query C* will have to read all those tombstones.
4) You bind your prepared statement multiple time:
BoundStatement bound = phonePrepared.bind(macAddress, ...
and that should be
BoundStatement bound = new BoundStatement(phonePrepared).bind(macAddress, ...
in order to use different bound statements. This is not an anti-pattern, this is a problem with your code.
Now, if you run your program multiple times, your tables have a lot of tombstones due to the TTL features, and that means C* is trying hard to read all these in order to find what you wrote "the last time" you successfully run, and it takes so long that the queries times-out.
Just for fun, you can try to increase your timeouts, say 2 minutes, in the SELECT and take a coffee, and in the meantime C* will get your records back.
I don't know what you are trying to achieve, but fast TTLs are your enemies. If you just wanted to refresh your records then try to keep TTLs time high enough so that it doesn't hurt your performances. Or, a probably better solution is to add a new column EXPIRED, "manually" written only when you need to delete a record instead. That depends on your requirements.

is there a parameter accelerates Sybase insertion

I am using Sybase ASE, and for a table, in which I will save results calculated by Java. This table has 10 columns, one column type is INT value (but not an ID column), and other 9 columns are all VARCHAR(50) type.
There's no index or trigger on this table (in fact this table is really independent). I need to insert around 160K rows into this table. I tried to separate the work by batch, which will do 10,000 insertions every time. I used two different ways, one is Spring's JdbcTemplate.batchUpdate the other one is native JDBC PreparedStatement.executeBatch api.
However no clear winner regarding the performance. Both of them takes around 25 to 30 seconds for 10K insertions.
Then I thought it could be related to the JDBC driver, so I tried two different drivers: jConnect and jTDS. No real impact on insertion performance.
Finally I decided to compare Sybase with another database, i.e. PostgreSQL in my test. I kept the same Java code, and surprisingly PostgreSQL takes only 0.3 second for every 10K insertions, while Sybase took 25 to 30 seconds (75 to 100 times longer).
DBA support team explains the difference is due to that PostgreSQL is installed on my local machine, while Sybase is installed on our enterprise's server. However, I am not convinced by this explanation at all.
Does anyone know if there's a configuration in Sybase which could considerably impact the insertion speed? Or are there any other possible causes for my above scenario?

The delay that you see on the sybase end is because of a lot of factors that needs to be checked and comparing it to a different database that too on a local machine is not correct.
For starting, we need to check the network latency and the storage used in the sybase database. We need to check the sybase server configuration, page size and locking scheme of the table that you are inserting into. We also need to do a basic health check of the server while you are inserting the data. As you have mentioned that you have used two different ways to insert the data, It is important that you check whether these two ways along are updated accordingly to the sybase client you have installed on your system.
To sum it up, It may be a simple issue as blocking on the sybase instance or it could be related to the storage which is not able to write it quickly. Given the sybase is configured properly, The performance would be very good.

Whether the DB server is local or not may indeed make a significant difference. Until you cut out this factor, comparison with a local DB makes little sense.
But that aside, there are many aspects that affect insert performance in ASE. First off, make sure the overall memory configuration (e.g. data cache and procedure cache) is not too small -- leaving it at the installation defaults is a guarantee for disappointing results. Then there is network packet size that can play a role. And the batch size (#rows before you commit). And the table's lock scheme.
Trying to use minimally logged inserts will help (requires config setting changes), especially since the table has no indexes (and no UNIQUE or PK constraints either?)
The ASE server page size (which you choose when you create the server) also makes a difference: bigger is basically better for inserts.

Set the ENABLE_BULK_LOAD parameter to True. It will speed it up.

Slow tomcat application performance using SQL Server 2008 as the app database

I built an java web application and it works fine with SQL Server 2008 if the size in the queried table is about 100 records. But when I increased it to 1.3 million records, it takes about 4-8 minutes to execute a single query. My application uses hibernate.
I have deployed this application on a 6gb ram server and a 12gb ram server and have also increased my java heap size to 4gb and 8gb respectively but I still encounter the same problem.
Please what can I do to improve performance?
UPDATE:
This is a one of the sql queries that is really slow on SQL Server but runs fast on Postgresql
select distinct c.company from Affiliates c where c.portalUser.userId = 'user.getUserId()' and lower(c.company.classification.name) = lower('" + companyClass + "') order by c.company.dateOfReservation desc";

I think the most painful part of your example query is the part:
lower(c.company.classification.name) = lower('" + companyClass + "')
Here you are forcing a table scan because for each row the names have to be lowercased and compared. If your database is not configured to compare case you might be able to omit the lower() calls. If not you could consider adding an extra column with a lower case copy of the string and use this copy for the query.
How many companyClasses are there? Could you create a separate table with all the company classes and refer to it by index? This would probably speed up this query a lot since you would not have to do any varchar comparisons anymore.
Just some ideas.

You haven't provided any code or sufficient specific details but the usual causes are:
Emitting millions of ORM queries. Use SQL Profiler (or turn on Hibernate's logging) to determine if you are emitting many selects from code when a single stored procedure to do all the work would be more appropriate. This SO answer shows you how to enable Hibernate SQL logging.
Poorly indexed tables causing lots of large table scans. See How Can I Log and Find the Most Expensive Queries?. Determine your expensive queries and create indexes to improve their performance.
Note: #Adriaan Koster pointed out the lower case conversion performed in your query. By default, SQL Server is case insensitive (I've only come across one that wasn't in 20 years, and that was set up mistakenly), so you can almost certainly drop the conversions to lowercase. This would allow the query to use an appropriate index if one exists.

Multi threaded insert using ORM?

I have one application where "persisting to database" is consuming 85% time of the entire application flow.
I was thinking of using multiple threads to do the insert because inserts are mostly independent here. Is there any way to achieve multi threaded insert using any of JPA implementation ? Or is it worth doing the mutli threaded insert, from improving the performance perspective ?
Note: Inserts are in the range of 10K to 100K records in a single run. Also performance is very very critical here.
Thanks.

Multi-threading insert statements on database won't really make it perform any faster because in most databases the table requires a lock for an insert. So your threads will just be waiting for the one before it to finish up and unlock the table before the next can insert - which really doesn't make it any more multi-threaded than with a single thread. If you where to do it, it would most likely slow it down.
If you inserting 10k-100k records you should consider using either batch insert statements or bulk insert commands that are native to the database your using. The fastest way would be the native bulk insert commands but it would require you to not use JPA and to work directly with JDBC calls for the inserts you want to use bulk commands on.
If you don't want to play around with native bulk commands I recommend using Spring's JDBCTemplate which has templated batch insert commands. It is very fast and I use it to batch insert 10k-20k entities every 30 seconds on a high transaction system and I am very pleased with the performance.
Lastly, make sure your database tables are optimized with the correct indexes, keys and options. Since your database is the bottleneck this should be one of the first places you look to increase performance.

Multi-threading insert statements on database won't really make it perform any faster
because in most databases the table requires a lock for an insert. So your threads will
just be waiting for the one before it to finish up and unlock the table before the next can
insert - which really doesn't make it any more multi-threaded than with a single thread. If
you where to do it, it would most likely slow it down.
Are you saying concurrent inserts from different db connections on the same table require exclusive locks to complete? I tested this on Oracle, and I didn't find this to be the case. Do you actually have a test case to back up what you wrote here?
Anyway, bulk insert is of course a lot faster than one insert at a time.

Are you periodically flushing your session when doing this? if not, you can hit nasty slowdowns that have nothing to do with the database. generally, you want to "batch" the inserts by periodically calling flush() then clear() on your session (assuming you are using some variant of JPA).

This article has many tips to improve batch writing performance with JPA. I'll quote the two that should give you the best result for fast reference.
Optimization #6 - Sequence
Pre-allocation
We have optimized the
first part of the application, reading
from the MySQL database. The second
part is to optimize the writing to
Oracle.
The biggest issue with the writing
process is that the Id generation is
using an allocation size of 1. This
means that for every insert there will
be an update and a select for the next
sequence number. This is a major
issue, as it is effectively doubling
the amount of database access. By
default JPA uses a pre-allocation size
of 50 for TABLE and SEQUENCE Id
generation, and 1 for IDENTITY Id
generation (a very good reason to
never use IDENTITY Id generation). But
frequently applications are
unnecessarily paranoid of holes in
their Id values and set the
pre-allocaiton value to 1. By changing
the pre-allocation size from 1 to 500,
we reduce about 1000 database accesses
per page.
Optimization #8 - Batch Writing
Many
databases provide an optimization that
allows a batch of write operations to
be performed as a single database
access. There is both parametrized and
dynamic batch writing. For
parametrized batch writing a single
parametrized SQL statement can be
executed with a batch of parameter
vales instead of a single set of
parameter values. This is very optimal
as the SQL only needs to be executed
once, and all of the data can be
passed optimally to the database.
Dynamic batch writing requires dynamic
(non-parametrized) SQL that is batched
into a single big statement and sent
to the database all at once. The
database then needs to process this
huge string and execute each
statement. This requires the database
do a lot of work parsing the
statement, so is no always optimal. It
does reduce the database access, so if
the database is remote or poorly
connected with the application, this
can result in an improvement.
In general parametrized batch writing
is much more optimal, and on Oracle it
provides a huge benefit, where as
dynamic does not. JDBC defines the API
for batch writing, but not all JDBC
drivers support it, some support the
API but then execute the statements
one by one, so it is important to test
that your database supports the
optimization before using it. In
EclipseLink batch writing is enabled
using the persistence unit property
"eclipselink.jdbc.batch-writing"="JDBC".
Another important aspect of using
batch writing is that you must have
the same SQL (DML actually) statement
being executed in a grouped fashion in
a single transaction. Some JPA
providers do not order their DML, so
you can end up ping-ponging between
two statements such as the order
insert and the order-line insert,
making batch writing in-effective.
Fortunately EclipseLink orders and
groups its DML, so usage of batch
writing reduces the database access
from 500 order inserts and 5000
order-line inserts to 55 (default
batch size is 100). We could increase
the batch size using
"eclipselink.jdbc.batch-writing.size",
so increasing the batch size to 1000
reduces the database accesses to 6 per
page.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.