Hibernate Relationship Mapping/Speed up batch inserts

Hibernate Relationship Mapping/Speed up batch inserts - java

I have 5 MySQL InnoDB tables: Test,InputInvoice,InputLine,OutputInvoice,OutputLine and each is mapped and functioning in Hibernate. I have played with using StatelessSession/Session, and JDBC batch size. I have removed any generator classes to let MySQL handle the id generation- but it is still performing quite slow.
Each of those tables is represented in a java class, and mapped in hibernate accordingly. Currently when it comes time to write the data out, I loop through the objects and do a session.save(Object) or session.insert(Object) if I'm using StatelessSession. I also do a flush and clear (when using Session) when my line count reaches the max jdbc batch size (50).
Would it be faster if I had these in a 'parent' class that held the objects and did a session.save(master) instead of each one?
If I had them in a master/container class, how would I map that in hibernate to reflect the relationship? The container class wouldn't actually be a table of it's own, but a relationship all based on two indexes run_id (int) and line (int).
Another direction would be: How do I get Hibernate to do a multi-row insert?

The ID generation strategy is critical for batch insertion in Hibernate. In particular, IDENTITY generation will usually not work (note that AUTO typically maps to IDENTITY as well). This is because during batch insert Hibernate has a flag called "requiresImmediateIdAccess" that says whether or not generated IDs are immediately required or not; if so, batch processing is disabled.
You can easily spot this in the DEBUG-level logs when it says "executing identity-insert immediately" - this means it has skipped batch processing because it was told that generated IDs are required immediately after insertion.
Generation strategies that typically do work are TABLE and SEQUENCE, because Hibernate can pre-generate the IDs, thereby allowing for batch insertion.
A quick way to spot whether your batch insertion works is to activate DEBUG-level logs as BatchingBatcher will explicitly tell you the batch size it's executing ("Executing batch size: " + batchSize ).
Additionally, the following properties are important for achieving batch insertion. I daren't say they're required as I'm not enough of a Hibernate-expert to do so - perhaps it's just my particular configuration - but in my experience they were needed nonetheless:
hibernate.order_inserts = true
hibernate.order_updates = true
These properties are pretty poorly documented, but I believe what they did was enable for the SQL INSERT and UPDATE statements to be properly grouped for batch execution; I think this might be the multi-row inserts you're after. Don't shoot me if I'm wrong on this, I'm recalling from memory.
I'll also go ahead and assume that you set the following property; if not, this should serve as a reminder:
hibernate.jdbc.batch_size = xx
Where xx is your desired batch size, naturally.

The final solution for me was to use voetsjoeba's response as a jumping off point.
My hibernate config uses the following options:
hibernate.order_inserts = true
hibernate.order_updates = true
I changed from using Session to
StatelessSession
Re-ordered the
Java code to process all the elements
in a batch a table at a time. So all
of table x, then table y, etc.
Removed the <generator> from each
class. Java now creates it and
assigns it to the object
Created logic that allowed me to determine if just
an id was being set and not write
'empty' lines to the database
Finally, I turned on dynamic-insert
for my classes in their hibernate
definitions like so: <class name="com.my.class" table="MY_TABLE" dynamic-insert="true">

Related

How to optimize one big insert with hibernate

For my website, I'm creating a book database. I have a catalog, with a root node, each node have subnodes, each subnode has documents, each document has versions, and each version is made of several paragraphs.
In order to create this database the fastest possible, I'm first creating the entire tree model, in memory, and then I call session.save(rootNode)
This single save will populate my entire database (at the end when I'm doing a mysqldump on the database it weights 1Go)
The save coasts a lot (more than an hour), and since the database grows with new books and new versions of existing books, it coasts more and more. I would like to optimize this save.
I've tried to increase the batch_size. But it changes nothing since it's a unique save. When I mysqldump a script, and I insert it back into mysql, the operation coast 2 minutes or less.
And when I'm doing a "htop" on the ubuntu machine, I can see the mysql is only using 2 or 3 % CPU. Which means that it's hibernate who's slow.
If someone could give me possible techniques that I could try, or possible leads, it would be great... I already know some of the reasons, why it takes time. If someone wants to discuss it with me, thanks for his help.
Here are some of my problems (I think): For exemple, I have self assigned ids for most of my entities. Because of that, hibernate is checking each time if the line exists before it saves it. I don't need this because, the batch I'm executing, is executed only one, when I create the databse from scratch. The best would be to tell hibernate to ignore the primaryKey rules (like mysqldump does) and reenabeling the key checking once the database has been created. It's just a one shot batch, to initialize my database.
Second problem would be again about the foreign keys. Hibernate inserts lines with null values, then, makes an update in order to make foreign keys work.
About using another technology : I would like to make this batch work with hibernate because after, all my website is working very well with hibernate, and if it's hibernate who creates the databse, I'm sure the naming rules, and every foreign keys will be well created.
Finally, it's a readonly database. (I have a user database, which is using innodb, where I do updates, and insert while my website is running, but the document database is readonly and mYisam)
Here is a exemple of what I'm doing
TreeNode rootNode = new TreeNode();
recursiveLoadSubNodes(rootNode); // This method creates my big tree, in memory only.
hibernateSession.beginTrasaction();
hibernateSession.save(rootNode); // during more than an hour, it saves 1Go of datas : hundreads of sub treeNodes, thousands of documents, tens of thousands paragraphs.
hibernateSession.getTransaction().commit();

It's a little hard to guess what could be the problem here but I could think of 3 things:
Increasing batch_size only might not help because - depending on your model - inserts might be interleaved (i.e. A B A B ...). You can allow Hibernate to reorder inserts and updates so that they can be batched (i.e. A A ... B B ...).Depending on your model this might not work because the inserts might not be batchable. The necessary properties would be hibernate.order_inserts and hibernate.order_updates and a blog post that describes the situation can be found here: https://vladmihalcea.com/how-to-batch-insert-and-update-statements-with-hibernate/
If the entities don't already exist (which seems to be the case) then the problem might be the first level cache. This cache will cause Hibernate to get slower and slower because each time it wants to flush changes it will check all entries in the cache by iterating over them and calling equals() (or something similar). As you can see that will take longer with each new entity that's created.To Fix that you could either try to disable the first level cache (I'd have to look up whether that's possible for write operations and how this is done - or you do that :) ) or try to keep the cache small, e.g. by inserting the books yourself and evicting each book from the first level cache after the insert (you could also go deeper and do that on the document or paragraph level).
It might not actually be Hibernate (or at least not alone) but your DB as well. Note that restoring dumps often removes/disables constraint checks and indices along with other optimizations so comparing that with Hibernate isn't that useful. What you'd need to do is create a bunch of insert statements and then just execute those - ideally via a JDBC batch - on an empty database but with all constraints and indices enabled. That would provide a more accurate benchmark.
Assuming that comparison shows that the plain SQL insert isn't that much faster then you could decide to either keep what you have so far or refactor your batch insert to temporarily disable (or remove and re-create) constraints and indices.
Alternatively you could try not to use Hibernate at all or change your model - if that's possible given your requirements which I don't know. That means you could try to generate and execute the SQL queries yourself, use a NoSQL database or NoSQL storage in a SQL database that supports it - like Postgres.
We're doing something similar, i.e. we have Hibernate entities that contain some complex data which is stored in a JSONB column. Hibernate can read and write that column via a custom usertype but it can't filter (Postgres would support that but we didn't manage to enable the necessary syntax in Hibernate).

In mysql is it advisible to rely on uniqueness constraints or to manually check if the row is already present?

I have a table:
userId | subject
with uniqueness constraint on both combined.
Now I am writing thousands of rows to this table every few minutes. The data stream is coming from a queue and it might repeat. I have to however make sure that there is only one unique combination of userId,subject in the table.
Currently I rely on mysql's uniqueness constraint which throws as exception.
Another approach is run a SELECT count(*) query to check if this row is already present and then skip it if need be.
Since I want to write on an average 4 rows per second what is advisable.
Programming language: Java
EDIT:
Just in case I am not clear the question here is whether relying ton MYSQL to throw an exception is better or running a select query before insert operation is better in terms of performance.
I thought a select query is less CPU/IO intensive than a INSERT query. If I run too many INSERTS wouldn't that create many locks ?

MySQL is ACID and employs transactional locking, so relying on its uniqueness constraints is very standard. Note that you can do this either via PRIMARY KEY or UNIQUE KEY (but favour the former if you can).

A unique constraint is unique for the complete committed dataset.
There a several databases which allows to set "transaction isolation level".
userId subject
A 1
B 2
-------------------------
A 2
A 3
The two rows above the line are committed. Every connection can read these lines. The two line under the line are currently been written within your transaction. Within this connection all four lines are visible.
If another thread / connection / transaction tries to store A-2 there will be an exception in one of the two transaction (the first one can commit the transaction, the second one can't).
Other isolation level may fail earlier. But it is not possible to violate against the Unique-key constraint.

What is the difference between hibernate.jdbc.fetch_size and hibernate.jdbc.batch_size?

I am trying to tune my application, came across some blogs speaking about the batch fetch and batch select and putting my understanding as follows.
hibernate.jdbc.fetch_size - Used to specify number of rows to be fetched in a select query.
hibernate.jdbc.batch_size - Used to specify number of inserts or updates to be carried out in a single database hit.
Please let me know whether my understanding is correct or not? Also what are the optimal values for the above parameters..

Both of these options set properties within the JDBC driver. In the first case, hibernate.jdbc.fetch_size sets the statement's fetch size within the JDBC driver, that is the number of rows fetched when there is more than a one row result on select statements.
In the second case, hibernate.jdbc.batch_size determines the number of updates (inserts, updates and deletes) that are sent to the database at one time for execution. This parameter is necessary to do batch inserts, but must be coupled with the ordered inserts parameter and the JDBC driver's capability to rewrite the inserts into a batch insert statement.
See this link

Your assumptions are correct.
hibernate.jdbc.fetch_size
The hibernate.jdbc.fetch_size Hibernate configuration property is used for setting the JDBC Statement#setFetchSize property for every statement that Hibernate uses during the currently running Persistence Context.
Usually, you don't need to set this property as the default is fine, especially for MySQL and PostgreSQL which fetch the entire ResultSet in a single database roundtrip. Because Hibernate traverses the entire ResultSet, you are better off fetching all rows in a single shot instead of using multiple roundtrips.
Only for Oracle, you might want to set it since the default fetchSize is just 10.
hibernate.jdbc.batch_size
The hibernate.jdbc.batch_size property is used to batch multiple INSERT, UPDATE, and DELETE statements together so that they can be set in a single database call.
If you set this property, you are better off setting these two as well:
hibernate.order_inserts to true
hibernate.order_updates to true

Fetch size does Statement.setFetchSize() while batch size is for Hibernate batching. Both configuration parameters are explained here. For hibernate batch refer here

Your understanding seems quite correct. I would refer you to the JBOSS documentation on Hibernate, the following chapter is on Batch processing. And this one on tweaking performance.
It's a good, easy to read source. It gives some suggestions on optimal values, but as CodeChimp mentioned, tuning is best done case by case and is a repeatable process over time.

BatchUpdate with ACID properties in Spring + mysql

Currently our application uses Spring, SimpleJdbcTemplate & MySQL.
DataSource used is org.apache.commons.dbcp.BasicDataSource with url properties "rewriteBatchedStatements=true".
During the batch insert with SimpleJdbcTemplate.batchUpdate(List<Object[]>), there will be duplicate records(based on the primary key of the table we are inesrting) in the input batch.
Under the duplicate record scenario is it possible to
1) Insert all the non-duplicate records and get the response about the number of successful inserts?
OR
2) Completely rollback the batchInsert, no record should be inserted?
We are able to partially achieve the first requirement using the "INSERT IGNORE" of mysql. But the SimpleJdbcTemplate.batchUpdate() returns every record as updated.(Not able to capture only the inserted record count ignoring the duplicates)
And to achieve the second requirement we have to turn off the "rewriteBatchedStatements". But this parameter has been fine tuned after performance test. So we can't set this to "false".
Is the possible to achieve one of two above cases, within the constraints of the components we are using as mentioned in the first line?
I am new to spring and jdbc world, so a detailed explanation will help us better.
Thanks

1) Insert all the non-duplicate records and get the response about the number of successful inserts?
=> Well you can, insert all non.duplicate records, and get the response. batchUpdate() returns int[] i.e. array of integers which represents the numbers of rows affected by each update in the batch (inserted/updated).
2) Completely rollback the batchInsert no record should be inserted?
=> As batchInsert will be within single transaction, either all records will be inserted or no record is inserted. If transaction gets committed all records will be inserted. If any exception occurs, transaction will be rolled back automatically if you are using spring transaction management (either using #Transactional annotation or spring aop based tx advice). Here make sure, you set BasicDataSource.defaultAutoCommit = false.

Alright to truncate database tables when also using Hibernate?

Is it OK to truncate tables while at the same time using Hibernate to insert data?
We parse a big XML file with many relationships into Hibernate POJO's and persist to the DB.
We are now planning on purging existing data at certain points in time by truncating the tables. Is this OK?
It seems to work fine. We don't use Hibernate's second level cache. One thing I did notice, which is fine, is that when inserting we generate primary keys using Hibernate's #GeneratedValue where Hibernate just uses a key value one greater than the highest value in the table - and even though we are truncating the tables, Hibernate remembers the prior value and uses prior value + 1 as opposed to starting over at 1. This is fine, just unexpected.
Note that the reason we do truncate as opposed to calling delete() on the Hibernate POJO's is for speed. We have gazillions of rows of data, and truncate is just so much faster.

We are now planning on purging existing data at certain points in time by truncating the tables. Is this OK?
If you're not using the second level cache and if you didn't load Entities from the table you're going to truncate in the Session, the following should work (assuming it doesn't break integrity constraints):
Session s = sf.openSession();
PreparedStatement ps = s.connection().prepareStatement("TRUNCATE TABLE XXX");
ps.executeUpdate();
And you should be able to persist entities after that, either in the same transaction or another one.
Of course, such a TRUNCATE won't generate any Hibernate event or trigger any callback, if this matters.
(...) when inserting we generate primary keys using Hibernate's #GeneratedValue (...)
If you are using the default strategy for #GeneratedValue (i.e. AUTO), then it should default to a sequence with Oracle and a sequence won't be reseted if you truncate a table or delete records.
We truncate tables like jdbcTemplate.execute("TRUNCATE TABLE abc")
This should be equivalent (you'll end-up using the same underlying JDBC connection than Hibernate).
What sequence would Hibernate use for the inserts?
AFAIK, Hibernate generates a default "hibernate_sequence" sequence for you if you don't declare your own.
I thought it was just doing a max(field) + 1 on the table?
I don't think so and the fact that Hibernate doesn't start over from 1 after the TRUNCATE seems to confirm that it doesn't. I suggest to activate SQL logging to see the exact statements performed against your database on INSERT.
The generator we specify for #GeneratedValue is just a "dummy" generator (doesn't correspond to any sequence that we've created).
I'm not 100% sure but if you didn't declare any #SequenceGenerator (or #TableGenerator), I don't think that specifying a generator changes something.

Depends on your application. If deleting rows in the database is okey, then truncate is okey, too.
As far as you don't have any Pre- or PostRemove listeners on your entities, there should be no problems.
On the other hand... is it possible that there are still entities loaded in an EntityManager at truncate time, or is this a writeonly table (like a logging table). In this case you won't have any problem at all.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.