I need to consume a rather large amounts of data from a daily CSV file. The CSV contains around 120K records. This is slowing to a crawl when using hibernate. Basically, it seems hibernate is doing a SELECT before every single INSERT (or UPDATE) when using saveOrUpdate(); for every instance being persisted with saveOrUpdate(), a SELECT is issued before the actual INSERT or a UPDATE. I can understand why it's doing this, but its terribly inefficient for doing bulk processing, and I'm looking for alternatives
I'm confident that the performance issue lies with the way I'm using hibernate for this, since I got another version working with native SQL (that parses the CSV in the excat same manner) and its literally running circles around this new version)
So, to the actual question, does a hibernate alternative to mysqls "INSERT ... ON DUPLICATE" syntax exist?
Or, if i choose to do native SQL for this, can I do native SQL within a hibernate transaction? Meaning, will it support commit/rollbacks?
There are many possible bottlenecks in to bulk operations. The best approach depends heavily on what your data looks like. Have a look at the Hibernate Manual section on batch processing.
At a minimum, make sure you are using the following pattern (copied from the manual):
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
for ( int i=0; i<100000; i++ ) {
Customer customer = new Customer(.....);
session.save(customer);
if ( i % 20 == 0 ) { //20, same as the JDBC batch size
//flush a batch of inserts and release memory:
session.flush();
session.clear();
}
}
tx.commit();
session.close();
If you are mapping a flat file to a very complex object graph you may have to get more creative, but the basic principal is that you have to find a balance between pushing good sized chunks of data to the database with each flush/commit and avoiding exploding the size of the session level cache.
Lastly, if you don't need Hibernate to handle any collections or cascading for your data to be correctly inserted, consider using a StatelessSession.
From Hibernate Batch Processing
For update i used the following :
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
ScrollableResults employeeCursor = session.createQuery("FROM EMPLOYEE")
.scroll();
int count = 0;
while ( employeeCursor.next() ) {
Employee employee = (Employee) employeeCursor.get(0);
employee.updateEmployee();
seession.update(employee);
if ( ++count % 50 == 0 ) {
session.flush();
session.clear();
}
}
tx.commit();
session.close();
But for insert i would go for jcwayne answer
According to an answer to a similar question, it can be done by configuring Hibernate to insert objects using a custom stored procedure which uses your database's upsert functionality. It's not pretty, though.
High-throughput data export
If you only want to import data without doing any processing or transformation, then a tool like PostgreSQL COPY is the fastest way o import data.
Batch processing
However, if you need to do the transformation, data aggregation, correlation/merging between existing data and the incoming one, then you need application-level batch processing.
In this case, you want to flush-clear-commit regularly:
int entityCount = 50;
int batchSize = 25;
EntityManager entityManager = entityManagerFactory()
.createEntityManager();
EntityTransaction entityTransaction = entityManager
.getTransaction();
try {
entityTransaction.begin();
for (int i = 0; i < entityCount; i++) {
if (i > 0 && i % batchSize == 0) {
entityTransaction.commit();
entityTransaction.begin();
entityManager.clear();
}
Post post = new Post(
String.format("Post %d", i + 1)
);
entityManager.persist(post);
}
entityTransaction.commit();
} catch (RuntimeException e) {
if (entityTransaction.isActive()) {
entityTransaction.rollback();
}
throw e;
} finally {
entityManager.close();
}
Also, make sure you enable JDBC batching as well using the following configuration properties:
<property
name="hibernate.jdbc.batch_size"
value="25"
/>
<property
name="hibernate.order_inserts"
value="true"
/>
<property
name="hibernate.order_updates"
value="true"
/>
Bulk processing
Bulk processing is suitable when all rows match pre-defined filtering criteria, so you can use a single UPDATE to change all records.
However, using bulk updates that modify millions of records can increase the size of the redo log or end up taking lots of locks on database systems that still use 2PL (Two-Phase Locking), like SQL Server.
So, while the bulk update is the most efficient way to change many records, you have to pay attention to how many records are to be changed to avoid a long-running transaction.
Also, you can combine bulk update with optimistic locking so that other OLTP transactions won't lose the update done by the bulk processing process.
If you use sequence or native generator Hibernate will use a select to get the id:
<id name="id" column="ID">
<generator class="native" />
</id>
You should use hilo or seqHiLo generator:
<id name="id" type="long" column="id">
<generator class="seqhilo">
<param name="sequence">SEQ_NAME</param>
<param name="max_lo">100</param>
</generator>
</id>
The "extra" select is to generate the unique identifier for your data.
Switch to HiLo sequence generation and you can reduce the sequence roundtrips to the database by the number of the allocation size. Please note, there will be a gap in primary keys unless you adjust your sequence value for the HiLo generator
Related
I trying persist a many registers in database reading a file with many lines
I´m using a forech to read the list of objects wrapped in file
logs.stream().forEach(log -> save(log));
private LogData save(LogData log) {
return repository.persist(log);
}
But the inserts are slow
Do i have a way to speed the inserts?
Your way take a long time because you persist element by element, so you go n time to the database, I would like to use Batch processing instead to use one transaction instead of N transaction, so the persist method can be :
public void persist(List<Logs> logs) {
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
logs.forEach(log -> session.save(log));// from the comment of #shmosel
tx.commit();
session.close();
}
Use a Batch Insert, Google "Hibernate Batch Insert" or replace with whatever name of your ORM if it's not Hibernate.
https://www.tutorialspoint.com/hibernate/hibernate_batch_processing.htm
To insert at every line makes this program slowly, why dont you think to collect n lines, and insert n lines together at once.
We will migrate large amounts of data (a single type of entity) from Amazon's DynamoDB into a MySQL DB. We are using Hibernate to map this class into a mysql entity. There are around 3 million entities (excluding rows of list property). Here is our class mapping summary:
#Entity
#Table(name = "CUSTOMER")
public class Customer {
#Id
#Column(name = "id")
private String id;
//Other properties in which all of them are primitive types/String
#ElementCollection
#CollectionTable(name = "CUSTOMER_USER", joinColumns = #JoinColumn(name = "customer_id"))
#Column(name = "userId")
private List<String> users;
// CONSTRUCTORS, GETTERS, SETTERS, etc.
}
users is a list of String. We have created two mysql tables like following:
CREATE TABLE CUSTOMER(id VARCHAR(100), PRIMARY KEY(id));
CREATE TABLE CUSTOMER_USER(customer_id VARCHAR(100), userId VARCHAR(100), PRIMARY KEY(customer_id, userId), FOREIGN KEY (customer_id) REFERENCES CUSTOMER(id));
Note: We do not make hibernate generate any id value, we are assigning our IDs to Customer entities which are guaranteed to be unique.
Here is our hibernate.cfg.xml:
<hibernate-configuration>
<session-factory>
<property name="hibernate.dialect"> org.hibernate.dialect.MySQLDialect </property>
<property name="hibernate.connection.driver_class"> com.mysql.jdbc.Driver </property>
<property name="hibernate.connection.url"> jdbc:mysql://localhost/xxx </property>
<property name="hibernate.connection.username"> xxx </property>
<property name="hibernate.connection.password"> xxx </property>
<property name="hibernate.connection.provider_class">org.hibernate.c3p0.internal.C3P0ConnectionProvider</property>
<property name="hibernate.jdbc.batch_size"> 50 </property>
<property name="hibernate.cache.use_second_level_cache">false</property>
<property name="c3p0.min_size">30</property>
<property name="c3p0.max_size">70</property>
</session-factory>
</hibernate-configuration>
We are creating some number of threads each reading data from Dynamo and inserting them to our MySQl DB via Hibernate. Here is what each thread does:
// Each single thread brings resultItems from DynamoDB
Session session = factory.openSession();
Transaction tx = session.beginTransaction();
for(int i = 0; i < resultItems.size(); i++) {
Customer cust = new Customer(resultItems.get(i));
session.save(cust);
if(i % BATCH_SIZE == 0) {
session.flush();
session.clear();
}
}
tx.commit();
session.close();
We have our own performance monitoring functions and we are continuously logging the overall read/write performance. The problem is, migration starts with reading/writing 1500 items/sec (on average), but keeps getting slowed as long as number of rows in CUSTOMER and CUSTOMER_USER tables increases (after a few minutes, r/w speed was around 500 items/sec). I am not experienced on Hibernate and here are my questions:
What should hibernate.cfg.xml be like for a multi-threaded task like ours? Is the content which i gave above fits for such a task or is there any wrong/missing point?
There are exactly 50 threads and each does following: Read from DynamoDB first, and then insert the results into mysql db, then read from dynamo, and so on. Therefore, uptime of communication with hibernate is not 100%. Under these circumstances, what do you recommend to set min_size and max_size of c3p0 connection pool sizes? To be able to understand the concept, should I also set remaining c3p0-related tags in hibernate.cfg.xml?
What can be done to maximize the speed of bulk inserting?
NOTE 1 I did not write all of the properties, because the remaining ones other than list of users are all int, boolean, String, etc.
NOTE 2 All of points are tested and have no negative effect on performance. When we dont insert anything into mysql db, read speed stays stable for hours.
NOTE 3 Any recommendation/guidance about the structure of mysql tables, configuration settings, sessions/transactions, number of connection pools, batch sizes, etc. will be really helpful!
Assuming you are not doing anything else in the hibernate transaction than just inserting the data into these two tables, you can use StatelessSession session = sessionFactory.openStatelessSession(); instead of normal session which reduces the overhead of maintaining the caches. But then you will have to save the nested collection objects separately.
Refer https://docs.jboss.org/hibernate/orm/3.3/reference/en/html/batch.html
So it could be something like -
// Each single thread brings resultItems from DynamoDB
StatelessSession session = factory.openStatelessSession();
Transaction tx = session.beginTransaction();
for(int i = 0; i < resultItems.size(); i++) {
Customer cust = new Customer(resultItems.get(i));
Long id = session.save(cust); // get the generated id
// TODO: Create a list of related customer users and assign the id to all of them and then save those customer user objects in the same transaction.
if(i % BATCH_SIZE == 0) {
session.flush();
session.clear();
}
}
tx.commit();
session.close();
In your scenario there are 25 threads batch-inserting data into one table simultaneously. MySQL has to maintain ACID properties while 25 transactions for many records in one table remain open or are being committed. That can cause a huge overhead.
While migrating data from databases, network latency can cause significant delays when there are many back-and-forth communications with the database. In this case, using multiple threads can be beneficial. But when doing batch fetches and batch inserts, there is little to gain as the database drivers will (or should) communicate data without doing much back-and-forth communications.
In the batch-scenario, start with 1 thread that reads data, prepares a batch and puts it in a queue for 1 thread that is writing data from the prepared batches. Keep the batches small (100 to 1 000 records) and commit often (every 100 records or so). This will minimize the overhead for maintaining the table. If network latency is a problem, try using 2 threads for reading and 2 for writing (but any performance gain might be offset by the overhead for maintaining the table used by 2 threads simultaneously).
Since there is no generated ID, you should benefit from the hibernate.jdbc.batch_size option already in your hibernate configuration. The hibernate.jdbc.fetch_size option (set this to 250 or so) might also be of interest.
As #hermant1900 mentions, using the StatelessSession is also a good idea. But by far the fastest method is mentioned by #Rob in the comments: use database tools to export the data to a file and import it in MySQL. I'm quite sure this is also the preferred method: it takes less time, less processing and there are fewer variables involved - overall a lot more reliable.
I have the equivalent of the following code and hibernate configuration (basically, StreamRef belongs to tape, and has to be unique on that tape):
<class name="StreamRef" table="StreamRefToTape">
<composite-id> <key-property name="UUID"/>
<key-many-to-one class="Tape" name="tape">
<column name="Tape_TapeId" not-null="true"/>
</key-many-to-one>
</composite-id>
...</class>
<class name="Tape" table="Tape">
<id column="TapeId" name="tapeId"/></class>
I have millions of these StreamRef's, and I want to save them all within the same transaction, but I also want to save on RAM during this transaction.
So I attempted the following code, my assumption being that if I turn off CacheMode, then it won't track objects internally, so it will save a lot of RAM (this seems to help, to some degree). But when testing this hypothesis, like this:
session = sessionFactory.openSession();
session.setCacheMode(CacheMode.IGNORE); // disable the first level cache
session.beginTransaction();
Tape t = new Tape();
StreamRef s1 = new StreamRef("same uuid");
StreamRef s2 = new StreamRef("same uuid"); // force a primary key collision
session.saveOrUpdate(t);
for(StreamRef s : t.getStreams()) {
session.save(s);
}
session.commit();
I would have expected this to not raise because I turned off CacheMode (but it raises a NonUniqueObjectException https://gist.github.com/4542569 ). Could somebody please confirm that 1) the hibernate internal cache is not disable-able? and 2) this exception has nothing to do with CacheMode? Is there any way to accomplish what I want here (to not use up tons of hibernate RAM within a transaction?)
somewhat related: https://stackoverflow.com/a/3543740/32453
(As a side question...does it matter the order that setCacheMode is called vin relation to beginTransaction? I assume it doesn't?)
Many thanks.
The exception makes sense. You're violating the rules you told Hibernate you were going to play by. If you really want to do what you've coded you'll need to use the StatelessSession API or createSQLQuery API. As it stands, Session.setCacheMode is for interacting with the second level cache, not the session cache.
Regarding memory usage, you'll want to incrementally flush batches of records to disk so Hibernate can purge its ActionQueue.
Here is an example from the section on batch updates in the user's guide:
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
for ( int i=0; i<100000; i++ ) {
Customer customer = new Customer(.....);
session.save(customer);
if ( i % 20 == 0 ) { //20, same as the JDBC batch size
//flush a batch of inserts and release memory:
session.flush();
session.clear();
}
}
tx.commit();
session.close();
You can also read about stateless sessions in the same chapter.
Hibernate will save all session object at once... Cache will store object for other sessions... So you can't disable your's for the single session... You can't do it try to use merge()....
I need to insert a lot of data in a database using hibernate, i was looking at batch insert from hibernate, what i am using is similar to the example on the manual:
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
for ( int i=0; i<100000; i++ ) {
Customer customer = new Customer(.....);
session.save(customer);
if ( i % 20 == 0 ) { //20, same as the JDBC batch size
//flush a batch of inserts and release memory:
session.flush();
session.clear();
}
}
tx.commit();
session.close();
but i see that flush doesn't write the data on the database.
Reading about it, if the code is inside a transaction then nothing will be committed to the database until the transaction performs a commit.
So what is the need to use flush/clear ? seems useless, if the data are not written on the database then they are in memory.
How can i force hibernate to write data in the database?
Thanks
The data is sent to the database, and is not in memory anymore. It's just not made definitively persistent until the transaction commit. It's exacltly the same as if you executes the following sequences of statements in any database tool:
begin;
insert into ...
insert into ...
insert into ...
// here, three inserts have been done on the database. But they will only be made
// definitively persistent at commit time
...
commit;
The flush consists in executing the insert statements.
The commit consists in executing the commit statement.
The data will be written to the database, but according to the transaction isolation level you will not see them (in other transactions) until the transaction is committed.
Use some sql statement logger, that prints the statmentes that are transported over the database connection, then you will see that the statmentes are send to the database.
For best perfromance you also have to commit transactions. Flushing and clearing session clears hibernate caches, but data is moved to JDBC connection caches, and is still uncommited ( different RDBMS / drivers show differrent behaviour ) - you are just shifting proble to other place without real improvements in perfromance.
Having flush() at the location mentioned saves you memory too as your session will be cleared regularly. Otherwise you will have 100000 object in memory and might run out of memory for larger count. Check out this article.
I have a program that is used to replicate/mirror the main tables (around 20) from Oracle to MSSQL 2005 via webservice (REST).
The program periodically read XML data from the webservice and convert it to list via jpa entity. This list of entity will store to MSSQL via JPA.
All jpa entity will be provided by the team who create the webservice.
There are two issues that I notice and seems unsolvable after some searching.
1st issue: The performance of inserting/updating via JDBC jpa is very slow, it takes around 0.1s per row...
Doing the same via C# -> datatable -> bulkinsert to new table in DB -> call stored procedure to do mass insert / update base on joins takes 0.01 s for 4000 records.
(Each table will have around 500-5000 records every 5 minutes)
Below shows a snapshot of the Java code that do the task-> persistent library -> EclipseLink JPA2.0
private void GetEntityA(OurClient client, EntityManager em, DBWriter dbWriter){
//code to log time and others
List<EntityA> response = client.findEntityA_XML();
em.setFlushMode(FlushModeType.COMMIT);
em.getTransaction().begin();
int count = 0;
for (EntityA object : response) {
count++;
em.merge(object);
//Batch commit
if (count % 1000 == 0){
try{
em.getTransaction().commit();
em.getTransaction().begin();
commitRecords = count;
} catch (Exception e) {
em.getTransaction().rollback();
}
}
}
try{
em.getTransaction().commit();
} catch (Exception e) {
em.getTransaction().rollback();
}
//dbWriter write log to DB
}
Anything done wrong causing the slowness? How can I improve the insert/update speed?
2nd issue: There are around 20 tables to replicate and I have created the same number of methods similar to above, basically copying above method 20 times and replace EntityA with EntityB and so on, you get the idea...
Is there anyway to generalize the method such that I can throw in any entity?
The performance of inserting/updating via JDBC jpa is very slow,
OR mappers generally are slow for bulk inserts. Per definition. You ant speed? Use another approach.
In general an ORM will not cater fur the bulk insert / stored procedure approach and tus get slaughtered here. You use the wrong appraoch for high performance inserts.
There are around 20 tables to replicate and I have created the same number of methods similar to
above, basically copying above method 20 times and replace EntityA with EntityB and so on, you get
the idea...
Generics. Part of java for some time now.
You can execute SQL, stored procedure or JPQL update all queries through JPA as well. I'm not sure where these objects are coming from, but if you are just migrating one table to another in the same database, you can do the same thing you were doing in C# in Java with JPA.
If you want to process the objects in JPA, then see,
http://java-persistence-performance.blogspot.com/2011/06/how-to-improve-jpa-performance-by-1825.html
For #2, change EntityA to Object, and you have a generic method.