We will migrate large amounts of data (a single type of entity) from Amazon's DynamoDB into a MySQL DB. We are using Hibernate to map this class into a mysql entity. There are around 3 million entities (excluding rows of list property). Here is our class mapping summary:
#Entity
#Table(name = "CUSTOMER")
public class Customer {
#Id
#Column(name = "id")
private String id;
//Other properties in which all of them are primitive types/String
#ElementCollection
#CollectionTable(name = "CUSTOMER_USER", joinColumns = #JoinColumn(name = "customer_id"))
#Column(name = "userId")
private List<String> users;
// CONSTRUCTORS, GETTERS, SETTERS, etc.
}
users is a list of String. We have created two mysql tables like following:
CREATE TABLE CUSTOMER(id VARCHAR(100), PRIMARY KEY(id));
CREATE TABLE CUSTOMER_USER(customer_id VARCHAR(100), userId VARCHAR(100), PRIMARY KEY(customer_id, userId), FOREIGN KEY (customer_id) REFERENCES CUSTOMER(id));
Note: We do not make hibernate generate any id value, we are assigning our IDs to Customer entities which are guaranteed to be unique.
Here is our hibernate.cfg.xml:
<hibernate-configuration>
<session-factory>
<property name="hibernate.dialect"> org.hibernate.dialect.MySQLDialect </property>
<property name="hibernate.connection.driver_class"> com.mysql.jdbc.Driver </property>
<property name="hibernate.connection.url"> jdbc:mysql://localhost/xxx </property>
<property name="hibernate.connection.username"> xxx </property>
<property name="hibernate.connection.password"> xxx </property>
<property name="hibernate.connection.provider_class">org.hibernate.c3p0.internal.C3P0ConnectionProvider</property>
<property name="hibernate.jdbc.batch_size"> 50 </property>
<property name="hibernate.cache.use_second_level_cache">false</property>
<property name="c3p0.min_size">30</property>
<property name="c3p0.max_size">70</property>
</session-factory>
</hibernate-configuration>
We are creating some number of threads each reading data from Dynamo and inserting them to our MySQl DB via Hibernate. Here is what each thread does:
// Each single thread brings resultItems from DynamoDB
Session session = factory.openSession();
Transaction tx = session.beginTransaction();
for(int i = 0; i < resultItems.size(); i++) {
Customer cust = new Customer(resultItems.get(i));
session.save(cust);
if(i % BATCH_SIZE == 0) {
session.flush();
session.clear();
}
}
tx.commit();
session.close();
We have our own performance monitoring functions and we are continuously logging the overall read/write performance. The problem is, migration starts with reading/writing 1500 items/sec (on average), but keeps getting slowed as long as number of rows in CUSTOMER and CUSTOMER_USER tables increases (after a few minutes, r/w speed was around 500 items/sec). I am not experienced on Hibernate and here are my questions:
What should hibernate.cfg.xml be like for a multi-threaded task like ours? Is the content which i gave above fits for such a task or is there any wrong/missing point?
There are exactly 50 threads and each does following: Read from DynamoDB first, and then insert the results into mysql db, then read from dynamo, and so on. Therefore, uptime of communication with hibernate is not 100%. Under these circumstances, what do you recommend to set min_size and max_size of c3p0 connection pool sizes? To be able to understand the concept, should I also set remaining c3p0-related tags in hibernate.cfg.xml?
What can be done to maximize the speed of bulk inserting?
NOTE 1 I did not write all of the properties, because the remaining ones other than list of users are all int, boolean, String, etc.
NOTE 2 All of points are tested and have no negative effect on performance. When we dont insert anything into mysql db, read speed stays stable for hours.
NOTE 3 Any recommendation/guidance about the structure of mysql tables, configuration settings, sessions/transactions, number of connection pools, batch sizes, etc. will be really helpful!
Assuming you are not doing anything else in the hibernate transaction than just inserting the data into these two tables, you can use StatelessSession session = sessionFactory.openStatelessSession(); instead of normal session which reduces the overhead of maintaining the caches. But then you will have to save the nested collection objects separately.
Refer https://docs.jboss.org/hibernate/orm/3.3/reference/en/html/batch.html
So it could be something like -
// Each single thread brings resultItems from DynamoDB
StatelessSession session = factory.openStatelessSession();
Transaction tx = session.beginTransaction();
for(int i = 0; i < resultItems.size(); i++) {
Customer cust = new Customer(resultItems.get(i));
Long id = session.save(cust); // get the generated id
// TODO: Create a list of related customer users and assign the id to all of them and then save those customer user objects in the same transaction.
if(i % BATCH_SIZE == 0) {
session.flush();
session.clear();
}
}
tx.commit();
session.close();
In your scenario there are 25 threads batch-inserting data into one table simultaneously. MySQL has to maintain ACID properties while 25 transactions for many records in one table remain open or are being committed. That can cause a huge overhead.
While migrating data from databases, network latency can cause significant delays when there are many back-and-forth communications with the database. In this case, using multiple threads can be beneficial. But when doing batch fetches and batch inserts, there is little to gain as the database drivers will (or should) communicate data without doing much back-and-forth communications.
In the batch-scenario, start with 1 thread that reads data, prepares a batch and puts it in a queue for 1 thread that is writing data from the prepared batches. Keep the batches small (100 to 1 000 records) and commit often (every 100 records or so). This will minimize the overhead for maintaining the table. If network latency is a problem, try using 2 threads for reading and 2 for writing (but any performance gain might be offset by the overhead for maintaining the table used by 2 threads simultaneously).
Since there is no generated ID, you should benefit from the hibernate.jdbc.batch_size option already in your hibernate configuration. The hibernate.jdbc.fetch_size option (set this to 250 or so) might also be of interest.
As #hermant1900 mentions, using the StatelessSession is also a good idea. But by far the fastest method is mentioned by #Rob in the comments: use database tools to export the data to a file and import it in MySQL. I'm quite sure this is also the preferred method: it takes less time, less processing and there are fewer variables involved - overall a lot more reliable.
Related
I have the equivalent of the following code and hibernate configuration (basically, StreamRef belongs to tape, and has to be unique on that tape):
<class name="StreamRef" table="StreamRefToTape">
<composite-id> <key-property name="UUID"/>
<key-many-to-one class="Tape" name="tape">
<column name="Tape_TapeId" not-null="true"/>
</key-many-to-one>
</composite-id>
...</class>
<class name="Tape" table="Tape">
<id column="TapeId" name="tapeId"/></class>
I have millions of these StreamRef's, and I want to save them all within the same transaction, but I also want to save on RAM during this transaction.
So I attempted the following code, my assumption being that if I turn off CacheMode, then it won't track objects internally, so it will save a lot of RAM (this seems to help, to some degree). But when testing this hypothesis, like this:
session = sessionFactory.openSession();
session.setCacheMode(CacheMode.IGNORE); // disable the first level cache
session.beginTransaction();
Tape t = new Tape();
StreamRef s1 = new StreamRef("same uuid");
StreamRef s2 = new StreamRef("same uuid"); // force a primary key collision
session.saveOrUpdate(t);
for(StreamRef s : t.getStreams()) {
session.save(s);
}
session.commit();
I would have expected this to not raise because I turned off CacheMode (but it raises a NonUniqueObjectException https://gist.github.com/4542569 ). Could somebody please confirm that 1) the hibernate internal cache is not disable-able? and 2) this exception has nothing to do with CacheMode? Is there any way to accomplish what I want here (to not use up tons of hibernate RAM within a transaction?)
somewhat related: https://stackoverflow.com/a/3543740/32453
(As a side question...does it matter the order that setCacheMode is called vin relation to beginTransaction? I assume it doesn't?)
Many thanks.
The exception makes sense. You're violating the rules you told Hibernate you were going to play by. If you really want to do what you've coded you'll need to use the StatelessSession API or createSQLQuery API. As it stands, Session.setCacheMode is for interacting with the second level cache, not the session cache.
Regarding memory usage, you'll want to incrementally flush batches of records to disk so Hibernate can purge its ActionQueue.
Here is an example from the section on batch updates in the user's guide:
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
for ( int i=0; i<100000; i++ ) {
Customer customer = new Customer(.....);
session.save(customer);
if ( i % 20 == 0 ) { //20, same as the JDBC batch size
//flush a batch of inserts and release memory:
session.flush();
session.clear();
}
}
tx.commit();
session.close();
You can also read about stateless sessions in the same chapter.
Hibernate will save all session object at once... Cache will store object for other sessions... So you can't disable your's for the single session... You can't do it try to use merge()....
I have an application using hibernate. One of its modules calls a native SQL (StoredProc) in batch process. Roughly what it does is that every time it writes a file it updates a field in the database. Right now I am not sure how many files would need to be written as it is dependent on the number of transactions per day so it could be zero to a million.
If I use this code snippet in while loop will I have any problems?
#Transactional
public void test()
{
//The for loop represents a list of records that needs to be processed.
for (int i = 0; i < 1000000; i++ )
{
//Process the records and write the information into a file.
...
//Update a field(s) in the database using a stored procedure based on the processed information.
updateField(String.valueOf(i));
}
}
#Transactional(propagation=propagation.MANDATORY)
public void updateField(String value)
{
Session session = getSession();
SQLQuery sqlQuery = session.createSQLQuery("exec spUpdate :value");
sqlQuery.setParameter("value", value);
sqlQuery.executeUpdate();
}
Will I need any other configurations for my data source and transaction manager?
Will I need to set hibernate.jdbc.batch_size and hibernate.cache.use_second_level_cache?
Will I need to use session flush and clear for this? The samples in the hibernate tutorial is using POJO's and not native sql so I am not sure if it is also applicable.
Please note another part of the application is already using hibernate so as much as possible I would like to stick to using hibernate.
Thank you for your time and I am hoping for your quick response. If it is also possible could code snippet would really be useful for me.
Application Work Flow
1) Query Database for the transaction information. (Transaction date, Type of account, currency, etc..)
2) For each account process transaction information. (Discounts, Current Balance, etc..)
3) Write the transaction information and processed information to a file.
4) Update a database field based on the process information
5) Go back to step 2 while their are still accounts. (Assuming that no exception are thrown)
The code snippet will open and close the session for each iteration, which definitely not a good practice.
Is it possible, you have a job which checks how many new files added in the folder?
The job should run say every 15/25 minutes, checking how much files are changed/added in last 15/25 minutes and updates the database in batch.
Something like that will lower down the number of open/close session connections. It should be much faster than this.
I have a program that is used to replicate/mirror the main tables (around 20) from Oracle to MSSQL 2005 via webservice (REST).
The program periodically read XML data from the webservice and convert it to list via jpa entity. This list of entity will store to MSSQL via JPA.
All jpa entity will be provided by the team who create the webservice.
There are two issues that I notice and seems unsolvable after some searching.
1st issue: The performance of inserting/updating via JDBC jpa is very slow, it takes around 0.1s per row...
Doing the same via C# -> datatable -> bulkinsert to new table in DB -> call stored procedure to do mass insert / update base on joins takes 0.01 s for 4000 records.
(Each table will have around 500-5000 records every 5 minutes)
Below shows a snapshot of the Java code that do the task-> persistent library -> EclipseLink JPA2.0
private void GetEntityA(OurClient client, EntityManager em, DBWriter dbWriter){
//code to log time and others
List<EntityA> response = client.findEntityA_XML();
em.setFlushMode(FlushModeType.COMMIT);
em.getTransaction().begin();
int count = 0;
for (EntityA object : response) {
count++;
em.merge(object);
//Batch commit
if (count % 1000 == 0){
try{
em.getTransaction().commit();
em.getTransaction().begin();
commitRecords = count;
} catch (Exception e) {
em.getTransaction().rollback();
}
}
}
try{
em.getTransaction().commit();
} catch (Exception e) {
em.getTransaction().rollback();
}
//dbWriter write log to DB
}
Anything done wrong causing the slowness? How can I improve the insert/update speed?
2nd issue: There are around 20 tables to replicate and I have created the same number of methods similar to above, basically copying above method 20 times and replace EntityA with EntityB and so on, you get the idea...
Is there anyway to generalize the method such that I can throw in any entity?
The performance of inserting/updating via JDBC jpa is very slow,
OR mappers generally are slow for bulk inserts. Per definition. You ant speed? Use another approach.
In general an ORM will not cater fur the bulk insert / stored procedure approach and tus get slaughtered here. You use the wrong appraoch for high performance inserts.
There are around 20 tables to replicate and I have created the same number of methods similar to
above, basically copying above method 20 times and replace EntityA with EntityB and so on, you get
the idea...
Generics. Part of java for some time now.
You can execute SQL, stored procedure or JPQL update all queries through JPA as well. I'm not sure where these objects are coming from, but if you are just migrating one table to another in the same database, you can do the same thing you were doing in C# in Java with JPA.
If you want to process the objects in JPA, then see,
http://java-persistence-performance.blogspot.com/2011/06/how-to-improve-jpa-performance-by-1825.html
For #2, change EntityA to Object, and you have a generic method.
I need to consume a rather large amounts of data from a daily CSV file. The CSV contains around 120K records. This is slowing to a crawl when using hibernate. Basically, it seems hibernate is doing a SELECT before every single INSERT (or UPDATE) when using saveOrUpdate(); for every instance being persisted with saveOrUpdate(), a SELECT is issued before the actual INSERT or a UPDATE. I can understand why it's doing this, but its terribly inefficient for doing bulk processing, and I'm looking for alternatives
I'm confident that the performance issue lies with the way I'm using hibernate for this, since I got another version working with native SQL (that parses the CSV in the excat same manner) and its literally running circles around this new version)
So, to the actual question, does a hibernate alternative to mysqls "INSERT ... ON DUPLICATE" syntax exist?
Or, if i choose to do native SQL for this, can I do native SQL within a hibernate transaction? Meaning, will it support commit/rollbacks?
There are many possible bottlenecks in to bulk operations. The best approach depends heavily on what your data looks like. Have a look at the Hibernate Manual section on batch processing.
At a minimum, make sure you are using the following pattern (copied from the manual):
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
for ( int i=0; i<100000; i++ ) {
Customer customer = new Customer(.....);
session.save(customer);
if ( i % 20 == 0 ) { //20, same as the JDBC batch size
//flush a batch of inserts and release memory:
session.flush();
session.clear();
}
}
tx.commit();
session.close();
If you are mapping a flat file to a very complex object graph you may have to get more creative, but the basic principal is that you have to find a balance between pushing good sized chunks of data to the database with each flush/commit and avoiding exploding the size of the session level cache.
Lastly, if you don't need Hibernate to handle any collections or cascading for your data to be correctly inserted, consider using a StatelessSession.
From Hibernate Batch Processing
For update i used the following :
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
ScrollableResults employeeCursor = session.createQuery("FROM EMPLOYEE")
.scroll();
int count = 0;
while ( employeeCursor.next() ) {
Employee employee = (Employee) employeeCursor.get(0);
employee.updateEmployee();
seession.update(employee);
if ( ++count % 50 == 0 ) {
session.flush();
session.clear();
}
}
tx.commit();
session.close();
But for insert i would go for jcwayne answer
According to an answer to a similar question, it can be done by configuring Hibernate to insert objects using a custom stored procedure which uses your database's upsert functionality. It's not pretty, though.
High-throughput data export
If you only want to import data without doing any processing or transformation, then a tool like PostgreSQL COPY is the fastest way o import data.
Batch processing
However, if you need to do the transformation, data aggregation, correlation/merging between existing data and the incoming one, then you need application-level batch processing.
In this case, you want to flush-clear-commit regularly:
int entityCount = 50;
int batchSize = 25;
EntityManager entityManager = entityManagerFactory()
.createEntityManager();
EntityTransaction entityTransaction = entityManager
.getTransaction();
try {
entityTransaction.begin();
for (int i = 0; i < entityCount; i++) {
if (i > 0 && i % batchSize == 0) {
entityTransaction.commit();
entityTransaction.begin();
entityManager.clear();
}
Post post = new Post(
String.format("Post %d", i + 1)
);
entityManager.persist(post);
}
entityTransaction.commit();
} catch (RuntimeException e) {
if (entityTransaction.isActive()) {
entityTransaction.rollback();
}
throw e;
} finally {
entityManager.close();
}
Also, make sure you enable JDBC batching as well using the following configuration properties:
<property
name="hibernate.jdbc.batch_size"
value="25"
/>
<property
name="hibernate.order_inserts"
value="true"
/>
<property
name="hibernate.order_updates"
value="true"
/>
Bulk processing
Bulk processing is suitable when all rows match pre-defined filtering criteria, so you can use a single UPDATE to change all records.
However, using bulk updates that modify millions of records can increase the size of the redo log or end up taking lots of locks on database systems that still use 2PL (Two-Phase Locking), like SQL Server.
So, while the bulk update is the most efficient way to change many records, you have to pay attention to how many records are to be changed to avoid a long-running transaction.
Also, you can combine bulk update with optimistic locking so that other OLTP transactions won't lose the update done by the bulk processing process.
If you use sequence or native generator Hibernate will use a select to get the id:
<id name="id" column="ID">
<generator class="native" />
</id>
You should use hilo or seqHiLo generator:
<id name="id" type="long" column="id">
<generator class="seqhilo">
<param name="sequence">SEQ_NAME</param>
<param name="max_lo">100</param>
</generator>
</id>
The "extra" select is to generate the unique identifier for your data.
Switch to HiLo sequence generation and you can reduce the sequence roundtrips to the database by the number of the allocation size. Please note, there will be a gap in primary keys unless you adjust your sequence value for the HiLo generator
I want to inquire about what actually the flush method does in the following case:
for (int i = 0; i < myList.size(); i++) {
Car c = new Car( car.get(i).getId(),car.get(i).getName() );
getCurrentSession().save(c);
if (i % 20 == 0)
getCurrentSession().flush();
}
Does this means that after the iteration 20, the cache is flushed, and then the 20 held memory objects are actually saved in the database ?
Can someone please explain to me what will happen when the condition is true.
From the javadoc of Session#flush:
Force this session to flush. Must be
called at the end of a unit of work,
before committing the transaction and
closing the session (depending on
flush-mode, Transaction.commit()
calls this method).
Flushing is the process of synchronizing the underlying
persistent store with persistable
state held in memory.
In other words, flush tells Hibernate to execute the SQL statements needed to synchronize the JDBC connection's state with the state of objects held in the session-level cache. And the condition if (i % 20 == 0) will make it happen for every i multiple of 20.
But, still, the new Car instances will be held in the session-level cache and, for big myList.size(), you're going to eat all memory and ultimately get an OutOfMemoryException. To avoid this situation, the pattern described in the documentation is to flush AND clear the session at regular intervals (same size as the JDBC batch size) to persist the changes and then detach the instances so that they can be garbage collected:
13.1. Batch inserts
When making new objects persistent
flush() and then clear() the session
regularly in order to control the size
of the first-level cache.
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
for ( int i=0; i<100000; i++ ) {
Customer customer = new Customer(.....);
session.save(customer);
if ( i % 20 == 0 ) { //20, same as the JDBC batch size
//flush a batch of inserts and release memory:
session.flush();
session.clear();
}
}
tx.commit();
session.close();
The documentation mentions in the same chapter how to set the JDBC batch size.
See also
10.10. Flushing the Session
Chapter 13. Batch processing
Depends on how the FlushMode is set up.
In default configuration Hibernate tries to sync up with the database at three locations.
1. before querying data
2. on commiting a transaction
3. explictly calling flush
If the FlushMode is set as FlushMode.Manual, the programmer is informing hibernate that he/she will handle when to pass the data to the database.Under this configuration
the session.flush() call will save the object instances to the database.
A session.clear() call acutally can be used to clear the persistance context.
// Assume List to be of 50
for (int i = 0; i < 50 ; i++) {
Car c = new Car( car.get(i).getId(),car.get(i).getName() );
getCurrentSession().save(c);
// 20 car Objects which are saved in memory syncronizes with DB
if (i % 20 == 0)
getCurrentSession().flush();
}
Few more pointers regarding why the flushing should match batch size
To enable batching you need to set the jdbc batch size
// In your case
hibernate.jdbc.batch_size =20
One common pitfall in using batching is if you are using single object update or insert this goes fine.But in case
you are using mutiple objects leading to multiple inserts /updates then you will have to explicitly set the sorting mechanism.
For example
// Assume List to be of 50
for (int i = 0; i < 50 ; i++) {
Car c = new Car( car.get(i).getId(),car.get(i).getName() );
// Adding accessory also in the card here
Accessories a=new Accessories("I am new one");
c.add(a);
// Now you got two entities to be persisted . car and accessory
// Two SQL inserts
getCurrentSession().save(c);
// 20 car Objects which are saved in memory syncronizes with DB
// Flush here clears the car objects from 1st level JVM cache
if (i % 20 == 0)
getCurrentSession().flush();
getCurrentSession().clear();
}
Here in this case
two sql are generated
1 for insert in car
1 for insert in accessory
For proper batching you will have to set the
<prop key="hibernate.order_inserts">true</prop>
so that all the inserts for car is sorted together and all inserts of accessories are sorted together.By doing so you will have 20 inserts firing in a batch rather then 1 sql firing at a time.
For different operation under one transaction, you can have a look at http://docs.jboss.org/hibernate/core/3.2/api/org/hibernate/event/def/AbstractFlushingEventListener.html
Yes every 20 loop, sql is generated and executed for the unsaved objects. Your should also set batch mode to 20 to increase performances.