I have 1M rows in a table and I want to get all of them. But when I try to get all rows with jpa by pagination then I get java heap error. Do you think that am I missing something? Any advice
int counter = 0;
while (counter >= 0) {
javax.persistence.EntityManager em = javax.persistence.Persistence
.createEntityManagerFactory("MyPU")
.createEntityManager();
Query query = em.createQuery("select m from mytable m");
java.util.Collection<MyEntity> data = query
.setFirstResult(counter).setMaxResults(1000).getResultList();
for(MyEntity yobj : data){
System.out.println(obj);
}
counter += 1000;
data.clear();
em.clear();
em.close();
}
Since you use native SQL anyway, can't you specify the LIMIT :counter, 1000 (or ROWNUM BETWEEN :counter AND 1000 if using Oracle) directly in your SQL statement?
Note that you create new instance of EntityManagerFactory at each iteration, but don't close it. Perhaps it would be better to use a single factory instead:
int counter = 0;
EntityManagerFactory emf = javax.persistence.Persistence.createEntityManagerFactory("MyPU");
while (counter >= 0) {
javax.persistence.EntityManager em = emf.createEntityManager();
...
}
Related
I am using this StringBuilder in order to add content in a query:
Integer lastEntryInEntityId = 1;//acquired through another query
Integer tmpValueForEntityId;
Integer lastEntryInEntity2Id = 1;//acquired through another query
StringBuilder queryString = new StringBuilder("insert
into entity(column,column_1,column_2,column_3) values");
StringBuilder queryString2 = new StringBuilder("insert
into entity2(column,column_1,column_2,column_3) values");
for(Object[] entityToCopy : entitiesToCopy){
Entity entity= (Entity )entityToCopy[0];
tmpValueForEntityId= lastEntryInEntityId ;
queryString.append("("+ lastEntryInEntityId ++ +","+entity.getProperty()+","+entity[1]+","+entity.getProperty2()+"),");
for(Entity2 entity2 : entity.getEntity2Collection()){
queryString2.append("("+lastEntryInEntity2Id ++ +","+tmpValueForEntityId+","+entity.getProperty2()+","+entity.getProperty3()+"),");
}
}
This code takes both too much time and memory. It actually throws an OutOfMemoryException on adding to the second StringBuilder after some time (when entitiesToCopy are too many).
How else can I write this code in order to make it faster and use less memory?
NOTE: A java 8 solution would be preferred.
NOTE 2: I use EntityManager.
You should use concat() instead + inside StringBuilder
for(Object[] entityToCopy : entitiesToCopy){
Entity entity= (Entity )entityToCopy[0];
tmpValueForEntityId= lastEntryInEntityId ;
queryString.append("(").append(lastEntryInEntityId++).append(",").append(entity.getProperty()).append(",").append(entity[1]).append(",").append(entity.getProperty2()).append("),");
for(Entity2 entity2 : entity.getEntity2Collection()){
queryString2.append("(").append(lastEntryInEntity2Id ++).append(",").append(tmpValueForEntityId).append(",").append(entity.getProperty2()).append(",").append(entity.getProperty3()).append("),");
}
}
For better performance, use PreparedStatement in transaction:
dbCon.setAutoCommit(false);
var pst = dbCon.prepareStatement("insert into entity (columnID, column_1, column_2, column_3) values (?, ?, ?, ?)";
for(Object[] entityToCopy : entitiesToCopy){
var entity = (Entity )entityToCopy[0];
tmpValueForEntityId = lastEntryInEntityId;
pst.setInt(1, lastEntryInEntityId);
pst.setString(2, entity.getProperty());
pst.setString(3, entity[1]);
pst.setString(4, entity.getProperty2());
pst.addBatch();
}
pst.executeBatch();
dbCon.commit();
dbCon.setAutoCommit(true);
Each ? represents a column. The first one represents the ID, the second one represents column_1, etc. Keep the order of each one.
Note: If you are using Java prior to 1.10, change var to PreparedStatement
With concurrent connections (more than one thread insert into database):
Don't close the database connection after commit (close on program exit)
The method that inserts data should be synchronized
Don't use prepareStatement(), instead use createStatement() with Pattern (regex) to avoid SQLinjections.
Note: PreparedStatement is good, fast and secure.
The database keep a pool of prepared statements to avoid create new every time. But in concurrence, after one thread have a reference to existing statement -> PreparedStatement, another thread can use it and the transactioin is slow (waits for new instance or new reference to existing). In concurrence this happens many, many times.
EntityManager example:
var em = emf.createEntityManager();
EntityTransaction transaction = null;
try {
transaction = em.getTransaction();
transaction.begin();
for(Object[] entityToCopy : entitiesToCopy){
var entity = (Entity )entityToCopy[0];
...//insert here
}
tx.commit();
} catch (RuntimeException e) {
if (transaction != null && transaction.isActive()) {
tx.rollback();
e.printStackTrace();
}
} finally {
em.close();
}
I executed a query every x iterations so that the query doesn't get too big. This solved my problem.
int count = 0;
for(Object[] entityToCopy : entitiesToCopy){
Entity entity= (Entity )entityToCopy[0];
tmpValueForEntityId= lastEntryInEntityId ;
queryString.append("("+ lastEntryInEntityId ++
+","+entity.getProperty()+","+entity[1]+","+entity.getProperty2()+"),");
for(Entity2 entity2 : entity.getEntity2Collection()){
queryString2.append("("+lastEntryInEntity2Id ++
+","+tmpValueForEntityId+","+entity.getProperty2()+","+entity.getProperty3()+"),");
}
count++;
if(count%2000 == 0 || entitiesToCopy.size() == count){
em.executeQuery(queryString);
queryString = "";
em.executeQuery(queryString2);
queryString2 = "";
}
}
I have this setup
#Table(name ="A")
EntityA {
Long ID;
List<EntityB> children;
}
#Table(name ="B")
EntityB {
Long ID;
EntityA parent;
EntityC grandchild;
}
#Table(name ="C")
EntityC {
Long ID;
}
The SQL query is this (I omitted irrelevant details):
select top 300 from A where ... and ID in (select parent from B where ... and grandchild in (select ID from C where ...)) order by ...
The sql query in direct database or through Hibernate (3.5) SQL runs 1000 faster than using Criteria or HQL to express this.
The SQL generated is identical from HQL and Criteria and the SQL I posted there.
[EDIT]: Correction - the sql was not identical. I didn't try the Hibernate style parameter setting on the management studio side because I did not realize this until later - see my answer.
If I separate out the subqueries into separate queries, then it is fast again.
I tried
removing all mappings of child, parent, ect.. and just use Long Id references - same thing, so its not a fetching, lazy,eager related.
using joins instead of subqueries, and got the same slow behaviour with all combinations of fetching and loading.
setting a projection on ID instead of retrieving entities, so there is no object conversion - still slow
I looked at Hibernate code and it is doing something astounding. It has a loop through all 300 results that end up hitting the database.
private List doQuery(
final SessionImplementor session,
final QueryParameters queryParameters,
final boolean returnProxies) throws SQLException, HibernateException {
final RowSelection selection = queryParameters.getRowSelection();
final int maxRows = hasMaxRows( selection ) ?
selection.getMaxRows().intValue() :
Integer.MAX_VALUE;
final int entitySpan = getEntityPersisters().length;
final ArrayList hydratedObjects = entitySpan == 0 ? null : new ArrayList( entitySpan * 10 );
final PreparedStatement st = prepareQueryStatement( queryParameters, false, session );
final ResultSet rs = getResultSet( st, queryParameters.hasAutoDiscoverScalarTypes(), queryParameters.isCallable(), selection, session );
// would be great to move all this below here into another method that could also be used
// from the new scrolling stuff.
//
// Would need to change the way the max-row stuff is handled (i.e. behind an interface) so
// that I could do the control breaking at the means to know when to stop
final EntityKey optionalObjectKey = getOptionalObjectKey( queryParameters, session );
final LockMode[] lockModesArray = getLockModes( queryParameters.getLockOptions() );
final boolean createSubselects = isSubselectLoadingEnabled();
final List subselectResultKeys = createSubselects ? new ArrayList() : null;
final List results = new ArrayList();
try {
handleEmptyCollections( queryParameters.getCollectionKeys(), rs, session );
EntityKey[] keys = new EntityKey[entitySpan]; //we can reuse it for each row
if ( log.isTraceEnabled() ) log.trace( "processing result set" );
int count;
for ( count = 0; count < maxRows && rs.next(); count++ ) {
if ( log.isTraceEnabled() ) log.debug("result set row: " + count);
Object result = getRowFromResultSet(
rs,
session,
queryParameters,
lockModesArray,
optionalObjectKey,
hydratedObjects,
keys,
returnProxies
);
results.add( result );
if ( createSubselects ) {
subselectResultKeys.add(keys);
keys = new EntityKey[entitySpan]; //can't reuse in this case
}
}
if ( log.isTraceEnabled() ) {
log.trace( "done processing result set (" + count + " rows)" );
}
}
finally {
session.getBatcher().closeQueryStatement( st, rs );
}
initializeEntitiesAndCollections( hydratedObjects, rs, session, queryParameters.isReadOnly( session ) );
if ( createSubselects ) createSubselects( subselectResultKeys, queryParameters, session );
return results; //getResultList(results);
}
In this code
final ResultSet rs = getResultSet( st, queryParameters.hasAutoDiscoverScalarTypes(), queryParameters.isCallable(), selection, session );
it hits the database with the full SQL, but there are no results collected anywhere.
Then it proceeds to go through this loop
for ( count = 0; count < maxRows && rs.next(); count++ ) {
Where for every one of the expected 300 results, it ends up hitting the database to get the actual result.
This seems insane, since it should already have all the results after 1 query. Hibernate logs do not show any additional SQL being issued during all that time.
Anyone have any insight? The only option I have is to go to native SQL query through Hibernate.
I finally managed to get to the bottom of this. The problem was being caused by Hibernate setting the parameters separately from the actual SQL query that involved subqueries. So native SQL or not, the performance will be slow if this is done. For example this will be slow:
String sql = some sql that has named parameter = :value
SQLQuery sqlQuery = session.createSQLQuery(sql);
sqlQuery.setParameter ("value", someValue);
List<Object[]> list = (List<Object[]>)sqlQuery.list();
And this will be fast
String sql = some native sql where parameter = 'actualValue'
SQLQuery sqlQuery = session.createSQLQuery(sql);
List<Object[]> list = (List<Object[]>)sqlQuery.list();
It seems that for some reason with letting Hibernate take care of the parameters it ends up getting stuck in the resultSet fetching. This is probably because the underlying query on the database is taking much longer being parameterized. I ended up writing the equivalent of Hibernate Criteria and Restrictions code that sets the parameters directly as above.
We noticed a similar behaviour in our system.
And also encountered that writing the query with hardcoded parameters instead of using setParameter() would fixed the issue.
We are using MS SQL Server and after further investigation we noticed the the root cause of our issue is a default configuration of the sql server driver that transmits the query parameters as unicode. This lead to our indices being ignored since they were based on the ascii values on the queried columns.
The solution was to setup this property in the jdbc url : sendStringParametersAsUnicode=false
More details can be found here.
https://stackoverflow.com/a/32867579
The following code:
EntityManagerFactory emf = Persistence.createEntityManagerFactory("test.odb");
EntityManager em = emf.createEntityManager();
em.getTransaction().begin();
Point p = new Point(0, 0);
em.persist(p);
em.getTransaction().commit();
em.getTransaction().begin();
Query query = em.createQuery("UPDATE Point SET x = 1001 where x = 0");
int updateCount = query.executeUpdate();
em.getTransaction().commit();
TypedQuery<Point> myquery = em.createQuery("SELECT p from Point p where p.x = 1001", Point.class);
List<Point> results = myquery.getResultList();
System.out.println("X coordinate is: " + results.get(0).getX());
em.close();
prints out: X coordinate is: 0
which is wrong because X coordinate should be 1001
But if I change the code to:
EntityManagerFactory emf = Persistence.createEntityManagerFactory("test.odb");
EntityManager em = emf.createEntityManager();
em.getTransaction().begin();
Point p = new Point(0, 0);
em.persist(p);
em.getTransaction().commit();
em.getTransaction().begin();
Query query = em.createQuery("UPDATE Point SET x = 1001 where x = 0");
int updateCount = query.executeUpdate();
em.getTransaction().commit();
em.close();
em = emf.createEntityManager();
TypedQuery<Point> myquery = em.createQuery("SELECT p from Point p where p.x = 1001", Point.class);
List<Point> results = myquery.getResultList();
System.out.println("X coordinate is: " + results.get(0).getX());
em.close();
The result is same as expected:
X coordiate is: 1001
What have I done wrong in the first code snippet?
UPDATE queries bypass the EntityManager, which means that the EntityManager may not have an up to date view of the real objects in the database.
As explained in the UPDATE queries page in the ObjectDB Manual:
"Updating entity objects in the database using an UPDATE query may be slightly more efficient than retrieving entity objects and then updating them, but it should be used cautiously because bypassing the EntityManager may break its synchronization with the database. For example, the EntityManager may not be aware that a cached entity object in its persistence context has been modified by an UPDATE query. Therefore, it is a good practice to use a separate EntityManager for UPDATE queries."
Using a separate EntityManager is exactly what you did by closing and opening a new EntityManager in your revised code.
Alternatively, if you want to use the same EntityManager, you may clear its persistence context (i.e. its cache), after running the UPDATE query and before running the SELECT query.
I've noticed that the instantiation using the the RepositoryConnection method add was slower than when instantiated by modifying the model using a SPARQL query. Despite the difference, even the SPARQL update method takes a long time for instantiation (3.4 minutes to 10,000 triplets). The execution of multiple inserts (one query for each triple) or one big insert query does not change the performance of the methods. It is still slow. Is there another method appropriate for adding 1 million triples, or are there any special configurations that can help?
Code for RepositoryConnection
Repository myRepository = new HTTPRepository(serverURL, repositoryId);
myRepository.initialize();
RepositoryConnection con = myRepository.getConnection();
ValueFactory f = myRepository.getValueFactory();
i = 0;
j = 1000000;
while(i < j)(
URI event = f.createURI(ontologyIRI + "event"+i);
URI hasTimeStamp = f.createURI(ontologyIRI + "hasTimeStamp");
Literal timestamp = f.createLiteral(fields.get(0));
con.add(event, hasTimeStamp, timestamp);
i++
}
Code for SPARQL
Repository myRepository = new HTTPRepository(serverURL, repositoryId);
myRepository.initialize();
RepositoryConnection con = myRepository.getConnection();
i = 0;
j = 1000000;
while(i < j)(
query = "INSERT {";
query += "st:event"+i+" st:hasTimeStamp '"+fields.get(0)+"'^^<http://www.w3.org/2001/XMLSchema#float> .\n"
+ "}"
+ "WHERE { ?x ?y ?z }";
Update update = con.prepareUpdate(QueryLanguage.SPARQL, query);
update.execute();
i++;
}
Edition
I've done experiment with In Memory and Native Store Sesame repositories with synchronization value equal to 0
(I only just noticed that you added the requested additional info, hence this rather late reply)
The problem is, as I suspected, that you are not using transactions to batch your update operations together. Effectively, each add operation you do becomes a single transaction (a Sesame repository connection by default runs in autocommit mode), and this is slow and ineffecient.
To change this, start a transaction (using RepositoryConnection.begin()), then add your data, and finally call RepositoryConnection.commit() to finalize the transaction.
Here's how you should modify your first code example:
Repository myRepository = new HTTPRepository(serverURL, repositoryId);
myRepository.initialize();
RepositoryConnection con = myRepository.getConnection();
ValueFactory f = myRepository.getValueFactory();
i = 0;
j = 1000000;
try {
con.begin(); // start the transaction
while(i < j) {
URI event = f.createURI(ontologyIRI + "event"+i);
URI hasTimeStamp = f.createURI(ontologyIRI + "hasTimeStamp");
Literal timestamp = f.createLiteral(fields.get(0));
con.add(event, hasTimeStamp, timestamp);
i++;
}
con.commit(); // finish the transaction: commit all our adds in one go.
}
finally {
// always close the connection when you're done with it.
con.close();
}
The same applies to your code with the SPARQL update. For more information on how to work with transactions, have a look at the Sesame manual, particularly the chapter about using the Repository API.
As an aside: since you're working over HTTTP, there is a risk that if your transaction becomes too large, it will start consuming a lot of memory in your client. If this starts happening you may want to break up your update into several transactions. But with an update consisting of a million triples you should still be alright I think.
Does JPA/EJB3 framework provide standard way to do batch insert operation...?
We use hibernate for persistence framework, So I can fall back to Hibernate Session and use combination session.save()/session.flush() achieve batch insert. But would like to know if EJB3 have a support for this...
Neither JPA nor Hibernate do provide particular support for batch inserts and the idiom for batch inserts with JPA would be the same as with Hibernate:
EntityManager em = ...;
EntityTransaction tx = em.getTransaction();
tx.begin();
for ( int i=0; i<100000; i++ ) {
Customer customer = new Customer(.....);
em.persist(customer);
if ( i % 20 == 0 ) { //20, same as the JDBC batch size
//flush a batch of inserts and release memory:
em.flush();
em.clear();
}
}
tx.commit();
session.close();
Using Hibernate's proprietary API in this case doesn't provide any advantage IMO.
References
JPA 1.0 Specification
Section 4.10 "Bulk Update and Delete Operations"
Hibernate Core reference guide
Chapter 13. Batch processing
For hibernate specifically, the whole chapter 13 of the core manual explain the methods.
But you are saying that you want the EJB method through Hibernate, so the entity manager documentation also has a chapter on that here. I suggest that you read both (the core and the entity manager).
In EJB, it is simply about using EJB-QL (with some limitations). Hibernate provides more mechanics though if you need more flexibility.
With medium records number you can use this way:
em.getTransaction().begin();
for (int i = 1; i <= 100000; i++) {
Point point = new Point(i, i);
em.persist(point);
if ((i % 10000) == 0) {
em.flush();
em.clear();
}
}
em.getTransaction().commit();
But with large records number you should perform this task in multiple transactions:
em.getTransaction().begin();
for (int i = 1; i <= 1000000; i++) {
Point point = new Point(i, i);
em.persist(point);
if ((i % 10000) == 0) {
em.getTransaction().commit();
em.clear();
em.getTransaction().begin();
}
}
em.getTransaction().commit();
Ref: JPA Batch Store
Yes you can rollback to your JPA implementation if you wish in order to have the control you defined.
JPA 1.0 is rich on EL-HQL but light on Criteria API support, however this has been addressed in 2.0.
Session session = (Session) entityManager.getDelegate();
session.setFlushMode(FlushMode.MANUAL);
Pascal
In your example to insert 100000 records, it is done within single transaction, as the commit() is only called at the end.. Does it put a lot pressure towards the database? Furthermore, in case there is rollback, the cost will be too much..
Will the following approach be better?
EntityManager em = ...;
for ( int i=0; i<100000; i++ ) {
if(!em.getTransaction().isActive()) {
em.getTransaction().begin();
}
Customer customer = new Customer(.....);
em.persist(customer);
if ((i+1) % 20 == 0 ) { //20, same as the JDBC batch size
//flush and commit of inserts and release memory:
em.getTransaction().commit();
em.clear();
}
}
session.close();