Spring Data JPA - concurrent Bulk inserts/updates

Spring Data JPA - concurrent Bulk inserts/updates - java

at the moment I develop a Spring Boot application which mainly pulls product review data from a message queue (~5 concurrent consumer) and stores them to a MySQL DB. Each review can be uniquely identified by its reviewIdentifier (String), which is the primary key and can belong to one or more product (e.g. products with different colors). Here is an excerpt of the data-model:
public class ProductPlacement implements Serializable{
private static final long serialVersionUID = 1L;
#Id
#GeneratedValue(strategy = GenerationType.AUTO)
#Column(name = "product_placement_id")
private long id;
#ManyToMany(fetch = FetchType.LAZY, cascade = CascadeType.ALL, mappedBy="productPlacements")
private Set<CustomerReview> customerReviews;
}
public class CustomerReview implements Serializable{
private static final long serialVersionUID = 1L;
#Id
#Column(name = "customer_review_id")
private String reviewIdentifier;
#ManyToMany(fetch = FetchType.LAZY, cascade = CascadeType.ALL)
#JoinTable(
name = "tb_miner_review_to_product",
joinColumns = #JoinColumn(name = "customer_review_id"),
inverseJoinColumns = #JoinColumn(name = "product_placement_id")
)
private Set<ProductPlacement> productPlacements;
}
One message from the queue contains 1 - 15 reviews and a productPlacementId. Now I want an efficient method to persist the reviews for the product. There are basically two cases which need to be considered for each incomming review:
The review is not in the database -> insert review with reference to the product that is contained in the message
The review is already in the database -> just add the product reference to the Set productPlacements of the existing review.
Currently my method for persisting the reviews is not optimal. It looks as follows (uses Spring Data JpaRespoitories):
#Override
#Transactional
public void saveAllReviews(List<CustomerReview> customerReviews, long productPlacementId) {
ProductPlacement placement = productPlacementRepository.findOne(productPlacementId);
for(CustomerReview review: customerReviews){
CustomerReview cr = customerReviewRepository.findOne(review.getReviewIdentifier());
if (cr!=null){
cr.getProductPlacements().add(placement);
customerReviewRepository.saveAndFlush(cr);
}
else{
Set<ProductPlacement> productPlacements = new HashSet<>();
productPlacements.add(placement);
review.setProductPlacements(productPlacements);
cr = review;
customerReviewRepository.saveAndFlush(cr);
}
}
}
Questions:
I sometimes get constraintViolationExceptions because of violating the unique constraint on the "reviewIndentifier". This is obviously because I (concurrently) look if the review is already present and than insert or update it. How can I avoid that?
Is it better to use save() or saveAndFlush() in my case. I get ~50-80 reviews per secound. Will hibernate flush automatically if I just use save() or will it result in greatly increased memory usage?
Update to question 1: Would a simple #Lock on my Review-Repository prefent the unique-constraint exception?
#Lock(LockModeType.PESSIMISTIC_WRITE)
CustomerReview findByReviewIdentifier(String reviewIdentifier);
What happens when the findByReviewIdentifier returns null? Can hibernate lock the reviewIdentifier for a potential insert even if the method returns null?
Thank you!

From a performance point of view, I will consider evaluating the solution with the following changes.
Changing from bidirectional ManyToMany to bidirectional OneToMany
I had a same question on which one is more efficient from DML statements that gets executed. Quoting from Typical ManyToMany mapping versus two OneToMany.
The option one might be simpler from a configuration perspective, but it yields less efficient DML statements.
Use the second option because whenever the associations are controlled by #ManyToOne associations, the DML statements are always the most efficient ones.
Enable the batching of DML statements
Enabling the batching support would result in less number of round trips to the database to insert/update the same number of records.
Quoting from batch INSERT and UPDATE statements
hibernate.jdbc.batch_size = 50
hibernate.order_inserts = true
hibernate.order_updates = true
hibernate.jdbc.batch_versioned_data = true
Remove the number of saveAndFlush calls
The current code gets the ProductPlacement and for each review it does a saveAndFlush, which results in no batching of DML statements.
Instead I would consider loading the ProductPlacement entity and adding the List<CustomerReview> customerReviews to the Set<CustomerReview> customerReviews field of ProductPlacement entity and finally call the merge method once at the end, with these two changes:
Making ProductPlacement entity owner of the association i.e., by moving mappedBy attribute onto Set<ProductPlacement> productPlacements field of CustomerReview entity.
Making CustomerReview entity implement equals and hashCode method by using reviewIdentifier field in these method. I believe reviewIdentifier is unique and user assigned.
Finally, as you do performance tuning with these changes, baseline your performance with the current code. Then make the changes and compare if the changes are really resulting in the any significant performance improvement for your solution.

Related

JPA Many-To-Many Relationship with same DB and extra Attribute

I need a Many-To-Many Relationship within the same Database. I don't mind creating mapping databases, but I want to have only one Entity in the end.
Let's say I've got a resource which can have many resources (Sub-Resources). What I need is an Resource with the Sub-Resources and also the count of them because one Resource can have x resources.
Essentially, I need this with the extra Attribute of the count of Sub resources needed for the Resource.
#Table(name = "resources")
public class Resources {
#Id
#GeneratedValue(strategy = GenerationType.AUTO)
private long id;
#Column
private String name;
#ManyToMany
private Collection<Resources> subResources;
}
To clarify that a bit, at best I would have something like that:
#Table(name = "resources")
public class Resources {
#Id
#GeneratedValue(strategy = GenerationType.AUTO)
private long id;
#Column
private String name;
#ManyToMany
private HashMap<Resources, Integer /* count */> subResources;
}
I know how it works with two tables (Resources & Sub Resources) and a mapping type, but I couldn't figure out how to do it as described above, since Resources can be Sub-Resources at the same time.
Thanks in advance
EDIT: I need an extra Attribute in the mapping table where I can set the amount of sub resources as an Integer

The configuration yo have will work for a unidirectional relationship. There is no technical problem, only you will not be able to specify the multiple parents of a subresource, so in the end it is not many to many.
To make it trully many to many you need another field on the Resources class to define the inverse side of the relationship; I have added the #JoinTable annotation to make the names in the join table explicit, but it is optional if the defaults are good enough for you; I also switched from the vary basic Collection to List; I would prefer Set and you would have to provide equals and hashCode on the entity. Finally I am always initializing the collection-valued fields (ArrayList here; HashSet if you go for Set), so as to avoid silly NullPointerExceptions or complex initialization code:
#ManyToMany
#JoinTable(
name = "RESOURCE_SUBRESOURCE",
joinColumns = #JoinColumn(name = "resource_id"),
inverseJoinColumns = #JoinColumn(name = "subresource_id")
)
private List<Resource> subResources = new ArrayList<>();
// the mappedBy signals that this is the inverse side of the relation, not a new relation altogether
#ManyToMany(mappedBy = "subResources")
private List<Resource> parentResources = new ArrayList<>();
Use as:
Resources r1 = new Resources();
r1.setName("alpha");
em.persist(r1);
Resources r2 = new Resources();
r2.setName("beta");
r2.getSubResources().add(r1);
em.persist(r2);
Resources r3 = new Resources();
r3.setName("gama");
em.persist(r3);
Resources r4 = new Resources();
r4.setName("delta");
// won't work, you need to set the owning side of the relationship, not the inverse:
r4.setParentResources(Arrays.asList(r2, r3));
// will work like this:
r2.getSubResources().add(r4);
r3.getSubResources().add(r4);
// I believe that the order of the following operations is important, unless you set cascade on the relationship
em.persist(r4);
r2 = em.merge(r2);
r3 = em.merge(r3);
As for the count: In the question you mention that you want a count of related objects. While specific JPA providers (Hibernate, EclipseLink) may allow you to accomplish this (using a read-only field that is populated by an aggragate query - COUNT(*) FROM JoinTable WHERE resource_id=?), it is not standard. You can always do resource.getSubResources().size(), but that would fetch all the subresources into memory, which is not a good thing and might in fact be a really bad thing if you call in frequently or there are many sub/parent resources.
I would prefer to run a separate count query, perhaps even for a set of resource ids, whenever I really need this.

N + 1 when ID is string (JpaRepository)

I have an entity with string id:
#Table
#Entity
public class Stock {
#Id
#Column(nullable = false, length = 64)
private String index;
#Column(nullable = false)
private Integer price;
}
And JpaRepository for it:
public interface StockRepository extends JpaRepository<Stock, String> {
}
When I call stockRepository::findAll, I have N + 1 problem:
logs are simplified
select s.index, s.price from stock s
select s.index, s.price from stock s where s.index = ?
The last line from the quote calls about 5K times (the size of the table). Also, when I update prices, I do next:
stockRepository.save(listOfStocksWithUpdatedPrices);
In logs I have N inserts.
I haven't seen similar behavior when id was numeric.
P.S. set id's type to numeric is not the best solution in my case.
UPDATE1:
I forgot to mention that there is also Trade class that has many-to-many relation with Stock:
#Table
#Entity
public class Trade {
#Id
#GeneratedValue(strategy = GenerationType.AUTO)
private Integer id;
#Column
#Enumerated(EnumType.STRING)
private TradeType type;
#Column
#Enumerated(EnumType.STRING)
private TradeState state;
#MapKey(name = "index")
#ManyToMany(fetch = FetchType.EAGER)
#JoinTable(name = "trade_stock",
joinColumns = { #JoinColumn(name = "id", referencedColumnName = "id") },
inverseJoinColumns = { #JoinColumn(name = "stock_index", referencedColumnName = "index") })
private Map<String, Stock> stocks = new HashMap<>();
}
UPDATE2:
I added many-to-many relation for the Stock side:
#ManyToMany(cascade = CascadeType.ALL, mappedBy = "stocks") //lazy by default
Set<Trade> trades = new HashSet<>();
But now it left joins trades (but they're lazy), and all trade's collections (they are lazy too). However, generated Stock::toString method throws LazyInitializationException exception.

Related answer: JPA eager fetch does not join
You basically need to set #Fetch(FetchMode.JOIN), because fetch = FetchType.EAGER just specifies that the relationship will be loaded, not how.
Also what might help with your problem is
#BatchSize annotation, which specifies how many lazy collections will be loaded, when the first one is requested. For example, if you have 100 trades in memory (with stocks not initializes) #BatchSize(size=50) will make sure that only 2 queries will be used. Effectively changing n+1 to (n+1)/50.
https://docs.jboss.org/hibernate/orm/4.3/javadocs/org/hibernate/annotations/BatchSize.html
Regarding inserts, you may want to set
hibernate.jdbc.batch_size property and set order_inserts and order_updates to true as well.
https://vladmihalcea.com/how-to-batch-insert-and-update-statements-with-hibernate/

However, generated Stock::toString method throws
LazyInitializationException exception.
Okay, from this I am assuming you have generated toString() (and most likely equals() and hashcode() methods) using either Lombok or an IDE generator based on all fields of your class.
Do not override equals() hashcode() and toString() in this way in a JPA environment as it has the potential to (a) trigger the exception you have seen if toString() accesses a lazily loaded collection outside of a transaction and (b) trigger the loading of extremely large volumes of data when used within a transaction. Write a sensible to String that does not involve associations and implement equals() and hashcode() using (a) some business key if one is available, (b) the ID (being aware if possible issues with this approach or (c) do not override them at all.
So firstly, remove these generated methods and see if that improves things a bit.

With regards to the inserts, I do notice one thing that is often overlooked in JPA. I don't know what Database you use, but you have to be careful with
#GeneratedValue(strategy = GenerationType.AUTO)
For MySQL I think all JPA implementations map to an auto_incremented field, and once you know how JPA works, this has two implication.
Every insert will consist of two queries. First the insert and then a select query (LAST_INSERT_ID for MySQL) to get the generated primary key.
It also prevents any batch query optimization, because each query needs to be done in it's own insert.
If you insert a large number of objects, and you want good performance, I would recommend using table generated sequences, where you let JPA pre-allocate IDs in large chunks, this also allows the SQL driver do batch Insert into (...) VALUES(...) optimizations.
Another recommendation (not everyone agrees with me on this one). Personally I never use ManyToMany, I always decompose it into OneToMany and ManyToOne with the join table as a real entity. I like the added control it gives over cascading and fetch, and you avoid some of the ManyToMany traps that exist with bi-directional relations.

Hibernate creating N+1 queries for #ManyToOne JPA annotated property

I have these classes:
#Entity
public class Invoice implements Serializable {
#Id
#Basic(optional = false)
private Integer number;
private BigDecimal value;
//Getters and setters
}
#Entity
public class InvoiceItem implements Serializable {
#EmbeddedId
protected InvoiceItemPK invoiceItemPk;
#ManyToOne
#JoinColumn(name = "invoice_number", insertable = false, updatable = false)
private Invoice invoice;
//Getters and setters
}
When i run this query:
session.createQuery("select i from InvoiceItem i").list();
It executes one query to select the records from InvoiceItem, and if I have 10000 invoice items, it generates 10000 additional queries to select the Invoice from each InvoiceItem.
I think it would be a lot better if all the records could be fetched in a single sql. Actually, I find it weird why it is not the default behavior.
So, how can I do it?

The problem here is not related to Hibernate but to JPA.
Prior to JPA 1.0, Hibernate 3 used lazy loading for all associations.
However, the JPA 1.0 specification uses FetchType.LAZY only for collection associations:
#OneToMany,
#ManyToMany
#ElementCollection)
The #ManyToOne and #OneToOne associations use FetchType.EAGER by default, and that's very bad from a performance perspective.
The behavior described here is called the [N+1 query issue][5], and it happens because Hibernate needs to make sure that the #ManyToOne association is initialized prior to returning the result to the user.
Now, if you are using direct fetching via entityManager.find, Hibernate can use a LEFT JOIN to initialize the FetchTYpe.EAGER associations.
However, when executing a query that does not explicitly use a JOIN FETCH clause, Hibernate will not use a JOIN to fetch the FetchTYpe.EAGER associations, as it cannot alter the query that you already specified how to be constructed. So, it can only use secondary queries.
The fix is simple. Just use FetchType.LAZY for all associations:
#ManyToOne(fetch = FetchType.LAZY)
#JoinColumn(name = "invoice_number", insertable = false, updatable = false)
private Invoice invoice;
More, you should use the Hypersistence Utils to assert the number of statements executed by JPA and Hibernate.

Try with
session.createQuery("select i from InvoiceItem i join fetch i.invoice inv").list();
It should get all the data in a single SQL query by using joins.

Yes there is setting you need: #BatchSize(size=25). Check it here:
20.1.5. Using batch fetching
small cite:
Using batch fetching, Hibernate can load several uninitialized proxies if one proxy is accessed. Batch fetching is an optimization of the lazy select fetching strategy. There are two ways you can configure batch fetching: on the class level and the collection level.
Batch fetching for classes/entities is easier to understand. Consider the following example: at runtime you have 25 Cat instances loaded in a Session, and each Cat has a reference to its owner, a Person. The Person class is mapped with a proxy, lazy="true". If you now iterate through all cats and call getOwner() on each, Hibernate will, by default, execute 25 SELECT statements to retrieve the proxied owners. You can tune this behavior by specifying a batch-size in the mapping of Person:
<class name="Person" batch-size="10">...</class>
With this batch-size specified, Hibernate will now execute queries on demand when need to access the uninitialized proxy, as above, but the difference is that instead of querying the exactly proxy entity that being accessed, it will query more Person's owner at once, so, when accessing other person's owner, it may already been initialized by this batch fetch with only a few ( much less than 25) queries will be executed.
So, we can use that annotation on both:
collections/sets
classes/Entities
Check it also here:
#BatchSize but many round trip in #ManyToOne case

In this Method there are Multiple SQLs fired. This first one is fired for retrieving all the records in the Parent table. The remaining are fired for retrieving records for each Parent Record. The first query retrieves M records from database, in this case M Parent records. For each Parent a new query retrieves Child.

Bulk Insert via Spring/Hibernate where ids are needed

I have to do bulk inserts, and need the ids of what's being added. This is a basic example that shows what I am doing (which is obviously horrible for performance). I am looking for a much better way to do this.
public void omgThisIsSlow(final Set<ObjectOne> objOneSet,
final Set<ObjectTwo> objTwoSet) {
for (final ObjectOne objOne : objOneSet) {
persist(objOne);
for (final ObjThree objThree : objOne.getObjThreeSet()) {
objThree.setObjOne(objOne);
persist(objThree);
}
for (final ObjectTwo objTwo : objTwoSet) {
final ObjectTwo objTwoCopy = new ObjTwo();
objTwoCopy.setFoo(objTwo.getFoo());
objTwoCopy.setBar(objTwo.getBar());
persist(objTwoCopy);
final ObjectFour objFour = new ObjectFour();
objFour.setObjOne(objOne);
objFour.setObjTwo(objTwoCopy);
persist(objFour);
}
}
}
In the case above persist is a method which internally calls
sessionFactory.getCurrentSession().saveOrUpdate();
Is there any optimized way of getting back the ids and bulk inserting based upon that?
Thanks!
Update: Got it working with the following additions and help from JustinKSU
import javax.persistence.*;
#Entity
public class ObjectFour{
#ManyToOne(cascade = CascadeType.ALL)
private ObjectOne objOne;
#ManyToOne(cascade = CascadeType.ALL)
private ObjectTwo objTwo;
}
// And similar for other classes and their objects that need to be persisted

If you define the relationships using annotations and define appropriate cascading, you should be able set the object relationships in the objects in java and persist it all in one call. Hibernate will handle setting the foreign keys for you.
Documentation -
http://docs.jboss.org/hibernate/annotations/3.5/reference/en/html/entity.html#entity-mapping-association
An example annotation on a parent object would be
#OneToMany(mappedBy = "foo", fetch = FetchType.LAZY, cascade=CascadeType.ALL)
On the child object you would do the following
#ManyToOne(fetch = FetchType.LAZY)
#JoinColumn(name = "COLUMN_NAME", nullable = false)

I'm not sure but Hibernate makes bulk inserts/updates. The problem I understand is you need to persist the parent object in order to assign the reference to the child object.
I would try to persist all the "one" objects. And then, iterate over all their "three" objects and persist them in a second bulk insertion.
If your tree has three levels you can achieve all the insertions in 3 batchs. Pretty decent I think.

Assuming that you are just looking at getting a large amount of data persisted in one go and your problem is that you don't know what the IDs are going to be as the various related objects are persisted, one possible solution for this is to run all your inserts (as bulk inserts) into ancillary tables (one per real table) with temporary IDs (and some session ID) and have a stored procedure perform the inserts into the real tables whilst resolving the IDs.

JPA EclipseLink 2 query performance

APPLICATION and ENVIRONMENT
Java EE / JSF2.0 / JPA enterprise application, which contains a web and an EJB module. I am generating PDF documents which contains evaluated data queried via JPA.
I am using MySQL as database, with MyISAM engine on all tables. JPA Provider is EclipseLink with cache set to ALL. FetchType.EAGER is used at relationships.
AFTER RUNNING NETBEANS PROFILER
Profiler results show that the following method is called the most. In this session it was 3858 invocations, with ~80 seconds from request to response. This takes up 80% of CPU time. There are 680 entries in the Question table.
public Question getQuestionByAzon(String azon) {
try {
return (Question) em.createQuery("SELECT q FROM Question q WHERE q.azonosito=:a").setParameter("a", azon).getSingleResult();
} catch (NoResultException e) {
return null;
}
}
The Question entity:
#Entity
#Inheritance(strategy = InheritanceType.SINGLE_TABLE)
public abstract class Question implements Serializable {
private static final long serialVersionUID = 1L;
#Id
#GeneratedValue(strategy = GenerationType.AUTO)
private Long id;
#Column(unique = true)
private String azonosito;
#Column(nullable = false)
#Basic(optional = false)
private String label;
#Lob
#Column(columnDefinition = "TEXT")
private String help;
private int quizNumber;
private String type;
#ManyToOne
private Category parentQuestion;
...
//getters and setters, equals() and hashCode() function implementations
}
There are four entities extending Question.
The column azonosito should be used as primary key, but I don't see this as the main reason for low performance.
I am interested in suggestions for optimization. Feel free to ask if you need more information!
EDIT See my answer summarizing the best results
Thanks in advance!

Using LAZY is a good start, I would recommend you always make everything LAZY if you are at all concerned about performance.
Also ensure that you are using weaving, (Java SE agent, or Java EE/Spring, or static), as LAZY OneToOne and ManyToOne depend on this.
Changing the Id to your other field would be a good idea, if you always query on it and it is unique. You should also check why your application keeps executing the same query over and over.
You should make the query a NameDQuery not use a dynamic query.
In EclipseLink you could also enable the query cache on the query (once it is a named query), this will enable cache hits on the query result.

Have you got unique index on the azonosito column in your database. Maybe that will help.
I would also suggest to fetch only the fields you really need so maybe some of then could be lazy i.e. Category.

Since changing fetch type of relationship to LAZY dramatically improved performance of your application, perhaps you don't have an index for foreign key of that relationship. If so, you need to create it.

In this answer I will summarize what was the best solution for that particular query.
First of all, I set azonosito column as primary key, and modified my entities accordingly. This is necessary because EclipseLink object cache works with em.find:
public Question getQuestionByAzon(String azon) {
try {
return em.find(Question.class, azon);
} catch (NoResultException e) {
return null;
}
}
Now, instead of using a QUERY_RESULT_CACHE on a #NamedQuery, I configured the Question entity like this:
#Entity
#Inheritance(strategy = InheritanceType.SINGLE_TABLE)
#Cache(size=1000, type=CacheType.FULL)
public abstract class Question implements Serializable { ... }
This means an object cache of maximum size 1000 will be maintained of all Question entities.
Profiler Results ~16000 invocations
QUERY_RESULT_CACHE: ~28000ms
#Cache(size=1000, type=CacheType.FULL): ~7500ms
Of course execution time gets shorter after the first execution.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spring Data JPA - concurrent Bulk inserts/updates - java

Related

JPA Many-To-Many Relationship with same DB and extra Attribute

N + 1 when ID is string (JpaRepository)

Hibernate creating N+1 queries for #ManyToOne JPA annotated property

Bulk Insert via Spring/Hibernate where ids are needed

JPA EclipseLink 2 query performance

Categories

Resources