Bulk Insert via Spring/Hibernate where ids are needed - java

I have to do bulk inserts, and need the ids of what's being added. This is a basic example that shows what I am doing (which is obviously horrible for performance). I am looking for a much better way to do this.
public void omgThisIsSlow(final Set<ObjectOne> objOneSet,
final Set<ObjectTwo> objTwoSet) {
for (final ObjectOne objOne : objOneSet) {
persist(objOne);
for (final ObjThree objThree : objOne.getObjThreeSet()) {
objThree.setObjOne(objOne);
persist(objThree);
}
for (final ObjectTwo objTwo : objTwoSet) {
final ObjectTwo objTwoCopy = new ObjTwo();
objTwoCopy.setFoo(objTwo.getFoo());
objTwoCopy.setBar(objTwo.getBar());
persist(objTwoCopy);
final ObjectFour objFour = new ObjectFour();
objFour.setObjOne(objOne);
objFour.setObjTwo(objTwoCopy);
persist(objFour);
}
}
}
In the case above persist is a method which internally calls
sessionFactory.getCurrentSession().saveOrUpdate();
Is there any optimized way of getting back the ids and bulk inserting based upon that?
Thanks!
Update: Got it working with the following additions and help from JustinKSU
import javax.persistence.*;
#Entity
public class ObjectFour{
#ManyToOne(cascade = CascadeType.ALL)
private ObjectOne objOne;
#ManyToOne(cascade = CascadeType.ALL)
private ObjectTwo objTwo;
}
// And similar for other classes and their objects that need to be persisted

If you define the relationships using annotations and define appropriate cascading, you should be able set the object relationships in the objects in java and persist it all in one call. Hibernate will handle setting the foreign keys for you.
Documentation -
http://docs.jboss.org/hibernate/annotations/3.5/reference/en/html/entity.html#entity-mapping-association
An example annotation on a parent object would be
#OneToMany(mappedBy = "foo", fetch = FetchType.LAZY, cascade=CascadeType.ALL)
On the child object you would do the following
#ManyToOne(fetch = FetchType.LAZY)
#JoinColumn(name = "COLUMN_NAME", nullable = false)

I'm not sure but Hibernate makes bulk inserts/updates. The problem I understand is you need to persist the parent object in order to assign the reference to the child object.
I would try to persist all the "one" objects. And then, iterate over all their "three" objects and persist them in a second bulk insertion.
If your tree has three levels you can achieve all the insertions in 3 batchs. Pretty decent I think.

Assuming that you are just looking at getting a large amount of data persisted in one go and your problem is that you don't know what the IDs are going to be as the various related objects are persisted, one possible solution for this is to run all your inserts (as bulk inserts) into ancillary tables (one per real table) with temporary IDs (and some session ID) and have a stored procedure perform the inserts into the real tables whilst resolving the IDs.

Related

N + 1 when ID is string (JpaRepository)

I have an entity with string id:
#Table
#Entity
public class Stock {
#Id
#Column(nullable = false, length = 64)
private String index;
#Column(nullable = false)
private Integer price;
}
And JpaRepository for it:
public interface StockRepository extends JpaRepository<Stock, String> {
}
When I call stockRepository::findAll, I have N + 1 problem:
logs are simplified
select s.index, s.price from stock s
select s.index, s.price from stock s where s.index = ?
The last line from the quote calls about 5K times (the size of the table). Also, when I update prices, I do next:
stockRepository.save(listOfStocksWithUpdatedPrices);
In logs I have N inserts.
I haven't seen similar behavior when id was numeric.
P.S. set id's type to numeric is not the best solution in my case.
UPDATE1:
I forgot to mention that there is also Trade class that has many-to-many relation with Stock:
#Table
#Entity
public class Trade {
#Id
#GeneratedValue(strategy = GenerationType.AUTO)
private Integer id;
#Column
#Enumerated(EnumType.STRING)
private TradeType type;
#Column
#Enumerated(EnumType.STRING)
private TradeState state;
#MapKey(name = "index")
#ManyToMany(fetch = FetchType.EAGER)
#JoinTable(name = "trade_stock",
joinColumns = { #JoinColumn(name = "id", referencedColumnName = "id") },
inverseJoinColumns = { #JoinColumn(name = "stock_index", referencedColumnName = "index") })
private Map<String, Stock> stocks = new HashMap<>();
}
UPDATE2:
I added many-to-many relation for the Stock side:
#ManyToMany(cascade = CascadeType.ALL, mappedBy = "stocks") //lazy by default
Set<Trade> trades = new HashSet<>();
But now it left joins trades (but they're lazy), and all trade's collections (they are lazy too). However, generated Stock::toString method throws LazyInitializationException exception.
Related answer: JPA eager fetch does not join
You basically need to set #Fetch(FetchMode.JOIN), because fetch = FetchType.EAGER just specifies that the relationship will be loaded, not how.
Also what might help with your problem is
#BatchSize annotation, which specifies how many lazy collections will be loaded, when the first one is requested. For example, if you have 100 trades in memory (with stocks not initializes) #BatchSize(size=50) will make sure that only 2 queries will be used. Effectively changing n+1 to (n+1)/50.
https://docs.jboss.org/hibernate/orm/4.3/javadocs/org/hibernate/annotations/BatchSize.html
Regarding inserts, you may want to set
hibernate.jdbc.batch_size property and set order_inserts and order_updates to true as well.
https://vladmihalcea.com/how-to-batch-insert-and-update-statements-with-hibernate/
However, generated Stock::toString method throws
LazyInitializationException exception.
Okay, from this I am assuming you have generated toString() (and most likely equals() and hashcode() methods) using either Lombok or an IDE generator based on all fields of your class.
Do not override equals() hashcode() and toString() in this way in a JPA environment as it has the potential to (a) trigger the exception you have seen if toString() accesses a lazily loaded collection outside of a transaction and (b) trigger the loading of extremely large volumes of data when used within a transaction. Write a sensible to String that does not involve associations and implement equals() and hashcode() using (a) some business key if one is available, (b) the ID (being aware if possible issues with this approach or (c) do not override them at all.
So firstly, remove these generated methods and see if that improves things a bit.
With regards to the inserts, I do notice one thing that is often overlooked in JPA. I don't know what Database you use, but you have to be careful with
#GeneratedValue(strategy = GenerationType.AUTO)
For MySQL I think all JPA implementations map to an auto_incremented field, and once you know how JPA works, this has two implication.
Every insert will consist of two queries. First the insert and then a select query (LAST_INSERT_ID for MySQL) to get the generated primary key.
It also prevents any batch query optimization, because each query needs to be done in it's own insert.
If you insert a large number of objects, and you want good performance, I would recommend using table generated sequences, where you let JPA pre-allocate IDs in large chunks, this also allows the SQL driver do batch Insert into (...) VALUES(...) optimizations.
Another recommendation (not everyone agrees with me on this one). Personally I never use ManyToMany, I always decompose it into OneToMany and ManyToOne with the join table as a real entity. I like the added control it gives over cascading and fetch, and you avoid some of the ManyToMany traps that exist with bi-directional relations.

JPA Collection of objects with Lazy loaded field

What is the good way to force initialization of Lazy Loaded field in each object of collection?
At this moment the only thing that comes to my mind is to use for each loop to iterate trough collection and call getter of that field but it's not very effiecient. Collection can have even 1k objects and in that case every iteration will fire to db.
I can't change the way I fetch objects from DB.
Example of code.
public class TransactionData{
#ManyToOne(fetch = FetchType.LAZY)
private CustomerData customer;
...
}
List<TransactionData> transactions = getTransactions();
You may define Entity Graphs to overrule the default fetch types, as they are defined in the Mapping.
See the example below
#Entity
#NamedEntityGraph(
name = "Person.addresses",
attributeNodes = #NamedAttributeNode("addresses")
)
public class Person {
...
#OneToMany(fetch = FetchType.LAZY) // default fetch type
private List<Address> addresses;
...
}
In the following query the adresses will now be loaded eagerly.
EntityGraph entityGraph = entityManager.getEntityGraph("Person.addresses");
TypedQuery<Person> query = entityManager.createNamedQuery("Person.findAll", Person.class);
query.setHint("javax.persistence.loadgraph", entityGraph);
List<Person> persons = query.getResultList();
In that way you are able to define specific fetch behaviour for each differet use-case.
See also:
http://www.thoughts-on-java.org/jpa-21-entity-graph-part-1-named-entity/
https://docs.oracle.com/javaee/7/tutorial/persistence-entitygraphs001.htm
By the way: afaik do most JPA provider perform eager loading of #XXXtoOne relations, even if you define the mapping as lazy. The JPA spec does allow this behaviour, as lazy loading is always just a hint that the data may or may not be loaded immediately. Eager Loading on other other hand has to be performed immediately.
What you can do is something like this:
//lazily loaded
List<Child> childList = parent.getChild();
// this will get all the child in memory of that particular Parent
Integer childListSize = childList.size();
But if you eager load then all the child will be loaded for each of the parents. This should be your best bet.

Spring Data JPA - concurrent Bulk inserts/updates

at the moment I develop a Spring Boot application which mainly pulls product review data from a message queue (~5 concurrent consumer) and stores them to a MySQL DB. Each review can be uniquely identified by its reviewIdentifier (String), which is the primary key and can belong to one or more product (e.g. products with different colors). Here is an excerpt of the data-model:
public class ProductPlacement implements Serializable{
private static final long serialVersionUID = 1L;
#Id
#GeneratedValue(strategy = GenerationType.AUTO)
#Column(name = "product_placement_id")
private long id;
#ManyToMany(fetch = FetchType.LAZY, cascade = CascadeType.ALL, mappedBy="productPlacements")
private Set<CustomerReview> customerReviews;
}
public class CustomerReview implements Serializable{
private static final long serialVersionUID = 1L;
#Id
#Column(name = "customer_review_id")
private String reviewIdentifier;
#ManyToMany(fetch = FetchType.LAZY, cascade = CascadeType.ALL)
#JoinTable(
name = "tb_miner_review_to_product",
joinColumns = #JoinColumn(name = "customer_review_id"),
inverseJoinColumns = #JoinColumn(name = "product_placement_id")
)
private Set<ProductPlacement> productPlacements;
}
One message from the queue contains 1 - 15 reviews and a productPlacementId. Now I want an efficient method to persist the reviews for the product. There are basically two cases which need to be considered for each incomming review:
The review is not in the database -> insert review with reference to the product that is contained in the message
The review is already in the database -> just add the product reference to the Set productPlacements of the existing review.
Currently my method for persisting the reviews is not optimal. It looks as follows (uses Spring Data JpaRespoitories):
#Override
#Transactional
public void saveAllReviews(List<CustomerReview> customerReviews, long productPlacementId) {
ProductPlacement placement = productPlacementRepository.findOne(productPlacementId);
for(CustomerReview review: customerReviews){
CustomerReview cr = customerReviewRepository.findOne(review.getReviewIdentifier());
if (cr!=null){
cr.getProductPlacements().add(placement);
customerReviewRepository.saveAndFlush(cr);
}
else{
Set<ProductPlacement> productPlacements = new HashSet<>();
productPlacements.add(placement);
review.setProductPlacements(productPlacements);
cr = review;
customerReviewRepository.saveAndFlush(cr);
}
}
}
Questions:
I sometimes get constraintViolationExceptions because of violating the unique constraint on the "reviewIndentifier". This is obviously because I (concurrently) look if the review is already present and than insert or update it. How can I avoid that?
Is it better to use save() or saveAndFlush() in my case. I get ~50-80 reviews per secound. Will hibernate flush automatically if I just use save() or will it result in greatly increased memory usage?
Update to question 1: Would a simple #Lock on my Review-Repository prefent the unique-constraint exception?
#Lock(LockModeType.PESSIMISTIC_WRITE)
CustomerReview findByReviewIdentifier(String reviewIdentifier);
What happens when the findByReviewIdentifier returns null? Can hibernate lock the reviewIdentifier for a potential insert even if the method returns null?
Thank you!
From a performance point of view, I will consider evaluating the solution with the following changes.
Changing from bidirectional ManyToMany to bidirectional OneToMany
I had a same question on which one is more efficient from DML statements that gets executed. Quoting from Typical ManyToMany mapping versus two OneToMany.
The option one might be simpler from a configuration perspective, but it yields less efficient DML statements.
Use the second option because whenever the associations are controlled by #ManyToOne associations, the DML statements are always the most efficient ones.
Enable the batching of DML statements
Enabling the batching support would result in less number of round trips to the database to insert/update the same number of records.
Quoting from batch INSERT and UPDATE statements
hibernate.jdbc.batch_size = 50
hibernate.order_inserts = true
hibernate.order_updates = true
hibernate.jdbc.batch_versioned_data = true
Remove the number of saveAndFlush calls
The current code gets the ProductPlacement and for each review it does a saveAndFlush, which results in no batching of DML statements.
Instead I would consider loading the ProductPlacement entity and adding the List<CustomerReview> customerReviews to the Set<CustomerReview> customerReviews field of ProductPlacement entity and finally call the merge method once at the end, with these two changes:
Making ProductPlacement entity owner of the association i.e., by moving mappedBy attribute onto Set<ProductPlacement> productPlacements field of CustomerReview entity.
Making CustomerReview entity implement equals and hashCode method by using reviewIdentifier field in these method. I believe reviewIdentifier is unique and user assigned.
Finally, as you do performance tuning with these changes, baseline your performance with the current code. Then make the changes and compare if the changes are really resulting in the any significant performance improvement for your solution.

What is a best practice to store 'large' data, represented by List in Java, in database?

What is a best practice to store 'large' data, represented by List in Java, in database?
i'm considering 3 variants:
Use '#OneToMany' to store data in separate table.
Serialize data and store it in parent table.
Store data as files(naming conventions? same as id?).
To be more specific
'Large' data entities:
class SingleSleeper{
private Double startPositionOnLeft;
private Double endPositionOnLeft;
private Double startPositionOnRight;
private Double endPositionOnRight;
....
}
class RutEntry{
private Double width;
private Double position;
...
}
There are about 50 instances of SingleSleeper class and about 25000 instances of RutEntry class in one parent instance. Parent instances are generated about 40 times every day.
i'm using EclipseLink JPA 2.1, derby
Addition
Most of all i'm interested in best readability in Java. But i'm afraid that database speed will significantly decrease if i will store too much data into database. An overwhelming number of requests will be to select all instances of SingleSleeper or RutEntry classes of particular parent entity. I'm not interested for support to different database types, but i can move to other database, if needed.
I think I would do neither of your variants.
I would add a ManyToOne to the child entities (which is somehow the opposite of your first variant):
public class SingleSleeper {
#ManyToOne(optional = false, fetch = FetchType.LAZY)
private ParentEntity parent;
...
}
public class RutEntry {
#ManyToOne(optional = false, fetch = FetchType.LAZY)
private ParentEntity parent;
}
This ensures that you have a mapping and that you never load all 25000 entities for a parent object, if you don't need them (the lazy fetch ensures that you even don't need to load the parent entity).
You can create a OneToMany in the parent object with a mappedBy link, if you really want to. For example because you always need all child objects in the parent entity:
class ParentEntity {
#OneToMany(mappedBy = "parent", fetch = FetchType.LAZY)
Collection<SingleSleeper> singleSleepers;
#OneToMany(mappedBy = "parent", fetch = FetchType.LAZY)
Collection<RutEntry> rutEntries;
}
But I don't know how EclipseLink is working here - for Hibernate you need at least an additional BatchSize annotation to indicate that it should load as many child entities as possible at once. It can't fetch all together with the parent instance (e.g. by defining both as FetchType.EAGER), as only one is allowed to be fetched eagerly (and otherwise you would have 25000 * 50 result rows in the result set of the corresponding SQL select statement).
The best to load all child entities for a parent entity is to load them separate, either using JPQL (easier to read, faster to write) or the Criteria API (typesafe, but you need a metamodel):
ParentEntity parent = entityManager.find(ParentEntity.class, id);
// JPQL:
List<SingleSleeper> singleSleepers = entityManager.createQuery(
"SELECT s FROM SingleSleeper s WHERE s.parent = %parent"
).setParameter("parent", parent).getResultList();
// Or Criteria API:
CriteriaBuilder criteriaBuilder = entityManager.getCriteriaBuilder();
CriteriaQuery<SingleSleeper> query = criteriaBuilder.createQuery(SingleSleeper.class);
Root<SingleSleeper> s = query.from(SingleSleeper.class);
query.select(s).where(criteriaBuilder.equal(s.get(SingleSleeper_.parent), parent));
List<SingleSleeper> singleSleepers = entityManager.createQuery(query).getResultList();
You have three advantages of that approach:
Still easy to read - if you put the loading into its own method.
You are flexible to decide when to load the 25050 children.
You can load a subset of the children as well (by modifying the result of createQuery with Query.setFirstResult and Query.setMaxResults).

Orphan deletion in Hibernate (when have multiple mapped objects)

I've got this structure of project:
class UserServiceSettingsImpl {
...
#ManyToOne
private UserImpl user;
#ManyToOne
private ServiceImpl service;
...
}
class ServiceImpl {
....
#OneToMany(fetch = FetchType.LAZY, cascade = CascadeType.ALL, mappedBy = "service", orphanRemoval = true)
private Set<UserServiceSettingsImpl> userServiceSettings;
....
}
class UserImpl {
....
#OneToMany(fetch = FetchType.EAGER, cascade = CascadeType.ALL, mappedBy = "user", orphanRemoval = true)
private Set<UserServiceSettingsImpl> serviceSettings;
....
}
I am trying to delete Service and everything that belongs to it (UserServiceSettingsImpl), but accidentally, this settings are not being removed (I suppose because they are not orphans since UserImpl has them too). So the thing is: is there a way to delete Settings, without deleting them from user manually (there could be a lot of users with a lot of settings, iterating through it could take a lot of time) ?
You are correct in why the UserServiceSettings are not being deleted when deleting a service if they are also referenced by a User. They are not orphans and will have to be deleted explicitly per your business logic.
Three ideas:
Use the ORM to batch delete entities.
It's not much different than iterating, but might be optimized while still using the ORM.
List settingsCopy = new ArrayList<>(service.getSettings());
service.getSettings().clear();
myDao.deleteAll(settingsCopy);
Use direct HSQL/SQL to batch delete.
This depends on what framework you are using, but generally would be something like this, probably in your repository/dao class:delete from UserServiceSettingsImpl o where o.service.id = ? However, hibernate does not support JOINs when deleting, afaik, so this doesn't work as written. It's generally necessary to rework the HSQL to use a "delete where id IN(...)" type format.
Setup CASCADE DELETEs and CASCADE UPDATEs in your database DDL, outside of the ORM framework. (Not recommended.)
However, the last two options have problems if there is chance that service's and user's UserServiceSettings can be modified at same time via multiple threads (even with correct transaction boundaries), or if those entities will be used within the orm context after the delete without a reload. In that case, you will likely run in to unexpected and sporadic errors with the last two approaches, and instead, should iterate the settings and delete via the ORM, even if it is inefficient.
Even with the first approach, it can be tricky to avoid errors in highly concurrent environments when deleting shared entities.
You're correct that you cannot delete them in any kind of automatic way - they will never be orphans. I think the best you can do is just write yourself a helper method. e.g. if you have a ServiceDao class, you would just add a helper as:
public void deleteServiceAndSettings(Service service) {
for (UserServiceSettings setting : service.getUserServiceSettings()) {
session.delete(setting);
}
session.delete(service);
}

Categories