Optimal way to deal with huge one to many collections with JPA - java

Suppose we have an entity "Something" and this something has one to many (factor of millions) relationship to some "Data".
Something 1 -> * Data
Data 1 -> 1 Something
Now if we want to add some data objects we should do this:
Something.getDataList().add(Data)
This will actually pull all data objects from database which is not optimal imho.
However if i remove relationship from Something, and leave it in Data I'll be able to add and retrieve exactly those objects that I ask for using DAO:
Something
Data 1 -> 1 Something
Now the data access interface will look like this:
Something.addData(Data) // will use DataDAO to save object
or
Something.addData(List<Data>) // will use same DataDAO batch insert
I need some guidance on this, maybe I lack some knowledge in JPA, and there is no need for this? Also I'm not sure as entities are not natural this way as data is provided by their methods but its not actually contained in entity, (if this is right then I should remove every one to many relationship if there is a performance critical operation dealing with that particular entity, which is also unnatural).
In my particular case I have lot of REST consumers that periodically gonna update database. I'm using ObjectDB, JPA... But question is more abstract here.

I believe that having the something.getDataList() around is a clicking bomb if there are milions of Data records related to Something. Just as you said, calling something.getDataList().add(data) would fetch the whole data set from DB in order to perform a single insert. Also, anyone could be tempted to use something.getDataList().size to get the number of records, resulting in the same overhead.
My suggestion is that you use the DataDAO for such operations (i.e. adding or counting) like:
void addData(Something something, Data data){
//Something similar can be used for batch insert
data.setSomething(something);
em.persist(data);
}
List<Data> findData(Something something, MyFilter filter){
// use a NamedQuery with parameters populated by something and your filters
return em.getResultList();
}
Long countData(Something something){
// use a 'SELECT COUNT(*) FROM Data d WHERE d.something = :something' NamedQuery
return em.getSingleResult;
}

Related

Hint HINT_PASS_DISTINCT_THROUGH reduces the amount of Entities returned per page for a PageRequest down to below the configured page size (PostgreSQL)

I'm setting up a JPA Specification based repository implementation that utilizes jpa specifications(constructed based on RSQL filter strings) to filter the results, define result ordering and remove any duplicates via "distinct" that would otherwise be returned due to joined tables. The JPA Specification builder method joins several tables and sets the "distinct" flag:
final Join<Object, Object> rootJoinedTags = root.join("tags", JoinType.LEFT);
final Join<Object, Object> rootJoinedLocations = root.join("location", JoinType.LEFT);
...
query.distinct(true);
To allow sorting by joined table columns, I've applied the "HINT_PASS_DISTINCT_THROUGH" hint to the relevant repository method(otherwise, sorting by joined table columns returns an error along the lines of "sort column must be included in the SELECT DISTINCT query").
#QueryHints(value = {
#QueryHint(name = org.hibernate.jpa.QueryHints.HINT_PASS_DISTINCT_THROUGH, value = "false")
})
Page<SomeEntity> findAll(#Nullable Specification<SomeEntity> spec, Pageable pageable);
The arguments for said repository method are constructed as such:
final Sort sort = getSort(searchFilter);
final Specification spec = getSpecificationIfPresent(searchFilter);
final PageRequest pageRequest = PageRequest.of(searchFilter.getPageNumber(), searchFilter.getLimit(), sort);
return eventRepository.findAll(spec, pageRequest);
After those changes, filtering and sorting seem to work as expected. However, the hint seems to cause "distinct" filtering to be applied after the result page is already constructed, thus reducing the number of returned entities in the page from the configured "size" PageRequest argument, to whatever is left after the duplicates are filtered out. For example, if we'd make a PageRequest with "page=0" and "pageSize=10", then the resulting Page may return only 5 "SomeEntity" instances, although the database contains way more entries(177 entities to be exact in this case). If I remove the hint, then the returned entities number is correct again.
Question: is there a way to make the same Specification query setup work with correctly sized Pages(some other hints that might be added to have duplicate filtering performed before the Page object is constructed)? If not, then is there another approach I could use to achieve the required Specification-based filtering, with joined-column sorting and duplicate removal as with "distinct"?
PS: PostgreSQL is the database behind the application in question
The problem you are experimenting have to do with the way you are using the HINT_PASS_DISTINCT_THROUGH hint.
This hint allows you to indicate Hibernate that the DISTINCT keyword should not be used in the SELECT statement issued against the database.
You are taking advantage of this fact to allow your queries to be sorted by a field that is not included in the DISTINCT column list.
But that is not how this hint should be used.
This hint only must be used when you are sure that there will be no difference between applying or not a DISTINCT keyword to the SQL SELECT statement, because the SELECT statement already will fetch all the distinct values per se. The idea is improve the performance of the query avoiding the use of an unnecessary DISTINCT statement.
This is usually what will happen when you use the query.distinct method in you criteria queries, and you are join fetching child relationships. This great article of #VladMihalcea explain how the hint works in detail.
On the other hand, when you use paging, it will set OFFSET and LIMIT - or something similar, depending on the underlying database - in the SQL SELECT statement issued against the database, limiting to a maximum number of results your query.
As stated, if you use the HINT_PASS_DISTINCT_THROUGH hint, the SELECT statement will not contain the DISTINCT keyword and, because of your joins, it could potentially give duplicate records of your main entity. This records will be processed by Hibernate to differentiate duplicates, because you are using query.distinct, and it will in fact remove duplicates if needed. I think this is the reason why you may get less records than requested in your Pageable.
If you remove the hint, as the DISTINCT keyword is passed in the SQL statement which is sent to the database, as far as you only project information of the main entity, it will fetch all the records indicated by LIMIT and this is why it will give you always the requested number of records.
You can try and fetch join your child entities (instead of only join with them). It will eliminate the problem of not being able to use the field you need to sort by in the columns of the DISTINCT keyword and, in addition, you will be able to apply, now legitimately, the hint.
But if you do so it will you another problem: if you use join fetch and pagination, to return the main entities and its collections, Hibernate will no longer apply pagination at database level - it will no include OFFSET or LIMIT keywords in the SQL statement, and it will try to paginate the results in memory. This is the famous Hibernate HHH000104 warning:
HHH000104: firstResult/maxResults specified with collection fetch; applying in memory!
#VladMihalcea explain that in great detail in the last part of this article.
He also proposed one possible solution to your problem, Window Functions.
In you use case, instead of using Specifications, the idea is that you implement your own DAO. This DAO only need to have access to the EntityManager, which is not a great deal as you can inject your #PersistenceContext:
#PersistenceContext
protected EntityManager em;
Once you have this EntityManager, you can create native queries and use window functions to build, based on the provided Pageable information, the right SQL statement that will be issued against the database. This will give you a lot of more freedom about what fields use for sorting or whatever you need.
As the last cited article indicates, Window Functions is a feature supported by all mayor databases.
In the case of PostgreSQL, you can easily come across them in the official documentation.
Finally, one more option, suggested in fact by #nickshoe, and explained in great detail in the article he cited, is to perform the sorting and paging process in two phases: in the first phase, you need to create a query that will reference your child entities and in which you will apply paging and sorting. This query will allow you to identify the ids of the main entities that will be used, in the second phase of the process, to obtain the main entities themselves.
You can take advantage of the aforementioned custom DAO to accomplish this process.
It may be an off-topic answer, but it may help you.
You could try to tackle this problem (pagination of parent-child entities) by separating the query in two parts:
a query for retrieving the ids that match the given criteria
a query for retrieving the actual entities by the resulting ids of the previous query
I came across this solution in this blog post: https://vladmihalcea.com/fix-hibernate-hhh000104-entity-fetch-pagination-warning-message/

Is there equivalent of #BatchSize in spring-data-jdbc

Hi I used spring data to map my Entity and Repository. The mapping is very simple:
public class Car {
Set<Part> parts;
}
public class Part {
}
I use the findAllByIds(Iterable) interface of my spring data repository. And it generates a nice sql in the form of:
select from CAR where id in (?, ?, ?, ?)
for each Car it executes exactly one SQL.
Select from Part where car_id = ?
My problem starts when the related parts are fetch. It apears that it is fetching them one by one. Is there in spring data jdbc something equivalent to the batch fetching in hibernate ?
If the anser is negative is there some relatively easy way to implement it ?
Unfortunately, the answer is short answer is "No" to both questions right now.
If you want to implement batching for selects what you would need to do is to come up with
a) a new implementation of the DataAccessStrategy which essentially implements all the CRUD functionality, and/or
b) a new EntityRowMapper which converts ResultSet rows into entities.
The first one is needed if you want to execute a different SQL statement to start with.
The second one if you consider changing subsequent SQL sufficient.
There are issues around batching that you might want to track or if the exact variant you are looking for doesn't exist, feel free to create another one.

JPA setting referenced property without retrieving it. Best practices

Let's assume I have Entity that have nested Entity inside it.
For example (please, ignore missing annotations, getters/setters, etc):
#Entity
class User {
private String userId;
private Set<UserOperation> userOperations;
}
#Entity
class UserOperation {
private String someString;
// This is nested referenced entity
private User user;
}
Let's assume that I want to insert new UserOperation and all that I have is userId.
Can I do something like:
// We just create new user. There is no interaction to get existing from DB. Totally only 1 insert
User user = new User();
user.setId("someId")
UserOperation uOp = new UserOperation();
uOp.setUser(user);
uOp.setSomeString("just op");
em.persist(uOp);
Or I should go that way only:
// We retrieve existing user. There is interaction to get it from DB. Totally 1 select and 1 insert
User user = em.find("someId")
UserOperation uOp = new UserOperation();
uOp.setUser(user);
uOp.setSomeString("just op");
em.persist(uOp);
What is the right way of doing it?
Because from DB perspective userOperation table just have String user reference, so ID should be enough. Java requires an object.
When call "new User" I would like to avoid, properties of existing user be flushed (as they are all not set) or JPA trying to insert new user and operation failing due to primary key violation.
Some examples are welcomed.
For your use case, there is particularly method getReference() in EntityManager. It gives you an entity object for id, but does not access DB to create it. Therefore the best solution is a slightly modified 2nd solution of yours:
// We retrieve a stub user for given id. There is no interaction with DB
User user = em.getReference("someId", User.class);
UserOperation uOp = new UserOperation();
uOp.setUser(user);
uOp.setSomeString("just op");
em.persist(uOp);
Explanation:
getReference() has the same logical meaning as find(), with the exception that it does call DB. The consequence is that it does not check if there is a row in DB table with the given id, and that the object you get does not yet contain the data. However, the object is fully capable to load additinal data when get method is called. Therefore the object is fully usable even if retrieved by getReference() - in fact it works the same way as lazy loading.
A side note to your first solution:
The first solution would not work, as it would create a new entity user and then it would fail either when storing the entity to DB if it is cascaded (persist always calls insert and it would try to insert user with the same ID as exists in DB), or it would fail that UserOperation is to be persisted while user is not. In order to fix this solution, you would need to call em.merge(user) before you call em.persist(userOperation). But again, this would call a select to DB in the same way as em.find().
The best way to do this is using the second example. We should always try to use the actual object direct from db. Working with only the db reference will be way worse to mantain.
Now speaking specifically about Hibernate, it makes even more sense to work with whole objects, especially because of Hibernate's cascade, that can and will (if cascade is set) update the child entities of the one you are persisting to database.
Well, I have to admit that always fetching objects from database may cause some performance issues especially after the database gets a huge amount of data, so it's always important to implement nice and coherent model entities, and also keep in track of database hits from your application, and try to keep it the less possible queries being generated.
As for example, your own example (the second) is clean and easy to understand, I would stick with this approach, since it's really simple.
Hope it can solve your questons :)

Spring JpaRepostory delete vs deleteInBatch

What is the difference between delete(...) and deleteInBatch(...) methods in JpaRepostory in Spring ? The second one "deletes items in one SQL statement", but what does it mean from the application/database perspective ? Why exists two different methods with the similar results and when it is better to use one or other ?
EDIT:
The same applies also for deleteAll() and deleteAllInBatch() ...
The answers here are not complete!
First off, let's check the documentation!
void deleteInBatch(Iterable<T> entities)
Deletes the given entities in a batch which means it will create a single Query.
So the "delete[All]InBatch" methods will use a JPA Batch delete, like "DELETE FROM table [WHERE ...]". That may be WAY more efficient, but it has some caveats:
This will not call any JPA/Hibernate lifecycle hooks you might have (#PreDelete)
It will NOT CASCADE to other entities
You have to clear your persistence context or just assume it is invalidated.
That's because JPA will issue a bulk DELETE statement to the database, bypassing the cache etc. and thus can't know which entities were affected.
See Hibernate Docs
The actual code in Spring Data JPA
And while I can't find a specific article here, I'd recommend everything Vlad Mihalcea has written, to gain a deeper understanding of JPA.
TLDR: The "inBatch" methods use bulk DELETE statements, which can be drastically faster, but have some caveats bc. they bypass the JPA cache. You should really know how they work and when to use them to benefit.
The delete method is going to delete your entity in one operation. The deleteInBatch is going to batch several delete-statements and delete them as 1 operation.
If you need a lot of delete operations the batch-deletion might be faster.
deleteInBatch(...) in the log would look like this:
DELETE FROM table_name WHERE (((((((? = id) OR (? = id)) OR (? = id)) OR (? = id)) OR (? = id)) OR (? = id)) OR (? = id))
That might leads to a problem if there are a large amount of data to be deleted, which reaches maximum size of the SQL server query:
Maximum size for a SQL Server Query? IN clause? Is there a Better Approach
Just do add curious information.
You can't create your custom delete using 'batch' on the method name and wait for spring data to resolve it, for example, you can't do this:
void deleteByYourAttributeInBatch(Iterable<YourObject> object);
Do you need to do something like this:
#Modifying
#Transactional
#Query("DELETE FROM YourObject qr WHERE o.yourAtribute IN (:object)")
void deleteByYourAttributeInBatch(Iterable<YourObject> o);
Maybe it's an issue to spring-data ;)

Using Hibernate sequence generators manually

Basically, I want a way to access sequence values in a database-neutral way.
The use case is that I have a field on an entity that I want to set based on an incrementing value (other than the id).
For instance, say I have a Shipment entity. At some point after the shipment gets created, it gets shipped. Once it gets shipped, a manifest number is generated for it and assigned. The manifest number looks something like M000009 (Where the stuff after the 'M' is a left-padded value from a sequence).
Something similar was asked here at SO , but I'm not a fan of the solution since it requires another table to maintain and seems like a weird relationship to have.
Does anyone know if it is possible to use something like hibernate's MultipleHiLoPerTableGenerator as something other than an ID generator?
If that's not possible, does anyone know of any libraries that handle this (either using hibernate or even just pure JDBC). I'd prefer not to have to write this myself (and have to deal with prefetching values, locking and synchronization).
Thanks.
I think the complexity of your task depends on whether or not you manifest number needs to be sequential:
If you don't need sequential manifest numbers then it's happy days and can use a sequence.
If you do need sequential manifest numbers (or your database doesn't support sequences) then use an id table with the appropriate locking so that each transaction gets a unique sequential value.
Then you've got 2 options that I can think of:
write the necessary JDBC code on your client, ensuring (if the manifest number is sequential) that the transaction being used is the same as that for the database update.
use a trigger to create the manifest number when the appropriate update occurs.
I think my preference would be the trigger because the transaction side of things would be taken care of although it would mean the object would need refreshing on the client.
I didn't read over the linked similar solution, but sounds like something I wound up doing. I created a table just for sequences. I added a row to the table for each sequence type I needed.
I then had a sequence generator class that would do the necessary sql query to fetch and update the sequence value for a particular named sequence.
I used hibernate's Dialect class to do it in a db neutral way.
I also would 'cache' the sequences. I would bump the stored sequence value by a large number, and then dole those out those allocated sequences from my generator class. If the class was destroyed (ie. app shutdown), a new instance of the sequence generator would start up at the stored value. (having a gap in my sequence numbers did not matter)
Here is a code samnple. I would like to caveat this with - I have not comiled this and it reuires spring code. Having said this it should still provide the bones of what you want to do.
public Long getManifestNumber() {
final Object result = getHibernateTemplate().execute(new HibernateCallback() {
public Object doInHibernate(Session sess) throws HibernateException, SQLException {
SQLQuery sqlQuery = sess.createSQLQuery("select MY_SEQUENCE.NEXTVAL from dual");
sqlQuery.uniqueResult();
}
});
Long toReturn;
if (result instanceof BigDecimal) {
toReturn = ((BigDecimal)result).longValue();
}
return toReturn;
}

Categories