What is the difference between delete(...) and deleteInBatch(...) methods in JpaRepostory in Spring ? The second one "deletes items in one SQL statement", but what does it mean from the application/database perspective ? Why exists two different methods with the similar results and when it is better to use one or other ?
EDIT:
The same applies also for deleteAll() and deleteAllInBatch() ...
The answers here are not complete!
First off, let's check the documentation!
void deleteInBatch(Iterable<T> entities)
Deletes the given entities in a batch which means it will create a single Query.
So the "delete[All]InBatch" methods will use a JPA Batch delete, like "DELETE FROM table [WHERE ...]". That may be WAY more efficient, but it has some caveats:
This will not call any JPA/Hibernate lifecycle hooks you might have (#PreDelete)
It will NOT CASCADE to other entities
You have to clear your persistence context or just assume it is invalidated.
That's because JPA will issue a bulk DELETE statement to the database, bypassing the cache etc. and thus can't know which entities were affected.
See Hibernate Docs
The actual code in Spring Data JPA
And while I can't find a specific article here, I'd recommend everything Vlad Mihalcea has written, to gain a deeper understanding of JPA.
TLDR: The "inBatch" methods use bulk DELETE statements, which can be drastically faster, but have some caveats bc. they bypass the JPA cache. You should really know how they work and when to use them to benefit.
The delete method is going to delete your entity in one operation. The deleteInBatch is going to batch several delete-statements and delete them as 1 operation.
If you need a lot of delete operations the batch-deletion might be faster.
deleteInBatch(...) in the log would look like this:
DELETE FROM table_name WHERE (((((((? = id) OR (? = id)) OR (? = id)) OR (? = id)) OR (? = id)) OR (? = id)) OR (? = id))
That might leads to a problem if there are a large amount of data to be deleted, which reaches maximum size of the SQL server query:
Maximum size for a SQL Server Query? IN clause? Is there a Better Approach
Just do add curious information.
You can't create your custom delete using 'batch' on the method name and wait for spring data to resolve it, for example, you can't do this:
void deleteByYourAttributeInBatch(Iterable<YourObject> object);
Do you need to do something like this:
#Modifying
#Transactional
#Query("DELETE FROM YourObject qr WHERE o.yourAtribute IN (:object)")
void deleteByYourAttributeInBatch(Iterable<YourObject> o);
Maybe it's an issue to spring-data ;)
Related
I was trying to use the Spring's CrudRepository work with Hibernate to delete rows by a non-primary-key column, using deleteByColumnName method. However, the actual executed query is very inefficient and too slow in practice.
Suppose I have two tables Project and Employee, and each employee is in charge of some projects, which implies that the Project table has a field employee_id. Now I would like to delete some projects by employee_id. I wrote something like
public interface ProjectRepository extends CrudRepository<Project, String> {
#Transactional
void deleteByEmployeeId(String employeeId);
}
What I am expecting is Hibernate will execute the following query for this method
DELETE FROM Project
WHERE employee_id = ?
However, Hibernate executes it in a drastically slow way like
SELECT id FROM Project
WHERE employee_id = ?
Hibernate stores the above result in a list, and execute
DELETE FROM Project
WHERE id = ?
for N times... (it executes in batch though)
To address this inefficiency problem, I have to override the method by writing SQL directly, like
public interface ProjectRepository extends CrudRepository<Project, String> {
#Query("DELETE FROM Project p where p.employee_id = ?1")
#Modifying
#Transactional
void deleteByEmployeeId(String employeeId);
}
Then the behavior will be exactly the same as what I am expecting.
The performance is substantially distinct when I delete about 1k rows in a table containing around 500k entries. The first method will take 45 seconds to finish the deleting compared to the second methods taking only 250ms!
The reason I use Hibernate is taking advantage of its ORM strategy that avoids the use of SQL language directly, which is easy to maintain in the long run. At this point, is there anyone who know how to let Hibernate execute the deletion in the manner of my second method without directly writing the SQL? Is there something I am missing to optimize the Hibernate performance?
Thanks in advance!
Here you can find a good explanation why Hibernate has this bad performace when deleting Project items Best Practices for Many-To-One and One-To-Many Association Mappings
I'm setting up a JPA Specification based repository implementation that utilizes jpa specifications(constructed based on RSQL filter strings) to filter the results, define result ordering and remove any duplicates via "distinct" that would otherwise be returned due to joined tables. The JPA Specification builder method joins several tables and sets the "distinct" flag:
final Join<Object, Object> rootJoinedTags = root.join("tags", JoinType.LEFT);
final Join<Object, Object> rootJoinedLocations = root.join("location", JoinType.LEFT);
...
query.distinct(true);
To allow sorting by joined table columns, I've applied the "HINT_PASS_DISTINCT_THROUGH" hint to the relevant repository method(otherwise, sorting by joined table columns returns an error along the lines of "sort column must be included in the SELECT DISTINCT query").
#QueryHints(value = {
#QueryHint(name = org.hibernate.jpa.QueryHints.HINT_PASS_DISTINCT_THROUGH, value = "false")
})
Page<SomeEntity> findAll(#Nullable Specification<SomeEntity> spec, Pageable pageable);
The arguments for said repository method are constructed as such:
final Sort sort = getSort(searchFilter);
final Specification spec = getSpecificationIfPresent(searchFilter);
final PageRequest pageRequest = PageRequest.of(searchFilter.getPageNumber(), searchFilter.getLimit(), sort);
return eventRepository.findAll(spec, pageRequest);
After those changes, filtering and sorting seem to work as expected. However, the hint seems to cause "distinct" filtering to be applied after the result page is already constructed, thus reducing the number of returned entities in the page from the configured "size" PageRequest argument, to whatever is left after the duplicates are filtered out. For example, if we'd make a PageRequest with "page=0" and "pageSize=10", then the resulting Page may return only 5 "SomeEntity" instances, although the database contains way more entries(177 entities to be exact in this case). If I remove the hint, then the returned entities number is correct again.
Question: is there a way to make the same Specification query setup work with correctly sized Pages(some other hints that might be added to have duplicate filtering performed before the Page object is constructed)? If not, then is there another approach I could use to achieve the required Specification-based filtering, with joined-column sorting and duplicate removal as with "distinct"?
PS: PostgreSQL is the database behind the application in question
The problem you are experimenting have to do with the way you are using the HINT_PASS_DISTINCT_THROUGH hint.
This hint allows you to indicate Hibernate that the DISTINCT keyword should not be used in the SELECT statement issued against the database.
You are taking advantage of this fact to allow your queries to be sorted by a field that is not included in the DISTINCT column list.
But that is not how this hint should be used.
This hint only must be used when you are sure that there will be no difference between applying or not a DISTINCT keyword to the SQL SELECT statement, because the SELECT statement already will fetch all the distinct values per se. The idea is improve the performance of the query avoiding the use of an unnecessary DISTINCT statement.
This is usually what will happen when you use the query.distinct method in you criteria queries, and you are join fetching child relationships. This great article of #VladMihalcea explain how the hint works in detail.
On the other hand, when you use paging, it will set OFFSET and LIMIT - or something similar, depending on the underlying database - in the SQL SELECT statement issued against the database, limiting to a maximum number of results your query.
As stated, if you use the HINT_PASS_DISTINCT_THROUGH hint, the SELECT statement will not contain the DISTINCT keyword and, because of your joins, it could potentially give duplicate records of your main entity. This records will be processed by Hibernate to differentiate duplicates, because you are using query.distinct, and it will in fact remove duplicates if needed. I think this is the reason why you may get less records than requested in your Pageable.
If you remove the hint, as the DISTINCT keyword is passed in the SQL statement which is sent to the database, as far as you only project information of the main entity, it will fetch all the records indicated by LIMIT and this is why it will give you always the requested number of records.
You can try and fetch join your child entities (instead of only join with them). It will eliminate the problem of not being able to use the field you need to sort by in the columns of the DISTINCT keyword and, in addition, you will be able to apply, now legitimately, the hint.
But if you do so it will you another problem: if you use join fetch and pagination, to return the main entities and its collections, Hibernate will no longer apply pagination at database level - it will no include OFFSET or LIMIT keywords in the SQL statement, and it will try to paginate the results in memory. This is the famous Hibernate HHH000104 warning:
HHH000104: firstResult/maxResults specified with collection fetch; applying in memory!
#VladMihalcea explain that in great detail in the last part of this article.
He also proposed one possible solution to your problem, Window Functions.
In you use case, instead of using Specifications, the idea is that you implement your own DAO. This DAO only need to have access to the EntityManager, which is not a great deal as you can inject your #PersistenceContext:
#PersistenceContext
protected EntityManager em;
Once you have this EntityManager, you can create native queries and use window functions to build, based on the provided Pageable information, the right SQL statement that will be issued against the database. This will give you a lot of more freedom about what fields use for sorting or whatever you need.
As the last cited article indicates, Window Functions is a feature supported by all mayor databases.
In the case of PostgreSQL, you can easily come across them in the official documentation.
Finally, one more option, suggested in fact by #nickshoe, and explained in great detail in the article he cited, is to perform the sorting and paging process in two phases: in the first phase, you need to create a query that will reference your child entities and in which you will apply paging and sorting. This query will allow you to identify the ids of the main entities that will be used, in the second phase of the process, to obtain the main entities themselves.
You can take advantage of the aforementioned custom DAO to accomplish this process.
It may be an off-topic answer, but it may help you.
You could try to tackle this problem (pagination of parent-child entities) by separating the query in two parts:
a query for retrieving the ids that match the given criteria
a query for retrieving the actual entities by the resulting ids of the previous query
I came across this solution in this blog post: https://vladmihalcea.com/fix-hibernate-hhh000104-entity-fetch-pagination-warning-message/
Hi I used spring data to map my Entity and Repository. The mapping is very simple:
public class Car {
Set<Part> parts;
}
public class Part {
}
I use the findAllByIds(Iterable) interface of my spring data repository. And it generates a nice sql in the form of:
select from CAR where id in (?, ?, ?, ?)
for each Car it executes exactly one SQL.
Select from Part where car_id = ?
My problem starts when the related parts are fetch. It apears that it is fetching them one by one. Is there in spring data jdbc something equivalent to the batch fetching in hibernate ?
If the anser is negative is there some relatively easy way to implement it ?
Unfortunately, the answer is short answer is "No" to both questions right now.
If you want to implement batching for selects what you would need to do is to come up with
a) a new implementation of the DataAccessStrategy which essentially implements all the CRUD functionality, and/or
b) a new EntityRowMapper which converts ResultSet rows into entities.
The first one is needed if you want to execute a different SQL statement to start with.
The second one if you consider changing subsequent SQL sufficient.
There are issues around batching that you might want to track or if the exact variant you are looking for doesn't exist, feel free to create another one.
Suppose we have an entity "Something" and this something has one to many (factor of millions) relationship to some "Data".
Something 1 -> * Data
Data 1 -> 1 Something
Now if we want to add some data objects we should do this:
Something.getDataList().add(Data)
This will actually pull all data objects from database which is not optimal imho.
However if i remove relationship from Something, and leave it in Data I'll be able to add and retrieve exactly those objects that I ask for using DAO:
Something
Data 1 -> 1 Something
Now the data access interface will look like this:
Something.addData(Data) // will use DataDAO to save object
or
Something.addData(List<Data>) // will use same DataDAO batch insert
I need some guidance on this, maybe I lack some knowledge in JPA, and there is no need for this? Also I'm not sure as entities are not natural this way as data is provided by their methods but its not actually contained in entity, (if this is right then I should remove every one to many relationship if there is a performance critical operation dealing with that particular entity, which is also unnatural).
In my particular case I have lot of REST consumers that periodically gonna update database. I'm using ObjectDB, JPA... But question is more abstract here.
I believe that having the something.getDataList() around is a clicking bomb if there are milions of Data records related to Something. Just as you said, calling something.getDataList().add(data) would fetch the whole data set from DB in order to perform a single insert. Also, anyone could be tempted to use something.getDataList().size to get the number of records, resulting in the same overhead.
My suggestion is that you use the DataDAO for such operations (i.e. adding or counting) like:
void addData(Something something, Data data){
//Something similar can be used for batch insert
data.setSomething(something);
em.persist(data);
}
List<Data> findData(Something something, MyFilter filter){
// use a NamedQuery with parameters populated by something and your filters
return em.getResultList();
}
Long countData(Something something){
// use a 'SELECT COUNT(*) FROM Data d WHERE d.something = :something' NamedQuery
return em.getSingleResult;
}
How do you join across multiple tables in an efficient way using JPQL
select a.text,b.text,c.text,
from Class1 a, Class2 b, Class3 c
where a.id=b.b_id and b.id=c.b_id and a.text like ... and b.text like ...
I am doing something like this, the tables only have a few thousand rows, yet the query takes 5-6 seconds to run. I assume it is joining all of the tables before doing the filter
I know the speed may be JPA vendor implementation specific, but I suspect this is not the proper way to write this query!
See what SQL query has been generated. Then EXPLAIN that query and try to optimize it. For example, make sure you have proper indices.
If you don't like the SQL which JPA is generating (but I doubt its generating "bad" SQL) you can always use a Native Query.