I have a database with a quite large entity model.
In total, there are 14 tables with around 100k records on average.
When I tested my application that takes one entity from the database and converts it to json and throws that back to the caller, it took 7 seconds to get the entry.
This doesnt seem an initialization issue because if I do the same call twice in a row, they both take around 10 seconds to get the data.
When I enable sql script logging, I find that for each entity read, hibernate sends hundreds of sql requests to the database.
(The actual number is based on how many connections/licenses/products/services etc the entity has but for my test entry, I got 286 queries (which took around 7 seconds in total))
When the database is empty (except for the data that the test should return), then it takes around 6 seconds.
I suppose the issue is because I have my #OneToMany's and #ManyToMany's fetch set to Lazy, but when I set them to eager, I get the error of multiple fetch bags.
#javax.persistence.OneToMany(fetch = javax.persistence.FetchType.LAZY)
#org.hibernate.annotations.LazyCollection(org.hibernate.annotations.LazyCollectionOption.FALSE)
#javax.persistence.JoinColumn(name = "ProductId")
This is an example of a OneToMany relation I have, which simulates the eager fetch without the multiple fetch bags error.
Is this a common issue?
And how do I improve the speed on the reading?
Hibernate does not allow to have more then one collection of type List marked as EAGER.
The best to load the data eager is to use a JOIN FETCH
select e from YourEntity JOIN FETCH e.yourList l
Another option could be to use an EntityGraph to define the EAGER loading.
Read more about that topic here:
https://vladmihalcea.com/hibernate-facts-the-importance-of-fetch-strategy/
If you use Spring Data JPA repository interfaces you can use NamedEntityGraphs:
https://docs.spring.io/spring-data/jpa/docs/current/reference/html/#jpa.entity-graph
Related
I am looking for a way how to process a large amount of data that are loaded from the database in a reasonable time.
The problem I am facing is that I have to read all the data from the database (currently around 30M of rows) and then process them in Java. The processing itself is not the problem but fetching the data from the database is. The fetching generally takes from 1-2 minutes. However, I need it to be much faster than that. I am loading the data from db straight to DTO using following query:
select id, id_post, id_comment, col_a, col_b from post_comment
Where id is primary key, id_post and id_comment are foreign keys to respective tables and col_a and col_b are columns of small int data types. The columns with foreign keys have indexes.
The tools I am using for the job currently are Java, Spring Boot, Hibernate and PostgreSQL.
So far the only options that came to my mind were
Ditch hibernate for this query and try to use plain jdbc connection hoping that it will be faster.
Completely rewrite the processing algorithm from Java to SQL procedure.
Did I miss something or these are my only options? I am open to any ideas.
Note that I only need to read the data, not change them in any way.
EDIT: The explain analyze of the used query
"Seq Scan on post_comment (cost=0.00..397818.16 rows=21809216 width=28) (actual time=0.044..6287.066 rows=21812469 loops=1), Planning Time: 0.124 ms, Execution Time: 8237.090 ms"
Do you need to process all rows at once, or can you process them one at a time?
If you can process them one at a time, you should try using a scrollable result set.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
}
This will still remember every object in the entity manager, and so will get progressively slower and slower. To avoid that issue, you might detach the object from the entity manager after you're done. This can only be done if the objects are not modified. If they are modified, the changes will NOT be persisted.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
entityManager.detach(myObject);
}
If I was in your shoes I would definitely bypass hibernate and go directly to JDBC for this query. Hibernate is not made for dealing with large result sets, and it represents an additional overhead for benefits that are not applicable to cases like this one.
When you use JDBC, do not forget to set autocommit to false and set some large fetch size (of the order of thousands) or else postgres will first fetch all 21 million rows into memory before starting to yield them to you. (See https://stackoverflow.com/a/10959288/773113)
Since you asked for ideas, I have seen this problem being resolved in below options depending on how it fits in your environment:
1) First try with JDBC and Java, simple code and you can do a test run on your database and data to see if this improvement is enough. You will here need to compromise on the other benefits of Hibernate.
2) In point 1, use Multi-threading with multiple connections pulling data to one queue and then you can use that queue to process further or print as you need. you may consider Kafka also.
3) If data is going to further keep on increasing you can consider Spark as the latest technology which can make it all in memory and will be much more faster.
These are some of the options, please like if these ideas help you anywhere.
Why do you 30M keep in memory ??
it's better to rewrite it to pure sql and use pagination based on id
you will be sent 5 as the id of the last comment and you will issue
select id, id_post, id_comment, col_a, col_b from post_comment where id > 5 limit 20
if you need to update the entire table then you need to put the task in the cron but also there to process it in parts
the memory of the road and downloading 30M is very expensive - you need to process parts 0-20 20-n n+20
I have problems with full loading of very complex object from DB in a reasonable time and with reasonable count of queries.
My object has a lot of embedded entities, each entity has references to another entities, another entities references yet another and so on (So, the nesting level is 6)
So, I've created example to demonstrate what I want:
https://github.com/gladorange/hibernate-lazy-loading
I have User.
User has #OneToMany collections of favorite Oranges,Apples,Grapevines and Peaches. Each Grapevine has #OneToMany collection of Grapes. Each fruit is another entity with just one String field.
I'm creating user with 30 favorite fruits of each type and each grapevine has 10 grapes. So, totally I have 421 entity in DB - 30*4 fruits, 100*30 grapes and one user.
And what I want: I want to load them using no more than 6 SQL queries.
And each query shouldn't produce big result set (big is a result set with more that 200 records for that example).
My ideal solution will be the following:
6 requests. First request returns information about user and size of result set is 1.
Second request return information about Apples for this user and size of result set is 30.
Third, Fourth and Fifth requests returns the same, as second (with result set size = 30) but for Grapevines, Oranges and Peaches.
Sixth request returns Grape for ALL grapevines
This is very simple in SQL world, but I can't achieve such with JPA (Hibernate).
I tried following approaches:
Use fetch join, like from User u join fetch u.oranges .... This is awful. The result set is 30*30*30*30 and execution time is 10 seconds. Number of requests = 3. I tried it without grapes, with grapes you will get x10 size of result set.
Just use lazy loading. This is the best result in this example (with #Fetch=
SUBSELECT for grapes). But in that case that I need to manually iterate over each collection of elements. Also, subselect fetch is too global setting, so I would like to have something which could work on query level. Result set and time near ideal. 6 queries and 43 ms.
Loading with entity graph. The same as fetch join but it also make request for every grape to get it grapevine. However, result time is better (6 seconds), but still awful. Number of requests > 30.
I tried to cheat JPA with "manual" loading of entities in separate query. Like:
SELECT u FROM User where id=1;
SELECT a FROM Apple where a.user_id=1;
This is a little bit worse that lazy loading, since it requires two queries for each collection: first query to manual loading of entities (I have full control over this query, including loading associated entities), second query to lazy-load the same entities by Hibernate itself (This is executed automatically by Hibernate)
Execution time is 52, number of queries = 10 (1 for user, 1 for grape, 4*2 for each fruit collection)
Actually, "manual" solution in combination with SUBSELECT fetch allows me to use "simple" fetch joins to load necessary entities in one query (like #OneToOne entities) So I'm going to use it. But I don't like that I have to perform two queries to load collection.
Any suggestions?
I usually cover 99% of such use cases by using batch fetching for both entities and collections. If you process the fetched entities in the same transaction/session in which you read them, then there is nothing additionally that you need to do, just navigate to the associations needed by the processing logic and the generated queries will be very optimal. If you want to return the fetched entities as detached, then you initialize the associations manually:
User user = entityManager.find(User.class, userId);
Hibernate.initialize(user.getOranges());
Hibernate.initialize(user.getApples());
Hibernate.initialize(user.getGrapevines());
Hibernate.initialize(user.getPeaches());
user.getGrapevines().forEach(grapevine -> Hibernate.initialize(grapevine.getGrapes()));
Note that the last command will not actually execute a query for each grapevine, as multiple grapes collections (up to the specified #BatchSize) are initialized when you initialize the first one. You simply iterate all of them to make sure all are initialized.
This technique resembles your manual approach but is more efficient (queries are not repeated for each collection), and is more readable and maintainable in my opinion (you just call Hibernate.initialize instead of manually writing the same query that Hibernate generates automatically).
I'm going to suggest yet another option on how to lazily fetch collections of Grapes in Grapevine:
#OneToMany
#BatchSize(size = 30)
private List<Grape> grapes = new ArrayList<>();
Instead of doing a sub-select this one would use in (?, ?, etc) to fetch many collections of Grapes at once. Instead ? Grapevine IDs will be passed. This is opposed to querying 1 List<Grape> collection at a time.
That's just yet another technique to your arsenal.
I do not quite understand your demands here. It seems to me you want Hibernate to do something that it's not designed to do, and when it can't, you want a hack-solution that is far from optimal. Why not loosen the restrictions and get something that works? Why do you even have these restrictions in the first place?
Some general pointers:
When using Hibernate/JPA, you do not control the queries. You are not supposed to either (with a few exceptions). How many queries, the order they are executed in, etc, is pretty much beyond your control. If you want complete control of your queries, just skip JPA and use JDBC instead (Spring JDBC for instance.)
Understanding lazy-loading is key to making decisions in these type of situation. Lazy-loaded relations are not fetched when getting the owning entity, instead Hibernate goes back to the database and gets them when they are actually used. Which means that lazy-loading pays off if you don't use the attribute every time, but has a penalty the times you actually use it. (Fetch join is used for eager-fetching a lazy relation. Not really meant for use with regular load from the database.)
Query optimalization using Hibernate should not be your first line of action. Always start with your database. Is it modelled correctly, with primary keys and foreign keys, normal forms, etc? Do you have search indexes on proper places (typically on foreign keys)?
Testing for performance on a very limited dataset probably won't give the best results. There probably will be overhead with connections, etc, that will be larger than the time spent actually running the queries. Also, there might be random hickups that cost a few milliseconds, which will give a result that might be misleading.
Small tip from looking at your code: Never provide setters for collections in entities. If actually invoked within a transaction, Hibernate will throw an exception.
tryManualLoading probably does more than you think. First, it fetches the user (with lazy loading), then it fetches each of the fruits, then it fetches the fruits again through lazy-loading. (Unless Hibernate understands that the queries will be the same as when lazy loading.)
You don't actually have to loop through the entire collection in order to initiate lazy-loading. You can do this user.getOranges().size(), or Hibernate.initialize(user.getOranges()). For the grapevine you would have to iterate to initialize all the grapes though.
With proper database design, and lazy-loading in the correct places, there shouldn't be a need for anything other than:
em.find(User.class, userId);
And then maybe a join fetch query if a lazy load takes a lot of time.
In my experience, the most important factor for speeding up Hibernate is search indexes in the database.
I am stuck with an issue. I have 3 tables that are associated with a table in one to many relationship.
An employee may have one or more degrees.
An employee may have one or more departments in past
An employee may have one or more Jobs
I am trying to fetch results using named query in a way that I fetch all the results from Degree table and Department table, but only 5 results from Jobs table. Because I want to apply pagination on Jobs table.
But, all these entities are in User tables as a set. Secondly, I don't want to change mapping file because of other usages of same files and due to some architectural restrictions.
Else in case of mapping I could use BatchSize annotation in mapping file, which I am not willing to do.
The best approach is to write three queries:
userRepository.getDegrees(userId);
userRepository.getDepartments(userId);
userRepository.getJobs(userId, pageIndex);
Spring Data is very useful for pagination, as well as simplifying your data access code.
Hibernate cannot fetch multiple Lists in a single query, and even for Sets, you don't want to run a Cartesian Product. So use queries instead of a single JPQL query.
I have been using JPA 2.0 with hibernate as the vendor for my web application. For one of the database queries, I am using the JPA criteria builder API to construct the dynamic queries . I have been able to get it to work as expected.
The query has a fetch join between two tables A & B. A has a OneToMany relationship with B.
Join<A,B> ordLeg = ord.joinSet("bLegs", JoinType.INNER);
ord.fetch("bLegs",JoinType.INNER);
I also have used the setMaxResults to restrict the no. of records returned to the user
typedQuery.setMaxResults(10);
FYI, there are around 400 records in the database.
When I see the result of the query in the logs, I see that it loads all the 400 records in the memory. However, it returns only 20 records to the user since Maxresults is set.
When analyzed further, found out that hibernate has hints that can be given to fetch records in batches rather than loading all the 400 records in the memory.
TypedQuery<A> typedQuery = entityManager.createQuery(query);
typedQuery.setHint("org.hibernate.fetchSize", 5);
List<A> orderList = (List<A>) typedQuery.getResultList();
But it does not seem to take the hint at all. I am still seeing that it loads all 400 records in the memory and no batch fetching is done. I am expecting it to fetch the 10 (maxResults ) records in 2 batches of 5. Am i missing something ?\
It will be a severe performance hit if there are more than 1000 records in the DB for the specified query.
I even tried setting the following property in the persistence.xml level but in vain.
<properties>
<property name="hibernate.jdbc.batch_size" value="5"/>
<property name="hibernate.jdbc.fetch_size" value="5">
</properties>
Can you please help me out in the right direction ?
Hibernate Version: I have tried 3.6.4.Final, 4.0.0.Final & 4.2.1.Final
Appreciate your help.
Thanks,
Santhosh
maxresults might not work as expected due to joins. suppose each user has 10 addresses and you expect 10 users, so db query should give you 10x10=100 records, then jpa has to put user in one object with list of 10 addresses as collection. so you will see discrepancies in result set and final objects returned.
fetch size will tell jpa to fetch those many records only through jdbc call. the smaller the fetch size, more the no of calls jpa makes to db going back n forth.
try to increase your fetch size if query is not very heavy, in your case of 400 records, with one inner join, you can easily play with higher value of fetch size.
so, using solr 4.0
I have a fairly straight-up setup of an entity, with 1 sub entity (1:N relation)
the data to import sits on a mysql server
the main table has about 30 million records
the sub table has about 5 million records(most parent entities don't have the sub entity, the rest generally have a single 1)
I am running into rather horrible indexing(importing) performance. about 80 entities(docs) per second. so to index this table it'll in theory take few days.
now from what I am seeing that solr reports is, for example, if I tell it to index the first 1000 entities it actually issues 1000+ queries to sql. I have also tried setting the batchSize property for the data source with no luck... only -1 works(otherwise out of memory exception).
really not sure what I can do to optimize this, is there no PROPER data importer for mysql?
you could use CachedSqlEntityProcessor so that the sub entity query at least is cached...
Thought the cachedEntity approach helped me in another issue, I have found that using nested entities is usually not just the went to go.
The logic to fire the sub entity query for each "root" entity is just never going to work.
I've re-written my statements to SQL JOIN which fetches both root and sub entities as a single row and mapped to fields accordingly and performance improved significantly.