I would like to know what is the best way to do the following:
A client sends a json of 100 records to the spring boot application to insert into the DB.
But before inserting I have to execute a query to verify some data of EACH record of the 100 records. And then insert.
I currently have this:
for(int i= 0; i < productos.size(); i++) {
productos.get(i).setIdvehiculo(productoRepository.findTesting("49878", 3)); // ----> NATIVE QUERY EXECUTION TAKES 100ms I THINK
productoRepository.save(productos.get(i)); // ----> INSERT
}
//productoRepository.saveAll(productos);
entityManager.flush();
entityManager.clear();
And it takes 10 seconds ... doing the select and inserting. 100 records, 10 seconds, isn't that a long time?
Don't insert 1:1 inside for loop, just construct the model there and add that model into ArrayList and once you done with processing of records, call saveAll(productos list) outside loop.
Try enabling L2 cache. That would reduce the validation time. Depending on how critical your data is, you can also cache the entity on the application level.
Create a transaction to save the entity. This will allow the database to leverage the concurrency control.
See if you can change the architecture to enable the queue (could be Kafka Q), and another application consumes this Q to write to the database.
Related
I have written an application to scrape a huge set of reviews. For each review i store the review itself Review_Table(User_Id, Trail_Id, Rating), the Username (Id, Username, UserLink) and the Trail which is build previously in the code (Id, ...60 other attributes)
for(Element card: reviewCards){
String userName = card.select("expression").text();
String userLink = card.select("expression").attr("href");
String userRatingString = card.select("expression").attr("aria-label");
Double userRating;
if(userRatingString.equals("NaN Stars")){
userRating = 0.0;
}else {
userRating = Double.parseDouble(userRatingString.replaceAll("[^0-9.]", ""));
}
User u;
Rating r;
//probably this is the bottleneck
if(userService.getByUserLink(userLink)!=null){
u = new User(userName, userLink, new HashSet<Rating>());
r = Rating.builder()
.user(u)
.userRating(userRating)
.trail(t)
.build();
}else {
u = userService.getByUserLink(userLink);
r = Rating.builder()
.user(u)
.userRating(userRating)
.trail(t)
.build();
}
i = i +1;
ratingSet.add(r);
userSet.add(u);
}
saveToDb(userSet, t, link, ratingSet);
savedEntities = savedEntities + 1;
log.info(savedEntities + " Saved Entities");
}
The code works fine for small-medium sized dataset but i encounter a huge bottleneck for larger datasets. Let's suppose i have 13K user entities already stored in the PostgresDB and another batch of 8500 reviews comes to be scraped, i have to check for every review if the user of that review is already stored. This is taking forever
I tried to define and index on the UserLink attribute in Postgres but the speed didn't improve at all
I tried to take and collect all the users stored in the Db inside a set and use the contains method to check if a particular user already exists in the set (in this way I thought I could bypass the database bottleneck of 8k write and read but in a risky way because if the users inside the db table were too much i would have encountered a memory overflow). The speed, again, didn't improve
At this point I don't have any other idea to improve this
Well for one, you would certainly benefit from not querying for each user individually in a loop. What you can do is query & cache for only the UserLink or UserName meaning get & cache the complete set of only one of them because that's what you seem to need to differentiate in the if-else.
You can actually query for individual fields with Spring Data JPA #Query either directly or even with Spring Data JPA Projections to query subset of fields if needed and cache & use them for the lookup. If you think the users could run into millions or billions then you could think of using a distributed cache like Apache Ignite where your collection could scale easily.
Btw, the if-else seem to be inversed is it not?
Next you don't store each review individually which the above code appears to imply. You can write in batches. Also since you are using Postgres you can use Postgres CopyManager provided by Postgres for bulk data transfer by using it with Spring Data Custom repositories. So you can keep writing to a new text/csv file locally at a set schedule (every x minutes) and use this to write that batched text/csv to the table (after that x minutes) and remove the file. This would be really quick.
The other option is write a stored procedure that combines the above & invoke it again in a custom repository.
Please let me know which one you had like elaborated..
UPDATE (Jan 12 2022):
One other item i missed is when you querying for UserLink or UserName you can use a very efficient form of select query that Postgres supports instead of using an IN clause like below,
#Select("select u from user u where u.userLink = ANY('{:userLinks}'::varchar[])", nativeQuery = true)
List<Users> getUsersByLinks(#Param("userLinks") String[] userLinks);
I am new to Spring and Hibernate. I have 157 records and I want to batch insert the records using Hibernate. I created my entity and in my service I am making a list of entities. I am using CrudRepository.saveAll(List) to do batch insert.
But Hibernate statistics show me 1045 flushed entities. When I add something to logger to see SQL queries it is exactly 157 statements for this entity. It shows batch size as what I defined(160). I tried different ways like session.flush and session.clear. All other methods have the same result. It saves 157 records in the database which means the result is correct. Why does it show me 1045 instead of 157? Can I see the number of inserted entities instead?
P.S: It is 158 statements in statistics because one single record is for another entity that is not part of my question.
This is part of my code:
for (int i = 0; i < listsize; i++){
listFields.add(entity);
}
FieldsDAO.saveAll(listFields);
and the screenshot for the report
spring.jpa.properties.hibernate.generate_statistics=true
My second concern is why it executes both batch and JDBC Statement? Thanks in advance for your help
There are more than 10,00,000 records in the table, I am working on. I need to perform an asynchronous operation(a push queue) for each record. Getting all the records at once and processing each record in a loop feels like a bad idea. Instead, I want to fetch records in batches and loop over each batch. Read somewhere on the internet about querying in batches using setFetchSize(int n) and my DAO looks like:
public List<UserPreferenceDTO> getUserPreferences() {
String sqlQueryString = "select us.id as userId, pf.id as preferenceId from users us, preferences pf where us.id = pf.user_id;";
SQLQuery sqlQuery = (SQLQuery) session.createSQLQuery(sqlQueryString).setFetchSize(200);
return sqlQuery.addScalar("userId").addScalar("preferenceId").setResultTransformer(new AliasToBeanResultTransformer(UserPreferenceDTO.class)).list();
}
My Service class looks like:
List<UserPreferenceDTO> userPreferenceDTOs = userDeviceDao.getUserPreferences();
for(UserPreferenceDTO userPreferenceDTO: userPreferenceDTOs ){
pushToRabbitMQ(userPreferenceDTO);
}
I need to get "N" records from the DB push them to the queue for processing then get another "N" records push them to queue and so on till all the records are pushed to queue.
A reasonable setFetchSize() is a must in any batch load scenario as the database won't have to send each row separately. Even if your roundtrip to the database is just 10ms it's still 10ms * 10mln ~ 28 h to do it for all the rows. The improvement usually plateaus somewhere around 1000 but this depends on your environment setup so you need to test it.
It might be enough to replace .list() with .scroll() which returns ScrollableResults which allows to read one record at a time. This will however depend on the database, some like MySQL will fake the scrolling and load the entire result set.
If that's the case you need to use ORDER BY in your query with setFirstResult() and setMaxResult(). This will execute new query to read each batch. It's the safest approach but ORDER BY might be an expensive statement.
I have an application using hibernate. One of its modules calls a native SQL (StoredProc) in batch process. Roughly what it does is that every time it writes a file it updates a field in the database. Right now I am not sure how many files would need to be written as it is dependent on the number of transactions per day so it could be zero to a million.
If I use this code snippet in while loop will I have any problems?
#Transactional
public void test()
{
//The for loop represents a list of records that needs to be processed.
for (int i = 0; i < 1000000; i++ )
{
//Process the records and write the information into a file.
...
//Update a field(s) in the database using a stored procedure based on the processed information.
updateField(String.valueOf(i));
}
}
#Transactional(propagation=propagation.MANDATORY)
public void updateField(String value)
{
Session session = getSession();
SQLQuery sqlQuery = session.createSQLQuery("exec spUpdate :value");
sqlQuery.setParameter("value", value);
sqlQuery.executeUpdate();
}
Will I need any other configurations for my data source and transaction manager?
Will I need to set hibernate.jdbc.batch_size and hibernate.cache.use_second_level_cache?
Will I need to use session flush and clear for this? The samples in the hibernate tutorial is using POJO's and not native sql so I am not sure if it is also applicable.
Please note another part of the application is already using hibernate so as much as possible I would like to stick to using hibernate.
Thank you for your time and I am hoping for your quick response. If it is also possible could code snippet would really be useful for me.
Application Work Flow
1) Query Database for the transaction information. (Transaction date, Type of account, currency, etc..)
2) For each account process transaction information. (Discounts, Current Balance, etc..)
3) Write the transaction information and processed information to a file.
4) Update a database field based on the process information
5) Go back to step 2 while their are still accounts. (Assuming that no exception are thrown)
The code snippet will open and close the session for each iteration, which definitely not a good practice.
Is it possible, you have a job which checks how many new files added in the folder?
The job should run say every 15/25 minutes, checking how much files are changed/added in last 15/25 minutes and updates the database in batch.
Something like that will lower down the number of open/close session connections. It should be much faster than this.
I have a program that is used to replicate/mirror the main tables (around 20) from Oracle to MSSQL 2005 via webservice (REST).
The program periodically read XML data from the webservice and convert it to list via jpa entity. This list of entity will store to MSSQL via JPA.
All jpa entity will be provided by the team who create the webservice.
There are two issues that I notice and seems unsolvable after some searching.
1st issue: The performance of inserting/updating via JDBC jpa is very slow, it takes around 0.1s per row...
Doing the same via C# -> datatable -> bulkinsert to new table in DB -> call stored procedure to do mass insert / update base on joins takes 0.01 s for 4000 records.
(Each table will have around 500-5000 records every 5 minutes)
Below shows a snapshot of the Java code that do the task-> persistent library -> EclipseLink JPA2.0
private void GetEntityA(OurClient client, EntityManager em, DBWriter dbWriter){
//code to log time and others
List<EntityA> response = client.findEntityA_XML();
em.setFlushMode(FlushModeType.COMMIT);
em.getTransaction().begin();
int count = 0;
for (EntityA object : response) {
count++;
em.merge(object);
//Batch commit
if (count % 1000 == 0){
try{
em.getTransaction().commit();
em.getTransaction().begin();
commitRecords = count;
} catch (Exception e) {
em.getTransaction().rollback();
}
}
}
try{
em.getTransaction().commit();
} catch (Exception e) {
em.getTransaction().rollback();
}
//dbWriter write log to DB
}
Anything done wrong causing the slowness? How can I improve the insert/update speed?
2nd issue: There are around 20 tables to replicate and I have created the same number of methods similar to above, basically copying above method 20 times and replace EntityA with EntityB and so on, you get the idea...
Is there anyway to generalize the method such that I can throw in any entity?
The performance of inserting/updating via JDBC jpa is very slow,
OR mappers generally are slow for bulk inserts. Per definition. You ant speed? Use another approach.
In general an ORM will not cater fur the bulk insert / stored procedure approach and tus get slaughtered here. You use the wrong appraoch for high performance inserts.
There are around 20 tables to replicate and I have created the same number of methods similar to
above, basically copying above method 20 times and replace EntityA with EntityB and so on, you get
the idea...
Generics. Part of java for some time now.
You can execute SQL, stored procedure or JPQL update all queries through JPA as well. I'm not sure where these objects are coming from, but if you are just migrating one table to another in the same database, you can do the same thing you were doing in C# in Java with JPA.
If you want to process the objects in JPA, then see,
http://java-persistence-performance.blogspot.com/2011/06/how-to-improve-jpa-performance-by-1825.html
For #2, change EntityA to Object, and you have a generic method.