I have an application using hibernate. One of its modules calls a native SQL (StoredProc) in batch process. Roughly what it does is that every time it writes a file it updates a field in the database. Right now I am not sure how many files would need to be written as it is dependent on the number of transactions per day so it could be zero to a million.
If I use this code snippet in while loop will I have any problems?
#Transactional
public void test()
{
//The for loop represents a list of records that needs to be processed.
for (int i = 0; i < 1000000; i++ )
{
//Process the records and write the information into a file.
...
//Update a field(s) in the database using a stored procedure based on the processed information.
updateField(String.valueOf(i));
}
}
#Transactional(propagation=propagation.MANDATORY)
public void updateField(String value)
{
Session session = getSession();
SQLQuery sqlQuery = session.createSQLQuery("exec spUpdate :value");
sqlQuery.setParameter("value", value);
sqlQuery.executeUpdate();
}
Will I need any other configurations for my data source and transaction manager?
Will I need to set hibernate.jdbc.batch_size and hibernate.cache.use_second_level_cache?
Will I need to use session flush and clear for this? The samples in the hibernate tutorial is using POJO's and not native sql so I am not sure if it is also applicable.
Please note another part of the application is already using hibernate so as much as possible I would like to stick to using hibernate.
Thank you for your time and I am hoping for your quick response. If it is also possible could code snippet would really be useful for me.
Application Work Flow
1) Query Database for the transaction information. (Transaction date, Type of account, currency, etc..)
2) For each account process transaction information. (Discounts, Current Balance, etc..)
3) Write the transaction information and processed information to a file.
4) Update a database field based on the process information
5) Go back to step 2 while their are still accounts. (Assuming that no exception are thrown)
The code snippet will open and close the session for each iteration, which definitely not a good practice.
Is it possible, you have a job which checks how many new files added in the folder?
The job should run say every 15/25 minutes, checking how much files are changed/added in last 15/25 minutes and updates the database in batch.
Something like that will lower down the number of open/close session connections. It should be much faster than this.
Related
I have written an application to scrape a huge set of reviews. For each review i store the review itself Review_Table(User_Id, Trail_Id, Rating), the Username (Id, Username, UserLink) and the Trail which is build previously in the code (Id, ...60 other attributes)
for(Element card: reviewCards){
String userName = card.select("expression").text();
String userLink = card.select("expression").attr("href");
String userRatingString = card.select("expression").attr("aria-label");
Double userRating;
if(userRatingString.equals("NaN Stars")){
userRating = 0.0;
}else {
userRating = Double.parseDouble(userRatingString.replaceAll("[^0-9.]", ""));
}
User u;
Rating r;
//probably this is the bottleneck
if(userService.getByUserLink(userLink)!=null){
u = new User(userName, userLink, new HashSet<Rating>());
r = Rating.builder()
.user(u)
.userRating(userRating)
.trail(t)
.build();
}else {
u = userService.getByUserLink(userLink);
r = Rating.builder()
.user(u)
.userRating(userRating)
.trail(t)
.build();
}
i = i +1;
ratingSet.add(r);
userSet.add(u);
}
saveToDb(userSet, t, link, ratingSet);
savedEntities = savedEntities + 1;
log.info(savedEntities + " Saved Entities");
}
The code works fine for small-medium sized dataset but i encounter a huge bottleneck for larger datasets. Let's suppose i have 13K user entities already stored in the PostgresDB and another batch of 8500 reviews comes to be scraped, i have to check for every review if the user of that review is already stored. This is taking forever
I tried to define and index on the UserLink attribute in Postgres but the speed didn't improve at all
I tried to take and collect all the users stored in the Db inside a set and use the contains method to check if a particular user already exists in the set (in this way I thought I could bypass the database bottleneck of 8k write and read but in a risky way because if the users inside the db table were too much i would have encountered a memory overflow). The speed, again, didn't improve
At this point I don't have any other idea to improve this
Well for one, you would certainly benefit from not querying for each user individually in a loop. What you can do is query & cache for only the UserLink or UserName meaning get & cache the complete set of only one of them because that's what you seem to need to differentiate in the if-else.
You can actually query for individual fields with Spring Data JPA #Query either directly or even with Spring Data JPA Projections to query subset of fields if needed and cache & use them for the lookup. If you think the users could run into millions or billions then you could think of using a distributed cache like Apache Ignite where your collection could scale easily.
Btw, the if-else seem to be inversed is it not?
Next you don't store each review individually which the above code appears to imply. You can write in batches. Also since you are using Postgres you can use Postgres CopyManager provided by Postgres for bulk data transfer by using it with Spring Data Custom repositories. So you can keep writing to a new text/csv file locally at a set schedule (every x minutes) and use this to write that batched text/csv to the table (after that x minutes) and remove the file. This would be really quick.
The other option is write a stored procedure that combines the above & invoke it again in a custom repository.
Please let me know which one you had like elaborated..
UPDATE (Jan 12 2022):
One other item i missed is when you querying for UserLink or UserName you can use a very efficient form of select query that Postgres supports instead of using an IN clause like below,
#Select("select u from user u where u.userLink = ANY('{:userLinks}'::varchar[])", nativeQuery = true)
List<Users> getUsersByLinks(#Param("userLinks") String[] userLinks);
I would like to know what is the best way to do the following:
A client sends a json of 100 records to the spring boot application to insert into the DB.
But before inserting I have to execute a query to verify some data of EACH record of the 100 records. And then insert.
I currently have this:
for(int i= 0; i < productos.size(); i++) {
productos.get(i).setIdvehiculo(productoRepository.findTesting("49878", 3)); // ----> NATIVE QUERY EXECUTION TAKES 100ms I THINK
productoRepository.save(productos.get(i)); // ----> INSERT
}
//productoRepository.saveAll(productos);
entityManager.flush();
entityManager.clear();
And it takes 10 seconds ... doing the select and inserting. 100 records, 10 seconds, isn't that a long time?
Don't insert 1:1 inside for loop, just construct the model there and add that model into ArrayList and once you done with processing of records, call saveAll(productos list) outside loop.
Try enabling L2 cache. That would reduce the validation time. Depending on how critical your data is, you can also cache the entity on the application level.
Create a transaction to save the entity. This will allow the database to leverage the concurrency control.
See if you can change the architecture to enable the queue (could be Kafka Q), and another application consumes this Q to write to the database.
I trying persist a many registers in database reading a file with many lines
I´m using a forech to read the list of objects wrapped in file
logs.stream().forEach(log -> save(log));
private LogData save(LogData log) {
return repository.persist(log);
}
But the inserts are slow
Do i have a way to speed the inserts?
Your way take a long time because you persist element by element, so you go n time to the database, I would like to use Batch processing instead to use one transaction instead of N transaction, so the persist method can be :
public void persist(List<Logs> logs) {
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
logs.forEach(log -> session.save(log));// from the comment of #shmosel
tx.commit();
session.close();
}
Use a Batch Insert, Google "Hibernate Batch Insert" or replace with whatever name of your ORM if it's not Hibernate.
https://www.tutorialspoint.com/hibernate/hibernate_batch_processing.htm
To insert at every line makes this program slowly, why dont you think to collect n lines, and insert n lines together at once.
I am working on a monitoring tool developed in Spring Boot using Hibernate as ORM.
I need to compare each row (already persisted rows of sent messages) in my table and see if a MailId (unique) has received a feedback (status: OPENED, BOUNCED, DELIVERED...) Yes or Not.
I get the feedbacks by reading csv files from a network folder. The CSV parsing and reading of files goes very fast, but the update of my database is very slow. My algorithm is not very efficient because I loop trough a list that can have hundred thousands of objects and look in my table.
This is the method that make the update in my table by updating the "target" Object (row in table database)
#Override
public void updateTargetObjectFoo() throws CSVProcessingException, FileNotFoundException {
// Here I make a call to performProcessing method which reads files on a folder and parse them to JavaObjects and I map them in a feedBackList of type Foo
List<Foo> feedBackList = performProcessing(env.getProperty("foo_in"), EXPECTED_HEADER_FIELDS_STATUS, Foo.class, ".LETTERS.STATUS.");
for (Foo foo: feedBackList) {
//findByKey does a simple Select in mySql where MailId = foo.getMailId()
Foo persistedFoo = fooDao.findByKey(foo.getMailId());
if (persistedFoo != null) {
persistedFoo.setStatus(foo.getStatus());
persistedFoo.setDnsCode(foo.getDnsCode());
persistedFoo.setReturnDate(foo.getReturnDate());
persistedFoo.setReturnTime(foo.getReturnTime());
//The save account here does an MySql UPDATE on the table
fooDao.saveAccount(foo);
}
}
}
What if I achieve this selection/comparison and update action in Java side? Then re-update the whole list in database?
Will it be faster?
Thanks to all for your help.
Hibernate is not particularly well-suited for batch processing.
You may be better off using Spring's JdbcTemplate to do jdbc batch processing.
However, if you must do this via Hibernate, this may help: https://docs.jboss.org/hibernate/orm/5.2/userguide/html_single/chapters/batch/Batching.html
My business flow is following:
Invalidate a command
Fetch data from command (database operations, little slower)
Step2 would be access by many concurrent users.
Now, when a command in invalidated, and user tries to fetch the data, multiple database queries starts executing because execute is little slower.
Is there any way to stop this multiple executions of queries?
In other words, the question is: Can we make the execution of command,
and fetching data from command as Synchronized?
Yes, you can do something like this.
public class Fetcher {
private static String data;
private long timestamp;
public synchronized String fetchData() {
String result="";
if (data!=null) {
result=data;
// let's invalidate too old data
if (new Date().getTime()-timestamp> 100000)
data=null;
} else {
DAO db = DAO.getConnection();
data = db.performQuery();
result=data;
}
return result;
}
}
If you are using a Dynacache cacheable command and the queries are the same for users, then the command should get cached after the first execution.
Only the first execution should hit the database, after that the data should be fetched from cache until the cache is invalidated.
I usually use Dynacache as part of IBM Websphere Commerce suite.
Websphere Commerce uses a scheduled command to check a table called CACHEIVL.
You would setup triggers which would insert an invalidation id into CACHEIVL when the target table is changed.
Since you don't have the scheduled Dynacache command you can implement something specific to your use case using Websphere schedulers,
Here is an example of a cacheable command using Dynacache.