I have written an application to scrape a huge set of reviews. For each review i store the review itself Review_Table(User_Id, Trail_Id, Rating), the Username (Id, Username, UserLink) and the Trail which is build previously in the code (Id, ...60 other attributes)
for(Element card: reviewCards){
String userName = card.select("expression").text();
String userLink = card.select("expression").attr("href");
String userRatingString = card.select("expression").attr("aria-label");
Double userRating;
if(userRatingString.equals("NaN Stars")){
userRating = 0.0;
}else {
userRating = Double.parseDouble(userRatingString.replaceAll("[^0-9.]", ""));
}
User u;
Rating r;
//probably this is the bottleneck
if(userService.getByUserLink(userLink)!=null){
u = new User(userName, userLink, new HashSet<Rating>());
r = Rating.builder()
.user(u)
.userRating(userRating)
.trail(t)
.build();
}else {
u = userService.getByUserLink(userLink);
r = Rating.builder()
.user(u)
.userRating(userRating)
.trail(t)
.build();
}
i = i +1;
ratingSet.add(r);
userSet.add(u);
}
saveToDb(userSet, t, link, ratingSet);
savedEntities = savedEntities + 1;
log.info(savedEntities + " Saved Entities");
}
The code works fine for small-medium sized dataset but i encounter a huge bottleneck for larger datasets. Let's suppose i have 13K user entities already stored in the PostgresDB and another batch of 8500 reviews comes to be scraped, i have to check for every review if the user of that review is already stored. This is taking forever
I tried to define and index on the UserLink attribute in Postgres but the speed didn't improve at all
I tried to take and collect all the users stored in the Db inside a set and use the contains method to check if a particular user already exists in the set (in this way I thought I could bypass the database bottleneck of 8k write and read but in a risky way because if the users inside the db table were too much i would have encountered a memory overflow). The speed, again, didn't improve
At this point I don't have any other idea to improve this
Well for one, you would certainly benefit from not querying for each user individually in a loop. What you can do is query & cache for only the UserLink or UserName meaning get & cache the complete set of only one of them because that's what you seem to need to differentiate in the if-else.
You can actually query for individual fields with Spring Data JPA #Query either directly or even with Spring Data JPA Projections to query subset of fields if needed and cache & use them for the lookup. If you think the users could run into millions or billions then you could think of using a distributed cache like Apache Ignite where your collection could scale easily.
Btw, the if-else seem to be inversed is it not?
Next you don't store each review individually which the above code appears to imply. You can write in batches. Also since you are using Postgres you can use Postgres CopyManager provided by Postgres for bulk data transfer by using it with Spring Data Custom repositories. So you can keep writing to a new text/csv file locally at a set schedule (every x minutes) and use this to write that batched text/csv to the table (after that x minutes) and remove the file. This would be really quick.
The other option is write a stored procedure that combines the above & invoke it again in a custom repository.
Please let me know which one you had like elaborated..
UPDATE (Jan 12 2022):
One other item i missed is when you querying for UserLink or UserName you can use a very efficient form of select query that Postgres supports instead of using an IN clause like below,
#Select("select u from user u where u.userLink = ANY('{:userLinks}'::varchar[])", nativeQuery = true)
List<Users> getUsersByLinks(#Param("userLinks") String[] userLinks);
Related
I have a specific question regarding an Anylogic model that I am trying to build.
I have 3 tables:
connections with columns connecteddc and connectedcustomer
customer with columns custname and demand
dcdetails with columns dcname and dccapactiy
I am trying to write a java code that connects each dc in the first table (connecteddc) to each customer assigned (connectedcustomer) and iterates through this process multiple times to build an accurate network. I have tried using several variations of code, as shown below.
for (int i=0; i<3 ; i++){
dc.get(i).LinktoCustomers.connectTo(Locations.get(selectFirstValue(false, int.class, "SELECT connectedcustomer FROM connections WHERE connectedDC = "+i+";")));
}
This code is only connecting 1 DC to 1 customer. This problem is occurring in the 'selectFirstValue' portion of the code.
Database Query
You have to use one of the possibilies to retrieve all of the concerning database entries, instead of just the first one, as you do with selectFirstValue(). Here is one option to do so:
for (int i=0; i<dc.size() ; i++){
List<Tuple> rows = selectFrom(connection)
.where(connection.connecteddc.eq(dc.get(i).dcName))
.list();
for (Tuple row : rows) {
dc.get(i).connectTo(getCustomerByName(row.get(connection.connectedcustomer)));
}
}
Tipp: AnyLogic offers you an assistant to create such queries, that you find in the AnyLogic toolbar under "Insert Database Query". It looks like this:
AnyLogic Database Query Assistant
Other Stuff
I modified a couple of other things that catched my attention:
To add a connection you use dc.get(i).LinktoCustomers.connectTo(...). It is not neccessary to use a special variable for the connections, it is enough to just add it to the standard connections by using: dc.get(i).connectTo(...)
You go through the list of DCs with a hardcoded max index. As soon as you change the number of entries in the DC table, the code will not work anymore. I recommend something like this: for (int i=0; i<dc.size() ; i++){...}.
You gave the name "Locations" to your population of Agent type "Customer". It is confusing to use a population name that doesn't reflect the underlying agent type at all. I recommend to rename it for example "Customers".
To access your DCs you store and use the index number of the DC as an integer in the tables. In order to be on the safe side, I recommend to use unique String Ids instead, which will work even if you change to order of your table. For this to work you'll need to "parse" the Id (stored in the tables) to a Customer object.
This could be done in a function getCustomerByName(String name) like this (although this obviously lacks error handling):
for(Customer c:Customers){
if(c.custName.equals(name)){
return c;
}
}
return null;
My use case is that I have to run a query on RDS instance and it returns 2 millions records. Now,I want to copy the result directly to disk instead of bringing it in memory then copying it to disk.
Following statement will bring all the records in memory, I want to transfer the results directly to file on disk.
SelectQuery<Record> abc = dslContext.selectQuery().fetch();
Can anyone suggest an pointer?
Update1:
I found the following way to read it :
try (Cursor<BookRecord> cursor = create.selectFrom(BOOK).fetchLazy()) {
while (cursor.hasNext()){
BookRecord book = cursor.fetchOne();
Util.doThingsWithBook(book);
}
}
How many records does it fetch at once and are those records brought in memory first?
Update2:
MySQL driver by default it fetches all the records at once. If fetch size is set to Integer.MIN_VALUE then it fetches one record at a time. If you want to fetch the records in batches then set useCursorFetch=true while setting connection properties.
Related wiki : https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-implementation-notes.html
Your approach using the ResultQuery.fetchLazy() method is the way to go for jOOQ to fetch records one at a time from JDBC. Note that you can use Cursor.fetchNext(int) to fetch a batch of records from JDBC as well.
There's a second thing you might need to configure, and that's the JDBC fetch size, see Statement.setFetchSize(int). This configures how many rows are fetched by the JDBC driver from the server in a single batch. Depending on your database / JDBC driver (e.g. MySQL), the default would again be to fetch all rows in one go. In order to specify the JDBC fetch size on a jOOQ query, use ResultQuery.fetchSize(int). So your loop would become:
try (Cursor<BookRecord> cursor = create
.selectFrom(BOOK)
.fetchSize(size)
.fetchLazy()) {
while (cursor.hasNext()){
BookRecord book = cursor.fetchOne();
Util.doThingsWithBook(book);
}
}
Please read your JDBC driver manual about how they interpret the fetch size, noting that MySQL is "special"
There are more than 10,00,000 records in the table, I am working on. I need to perform an asynchronous operation(a push queue) for each record. Getting all the records at once and processing each record in a loop feels like a bad idea. Instead, I want to fetch records in batches and loop over each batch. Read somewhere on the internet about querying in batches using setFetchSize(int n) and my DAO looks like:
public List<UserPreferenceDTO> getUserPreferences() {
String sqlQueryString = "select us.id as userId, pf.id as preferenceId from users us, preferences pf where us.id = pf.user_id;";
SQLQuery sqlQuery = (SQLQuery) session.createSQLQuery(sqlQueryString).setFetchSize(200);
return sqlQuery.addScalar("userId").addScalar("preferenceId").setResultTransformer(new AliasToBeanResultTransformer(UserPreferenceDTO.class)).list();
}
My Service class looks like:
List<UserPreferenceDTO> userPreferenceDTOs = userDeviceDao.getUserPreferences();
for(UserPreferenceDTO userPreferenceDTO: userPreferenceDTOs ){
pushToRabbitMQ(userPreferenceDTO);
}
I need to get "N" records from the DB push them to the queue for processing then get another "N" records push them to queue and so on till all the records are pushed to queue.
A reasonable setFetchSize() is a must in any batch load scenario as the database won't have to send each row separately. Even if your roundtrip to the database is just 10ms it's still 10ms * 10mln ~ 28 h to do it for all the rows. The improvement usually plateaus somewhere around 1000 but this depends on your environment setup so you need to test it.
It might be enough to replace .list() with .scroll() which returns ScrollableResults which allows to read one record at a time. This will however depend on the database, some like MySQL will fake the scrolling and load the entire result set.
If that's the case you need to use ORDER BY in your query with setFirstResult() and setMaxResult(). This will execute new query to read each batch. It's the safest approach but ORDER BY might be an expensive statement.
I have an application developed based on MySQL that is connected through Hibernate. I used DAO utility code to query the database. Now I need optimize my database query by indexes. My question is, how can I query data through Hibernate DAO utility code and make sure indexes are used in MySQL database when queries are executed. Any hints or pointers to existing examples are appreciated!
Update: Just want to make the question more understandable a little bit. Following is the code I used to query the MySQL database through Hibernated DAO utility codes. I'm not directly using HQL here. Any suggestions for a best solution? If needed, I will rewrite the database query code and use HQL directly instead.
public static List<Measurements> getMeasurementsList(String physicalId, String startdate, String enddate) {
List<Measurements> listOfMeasurements = new ArrayList<Measurements>();
Timestamp queryStartDate = toTimestamp(startdate);
Timestamp queryEndDate = toTimestamp(enddate);
MeasurementsDAO measurementsDAO = new MeasurementsDAO();
PhysicalLocationDAO physicalLocationDAO = new PhysicalLocationDAO();
short id = Short.parseShort(physicalId);
List physicalLocationList = physicalLocationDAO.findByProperty("physicalId", id);
Iterator ite = physicalLocationList.iterator();
while(ite.hasNext()) {
PhysicalLocation physicalLocation = (PhysicalLocation)ite.next();
List measurementsList = measurementsDAO.findByProperty("physicalLocation", physicalLocation);
Iterator jte = measurementsList.iterator();
while(jte.hasNext()){
Measurements measurements = (Measurements)jte.next();
if(measurements.getMeasTstime().after(queryStartDate)
&& measurements.getMeasTstime().before(queryEndDate)) {
listOfMeasurements.add(measurements);
}
}
}
return listOfMeasurements;
}
Just like with SQL, you don't need to do anything special. Just execute your queries as usual, and the database will use the indices you've created to optimize them, if possible.
For example, let's say you have a HQL query that searches all the products that have a given name:
select p from Product where p.name = :name
This query will be translated by Hibernate to SQL:
select p.id, p.name, p.price, p.code from product p where p.name = ?
If you don't have any index set on product.name, the database will have to scan the whole table of products to find those that have the given name.
If you have an index set on product.name, the database will determine that, given the query, it's useful to use this index, and will thus know which rows have the given name thanks to the index. It willl thus be able to only read a small subset of the rows to return the queries data.
This is all transparent to you. You just need to know which queries are slow and frequent enough to justify the creation of an index to speed them up.
I have an application using hibernate. One of its modules calls a native SQL (StoredProc) in batch process. Roughly what it does is that every time it writes a file it updates a field in the database. Right now I am not sure how many files would need to be written as it is dependent on the number of transactions per day so it could be zero to a million.
If I use this code snippet in while loop will I have any problems?
#Transactional
public void test()
{
//The for loop represents a list of records that needs to be processed.
for (int i = 0; i < 1000000; i++ )
{
//Process the records and write the information into a file.
...
//Update a field(s) in the database using a stored procedure based on the processed information.
updateField(String.valueOf(i));
}
}
#Transactional(propagation=propagation.MANDATORY)
public void updateField(String value)
{
Session session = getSession();
SQLQuery sqlQuery = session.createSQLQuery("exec spUpdate :value");
sqlQuery.setParameter("value", value);
sqlQuery.executeUpdate();
}
Will I need any other configurations for my data source and transaction manager?
Will I need to set hibernate.jdbc.batch_size and hibernate.cache.use_second_level_cache?
Will I need to use session flush and clear for this? The samples in the hibernate tutorial is using POJO's and not native sql so I am not sure if it is also applicable.
Please note another part of the application is already using hibernate so as much as possible I would like to stick to using hibernate.
Thank you for your time and I am hoping for your quick response. If it is also possible could code snippet would really be useful for me.
Application Work Flow
1) Query Database for the transaction information. (Transaction date, Type of account, currency, etc..)
2) For each account process transaction information. (Discounts, Current Balance, etc..)
3) Write the transaction information and processed information to a file.
4) Update a database field based on the process information
5) Go back to step 2 while their are still accounts. (Assuming that no exception are thrown)
The code snippet will open and close the session for each iteration, which definitely not a good practice.
Is it possible, you have a job which checks how many new files added in the folder?
The job should run say every 15/25 minutes, checking how much files are changed/added in last 15/25 minutes and updates the database in batch.
Something like that will lower down the number of open/close session connections. It should be much faster than this.