Writing Data from RDS to Disk in JOOQ

Writing Data from RDS to Disk in JOOQ - java

My use case is that I have to run a query on RDS instance and it returns 2 millions records. Now,I want to copy the result directly to disk instead of bringing it in memory then copying it to disk.
Following statement will bring all the records in memory, I want to transfer the results directly to file on disk.
SelectQuery<Record> abc = dslContext.selectQuery().fetch();
Can anyone suggest an pointer?
Update1:
I found the following way to read it :
try (Cursor<BookRecord> cursor = create.selectFrom(BOOK).fetchLazy()) {
while (cursor.hasNext()){
BookRecord book = cursor.fetchOne();
Util.doThingsWithBook(book);
}
}
How many records does it fetch at once and are those records brought in memory first?
Update2:
MySQL driver by default it fetches all the records at once. If fetch size is set to Integer.MIN_VALUE then it fetches one record at a time. If you want to fetch the records in batches then set useCursorFetch=true while setting connection properties.
Related wiki : https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-implementation-notes.html

Your approach using the ResultQuery.fetchLazy() method is the way to go for jOOQ to fetch records one at a time from JDBC. Note that you can use Cursor.fetchNext(int) to fetch a batch of records from JDBC as well.
There's a second thing you might need to configure, and that's the JDBC fetch size, see Statement.setFetchSize(int). This configures how many rows are fetched by the JDBC driver from the server in a single batch. Depending on your database / JDBC driver (e.g. MySQL), the default would again be to fetch all rows in one go. In order to specify the JDBC fetch size on a jOOQ query, use ResultQuery.fetchSize(int). So your loop would become:
try (Cursor<BookRecord> cursor = create
.selectFrom(BOOK)
.fetchSize(size)
.fetchLazy()) {
while (cursor.hasNext()){
BookRecord book = cursor.fetchOne();
Util.doThingsWithBook(book);
}
}
Please read your JDBC driver manual about how they interpret the fetch size, noting that MySQL is "special"

Related

Spring Data JPA: Efficiently Query The Database for A Large Dataset

I have written an application to scrape a huge set of reviews. For each review i store the review itself Review_Table(User_Id, Trail_Id, Rating), the Username (Id, Username, UserLink) and the Trail which is build previously in the code (Id, ...60 other attributes)
for(Element card: reviewCards){
String userName = card.select("expression").text();
String userLink = card.select("expression").attr("href");
String userRatingString = card.select("expression").attr("aria-label");
Double userRating;
if(userRatingString.equals("NaN Stars")){
userRating = 0.0;
}else {
userRating = Double.parseDouble(userRatingString.replaceAll("[^0-9.]", ""));
}
User u;
Rating r;
//probably this is the bottleneck
if(userService.getByUserLink(userLink)!=null){
u = new User(userName, userLink, new HashSet<Rating>());
r = Rating.builder()
.user(u)
.userRating(userRating)
.trail(t)
.build();
}else {
u = userService.getByUserLink(userLink);
r = Rating.builder()
.user(u)
.userRating(userRating)
.trail(t)
.build();
}
i = i +1;
ratingSet.add(r);
userSet.add(u);
}
saveToDb(userSet, t, link, ratingSet);
savedEntities = savedEntities + 1;
log.info(savedEntities + " Saved Entities");
}
The code works fine for small-medium sized dataset but i encounter a huge bottleneck for larger datasets. Let's suppose i have 13K user entities already stored in the PostgresDB and another batch of 8500 reviews comes to be scraped, i have to check for every review if the user of that review is already stored. This is taking forever
I tried to define and index on the UserLink attribute in Postgres but the speed didn't improve at all
I tried to take and collect all the users stored in the Db inside a set and use the contains method to check if a particular user already exists in the set (in this way I thought I could bypass the database bottleneck of 8k write and read but in a risky way because if the users inside the db table were too much i would have encountered a memory overflow). The speed, again, didn't improve
At this point I don't have any other idea to improve this

Well for one, you would certainly benefit from not querying for each user individually in a loop. What you can do is query & cache for only the UserLink or UserName meaning get & cache the complete set of only one of them because that's what you seem to need to differentiate in the if-else.
You can actually query for individual fields with Spring Data JPA #Query either directly or even with Spring Data JPA Projections to query subset of fields if needed and cache & use them for the lookup. If you think the users could run into millions or billions then you could think of using a distributed cache like Apache Ignite where your collection could scale easily.
Btw, the if-else seem to be inversed is it not?
Next you don't store each review individually which the above code appears to imply. You can write in batches. Also since you are using Postgres you can use Postgres CopyManager provided by Postgres for bulk data transfer by using it with Spring Data Custom repositories. So you can keep writing to a new text/csv file locally at a set schedule (every x minutes) and use this to write that batched text/csv to the table (after that x minutes) and remove the file. This would be really quick.
The other option is write a stored procedure that combines the above & invoke it again in a custom repository.
Please let me know which one you had like elaborated..
UPDATE (Jan 12 2022):
One other item i missed is when you querying for UserLink or UserName you can use a very efficient form of select query that Postgres supports instead of using an IN clause like below,
#Select("select u from user u where u.userLink = ANY('{:userLinks}'::varchar[])", nativeQuery = true)
List<Users> getUsersByLinks(#Param("userLinks") String[] userLinks);

Oracle to SQL Server data transfer using spring boot

I am looking for technical solution; To query data from one db and load it into a SQL Server database using java spring boot.
Mock query to get productNames which are updated between given time of 20 hours:
SELECT
productName, updatedtime FROM
products WHERE
updatedtime BETWEEN '2018-03-26 00:00:01' AND '2018-03-26 19:59:59';
Here is the approach we followed.
1) Its long running Oracle query, which runs approximately 1 hours on business hours and it returns ~1Million records.
2) We have to insert/ dump this resultset into a SQL Server Table using JDBC.
3) As I know Oracle JDBC driver supports kind of streaming. When we iterate over ResultSet it loads only fetchSize rows into memory.
int currentRow = 1;
while (rs.next()) {
// get your data from current row from Oracle database and accumulate in a batch
if (currentRow++ % BATCH_SIZE == 0) {
//insert whole accumulated batch into SqlServer database
}
}
In this case we do not need to store all huge dataset from Oracle in memory. And we will insert into SqlServer by batches of BATCH_SIZE. The only thing is that we need to think where to do commit into SqlServer database.
4)Here is the bottleneck is query execution waiting time to get the data from oracle db, So I am planing to split the query into 10 equal parts such each query to give updatedtime between each hour as shown. so that execution time also get reduced to ~10min for each query.
eg:
SELECT
productName, updatedtime FROM
products WHERE
updatedtime BETWEEN '2018-03-26 01:00:01' AND '2018-03-26 01:59:59';
5.For that I required 5 Oracle JDBC connections and 5 Sql server connection(to query the data and insert into db) to do its job independently. I am new to JDBC connection pooling
How can I do the connection pooling and closing the connection if its not in use etc?
Please suggest if you have any other better approach to get the data from the data source quickly as real time data. Please suggest. Thanks in advance.

This is a typical use case from spring batch.
There you have the concept of ItemReader(from your source db) and ItemWriter(into your destination db).
You can define multiple datasource and you will have capabilities for reading in fixed fetch size(JdbcCursorItemReader for instance) and also to create grid for parallel execution.
With a quick search you can find many examples online relative to this kind of tasks.
I know I'm not posting the code relative to the concept but it will take me some time to prepare a decent example

DB2 ERRORCODE 4499 SQLSTATE=58009

On our production application we recently become weird error from DB2:
Caused by: com.ibm.websphere.ce.cm.StaleConnectionException: [jcc][t4][2055][11259][4.13.80] The database manager is not able to accept new requests, has terminated all requests in progress, or has terminated your particular request due to an error or a force interrupt. ERRORCODE=-4499, SQLSTATE=58009
This occurs when hibernate tries to select data from one big table(More than 6 milions records and 320 columns).
I observed that when ResultSet lower that 10 elements, hibernate selects successfully.
Our architecture:
Spring 4.0.3
Hibernate 4.3.5
DB2 v10 z/Os
Websphere 7.0.0.31(with JDBC V9.7FP5)
This select works when I tried to executed this in Data Studio or when app is started localy from Tomcat(connected to production Data Source). I suppose that Data Source on Websphere is not corectly configured, but I tried some modifications and without results. I also tried to update JDBC Driver but that not helped. Actually I become then ERRORCODE = -1244.
Ok, so now I'm looking for any help ;).
I can obviously provide additional information when needed.
Maybe someone fighted earlier with this problem?
Thanks in advance!

We have the same problem and finally solved by running REORG and RUNSTAT on the table(s). In our case, databse and tables were damaged and after running both mentioned operations, it resolved.

This occurs when hibernate tries to select data from one big table(More than 6 milions records and 320 columns)
6 million records with 320 columns seems huge to be read at once through hibernate. How you tried creating a database cursor and streaming few records at a time? In plain JDBC it is done as follows
Statement stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(50); //fetch only 50 records at a time
while with hibernate you would need the below code
Query query = session.createQuery(query);
query.setReadOnly(true);
query.setFetchSize(50);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
// iterate over results
while (results.next()) {
Object row = results.get();
// process row then release reference
// you may need to flush() as well
}
results.close();
This allows you to stream over the result set, however Hibernate will still cache results in the Session, so you’ll need to call session.flush() every so often. If you are only reading data, you might consider using a StatelessSession, though you should read its documentation beforehand.
Analyze the database table locking impact when using this approach.

Dumping data from MySql to MongoDB

There is a table event_logs There are about 16 million entries in the table. Database is MySQL. The database is stored in Google Cloud.
My task is dump this data into MongoDB. Before dumping data I need to convert each row into JSON document.
Table schema issues
1.There is no auto_increment_id and no primary keys in the table
Tried in following ways
1.In Java using JDBC driver streamed results in ResultSet, but the problem is for first 300k results it took less time , after that it's taking long time to getting results from database server why ??
2.Splitted queries into multiple queries and used limit (like limit 1000000,100000) , but the problem is if row number starts from large number like 1000000 it's taking long time to get results. Looks like MySql is still starting from beginning even if I put row number like 1000000 and throwing results up to this number.
Please suggest solutions to efficient way to copy from MySql to MongoDB

at first you can try to set ResultSet fetch size that way:
...
Statement statement = connection.createStatement();
statement.setFetchSize(2000); // perhaps more...
ResultSet resultSet = statement.executeQuery("YOUR QUERY");
...
or you could just export your MYSQL data (CSV/XML) and then import they using this import-export-tool
converting each row into JSON document could be done after that, by parsing CSV file
also you can try to create a Statement using this parameters:
Statement stmt = con.createStatement(
ResultSet.TYPE_SCROLL_INSENSITIVE,
ResultSet.CONCUR_READ_ONLY);

Use Mongify a Ruby based application for super simple conversion from MySQL to MongoDB

What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?

I have this really big table with some millions of records every day and in the end of every day I am extracting all the records of the previous day. I am doing this like:
String SQL = "select col1, col2, coln from mytable where timecol = yesterday";
Statement.executeQuery(SQL);
The problem is that this program takes like 2GB of memory because it takes all the results in memory then it processes it.
I tried setting the Statement.setFetchSize(10) but it takes exactly the same memory from OS it does not make any difference. I am using Microsoft SQL Server 2005 JDBC Driver for this.
Is there any way to read the results in small chunks like the Oracle database driver does when the query is executed to show only a few rows and as you scroll down more results are shown?

In JDBC, the setFetchSize(int) method is very important to performance and memory-management within the JVM as it controls the number of network calls from the JVM to the database and correspondingly the amount of RAM used for ResultSet processing.
Inherently if setFetchSize(10) is being called and the driver is ignoring it, there are probably only two options:
Try a different JDBC driver that will honor the fetch-size hint.
Look at driver-specific properties on the Connection (URL and/or property map when creating the Connection instance).
The RESULT-SET is the number of rows marshalled on the DB in response to the query.
The ROW-SET is the chunk of rows that are fetched out of the RESULT-SET per call from the JVM to the DB.
The number of these calls and resulting RAM required for processing is dependent on the fetch-size setting.
So if the RESULT-SET has 100 rows and the fetch-size is 10,
there will be 10 network calls to retrieve all of the data, using roughly 10*{row-content-size} RAM at any given time.
The default fetch-size is 10, which is rather small.
In the case posted, it would appear the driver is ignoring the fetch-size setting, retrieving all data in one call (large RAM requirement, optimum minimal network calls).
What happens underneath ResultSet.next() is that it doesn't actually fetch one row at a time from the RESULT-SET. It fetches that from the (local) ROW-SET and fetches the next ROW-SET (invisibly) from the server as it becomes exhausted on the local client.
All of this depends on the driver as the setting is just a 'hint' but in practice I have found this is how it works for many drivers and databases (verified in many versions of Oracle, DB2 and MySQL).

The fetchSize parameter is a hint to the JDBC driver as to many rows to fetch in one go from the database. But the driver is free to ignore this and do what it sees fit. Some drivers, like the Oracle one, fetch rows in chunks, so you can read very large result sets without needing lots of memory. Other drivers just read in the whole result set in one go, and I'm guessing that's what your driver is doing.
You can try upgrading your driver to the SQL Server 2008 version (which might be better), or the open-source jTDS driver.

You need to ensure that auto-commit on the Connection is turned off, or setFetchSize will have no effect.
dbConnection.setAutoCommit(false);
Edit: Remembered that when I used this fix it was Postgres-specific, but hopefully it will still work for SQL Server.

Statement interface Doc
SUMMARY: void setFetchSize(int rows)
Gives the JDBC driver a hint as to the
number of rows that should be fetched
from the database when more rows are
needed.
Read this ebook J2EE and beyond By Art Taylor

Sounds like mssql jdbc is buffering the entire resultset for you. You can add a connect string parameter saying selectMode=cursor or responseBuffering=adaptive. If you are on version 2.0+ of the 2005 mssql jdbc driver then response buffering should default to adaptive.
http://msdn.microsoft.com/en-us/library/bb879937.aspx

It sounds to me that you really want to limit the rows being returned in your query and page through the results. If so, you can do something like:
select * from (select rownum myrow, a.* from TEST1 a )
where myrow between 5 and 10 ;
You just have to determine your boundaries.

Try this:
String SQL = "select col1, col2, coln from mytable where timecol = yesterday";
connection.setAutoCommit(false);
PreparedStatement stmt = connection.prepareStatement(SQL, SQLServerResultSet.TYPE_SS_SERVER_CURSOR_FORWARD_ONLY, SQLServerResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(2000);
stmt.set....
stmt.execute();
ResultSet rset = stmt.getResultSet();
while (rset.next()) {
// ......

I had the exact same problem in a project. The issue is that even though the fetch size might be small enough, the JDBCTemplate reads all the result of your query and maps it out in a huge list which might blow your memory. I ended up extending NamedParameterJdbcTemplate to create a function which returns a Stream of Object. That Stream is based on the ResultSet normally returned by JDBC but will pull data from the ResultSet only as the Stream requires it. This will work if you don't keep a reference of all the Object this Stream spits. I did inspire myself a lot on the implementation of org.springframework.jdbc.core.JdbcTemplate#execute(org.springframework.jdbc.core.ConnectionCallback). The only real difference has to do with what to do with the ResultSet. I ended up writing this function to wrap up the ResultSet:
private <T> Stream<T> wrapIntoStream(ResultSet rs, RowMapper<T> mapper) {
CustomSpliterator<T> spliterator = new CustomSpliterator<T>(rs, mapper, Long.MAX_VALUE, NON-NULL | IMMUTABLE | ORDERED);
Stream<T> stream = StreamSupport.stream(spliterator, false);
return stream;
}
private static class CustomSpliterator<T> extends Spliterators.AbstractSpliterator<T> {
// won't put code for constructor or properties here
// the idea is to pull for the ResultSet and set into the Stream
#Override
public boolean tryAdvance(Consumer<? super T> action) {
try {
// you can add some logic to close the stream/Resultset automatically
if(rs.next()) {
T mapped = mapper.mapRow(rs, rowNumber++);
action.accept(mapped);
return true;
} else {
return false;
}
} catch (SQLException) {
// do something with this Exception
}
}
}
you can add some logic to make that Stream "auto closable", otherwise don't forget to close it when you are done.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.