Fast streaming batch data from mssql database

Fast streaming batch data from mssql database - java

I need to read each row from a complex query in SQL server database using Hibernate and write the result to a file. But the query can return millions of records so it seemed that the following code was appropriate:
Session unwrap = entityManager.unwrap(Session.class);
NativeQuery nativeQuery =
unwrap.createNativeQuery("the sql query string read from a file");
nativeQuery.setFlushMode(FlushMode.MANUAL);
nativeQuery.addEntity("C", CustomObject.class);
nativeQuery.setFetchSize(100000);
nativeQuery.setReadOnly(true);
ScrollableResults scroll = nativeQuery.scroll(ScrollMode.FORWARD_ONLY);
while(scroll.next()) {
CustomObject customObject = (CustomObject) scroll.get(0);
jsonGenerator.writeObject(customObject); // using the JsonGenerator library https://fasterxml.github.io/jackson-core/javadoc/2.6/com/fasterxml/jackson/core/JsonGenerator.html
unwrap.evict(claimEntity);
}
Currently, this code takes approximately 3-4 days to write around 1 million records to the file, which is too slow. I am using the mssql-jdbc driver with hibernate and I assume that the fetch size might be ignored by the driver, but changing the driver is not an option for me since the other drivers do not support the bulk copy functionality.
The problem is that hibernate is probably making a connection to fetch each row separately from the database, resulting in expensive network calls.
I have tried setting adaptive buffering, enabled cursors, setting the connection auto commit mode to false and other things, but nothing seemed to make this faster.
I would like to make this faster and would appreciate any help.

Had a similar issue!
Data set was too big, while in a project which involved task of Bank Migration
Solution Adopted: Used PlSql instead of Java Batch. They are always faster.
Another thought I will like to add into this, from my experience writing for big data sets
Instead of committing after every iteration, rather go for BULK COMMITS
We used to commit together after 30,000 iterations over result set.

Related

Processing large amount of data from PostgreSQL

I am looking for a way how to process a large amount of data that are loaded from the database in a reasonable time.
The problem I am facing is that I have to read all the data from the database (currently around 30M of rows) and then process them in Java. The processing itself is not the problem but fetching the data from the database is. The fetching generally takes from 1-2 minutes. However, I need it to be much faster than that. I am loading the data from db straight to DTO using following query:
select id, id_post, id_comment, col_a, col_b from post_comment
Where id is primary key, id_post and id_comment are foreign keys to respective tables and col_a and col_b are columns of small int data types. The columns with foreign keys have indexes.
The tools I am using for the job currently are Java, Spring Boot, Hibernate and PostgreSQL.
So far the only options that came to my mind were
Ditch hibernate for this query and try to use plain jdbc connection hoping that it will be faster.
Completely rewrite the processing algorithm from Java to SQL procedure.
Did I miss something or these are my only options? I am open to any ideas.
Note that I only need to read the data, not change them in any way.
EDIT: The explain analyze of the used query
"Seq Scan on post_comment (cost=0.00..397818.16 rows=21809216 width=28) (actual time=0.044..6287.066 rows=21812469 loops=1), Planning Time: 0.124 ms, Execution Time: 8237.090 ms"

Do you need to process all rows at once, or can you process them one at a time?
If you can process them one at a time, you should try using a scrollable result set.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
}
This will still remember every object in the entity manager, and so will get progressively slower and slower. To avoid that issue, you might detach the object from the entity manager after you're done. This can only be done if the objects are not modified. If they are modified, the changes will NOT be persisted.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
entityManager.detach(myObject);
}

If I was in your shoes I would definitely bypass hibernate and go directly to JDBC for this query. Hibernate is not made for dealing with large result sets, and it represents an additional overhead for benefits that are not applicable to cases like this one.
When you use JDBC, do not forget to set autocommit to false and set some large fetch size (of the order of thousands) or else postgres will first fetch all 21 million rows into memory before starting to yield them to you. (See https://stackoverflow.com/a/10959288/773113)

Since you asked for ideas, I have seen this problem being resolved in below options depending on how it fits in your environment:
1) First try with JDBC and Java, simple code and you can do a test run on your database and data to see if this improvement is enough. You will here need to compromise on the other benefits of Hibernate.
2) In point 1, use Multi-threading with multiple connections pulling data to one queue and then you can use that queue to process further or print as you need. you may consider Kafka also.
3) If data is going to further keep on increasing you can consider Spark as the latest technology which can make it all in memory and will be much more faster.
These are some of the options, please like if these ideas help you anywhere.

Why do you 30M keep in memory ??
it's better to rewrite it to pure sql and use pagination based on id
you will be sent 5 as the id of the last comment and you will issue
select id, id_post, id_comment, col_a, col_b from post_comment where id > 5 limit 20
if you need to update the entire table then you need to put the task in the cron but also there to process it in parts
the memory of the road and downloading 30M is very expensive - you need to process parts 0-20 20-n n+20

HibernateTemplate is taking too much time to execute queries

I am using HibernateTemplate with Oracle Database and while executing simple queries it is taking too much time.
String queryString = "from document as doc where doc.name=?";
return getHibernateTemplate().find(queryString, "cloud");
This simple query which fetches 200 records taking 8-10 seconds.

One first step you can take to solve this issue is to gather more information by setting "hibernate.show_sql" to "true" in your configuration files in order to see exactly what SQL is generated. This will let you see and test the generated queries to isolate the source of the problem.
My best guess without more information is that this statement is triggering eager fetching for a large number of records. Overuse of eager fetching is a common mistake that can significantly slow down Hibernate applications. Hibernate's eager fetching can be very inefficient, retrieving records one at a time and running large numbers of queries against the database.

Not able to run select query after setting TTL in cassandra

I have records already in cassandra DB,Using Java Class I am retrieving each row , updating with TTL and storing them back to Cassandra DB. after that if I run select query its executing and showing records. but when the TTL time was complete, If I run select query it has to show zero records but its not running select query showing Cassandra Failure during read query at consistency ONE error. For other tables select query working properly but for that table(to which rows I applied TTL) not working.

You are using common anti-patterns.
1) You are using batches to load data into two single tables, separately. I don't know if you already own a cluster or you're on your local machine, but this is not the way you load data to a C* cluster, and you are going to stress a lot your C* cluster. You should use batches only when you need to keep two or more tables in sync, and not to load a bunch of records at time. I suggest you the following readings on the topic:
DataStax documentation on BATCH
Ryan Svihla Blog
2) You are using synchronous writes to insert your pretty indipendent records into your cluster. You should use asynchronous writes to speed up your data processing.
DataStax Java Drive Async Queries
3) You are using the TTL features in your tables, which per se are not that bad. However, an expired TTL is a tombstone, and that means when you SELECT your query C* will have to read all those tombstones.
4) You bind your prepared statement multiple time:
BoundStatement bound = phonePrepared.bind(macAddress, ...
and that should be
BoundStatement bound = new BoundStatement(phonePrepared).bind(macAddress, ...
in order to use different bound statements. This is not an anti-pattern, this is a problem with your code.
Now, if you run your program multiple times, your tables have a lot of tombstones due to the TTL features, and that means C* is trying hard to read all these in order to find what you wrote "the last time" you successfully run, and it takes so long that the queries times-out.
Just for fun, you can try to increase your timeouts, say 2 minutes, in the SELECT and take a coffee, and in the meantime C* will get your records back.
I don't know what you are trying to achieve, but fast TTLs are your enemies. If you just wanted to refresh your records then try to keep TTLs time high enough so that it doesn't hurt your performances. Or, a probably better solution is to add a new column EXPIRED, "manually" written only when you need to delete a record instead. That depends on your requirements.

PreparedStatement.addBatch and thousands of rows from a file and a confusion

Hi I am trying to write to Sybase IQ using JDBC from a file which contains thousands of rows. People say that I should use batchUpdate. So I am reading file by NIO and adding it to PreparedStatement batches. But I dont see any advantage here for all the rows I need to do the following
PreparedStatement prepStmt = con.prepareStatement(
"UPDATE DEPT SET MGRNO=? WHERE DEPTNO=?");
prepStmt.setString(1,mgrnum1);
prepStmt.setString(2,deptnum1);
prepStmt.addBatch();
I dont understand what is the advantage of batches. I have to anyhow execute addBatch for thousands of time for all the records of file. Or Should I even be using addBatch() to write records from a file to sybase iq. Please guide. Thanks a lot.

With batch updates, basically, you're cutting down on your Network I/O overhead. It's providing the benefits analogous to what a BufferedWriter provides you while writing to the disk. That's basically what this is: buffering of database updates.
Any kind of I/O has a cost; be it disk I/O or network. By buffering your inserts or updates in a batch and doing a bulk update you're minimizing the performance hit incurred every time you hit the database and come back.
The performance hit becomes even more obvious in case of a real world application where the database server is almost always under some load serving other clients as opposed to development where you're the only one.
When paired with a PreparedStatement the bulk updates are even more efficient because the Statement is pre-compiled and the execution plan is cached as well throughout the execution of the batch. So, the binding of variables happen as per your chosen batch size and then a single batchUpdate() call persists all the values in one go.

The advantage of addBatch is that it allows the jdbc driver to write chunks of data instead of sending single insert statements to the database.
This can be faster in certain situations, but real life performance may vary.
It should also be noticed that it's recommended to use batches of 50-100 rows, instead of adding all the data into a single batch.

spring jdbc RowCallbackHandler nightmare

I'm having trouble retrieving data from my database using Spring Jdbc. Here's my issue:
I have a getData() method on my DAO which is supposed to return ONE row from the result of some select statement. When invoked again, the getData() method should return the second row in a FIFO-like manner. I'm aiming for having only one result in memory at a time, since my table will get potentially huge in the future and bringing everything to memory would be a disaster.
If I were using regular jdbc code with a result set I could set its fetch size to 1 and everything would be fine. However I recently found out that Spring Jdbc operations via the JdbcTemplate object don't allow me to achieve such a behaviour (as far as I know... I'm not really knowledgeable about the Spring framework's features). I've heard of the RowCallbackHandler interface, and this post in the java ranch said I could somehow expose the result set to be used later (though using this method it stores the result set as many times over as there are rows, which is pretty dumb).
I have been playing with implementing the RowCallbackHandler interface for a day now and I still can't find a way to get it to retrieve one row from my select at a time. If anyone could enlighten me in this matter i'd greatly appreciate it.

JdbcTemplate.setFetchSize(int fetchSize):
Set the fetch size for this JdbcTemplate. This is important for processing large result sets: Setting this higher than the default value will increase processing speed at the cost of memory consumption; setting this lower can avoid transferring row data that will never be read by the application.
Default is 0, indicating to use the JDBC driver's default.

After a lot of searching and consulting with the rest of my team, we have come to the conclusion that this is not the best implementation path for our project. As Boris suggested, a different approach is the way to go. However, I'm doing something different and using SimpleJdbcTemplate instead and splitting my query so it'll fit in memory better. A "status" field in my records table will be responsbile for telling if the record was successfully processed or read, so i know what records to fetch next.
The question if Spring Jdbc is capable of the behaviour i mentioned in my OP is, however, still in the air. If anyone has an answer for that question I'm sure it would help someone else out there.
Cheers!

You can take a different approach. Create a query which will return just IDs of rows that you want to read. Keep this collection of IDs in memory. You really need to have huge data set to consume a lot of memory. Iterate over it and load one by one row referenced by its ID.

We have the same issue:
- Test fetching fetchSize records in raw jdbc Preparestatement works well: when stop Db after fetching a fetchSize of records, the error throw is Jdbc Connection when the resultset.next() get run.
- Test fetchSize with JdbcTemplate:
PreparedStatementSetter preparedStatementSetter = ps -> { ps.setFetchSize(_exportParams.getFetchSize()); };
RowCallbackHandler rowCallbackHandler = _rs -> { //do st here}
this.jdbcTemplate.query(_exportParams.getSqlscript(), preparedStatementSetter, rowCallbackHandler);
After getting first record, we stop the Postgres. The callback record handler can still handle the rest of records without error.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.