Processing large amount of data from PostgreSQL - java

I am looking for a way how to process a large amount of data that are loaded from the database in a reasonable time.
The problem I am facing is that I have to read all the data from the database (currently around 30M of rows) and then process them in Java. The processing itself is not the problem but fetching the data from the database is. The fetching generally takes from 1-2 minutes. However, I need it to be much faster than that. I am loading the data from db straight to DTO using following query:
select id, id_post, id_comment, col_a, col_b from post_comment
Where id is primary key, id_post and id_comment are foreign keys to respective tables and col_a and col_b are columns of small int data types. The columns with foreign keys have indexes.
The tools I am using for the job currently are Java, Spring Boot, Hibernate and PostgreSQL.
So far the only options that came to my mind were
Ditch hibernate for this query and try to use plain jdbc connection hoping that it will be faster.
Completely rewrite the processing algorithm from Java to SQL procedure.
Did I miss something or these are my only options? I am open to any ideas.
Note that I only need to read the data, not change them in any way.
EDIT: The explain analyze of the used query
"Seq Scan on post_comment (cost=0.00..397818.16 rows=21809216 width=28) (actual time=0.044..6287.066 rows=21812469 loops=1), Planning Time: 0.124 ms, Execution Time: 8237.090 ms"

Do you need to process all rows at once, or can you process them one at a time?
If you can process them one at a time, you should try using a scrollable result set.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
}
This will still remember every object in the entity manager, and so will get progressively slower and slower. To avoid that issue, you might detach the object from the entity manager after you're done. This can only be done if the objects are not modified. If they are modified, the changes will NOT be persisted.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
entityManager.detach(myObject);
}

If I was in your shoes I would definitely bypass hibernate and go directly to JDBC for this query. Hibernate is not made for dealing with large result sets, and it represents an additional overhead for benefits that are not applicable to cases like this one.
When you use JDBC, do not forget to set autocommit to false and set some large fetch size (of the order of thousands) or else postgres will first fetch all 21 million rows into memory before starting to yield them to you. (See https://stackoverflow.com/a/10959288/773113)

Since you asked for ideas, I have seen this problem being resolved in below options depending on how it fits in your environment:
1) First try with JDBC and Java, simple code and you can do a test run on your database and data to see if this improvement is enough. You will here need to compromise on the other benefits of Hibernate.
2) In point 1, use Multi-threading with multiple connections pulling data to one queue and then you can use that queue to process further or print as you need. you may consider Kafka also.
3) If data is going to further keep on increasing you can consider Spark as the latest technology which can make it all in memory and will be much more faster.
These are some of the options, please like if these ideas help you anywhere.

Why do you 30M keep in memory ??
it's better to rewrite it to pure sql and use pagination based on id
you will be sent 5 as the id of the last comment and you will issue
select id, id_post, id_comment, col_a, col_b from post_comment where id > 5 limit 20
if you need to update the entire table then you need to put the task in the cron but also there to process it in parts
the memory of the road and downloading 30M is very expensive - you need to process parts 0-20 20-n n+20

Related

How to get data from Oracle table into java application concurrently

I have an Oracle table with ~10 million records that are not dependent on each other . An existing Java application executes the query an iterates through the returned Iterator batching the records for further processing. The fetchSize is set to 250.
Is there any way to parallelize getting the data from the Oracle DB? One thing that comes to mind is to break down the query into chunks using "rowid" and then pass these chunks to separate threads.
I am wondering if there is some kind of standard approach in solving this issue.
Few approaches to achieve it:
alter session force parallel QUERY parallel 32; execute this at DB level in PL/SQL code just before the execution of SELECT statement. You can adjust the 32 value depends on number of Nodes (RAC setup).
The approach which you are doing on the basis of ROWID but the difficult part is how you return the chunk of SELECT queries to JAVA and how you can combine that result. So this approach is bit difficult.

Fast streaming batch data from mssql database

I need to read each row from a complex query in SQL server database using Hibernate and write the result to a file. But the query can return millions of records so it seemed that the following code was appropriate:
Session unwrap = entityManager.unwrap(Session.class);
NativeQuery nativeQuery =
unwrap.createNativeQuery("the sql query string read from a file");
nativeQuery.setFlushMode(FlushMode.MANUAL);
nativeQuery.addEntity("C", CustomObject.class);
nativeQuery.setFetchSize(100000);
nativeQuery.setReadOnly(true);
ScrollableResults scroll = nativeQuery.scroll(ScrollMode.FORWARD_ONLY);
while(scroll.next()) {
CustomObject customObject = (CustomObject) scroll.get(0);
jsonGenerator.writeObject(customObject); // using the JsonGenerator library https://fasterxml.github.io/jackson-core/javadoc/2.6/com/fasterxml/jackson/core/JsonGenerator.html
unwrap.evict(claimEntity);
}
Currently, this code takes approximately 3-4 days to write around 1 million records to the file, which is too slow. I am using the mssql-jdbc driver with hibernate and I assume that the fetch size might be ignored by the driver, but changing the driver is not an option for me since the other drivers do not support the bulk copy functionality.
The problem is that hibernate is probably making a connection to fetch each row separately from the database, resulting in expensive network calls.
I have tried setting adaptive buffering, enabled cursors, setting the connection auto commit mode to false and other things, but nothing seemed to make this faster.
I would like to make this faster and would appreciate any help.
Had a similar issue!
Data set was too big, while in a project which involved task of Bank Migration
Solution Adopted: Used PlSql instead of Java Batch. They are always faster.
Another thought I will like to add into this, from my experience writing for big data sets
Instead of committing after every iteration, rather go for BULK COMMITS
We used to commit together after 30,000 iterations over result set.

Not able to run select query after setting TTL in cassandra

I have records already in cassandra DB,Using Java Class I am retrieving each row , updating with TTL and storing them back to Cassandra DB. after that if I run select query its executing and showing records. but when the TTL time was complete, If I run select query it has to show zero records but its not running select query showing Cassandra Failure during read query at consistency ONE error. For other tables select query working properly but for that table(to which rows I applied TTL) not working.
You are using common anti-patterns.
1) You are using batches to load data into two single tables, separately. I don't know if you already own a cluster or you're on your local machine, but this is not the way you load data to a C* cluster, and you are going to stress a lot your C* cluster. You should use batches only when you need to keep two or more tables in sync, and not to load a bunch of records at time. I suggest you the following readings on the topic:
DataStax documentation on BATCH
Ryan Svihla Blog
2) You are using synchronous writes to insert your pretty indipendent records into your cluster. You should use asynchronous writes to speed up your data processing.
DataStax Java Drive Async Queries
3) You are using the TTL features in your tables, which per se are not that bad. However, an expired TTL is a tombstone, and that means when you SELECT your query C* will have to read all those tombstones.
4) You bind your prepared statement multiple time:
BoundStatement bound = phonePrepared.bind(macAddress, ...
and that should be
BoundStatement bound = new BoundStatement(phonePrepared).bind(macAddress, ...
in order to use different bound statements. This is not an anti-pattern, this is a problem with your code.
Now, if you run your program multiple times, your tables have a lot of tombstones due to the TTL features, and that means C* is trying hard to read all these in order to find what you wrote "the last time" you successfully run, and it takes so long that the queries times-out.
Just for fun, you can try to increase your timeouts, say 2 minutes, in the SELECT and take a coffee, and in the meantime C* will get your records back.
I don't know what you are trying to achieve, but fast TTLs are your enemies. If you just wanted to refresh your records then try to keep TTLs time high enough so that it doesn't hurt your performances. Or, a probably better solution is to add a new column EXPIRED, "manually" written only when you need to delete a record instead. That depends on your requirements.

How to use Bulk API with WHERE clause in Salesforce

I want to use Bulk API of Salesforce to run queries of this format.
Select Id from Object where field='<value>'.
I have thousands of such field values and want to retrieve Id of those objects. AFAIK, Bulk query of Salesforce supports only one SOQL statement as input.
One option could be to form a query like
Select Id,field where field in (<all field values>)
but problem is SOQL has 10000 characters limitation.
Any suggestions here?
Thanks
It seems like you are attempting to perform some kind of search query. If so you might look into using a SOSL query as opposed to SOQL as long as the fields you are searching are indexed by SFDC.
Otherwise, I agree with Born2BeMild. Your second approach is better and breaking up your list of values into batches would help get around the limits.
It would also help if you described a bit of your use case in more detail. Typically queries on a dynamic set of fields and values doesn't always yield the best performance even with the bulk api. You are almost better off downloading the data to a local database and exploring the data that way.
You could break those down into batches of 200 or so values and iteratively query Salesforce to build up a result set in memory or process subsets of the data.
You would have to check the governor limits for the maximum number of SOQL queries though. You should be able to track your usage via the API at runtime to avoid going over the maximum.
The problem is that you are hitting the governor limits. Saleforce can only process 200 records at a time if its coming from a database. Therefore to be able to work with all this records first you need to add all records to a list for example:
List<Account> accounts= [SELECT id, name, FROM Account];
Then you can work with the list accounts do everything you need to do with it then when you done you can update the database using:
Update accounts;
this link might be helpful:
https://help.salesforce.com/apex/HTViewSolution?id=000004410&language=en_US

Storing result set for later fetch

I have some queries that run for a quite long (20-30 minutes). If a lot of queries are started simultaneously, connection pool is drained quickly.
Is it possible to wrap the long-running query into a statement (procedure) that will store the result of a generic query into a temp table, terminanting the connection, and fetchin (polling) the results later on demand?
EDIT: queries and data stuctures are optimized, and tips like 'check your indices and execution plan' don't work for me. I'm looking for a way to store [maybe a] byte presentation of a generic result set, for later retreive.
First of all, 20-30 minutes is an extremely long time for a query - are you sure you aren't missing any indexes for the query? Do check your execution plan - you could get a huge performance gain from a well-placed index.
In MySQL, you could do
INSERT INTO `cached_result_table` (
SELECT your_query_here
)
(of course, cached_result_table needs to have the exact same column structure as your SELECT returns, otherwise you'll get an error).
Then, you could query these cached results (instead of the original tables), and only run the above query from time to time - to update the cached_result_table.
Of course, the query will need to run at least once initially, which will take the 20-30 minutes you mentioned. I suggest to pre-populate the cached table before the data are requested, and keep some locking mechanism to prevent the update query to run several times simultaneously. Pseudocode:
init:
insert select your_big_query
work:
if your_big_query cached table is empty or nearing expiration:
refresh in the background:
check flag to see if there's another "refresh" process running
if yes
end // don't run two your_big_queries at the same time
else
set flag
re-run your_big_query, save to cached table
clear flag
serve data to clients always from cached table
An easy way to do that in Oracle is "CREATE TABLE sometempname AS SELECT...". That will create a new table using the result columns from the select.
Not quite sure what you are requesting.
Currently you have 50 database sessions. Say you get 40 running long-running queries, that leaves 10 to service the rest.
What you seem to be asking for is, you want those 40 queries asynchronously (running in the background) not clogging up the connection pool of 50. The question is, do you want those 40 running concurrently with (potentially) another 50 queries from the connection pool, or do you want them queued up in some way ?
Queuing can be done (look into DBMS_SCHEDULER and DBMS_JOB). But you will need to deliver those results into some other table and know how to deliver that result set. The old fashioned way is simply to generate reports on request that get delivered to a directory on a shared drive or by email. Could be PDF or CSV or Excel.
If you want the 40 running concurrently alongside the 50 'connection pool' settings, then you may be best off setting up a separate connection pool for the long-running queries.
You can look into Resource Manager for terminating calls that take too long or too many resources. That way the quickie pool can't get bogged down in long running requests.
The most generic approach in Oracle I can think of is creating a stored procedure that will convert a result set into XML, and store it as CLOB XMLType in a table with the results of your long-running queries.
You can find more on generation XMLs from a generic result sets here.
SQL> select dbms_xmlgen.getxml('select employee_id, first_name,
2 last_name, phone_number from employees where rownum < 6') xml
3 from dual

Categories