Storing result set for later fetch - java

I have some queries that run for a quite long (20-30 minutes). If a lot of queries are started simultaneously, connection pool is drained quickly.
Is it possible to wrap the long-running query into a statement (procedure) that will store the result of a generic query into a temp table, terminanting the connection, and fetchin (polling) the results later on demand?
EDIT: queries and data stuctures are optimized, and tips like 'check your indices and execution plan' don't work for me. I'm looking for a way to store [maybe a] byte presentation of a generic result set, for later retreive.

First of all, 20-30 minutes is an extremely long time for a query - are you sure you aren't missing any indexes for the query? Do check your execution plan - you could get a huge performance gain from a well-placed index.
In MySQL, you could do
INSERT INTO `cached_result_table` (
SELECT your_query_here
)
(of course, cached_result_table needs to have the exact same column structure as your SELECT returns, otherwise you'll get an error).
Then, you could query these cached results (instead of the original tables), and only run the above query from time to time - to update the cached_result_table.
Of course, the query will need to run at least once initially, which will take the 20-30 minutes you mentioned. I suggest to pre-populate the cached table before the data are requested, and keep some locking mechanism to prevent the update query to run several times simultaneously. Pseudocode:
init:
insert select your_big_query
work:
if your_big_query cached table is empty or nearing expiration:
refresh in the background:
check flag to see if there's another "refresh" process running
if yes
end // don't run two your_big_queries at the same time
else
set flag
re-run your_big_query, save to cached table
clear flag
serve data to clients always from cached table

An easy way to do that in Oracle is "CREATE TABLE sometempname AS SELECT...". That will create a new table using the result columns from the select.

Not quite sure what you are requesting.
Currently you have 50 database sessions. Say you get 40 running long-running queries, that leaves 10 to service the rest.
What you seem to be asking for is, you want those 40 queries asynchronously (running in the background) not clogging up the connection pool of 50. The question is, do you want those 40 running concurrently with (potentially) another 50 queries from the connection pool, or do you want them queued up in some way ?
Queuing can be done (look into DBMS_SCHEDULER and DBMS_JOB). But you will need to deliver those results into some other table and know how to deliver that result set. The old fashioned way is simply to generate reports on request that get delivered to a directory on a shared drive or by email. Could be PDF or CSV or Excel.
If you want the 40 running concurrently alongside the 50 'connection pool' settings, then you may be best off setting up a separate connection pool for the long-running queries.
You can look into Resource Manager for terminating calls that take too long or too many resources. That way the quickie pool can't get bogged down in long running requests.

The most generic approach in Oracle I can think of is creating a stored procedure that will convert a result set into XML, and store it as CLOB XMLType in a table with the results of your long-running queries.
You can find more on generation XMLs from a generic result sets here.
SQL> select dbms_xmlgen.getxml('select employee_id, first_name,
2 last_name, phone_number from employees where rownum < 6') xml
3 from dual

Related

How to efficiently export/import database data with JDBC

I have a JAVA application that can use a SQL database from any vendor. Right now we have tested Vertica and PostgreSQL. I want to export all the data from one table in the DB and import it later on in a different instance of the application. The size of the DB is pretty big so there are many rows in there. The export and import process has to be done from inside the java code.
What we've tried so far is:
Export: we read the whole table (select * from) through JDBC and then dump it to an SQL file with all the INSERTS needed.
Import: The file containing those thousands of INSERTS is executed in the target database through JDBC.
This is not an efficient process. Firstly, the select * from part is giving us problems because of the size of it and secondly, executing a lot if inserts one after another gives us problems in Vertica (https://forum.vertica.com/discussion/235201/vjdbc-5065-error-too-many-ros-containers-exist-for-the-following-projections)
What would be a more efficient way of doing this? Are there any tools that can help with the process or there is no "elegant" solution?
Why not do the export/import in a single step with batching (for performance) and chunking (to avoid errors and provide a checkpoint where to start off after a failure).
In most cases, databases support INSERT queries with many values, e.g.:
INSERT INTO table_a (col_a, col_b, ...) VALUES
(val_a, val_b, ...),
(val_a, val_b, ...),
(val_a, val_b, ...),
...
The number of rows you generate into a single such INSERT statement is then your chunk-size, which might need tuning for the specific target database (big enough to speed things up but small enough to make the chunk not exceed some database limit and create failures).
As already proposed, each of this chunk should then be executed in a transaction and your application should remember which chunk it successfully executed last in case some error occurs so it can continue at the next run there.
For the chunks itself, you really should use LIMIT OFFSET .
This way, you can repeat any chunk at any time, each chunk by itself is atomic and it should perform much better than with single row statements.
I can only speak about PostgreSQL.
The size of the SELECT is not a problem if you use server-side cursors by calling setFetchSize with a value greater than 0 (perhaps 10000) on the statement.
The INSERTS will perform well if
you run them all in a single transaction
you use a PreparedStatement for the INSERT
Each insert into Vertica goes into WOS (memory), and periodically data from WOS gets moved to ROS (disk) into a single container. You can only have 1024 ROS containers per projection per node. Doing many thousands of INSERTs at a time is never a good idea for Vertica. The best way to do this is to copy all that data into a file and bulk load the file into Vertica using the COPY command.
This will create a single ROS container for the contents of the file. Depending on how many rows you want to copy it will be many times (sometimes even hundreds of times) faster.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/COPY/COPY.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ConnectingToVertica/ClientJDBC/UsingCOPYLOCALWithJDBC.htm

Processing large amount of data from PostgreSQL

I am looking for a way how to process a large amount of data that are loaded from the database in a reasonable time.
The problem I am facing is that I have to read all the data from the database (currently around 30M of rows) and then process them in Java. The processing itself is not the problem but fetching the data from the database is. The fetching generally takes from 1-2 minutes. However, I need it to be much faster than that. I am loading the data from db straight to DTO using following query:
select id, id_post, id_comment, col_a, col_b from post_comment
Where id is primary key, id_post and id_comment are foreign keys to respective tables and col_a and col_b are columns of small int data types. The columns with foreign keys have indexes.
The tools I am using for the job currently are Java, Spring Boot, Hibernate and PostgreSQL.
So far the only options that came to my mind were
Ditch hibernate for this query and try to use plain jdbc connection hoping that it will be faster.
Completely rewrite the processing algorithm from Java to SQL procedure.
Did I miss something or these are my only options? I am open to any ideas.
Note that I only need to read the data, not change them in any way.
EDIT: The explain analyze of the used query
"Seq Scan on post_comment (cost=0.00..397818.16 rows=21809216 width=28) (actual time=0.044..6287.066 rows=21812469 loops=1), Planning Time: 0.124 ms, Execution Time: 8237.090 ms"
Do you need to process all rows at once, or can you process them one at a time?
If you can process them one at a time, you should try using a scrollable result set.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
}
This will still remember every object in the entity manager, and so will get progressively slower and slower. To avoid that issue, you might detach the object from the entity manager after you're done. This can only be done if the objects are not modified. If they are modified, the changes will NOT be persisted.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
entityManager.detach(myObject);
}
If I was in your shoes I would definitely bypass hibernate and go directly to JDBC for this query. Hibernate is not made for dealing with large result sets, and it represents an additional overhead for benefits that are not applicable to cases like this one.
When you use JDBC, do not forget to set autocommit to false and set some large fetch size (of the order of thousands) or else postgres will first fetch all 21 million rows into memory before starting to yield them to you. (See https://stackoverflow.com/a/10959288/773113)
Since you asked for ideas, I have seen this problem being resolved in below options depending on how it fits in your environment:
1) First try with JDBC and Java, simple code and you can do a test run on your database and data to see if this improvement is enough. You will here need to compromise on the other benefits of Hibernate.
2) In point 1, use Multi-threading with multiple connections pulling data to one queue and then you can use that queue to process further or print as you need. you may consider Kafka also.
3) If data is going to further keep on increasing you can consider Spark as the latest technology which can make it all in memory and will be much more faster.
These are some of the options, please like if these ideas help you anywhere.
Why do you 30M keep in memory ??
it's better to rewrite it to pure sql and use pagination based on id
you will be sent 5 as the id of the last comment and you will issue
select id, id_post, id_comment, col_a, col_b from post_comment where id > 5 limit 20
if you need to update the entire table then you need to put the task in the cron but also there to process it in parts
the memory of the road and downloading 30M is very expensive - you need to process parts 0-20 20-n n+20

Not able to run select query after setting TTL in cassandra

I have records already in cassandra DB,Using Java Class I am retrieving each row , updating with TTL and storing them back to Cassandra DB. after that if I run select query its executing and showing records. but when the TTL time was complete, If I run select query it has to show zero records but its not running select query showing Cassandra Failure during read query at consistency ONE error. For other tables select query working properly but for that table(to which rows I applied TTL) not working.
You are using common anti-patterns.
1) You are using batches to load data into two single tables, separately. I don't know if you already own a cluster or you're on your local machine, but this is not the way you load data to a C* cluster, and you are going to stress a lot your C* cluster. You should use batches only when you need to keep two or more tables in sync, and not to load a bunch of records at time. I suggest you the following readings on the topic:
DataStax documentation on BATCH
Ryan Svihla Blog
2) You are using synchronous writes to insert your pretty indipendent records into your cluster. You should use asynchronous writes to speed up your data processing.
DataStax Java Drive Async Queries
3) You are using the TTL features in your tables, which per se are not that bad. However, an expired TTL is a tombstone, and that means when you SELECT your query C* will have to read all those tombstones.
4) You bind your prepared statement multiple time:
BoundStatement bound = phonePrepared.bind(macAddress, ...
and that should be
BoundStatement bound = new BoundStatement(phonePrepared).bind(macAddress, ...
in order to use different bound statements. This is not an anti-pattern, this is a problem with your code.
Now, if you run your program multiple times, your tables have a lot of tombstones due to the TTL features, and that means C* is trying hard to read all these in order to find what you wrote "the last time" you successfully run, and it takes so long that the queries times-out.
Just for fun, you can try to increase your timeouts, say 2 minutes, in the SELECT and take a coffee, and in the meantime C* will get your records back.
I don't know what you are trying to achieve, but fast TTLs are your enemies. If you just wanted to refresh your records then try to keep TTLs time high enough so that it doesn't hurt your performances. Or, a probably better solution is to add a new column EXPIRED, "manually" written only when you need to delete a record instead. That depends on your requirements.

concurrently reading large datasets from JDBC (MySQL) [duplicate]

I want to create a external application which will query one table from a large Oracle database.
The query will run daily and I am expecting to handle 30,000+ rows.
To break down the size of these rows, I would like to create a new thread/ process for each 10,000 rows that exist. So going by the above figure it would be 3 threads to process all those rows.
I don't want each thread to overlap each others row set so I know I will need to add a column within the table to act as a range marker, a row_position
Logic
Get row_count of data set in query parameters
Get first_row_pos
While (row_count > 10,000)
{
Create thread with 10,000 rows starting from first_row_pos
row_count == row_count - 10,000
first_row_pos = first_row_pos + 10,000
}
create thread for remaining rows
all threads run their queries concurrently.
This is basic logic at the moment, however I do not know how feasible this is.
Is this a good way or is there a better way?
Can this be done through one database connection with each thread sharing or is it better to have a seperate db connection for each thread?
Any other advice welcome?
Note: I just realised a do while loop would be better if there is less than 10,000 rows in this case.
Thanks
Oralce provide a parallel hint for sutuations such as this where you have a full table scan or similar problem and want to make use of multiple cores to divide the workload. Further details here.
The syntax is very simple, you specify the table (or alias) and the number of cores (I usually leave as default) e.g.:
select /*+ parallel(a, default) */ *
from table_a a
You can also use this with multiple tables e.g.
select /*+ parallel(a, default) parallel(b,default) */ *
from table_a a, table_b b
where a.some_id = b.some_id
A database connection is not thread-safe, so if you are going to query the database from several threads, you would have to have a separate connection for each of them. You can either create a connection or get them from a pool.
Before you implement your approach, take some time to analyze where is the time spent. Oracle overall is pretty good with utilizing multiple cores. And the database interaction is usually is the most time-consuming part. By splitting the query in three you might actually slow things down.
If indeed your application is spending most of the time performing calculations on that data, your best approach might be loading all data in a single thread and then splitting processing into multiple threads.

How to implement several threads in Java for downloading a single table data?

How can I implement several threads with multiple/same connection(s), so that a single large table data can be downloaded in quick time.
Actually in my application, I am downloading a table having 12 lacs (1 lac = 100,000) records which takes atleast 4 hrs to download in normal connection speed and more hrs with slow connection.
So there is a need to implement several threads in Java for downloading a single table data with multiple/same connection(s) object. But no idea how to do this.
How to position a record pointer in several threads then how to add all thread records into a single large file??
Thanks in Advance
First of all, is it not advisable to fetch and download such a huge data onto the client. If you need the data for display purposes then you dont need more records that fit into your screen. You can paginate the data and fetch one page at a time. If you are fetching it and processsing in your memory then you sure would run out of memory on your client.
If at all you need to do this irrespective of the suggestion, then you can spawn multiple threads with separate connections to the database where each thread will pull a fraction of data (1 to many pages). If you have say 100K records and 100 threads available then each thread can pull 1K of records. It is again not advisable to have 100 threads with 100 open connections to the DB. This is just an example. Limit the no number of threads to some optimal value and also limit the number of records each thread is pulling. You can limit the number of records pulled from the DB on the basis of rownum.
As Vikas pointed out, if you're downloading a gigabytes of data to the client-side, you're doing something really really wrong, as he had said you should never need to download more records that can fit into your screen. If however, you only need to do this occasionally for database duplication or backup purpose, just use the database export functionality of your DBMS and download the exported file using DAP (or your favorite download accelerator).
It seems that there are multiple ways to "multi thread read from a full table."
Zeroth way: if your problem is just "I run out of RAM reading that whole table into memory" then you could try processing one row at a time somehow (or a batch of rows), then process the next batch, etc. Thus avoiding loading an entire table into memory (but still single thread so possibly slow).
First way: have a single thread query the entire table, putting individual rows onto a queue that feeds multiple worker threads [NB that setting fetch size for your JDBC connection might be helpful here if you want this first thread to go as fast as possible]. Drawback: only one thread is querying the initial DB at a time, which may not "max out" your DB itself. Pro: you're not re-running queries so sort order shouldn't change on you half way through (for instance if your query is select * from table_name, the return order is somewhat random, but if you return it all from the same resultset/query, you won't get duplicates). You won't have accidental duplicates or anything like that. Here's a tutorial doing it this way.
Second way: pagination, basically every thread somehow knows what chunk it should select (XXX in this example), so it knows "I should query the table like select * from table_name order by something start with XXX limit 10". Then each thread basically processes (in this instance) 10 at a time [XXX is a shared variable among threads incremented by the calling thread].
The problem is the "order by something" it means that for each query the DB has to order the entire table, which may or may not be possible, and can be expensive especially near the end of a table. If it's indexed this should not be a problem. The caveat here is that if there are "gaps" in the data, you'll be doing some useless queries, but they'll probably still be fast. If you have an ID column and it's mostly contiguous, you might be able to "chunk" based on ID, for instance.
If you have some other column that you can key off of, for instance a date column with a known "quantity" per date, and it is indexed, then you may be able to avoid the "order by" by instead chunking by date, for example select * from table_name where date < XXX and date > YYY (also no limit clause, though you could have a thread use limit clauses to work through a particular unique date range, updating as it goes or sorting and chunking since it's a smaller range, less pain).
Third way: you execute a query to "reserve" rows from the table, like update table_name set lock_column = my_thread_unique_key where column is nil limit 10 followed by a query select * from table_name where lock_column = my_thread_unique_key. Disadvantage: are you sure your database executes this as one atomic operation? If not then it's possible two setter queries will collide or something like that, causing duplicates or partial batches. Be careful. Maybe synchronize your process around the "select and update" queries or lock the table and/or rows appropriately. Something like that to avoid possible collision (postgres for instance requires special SERIALIZABLE option).
Fourth way: (related to third) mostly useful if you have large gaps and want to avoid "useless" queries: create a new table that "numbers" your initial table, with an incrementing ID [basically a temp table]. Then you can divide that table up by chunks of contiguous ID's and use it to reference the rows in the first. Or if you have a column already in the table (or can add one) to use just for batching purposes, you may be able to assign batch ID's to rows, like update table_name set batch_number = rownum % 20000 then each row has a batch number assigned to itself, threads can be assigned batches (or assigned "every 9th batch" or what not). Or similarly update table_name set row_counter_column=rownum (Oracle examples, but you get the drift). Then you'd have a contiguous set of numbers to batch off of.
Fifth way: (not sure if I really recommend this, but) assign each row a "random" float at insert time. Then given you know the approximate size of the database, you can peel off a fraction of it like, if 100 and you want 100 batches "where x < 0.01 and X >= 0.02" or the like. (Idea inspired by how wikipedia is able to get a "random" page--assigns each row a random float at insert time).
The thing you really want to avoid is some kind of change in sort order half way through. For instance if you don't specify a sort order, and just query like this select * from table_name start by XXX limit 10 from multiple threads, it's conceivably possible that the database will [since there is no sort element specified] change the order it returns you rows half way through [for instance, if new data is added] meaning you may skip rows or what not.
Using Hibernate's ScrollableResults to slowly read 90 million records also has some related ideas (esp. for hibernate users).
Another option is if you know some column (like "id") is mostly contiguous, you can just iterate through that "by chunks" (get the max, then iterate numerically over chunks). Or some other column that is "chunkable" as it were.
I just felt compelled to answer on this old posting.
Note that this is a typical scenario for Big Data, not only to acquire the data in multiple threads, but also to further process that data in multiple threads. Such approaches do not always call for all data to be accumulated in memory, it can be processed in groups and/or sliding windows, and only need to either accumulate a result, or pass the data further on (other permanent storage).
To process the data in parallel, typically a partitioning scheme or a splitting scheme is applied to the source data. If the data is raw textual, this could be a random sizer cut somewhere in the middle. For databases, the partitioning scheme is nothing but an extra where condition applied on your query to allow paging. This could be something like:
Driver Program: Split my data in for parts, and start 4 workers
4 x (Worker Program): Give me part 1..4 of 4 of the data
This could translate into a (pseudo) sql like:
SELECT ...
FROM (... Subquery ...)
WHERE date = SYSDATE - days(:partition)
In the end it is all pretty conventional, nothing super advanced.

Categories