I want to create a external application which will query one table from a large Oracle database.
The query will run daily and I am expecting to handle 30,000+ rows.
To break down the size of these rows, I would like to create a new thread/ process for each 10,000 rows that exist. So going by the above figure it would be 3 threads to process all those rows.
I don't want each thread to overlap each others row set so I know I will need to add a column within the table to act as a range marker, a row_position
Logic
Get row_count of data set in query parameters
Get first_row_pos
While (row_count > 10,000)
{
Create thread with 10,000 rows starting from first_row_pos
row_count == row_count - 10,000
first_row_pos = first_row_pos + 10,000
}
create thread for remaining rows
all threads run their queries concurrently.
This is basic logic at the moment, however I do not know how feasible this is.
Is this a good way or is there a better way?
Can this be done through one database connection with each thread sharing or is it better to have a seperate db connection for each thread?
Any other advice welcome?
Note: I just realised a do while loop would be better if there is less than 10,000 rows in this case.
Thanks
Oralce provide a parallel hint for sutuations such as this where you have a full table scan or similar problem and want to make use of multiple cores to divide the workload. Further details here.
The syntax is very simple, you specify the table (or alias) and the number of cores (I usually leave as default) e.g.:
select /*+ parallel(a, default) */ *
from table_a a
You can also use this with multiple tables e.g.
select /*+ parallel(a, default) parallel(b,default) */ *
from table_a a, table_b b
where a.some_id = b.some_id
A database connection is not thread-safe, so if you are going to query the database from several threads, you would have to have a separate connection for each of them. You can either create a connection or get them from a pool.
Before you implement your approach, take some time to analyze where is the time spent. Oracle overall is pretty good with utilizing multiple cores. And the database interaction is usually is the most time-consuming part. By splitting the query in three you might actually slow things down.
If indeed your application is spending most of the time performing calculations on that data, your best approach might be loading all data in a single thread and then splitting processing into multiple threads.
Related
I have an Oracle table with ~10 million records that are not dependent on each other . An existing Java application executes the query an iterates through the returned Iterator batching the records for further processing. The fetchSize is set to 250.
Is there any way to parallelize getting the data from the Oracle DB? One thing that comes to mind is to break down the query into chunks using "rowid" and then pass these chunks to separate threads.
I am wondering if there is some kind of standard approach in solving this issue.
Few approaches to achieve it:
alter session force parallel QUERY parallel 32; execute this at DB level in PL/SQL code just before the execution of SELECT statement. You can adjust the 32 value depends on number of Nodes (RAC setup).
The approach which you are doing on the basis of ROWID but the difficult part is how you return the chunk of SELECT queries to JAVA and how you can combine that result. So this approach is bit difficult.
Interview question
Say , we have a table with 2 million records in Employee table and we need to cut 10% salary(need to do some processing) of each employee and then save it back to collection. How can you do it efficiently.
i asked him we can use executor framework for the same to create multiple threads which can fetch values from table then we can process and save it to list.
then he asked me how will you check that a record is already processed or not, there i was clueless(how to do that).
even i am not sure whether i am good with my approach or not.
please help.
One thing that you could do is to use a producer/consumer type model, where you have one thread working to feed the others the records to update. This way you would not have to worry as much about duplicate processing.
The best approach given the question as stated is to use pure SQL, something like:
update employees set
salary = salary * .9
It is very hard to imagine needing to do something to employee data that SQL could not handle.
If by some quirk of bad design you really needed to do something to employee type data that SQL absolutely could not do, then you would open a cursor to the rowset and iterate through it, making the update synchronously so you only do one pass over the data.
In pseudo code:
cursor = forUpdate ("select for update * from employees")
while (cursor.next()) {
cursor.salary = cursor.salary * .9
}
This is the simplest and likely fastest executing approach.
—-
Regarding logging
It’s only 2M rows, which is a “small” quantity, so most DB could handle it in a single transaction. However if not, add a where clause, eg where id between <start> and <end> to the query to chunk up the process into loggable amounts if using the shell script approach.
If using the code approach, most databases allow you to commit while holding the cursor open, so just commit every 10K rows or so.
Regarding locking
Similar aspects to logging. All rows in such a query are locked for the duration of the transaction. Given it would take that long to run, pick a quiet time to run. If it’s really a big deal, chunk up but realise that locking is unavoidable.
I would load in this table, then add a column for the state. By default, you could set this column to "Not Processed". Once a thread starts processing this employee it would change the state to "Processing", then when finished it would finally switch it to "Processed".
Having 3 states like this would also allow you to use this as a Lock preventing the processing from happening twice.
How can I implement several threads with multiple/same connection(s), so that a single large table data can be downloaded in quick time.
Actually in my application, I am downloading a table having 12 lacs (1 lac = 100,000) records which takes atleast 4 hrs to download in normal connection speed and more hrs with slow connection.
So there is a need to implement several threads in Java for downloading a single table data with multiple/same connection(s) object. But no idea how to do this.
How to position a record pointer in several threads then how to add all thread records into a single large file??
Thanks in Advance
First of all, is it not advisable to fetch and download such a huge data onto the client. If you need the data for display purposes then you dont need more records that fit into your screen. You can paginate the data and fetch one page at a time. If you are fetching it and processsing in your memory then you sure would run out of memory on your client.
If at all you need to do this irrespective of the suggestion, then you can spawn multiple threads with separate connections to the database where each thread will pull a fraction of data (1 to many pages). If you have say 100K records and 100 threads available then each thread can pull 1K of records. It is again not advisable to have 100 threads with 100 open connections to the DB. This is just an example. Limit the no number of threads to some optimal value and also limit the number of records each thread is pulling. You can limit the number of records pulled from the DB on the basis of rownum.
As Vikas pointed out, if you're downloading a gigabytes of data to the client-side, you're doing something really really wrong, as he had said you should never need to download more records that can fit into your screen. If however, you only need to do this occasionally for database duplication or backup purpose, just use the database export functionality of your DBMS and download the exported file using DAP (or your favorite download accelerator).
It seems that there are multiple ways to "multi thread read from a full table."
Zeroth way: if your problem is just "I run out of RAM reading that whole table into memory" then you could try processing one row at a time somehow (or a batch of rows), then process the next batch, etc. Thus avoiding loading an entire table into memory (but still single thread so possibly slow).
First way: have a single thread query the entire table, putting individual rows onto a queue that feeds multiple worker threads [NB that setting fetch size for your JDBC connection might be helpful here if you want this first thread to go as fast as possible]. Drawback: only one thread is querying the initial DB at a time, which may not "max out" your DB itself. Pro: you're not re-running queries so sort order shouldn't change on you half way through (for instance if your query is select * from table_name, the return order is somewhat random, but if you return it all from the same resultset/query, you won't get duplicates). You won't have accidental duplicates or anything like that. Here's a tutorial doing it this way.
Second way: pagination, basically every thread somehow knows what chunk it should select (XXX in this example), so it knows "I should query the table like select * from table_name order by something start with XXX limit 10". Then each thread basically processes (in this instance) 10 at a time [XXX is a shared variable among threads incremented by the calling thread].
The problem is the "order by something" it means that for each query the DB has to order the entire table, which may or may not be possible, and can be expensive especially near the end of a table. If it's indexed this should not be a problem. The caveat here is that if there are "gaps" in the data, you'll be doing some useless queries, but they'll probably still be fast. If you have an ID column and it's mostly contiguous, you might be able to "chunk" based on ID, for instance.
If you have some other column that you can key off of, for instance a date column with a known "quantity" per date, and it is indexed, then you may be able to avoid the "order by" by instead chunking by date, for example select * from table_name where date < XXX and date > YYY (also no limit clause, though you could have a thread use limit clauses to work through a particular unique date range, updating as it goes or sorting and chunking since it's a smaller range, less pain).
Third way: you execute a query to "reserve" rows from the table, like update table_name set lock_column = my_thread_unique_key where column is nil limit 10 followed by a query select * from table_name where lock_column = my_thread_unique_key. Disadvantage: are you sure your database executes this as one atomic operation? If not then it's possible two setter queries will collide or something like that, causing duplicates or partial batches. Be careful. Maybe synchronize your process around the "select and update" queries or lock the table and/or rows appropriately. Something like that to avoid possible collision (postgres for instance requires special SERIALIZABLE option).
Fourth way: (related to third) mostly useful if you have large gaps and want to avoid "useless" queries: create a new table that "numbers" your initial table, with an incrementing ID [basically a temp table]. Then you can divide that table up by chunks of contiguous ID's and use it to reference the rows in the first. Or if you have a column already in the table (or can add one) to use just for batching purposes, you may be able to assign batch ID's to rows, like update table_name set batch_number = rownum % 20000 then each row has a batch number assigned to itself, threads can be assigned batches (or assigned "every 9th batch" or what not). Or similarly update table_name set row_counter_column=rownum (Oracle examples, but you get the drift). Then you'd have a contiguous set of numbers to batch off of.
Fifth way: (not sure if I really recommend this, but) assign each row a "random" float at insert time. Then given you know the approximate size of the database, you can peel off a fraction of it like, if 100 and you want 100 batches "where x < 0.01 and X >= 0.02" or the like. (Idea inspired by how wikipedia is able to get a "random" page--assigns each row a random float at insert time).
The thing you really want to avoid is some kind of change in sort order half way through. For instance if you don't specify a sort order, and just query like this select * from table_name start by XXX limit 10 from multiple threads, it's conceivably possible that the database will [since there is no sort element specified] change the order it returns you rows half way through [for instance, if new data is added] meaning you may skip rows or what not.
Using Hibernate's ScrollableResults to slowly read 90 million records also has some related ideas (esp. for hibernate users).
Another option is if you know some column (like "id") is mostly contiguous, you can just iterate through that "by chunks" (get the max, then iterate numerically over chunks). Or some other column that is "chunkable" as it were.
I just felt compelled to answer on this old posting.
Note that this is a typical scenario for Big Data, not only to acquire the data in multiple threads, but also to further process that data in multiple threads. Such approaches do not always call for all data to be accumulated in memory, it can be processed in groups and/or sliding windows, and only need to either accumulate a result, or pass the data further on (other permanent storage).
To process the data in parallel, typically a partitioning scheme or a splitting scheme is applied to the source data. If the data is raw textual, this could be a random sizer cut somewhere in the middle. For databases, the partitioning scheme is nothing but an extra where condition applied on your query to allow paging. This could be something like:
Driver Program: Split my data in for parts, and start 4 workers
4 x (Worker Program): Give me part 1..4 of 4 of the data
This could translate into a (pseudo) sql like:
SELECT ...
FROM (... Subquery ...)
WHERE date = SYSDATE - days(:partition)
In the end it is all pretty conventional, nothing super advanced.
we have a system where we collect data every second on user activity on multiple web sites. we dump that data into a database X (say MS SQL Server). we now need to fetch data from this single table from daatbase X and insert into database Y (say mySql).
we want to fetch time based data from database X through multiple threads so that we fetch as fast as we can. Once fetched and stored in database Y, we will delete data from database X.
Are there any best practices on this sort of design? any specific things to take care on table design like sharing or something? Are there any other things that we need to take care to make sure we fetch it as fast as we can from threads running on multiple machines?
Thanks in advance!
Ravi
If you are moving data from one database to another, you will not gain any advantages by having multiple threads doing the work. It will only increase contention.
If both databases are of the same type, you should be looking into the vendors specific tools for replication. This will basically always outperform homegrown solutions.
If the databases are different (vendors), you have to decide upon an efficient mechanism for
identifying new/updated/deleted rows (Triggers, range based queries, full dumps)
transporting the data (unload to file & FTP, pull/push from a program)
loading the data on the other database (import, bulk insert)
Without more details, it's impossible to be more specific than that.
Oh, and the two most important considerations that will influence your choice are:
What is the expected data volume?
Longest acceptable delay between row creation in source DB and availability in Target DB
I would test (by measurement) your assumption that multiple slurper threads will speed things up. Without being more specific in your question, it looks like you want to do an ETL (extract transform load) process with your database, these are pretty efficient when you let the database specific technology handle it, especially if you're interested in aggregation etc.
There are two levels of concern of your issue:
The transaction between these two database:
This is important because you would delete database from source database. You must ensure that only remove data from X while the database has been stored into Y successfully. On the other side, your must ensure that the deletion of data from X must be successful to prevent re-insert same data into Y.
The performance of transferring data:
If the X database has incoming data whenever, which is a online database, it is not a good practice that just collect data, store to Y, and delete them. Planning a size of batch, the program starts a transaction for that batch; running the program repeatedly until the number of data in X is under the size of batch.
In both of databases, your should add a table to record the batch for processing.
There are three states in processing.
INIT - The start of batch, this value should be synchronized between two databases
COPIED - In database Y, the insertion of data and the update of this status should be in one transaction.
FINISH - In database X, the deletion of data and the update of this status should be in on transaction.
When the programing is running, it first checks the batches in 'INIT' or 'COPIED' state and restarts the session to process.
If X has an "INIT" record and Y don't, just insert the same INIT record to Y, then perform the insertion to Y.
If a record in Y is "COPIED" and X is "INIT", just update the state of X to "COPIED", then perform the deletion to X.
If a record in X is "FINISH" and the corresponding record in Y is "COPIED", just update the the state of Y to "FINISH".
In conclusion, processing data at a batch would give you a chance to optimize such transferring between two databases. The number of batch size dominates the efficiency of transforming and depends on two factors: how those databases concurrently used by other operation and the tuning parameter of your databases. In general situation, the write-throughput of Y is likely the bottleneck of processing.
Threads are not the way to go. The database(s) is the bottleneck here. Multiple threads will only increase contention. Even if 10 processes are jamming data into SQL Server, a single thread (rather than many) can pull it out faster. There is absolutely no doubt about that.
The SELECT itself can cause locks in the main table, reducing the throughput of the INSERTs, so I would "get in and get out" as fast as possible. If it were me, I would:
SELECT the rows based on a range query (date, recno, whatever), dump them into a file, and close the result set (cursor).
DELETE the rows based on the same range query.
Then process the dump. If possible, the dump format should be amenable to bulk-load into MySQL.
I don't want to beat up your architecture, but overall the design sounds problematic. SELECTing and DELETEing rows from a table undergoing a high INSERTion rate is going to create huge locking issues. I would be looking at "double-buffering" the data in the SQL Server.
For example, every minute the inserts switch between two tables. For example, in the first minute INSERTs go into TABLE_1, but when the minute rolls over they start INSERTing into TABLE_2, the next minute back to TABLE_1, and so forth. While INSERTS are going into TABLE_2, SELECT everything from TABLE_1 and dump it into MySQL (as efficiently as possible), then TRUNCATE the table (deleting all rows with zero penalty). This way, there is never lock-contention between the readers and writers.
Coordinating the rollover point of between TABLE_1 and TABLE_2 is the tricky part. But it can be done automatically through a clever use of SQL Server Partitioned Views.
I have some queries that run for a quite long (20-30 minutes). If a lot of queries are started simultaneously, connection pool is drained quickly.
Is it possible to wrap the long-running query into a statement (procedure) that will store the result of a generic query into a temp table, terminanting the connection, and fetchin (polling) the results later on demand?
EDIT: queries and data stuctures are optimized, and tips like 'check your indices and execution plan' don't work for me. I'm looking for a way to store [maybe a] byte presentation of a generic result set, for later retreive.
First of all, 20-30 minutes is an extremely long time for a query - are you sure you aren't missing any indexes for the query? Do check your execution plan - you could get a huge performance gain from a well-placed index.
In MySQL, you could do
INSERT INTO `cached_result_table` (
SELECT your_query_here
)
(of course, cached_result_table needs to have the exact same column structure as your SELECT returns, otherwise you'll get an error).
Then, you could query these cached results (instead of the original tables), and only run the above query from time to time - to update the cached_result_table.
Of course, the query will need to run at least once initially, which will take the 20-30 minutes you mentioned. I suggest to pre-populate the cached table before the data are requested, and keep some locking mechanism to prevent the update query to run several times simultaneously. Pseudocode:
init:
insert select your_big_query
work:
if your_big_query cached table is empty or nearing expiration:
refresh in the background:
check flag to see if there's another "refresh" process running
if yes
end // don't run two your_big_queries at the same time
else
set flag
re-run your_big_query, save to cached table
clear flag
serve data to clients always from cached table
An easy way to do that in Oracle is "CREATE TABLE sometempname AS SELECT...". That will create a new table using the result columns from the select.
Not quite sure what you are requesting.
Currently you have 50 database sessions. Say you get 40 running long-running queries, that leaves 10 to service the rest.
What you seem to be asking for is, you want those 40 queries asynchronously (running in the background) not clogging up the connection pool of 50. The question is, do you want those 40 running concurrently with (potentially) another 50 queries from the connection pool, or do you want them queued up in some way ?
Queuing can be done (look into DBMS_SCHEDULER and DBMS_JOB). But you will need to deliver those results into some other table and know how to deliver that result set. The old fashioned way is simply to generate reports on request that get delivered to a directory on a shared drive or by email. Could be PDF or CSV or Excel.
If you want the 40 running concurrently alongside the 50 'connection pool' settings, then you may be best off setting up a separate connection pool for the long-running queries.
You can look into Resource Manager for terminating calls that take too long or too many resources. That way the quickie pool can't get bogged down in long running requests.
The most generic approach in Oracle I can think of is creating a stored procedure that will convert a result set into XML, and store it as CLOB XMLType in a table with the results of your long-running queries.
You can find more on generation XMLs from a generic result sets here.
SQL> select dbms_xmlgen.getxml('select employee_id, first_name,
2 last_name, phone_number from employees where rownum < 6') xml
3 from dual