Spring Batch JdbcCursorItemReader or RepositoryItemReader? - java

I'm writing a batch processing with Spring Batch. I have to move circa about 2 000 000 records from the datasource (Oracle database) to the target (Kafka broker). I'm hesitating which ItemReader should I choose for this job:
JdbcCursorItemReader: if I understand correctly it will open the cursor, which will be iterating through the ResultSet of ALL of those records, one by one, performance is no issue; under the hood database keeps a snapshot of records satisfying where clause at the time of query execution;
RepositoryItemReader: might be less performant, partitioning based on paging mechanism, for each page the query will be executed, possibility of ommiting some records which could be written to database during fetch of 2 000 000 records, which wouldn't happen in the former case (is my reasoning even correct?)
Summary: As a result I want to send all of those 2 000 000 records as they were at the time of the query execution in a partitioned manner. Am I overthinking this problem? Maybe skipping new records isn't such a problem in case of the future executions of the job for updates? Or maybe my reasoning regarding RepositoryItemReader is not correct?

Keeping a cursor open for extended periods of time is not always ideal. Depending on the DB you're using it may not be optimized; i.e. some DBs do not honor fetchSize and will retrieve results one by one as they are requested.
I would go with the RepositoryItemReader or one of the PagingItemReader implementations.
I'm not quite following if your concern is that you DO or DO NOT want to omit new records.
If you DO want omit new records you should be able to add a predicate to your where clause to not pass a certain ID or timestamp field. If neither of those are available you can set the maxItemCount() on the reader based on a count query you execute up front before the job (in a listener, for example).

Related

How to get data from Oracle table into java application concurrently

I have an Oracle table with ~10 million records that are not dependent on each other . An existing Java application executes the query an iterates through the returned Iterator batching the records for further processing. The fetchSize is set to 250.
Is there any way to parallelize getting the data from the Oracle DB? One thing that comes to mind is to break down the query into chunks using "rowid" and then pass these chunks to separate threads.
I am wondering if there is some kind of standard approach in solving this issue.
Few approaches to achieve it:
alter session force parallel QUERY parallel 32; execute this at DB level in PL/SQL code just before the execution of SELECT statement. You can adjust the 32 value depends on number of Nodes (RAC setup).
The approach which you are doing on the basis of ROWID but the difficult part is how you return the chunk of SELECT queries to JAVA and how you can combine that result. So this approach is bit difficult.

JDBC Read without cursor

I have to read huge data from the database (for example lets consider more than 500 000 records). Then I have to save the read data to a file. I have many issues with cursor (not only memory issue).
Is it possible to do it without cursor, for example using stream? If so how can I achieve it?
I have experienced working with huge data (almost 500 milions of records). I simply used PreparedStatement query, ResultSet and of cource some buffer tweaking through:
setFetchSize(int)
In my case, i split the program into threads because the huge table was partitioned (each thread processed one partition) but i think that this is not your case.
It is pointless to fetch data through cursor. I would rather use the database view or SQL query. Do not use ORM for this purpose.
According to your comment, your best option is to limit JDBC to fetch only specific number of rows instead of fetching all of them (this helps to begin processing faster and does not load entire table into ResultSet). Save your data into collection and write it into file using BufferedWriter. You can also benefit from multi-core CPU to make it run in more threads - like first fetched rows run in 1 thread, other fetched rows in second thread. In case of threading, use synchronized collections and be aware that you might face the problem of ordering.

Not able to run select query after setting TTL in cassandra

I have records already in cassandra DB,Using Java Class I am retrieving each row , updating with TTL and storing them back to Cassandra DB. after that if I run select query its executing and showing records. but when the TTL time was complete, If I run select query it has to show zero records but its not running select query showing Cassandra Failure during read query at consistency ONE error. For other tables select query working properly but for that table(to which rows I applied TTL) not working.
You are using common anti-patterns.
1) You are using batches to load data into two single tables, separately. I don't know if you already own a cluster or you're on your local machine, but this is not the way you load data to a C* cluster, and you are going to stress a lot your C* cluster. You should use batches only when you need to keep two or more tables in sync, and not to load a bunch of records at time. I suggest you the following readings on the topic:
DataStax documentation on BATCH
Ryan Svihla Blog
2) You are using synchronous writes to insert your pretty indipendent records into your cluster. You should use asynchronous writes to speed up your data processing.
DataStax Java Drive Async Queries
3) You are using the TTL features in your tables, which per se are not that bad. However, an expired TTL is a tombstone, and that means when you SELECT your query C* will have to read all those tombstones.
4) You bind your prepared statement multiple time:
BoundStatement bound = phonePrepared.bind(macAddress, ...
and that should be
BoundStatement bound = new BoundStatement(phonePrepared).bind(macAddress, ...
in order to use different bound statements. This is not an anti-pattern, this is a problem with your code.
Now, if you run your program multiple times, your tables have a lot of tombstones due to the TTL features, and that means C* is trying hard to read all these in order to find what you wrote "the last time" you successfully run, and it takes so long that the queries times-out.
Just for fun, you can try to increase your timeouts, say 2 minutes, in the SELECT and take a coffee, and in the meantime C* will get your records back.
I don't know what you are trying to achieve, but fast TTLs are your enemies. If you just wanted to refresh your records then try to keep TTLs time high enough so that it doesn't hurt your performances. Or, a probably better solution is to add a new column EXPIRED, "manually" written only when you need to delete a record instead. That depends on your requirements.

Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time?

I have a CSV which is... 34 million lines long. Yes, no joking.
This is a CSV file produced by a parser tracer which is then imported into the corresponding debugging program.
And the problem is in the latter.
Right now I import all rows one by one:
private void insertNodes(final DSLContext jooq)
throws IOException
{
try (
final Stream<String> lines = Files.lines(nodesPath, UTF8);
) {
lines.map(csvToNode)
.peek(ignored -> status.incrementProcessedNodes())
.forEach(r -> jooq.insertInto(NODES).set(r).execute());
}
}
csvToNode is simply a mapper which will turn a String (a line of a CSV) into a NodesRecord for insertion.
Now, the line:
.peek(ignored -> status.incrementProcessedNodes())
well... The method name tells pretty much everything; it increments a counter in status which reflects the number of rows processed so far.
What happens is that this status object is queried every second to get information about the status of the loading process (we are talking about 34 million rows here; they take about 15 minutes to load).
But now jooq has this (taken from their documentation) which can load directly from a CSV:
create.loadInto(AUTHOR)
.loadCSV(inputstream)
.fields(ID, AUTHOR_ID, TITLE)
.execute();
(though personally I'd never use THAT .loadCSV() overload since it doesn't take the CSV encoding into account).
And of course JooQ will manage to turn that into a suitable construct so that for this or that DB engine the throughput is maximized.
The problem however is that I lose the "by second" information I get from the current code... And if I replace the query by a select count(*) from the_victim_table, that kind of defeats the point, not to mention that this MAY take a long time.
So, how do I get "the best of both worlds"? That is, is there a way to use an "optimized CSV load" and query, quickly enough and at any time, how many rows have been inserted so far?
(note: should that matter, I currently use H2; a PostgreSQL version is also planned)
There are a number of ways to optimise this.
Custom load partitioning
One way to optimise query execution at your side is to collect sets of values into:
Bulk statements (as in INSERT INTO t VALUES(1), (2), (3), (4))
Batch statements (as in JDBC batch)
Commit segments (commit after N statements)
... instead of executing them one by one. This is what the Loader API also does (see below). All of these measures can heavily increase load speed.
This is the only way you can currently "listen" to loading progress.
Load partitioning using jOOQ 3.6+
(this hasn't been released yet, but it will be, soon)
jOOQ natively implements the above three partitioning measures in jOOQ 3.6
Using vendor-specific CSV loading mechanisms
jOOQ will always need to pass through JDBC and might thus not present you with the fastest option. Most databases have their own loading APIs, e.g. the ones you've mentioned:
H2: http://www.h2database.com/html/tutorial.html#csv
PostgreSQL: http://www.postgresql.org/docs/current/static/sql-copy.html
This will be more low-level, but certainly faster than anything else.
General remarks
What happens is that this status object is queried every second to get information about the status of the loading process (we are talking about 34 million rows here; they take about 15 minutes to load).
That's a very interesting idea. Will register this as a feature request for the Loader API: Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time?
though personally I'd never use THAT .loadCSV() overload since it doesn't take the CSV encoding into account
We've fixed that for jOOQ 3.6, thanks to your remarks: https://github.com/jOOQ/jOOQ/issues/4141
And of course JooQ will manage to turn that into a suitable construct so that for this or that DB engine the throughput is maximized.
No, jOOQ doesn't make any assumptions about maximising throughput. This is extremely difficult and depends on many other factors than your DB vendor, e.g.:
Constraints on the table
Indexes on the table
Logging turned on/off
etc.
jOOQ offers you help in maximising throughput yourself. For instance, in jOOQ 3.5+, you can:
Set the commit rate (e.g. commit every 1000 rows) to avoid long UNDO / REDO logs in case you're inserting with logging turned on. This can be done via the commitXXX() methods.
In jOOQ 3.6+, you can also:
Set the bulk statement rate (e.g. combine 10 rows in a single statement) to drastically speed up execution. This can be done via the bulkXXX() methods.
Set the batch statement rate (e.g. combine 10 statements in a single JDBC batch) to drastically speed up execution (see this blog post for details). This can be done via the batchXXX() methods.

Storing result set for later fetch

I have some queries that run for a quite long (20-30 minutes). If a lot of queries are started simultaneously, connection pool is drained quickly.
Is it possible to wrap the long-running query into a statement (procedure) that will store the result of a generic query into a temp table, terminanting the connection, and fetchin (polling) the results later on demand?
EDIT: queries and data stuctures are optimized, and tips like 'check your indices and execution plan' don't work for me. I'm looking for a way to store [maybe a] byte presentation of a generic result set, for later retreive.
First of all, 20-30 minutes is an extremely long time for a query - are you sure you aren't missing any indexes for the query? Do check your execution plan - you could get a huge performance gain from a well-placed index.
In MySQL, you could do
INSERT INTO `cached_result_table` (
SELECT your_query_here
)
(of course, cached_result_table needs to have the exact same column structure as your SELECT returns, otherwise you'll get an error).
Then, you could query these cached results (instead of the original tables), and only run the above query from time to time - to update the cached_result_table.
Of course, the query will need to run at least once initially, which will take the 20-30 minutes you mentioned. I suggest to pre-populate the cached table before the data are requested, and keep some locking mechanism to prevent the update query to run several times simultaneously. Pseudocode:
init:
insert select your_big_query
work:
if your_big_query cached table is empty or nearing expiration:
refresh in the background:
check flag to see if there's another "refresh" process running
if yes
end // don't run two your_big_queries at the same time
else
set flag
re-run your_big_query, save to cached table
clear flag
serve data to clients always from cached table
An easy way to do that in Oracle is "CREATE TABLE sometempname AS SELECT...". That will create a new table using the result columns from the select.
Not quite sure what you are requesting.
Currently you have 50 database sessions. Say you get 40 running long-running queries, that leaves 10 to service the rest.
What you seem to be asking for is, you want those 40 queries asynchronously (running in the background) not clogging up the connection pool of 50. The question is, do you want those 40 running concurrently with (potentially) another 50 queries from the connection pool, or do you want them queued up in some way ?
Queuing can be done (look into DBMS_SCHEDULER and DBMS_JOB). But you will need to deliver those results into some other table and know how to deliver that result set. The old fashioned way is simply to generate reports on request that get delivered to a directory on a shared drive or by email. Could be PDF or CSV or Excel.
If you want the 40 running concurrently alongside the 50 'connection pool' settings, then you may be best off setting up a separate connection pool for the long-running queries.
You can look into Resource Manager for terminating calls that take too long or too many resources. That way the quickie pool can't get bogged down in long running requests.
The most generic approach in Oracle I can think of is creating a stored procedure that will convert a result set into XML, and store it as CLOB XMLType in a table with the results of your long-running queries.
You can find more on generation XMLs from a generic result sets here.
SQL> select dbms_xmlgen.getxml('select employee_id, first_name,
2 last_name, phone_number from employees where rownum < 6') xml
3 from dual

Categories