I'm currently trying to insert in batch many records (~2000) and Jooq's batchInsert is not doing what I want.
I'm transforming POJOs into UpdatableRecords and then I'm performing batchInsert which is executing insert for each record. So Jooq is doing ~2000 queries for each batch insert and it's killing database performance.
It's executing this code (jooq's batch insert):
for (int i = 0; i < records.length; i++) {
Configuration previous = ((AttachableInternal) records[i]).configuration();
try {
records[i].attach(local);
executeAction(i);
}
catch (QueryCollectorSignal e) {
Query query = e.getQuery();
String sql = e.getSQL();
// Aggregate executable queries by identical SQL
if (query.isExecutable()) {
List<Query> list = queries.get(sql);
if (list == null) {
list = new ArrayList<Query>();
queries.put(sql, list);
}
list.add(query);
}
}
finally {
records[i].attach(previous);
}
}
I could just do it like this (because Jooq is doing same thing internally):
records.forEach(UpdatableRecord::insert);
instead of:
jooq.batchInsert(records).execute();
How can I tell Jooq to create new records in batch mode? Should I transform records into bind queries and then call batchInsert? Any ideas? ;)
jOOQ's DSLContext.batchInsert() creates one JDBC batch statement per set of consecutive records with identical generated SQL strings (the Javadoc doesn't formally define this, unfortunately).
This can turn into a problem when your records look like this:
+------+--------+--------+
| COL1 | COL2 | COL3 |
+------+--------+--------+
| 1* | {null} | {null} |
| 2* | B* | {null} |
| 3* | {null} | C* |
| 4* | D* | D* |
+------+--------+--------+
.. because in that case, the generated SQL strings will look like this:
INSERT INTO t (col1) VALUES (?);
INSERT INTO t (col1, col2) VALUES (?, ?);
INSERT INTO t (col1, col3) VALUES (?, ?);
INSERT INTO t (col1, col2, col3) VALUES (?, ?, ?);
The reason for this default behaviour is the fact that this is the only way to guarantee ... DEFAULT behaviour. As in SQL DEFAULT. I gave a rationale of this behaviour here.
With this in mind, and as each consecutive SQL string is different, the inserts unfortunately aren't batched as a single batch as you intended.
Solution 1: Make sure all changed flags are true
One way to enforce all INSERT statements to be the same is to set all changed flags of each individula record to true:
for (Record r : records)
r.changed(true);
Now, all SQL strings will be the same.
Solution 2: Use the Loader API
Instead of batching, you could import the data (and specify batch sizes there). For details, see the manual's section about importing records:
https://www.jooq.org/doc/latest/manual/sql-execution/importing/importing-records
Solution 3: Use a batch statement instead
Your usage of batchInsert() is convenience that works when using TableRecords. But of course, you can generate an INSERT statement manually and batch the individual bind variables by using jOOQ's batch statement API:
https://www.jooq.org/doc/latest/manual/sql-execution/batch-execution
A note on performance
There are a couple of open issues regarding the DSLContext.batchInsert() and similar API. The client side algorithm that generates SQL strings for each individual record is inefficient and might be changed in the future, relying on changed() flags directly. Some relevant issues:
https://github.com/jOOQ/jOOQ/issues/4533
https://github.com/jOOQ/jOOQ/issues/6294
Related
I have a scenario where i want to insert record if it doesn't exist in DB2. If it already exists update is_active column to 0 of the existing row and insert the new row with is_active as 1.
I cannot use merge into as i cannot run 2 queries in when matched section.
How can i achieve this in batch.
If i were to run queries one by one i could have run them. But since there are streaming and there around 500msg per sec, i want to do this in batch
If we use statement we could have done
statement.addBatch(sql1)
statement.addBatch(sql2)
After doing above lets say 500 times we just execute batch
statement.excuteBatch()
But we are looking for something similar in prepared statement. When we tried to do it the same way as statement it failed
You may combine 2 or more data change statements into a single statement, but it's a SELECT statement which you can't use in the addBatch method.
Retrieval of result sets from an SQL data change statement
But you may use the after update trigger and the insert statement only in your addBatch method.
CREATE TABLE TEST
(
ID INT NOT NULL GENERATED ALWAYS AS IDENTITY
, KEY INT NOT NULL
, IS_ACTIVE INT NOT NULL
) IN USERSPACE1;
CREATE TRIGGER TEST_AIR
AFTER INSERT ON TEST
REFERENCING NEW AS N
FOR EACH ROW
UPDATE TEST T SET IS_ACTIVE=0 WHERE T.KEY=N.KEY AND T.ID<>N.ID AND T.IS_ACTIVE<>0;
INSERT INTO TEST (KEY, IS_ACTIVE) VALUES (1, 1);
INSERT INTO TEST (KEY, IS_ACTIVE) VALUES (1, 1);
SELECT * FROM TEST;
|ID |KEY |IS_ACTIVE |
|-----------|-----------|-----------|
|1 |1 |0 |
|2 |1 |1 |
I prefer to use the timestamp as one of the column in Cassandra (which I decided to use as Clustering Key). which is the right way to store the column as timestamp in Cassandra?
(i.e) Is it fine to use the 'milliseconds' (Example : 1513078338560) directly like below?
INSERT INTO testdata (nodeIp, totalCapacity, physicalUsage, readIOPS, readBW, writeIOPS, writeBW, writeLatency, flashMode, timestamp) VALUES('172.30.56.60',1, 1,1,1,1,1,1,'yes',1513078338560);
or to use the dateof(now());
INSERT INTO testdata (nodeIp, totalCapacity, physicalUsage, readIOPS, readBW, writeIOPS, writeBW, writeLatency, flashMode, timestamp) VALUES('172.30.56.60',1, 1,1,1,1,1,1,'yes',dateof(now()));
which is faster and recommended way to use for timestamp based queries in Cassandra?
NOTE : I know internally it stores as milliseconds, I used the 'SELECT timestamp, blobAsBigint(timestampAsBlob(timestamp)) FROM'
Thanks,
Harry
The dateof is deprecated in Cassandra >= 2.2... Instead it's better to use function toTimestamp, like this: toTimestamp(now()). When you selecting, you can also use the toUnixTimestamp function if you want to get the timestamp as long:
cqlsh:test> CREATE TABLE test_times (a int, b timestamp, PRIMARY KEY (a,b));
cqlsh:test> INSERT INTO test_times (a,b) VALUES (1, toTimestamp(now()));
cqlsh:test> SELECT toUnixTimestamp(b) FROM test_times;
system.tounixtimestamp(b)
---------------------------
1513086032267
(1 rows)
cqlsh:test> SELECT b FROM test_times;
b
---------------------------------
2017-12-12 13:40:32.267000+0000
(1 rows)
Regarding the performance - there are different considerations:
If you already have the timestamp as number, then you can use it instead of calling function
It's better to use prepared statements instead of "raw inserts" - in this case Cassandra won't need to transfer full query, but only data, and also don't need to parse statement every time.
The pseudo code will look as following (Java-like).
PreparedStatement prepared = session.prepare(
"insert into your_table (field1, field2) values (?, ?)");
while(true) {
session.execute(prepared.bind(value1, value2));
}
Given the following SQL structure of MY_TABLE:
GROUP_LABEL | FILE | TOPIC
-----------------------------
group A | 1.pdf | topic A
group A | 1.pdf | topic B
group A | 2.pdf | topic A
group B | 2.pdf | topic B
My task is to get this stuff grouped by GROUP_LABEL, while forgetting about the different TOPICs of a file. So my expected result is
GROUP_LABEL | COUNT(*)
----------------------
group A | 2 -- two different files 1.pdf and 2.pdf here
group B | 1 -- only one file here
In pure SQL I would do it like
SELECT GROUP_LABEL, COUNT(*) FROM (
SELECT DISTINCT GROUP_LABEL, FILE FROM MY_TABLE
);
Is it possible to transform it into a JPA Criteria API query? I don't have any idea to get my inner query into the from construct of the Criteria query, in 9.3.1 of https://docs.jboss.org/hibernate/entitymanager/3.5/reference/en/html/querycriteria.html it seems like this is not possible.
But I just can't believe it ;-) Has anyone done this before? The inner query would be enriched with various, well-tested, filter Predicates which I would want to reuse.
I'm using spring-boot-starter-data : 1.5.6.RELEASE with mainly standard configuration.
Try this,
Query: select label, count(distinct file) from tableName group by label;
Criteria: criteria.setProjection(Projections.projectionList().add(Projections.groupProperty("label")).add(Projections.countDistinct("file")));
Firstly your sql query can be resumed to this :
Select distinct GLOBAL_LABEL ,count (distinct FILE) from MY_TABLE group by GLOBAL_LABEL
Secondly it's always good to not name your columns with primary names to avoid problems.
Finaly you can use this as your HQL query (with no magic) :
Select distinct ge.globalLabel,count (distinct ge.file) from GlobalEntity ge group by ge.globalLabel
Yes it is possible by using JPA javax.persistence.criteria API.
Take a look at this example in the official Documentation.
The scenario is simple.
I have a somehow large MySQL db containing two tables:
-- Table 1
id (primary key) | some other columns without constraints
-----------------+--------------------------------------
1 | foo
2 | bar
3 | foobar
... | ...
-- Table 2
id_src | id_trg | some other columns without constraints
-------+--------+---------------------------------------
1 | 2 | ...
1 | 3 | ...
2 | 1 | ...
2 | 3 | ...
2 | 5 | ...
...
On table1 only id is a primary key. This table contains about 12M entries.
On table2 id_src and id_trg are both primary keys and both have foreign key constraints on table1's id and they also have the option DELETE ON CASCADE enabled. This table contains about 110M entries.
Ok, now what I'm doing is only to create a list of ids that I want to remove from table 1 and then I'm executing a simple DELETE FROM table1 WHERE id IN (<the list of ids>);
The latter process is as you may have guessed would delete the corresponding id from table2 as well. So far so good, but the problem is that when I run this on a multi-threaded env and I get many Deadlocks!
A few notes:
There is no other process running at the same time nor will be (for the time being)
I want this to be fast! I have about 24 threads (if this does make any difference in the answer)
I have already tried almost all of transaction isolation levels (except the TRANSACTION_NONE) Java sql connection transaction isolation
Ordering/sorting the id's I think would not help!
I have already tried SELECT ... FOR UPDATE, but a simple DELETE would take up to 30secs! (so there is no use of using it) :
DELETE FROM table1
WHERE id IN (
SELECT id FROM (
SELECT * FROM table1
WHERE id='some_id'
FOR UPDATE) AS x);
How can I fix this?
I would appreciate any help and thanks in advance :)
Edit:
Using InnoDB engine
On a single thread this process would take a dozen hours even maybe a whole day, but I'm aiming for a few hours!
I'm already using a connection pool manager: java.util.concurrent
For explanation on double nested SELECTs please refer to MySQL can’t specify target table for update in FROM clause
The list that is to be deleted from DB, may contain a couple of million entries in total which is divided into chunks of 200
The FOR UPDATE clause is that I've heard that it locks a single row instead of locking the whole table
The app uses Spring's batchUpdate(String sqlQuery) method, thus the transactions are managed automatically
All ids have index enabled and the ids are unique 50 chars max!
DELETE ON CASCADE on id_src and id_trg (each separately) would mean that every delete on table1 id=x would lead to deletes on table2 id_src=x and id_trg=x
Some code as requested:
public void write(List data){
try{
Arraylist idsToDelete = getIdsToDelete();
String query = "DELETE FROM table1 WHERE id IN ("+ idsToDelete + " )";
mysqlJdbcTemplate.getJdbcTemplate().batchUpdate(query);
} catch (Exception e) {
LOG.error(e);
}
}
and myJdbcTemplate is just an abstract class that extends JdbcDaoSupport.
First of all your first simple delete query in which you are passing ids, should not create problem if you are passing ids till a limit like 1000 (total no of rows in child table also should be near about but not to many like 10,000 etc.), but if you are passing like 50,000 or more then it can create locking issue.
To avoid deadlock, you can follow below approach to take care this issue (assuming bulk deletion will not be part of production system)-
Step1: Fetch all ids by select query and keep in cursor.
Step2: now delete these ids stored in cursor in a stored procedure one by one.
Note: To check why deletion is acquiring locks we have to check several things like how many ids you are passing, what is transaction level set at DB level, what is your Mysql configuration setting in my.cnf etc...
It may be dangereous to delete many (> 10000) parent records each having child records deleted by cascade, because the most records you delete in a single time, the most chances of lock conflict leading to deadlock or rollback.
If it is acceptable (meaning you can make a direct JDBC connection to the database) you should (no threading involved here) :
compute the list of ids to delete
delete them by batches (between 10 and 100 a priori) committing every 100 or 1000 records
As the heavier job should be on database part, I hardly doubt that threading will help here. If you want to try it, I would recommend :
one single thread (with a dedicated database connection) computing the list of ids to delete and alimenting a synchronized queue with them
a small number of threads (4 maybe 8), each with its own database connection that :
use a prepared DELETE FROM table1 WHERE id = ? in batches
take ids from the queue and prepare the batches
send a batch to the database every 10 or 100 records
do a commit every 10 or 100 batches
I cannot imagine that the whole process could take more than several minutes.
After some other readings, it looks like I was used to old systems and that my numbers are really conservative.
Ok here's what I did, it might not actually avoid having Deadlocks but was my only option at time being.
This solution is actually a way of handling MySQL Deadlocks using Spring.
Catch and retry Deadlocks:
public void write(List data){
try{
Arraylist idsToDelete = getIdsToDelete();
String query = "DELETE FROM table1 WHERE id IN ("+ idsToDelete + " )";
try {
mysqlJdbcTemplate.getJdbcTemplate().batchUpdate(query);
} catch (org.springframework.dao.DeadlockLoserDataAccessException e) {
LOG.info("Caught DEADLOCK : " + e);
retryDeadlock(query); // Retry them!
}
} catch (Exception e) {
LOG.error(e);
}
}
public void retryDeadlock(final String[] sqlQuery) {
RetryTemplate template = new RetryTemplate();
TimeoutRetryPolicy policy = new TimeoutRetryPolicy();
policy.setTimeout(30000L);
template.setRetryPolicy(policy);
try {
template.execute(new RetryCallback<int[]>() {
public int[] doWithRetry(RetryContext context) {
LOG.info("Retrying DEADLOCK " + context);
return mysqlJdbcTemplate.getJdbcTemplate().batchUpdate(sqlQuery);
}
});
} catch (Exception e1) {
e1.printStackTrace();
}
}
Another solution could be to use Spring's multiple step mechanism.
So that the DELETE queries are split into 3 and thus by starting the first step by deleting the blocking column and other steps delete the two other columns respectively.
Step1: Delete id_trg from child table;
Step2: Delete id_src from child table;
Step3: Delete id from parent table;
Of course the last two steps could be merged into 1, but in that case two distinct ItemsWriters would be needed!
I would like to know how to create custom setups/teardown mostly to fix cyclyc refence issues where I can insert custom SQL commands with Spring Test Dbunit http://springtestdbunit.github.io/spring-test-dbunit/index.html.
Is there an annotation I can use or how can this be customized?
There isn't currently an annotation that you can use but you might be able to create a subclass of DbUnitTestExecutionListener and add custom logic in the beforeTestMethod. Alternatively you might get away with creating your own TestExecutionListener and just ordering it before DbUnitTestExecutionListener.
Another, potentially better solution would be to re-design your database to remove the cycle. You could probably drop the reference from company to company_config and add a unique index to company_id in the company_config table:
+------------+ 1 0..1 +--------------------------------+
| company |<---------| company_config |
+------------+ +--------------------------------+
| company_id | | config_id |
| ... | | company_id (fk, notnull, uniq) |
+------------+ +--------------------------------+
Rather than looking at company.config_id to get the config you would do select * from company_config where company_id = :id.
Dbunit needs the insert statements (xml lines) in order, because they are performed sequentially. There is no or magic parameter or annotation so dbunit can resolve your cyclyc refences or foreign keys automatically.
The most automate way I could achieve if you your data set contain many tables with foreign keys:
Populate your database with few records. In your example: Company, CompanyConfig and make it sure that the foreign keys are met.
Extract a sample of your database using dbunit Export tool.
This is an snippets you could use:
IDatabaseConnection connection = new DatabaseConnection(conn, schema);
configConnection((DatabaseConnection) connection);
// dependent tables database export: export table X and all tables that have a // PK which is a FK on X, in the right order for insertion
String[] depTableNames = TablesDependencyHelper.getAllDependentTables(connection, "company");
IDataSet depDataset = connection.createDataSet(depTableNames);
FlatXmlWriter datasetWriter = new FlatXmlWriter(new FileOutputStream("target/dependents.xml"));
datasetWriter.write(depDataset);
After running this code, you will have your dbunit data set in "dependents.xml", with all your cycle references fixed.
Here I pasted you the full code: also have a look on dbunit doc about how to export data.