how to get mysql disk writes in java if innoDb is enabled - java

I am using following code to check if innoDb is enabled in mysql server but i want to get total number of disk writes by mysql. please help with the following program
public class connectToInnoDb {
public static void main(String[] args) {
// TODO Auto-generated method stub
try{
Class.forName("com.mysql.cj.jdbc.Driver");
Connection con=DriverManager.getConnection(
"jdbc:mysql://localhost:3310/INFORMATION_SCHEMA","root","root");
Statement stmt=con.createStatement();
ResultSet rs=stmt.executeQuery("SELECT * FROM ENGINES");
while(rs.next()) {
if(rs.getString(1) == "Innodb")
System.out.println("Yes");
}
con.close();
}catch(Exception e){ System.out.println(e);}
}

You can get a lot of InnoDB information with SHOW ENGINE INNODB STATUS, including I/O counts:
mysql> SHOW ENGINE INNODB STATUS\G
...
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: waiting for i/o request (read thread)
I/O thread 4 state: waiting for i/o request (read thread)
I/O thread 5 state: waiting for i/o request (read thread)
I/O thread 6 state: waiting for i/o request (write thread)
I/O thread 7 state: waiting for i/o request (write thread)
I/O thread 8 state: waiting for i/o request (write thread)
I/O thread 9 state: waiting for i/o request (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
431 OS file reads, 69 OS file writes, 53 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
...
I see above there have been 69 OS file writes. The numbers above are small because I got this information from a sandbox MySQL instance running on my laptop, and it hasn't been running long.
As commented by JimGarrison above, most of the information reported by INNODB STATUS is also available to you as individual rows in the INFORMATION_SCHEMA.INNODB_METRICS table. This is much easier to use in Java than SHOW ENGINE INNODB STATUS and trying to parse the text.
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_METRICS
WHERE NAME = 'os_data_writes'\G
NAME: os_data_writes
SUBSYSTEM: os
COUNT: 69
MAX_COUNT: 69
MIN_COUNT: NULL
AVG_COUNT: 0.0034979215248910067
COUNT_RESET: 69
MAX_COUNT_RESET: 69
MIN_COUNT_RESET: NULL
AVG_COUNT_RESET: NULL
TIME_ENABLED: 2017-12-22 10:27:50
TIME_DISABLED: NULL
TIME_ELAPSED: 19726
TIME_RESET: NULL
STATUS: enabled
TYPE: status_counter
COMMENT: Number of writes initiated (innodb_data_writes)
Read https://dev.mysql.com/doc/refman/5.7/en/innodb-information-schema-metrics-table.html
I won't show the Java code, you already know how to run a query and fetch the results. These statements can be run as SQL statements the same way you run SELECT queries.

mysql> SHOW GLOBAL STATUS LIKE 'Innodb%write%';
+-----------------------------------+-------+
| Variable_name | Value |
+-----------------------------------+-------+
| Innodb_buffer_pool_write_requests | 5379 |
| Innodb_data_pending_writes | 0 |
| Innodb_data_writes | 416 |
| Innodb_dblwr_writes | 30 |
| Innodb_log_write_requests | 1100 |
| Innodb_log_writes | 88 |
| Innodb_os_log_pending_writes | 0 |
| Innodb_truncated_status_writes | 0 |
+-----------------------------------+-------+
mysql> SHOW GLOBAL STATUS LIKE 'Uptime';
+---------------+--------+
| Variable_name | Value |
+---------------+--------+
| Uptime | 4807 | -- divide by this to get "per second"
+---------------+--------+
Note: "requests" include both writes that need to hit the disk and those that do not.

Related

Firestore adding 2 documents execution time

Is there any difference between those 2 in terms of execution time?
collectionReference.add(testObject)
.addOnSuccessListener(new OnSuccessListener<DocumentReference>() {
#Override
public void onSuccess(DocumentReference documentReference) {
collectionReference.add(testObject2);
}
})
And
collectionReference.add(testObject);
collectionReference.add(testObject2);
In the first case second adding will be executed after first one is finished, is the same thing happening in second case? Is the second adding queried and is waiting for first to finish, or are they running in parallel?
Yes, there will be a difference between the execution time of these two.
In the first case you're waiting for the first write to be completed on the server, before sending the second write to the server. In a diagram:
Client Server
| |
|---- Send document to write ----->|
| |
| |
|<----- Response from server ------|
|---- Send document to write ----->|
| |
| |
|<----- Response from server ------|
| |
In the second case, the second write is sent to the server right after the first write was sent.
Client Server
| |
|---- Send document to write ----->|
|---- Send document to write ----->|
| |
| |
| |
|<----- Response from server ------|
|<----- Response from server ------|
| |
The difference in performance between these two is the latency of the connection between you and the server.
Note that this is just the theoretical difference, and likely there are many more factors influencing the performance.

Constant timeouts in Cassandra after adding second node

I'm trying to migrate a moderately large swath of data (~41 million rows) from an SQL database to Cassandra. I've previously done a trial-run using half the dataset, and everything worked exactly as expected.
The problem is, now that I'm trying the complete migration Cassandra is throwing constant timeout errors. For instance:
[INFO] [talledLocalContainer] com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:10112 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response))
[INFO] [talledLocalContainer] at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
[INFO] [talledLocalContainer] at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
[INFO] [talledLocalContainer] at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
[INFO] [talledLocalContainer] at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
[INFO] [talledLocalContainer] at com.mycompany.tasks.CassandraMigrationTask.execute(CassandraMigrationTask.java:164)
[INFO] [talledLocalContainer] at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
[INFO] [talledLocalContainer] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
[INFO] [talledLocalContainer] Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:10112 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response))
[INFO] [talledLocalContainer] at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
[INFO] [talledLocalContainer] at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
[INFO] [talledLocalContainer] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[INFO] [talledLocalContainer] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[INFO] [talledLocalContainer] at java.lang.Thread.run(Thread.java:745)
I've tried increasing the timeout values in cassandra.yaml, and that increased the amount of time that the migration was able to run before dying to a timeout (roughly in proportion to the increase in the timeout).
Prior to changing the timeout settings, my stack-trace looked more like:
[INFO] [talledLocalContainer] com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
[INFO] [talledLocalContainer] at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:54)
[INFO] [talledLocalContainer] at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
[INFO] [talledLocalContainer] at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
[INFO] [talledLocalContainer] at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
[INFO] [talledLocalContainer] at com.mycompany.tasks.CassandraMigrationTask.execute(CassandraMigrationTask.java:164)
[INFO] [talledLocalContainer] at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
[INFO] [talledLocalContainer] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
[INFO] [talledLocalContainer] Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
[INFO] [talledLocalContainer] at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:54)
[INFO] [talledLocalContainer] at com.datastax.driver.core.Responses$Error.asException(Responses.java:99)
[INFO] [talledLocalContainer] at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:140)
[INFO] [talledLocalContainer] at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:249)
[INFO] [talledLocalContainer] at com.datastax.driver.core.RequestHandler.onSet(RequestHandler.java:433)
[INFO] [talledLocalContainer] at com.datastax.driver.core.Connection$Dispatcher.messageReceived(Connection.java:697)
[INFO] [talledLocalContainer] at com.datastax.shaded.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
[INFO] [talledLocalContainer] at com.datastax.shaded.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
[INFO] [talledLocalContainer] at com.datastax.shaded.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
[INFO] [talledLocalContainer] at com.datastax.shaded.netty.channel.Channels.fireMessageReceived(Channels.java:296)
[INFO] [talledLocalContainer] at com.datastax.shaded.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70)
My timeout settings are currently:
# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 30000
# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 30000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 30000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 30000
# How long a coordinator should continue to retry a CAS operation
# that contends with other proposals for the same row
cas_contention_timeout_in_ms: 1000
# How long the coordinator should wait for truncates to complete
# (This can be much longer, because unless auto_snapshot is disabled
# we need to flush first so we can snapshot before removing the data.)
truncate_request_timeout_in_ms: 60000
# The default timeout for other, miscellaneous operations
request_timeout_in_ms: 20000
...which gets me about 1.5m rows inserted before the timeout happens. The original timeout settings were:
# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 5000
# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 10000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 2000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 5000
# How long a coordinator should continue to retry a CAS operation
# that contends with other proposals for the same row
cas_contention_timeout_in_ms: 1000
# How long the coordinator should wait for truncates to complete
# (This can be much longer, because unless auto_snapshot is disabled
# we need to flush first so we can snapshot before removing the data.)
truncate_request_timeout_in_ms: 60000
# The default timeout for other, miscellaneous operations
request_timeout_in_ms: 10000
...which caused the timeouts to happen approximately every 300,000 rows.
The only significant change that's occurred between when I had my successful run and now is that I added a second node to the Cassandra deployment. So intuitively I'd think the issue would have something to do with the propagation of data from the first node to the second (as in, there's <some process> that scales linearly with the amount of data inserted and which isn't used when there's only a single node). But I'm not seeing any obvious options that might be useful for configuring/mitigating this.
If it's relevant, I'm using batch statements during the migration, typically with between 100 and 200 statements/rows per batch, at most.
My keyspace was originally set up WITH REPLICATION =
{ 'class' : 'SimpleStrategy', 'replication_factor' : 2 }, but I altered it to be WITH REPLICATION =
{ 'class' : 'SimpleStrategy', 'replication_factor' : 1 } to see if that would make any difference. It didn't.
I also tried explicitly setting ConsistencyLevel.ANY on all my insert statements (and also the enclosing batch statements). That also made no difference.
There doesn't seem to be anything interesting in Cassandra's log on either node, although the first node is certainly showing more 'ops' than the second:
First node - 454317 ops
INFO [SlabPoolCleaner] 2016-01-25 19:46:08,806 ColumnFamilyStore.java:905 - Enqueuing flush of assetproperties_flat: 148265302 (14%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:15] 2016-01-25 19:46:08,807 Memtable.java:347 - Writing Memtable-assetproperties_flat#350387072(20.557MiB serialized bytes, 454317 ops, 14%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:15] 2016-01-25 19:46:09,393 Memtable.java:382 - Completed flushing /var/cassandra/data/itb/assetproperties_flat-e83359a0c34411e593abdda945619e28/itb-assetproperties_flat-tmp-ka-32-Data.db (5.249MiB) for commitlog position ReplayPosition(segmentId=1453767930194, position=15188257)
Second node - 2020 ops
INFO [BatchlogTasks:1] 2016-01-25 19:46:33,961 ColumnFamilyStore.java:905 - Enqueuing flush of batchlog: 4923957 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:22] 2016-01-25 19:46:33,962 Memtable.java:347 - Writing Memtable-batchlog#796821497(4.453MiB serialized bytes, 2020 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:22] 2016-01-25 19:46:33,963 Memtable.java:393 - Completed flushing /var/cassandra/data/system/batchlog-0290003c977e397cac3efdfdc01d626b/system-batchlog-tmp-ka-11-Data.db; nothing needed to be retained. Commitlog position was ReplayPosition(segmentId=1453767955411, position=18567563)
Has anyone encountered a similar issue, and if so, what was the fix?
Would it be advisable to just take the second node offline, run the migration with just the first node, and then run nodetool repair afterwards to get the second node back in sync?
Edit
Answers to questions from comments:
I'm using the datastax Java driver, and have a server-side task (Quartz job) that uses the ORM layer (hibernate) to lookup the next chunk of data to migrate, write it into Cassandra, and then purge it from the SQL database. I'm getting a connection to Cassandra using the following code:
public static Session getCassandraSession(String keyspace) {
Session session = clusterSessions.get(keyspace);
if (session != null && ! session.isClosed()) {
//can use the cached session
return session;
}
//create a new session for the specified keyspace
Cluster cassandraCluster = getCluster();
session = cassandraCluster.connect(keyspace);
//cache and return the session
clusterSessions.put(keyspace, session);
return session;
}
private static Cluster getCluster() {
if (cluster != null && ! cluster.isClosed()) {
//can use the cached cluster
return cluster;
}
//configure socket options
SocketOptions options = new SocketOptions();
options.setConnectTimeoutMillis(30000);
options.setReadTimeoutMillis(300000);
options.setTcpNoDelay(true);
//spin up a fresh connection
cluster = Cluster.builder().addContactPoint(Configuration.getCassandraHost()).withPort(Configuration.getCassandraPort())
.withCredentials(Configuration.getCassandraUser(), Configuration.getCassandraPass()).withSocketOptions(options).build();
//log the cluster details for confirmation
Metadata metadata = cluster.getMetadata();
LOG.debug("Connected to Cassandra cluster: " + metadata.getClusterName());
for ( Host host : metadata.getAllHosts() ) {
LOG.debug("Datacenter: " + host.getDatacenter() + "; Host: " + host.getAddress() + "; Rack: " + host.getRack());
}
return cluster;
}
The part with the SocketOptions is a recent addition, as the latest timeout error sounded like it was coming from the Java/client side rather than from within Cassandra itself.
Each batch inserts no more than 200 records. Typical values are closer to 100.
Both nodes have the same specs:
Intel(R) Xeon(R) CPU E3-1230 V2 # 3.30GHz
32GB RAM
256GB SSD (primary), 2TB HDD (backups), both in RAID-1 configurations
First node:
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 0 0 58155 0 0
RequestResponseStage 0 0 655104 0 0
MutationStage 0 0 259151 0 0
ReadRepairStage 0 0 0 0 0
GossipStage 0 0 58041 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MigrationStage 0 0 0 0 0
Sampler 0 0 0 0 0
ValidationExecutor 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MiscStage 0 0 0 0 0
MemtableFlushWriter 0 0 80 0 0
MemtableReclaimMemory 0 0 80 0 0
PendingRangeCalculator 0 0 3 0 0
MemtablePostFlush 0 0 418 0 0
CompactionExecutor 0 0 8979 0 0
InternalResponseStage 0 0 0 0 0
HintedHandoff 0 0 2 0 0
Native-Transport-Requests 1 0 1175338 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 0
_TRACE 0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0
Second node:
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 0 0 55803 0 0
RequestResponseStage 0 0 1 0 0
MutationStage 0 0 733828 0 0
ReadRepairStage 0 0 0 0 0
GossipStage 0 0 56623 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MigrationStage 0 0 0 0 0
Sampler 0 0 0 0 0
ValidationExecutor 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MiscStage 0 0 0 0 0
MemtableFlushWriter 0 0 394 0 0
MemtableReclaimMemory 0 0 394 0 0
PendingRangeCalculator 0 0 2 0 0
MemtablePostFlush 0 0 428 0 0
CompactionExecutor 0 0 8883 0 0
InternalResponseStage 0 0 0 0 0
HintedHandoff 0 0 1 0 0
Native-Transport-Requests 0 0 70 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 0
_TRACE 0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0
The output of nodetool ring was very long. Here's a nodetool status instead:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 204.11.xxx.1 754.66 MB 1024 ? 8cf373d8-0b3e-4fd3-9e63-fdcdd8ce8cd4 RAC1
UN 208.66.xxx.2 767.78 MB 1024 ? 42e1f336-84cb-4260-84df-92566961a220 RAC2
I increased all of Cassandra's timeout values by a factor of 10, and also set the Java driver's read timeout settings to match, and now I'm up to 8m 29.4m inserts with no issues. In theory if the issue scales linearly with the timeout values I should be good up until around 15m inserts (which is at least good enough that I don't need to constantly babysit the migration process waiting for each new error).
1) CL.ANY is almost always a bad idea - you're writing faster than the server can even acknowledge the writes.
2) 1024 tokens is silly, but not the cause of the problems. You also can't change it once the node is live in the cluster.
3) You're masking your problems by increasing the timeouts - cassandra on that hardware can run easily 100k writes/second.
4) Batches are meant for atomicity, you're probably misusing them, which is adding headache.
5) You've tuned all sorts of knobs without understanding them. Cassandra is different than a relational DB.
6) The right way to do data loads of this nature is with CQLSSTableWriter and the bulk load interface. Details at http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
7) When the client starts throwing errors, what's in the server logs? What's the JVM doing? Are you seeing GC pauses? Is the server idle? CPU maxed? Disks maxed?
8) There exist some very good tuning guides - consider reading and understanding https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
Okay, so I was able to get the timeout errors to stop by doing two things. First, I increased Cassandra's timeout values on both hosts, as follows:
# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 30000
# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 30000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 30000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 30000
# How long a coordinator should continue to retry a CAS operation
# that contends with other proposals for the same row
cas_contention_timeout_in_ms: 1000
# How long the coordinator should wait for truncates to complete
# (This can be much longer, because unless auto_snapshot is disabled
# we need to flush first so we can snapshot before removing the data.)
truncate_request_timeout_in_ms: 60000
# The default timeout for other, miscellaneous operations
request_timeout_in_ms: 20000
I suspect those values are unnecessarily large, but those are what I had in place when everything started working.
The second part of the solution was to adjust the client timeout in my Java code, as follows:
//configure socket options
SocketOptions options = new SocketOptions();
options.setConnectTimeoutMillis(30000);
options.setReadTimeoutMillis(300000);
options.setTcpNoDelay(true);
//spin up a fresh connection (using the SocketOptions set up above)
cluster = Cluster.builder().addContactPoint(Configuration.getCassandraHost()).withPort(Configuration.getCassandraPort())
.withCredentials(Configuration.getCassandraUser(), Configuration.getCassandraPass()).withSocketOptions(options).build();
With those two changes, the timeout errors stopped and the data migration completed without issue.
As #MarcintheCloud rightly points out in the comments above, increasing the timeout values may only have the effect of masking the underlying problem. But that's good enough in my case since 1) the underlying problem only surfaces under very high load, 2) I only need to run the migration process once, and 3) once the data has been migrated, the actual load levels are orders of magnitude lower than what's experienced during the migration.
However, understanding the underlying cause still seems worthwhile. So what was it? Well I've got two theories:
As #MarcintheCloud posits, perhaps 1024 is too many tokens to reasonably use with Cassandra. And perhaps as a consequence of that the deployment gets a bit flaky under heavy load.
My alternative theory has to do with network chatter between the two nodes. In my deployment, the first node runs the app-server instance, the first Cassandra instance, and the primary SQL database. The second node runs the second Cassandra instance and also a replica SQL database that is kept in sync with the primary database in near-real-time.
Now, the migration process essentially does two things concurrently; it writes data into Cassandra, and it deletes data from the SQL database. Both of those actions generate changesets that need to propagate over the network to the second node.
So my theory is that if changes are happening quickly enough on the first node (since the SSD does allow very high IO throughput), the network transfers of the SQL and Cassandra changelogs (and/or the subsequent IO ops on the second node) may occasionally contend with each other, introducing additional latency into the replication process(es) and potentially leading to timeouts. It seems plausible that with enough contention, one process or the other might get blocked for several seconds at a time, which is enough to trigger timeout errors at Cassandra's default settings.
Those are the plausible theories I can think of. Though no real way of testing to confirm which (if any) is correct.

How can I know why my application using PostgreSQL 9, hibernate 4.3.5 and c3p0 is hanging?

I'm working on an application which is based on PostgreSQL 9, hibernate 4.3.5.Final, c3p0, Tomcat 7 and JDK 7.
Here is the c3p0 configuration:
hibernate.c3p0.min_size=5
hibernate.c3p0.max_size=20
hibernate.c3p0.timeout=1800
hibernate.c3p0.max_statements=50
After few hours of utilization, the application is hanging. All screen are frozen because it seemed like no new transaction to the database could be opened.
I did a kill -3 on the tomcat 7 process (there is a single app deployed) to see where all the threads are locked. Here's a part of the output:
"ajp-bio-8127-exec-274" daemon prio=10 tid=0x0000000001365000 nid=0x257b in Object.wait() [0x0000000045242000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at com.mchange.v2.resourcepool.BasicResourcePool.awaitAvailable(BasicResourcePool.java:1414)
at com.mchange.v2.resourcepool.BasicResourcePool.prelimCheckoutResource(BasicResourcePool.java:606)
- locked <0x000000078567cb70> (a com.mchange.v2.resourcepool.BasicResourcePool)
at com.mchange.v2.resourcepool.BasicResourcePool.checkoutResource(BasicResourcePool.java:526)
at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool.checkoutAndMarkConnectionInUse(C3P0PooledConnectio
It's the same for all the http requests processes. So all requests are waiting indefinitly for an available connection in the pool.
We had a look at the postgres to see that 20 connections were opened (20 is the pool max size):
foobar=# select datname, usename, client_port, query from pg_stat_activity where usename='foobar';
datname | usename | client_port | query
---------+---------+-------------+----------
foobar | foobar | 52992 | ROLLBACK
foobar | foobar | 52993 | ROLLBACK
foobar | foobar | 52991 | ROLLBACK
foobar | foobar | 52994 | ROLLBACK
foobar | foobar | 52995 | ROLLBACK
foobar | foobar | 36398 | ROLLBACK
foobar | foobar | 36399 | ROLLBACK
foobar | foobar | 36400 | ROLLBACK
foobar | foobar | 51766 | ROLLBACK
foobar | foobar | 56689 | ROLLBACK
foobar | foobar | 56690 | ROLLBACK
foobar | foobar | 39582 | ROLLBACK
foobar | foobar | 39581 | ROLLBACK
foobar | foobar | 39583 | ROLLBACK
foobar | foobar | 39590 | ROLLBACK
foobar | foobar | 39592 | ROLLBACK
foobar | foobar | 39591 | ROLLBACK
foobar | foobar | 41799 | ROLLBACK
foobar | foobar | 36105 | ROLLBACK
foobar | foobar | 36103 | ROLLBACK
(20 rows)
So, we configured the pool logs to DEBUG, and we can see statements like theses:
2014/07/09 05:24:40 DEBUG (BasicResourcePool.java:1747) trace trace com.mchange.v2.resourcepool.BasicResourcePool#12c39c9e [managed: 19, unused: 4, excluded: 0] (e.g. com.mchange.v2.c3p0.impl.NewPooledConnection#4fc04747)
They show that the managed connexion number grows slowly until managed: 20 and usused: 0 this final state remains stable and the application is frozen because all the threads are expecting a connection to be available from the pool.
It's a web application and we use the session in request pattern so the connection are closed properly after each request is processed (in a finally statement). There is no such thing like an ERROR or WARN in the application logs.
How can I know what did I do wrong ?
Well evidently those queries are getting blocked without being released. It could be that you are getting some exception or something which you are not seeing, because the query is marked as ROLLBACK, and for some reason the thread is hanging waiting for the query to finish or something of that sort. Without seeing the code its difficult to say exactly.
What you could do is wait for this to happen again and then get a full thread dump. This should give you full details of where each thread is hanging, so you could see what the 20 connections are waiting on.
You can use jstack for this, which comes with the JDK.
You could also enable JMX on Tomcat and connect to it using jconsole or jvisualvm to see in real time what the threads are doing.
It looks like your application runs out of available pooled connections. A transaction is marked as rollback only if an exception was thrown. If you can't see any exception it might be because you don't properly handle them, like logging any exception with an ERROR threshold.
You need to check the db log as well, maybe you find what causes all those transactions to rollback.

Getting Error com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

I have some data migration scripts ( shell scripts generated from talend data migration tool) which connects to MySQL and perform some operations.
One of the script is performs heavy calculations and make lockups from when i execute it on my local machine it completes in around 2.5 hours,and the connection made to MySQL stays in sleep mode
From MySQL Processlist
mysql> show processlist;
+-------+------+---------------------+-------------------+---------+------+-------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-------+------+---------------------+-------------------+---------+------+-------+------------------+
| 10631 | root | localhost | psdata_psdatabase | Sleep | 18 | | NULL |
| 11195 | root | localhost | psdata_psdatabase | Sleep | 5497 | | NULL |
| 11261 | root | localhost | psdata_psdatabase | Query | 0 | NULL | show processlist |
| 11492 | root | 192.168.9.213:56507 | psdata_psdatabase | Sleep | 5509 | | NULL |
| 11493 | root | 192.168.9.213:56508 | psdata_psdatabase | Sleep | 5508 | | NULL |
+-------+------+---------------------+-------------------+---------+------+-------+------------------+
5 rows in set (0.00 sec)
The Threads from 192.168.9.213 are of that script.
But when i execute same script on production/staging, I am getting error after around 1 hour, And I also don't see any activity in MySQL Processlist in that one hour as there were connections in sleep mode on my local machine
"com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure"
Stack Trace
[root#host5 /home/talend/PharmaSecure_psData_localhost/execute_analysis_jobs]# sh execute_analysis_jobs_run.sh
Analysis Started
Started psVerify_interaction_analysis
Ended psVerify_interaction_analysis
Started psVerify_interaction_analysis_p2
Exception in component tMysqlOutput_1
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 3,350,869 milliseconds ago. The last packet sent successfully to the server was 31 milliseconds ago.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1117)
at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3851)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2471)
at com.mysql.jdbc.MysqlIO.disableMultiQueries(MysqlIO.java:3771)
at com.mysql.jdbc.PreparedStatement.executePreparedBatchAsMultiStatement(PreparedStatement.java:1675)
at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1463)
at pharmasecure_ps.psverify_interaction_analysis_p2_0_1.psVerify_interaction_analysis_p2.tMysqlInput_1Process(psVerify_interaction_analysis_p2.java:3576)
at pharmasecure_ps.psverify_interaction_analysis_p2_0_1.psVerify_interaction_analysis_p2.tMysqlInput_6Process(psVerify_interaction_analysis_p2.java:1196)
at pharmasecure_ps.psverify_interaction_analysis_p2_0_1.psVerify_interaction_analysis_p2.tJava_1Process(psVerify_interaction_analysis_p2.java:687)
at pharmasecure_ps.psverify_interaction_analysis_p2_0_1.psVerify_interaction_analysis_p2.runJobInTOS(psVerify_interaction_analysis_p2.java:7365)
at pharmasecure_ps.psverify_interaction_analysis_p2_0_1.psVerify_interaction_analysis_p2.runJob(psVerify_interaction_analysis_p2.java:7207)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tRunJob_1Process(psVerify_interaction_analysis.java:4911)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tJava_2Process(psVerify_interaction_analysis.java:4806)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tMysqlInput_8Process(psVerify_interaction_analysis.java:4719)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tMysqlInput_6Process(psVerify_interaction_analysis.java:2821)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tMysqlInput_5Process(psVerify_interaction_analysis.java:1541)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tMysqlInput_9Process(psVerify_interaction_analysis.java:6433)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tJava_1Process(psVerify_interaction_analysis.java:5924)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.runJobInTOS(psVerify_interaction_analysis.java:6652)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.runJob(psVerify_interaction_analysis.java:6494)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.tRunJob_4Process(execute_analysis_jobs.java:1174)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.tRunJob_1Process(execute_analysis_jobs.java:838)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.tJava_1Process(execute_analysis_jobs.java:682)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.runJobInTOS(execute_analysis_jobs.java:2749)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.main(execute_analysis_jobs.java:2596)
Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3832)
... 23 more
Exception in component tRunJob_1
java.lang.RuntimeException: Child job running failed
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tRunJob_1Process(psVerify_interaction_analysis.java:4932)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tJava_2Process(psVerify_interaction_analysis.java:4806)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tMysqlInput_8Process(psVerify_interaction_analysis.java:4719)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tMysqlInput_6Process(psVerify_interaction_analysis.java:2821)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tMysqlInput_5Process(psVerify_interaction_analysis.java:1541)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tMysqlInput_9Process(psVerify_interaction_analysis.java:6433)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.tJava_1Process(psVerify_interaction_analysis.java:5924)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.runJobInTOS(psVerify_interaction_analysis.java:6652)
at pharmasecure_ps.psverify_interaction_analysis_0_1.psVerify_interaction_analysis.runJob(psVerify_interaction_analysis.java:6494)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.tRunJob_4Process(execute_analysis_jobs.java:1174)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.tRunJob_1Process(execute_analysis_jobs.java:838)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.tJava_1Process(execute_analysis_jobs.java:682)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.runJobInTOS(execute_analysis_jobs.java:2749)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.main(execute_analysis_jobs.java:2596)
Exception in component tRunJob_4
java.lang.RuntimeException: Child job running failed
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.tRunJob_4Process(execute_analysis_jobs.java:1195)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.tRunJob_1Process(execute_analysis_jobs.java:838)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.tJava_1Process(execute_analysis_jobs.java:682)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.runJobInTOS(execute_analysis_jobs.java:2749)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.main(execute_analysis_jobs.java:2596)
Exception in component tSendMail_2
javax.mail.AuthenticationFailedException: failed to connect
at javax.mail.Service.connect(Service.java:322)
at javax.mail.Service.connect(Service.java:172)
at javax.mail.Service.connect(Service.java:121)
at javax.mail.Transport.send0(Transport.java:190)
at javax.mail.Transport.send(Transport.java:120)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.tSendMail_2Process(execute_analysis_jobs.java:1433)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.tRunJob_4_onSubJobError(execute_analysis_jobs.java:508)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.tRunJob_4_error(execute_analysis_jobs.java:384)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs$TalendException.printStackTrace(execute_analysis_jobs.java:330)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs$TalendException.printStackTrace(execute_analysis_jobs.java:318)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs$TalendException.printStackTrace(execute_analysis_jobs.java:318)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.runJobInTOS(execute_analysis_jobs.java:2755)
at pharmasecure_ps.execute_analysis_jobs_0_1.execute_analysis_jobs.main(execute_analysis_jobs.java:2596)
Check the value of the wait_timeout variable on your mysql servers.
The value of the variable could be too low on your production server.
To show the value type this command :
SHOW VARIABLES LIKE 'wait_timeout'
Set the same value in your local server as the production to see if you can reproduce the problem.
If the problem can be reproduced, you may have to increase the value of the wait_timeout on your production server.
http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_wait_timeout

CPU load with play framework

Since a few days, on a system which has been in development for about a year, I have a constant CPU load from the play! server. I have two servers, one active and one as a hot spare. In the past, the hot-spre server showed no load, or a neglectable load. But now it consumes a constant 50-110% CPU (using top on Linux).
Is there an easy way to find out what the cause it? I don't see this behavior on my MacBook when debugging (usually 0.1-1%).This is something that only happened in the past few days as far as I am aware.
This is a status print of the hot-spare. As can be seen no controllers are queried apart from the scheduled tasks (which do not perform on this server due to a flag, but are launched):
~ _ _
~ _ __ | | __ _ _ _| |
~ | '_ \| |/ _' | || |_|
~ | __/|_|\____|\__ (_)
~ |_| |__/
~
~ play! 1.2.4, http://www.playframework.org
~ framework ID is prod-frontend
~
~ Status from http://localhost:xxxx/#status,
~
Java:
~~~~~
Version: 1.6.0_26
Home: /usr/lib/jvm/java-6-sun-1.6.0.26/jre
Max memory: 64880640
Free memory: 11297896
Total memory: 29515776
Available processors: 2
Play framework:
~~~~~~~~~~~~~~~
Version: 1.2.4
Path: /opt/play
ID: prod-frontend
Mode: PROD
Tmp dir: /xxx/tmp
Application:
~~~~~~~~~~~~
Path: /xxx/server
Name: iDoms Server
Started at: 07/01/2012 12:05
Loaded modules:
~~~~~~~~~~~~~~
secure at /opt/play/modules/secure
paginate at /xxx/server/modules/paginate-0.14
Loaded plugins:
~~~~~~~~~~~~~~
0:play.CorePlugin [enabled]
100:play.data.parsing.TempFilePlugin [enabled]
200:play.data.validation.ValidationPlugin [enabled]
300:play.db.DBPlugin [enabled]
400:play.db.jpa.JPAPlugin [enabled]
450:play.db.Evolutions [enabled]
500:play.i18n.MessagesPlugin [enabled]
600:play.libs.WS [enabled]
700:play.jobs.JobsPlugin [enabled]
100000:play.plugins.ConfigurablePluginDisablingPlugin [enabled]
Threads:
~~~~~~~~
Thread[Reference Handler,10,system] WAITING
Thread[Finalizer,8,system] WAITING
Thread[Signal Dispatcher,9,system] RUNNABLE
Thread[net.sf.ehcache.CacheManager#449278d5,5,main] WAITING
Thread[Timer-0,5,main] TIMED_WAITING
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#0,5,main] TIMED_WAITING
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#1,5,main] TIMED_WAITING
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2,5,main] TIMED_WAITING
Thread[jobs-thread-1,5,main] TIMED_WAITING
Thread[jobs-thread-2,5,main] TIMED_WAITING
Thread[jobs-thread-3,5,main] TIMED_WAITING
Thread[New I/O server boss #1 ([id: 0x7065ec20, /0:0:0:0:0:0:0:0:9001]),5,main] RUNNABLE
Thread[DestroyJavaVM,5,main] RUNNABLE
Thread[New I/O server worker #1-3,5,main] RUNNABLE
Requests execution pool:
~~~~~~~~~~~~~~~~~~~~~~~~
Pool size: 0
Active count: 0
Scheduled task count: 0
Queue size: 0
Monitors:
~~~~~~~~
controllers.ReaderJob.doJob(), ms. -> 114 hits; 4.1 avg; 0.0 min; 463.0 max;
controllers.MediaCoderProcess.doJob(), ms. -> 4572 hits; 0.1 avg; 0.0 min; 157.0 max;
controllers.Bootstrap.doJob(), ms. -> 1 hits; 0.0 avg; 0.0 min; 0.0 max;
Datasource:
~~~~~~~~~~~
Jdbc url: jdbc:mysql://xxxx
Jdbc driver: com.mysql.jdbc.Driver
Jdbc user: xxxx
Jdbc password: xxxx
Min pool size: 1
Max pool size: 30
Initial pool size: 3
Checkout timeout: 5000
Jobs execution pool:
~~~~~~~~~~~~~~~~~~~
Pool size: 3
Active count: 0
Scheduled task count: 4689
Queue size: 3
Scheduled jobs (4):
~~~~~~~~~~~~~~~~~~~~~~~~~~
controllers.APNSFeedbackJob run every 24h. (has never run)
controllers.Bootstrap run at application start. (last run at 07/01/2012 12:05:32)
controllers.MediaCoderProcess run every 15s. (last run at 07/02/2012 07:10:46)
controllers.ReaderJob run every 600s. (last run at 07/02/2012 07:05:36)
Waiting jobs:
~~~~~~~~~~~~~~~~~~~~~~~~~~~
controllers.MediaCoderProcess will run in 2 seconds
controllers.APNSFeedbackJob will run in 17672 seconds
controllers.ReaderJob will run in 276 seconds
if your server is running under Linux, you may be hit by the Leap Second bug which appears last week-end.
This bug affects the Linux kernel (the Thread management), so application which uses threads (as the JVM, mysql etc...) may consume high load of CPU.
if you are using jdk 1.7 should be easy as they added this feature have look at my other related answer -> How to monitor the computer's cpu, memory, and disk usage in Java?

Categories