Constant timeouts in Cassandra after adding second node - java

I'm trying to migrate a moderately large swath of data (~41 million rows) from an SQL database to Cassandra. I've previously done a trial-run using half the dataset, and everything worked exactly as expected.
The problem is, now that I'm trying the complete migration Cassandra is throwing constant timeout errors. For instance:
[INFO] [talledLocalContainer] com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:10112 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response))
[INFO] [talledLocalContainer] at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
[INFO] [talledLocalContainer] at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
[INFO] [talledLocalContainer] at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
[INFO] [talledLocalContainer] at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
[INFO] [talledLocalContainer] at com.mycompany.tasks.CassandraMigrationTask.execute(CassandraMigrationTask.java:164)
[INFO] [talledLocalContainer] at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
[INFO] [talledLocalContainer] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
[INFO] [talledLocalContainer] Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:10112 (com.datastax.driver.core.exceptions.DriverException: Timed out waiting for server response))
[INFO] [talledLocalContainer] at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
[INFO] [talledLocalContainer] at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
[INFO] [talledLocalContainer] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[INFO] [talledLocalContainer] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[INFO] [talledLocalContainer] at java.lang.Thread.run(Thread.java:745)
I've tried increasing the timeout values in cassandra.yaml, and that increased the amount of time that the migration was able to run before dying to a timeout (roughly in proportion to the increase in the timeout).
Prior to changing the timeout settings, my stack-trace looked more like:
[INFO] [talledLocalContainer] com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
[INFO] [talledLocalContainer] at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:54)
[INFO] [talledLocalContainer] at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
[INFO] [talledLocalContainer] at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
[INFO] [talledLocalContainer] at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
[INFO] [talledLocalContainer] at com.mycompany.tasks.CassandraMigrationTask.execute(CassandraMigrationTask.java:164)
[INFO] [talledLocalContainer] at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
[INFO] [talledLocalContainer] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
[INFO] [talledLocalContainer] Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
[INFO] [talledLocalContainer] at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:54)
[INFO] [talledLocalContainer] at com.datastax.driver.core.Responses$Error.asException(Responses.java:99)
[INFO] [talledLocalContainer] at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:140)
[INFO] [talledLocalContainer] at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:249)
[INFO] [talledLocalContainer] at com.datastax.driver.core.RequestHandler.onSet(RequestHandler.java:433)
[INFO] [talledLocalContainer] at com.datastax.driver.core.Connection$Dispatcher.messageReceived(Connection.java:697)
[INFO] [talledLocalContainer] at com.datastax.shaded.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
[INFO] [talledLocalContainer] at com.datastax.shaded.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
[INFO] [talledLocalContainer] at com.datastax.shaded.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
[INFO] [talledLocalContainer] at com.datastax.shaded.netty.channel.Channels.fireMessageReceived(Channels.java:296)
[INFO] [talledLocalContainer] at com.datastax.shaded.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70)
My timeout settings are currently:
# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 30000
# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 30000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 30000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 30000
# How long a coordinator should continue to retry a CAS operation
# that contends with other proposals for the same row
cas_contention_timeout_in_ms: 1000
# How long the coordinator should wait for truncates to complete
# (This can be much longer, because unless auto_snapshot is disabled
# we need to flush first so we can snapshot before removing the data.)
truncate_request_timeout_in_ms: 60000
# The default timeout for other, miscellaneous operations
request_timeout_in_ms: 20000
...which gets me about 1.5m rows inserted before the timeout happens. The original timeout settings were:
# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 5000
# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 10000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 2000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 5000
# How long a coordinator should continue to retry a CAS operation
# that contends with other proposals for the same row
cas_contention_timeout_in_ms: 1000
# How long the coordinator should wait for truncates to complete
# (This can be much longer, because unless auto_snapshot is disabled
# we need to flush first so we can snapshot before removing the data.)
truncate_request_timeout_in_ms: 60000
# The default timeout for other, miscellaneous operations
request_timeout_in_ms: 10000
...which caused the timeouts to happen approximately every 300,000 rows.
The only significant change that's occurred between when I had my successful run and now is that I added a second node to the Cassandra deployment. So intuitively I'd think the issue would have something to do with the propagation of data from the first node to the second (as in, there's <some process> that scales linearly with the amount of data inserted and which isn't used when there's only a single node). But I'm not seeing any obvious options that might be useful for configuring/mitigating this.
If it's relevant, I'm using batch statements during the migration, typically with between 100 and 200 statements/rows per batch, at most.
My keyspace was originally set up WITH REPLICATION =
{ 'class' : 'SimpleStrategy', 'replication_factor' : 2 }, but I altered it to be WITH REPLICATION =
{ 'class' : 'SimpleStrategy', 'replication_factor' : 1 } to see if that would make any difference. It didn't.
I also tried explicitly setting ConsistencyLevel.ANY on all my insert statements (and also the enclosing batch statements). That also made no difference.
There doesn't seem to be anything interesting in Cassandra's log on either node, although the first node is certainly showing more 'ops' than the second:
First node - 454317 ops
INFO [SlabPoolCleaner] 2016-01-25 19:46:08,806 ColumnFamilyStore.java:905 - Enqueuing flush of assetproperties_flat: 148265302 (14%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:15] 2016-01-25 19:46:08,807 Memtable.java:347 - Writing Memtable-assetproperties_flat#350387072(20.557MiB serialized bytes, 454317 ops, 14%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:15] 2016-01-25 19:46:09,393 Memtable.java:382 - Completed flushing /var/cassandra/data/itb/assetproperties_flat-e83359a0c34411e593abdda945619e28/itb-assetproperties_flat-tmp-ka-32-Data.db (5.249MiB) for commitlog position ReplayPosition(segmentId=1453767930194, position=15188257)
Second node - 2020 ops
INFO [BatchlogTasks:1] 2016-01-25 19:46:33,961 ColumnFamilyStore.java:905 - Enqueuing flush of batchlog: 4923957 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:22] 2016-01-25 19:46:33,962 Memtable.java:347 - Writing Memtable-batchlog#796821497(4.453MiB serialized bytes, 2020 ops, 0%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:22] 2016-01-25 19:46:33,963 Memtable.java:393 - Completed flushing /var/cassandra/data/system/batchlog-0290003c977e397cac3efdfdc01d626b/system-batchlog-tmp-ka-11-Data.db; nothing needed to be retained. Commitlog position was ReplayPosition(segmentId=1453767955411, position=18567563)
Has anyone encountered a similar issue, and if so, what was the fix?
Would it be advisable to just take the second node offline, run the migration with just the first node, and then run nodetool repair afterwards to get the second node back in sync?
Edit
Answers to questions from comments:
I'm using the datastax Java driver, and have a server-side task (Quartz job) that uses the ORM layer (hibernate) to lookup the next chunk of data to migrate, write it into Cassandra, and then purge it from the SQL database. I'm getting a connection to Cassandra using the following code:
public static Session getCassandraSession(String keyspace) {
Session session = clusterSessions.get(keyspace);
if (session != null && ! session.isClosed()) {
//can use the cached session
return session;
}
//create a new session for the specified keyspace
Cluster cassandraCluster = getCluster();
session = cassandraCluster.connect(keyspace);
//cache and return the session
clusterSessions.put(keyspace, session);
return session;
}
private static Cluster getCluster() {
if (cluster != null && ! cluster.isClosed()) {
//can use the cached cluster
return cluster;
}
//configure socket options
SocketOptions options = new SocketOptions();
options.setConnectTimeoutMillis(30000);
options.setReadTimeoutMillis(300000);
options.setTcpNoDelay(true);
//spin up a fresh connection
cluster = Cluster.builder().addContactPoint(Configuration.getCassandraHost()).withPort(Configuration.getCassandraPort())
.withCredentials(Configuration.getCassandraUser(), Configuration.getCassandraPass()).withSocketOptions(options).build();
//log the cluster details for confirmation
Metadata metadata = cluster.getMetadata();
LOG.debug("Connected to Cassandra cluster: " + metadata.getClusterName());
for ( Host host : metadata.getAllHosts() ) {
LOG.debug("Datacenter: " + host.getDatacenter() + "; Host: " + host.getAddress() + "; Rack: " + host.getRack());
}
return cluster;
}
The part with the SocketOptions is a recent addition, as the latest timeout error sounded like it was coming from the Java/client side rather than from within Cassandra itself.
Each batch inserts no more than 200 records. Typical values are closer to 100.
Both nodes have the same specs:
Intel(R) Xeon(R) CPU E3-1230 V2 # 3.30GHz
32GB RAM
256GB SSD (primary), 2TB HDD (backups), both in RAID-1 configurations
First node:
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 0 0 58155 0 0
RequestResponseStage 0 0 655104 0 0
MutationStage 0 0 259151 0 0
ReadRepairStage 0 0 0 0 0
GossipStage 0 0 58041 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MigrationStage 0 0 0 0 0
Sampler 0 0 0 0 0
ValidationExecutor 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MiscStage 0 0 0 0 0
MemtableFlushWriter 0 0 80 0 0
MemtableReclaimMemory 0 0 80 0 0
PendingRangeCalculator 0 0 3 0 0
MemtablePostFlush 0 0 418 0 0
CompactionExecutor 0 0 8979 0 0
InternalResponseStage 0 0 0 0 0
HintedHandoff 0 0 2 0 0
Native-Transport-Requests 1 0 1175338 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 0
_TRACE 0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0
Second node:
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 0 0 55803 0 0
RequestResponseStage 0 0 1 0 0
MutationStage 0 0 733828 0 0
ReadRepairStage 0 0 0 0 0
GossipStage 0 0 56623 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MigrationStage 0 0 0 0 0
Sampler 0 0 0 0 0
ValidationExecutor 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MiscStage 0 0 0 0 0
MemtableFlushWriter 0 0 394 0 0
MemtableReclaimMemory 0 0 394 0 0
PendingRangeCalculator 0 0 2 0 0
MemtablePostFlush 0 0 428 0 0
CompactionExecutor 0 0 8883 0 0
InternalResponseStage 0 0 0 0 0
HintedHandoff 0 0 1 0 0
Native-Transport-Requests 0 0 70 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 0
_TRACE 0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0
The output of nodetool ring was very long. Here's a nodetool status instead:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 204.11.xxx.1 754.66 MB 1024 ? 8cf373d8-0b3e-4fd3-9e63-fdcdd8ce8cd4 RAC1
UN 208.66.xxx.2 767.78 MB 1024 ? 42e1f336-84cb-4260-84df-92566961a220 RAC2
I increased all of Cassandra's timeout values by a factor of 10, and also set the Java driver's read timeout settings to match, and now I'm up to 8m 29.4m inserts with no issues. In theory if the issue scales linearly with the timeout values I should be good up until around 15m inserts (which is at least good enough that I don't need to constantly babysit the migration process waiting for each new error).

1) CL.ANY is almost always a bad idea - you're writing faster than the server can even acknowledge the writes.
2) 1024 tokens is silly, but not the cause of the problems. You also can't change it once the node is live in the cluster.
3) You're masking your problems by increasing the timeouts - cassandra on that hardware can run easily 100k writes/second.
4) Batches are meant for atomicity, you're probably misusing them, which is adding headache.
5) You've tuned all sorts of knobs without understanding them. Cassandra is different than a relational DB.
6) The right way to do data loads of this nature is with CQLSSTableWriter and the bulk load interface. Details at http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
7) When the client starts throwing errors, what's in the server logs? What's the JVM doing? Are you seeing GC pauses? Is the server idle? CPU maxed? Disks maxed?
8) There exist some very good tuning guides - consider reading and understanding https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

Okay, so I was able to get the timeout errors to stop by doing two things. First, I increased Cassandra's timeout values on both hosts, as follows:
# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 30000
# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 30000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 30000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 30000
# How long a coordinator should continue to retry a CAS operation
# that contends with other proposals for the same row
cas_contention_timeout_in_ms: 1000
# How long the coordinator should wait for truncates to complete
# (This can be much longer, because unless auto_snapshot is disabled
# we need to flush first so we can snapshot before removing the data.)
truncate_request_timeout_in_ms: 60000
# The default timeout for other, miscellaneous operations
request_timeout_in_ms: 20000
I suspect those values are unnecessarily large, but those are what I had in place when everything started working.
The second part of the solution was to adjust the client timeout in my Java code, as follows:
//configure socket options
SocketOptions options = new SocketOptions();
options.setConnectTimeoutMillis(30000);
options.setReadTimeoutMillis(300000);
options.setTcpNoDelay(true);
//spin up a fresh connection (using the SocketOptions set up above)
cluster = Cluster.builder().addContactPoint(Configuration.getCassandraHost()).withPort(Configuration.getCassandraPort())
.withCredentials(Configuration.getCassandraUser(), Configuration.getCassandraPass()).withSocketOptions(options).build();
With those two changes, the timeout errors stopped and the data migration completed without issue.
As #MarcintheCloud rightly points out in the comments above, increasing the timeout values may only have the effect of masking the underlying problem. But that's good enough in my case since 1) the underlying problem only surfaces under very high load, 2) I only need to run the migration process once, and 3) once the data has been migrated, the actual load levels are orders of magnitude lower than what's experienced during the migration.
However, understanding the underlying cause still seems worthwhile. So what was it? Well I've got two theories:
As #MarcintheCloud posits, perhaps 1024 is too many tokens to reasonably use with Cassandra. And perhaps as a consequence of that the deployment gets a bit flaky under heavy load.
My alternative theory has to do with network chatter between the two nodes. In my deployment, the first node runs the app-server instance, the first Cassandra instance, and the primary SQL database. The second node runs the second Cassandra instance and also a replica SQL database that is kept in sync with the primary database in near-real-time.
Now, the migration process essentially does two things concurrently; it writes data into Cassandra, and it deletes data from the SQL database. Both of those actions generate changesets that need to propagate over the network to the second node.
So my theory is that if changes are happening quickly enough on the first node (since the SSD does allow very high IO throughput), the network transfers of the SQL and Cassandra changelogs (and/or the subsequent IO ops on the second node) may occasionally contend with each other, introducing additional latency into the replication process(es) and potentially leading to timeouts. It seems plausible that with enough contention, one process or the other might get blocked for several seconds at a time, which is enough to trigger timeout errors at Cassandra's default settings.
Those are the plausible theories I can think of. Though no real way of testing to confirm which (if any) is correct.

Related

(lettuce) READONLY You can't write against a read only slave

I need some help, Our service uses the lettuce 5.1.6 version, and a total of 22 docker nodes are deployed.
Whenever the service is deployed, several docker nodes will appear ERROR: READONLY You can't write against a read only slave.
Restart the problematic docker node ERROR no longer appears
redis server configuration:
8 master 8 slave
stop-writes-on-bgsave-error no
slave-serve-stale-data yes
slave-read-only yes
cluster-enabled yes
cluster-config-file "/data/server/redis-cluster/{port}/conf/node.conf"
lettuce configuration:
ClientResources res = DefaultClientResources.builder()
.commandLatencyPublisherOptions(
DefaultEventPublisherOptions.builder()
.eventEmitInterval(Duration.ofSeconds(5))
.build()
)
.build();
redisClusterClient = RedisClusterClient.create(res, REDIS_CLUSTER_URI);
redisClusterClient.setOptions(
ClusterClientOptions.builder()
.maxRedirects(99)
.socketOptions(SocketOptions.builder().keepAlive(true).build())
.topologyRefreshOptions(
ClusterTopologyRefreshOptions.builder()
.enableAllAdaptiveRefreshTriggers()
.build())
.build());
RedisAdvancedClusterCommands<String, String> command = redisClusterClient.connect().sync();
command.setex("some key", 18000, "some value");
The Exception that appears:
io.lettuce.core.RedisCommandExecutionException: READONLY You can't write against a read only slave.
at io.lettuce.core.ExceptionFactory.createExecutionException(ExceptionFactory.java:135)
at io.lettuce.core.LettuceFutures.awaitOrCancel(LettuceFutures.java:122)
at io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(ClusterFutureSyncInvocationHandler.java:123)
at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
at com.sun.proxy.$Proxy135.setex(Unknown Source)
at com.xueqiu.infra.redis4.RedisClusterImpl.lambda$setex$164(RedisClusterImpl.java:1489)
at com.xueqiu.infra.redis4.RedisClusterImpl$$Lambda$1422/1017847781.apply(Unknown Source)
at com.xueqiu.infra.redis4.RedisClusterImpl.execute(RedisClusterImpl.java:526)
at com.xueqiu.infra.redis4.RedisClusterImpl.executeTotal(RedisClusterImpl.java:491)
at com.xueqiu.infra.redis4.RedisClusterImpl.setex(RedisClusterImpl.java:1489)
In the face of distributed middleware, the client side will put some partitions, sharding and other relationships on the client side for management.
And lettuce is the slot mapping management of redis cluster:
The method adopted is to use an array of slotCache, and cache the node corresponding to each slot locally in the form of an array.
When there is a key that needs to read and write to the server, the slot will be calculated through the CRC16 in the client, and then the node will be obtained in the cache.
When the redis cluster server performs cluster management, it records the mapping relationship between slot and node in the local node.conf of each node.
When ping pong data is exchanged through the gossip protocol, these metadata information are broadcast to form the final consistent metadata information.
However, if there is an error in the slot mapping relationship on the server side, the client side will use these wrong data.
This time the problem appears here. The server part node maps the slot to the slave, so that the slot cached by the client is mapped to the slave node, and the read and write requests are sent to the slave node, resulting in an error.
lettuce source code investigation
1 lettuce initialization Partitions.java
/**
* Update the partition cache. Updates are necessary after the partition details have changed.
*/
public void updateCache() {
synchronized (partitions) {
if (partitions.isEmpty()) {
this.slotCache = EMPTY;
this.nodeReadView = Collections.emptyList();
return;
}
RedisClusterNode[] slotCache = new RedisClusterNode[SlotHash.SLOT_COUNT];
List<RedisClusterNode> readView = new ArrayList<>(partitions.size());
for (RedisClusterNode partition: partitions) {
readView.add(partition);
for (Integer integer: partition.getSlots()) {
slotCache[integer.intValue()] = partition;
}
}
this.slotCache = slotCache;
this.nodeReadView = Collections.unmodifiableCollection(readView);
}
}
2 lettuce send command PooledClusterConnectionProvider.java
private CompletableFuture<StatefulRedisConnection<K, V>> getWriteConnection(int slot) {
CompletableFuture<StatefulRedisConnection<K, V>> writer;// avoid races when reconfiguring partitions.
synchronized (stateLock) {
writer = writers[slot];
}
if (writer == null) {
RedisClusterNode partition = partitions.getPartitionBySlot(slot);
if (partition == null) {
clusterEventListener.onUncoveredSlot(slot);
return Futures.failed(new PartitionSelectorException("Cannot determine a partition for slot "+ slot + ".",
partitions.clone()));
}
// Use always host and port for slot-oriented operations. We don't want to get reconnected on a different
// host because the nodeId can be handled by a different host.
RedisURI uri = partition.getUri();
ConnectionKey key = new ConnectionKey(Intent.WRITE, uri.getHost(), uri.getPort());
ConnectionFuture<StatefulRedisConnection<K, V>> future = getConnectionAsync(key);
return future.thenApply(connection -> {
synchronized (stateLock) {
if (writers[slot] == ​​null) {
writers[slot] = CompletableFuture.completedFuture(connection);
}
}
return connection;
}).toCompletableFuture();
}
return writer;
}
The sending principle of lettuce:
Load the topology when the client starts, and store the mapping relationship between slot and node locally in an array structure slotCache
When sending, after calculating the CRC16 of the key, go to the array slotCache through slot to get the corresponding node, and continue to get the connection of this node
Note that basically in all middleware of this cluster mode, the logic of the client is to obtain the network topology of the server, and then calculate the mapping logic on the client,
Compare the performance analysis of Kafka across computer rooms:
redis cluster information troubleshooting
./bin/redis-cli -h 10.10.28.2 -p 25661 cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size: 3
cluster_current_epoch:8
cluster_my_epoch:6
cluster_stats_messages_ping_sent:615483
cluster_stats_messages_pong_sent:610194
cluster_stats_messages_meet_sent:3
cluster_stats_messages_fail_sent:8
cluster_stats_messages_auth-req_sent:5
cluster_stats_messages_auth-ack_sent:2
cluster_stats_messages_update_sent:4
cluster_stats_messages_sent:1225699
cluster_stats_messages_ping_received:610188
cluster_stats_messages_pong_received:603593
cluster_stats_messages_meet_received:2
cluster_stats_messages_fail_received:4
cluster_stats_messages_auth-req_received:2
cluster_stats_messages_auth-ack_received:2
cluster_stats_messages_received:1213791
./bin/redis-cli -h 10.10.28.2 -p 25661 cluster nodes
5e9d0c185a2ba2fc9564495730c874bea76c15fa 10.10.28.3:25662#35662 slave 2281f330d771ee682221bc6c239afd68e6f20571 0 1595921769000 15 connected
79cb673db12199c32737b959cd82ec9963106558 10.10.25.2:25651#35651 master - 0 1595921770000 18 connected 4096-6143
2281f330d771ee682221bc6c239afd68e6f20571 10.10.28.2:25661#35661 myself,master - 0 1595921759000 15 connected 10240-12287
6a9ea568d6b49360afbb650c712bd7920403ba19 10.10.28.3:25686#35686 master - 0 1595921769000 14 connected 12288-14335
5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 10.10.27.2:25656#35656 master - 0 1595921771000 13 connected 14336-16383
f5148dba1127bd9bada8ecc39341a0b72ef25d8e 10.10.25.3:25652#35652 slave 79cb673db12199c32737b959cd82ec9963106558 0 1595921769000 18 connected
f6788b4829e601642ed4139548153830c430b932 10.10.26.3:25666#35666 master - 0 1595921769870 16 connected 8192-10239
f54cfebc12c69725f471d16133e7ca3a8567dc18 10.10.28.15:25687#35687 slave 6a9ea568d6b49360afbb650c712bd7920403ba19 0 1595921763000 14 connected
f09ad21effff245cae23c024a8a886f883634f5c 10.10.28.15:25667#35667 slave f6788b4829e601642ed4139548153830c430b932 0 1595921770870 16 connected
ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 10.10.25.3:25681#35681 master - 0 1595921773876 0 connected 0-2047
19c57214e4293b2e37d881534dcd55318fa96a70 10.10.50.16:25677#35677 slave 5f677e012808b09c67316f6ac5bdf0ec005cd598 0 1595921768000 17 connected
d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 10.10.30.9:25671#35671 master - 0 1595921773000 6 connected 2048-4095
068e3bc73c27782c49782d30b66aa8b1140666ce 10.10.27.3:25682#35682 slave ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 0 1595921771872 12 connected
e8b0311aeec4e3d285028abc377f0c277f9a5c74 10.10.49.9:25672#35672 slave d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 0 1595921770000 6 connected
f03bc2ca91b3012f4612ecbc8c611c9f4a0e1305 10.10.27.3:25657#35657 slave 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 0 1595921762000 13 connected
5f677e012808b09c67316f6ac5bdf0ec005cd598 10.10.50.7:25676#35676 master - 0 1595921772873 17 connected 6144-8191
./bin/redis-cli -h 10.10.28.3 -p 25662 cluster nodes
f5148dba1127bd9bada8ecc39341a0b72ef25d8e 10.10.25.3:25652#35652 slave 79cb673db12199c32737b959cd82ec9963106558 0 1595921741000 18 connected
f6788b4829e601642ed4139548153830c430b932 10.10.26.3:25666#35666 master - 0 1595921744000 16 connected 8192-10239
f03bc2ca91b3012f4612ecbc8c611c9f4a0e1305 10.10.27.3:25657#35657 slave 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 0 1595921740000 13 connected
5f677e012808b09c67316f6ac5bdf0ec005cd598 10.10.50.7:25676#35676 master - 0 1595921743127 17 connected 6144-8191
79cb673db12199c32737b959cd82ec9963106558 10.10.25.2:25651#35651 master - 0 1595921743000 18 connected 4096-6143
2281f330d771ee682221bc6c239afd68e6f20571 10.10.28.2:25661#35661 master - 0 1595921744129 15 connected 10240-12287
f09ad21effff245cae23c024a8a886f883634f5c 10.10.28.15:25667#35667 slave f6788b4829e601642ed4139548153830c430b932 0 1595921740000 16 connected
f54cfebc12c69725f471d16133e7ca3a8567dc18 10.10.28.15:25687#35687 slave 6a9ea568d6b49360afbb650c712bd7920403ba19 0 1595921745130 14 connected
5e9d0c185a2ba2fc9564495730c874bea76c15fa 10.10.28.3:25662#35662 myself,slave 2281f330d771ee682221bc6c239afd68e6f20571 0 1595921733000 5 connected 0-1820
068e3bc73c27782c49782d30b66aa8b1140666ce 10.10.27.3:25682#35682 slave ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 0 1595921744000 12 connected
d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 10.10.30.9:25671#35671 master - 0 1595921739000 6 connected 2048-4095
5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 10.10.27.2:25656#35656 master - 0 1595921742000 13 connected 14336-16383
ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 10.10.25.3:25681#35681 master - 0 1595921746131 0 connected 1821-2047
6a9ea568d6b49360afbb650c712bd7920403ba19 10.10.28.3:25686#35686 master - 0 1595921747133 14 connected 12288-14335
19c57214e4293b2e37d881534dcd55318fa96a70 10.10.50.16:25677#35677 slave 5f677e012808b09c67316f6ac5bdf0ec005cd598 0 1595921742126 17 connected
e8b0311aeec4e3d285028abc377f0c277f9a5c74 10.10.49.9:25672#35672 slave d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 0 1595921745000 6 connected
./bin/redis-cli -h 10.10.49.9 -p 25672 cluster nodes
d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 10.10.30.9:25671#35671 master - 0 1595921829000 6 connected 2048-4095
79cb673db12199c32737b959cd82ec9963106558 10.10.25.2:25651#35651 master - 0 1595921830000 18 connected 4096-6143
ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 10.10.25.3:25681#35681 master - 0 1595921830719 0 connected 0-1820
f54cfebc12c69725f471d16133e7ca3a8567dc18 10.10.28.15:25687#35687 slave 6a9ea568d6b49360afbb650c712bd7920403ba19 0 1595921827000 14 connected
5f677e012808b09c67316f6ac5bdf0ec005cd598 10.10.50.7:25676#35676 master - 0 1595921827000 17 connected 6144-8191
2281f330d771ee682221bc6c239afd68e6f20571 10.10.28.2:25661#35661 master - 0 1595921822000 15 connected 10240-12287
5e9d0c185a2ba2fc9564495730c874bea76c15fa 10.10.28.3:25662#35662 slave 2281f330d771ee682221bc6c239afd68e6f20571 0 1595921828714 15 connected
068e3bc73c27782c49782d30b66aa8b1140666ce 10.10.27.3:25682#35682 slave ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 0 1595921832721 12 connected
6a9ea568d6b49360afbb650c712bd7920403ba19 10.10.28.3:25686#35686 master - 0 1595921825000 14 connected 12288-14335
f5148dba1127bd9bada8ecc39341a0b72ef25d8e 10.10.25.3:25652#35652 slave 79cb673db12199c32737b959cd82ec9963106558 0 1595921830000 18 connected
19c57214e4293b2e37d881534dcd55318fa96a70 10.10.50.16:25677#35677 slave 5f677e012808b09c67316f6ac5bdf0ec005cd598 0 1595921829716 17 connected
e8b0311aeec4e3d285028abc377f0c277f9a5c74 10.10.49.9:25672#35672 myself,slave d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 0 1595921832000 4 connected 1821-2047
f09ad21effff245cae23c024a8a886f883634f5c 10.10.28.15:25667#35667 slave f6788b4829e601642ed4139548153830c430b932 0 1595921826711 16 connected
f03bc2ca91b3012f4612ecbc8c611c9f4a0e1305 10.10.27.3:25657#35657 slave 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 0 1595921829000 13 connected
f6788b4829e601642ed4139548153830c430b932 10.10.26.3:25666#35666 master - 0 1595921831720 16 connected 8192-10239
5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 10.10.27.2:25656#35656 master - 0 1595921827714 13 connected 14336-16383
./bin/redis-trib.rb check 10.10.30.9:25671
>>> Performing Cluster Check (using node 10.10.30.9:25671)
M: d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 10.10.30.9:25671
slots:2048-4095 (2048 slots) master
1 additional replica(s)
S: e8b0311aeec4e3d285028abc377f0c277f9a5c74 10.10.49.9:25672
slots: (0 slots) slave
········
········
S: f03bc2ca91b3012f4612ecbc8c611c9f4a0e1305 10.10.27.3:25657
slots: (0 slots) slave
replicates 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757
M: 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 10.10.27.2:25656
slots:14336-16383 (2048 slots) master
1 additional replica(s)
[ERR] Nodes don't agree about configuration!
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
Be suspicious of everything, be vigilant, diligence can make up for one’s weaknesses
At the beginning, I suspected that the cluster is healthy, but due to online phenomena, most of the nodes are normal, and a few do not have problems, restart the problem fix
In the beginning, the information was checked through normal nodes, and no problems were found. Even if the topological information of several nodes was inconsistent through the logs at the beginning, it would be difficult to see the problem without the mapping relationship of the client.
Comparison found that some slots are mapped to slave nodes
After executing the check, it is found that there is a problem with the cluster, which is explained in the following open source related issues
I saw the same issue as you and tried to investigate that.
I figured out that caused by lettuce.
When we run a Redis command, lettuce will analyze and recognize which is the Redis-Endpoint to send the order.
If READ-COMMAND then it will send to slave-node (By setting ReadFrom.Any_Rep). Please note that the other ReadFrom options may change the behavior.
If WRITE-COMMAND then it will send to master-node
To determine what are READ-COMMAND. Lettuce used ReadOnlyCommands class to list all Read commands.
In my case, I used the EVAL command to write a key value to Redis. But Lettuce determines it is READ-COMMAND then send to slave-node => The exception happens.
So please check ReadOnlyCommands class and make sure your write-command does not include that. This is a mistake from the Lettuce team and they already fix this issue from newer versions.
In your version, ReadOnlyCommands for cluster settings is
class ReadOnlyCommands {
private static final Set<CommandType> READ_ONLY_COMMANDS = EnumSet.noneOf(CommandType.class);
static {
for (CommandName commandNames : CommandName.values()) {
READ_ONLY_COMMANDS.add(CommandType.valueOf(commandNames.name()));
}
}
/**
* #param protocolKeyword must not be {#literal null}.
* #return {#literal true} if {#link ProtocolKeyword} is a read-only command.
*/
public static boolean isReadOnlyCommand(ProtocolKeyword protocolKeyword) {
return READ_ONLY_COMMANDS.contains(protocolKeyword);
}
/**
* #return an unmodifiable {#link Set} of {#link CommandType read-only} commands.
*/
public static Set<CommandType> getReadOnlyCommands() {
return Collections.unmodifiableSet(READ_ONLY_COMMANDS);
}
enum CommandName {
ASKING, BITCOUNT, BITPOS, CLIENT, COMMAND, DUMP, ECHO, EVAL, EVALSHA, EXISTS, //
GEODIST, GEOPOS, GEORADIUS, GEORADIUSBYMEMBER, GEOHASH, GET, GETBIT, //
GETRANGE, HEXISTS, HGET, HGETALL, HKEYS, HLEN, HMGET, HSCAN, HSTRLEN, //
HVALS, INFO, KEYS, LINDEX, LLEN, LRANGE, MGET, PFCOUNT, PTTL, //
RANDOMKEY, READWRITE, SCAN, SCARD, SCRIPT, //
SDIFF, SINTER, SISMEMBER, SMEMBERS, SRANDMEMBER, SSCAN, STRLEN, //
SUNION, TIME, TTL, TYPE, ZCARD, ZCOUNT, ZLEXCOUNT, ZRANGE, //
ZRANGEBYLEX, ZRANGEBYSCORE, ZRANK, ZREVRANGE, ZREVRANGEBYLEX, ZREVRANGEBYSCORE, ZREVRANK, ZSCAN, ZSCORE, //
// Pub/Sub commands are no key-space commands so they are safe to execute on slave nodes
PUBLISH, PUBSUB, PSUBSCRIBE, PUNSUBSCRIBE, SUBSCRIBE, UNSUBSCRIBE
So you can check easily.
Solution -> Upgrade version Letture is the best way to do. Or you can try to override this setting

The messages are not getting deleted from the file system when deleteRecords Kafka Admin Client Java API is invoked

I was trying to delete messages from my kafka topic using Java Admin Client API's delete Records method. Following are the steps that i have tried
1. I pushed 20000 records to my TEST-DELETE topic
2. Started a console consumer and consumed all the messages
3. Invoked my java program to delete all those 20k messages
4. Started another console consumer with a different group id. This consumer is not receiving any of the deleted messages
When I checked the file system, I could still see all those 20k records occupying the disk space. My intention is to delete those records forever from file system too.
My Topic configuration is given below along with server.properties settings
Topic:TEST-DELETE PartitionCount:4 ReplicationFactor:1 Configs:cleanup.policy=delete
Topic: TEST-DELETE Partition: 0 Leader: 0 Replicas: 0 Isr: 0
Topic: TEST-DELETE Partition: 1 Leader: 0 Replicas: 0 Isr: 0
Topic: TEST-DELETE Partition: 2 Leader: 0 Replicas: 0 Isr: 0
Topic: TEST-DELETE Partition: 3 Leader: 0 Replicas: 0 Isr: 0
log.retention.hours=24
log.retention.check.interval.ms=60000
log.cleaner.delete.retention.ms=60000
file.delete.delay.ms=60000
delete.retention.ms=60000
offsets.retention.minutes=5
offsets.retention.check.interval.ms=60000
log.cleaner.enable=true
log.cleanup.policy=compact,delete
My delete code is given below
public void deleteRecords(Map<String, Map<Integer, Long>> allTopicPartions) {
Map<TopicPartition, RecordsToDelete> recordsToDelete = new HashMap<>();
allTopicPartions.entrySet().forEach(topicDetails -> {
String topicName = topicDetails.getKey();
Map<Integer, Long> value = topicDetails.getValue();
value.entrySet().forEach(partitionDetails -> {
if (partitionDetails.getValue() != 0) {
recordsToDelete.put(new TopicPartition(topicName, partitionDetails.getKey()),
RecordsToDelete.beforeOffset(partitionDetails.getValue()));
}
});
});
DeleteRecordsResult deleteRecords = this.client.deleteRecords(recordsToDelete);
Map<TopicPartition, KafkaFuture<DeletedRecords>> lowWatermarks = deleteRecords.lowWatermarks();
lowWatermarks.entrySet().forEach(entry -> {
try {
logger.info(entry.getKey().topic() + " " + entry.getKey().partition() + " "
+ entry.getValue().get().lowWatermark());
} catch (Exception ex) {
}
});
}
The output of my java program is given below
2019-06-25 16:21:15 INFO MyKafkaAdminClient:247 - TEST-DELETE 1 5000
2019-06-25 16:21:15 INFO MyKafkaAdminClient:247 - TEST-DELETE 0 5000
2019-06-25 16:21:15 INFO MyKafkaAdminClient:247 - TEST-DELETE 3 5000
2019-06-25 16:21:15 INFO MyKafkaAdminClient:247 - TEST-DELETE 2 5000
My intention is to delete the consumed records from the file system as I am working with limited storage for my kafka broker.
I would like to get some help with my below doubts
I was in the impression that the delete Records will remove the messages from the file system too, but look like I got it wrong!!
How long those deleted records be present in the log directory?
Is there any specific configuration that i need to use in order to remove the records from the files system once the delete Records API is invoked?
Appreciate your help
Thanks
The recommended approach to handle this is to set retention.ms and related configuration values for the topics you're interested in. That way, you can define how long Kafka will store your data until it deletes it, making sure all your downstream consumers have had the chance to pull down the data before it's deleted from the Kafk cluster.
If, however, you still want to force Kafka to delete based on bytes, there's the log.retention.bytes and retention.bytes configuration values. The first one is a cluster-wide setting, the second one is the topic-specific setting, which by default takes whatever the first one is set to, but you can still override it per topic. The retention.bytes number is enforced per partition, so you should multiply it by the total number of topic partitions.
Be aware, however, that if you have a run-away producer that starts generating a lot of data suddenly, and you have it set to a hard byte limit, you might wipe out entire days worth of data in the cluster, and only be left with the last few minutes of data, maybe before even valid consumers can pull down the data from the cluster. This is why it's much better to set your kafka topics to have time-based retention, and not byte-based.
You can find the configuration properties and their explanation in the official Kafka docs: https://kafka.apache.org/documentation/

Hadoop node is not active

I have 1 x master node and 1 x slave node setup.
My issue is when running the map reduce processing. The slave node doesn't seem working. Anyone can provide help on how to check, to change and ensure the slave is working?
The config files info can be found on the URL below too
https://drive.google.com/file/d/1ULEe6k2zYnfQDQUQIbz_xR29WgT1DJhB/view
Here are my observation
1) When i check the CPU resources utilization, The slaves doesn't seem working and CPU resources at 0% when running the map reduce job while the master at 44% CPU resources. refer to the attachment.
2) When i run the dfs report it show it has 2 live nodes but on the cluster web it show only 1. Refer to the attachment and below.
3) The total processing time of map reduce is same with or without the slave
-------------------------------------------------
Live datanodes (2):
Name: 192.168.249.128:9866 (node-master)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 20587741184 (19.17 GB)
DFS Used: 174785723 (166.69 MB)
Non DFS Used: 60308293 (57.51 MB)
DFS Remaining: 20352647168 (18.95 GB)
DFS Used%: 0.85%
DFS Remaining%: 98.86%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Oct 23 11:17:39 PDT 2018
Last Block Report: Tue Oct 23 11:07:32 PDT 2018
Num of Blocks: 93
Name: 192.168.249.129:9866 (node1)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 20587741184 (19.17 GB)
DFS Used: 85743 (83.73 KB)
Non DFS Used: 33775889 (32.21 MB)
DFS Remaining: 20553879552 (19.14 GB)
DFS Used%: 0.00%
DFS Remaining%: 99.84%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Oct 23 11:17:38 PDT 2018
Last Block Report: Tue Oct 23 11:03:59 PDT 2018
Num of Blocks: 4
You're showing datanodes with dfsreport, not nodemanagers that actually are processing the data. In the YARN UI, you will want to take note of the "Active Nodes" counter, which in your case is 1. That would make sense if the master is a namenode and resource manager while the slave would be a datanode and nodemanager.
Other than that, if you have a non splittable file, for example a ZIP, or your file is less than the block size (by default 128 MB), then only one mapper will process that. Plus, it's not guaranteed that mappers (or reducers) will be distributed evenly over all available resources
Outside of a learning environment, though, 40 GB of storage and 8 GB of RAM would be better spent on multi threading rather than distributed computing (or a proper database; i.e parse files and load them into a queryable store). Or use Spark or Pig, which don't require Hadoop, but are much easier to work with than MapReduce

How to check time on all the nodes in hadoop cluster

I am running spark job on hadoop cluster, and the job is failing at few times with the exception :
exception : Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.JavaMain], main() threw exception, begin > end in range (begin, end): (1494159709088, 1494159706071)
the job ran successfully on the rerun.
After searching on google, It might be Clock skew between the Oozie server host and launcher host.
Is there a way i can check if there is clock skew ? or how can i check the time on all the nodes whether they are in sync or not.
Thanks
ntptime command output :
ntp_gettime() returns code 0 (OK)
time dcb9b19b.a2328f64 Sun, May 7 2017 14:45:47.633, (.633584090),
maximum error 434990 us, estimated error 815 us, TAI offset 0
ntp_adjtime() returns code 0 (OK)
modes 0x0 (),
offset 176.871 us, frequency -25.666 ppm, interval 1 s,
maximum error 434990 us, estimated error 815 us,
status 0x2001 (PLL,NANO),
time constant 10, precision 0.001 us, tolerance 500 ppm,
ntpstat command output :
synchronised to NTP server (174.68.168.57) at stratum 3
time correct to within 77 ms
polling server every 1024 s

CPU load with play framework

Since a few days, on a system which has been in development for about a year, I have a constant CPU load from the play! server. I have two servers, one active and one as a hot spare. In the past, the hot-spre server showed no load, or a neglectable load. But now it consumes a constant 50-110% CPU (using top on Linux).
Is there an easy way to find out what the cause it? I don't see this behavior on my MacBook when debugging (usually 0.1-1%).This is something that only happened in the past few days as far as I am aware.
This is a status print of the hot-spare. As can be seen no controllers are queried apart from the scheduled tasks (which do not perform on this server due to a flag, but are launched):
~ _ _
~ _ __ | | __ _ _ _| |
~ | '_ \| |/ _' | || |_|
~ | __/|_|\____|\__ (_)
~ |_| |__/
~
~ play! 1.2.4, http://www.playframework.org
~ framework ID is prod-frontend
~
~ Status from http://localhost:xxxx/#status,
~
Java:
~~~~~
Version: 1.6.0_26
Home: /usr/lib/jvm/java-6-sun-1.6.0.26/jre
Max memory: 64880640
Free memory: 11297896
Total memory: 29515776
Available processors: 2
Play framework:
~~~~~~~~~~~~~~~
Version: 1.2.4
Path: /opt/play
ID: prod-frontend
Mode: PROD
Tmp dir: /xxx/tmp
Application:
~~~~~~~~~~~~
Path: /xxx/server
Name: iDoms Server
Started at: 07/01/2012 12:05
Loaded modules:
~~~~~~~~~~~~~~
secure at /opt/play/modules/secure
paginate at /xxx/server/modules/paginate-0.14
Loaded plugins:
~~~~~~~~~~~~~~
0:play.CorePlugin [enabled]
100:play.data.parsing.TempFilePlugin [enabled]
200:play.data.validation.ValidationPlugin [enabled]
300:play.db.DBPlugin [enabled]
400:play.db.jpa.JPAPlugin [enabled]
450:play.db.Evolutions [enabled]
500:play.i18n.MessagesPlugin [enabled]
600:play.libs.WS [enabled]
700:play.jobs.JobsPlugin [enabled]
100000:play.plugins.ConfigurablePluginDisablingPlugin [enabled]
Threads:
~~~~~~~~
Thread[Reference Handler,10,system] WAITING
Thread[Finalizer,8,system] WAITING
Thread[Signal Dispatcher,9,system] RUNNABLE
Thread[net.sf.ehcache.CacheManager#449278d5,5,main] WAITING
Thread[Timer-0,5,main] TIMED_WAITING
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#0,5,main] TIMED_WAITING
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#1,5,main] TIMED_WAITING
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2,5,main] TIMED_WAITING
Thread[jobs-thread-1,5,main] TIMED_WAITING
Thread[jobs-thread-2,5,main] TIMED_WAITING
Thread[jobs-thread-3,5,main] TIMED_WAITING
Thread[New I/O server boss #1 ([id: 0x7065ec20, /0:0:0:0:0:0:0:0:9001]),5,main] RUNNABLE
Thread[DestroyJavaVM,5,main] RUNNABLE
Thread[New I/O server worker #1-3,5,main] RUNNABLE
Requests execution pool:
~~~~~~~~~~~~~~~~~~~~~~~~
Pool size: 0
Active count: 0
Scheduled task count: 0
Queue size: 0
Monitors:
~~~~~~~~
controllers.ReaderJob.doJob(), ms. -> 114 hits; 4.1 avg; 0.0 min; 463.0 max;
controllers.MediaCoderProcess.doJob(), ms. -> 4572 hits; 0.1 avg; 0.0 min; 157.0 max;
controllers.Bootstrap.doJob(), ms. -> 1 hits; 0.0 avg; 0.0 min; 0.0 max;
Datasource:
~~~~~~~~~~~
Jdbc url: jdbc:mysql://xxxx
Jdbc driver: com.mysql.jdbc.Driver
Jdbc user: xxxx
Jdbc password: xxxx
Min pool size: 1
Max pool size: 30
Initial pool size: 3
Checkout timeout: 5000
Jobs execution pool:
~~~~~~~~~~~~~~~~~~~
Pool size: 3
Active count: 0
Scheduled task count: 4689
Queue size: 3
Scheduled jobs (4):
~~~~~~~~~~~~~~~~~~~~~~~~~~
controllers.APNSFeedbackJob run every 24h. (has never run)
controllers.Bootstrap run at application start. (last run at 07/01/2012 12:05:32)
controllers.MediaCoderProcess run every 15s. (last run at 07/02/2012 07:10:46)
controllers.ReaderJob run every 600s. (last run at 07/02/2012 07:05:36)
Waiting jobs:
~~~~~~~~~~~~~~~~~~~~~~~~~~~
controllers.MediaCoderProcess will run in 2 seconds
controllers.APNSFeedbackJob will run in 17672 seconds
controllers.ReaderJob will run in 276 seconds
if your server is running under Linux, you may be hit by the Leap Second bug which appears last week-end.
This bug affects the Linux kernel (the Thread management), so application which uses threads (as the JVM, mysql etc...) may consume high load of CPU.
if you are using jdk 1.7 should be easy as they added this feature have look at my other related answer -> How to monitor the computer's cpu, memory, and disk usage in Java?

Categories