Slow connection to Cassandra - java

I have trouble connecting to my 3-node cassandra cluster via Datastax PHP- and Java-Driver.
Especially for the PHP driver it is crucial that i can connect fast to improve loading times of my website.
How can i debug this or what is the reason?
Java output shows this:
09:59:40,284 [main] DEBUG - com.datastax.driver.NEW_NODE_DELAY_SECONDS is undefined, using default value 1
09:59:40,284 [main] DEBUG - com.datastax.driver.NON_BLOCKING_EXECUTOR_SIZE is undefined, using default value 4
09:59:40,297 [main] DEBUG - com.datastax.driver.NOTIF_LOCK_TIMEOUT_SECONDS is undefined, using default value 60
09:59:40,357 [main] DEBUG - Starting new cluster with contact points [/XXX.XXX.XXX.XXX:9042, /XXX.XXX.XXX.YYY:9042, /XXX.XXX.XXX.ZZZ:9042]
09:59:40,402 [main] DEBUG - Using SLF4J as the default logging framework
09:59:40,489 [main] DEBUG - java.nio.Buffer.address: available
09:59:40,490 [main] DEBUG - sun.misc.Unsafe.theUnsafe: available
09:59:40,490 [main] DEBUG - sun.misc.Unsafe.copyMemory: available
09:59:40,490 [main] DEBUG - java.nio.Bits.unaligned: true
09:59:40,492 [main] DEBUG - Java version: 8
09:59:40,492 [main] DEBUG - -Dio.netty.noUnsafe: false
09:59:40,492 [main] DEBUG - sun.misc.Unsafe: available
09:59:40,492 [main] DEBUG - -Dio.netty.noJavassist: false
09:59:40,665 [main] DEBUG - Javassist: available
09:59:40,665 [main] DEBUG - -Dio.netty.tmpdir: /var/folders/4y/t4b47lbn1zjbjpb6x09l30wm0000gn/T (java.io.tmpdir)
09:59:40,666 [main] DEBUG - -Dio.netty.bitMode: 64 (sun.arch.data.model)
09:59:40,666 [main] DEBUG - -Dio.netty.noPreferDirect: false
09:59:40,708 [main] DEBUG - com.datastax.driver.FORCE_NIO is undefined, using default value false
09:59:40,710 [main] INFO - Did not find Netty's native epoll transport in the classpath, defaulting to NIO.
09:59:40,714 [main] DEBUG - -Dio.netty.eventLoopThreads: 8
09:59:40,723 [main] DEBUG - -Dio.netty.noKeySetOptimization: false
09:59:40,723 [main] DEBUG - -Dio.netty.selectorAutoRebuildThreshold: 512
09:59:40,747 [main] DEBUG - -Dio.netty.leakDetectionLevel: simple
09:59:41,035 [main] DEBUG - com.datastax.driver.DISABLE_COALESCING is undefined, using default value false
09:59:41,046 [main] DEBUG - Generated: io.netty.util.internal.__matchers__.com.datastax.driver.core.Message$ResponseMatcher
09:59:41,066 [main] DEBUG - -Dio.netty.allocator.numHeapArenas: 4
09:59:41,066 [main] DEBUG - -Dio.netty.allocator.numDirectArenas: 4
09:59:41,066 [main] DEBUG - -Dio.netty.allocator.pageSize: 8192
09:59:41,066 [main] DEBUG - -Dio.netty.allocator.maxOrder: 11
09:59:41,067 [main] DEBUG - -Dio.netty.allocator.chunkSize: 16777216
09:59:41,067 [main] DEBUG - -Dio.netty.allocator.tinyCacheSize: 512
09:59:41,067 [main] DEBUG - -Dio.netty.allocator.smallCacheSize: 256
09:59:41,067 [main] DEBUG - -Dio.netty.allocator.normalCacheSize: 64
09:59:41,067 [main] DEBUG - -Dio.netty.allocator.maxCachedBufferCapacity: 32768
09:59:41,067 [main] DEBUG - -Dio.netty.allocator.cacheTrimInterval: 8192
09:59:41,078 [main] DEBUG - Generated: io.netty.util.internal.__matchers__.com.datastax.driver.core.FrameMatcher
09:59:41,082 [main] DEBUG - Generated: io.netty.util.internal.__matchers__.com.datastax.driver.core.Message$RequestMatcher
09:59:41,104 [main] DEBUG - -Dio.netty.initialSeedUniquifier: 0x24d6f22f78c5a924 (took 8 ms)
09:59:41,130 [main] DEBUG - -Dio.netty.allocator.type: unpooled
09:59:41,130 [main] DEBUG - -Dio.netty.threadLocalDirectBufferSize: 65536
09:59:41,197 [cluster1-nio-worker-0] DEBUG - Connection[/XXX.XXX.XXX.YYY:9042-1, inFlight=0, closed=false] Connection opened successfully
09:59:41,218 [cluster1-nio-worker-0] DEBUG - -Dio.netty.recycler.maxCapacity.default: 262144
09:59:41,432 [main] DEBUG - [Control connection] Refreshing node list and token map
09:59:41,518 [main] DEBUG - [Control connection] Refreshing schema
09:59:42,137 [main] DEBUG - [Control connection] Refreshing node list and token map
09:59:42,315 [main] DEBUG - [Control connection] Successfully connected to /XXX.XXX.XXX.YYY:9042
09:59:42,315 [main] INFO - Using data-center name '168' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
09:59:42,315 [main] INFO - New Cassandra host /XXX.XXX.XXX.XXX:9042 added
09:59:42,315 [main] INFO - New Cassandra host /XXX.XXX.XXX.YYY:9042 added
09:59:42,315 [main] INFO - New Cassandra host /XXX.XXX.XXX.ZZZ:9042 added
09:59:42,342 [cluster1-nio-worker-1] DEBUG - Connection[/XXX.XXX.XXX.XXX:9042-2, inFlight=0, closed=false] Connection opened successfully
09:59:42,345 [cluster1-nio-worker-2] DEBUG - Connection[/XXX.XXX.XXX.YYY:9042-1, inFlight=0, closed=false] Connection opened successfully
09:59:42,348 [cluster1-nio-worker-3] DEBUG - Connection[/XXX.XXX.XXX.ZZZ:9042-1, inFlight=0, closed=false] Connection opened successfully
09:59:42,580 [cluster1-nio-worker-2] DEBUG - Added connection pool for /XXX.XXX.XXX.XXX:9042
09:59:42,591 [cluster1-nio-worker-3] DEBUG - Added connection pool for /XXX.XXX.XXX.YYY:9042
09:59:42,609 [cluster1-nio-worker-1] DEBUG - Added connection pool for /XXX.XXX.XXX.ZZZ:9042
As you can see, it takes ~2.5 seconds which is too slow for my use case.
Same happens with the PHP driver, but i don't have a log for this.
Queries are very fast once the driver is connected. Only issue is the slow connection time. I have set up all the three nodes as contact points.
EDIT
Just to clarify: My PHP driver is the problem. I'm wondering why it isn't using pooling/persisting connections. When i call the script two times in a row, every call takes 2-5 seconds. I think the second call should be using the persisting pool. phpinfo() shows persistent clusters & sessions = 0. This is the code that i'm using:
$cluster = Cassandra::cluster()
->withContactPoints('XXX.XXX.XXX.XXX', 'XXX.XXX.XXX.YYY, 'XXX.XXX.XXX.ZZZ')
->withCredentials('USERNAME', 'PASSWORD')
->build();
$keyspace = 'myKeyspace';
$session = $cluster->connect($keyspace);
UPDATE
The problem was my network. Had too little bandwidth.

DataStax drivers are full featured drivers. They are aware of your custer topology and cluster state which requires some expensive operations in the cluster object build stage. It is common for the cluster object creation to take multiple seconds (depending on the size of your cluster/number of nodes).
The best practice is not to create the cluster object for every request (that would be extremely inefficient). Instead, you want to build the cluster object one time and maintain the connections open. Then when you receive a request from your front end, handle it with the existing cluster object.
Cassandra will give you very fast response times when used correctly.
For other c* client best practices, take a look at Brian's Cassandra Loader. This is a good reference application as well as a very efficient bulk loader.
Some key best practices include: Limit the number of async requests if you are using execute async, if you are using batches, ensure that the batches are token specific to avoid excessive coordination, do not use logged batches unless you need atomicity, and do not dynamically manipulate your schema from your application to avoid schema mismatches.

Related

Elastic APM Java - Transactions & Spans are recorded but not reported to Elastic APM Server or Kibana

I have a standalone JAVA application.
And have integrated it successfully with Elastic APM (+ElasticSearch +Kibana) for capturing telemetries.
Java Version: 8 - OpenJDK
Elastic Agent & Library Version: 1.16
Elastic Search, APM and Kibana Version: 7.7.1
Below are the relevant JVM Options being used:
JAVA_OPTS="$JAVA_OPTS -javaagent:$BASE_HOME/agent-lib/elastic-apm-agent-1.16.0.jar -Delastic.apm.service_name=my-app -Delastic.apm.server_urls=http://elastic-apm-server:8200"
JAVA_OPTS="$JAVA_OPTS -Delastic.apm.application_packages=com,org -Delastic.apm.span_frames_min_duration=-1ms"
JAVA_OPTS="$JAVA_OPTS -Delastic.apm.log_file=$BASE_HOME/logs/apm.log -Delastic.apm.log_level=DEBUG"
I am generating custom transactions and spans using the Tracer/Transaction/Span APIs as suggested in the official documentation.
And as per the generated debug logs. These spans and transactions are getting captured as expected.
I have validated the same by DEBUGGING it over the IDE, that transactions are being captured as expected.
Problem: The custom transactions are not shown on the Kibana APM Dashboard
However some out of the box transactions from Quartz(which is being used in the application) are shown as expected. Which should mean the integration with the Elastic APM Server is fine.
It appears to me, even though the transactions are being captured successfully, those are not reported(sent) to the APM Server
Refer some relevant apm logs:
2020-07-01 12:33:09.569 [pool-1-thread-1] DEBUG co.elastic.apm.agent.impl.ElasticApmTracer - startTransaction '' 00-d0025079170e4f03698702f4e68be4ac-cf792454fbef1c77-01 (16970dc3) {
2020-07-01 12:33:09.569 [pool-1-thread-1] DEBUG co.elastic.apm.agent.impl.ElasticApmTracer - Activating 'ExtractionRequestHandler#invokeExtraction' 00-d0025079170e4f03698702f4e68be4ac-cf792454fbef1c77-01 (16970dc3) on thread 26
2020-07-01 12:33:09.569 [pool-1-thread-1] DEBUG co.elastic.apm.agent.impl.transaction.AbstractSpan - increment references to 'ExtractionRequestHandler#invokeExtraction' 00-d0025079170e4f03698702f4e68be4ac-cf792454fbef1c77-01 (16970dc3) (2)
2020-07-01 12:33:09.569 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Receiving SPAN event (sequence 86)
2020-07-01 12:33:09.570 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.impl.transaction.AbstractSpan - decrement references to 'ExtractionRequestHandler#invokeExtraction' 00-98a1d8f4970d585915eb03a414b7b14c-994dd2823198f1ef-01 (33d448b5) (4)
2020-07-01 12:33:09.570 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.impl.transaction.AbstractSpan - decrement references to 'BOpFileUtils#authorizeFilePath' 00-98a1d8f4970d585915eb03a414b7b14c-133200d1793fbaab-01 (67fba8aa) (0)
2020-07-01 12:33:09.570 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Receiving SPAN event (sequence 87)
2020-07-01 12:33:09.570 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.impl.transaction.AbstractSpan - decrement references to 'ExtractionRequestHandler#invokeExtraction' 00-98a1d8f4970d585915eb03a414b7b14c-994dd2823198f1ef-01 (33d448b5) (3)
2020-07-01 12:33:09.570 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.impl.transaction.AbstractSpan - decrement references to 'SCR#init' 00-98a1d8f4970d585915eb03a414b7b14c-77cf207c33eb24ab-01 (2f1f25c3) (0)
Need help in finding what I am doing wrong? And how to fix it?
I got the answer after posting the same on Elastic Support Forum.
It was a very prompt response.
This was not a problem from Elastic APM side, and was more of a silly problem from my side.
Refer the discussion to find the problem and solution.

Error UNKNOWN_MEMBER_ID occurred while committing offsets for group xxx

With Kafka client Java library, consuming logs has worked for some time but with the following errors it doesn't work any more:
2016-07-15 19:37:54.609 INFO 4342 --- [main] o.a.k.c.c.internals.AbstractCoordinator : Marking the coordinator 2147483647 dead.
2016-07-15 19:37:54.933 ERROR 4342 --- [main] o.a.k.c.c.internals.ConsumerCoordinator : Error UNKNOWN_MEMBER_ID occurred while committing offsets for group logstash
2016-07-15 19:37:54.933 WARN 4342 --- [main] o.a.k.c.c.internals.ConsumerCoordinator : Auto offset commit failed: Commit cannot be completed due to group rebalance
2016-07-15 19:37:54.941 ERROR 4342 --- [main] o.a.k.c.c.internals.ConsumerCoordinator : Error UNKNOWN_MEMBER_ID occurred while committing offsets for group logstash
2016-07-15 19:37:54.941 WARN 4342 --- [main] o.a.k.c.c.internals.ConsumerCoordinator : Auto offset commit failed:
2016-07-15 19:37:54.948 INFO 4342 --- [main] o.a.k.c.c.internals.AbstractCoordinator : Attempt to join group logstash failed due to unknown member id, resetting and retrying.
It keeps resetting.
Running another instance of the same application gets errors immediately.
I suspect Kafka or its ZooKeeper has a problem but there's no error log.
Any one who has idea on what's going on here?
This is the application I'm using: https://github.com/izeye/log-redirector
I just faced the same issue. I have been investigating, and in this thread and in this wiki you can find the solution.
The issue seems to be that the processing of a batch takes longer than the session timeout.
Either increase the session timeout or the polling frequency or limit the number of bytes received.
What worked for me was changing max.partition.fetch.bytes. But you can also modify session.timeout.ms or the value you pass to your consumer.poll(TIMEOUT)

How to configure AmqpInboundChannelAdapter back off policy

I want to start my app even if the RabbitMQ is not reachable. Currently my app hangs while AmqpInboundChannelAdapter tries to establish connection and I see pattern how long it waits before it tries again. How to configure app using Spring AMQP to start regardless of RabbitMQ availability and configure this "back off policy"?
18/May/2015 16:52:57,666 INFO [main] - AmqpInboundChannelAdapter - started org.springframework.integration.amqp.inbound.AmqpInboundChannelAdapter#0
18/May/2015 16:54:47,769 INFO [main] - AmqpInboundChannelAdapter - started org.springframework.integration.amqp.inbound.AmqpInboundChannelAdapter#1
18/May/2015 16:57:07,818 INFO [main] - AmqpInboundChannelAdapter - started org.springframework.integration.amqp.inbound.AmqpInboundChannelAdapter#2

Neo4j 2.2.1 server does not start after db is generated via java code

Started with a new graph.db folder. Am using embedded graph db, java and cypher query to create nodes. It seems to create the nodes successfully. Have debugged and checked the result object.
I want to now start the neo4j server to check the nodes in the browser. However, it gives the message:
bash-4.2$ ./neo4j-community-2.2.1/bin/neo4j start
WARNING: Max 1024
open files allowed, minimum of 40 000 recommended. See the Neo4j
manual. Starting Neo4j Server...
WARNING: not changing user process
[2868]... waiting for server to be ready...... Failed to start within
120 seconds. Neo4j Server may have failed to start, please check the
logs.
I checked the console.log and message.log. But there is not error. I don't know what to read in the log files to put up here for diagnosis. Please advice.
Console.log:
2015-04-26 05:14:47.278+0000 INFO
[API] Setting startup timeout to:
120000ms based on 120000
2015-04-26 05:14:49.700+0000 INFO
[API]
Successfully shutdown Neo4j Server.
2015-04-26 05:15:24.684+0000 INFO
[API] Setting startup timeout to: 120000ms based on 120000
2015-04-26 05:15:26.477+0000 INFO
[API] Successfully shutdown Neo4j Server.
2015-04-26 05:19:54.699+0000 INFO
[API] Setting startup timeout to:
120000ms based on 120000
2015-04-26 05:19:56.521+0000 INFO
[API] Successfully shutdown Neo4j Server.
Where else should I look for errors? How can I get the neo4j server up?
Edit:
Checked messages.log. Text below. So looks like it is shutting down ok via the code:
2015-04-26 10:05:21.169+0000 INFO [org.neo4j]: --- STARTED diagnostics for KernelDiagnostics:StoreFiles END ---
2015-04-26 10:05:21.356+0000 INFO [org.neo4j]: Database is now ready
2015-04-26 10:06:05.470+0000 INFO [org.neo4j]: Index population started: [:PRODUCT(id) [provider: {key=lucene, version=1.0}]]
2015-04-26 10:06:05.653+0000 INFO [org.neo4j]: Schema state store has been cleared.
2015-04-26 10:06:05.695+0000 INFO [org.neo4j]: Index population completed. Index is now online: [:PRODUCT(id) [provider: {key=lucene, version=1.0}]]
2015-04-26 10:06:05.764+0000 INFO [org.neo4j]: Schema state store has been cleared.
2015-04-26 10:06:18.148+0000 INFO [org.neo4j]: Sampled index :PRODUCT(id) with 2 unique values in sample of avg size 2 taken from index containing 2 entries
2015-04-26 10:06:46.028+0000 INFO [org.neo4j]: Shutdown started
2015-04-26 10:06:46.031+0000 INFO [org.neo4j]: Database is now unavailable
2015-04-26 10:06:46.246+0000 INFO [org.neo4j]: About to rotate counts store at transaction 8 to [/home/dedhiaj/neo4j-community-2.2.1/data/graph.db/neostore.counts.db.b], from [/ho
me/dedhiaj/neo4j-community-2.2.1/data/graph.db/neostore.counts.db.a].
2015-04-26 10:06:46.250+0000 INFO [org.neo4j]: Successfully rotated counts store at transaction 8 to [/home/dedhiaj/neo4j-community-2.2.1/data/graph.db/neostore.counts.db.b], from
[/home/dedhiaj/neo4j-community-2.2.1/data/graph.db/neostore.counts.db.a].
2015-04-26 10:06:47.495+0000 INFO [org.neo4j]: NeoStore closed
2015-04-26 10:06:47.496+0000 INFO [org.neo4j]: --- STOPPING diagnostics START ---
2015-04-26 10:06:47.497+0000 INFO [org.neo4j]: --- STOPPING diagnostics END ---
Also increase "ulimit" of your Linux OS>

Cassandra Downed Host Retry (Hector)

Can someone assist me in understanding and troubleshooting this issue? I do not know what is causing Hector to fail when it tries to connect to the Cassandra cluster.
How can I find out where the issue is?
0 [main] INFO me.prettyprint.cassandra.connection.CassandraHostRetryService - Downed Host Retry service started with queue size 10 and retry delay 30s
168 [main] INFO me.prettyprint.cassandra.service.JmxMonitor - Registering JMX me.prettyprint.cassandra.service_keyspace-name:ServiceType=hector,MonitorType=hector
399 [main] INFO me.prettyprint.cassandra.model.ConfigurableConsistencyLevel - READ ConsistencyLevel set to QUORUM for ColumnFamily Files
400 [main] INFO me.prettyprint.cassandra.model.ConfigurableConsistencyLevel - WRITE ConsistencyLevel set to QUORUM for ColumnFamily Files
406 [main] INFO me.prettyprint.cassandra.model.ConfigurableConsistencyLevel - READ ConsistencyLevel set to QUORUM for ColumnFamily FileList
407 [main] INFO me.prettyprint.cassandra.model.ConfigurableConsistencyLevel - WRITE ConsistencyLevel set to QUORUM for ColumnFamily FileList
From the trace it seems you are using QUORUM as consistency level. Try to use ONE and see if it works. It seems that one or more of the nodes that should satisfy your request are down. Use noderool ring/status or see if any node is down in your cluster.

Categories