Cassandra - Strange behaviour with one node - java

I have a cluster with 3 nodes in my developer environment, with a keyspace and a replication factor = 2, originally I had only one node in this cluster but then I added 2 more nodes, one by one. Cassandra version is 3.7.
All these nodes are "clones" so I just modified the cassandra.yaml with the corresponding IP for every node.
I've done a repair and cleanup on every node, and in my application, I have a consistency level ONE.
This is the nodetool status output:
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.132.0.4 50.54 GiB 256 70.2% 50dc5baf-b8b3-4e19-8173-cf828afd36af rack1
UN 10.132.0.3 50.31 GiB 256 65.3% 2a45b7a5-41ce-4533-ba63-60fd3c5cc530 rack1
UN 10.132.0.9 33.88 GiB 256 64.5% e601fb16-6608-4e72-a820-dd4661977946 rack1
In the Cassandra.yaml I have only 10.132.0.3 as the seed node.
So at this point, everything works fine and as expected, if I turn down one node everything keeps running "fine" unless if this node is 10.132.0.9, if I turn down this "bad" node everything crashes with the following error:
org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM
When I stop the bad node, the good ones show this error in his system.log files (I only copy the error not the entire StackTrace):
ERROR [SharedPool-Worker-1] 2018-02-27 10:59:16,449 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM
I don't understand what's wrong with this node and I don't find a solution...
Edit
My connection code:
cluster_builder = Cluster.builder()
.addContactPoints(serverIP.getCassandraList(sysvar))
.withAuthProvider(new PlainTextAuthProvider(serverIP.getCassandraUser(sysvar), serverIP.getCassandraPwd(sysvar))).withPoolingOptions(poolingOptions)
.withQueryOptions(new QueryOptions().setConsistencyLevel(ConsistencyLevel.ONE));
cluster = cluster_builder.build();
session = cluster.connect(keyspace);
My query:
statement = QueryBuilder.insertInto(keyspace, "measurement_minute").values(this.normal_names, (List<Object>) values);
And the execution:
ResultSetFuture future = session.executeAsync(statement.setConsistencyLevel(ConsistencyLevel.ONE));
I want to mention that I restarted, repaired and cleaned up all the nodes.

You are requesting QUORUM with a replication factor of 2. This won't really work well as you are really requesting ALL. For a quorum, a majority of your nodes need to respond to your query.
You can calculate the node count for a quorum with (RF/2)+1 (using integer arithmetic), so RF=2 gives (2/2)+1=2 - you need both of your replicas and can't have one down. The reason for some queries to work is that those don't use 10.132.0.9.
You can go with a replication factor of RF=3 or use CL.ONE for example.

Related

Ehcache Initial table allocation failed

One of our web application runs within Tomcat 7 which is deployed on AS400 server, and it is using Ehcache as cache component swap data into disk and reduce memory usage.
Few weeks ago, when we try to deploy this application for one of our customer, it fails at startup. And log shows:
Caused by: java.lang.IllegalStateException: Cache 'data' creation in EhcacheManager failed.
at org.ehcache.core.EhcacheManager.createCache(EhcacheManager.java:288)
at org.ehcache.core.EhcacheManager.init(EhcacheManager.java:567)
... 7 more
Caused by: org.ehcache.StateTransitionException: Initial table allocation failed.
Initial Table Size (slots) : 64
Allocation Will Require : 1KB
Table Page Source : org.terracotta.offheapstore.disk.paging.MappedPageSource#bc8a4ca2
at org.ehcache.core.StatusTransitioner$Transition.succeeded(StatusTransitioner.java:209)
at org.ehcache.core.Ehcache.init(Ehcache.java:567)
at org.ehcache.core.EhcacheManager.createCache(EhcacheManager.java:261)
... 8 more
Caused by: java.lang.IllegalArgumentException: Initial table allocation failed.
Initial Table Size (slots) : 64
Allocation Will Require : 1KB
Table Page Source : org.terracotta.offheapstore.disk.paging.MappedPageSource#bc8a4ca2
at org.terracotta.offheapstore.OffHeapHashMap.<init>(OffHeapHashMap.java:219)
at org.terracotta.offheapstore.AbstractLockedOffHeapHashMap.<init>(AbstractLockedOffHeapHashMap.java:71)
at org.terracotta.offheapstore.AbstractOffHeapClockCache.<init>(AbstractOffHeapClockCache.java:76)
at org.terracotta.offheapstore.disk.persistent.AbstractPersistentOffHeapCache.<init>(AbstractPersistentOffHeapCache.java:43)
at org.terracotta.offheapstore.disk.persistent.PersistentReadWriteLockedOffHeapClockCache.<init>(PersistentReadWriteLockedOffHeapClockCache.java:36)
at org.ehcache.impl.internal.store.disk.factories.EhcachePersistentSegmentFactory$EhcachePersistentSegment.<init>(EhcachePersistentSegmentFactory.java:73)
at org.ehcache.impl.internal.store.disk.factories.EhcachePersistentSegmentFactory.newInstance(EhcachePersistentSegmentFactory.java:60)
at org.ehcache.impl.internal.store.disk.factories.EhcachePersistentSegmentFactory.newInstance(EhcachePersistentSegmentFactory.java:37)
at org.terracotta.offheapstore.concurrent.AbstractConcurrentOffHeapMap.<init>(AbstractConcurrentOffHeapMap.java:106)
at org.terracotta.offheapstore.concurrent.AbstractConcurrentOffHeapCache.<init>(AbstractConcurrentOffHeapCache.java:48)
at org.terracotta.offheapstore.disk.persistent.AbstractPersistentConcurrentOffHeapCache.<init>(AbstractPersistentConcurrentOffHeapCache.java:52)
at org.ehcache.impl.internal.store.disk.EhcachePersistentConcurrentOffHeapClockCache.<init>(EhcachePersistentConcurrentOffHeapClockCache.java:52)
at org.ehcache.impl.internal.store.disk.OffHeapDiskStore.createBackingMap(OffHeapDiskStore.java:279)
at org.ehcache.impl.internal.store.disk.OffHeapDiskStore.getBackingMap(OffHeapDiskStore.java:167)
at org.ehcache.impl.internal.store.disk.OffHeapDiskStore.access$600(OffHeapDiskStore.java:95)
at org.ehcache.impl.internal.store.disk.OffHeapDiskStore$Provider.init(OffHeapDiskStore.java:460)
at org.ehcache.impl.internal.store.disk.OffHeapDiskStore$Provider.initStore(OffHeapDiskStore.java:456)
at org.ehcache.impl.internal.store.disk.OffHeapDiskStore$Provider.initAuthoritativeTier(OffHeapDiskStore.java:507)
at org.ehcache.impl.internal.store.tiering.TieredStore$Provider.initStore(TieredStore.java:472)
at org.ehcache.core.EhcacheManager$8.init(EhcacheManager.java:499)
at org.ehcache.core.StatusTransitioner.runInitHooks(StatusTransitioner.java:135)
at org.ehcache.core.StatusTransitioner.access$000(StatusTransitioner.java:33)
at org.ehcache.core.StatusTransitioner$Transition.succeeded(StatusTransitioner.java:194)
this code triggered this is:
CacheConfiguration<String, String[]> dconf = CacheConfigurationBuilder
.newCacheConfigurationBuilder(String.class, String[].class, ResourcePoolsBuilder.heap(11)
.disk(3, MemoryUnit.GB, false))
.withExpiry(Expirations.timeToLiveExpiration(Duration.of(30, TimeUnit.MINUTES)))
.build();
dataCacheManager = CacheManagerBuilder.newCacheManagerBuilder()
.with(CacheManagerBuilder.persistence(new File(cacheFolder, "requestdata"))) //$NON-NLS-1$
.withCache(CACHE_NAME_DATA,dconf)
.build(true);
which surprised us because it has never happened before, we have deployed it for some other customers' server (Windows, As400, linux), none of them has this issues.
This is really a headache, we spend weeks try to figure it out, read source code, tuning jvm parameters, googling around..., nothing except one unanswered post: https://groups.google.com/forum/#!topic/ehcache-users/ApFAe5nYxuA
Is there anyone can help us one this? thanks ahead!
The Ehcache 3 disk store uses java.nio.MappedByteBuffer which require access to direct memory.
There is no documented default MaxDirectMemorySize in Java and the same JVM on different OS can behave differently.
If you have not already set the flag -XX:MaxDirectMemorySize=3G when launching your application, it could be the cause of that exception you see.

Infinispan TimeoutException ISPN000476

I am experiencing Embedded InfiniSpan cache issue where nodes timeout on re-joining the cluster.
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 7 from vvshost
at org.infinispan.remoting.transport.impl.SingleTargetRequest.onTimeout(SingleTargetRequest.java:64)
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:86)
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:21)
The only way I can get the node to re-join is to switch off the cache and delete all local cache persistence files.
Here is the configuration which I am using:
Transport:
TransportConfigurationBuilder - defaultClusteredBuild
JMX Statistics - Enabled
Duplicate domains - Allowed
Cache Manager:
Manager Class - EmbeddedCacheManager
Memory - Memory Size: 0
Persistence: Single File Store
async: disabled
Clustering Cache Mode - CacheMode.DIST_SYNC
It seems right to me, but the value of remote-timeout is "15000" milliseconds by default. Increase the timeout until you stop getting the error.
Hope it helps

Cassandra failure during write query at consistency LOCAL_QUORUM

While I am writing xml file into Cassandra table column I am facing following exception.Its a 3 node cluster and All nodes are up.
com.datastax.driver.core.exceptions.WriteFailureException: Cassandra failure during write query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded, 1 failed)
at com.datastax.driver.core.exceptions.WriteFailureException.copy(WriteFailureException.java:80)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:55)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:39)
at DBConnection.oracle2Cassandra(DBConnection.java:267)
at DBConnection.main(DBConnection.java:292)
Caused by: com.datastax.driver.core.exceptions.WriteFailureException: Cassandra failure during write query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded, 1 failed)
at com.datastax.driver.core.exceptions.WriteFailureException.copy(WriteFailureException.java:91)
at com.datastax.driver.core.Responses$Error.asException(Responses.java:119)
at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:180)
at com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:186)
at com.datastax.driver.core.RequestHandler.access$2300(RequestHandler.java:44)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.setFinalResult(RequestHandler.java:754)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:576)
It would be great if someone helps me out from this situation. Thanks
However i don't know real solutions.
One of the cluster node var/log/cassandra/system.log file. I found sstable corrupted for a table. The table is used in code from where its throwing exception.
Take backups of table.
Removed corrupted db file from node.
Drop table.
Re-create table.
It's work for me.

org.h2.jdbc.JdbcSQLException: General error: "java.lang.StackOverflowError" [50000-176]

Stackoverflow error while using H2 database in Multi Threaded Environment
Our Application has service layer querying H2 database and retrieving the resultset.
The service layer connects to the h2 database using opensource clustering middleware "Sequoia" (that offers load balancing and
transparent failover) and also manages database connections .
https://sourceforge.net/projects/sequoiadb/
Our service layer has 50 service methods and we have exposed the service methods as EJB's . While Invoking the EJB's
we get the response from service (that includes H2 READ) with an average response time of 0.2 secs .
The DAO layer, query the database using Hibernate Criteria and we also use JPA2.0 entity manager to manage datasource.
For Load testing , We created a test class (with a main method) that invokes all the 50 EJB Methods .
50 threads were created and all the threads invoked the test class . The execution was Ok for first run and all the 50 threads succssfully completed
invoking 50 EJB methods .
When we triggered the test class again , we encountered "stackoverflowerror".The Detailed stacktrace is shown below
org.h2.jdbc.JdbcSQLException: General error: "java.lang.StackOverflowError" [50000-176]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:344)
at org.h2.message.DbException.get(DbException.java:167)
at org.h2.message.DbException.convert(DbException.java:290)
at org.h2.server.TcpServerThread.sendError(TcpServerThread.java:222)
at org.h2.server.TcpServerThread.run(TcpServerThread.java:155)
at java.lang.Thread.run(Thread.java:784)
Caused by: java.lang.StackOverflowError
at java.lang.Character.digit(Character.java:4505)
at java.lang.Integer.parseInt(Integer.java:458)
at java.lang.Integer.parseInt(Integer.java:510)
at java.text.MessageFormat.makeFormat(MessageFormat.java:1348)
at java.text.MessageFormat.applyPattern(MessageFormat.java:469)
at java.text.MessageFormat.<init>(MessageFormat.java:361)
at java.text.MessageFormat.format(MessageFormat.java:822)
at org.h2.message.DbException.translate(DbException.java:92)
at org.h2.message.DbException.getJdbcSQLException(DbException.java:343)
at org.h2.message.DbException.get(DbException.java:167)
at org.h2.message.DbException.convert(DbException.java:290)
at org.h2.command.Command.executeUpdate(Command.java:262)
at org.h2.jdbc.JdbcPreparedStatement.execute(JdbcPreparedStatement.java:199)
at org.h2.server.TcpServer.addConnection(TcpServer.java:140)
at org.h2.server.TcpServerThread.run(TcpServerThread.java:152)
... 1 more
at org.h2.engine.SessionRemote.done(SessionRemote.java:606)
at org.h2.engine.SessionRemote.initTransfer(SessionRemote.java:129)
at org.h2.engine.SessionRemote.connectServer(SessionRemote.java:430)
at org.h2.engine.SessionRemote.connectEmbeddedOrServer(SessionRemote.java:311)
at org.h2.jdbc.JdbcConnection.<init>(JdbcConnection.java:107)
at org.h2.jdbc.JdbcConnection.<init>(JdbcConnection.java:91)
at org.h2.Driver.connect(Driver.java:74)
at org.continuent.sequoia.controller.connection.DriverManager.getConnectionForDriver(DriverManager.java:266)
We then even added a random thread sleep(10-25 secs) between EJB Invocation . The execution was successful thrice (all 50 EJB Invocation)
and when we triggered for 4th time ,it failed with above error .
We get to see the above failure even with a thread count of 25 .
The Failure is random and there doesn't seems to be a pattern . Kindly let us know if we have missed any configuration .
Please let me know if you need any additional information . Thanks in Advance for any help .
Technology Stack :
1) Java 1.6
2) h2-1.3.176
3) Sequoia Middleware that manages DB Connection Open and Close.
-Variable Connection Pool Manager
-init pool size 250
Thanks Lance Java for your suggestions . Increasing stack size didnt help in our scenario for the following reasons (i.e additional stack helped only for few more executions).
In Our App , we are using Entity Manager (JPA) and the transaction attribute was not set . Hence each query to the database , created a thread carrying out execution . In JVisualVm , we observed the DB Threads, the Live Threads was equal to Total Threads Started .
Eventually our app created more than 30K threads and hence has resulted in Stackoverflow error .
Upon Setting the transaction attribute , the threads get killed after DB execution and all the transactions are then managed by only 25-30 threads.
The Issue is resolved now .
There's two main causes for a stack overflow error
A bug containing a non-terminating recursive call
The allocated stack size for the jvm isn't big enough
Looking at your stack trace it doesn't look recursive so I'm guessing you are running out of space. Have you set the -Xss flag for your JVM? You might need to increase this value.

Increment times out; set always succeeds after retry

I'm getting strange behavior in memcached, in particular, behavior that is strange in its consistency. Here is my test:
#Test
public void testMemc() {
logger.info("Setting head.");
memc.set(env.memcachedQueueKeys().head, 3600, 0);
logger.info("Set head; incrementing.");
memc.incr(env.memcachedQueueKeys().head, 1);
logger.info("Incremented.");
}
And here is the output:
28 11:04:52.932 INFO; Setting head.
2014-01-28 11:04:52.933 WARN net.spy.memcached.MemcachedConnection: Could not redistribute to another node, retrying primary node for q:unittest:scannedemails:w.
28 11:04:52.933 INFO; Set head; incrementing.
2014-01-28 11:04:52.935 WARN net.spy.memcached.MemcachedConnection: Could not redistribute to another node, retrying primary node for q:unittest:scannedemails:w.
FAILED: testMemc
net.spy.memcached.OperationTimeoutException: Mutate operation timed out,unable to modify counter [q:unittest:scannedemails:w]
at net.spy.memcached.MemcachedClient.mutate(MemcachedClient.java:1484)
at net.spy.memcached.MemcachedClient.incr(MemcachedClient.java:1529)
at me.unroll.emailroller.ActOnScanResultsTest.testMemc(ActOnScanResultsTest.java:295)
Most of my intuition for this kind of error fails me here. The following things are all strange:
Why does it always fail exactly once to set?
Why does it permanently fail to increment after seeming to succeed at set?
This is on a high-load server (yes, it's a little wrong to be running a test on a load-bearing server, but if it catches issues like this there's at least some advantage). What can cause this consistent failure? There is only one node.
Problem is I couldn't connect at all. This is a bug in spymemcached, since the set operation did not throw an exception even though it had no memcached server to perform set on.

Categories