Ignite cluster repeatedly failing - java

Our ignite server is showing the following error.
We are not performing any queries or Ignite tasks, we are just using the put and get features of the caches.
We do have persistence enabled.
We did connect Ignite Visor, and it shows nothing anomalous within the cluster topology or with any of the caches.
Any insight into this error would be appreciated.
[15:01:03,943][SEVERE][sys-stripe-4-#5][FailureProcessor] A critical problem with persistence data structures was detected. Please make backup of persistence storage and WAL files for further analysis. Persistence storage path: null WAL path: /ignite/wal WAL archive path: /ignite/walarchive
[15:01:03,946][SEVERE][sys-stripe-4-#5][FailureProcessor] No deadlocked threads detected.
....
[15:01:04,135][SEVERE][sys-stripe-4-#5][] JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-2113909708, val2=1125985806188550]], msg=Runtime failure on search row: SearchRow [key=KeyCacheObjectImpl [part=20, val=2AC8D168CC86654C1879E6999D1CCFD7, hasValBytes=true], hash=741925420, cacheId=0]]
Edit, Server side stack trace:
[18:40:50,919][SEVERE][sys-stripe-4-#5][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-2113909708, val2=1125985806188550]], msg=Runtime failure on search row: SearchRow [key=KeyCacheObjectImpl [part=20, val=2AC8D168CC86654C1879E6999D1CCFD7, hasValBytes=true], hash=741925420, cacheId=0]]]]
class org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-2113909708, val2=1125985806188550]], msg=Runtime failure on search row: SearchRow [key=KeyCacheObjectImpl [part=20, val=2AC8D168CC86654C1879E6999D1CCFD7, hasValBytes=true], hash=741925420, cacheId=0]]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1953)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1765)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1748)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:2794)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:441)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2342)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2589)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:2049)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1866)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1725)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3228)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$400(GridDhtAtomicCache.java:143)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:284)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:279)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1151)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:592)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:393)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:319)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:110)
at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:309)
at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1908)
at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1529)
at org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:242)
at org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1422)
at org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:569)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Record is too long [capacity=134217728, size=134219738]
at org.apache.ignite.internal.processors.cache.persistence.wal.SegmentedRingByteBuffer.offer0(SegmentedRingByteBuffer.java:214)
at org.apache.ignite.internal.processors.cache.persistence.wal.SegmentedRingByteBuffer.offer(SegmentedRingByteBuffer.java:193)
at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FileWriteHandleImpl.addRecord(FileWriteHandleImpl.java:243)
at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:858)
at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:811)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.logUpdate(GridCacheMapEntry.java:4338)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.update(GridCacheMapEntry.java:6492)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:6244)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5928)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:4021)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$5700(BPlusTree.java:3915)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2042)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1920)
... 27 more

The issue is caused by the insertion of the record whose size is greater than the actual WAL buffer size.
Ignite uses a WAL buffer to store serialized WAL records before writing them to the WAL file. Due to that, the size of each WAL record should be less than the actual size of the WAL buffer.
By default, the WAL buffer size is equal to the configured WAL segment size property. But in case IGNITE_WAL_MMAP is disabled then the WAL buffer size will be limited by the WAL buffer size property, which has the WAL segment size / 4 value as a default.
As a workaround, you can try to increase the WAL buffer size using the properties mentioned above. More details regarding that can be found here.

Related

Redistribution fails for large messages in ActiveMQ Artemis

I'm using ActiveMQ Artemis version 2.19.1, and I'm facing an issue in a 6-node (3 masters) cluster where redistribution is failing for large messages with the below warning logs:
23:35:05,551 WARN [org.apache.activemq.artemis.core.server] AMQ222303: Redistribution by Redistributor[TEST_QUEUE/2244] of messageID = 196,950,715 failed: java.lang.UnsupportedOperationException: Method not supported with Large Messages
at org.apache.activemq.artemis.protocol.amqp.broker.AMQPLargeMessage.getData(AMQPLargeMessage.java:311) [artemis-amqp-protocol-2.19.1.jar:2.19.1]
at org.apache.activemq.artemis.protocol.amqp.broker.AMQPMessage.anyMessageAnnotations(AMQPMessage.java:1374) [artemis-amqp-protocol-2.19.1.jar:2.19.1]
at org.apache.activemq.artemis.protocol.amqp.broker.AMQPMessage.hasScheduledDeliveryTime(AMQPMessage.java:1352) [artemis-amqp-protocol-2.19.1.jar:2.19.1]
at org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.processRoute(PostOfficeImpl.java:1499) [artemis-server-2.19.1.jar:2.19.1]
at org.apache.activemq.artemis.core.server.cluster.impl.Redistributor$1.run(Redistributor.java:169) [artemis-server-2.19.1.jar:2.19.1]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.19.1.jar:2.19.1]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.19.1.jar:2.19.1]
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) [artemis-commons-2.19.1.jar:2.19.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_322]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_322]
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.19.1.jar:2.19.1]
Later I see broker is removing consumers with below warning since properties=null:
00:10:28,280 WARN [org.apache.activemq.artemis.core.server] AMQ222151: removing consumer which did not handle a message, consumer=ServerConsumerImpl [id=57, filter=null, binding=LocalQueueBinding [address=TEST_QUEUE, queue=QueueImpl[name=TEST_QUEUE, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::name=localhost], temp=false]#9c000b7, filter=null, name=TEST_QUEUE, clusterName=TEST_QUEUE1e359f55-c92b-11ec-b908-005056a3af3f]], message=Reference[157995028]:RELIABLE:AMQPLargeMessage( [durable=true, messageID=157995028, address=TEST_QUEUE, size=0, scanningStatus=SCANNED, applicationProperties={VER=05, trackingId=62701757c80c3004d037ded6}, messageAnnotations={}, properties=null, extraProperties = TypedProperties[_AMQ_AD=TEST_QUEUE]]: java.lang.IllegalArgumentException: Array must not be empty or null
at org.apache.qpid.proton.codec.CompositeReadableBuffer.append(CompositeReadableBuffer.java:688) [proton-j-0.33.10.jar:]
at org.apache.qpid.proton.engine.impl.DeliveryImpl.send(DeliveryImpl.java:345) [proton-j-0.33.10.jar:]
at org.apache.qpid.proton.engine.impl.SenderImpl.send(SenderImpl.java:74) [proton-j-0.33.10.jar:]
at org.apache.activemq.artemis.protocol.amqp.proton.ProtonServerSenderContext$LargeMessageDeliveryContext.deliverInitialPacket(ProtonServerSenderContext.java:686) [artemis-amqp-protocol-2.19.1.jar:2.19.1]
at org.apache.activemq.artemis.protocol.amqp.proton.ProtonServerSenderContext$LargeMessageDeliveryContext.deliver(ProtonServerSenderContext.java:587) [artemis-amqp-protocol-2.19.1.jar:2.19.1]
I have 6 consumers for this queue if one message out of many (let's say 1,000) is large - it should process other messages but, the processing is stopped completely with 0 consumers on the queue.
When you send large messages, within the first block sent the large message is parsed for properties. Most likely you are using large properties in a way the server is not being able to parse the first few bytes of the message. You should avoid using large properties and let the large portion only for the client.
We tried following up with many tests on this JIRA, and the only plausible scenarios would be through large properties, or some incomplete message your client is generating in a way the server can't parse it.
https://issues.apache.org/jira/browse/ARTEMIS-3837
if you provide a reproducer showing how the property is failing to be parsed we will follow up with a possible fix.
Please move any discussion regarding this to the JIRA. It could be a bug caused by your anti-pattern.

error occurred while executing hibernate query

Exception while execution of a fetch HQL query .The query works most of the time but some time its showing this exception
the database is MySQL and the server used is JBoss 5.1.0 GA
The error thats shown is:
org.hibernate.exception.GenericJDBCException: could not execute query
at org.hibernate.exception.SQLStateConverter.handledNonSpecificException(SQLStateConverter.java:126)
at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:114)
at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:66)
at org.hibernate.loader.Loader.doList(Loader.java:2231)
at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2125)
at org.hibernate.loader.Loader.list(Loader.java:2120)
at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:401)
at org.hibernate.hql.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:361)
at org.hibernate.engine.query.HQLQueryPlan.performList(HQLQueryPlan.java:196)
at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1148)
at org.hibernate.impl.QueryImpl.list(QueryImpl.java:102)
at java.lang.Thread.run(Thread.java:745) Caused by: org.jboss.util.NestedSQLException: Error; - nested throwable:
(java.lang.OutOfMemoryError: GC overhead limit exceeded)
at org.jboss.resource.adapter.jdbc.WrappedConnection.checkException(WrappedConnection.java:873)
at org.jboss.resource.adapter.jdbc.WrappedStatement.checkException(WrappedStatement.java:852)
at org.jboss.resource.adapter.jdbc.WrappedResultSet.checkException(WrappedResultSet.java:1947)
at org.jboss.resource.adapter.jdbc.WrappedResultSet.getString(WrappedResultSet.java:892)
at org.hibernate.type.StringType.get(StringType.java:41)
at org.hibernate.type.NullableType.nullSafeGet(NullableType.java:184)
at org.hibernate.type.NullableType.nullSafeGet(NullableType.java:173)
at org.hibernate.type.AbstractType.hydrate(AbstractType.java:105)
at org.hibernate.persister.entity.AbstractEntityPersister.hydrate(AbstractEntityPersister.java:2124)
at org.hibernate.loader.Loader.loadFromResultSet(Loader.java:1404)
at org.hibernate.loader.Loader.instanceNotYetLoaded(Loader.java:1332)
at org.hibernate.loader.Loader.getRow(Loader.java:1230)
at org.hibernate.loader.Loader.getRowFromResultSet(Loader.java:603)
at org.hibernate.loader.Loader.doQuery(Loader.java:724)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:259)
at org.hibernate.loader.Loader.doList(Loader.java:2228)
... 11 more Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Your application seems to have run out of memory. Probably the query might be returning a very large data set, so the memory allocated to the application is not sufficient to handle that.
You can either,
Increase the memory allocated to the Java process, or
Fetch your data in subsets (pagination) to avoid a large data set getting returned at once.
Based in your stacktrace the query was executed successfully in the database and the error occurs when the result set is being prepared inside the JVM.
This exceptions is a memory related issue that occurs when your program is demanding more than available memory.
"Thrown when the Java Virtual Machine cannot allocate an object because it is out of memory, and no more memory could be made available by the garbage collector"
Some possibilities are
Configuration: You heap configuration (Xmx) is too small for your needs and you should extend it.
Bug: You have a bug in your query that generates a result set bigger that expected.
Design: You need to redesign your code in order to load this data without exceed your available memory. You can do pagination and store a small chunk of data on each iteration. In that case it's important to flush each page to the end-user or consumer before load the next one.
GC overhead tutorial
Query Pagination tutorial

Ehcache Initial table allocation failed

One of our web application runs within Tomcat 7 which is deployed on AS400 server, and it is using Ehcache as cache component swap data into disk and reduce memory usage.
Few weeks ago, when we try to deploy this application for one of our customer, it fails at startup. And log shows:
Caused by: java.lang.IllegalStateException: Cache 'data' creation in EhcacheManager failed.
at org.ehcache.core.EhcacheManager.createCache(EhcacheManager.java:288)
at org.ehcache.core.EhcacheManager.init(EhcacheManager.java:567)
... 7 more
Caused by: org.ehcache.StateTransitionException: Initial table allocation failed.
Initial Table Size (slots) : 64
Allocation Will Require : 1KB
Table Page Source : org.terracotta.offheapstore.disk.paging.MappedPageSource#bc8a4ca2
at org.ehcache.core.StatusTransitioner$Transition.succeeded(StatusTransitioner.java:209)
at org.ehcache.core.Ehcache.init(Ehcache.java:567)
at org.ehcache.core.EhcacheManager.createCache(EhcacheManager.java:261)
... 8 more
Caused by: java.lang.IllegalArgumentException: Initial table allocation failed.
Initial Table Size (slots) : 64
Allocation Will Require : 1KB
Table Page Source : org.terracotta.offheapstore.disk.paging.MappedPageSource#bc8a4ca2
at org.terracotta.offheapstore.OffHeapHashMap.<init>(OffHeapHashMap.java:219)
at org.terracotta.offheapstore.AbstractLockedOffHeapHashMap.<init>(AbstractLockedOffHeapHashMap.java:71)
at org.terracotta.offheapstore.AbstractOffHeapClockCache.<init>(AbstractOffHeapClockCache.java:76)
at org.terracotta.offheapstore.disk.persistent.AbstractPersistentOffHeapCache.<init>(AbstractPersistentOffHeapCache.java:43)
at org.terracotta.offheapstore.disk.persistent.PersistentReadWriteLockedOffHeapClockCache.<init>(PersistentReadWriteLockedOffHeapClockCache.java:36)
at org.ehcache.impl.internal.store.disk.factories.EhcachePersistentSegmentFactory$EhcachePersistentSegment.<init>(EhcachePersistentSegmentFactory.java:73)
at org.ehcache.impl.internal.store.disk.factories.EhcachePersistentSegmentFactory.newInstance(EhcachePersistentSegmentFactory.java:60)
at org.ehcache.impl.internal.store.disk.factories.EhcachePersistentSegmentFactory.newInstance(EhcachePersistentSegmentFactory.java:37)
at org.terracotta.offheapstore.concurrent.AbstractConcurrentOffHeapMap.<init>(AbstractConcurrentOffHeapMap.java:106)
at org.terracotta.offheapstore.concurrent.AbstractConcurrentOffHeapCache.<init>(AbstractConcurrentOffHeapCache.java:48)
at org.terracotta.offheapstore.disk.persistent.AbstractPersistentConcurrentOffHeapCache.<init>(AbstractPersistentConcurrentOffHeapCache.java:52)
at org.ehcache.impl.internal.store.disk.EhcachePersistentConcurrentOffHeapClockCache.<init>(EhcachePersistentConcurrentOffHeapClockCache.java:52)
at org.ehcache.impl.internal.store.disk.OffHeapDiskStore.createBackingMap(OffHeapDiskStore.java:279)
at org.ehcache.impl.internal.store.disk.OffHeapDiskStore.getBackingMap(OffHeapDiskStore.java:167)
at org.ehcache.impl.internal.store.disk.OffHeapDiskStore.access$600(OffHeapDiskStore.java:95)
at org.ehcache.impl.internal.store.disk.OffHeapDiskStore$Provider.init(OffHeapDiskStore.java:460)
at org.ehcache.impl.internal.store.disk.OffHeapDiskStore$Provider.initStore(OffHeapDiskStore.java:456)
at org.ehcache.impl.internal.store.disk.OffHeapDiskStore$Provider.initAuthoritativeTier(OffHeapDiskStore.java:507)
at org.ehcache.impl.internal.store.tiering.TieredStore$Provider.initStore(TieredStore.java:472)
at org.ehcache.core.EhcacheManager$8.init(EhcacheManager.java:499)
at org.ehcache.core.StatusTransitioner.runInitHooks(StatusTransitioner.java:135)
at org.ehcache.core.StatusTransitioner.access$000(StatusTransitioner.java:33)
at org.ehcache.core.StatusTransitioner$Transition.succeeded(StatusTransitioner.java:194)
this code triggered this is:
CacheConfiguration<String, String[]> dconf = CacheConfigurationBuilder
.newCacheConfigurationBuilder(String.class, String[].class, ResourcePoolsBuilder.heap(11)
.disk(3, MemoryUnit.GB, false))
.withExpiry(Expirations.timeToLiveExpiration(Duration.of(30, TimeUnit.MINUTES)))
.build();
dataCacheManager = CacheManagerBuilder.newCacheManagerBuilder()
.with(CacheManagerBuilder.persistence(new File(cacheFolder, "requestdata"))) //$NON-NLS-1$
.withCache(CACHE_NAME_DATA,dconf)
.build(true);
which surprised us because it has never happened before, we have deployed it for some other customers' server (Windows, As400, linux), none of them has this issues.
This is really a headache, we spend weeks try to figure it out, read source code, tuning jvm parameters, googling around..., nothing except one unanswered post: https://groups.google.com/forum/#!topic/ehcache-users/ApFAe5nYxuA
Is there anyone can help us one this? thanks ahead!
The Ehcache 3 disk store uses java.nio.MappedByteBuffer which require access to direct memory.
There is no documented default MaxDirectMemorySize in Java and the same JVM on different OS can behave differently.
If you have not already set the flag -XX:MaxDirectMemorySize=3G when launching your application, it could be the cause of that exception you see.

Infinispan TimeoutException ISPN000476

I am experiencing Embedded InfiniSpan cache issue where nodes timeout on re-joining the cluster.
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 7 from vvshost
at org.infinispan.remoting.transport.impl.SingleTargetRequest.onTimeout(SingleTargetRequest.java:64)
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:86)
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:21)
The only way I can get the node to re-join is to switch off the cache and delete all local cache persistence files.
Here is the configuration which I am using:
Transport:
TransportConfigurationBuilder - defaultClusteredBuild
JMX Statistics - Enabled
Duplicate domains - Allowed
Cache Manager:
Manager Class - EmbeddedCacheManager
Memory - Memory Size: 0
Persistence: Single File Store
async: disabled
Clustering Cache Mode - CacheMode.DIST_SYNC
It seems right to me, but the value of remote-timeout is "15000" milliseconds by default. Increase the timeout until you stop getting the error.
Hope it helps

SSTable loader streaming failed giving java.io.IOException: Connection reset by peer

I am trying to use sstableloader to stream data to a Cassandra database, which is in fact in the same node. It used to work when i was using DSE 2.2 but when i upgraded it to DSE 4.5 and made all the relevant changes in the cassandra.yaml file, it stopped working and now it is throwing an error like this:
Established connection to initial hosts
Opening sstables and calculating sections to stream
Streaming relevant part of demo/test_yale/demo-test_yale-jb-2-Data.db demo/test_yale/demo-test_yale-jb-1-Data.db to [/127.0.0.1]
Streaming session ID: 02225ef0-1c17-11e4-a1ea-5f2d4f6a32c1
progress: [/127.0.0.1 1/2 (88%)] [total: 88% - 2147483647MB/s (avg: 14MB/s)]ERROR 16:36:29,029 [Stream #02225ef0-1c17-11e4-a1ea-5f2d4f6a32c1] Streaming error occurred
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
at java.nio.channels.Channels.writeFully(Channels.java:98)
at java.nio.channels.Channels.access$000(Channels.java:61)
at java.nio.channels.Channels$1.write(Channels.java:174)
at com.ning.compress.lzf.LZFChunk.writeCompressedHeader(LZFChunk.java:77)
at com.ning.compress.lzf.ChunkEncoder.encodeAndWriteChunk(ChunkEncoder.java:132)
at com.ning.compress.lzf.LZFOutputStream.writeCompressedBlock(LZFOutputStream.java:203)
at com.ning.compress.lzf.LZFOutputStream.write(LZFOutputStream.java:97)
at org.apache.cassandra.streaming.StreamWriter.write(StreamWriter.java:151)
at org.apache.cassandra.streaming.StreamWriter.write(StreamWriter.java:101)
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:59)
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:42)
at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:45)
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:383)
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:363)
at java.lang.Thread.run(Thread.java:745)
WARN 16:36:29,032 [Stream #02225ef0-1c17-11e4-a1ea-5f2d4f6a32c1] Stream failed
Streaming to the following hosts failed:
[/127.0.0.1]
java.util.concurrent.ExecutionException: org.apache.cassandra.streaming.StreamException: Stream failed
I have even tried assigning actual ip address of the node for the listen_address, broadcast_address, and rpc_address in the cassandra.yaml file but the same error occurs.
Can anyone be of assistance please?
It's worth looking at your system.log as specified in cassandra/conf/logback.xml, as suggested by Zanson.
In my case the issue was simply with exhausting disk space on the node:
ERROR [STREAM-IN-/xx.xx.xx.xx] 2016-08-02 10:50:31,125 StreamSession.java:505 - [Stream #8420bfa0-589c-11e6-9512-235b1f79cf1b] Streaming error occurred
java.io.IOException: No space left on device
at java.io.RandomAccessFile.writeBytes(Native Method) ~[na:1.8.0_101]
at java.io.RandomAccessFile.write(RandomAccessFile.java:525) ~[na:1.8.0_101]

Categories