Hazelcast custom timeout for operations - java

I am using "hazelcast.operation.call.timeout.millis = 100" configuration to timeout hazelcast operations.
But at the startup of the hazelcast some of the map size operation are getting timeout because of this configuration. I just only want to timeout the operations after the map load which are basically map get operations. Is there any way to add custom operation timeout for those map.get() operations ?
Is there any other way to get this done ???
com.hazelcast.core.OperationTimeoutException: HDMapSizeOperation got rejected before execution due to not starting within the operation-call-timeout of: 100ms. Current time: 2017-05-15 11:41:47.503. Start time: 2017-05-15 11:41:44.189. Total elapsed time: 3314 ms. Invocation{op=com.hazelcast.map.impl.operation.HDMapSizeOperation{serviceName='hz:impl:mapService', identityHash=1941379381, partitionId=0, replicaIndex=0, callId=-24461, invocationTime=1494828707296 (2017-05-15 11:41:47.296), waitTimeout=-1, callTimeout=100, name=blockMap}, tryCount=250, tryPauseMillis=500, invokeCount=11, callTimeoutMillis=100, firstInvocationTimeMs=1494828704189, firstInvocationTime='2017-05-15 11:41:44.189', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 05:30:00.000', target=[192.168.2.204]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:151)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:99)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:75)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.retryFailedPartitions(InvokeOnPartitions.java:143)
at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:73)
at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:371)
at com.hazelcast.map.impl.proxy.MapProxySupport.size(MapProxySupport.java:628)
at com.hazelcast.map.impl.proxy.MapProxyImpl.size(MapProxyImpl.java:102)
at it.XXXX.tbx.server.MapLoader.run(MapLoader.java:36)
Regards,
Tharinda

If you are trying to control waiting on the result of e.g. a map.get; you could have a look at the asynchronous version like map.getAsync. It returns a future and you can control how long you want to wait for a result.
Modifying the call timeout is not advised.

Related

JTA Transaction Timeout Troubleshooting

Setup:
Oracle 12 DB
JBoss EAP7
Webservice running on JBoss, inserts into DB
Batchprogramm calling the webservice from multiple threads about 130.000 times in the span of an hour
The problem:
2018-04-26 18:20:44,675 +0200 [WARN ] [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffffac110923:-4c44ed1d:5ac9329e:6866ea in state RUN
2018-04-26 18:20:44,675 +0200 [WARN ] [com.arjuna.ats.arjuna] (Transaction Reaper Worker 0) ARJUNA012095: Abort of action id 0:ffffac110923:-4c44ed1d:5ac9329e:6866ea invoked while multiple threads active within it.
2018-04-26 18:20:44,679 +0200 [WARN ] [com.arjuna.ats.arjuna] (Transaction Reaper Worker 0) ARJUNA012381: Action id 0:ffffac110923:-4c44ed1d:5ac9329e:6866ea completed with multiple threads - thread default task-48 was in progress with xxx.BaseEntity.getNextValue(BaseEntity.java:28)
This happens routinely in the production environment under heavy load, not when processing fewer records and not in an identical test environment with the exact same load.
The last line shows that this transaction timeout (300s) occurs while fetching the next value from a sequence:
CREATE SEQUENCE "XXX_S" MINVALUE xxx MAXVALUE xxx INCREMENT BY 1 START WITH xxx CACHE 2 NOORDER NOCYCLE NOPARTITION ;
I know Oracle needs to lock/unlock the sequence in order to keep it consistent, so my parallel webservice calls must somehow run into a deadlock or massive contention, producing the timeout.
How do I find the root of this problem? Which parameters can I try to manipulate?
Issue is now resolved, though very unsatisfyingly. We removed parallelism.

Infinispan TimeoutException ISPN000476

I am experiencing Embedded InfiniSpan cache issue where nodes timeout on re-joining the cluster.
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 7 from vvshost
at org.infinispan.remoting.transport.impl.SingleTargetRequest.onTimeout(SingleTargetRequest.java:64)
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:86)
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:21)
The only way I can get the node to re-join is to switch off the cache and delete all local cache persistence files.
Here is the configuration which I am using:
Transport:
TransportConfigurationBuilder - defaultClusteredBuild
JMX Statistics - Enabled
Duplicate domains - Allowed
Cache Manager:
Manager Class - EmbeddedCacheManager
Memory - Memory Size: 0
Persistence: Single File Store
async: disabled
Clustering Cache Mode - CacheMode.DIST_SYNC
It seems right to me, but the value of remote-timeout is "15000" milliseconds by default. Increase the timeout until you stop getting the error.
Hope it helps

Websphere HTTP outbound connection pool usage

I have an application running on Websphere Application Server 6.1.0.43. And I'm having slowdown issues when thrying to invoke a remote service.
The slowdown is on the method findGroupAndGetConnection from the class outboundConnectionCache.
According to the IBM APAR PK94494:
The delay occurs after the client-side JAX-RPC handler (if present) is invoked and before the actual SOAP message is sent to the provider.
Because the delay occurs in the IBM web services engine, this problem
can be difficult to detect.
A com.ibm.ws.webservices.engine.transport.*=all trace will show entries similar to these which repeat:
[8/19/09 18:08:29:658 GMT] 00000047 OutboundConne 1 Enter:
WSWS3595I: Current pool size: 25. Connections-in-use size: 0.
Configured pool size: 25
In addition, that same trace spec will show long delays in executing
the .findGroupAndGetConnection() method:
[8/19/09 18:08:03:428 GMT] 00000047 OutboundConne >
OutboundConnectionCache.findGroupAndGetConnection()
WAITING_THREADS_THRESHOLD is 5 Entry
[8/19/09 18:08:38:358 GMT] 00000047 OutboundConne <
OutboundConnectionCache.findGroupAndGetConnection() Exit
And they recommend the following:
Reduce the 'com.ibm.websphere.webservices.http.connectionPoolCleanUpTime' from
the default of 180 to 120 seconds
Increase the max connections 'com.ibm.websphere.webservices.http.maxConnection' property from
default of 25 to 50. This will also require increasing the web
container thread pool size to 100.
Before changing the default properties I decided to monitor the Web Container thread usage and I noticed that maximum thread pool size (50) is never reached, but the minimum pool size (10) is reached very often, forcing connections to be destroyed and recreated.
Running over the minimum pool size will cause this slowdown? Should I increase the minimum pool size? Is my problem something other than http outbound connection pool?

Spark Job in YARN - Executors are not executing the tasks for long time

I can see the executors are not executing the tasks for long time from the Spark UI.
When i see the executors tab stderr, i can see the below logs.
6/02/04 05:30:56 INFO storage.MemoryStore: Block broadcast_91 of size 153016 dropped from memory (free 6665239401)
16/02/04 06:11:20 WARN hdfs.DFSClient: Slow ReadProcessor read fields took 31337ms (threshold=30000ms); ack: seqno: 1240 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 4835789, targets: [DatanodeInfoWithStorage[10.25.36.18:1004,DS-f6e20cf7-0ccb-45aa-988f-f3310d5acf89,DISK], DatanodeInfoWithStorage[10.25.36.11:1004,DS-61ad0a2d-a6fd-402e-b0a1-61682d1755fb,DISK], DatanodeInfoWithStorage[10.25.36.5:1004,DS-c77503a2-0c7f-4b5c-8f4a-9c61cb4f18d7,DISK]]
I do not see any log for long time. i do not see error as well. It is keep on running..
Is anyone faced the same problem? how we can improve this?
Update:
It is actually took long time on saveAsTextFile() method.

flink: job won't run with higher taskmanager.heap.mb

Simple job: kafka->flatmap->reduce->map.
Job runs ok with default value for taskmanager.heap.mb (512Mb). According to the docs: this value should be as large as possible. Since the machine in question has 96Gb of RAM I set this to 75000 (arbitrary value).
Starting job gives this error:
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$5.apply$mcV$sp(JobManager.scala:563)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$5.apply(JobManager.scala:509)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$5.apply(JobManager.scala:509)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (Source: Custom Source (1/1)) # (unassigned) - [SCHEDULED] > with groupID < 95b239d1777b2baf728645df9a1c4232 > in sharing group < SlotSharingGroup [772c9ff1cf0b6cb3a361e3352f75fcee, d4f856f13654f424d7c49d0f00f6ecca, 81bb8c4310faefe32f97ebd6baa4c04f, 95b239d1777b2baf728645df9a1c4232] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:255)
at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:686)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
... 8 more
Restore the default value (512) to this parameter and the job runs ok. At 5000 it works -> at 10000 it doesn't.
What did I miss?
Edit: This is more hit-n-miss than I thought. Setting the value to 50000 and resubmitting gives success. In every test, the cluster is stopped and restarted.
What you are probably experiencing is submitting a job before the workers have registered at the master.
A 5GB JVM heap is initialized fast, and the TaskManager can register almost immediately. For a 70GB heap, the JVM takes a while to initialize and boot. Consequently, the worker registers later, and the job cannot be executed when you submit it, due to a lack of workers.
That is also the reason why it works once you re-submit the job.
JVMs are initialized faster, if you start the cluster in "streaming" mode (standalone via start-cluster-streaming.sh), because then at least Flink's internal memory is initialized lazily.

Categories