we are using Hazelcat 1.9.4.4 with cluser of 6 Tomcat servers. We restarted our cluster, ant here is a fragment of the log:
14-Jul-2012 03:25:41 com.hazelcast.nio.InSelector
INFO: /10.152.41.105:5701 [cem-prod] 5701 accepted socket connection from /10.153.26.16:54604
14-Jul-2012 03:25:47 com.hazelcast.cluster.ClusterManager
INFO: /10.152.41.105:5701 [cem-prod]
Members [6] {
Member [10.152.41.101:5701]
Member [10.164.101.143:5701]
Member [10.152.41.103:5701]
Member [10.152.41.105:5701] this
Member [10.153.26.15:5701]
Member [10.153.26.16:5701]
}
We can see that 10.153.26.16 is connected to the cluster, but after it later in the log there is:
14-Jul-2012 03:28:50 com.hazelcast.impl.ConcurrentMapManager
INFO: /10.152.41.105:5701 [cem-prod] ======= 47: CONCURRENT_MAP_LOCK ========
thisAddress= Address[10.152.41.105:5701], target= Address[10.153.26.16:5701]
targetMember= Member [10.153.26.16:5701], targetConn=Connection [/10.153.26.16:54604 -> Address[10.153.26.16:5701]] live=true, client=false, type=MEMBER, targetBlock=Block [2] owner=Address[10.153.26.16:5701] migrationAddress=null
cemClientNotificationsLock Re-doing [20] times! c:__hz_Locks : null
14-Jul-2012 03:28:55 com.hazelcast.impl.ConcurrentMapManager
INFO: /10.152.41.105:5701 [cem-prod] ======= 57: CONCURRENT_MAP_LOCK ========
thisAddress= Address[10.152.41.105:5701], target= Address[10.153.26.16:5701]
targetMember= Member [10.153.26.16:5701], targetConn=Connection [/10.153.26.16:54604 -> Address[10.153.26.16:5701]] live=true, client=false, type=MEMBER, targetBlock=Block [2] owner=Address[10.153.26.16:5701] migrationAddress=null
cemClientNotificationsLock Re-doing [30] times! c:__hz_Locks : null
After several restarts of servers (all together, stop all and start one-by-one etc) we were able to run the system.
Could you explain, why Hazelcast fails to lock map at the node if it is in cluster, or if this node was out of cluster, why it is displayed as a member?
Also are there any recomendations how to restart Tomcat cluster with distributed Hazelcast structures (stop all nodes and start together, stop and start one-by-one, stop Hazelcast somehow before server restart etc?)?
Thanks!
Could you explain, why Hazelcast fails to lock map at the node if it is in cluster
Map can be locked by some other node at the moment.
There are also lots of fixes and changes since 1.9.4.4 , it is pretty old version. You should try 2.1+.
Related
I am running two instances of spring boot application on openshift and both are joined to hazelcast cluster. However, I can see continuously see below error message in logs.
{"timestamp":"2021-06-01T20:29:48.775+10:00","app":"my-protected-application","logLevel":"INFO","thread":"hz.peaceful_pike.priority-generic-operation.thread-0","eventSource":"com.hazelcast.internal.cluster.impl.operations.SplitBrainMergeValidationOp","message":"[10.1.28.31]:5701 [spring-session-cluster] [4.2] Removing null, since it thinks it's already split from this cluster and looking to merge."}
{"timestamp":"2021-06-01T20:29:48.776+10:00","app":"my-protected-application","logLevel":"ERROR","thread":"hz.peaceful_pike.priority-generic-operation.thread-0","eventSource":"com.hazelcast.internal.cluster.impl.operations.SplitBrainMergeValidationOp","message":"[10.1.28.31]:5701 [spring-session-cluster] [4.2] Target is this node! -> [10.1.28.31]:5701","stack_trace":"<#d3566be0> j.l.IllegalArgumentException: Target is this node! -> [10.1.28.31]:5701\n\tat c.h.s.i.o.i.OutboundResponseHandler.checkTarget(OutboundResponseHandler.java:226)\n\tat c.h.s.i.o.i.OutboundResponseHandler.sendNormalResponse(OutboundResponseHandler.java:125)\n\tat c.h.s.i.o.i.OutboundResponseHandler.sendResponse(OutboundResponseHandler.java:88)\n\tat c.h.s.i.o.Operation.sendResponse(Operation.java:475)\n\tat c.h.s.i.o.i.OperationRunnerImpl.call(OperationRunnerImpl.java:282)\n\tat c.h.s.i.o.i.OperationRunnerImpl.run(OperationRunnerImpl.java:248)\n\tat c.h.s.i.o.i.OperationRunnerImpl.run(OperationRunnerImpl.java:469)\n\tat c.h.s.i.o.i.OperationThread.process(OperationThread.java:197)\n\tat c.h.s.i.o.i.OperationThread.process(OperationThread.java:137)\n\tat c.h.s.i.o.i.OperationThread.executeRun(OperationThread.java:123)\n\tat c.h.i.u.e.HazelcastManagedThread.run(HazelcastManagedThread.java:102)\n"}
{"timestamp":"2021-06-01T20:29:48.776+10:00","app":"my-protected-application","logLevel":"WARN","thread":"hz.peaceful_pike.priority-generic-operation.thread-0","eventSource":"com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl","message":"[10.1.28.31]:5701 [spring-session-cluster] [4.2] While sending op error... op: com.hazelcast.internal.cluster.impl.operations.SplitBrainMergeValidationOp{serviceName='hz:core:clusterService', identityHash=127643227, partitionId=-1, replicaIndex=0, callId=1046044, invocationTime=1622543389369 (2021-06-01 20:29:49.369), waitTimeout=-1, callTimeout=60000, tenantControl=com.hazelcast.spi.impl.tenantcontrol.NoopTenantControl#0}, error: java.lang.IllegalArgumentException: Target is this node! -> [10.1.28.31]:5701","stack_trace":"<#dbcc9949> j.l.IllegalArgumentException: Target is this node! -> [10.1.28.31]:5701, response: ErrorResponse{callId=1046044, urgent=true, cause=java.lang.IllegalArgumentException: Target is this node! -> [10.1.28.31]:5701}\n\tat c.h.s.i.o.i.OutboundResponseHandler.send(OutboundResponseHandler.java:113)\n\tat c.h.s.i.o.i.OutboundResponseHandler.sendResponse(OutboundResponseHandler.java:96)\n\tat c.h.s.i.o.Operation.sendResponse(Operation.java:475)\n\tat c.h.s.i.o.i.OperationRunnerImpl.sendResponseAfterOperationError(OperationRunnerImpl.java:425)\n\tat c.h.s.i.o.i.OperationRunnerImpl.handleOperationError(OperationRunnerImpl.java:419)\n\tat c.h.s.i.o.i.OperationRunnerImpl.run(OperationRunnerImpl.java:253)\n\tat c.h.s.i.o.i.OperationRunnerImpl.run(OperationRunnerImpl.java:469)\n\tat c.h.s.i.o.i.OperationThread.process(OperationThread.java:197)\n\tat c.h.s.i.o.i.OperationThread.process(OperationThread.java:137)\n\tat c.h.s.i.o.i.OperationThread.executeRun(OperationThread.java:123)\n\tat c.h.i.u.e.HazelcastManagedThread.run(HazelcastManagedThread.java:102)\n"}
{"timestamp":"2021-06-01T20:29:49.779+10:00","app":"my-protected-application","logLevel":"INFO","thread":"hz.peaceful_pike.InvocationMonitorThread","eventSource":"com.hazelcast.spi.impl.operationservice.impl.InvocationMonitor","message":"[10.1.28.31]:5701 [spring-session-cluster] [4.2] Invocations:2 timeouts:1 backup-timeouts:0"}
what does this log mean and what is it's significance.
Hazelcast for OpenShift
Check if you use StatefulSet (not Deployment). In the case of DNS Lookup discovery, using Deployment may cause Hazelcast to start in the split-brain mode. Also you may want to increase the value of the service-dns-timeout parameter.
I'm working with OpenJDK 11 and a very simple SpringBoot application that almost the only thing it has is the SpringBoot actuator enabled so I can call /actuator/health etc.
I also have a kubernetes cluster on GCE very simple with just a pod with a container (containing this app of course)
My configuration has some key points that I want to highlight, it has some requirements and limits
resources:
limits:
memory: 600Mi
requests:
memory: 128Mi
And it has a readiness probe
readinessProbe:
initialDelaySeconds: 30
periodSeconds: 30
httpGet:
path: /actuator/health
port: 8080
I'm also setting a JVM_OPTS like (that my program is using obviously)
env:
- name: JVM_OPTS
value: "-XX:MaxRAM=512m"
The problem
I launch this and it gets OOMKilled in about 3 hours every time!
I'm never calling anything myself the only call is the readiness probe each 30 seconds that kubernetes does, and that is enough to exhaust the memory ? I have also not implemented anything out of the ordinary, just a Get method that says hello world along all the SpringBoot imports to have the actuators
If I run kubectl top pod XXXXXX I actually see how gradually get bigger and bigger
I have tried a lot of different configurations, tips, etc, but anything seems to work with a basic SpringBoot app
Is there a way to actually hard limit the memory in a way that Java can raise a OutOfMemory exception ? or to prevent this from happening?
Thanks in advance
EDIT: After 15h running
NAME READY STATUS RESTARTS AGE
pod/test-79fd5c5b59-56654 1/1 Running 4 15h
describe pod says...
State: Running
Started: Wed, 27 Feb 2019 10:29:09 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 27 Feb 2019 06:27:39 +0000
Finished: Wed, 27 Feb 2019 10:29:08 +0000
That last span of time is about 4 hours and only have 483 calls to /actuator/health, apparently that was enough to make java exceed the MaxRAM hint ?
EDIT: Almost 17h
its about to die again
$ kubectl top pod test-79fd5c5b59-56654
NAME CPU(cores) MEMORY(bytes)
test-79fd5c5b59-56654 43m 575Mi
EDIT: loosing any hope at 23h
NAME READY STATUS RESTARTS AGE
pod/test-79fd5c5b59-56654 1/1 Running 6 23h
describe pod:
State: Running
Started: Wed, 27 Feb 2019 18:01:45 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 27 Feb 2019 14:12:09 +0000
Finished: Wed, 27 Feb 2019 18:01:44 +0000
EDIT: A new finding
Yesterday night I was doing some interesting reading:
https://developers.redhat.com/blog/2017/03/14/java-inside-docker/
https://banzaicloud.com/blog/java10-container-sizing/
https://medium.com/adorsys/jvm-memory-settings-in-a-container-environment-64b0840e1d9e
TL;DR I decided to remove the memory limit and start the process again, the result was quite interesting (after like 11 hours running)
NAME CPU(cores) MEMORY(bytes)
test-84ff9d9bd9-77xmh 218m 1122Mi
So... WTH with that CPU? I kind expecting a big number on memory usage but what happens with the CPU?
The one thing I can think is that the GC is running as crazy thinking that the MaxRAM is 512m and he is using more than 1G. I'm wondering, is Java detecting ergonomics correctly? (I'm starting to doubt it)
To test my theory I set a limit of 512m and deploy the app this way and I found that from the start there is a unusual CPU load that it has to be the GC running very frequently
kubectl create ...
limitrange/mem-limit-range created
pod/test created
kubectl exec -it test-64ccb87fd7-5ltb6 /usr/bin/free
total used free shared buff/cache available
Mem: 7658200 1141412 4132708 19948 2384080 6202496
Swap: 0 0 0
kubectl top pod ..
NAME CPU(cores) MEMORY(bytes)
test-64ccb87fd7-5ltb6 522m 283Mi
522m is too much vCPU, so my logical next step was to ensure I'm using the most appropriated GC for this case, I changed the JVM_OPTS this way:
env:
- name: JVM_OPTS
value: "-XX:MaxRAM=512m -Xmx128m -XX:+UseSerialGC"
...
resources:
requests:
memory: 256Mi
cpu: 0.15
limits:
memory: 700Mi
And thats bring the vCPU usage to a reasonable status again, after kubectl top pod
NAME CPU(cores) MEMORY(bytes)
test-84f4c7445f-kzvd5 13m 305Mi
Messing with Xmx having MaxRAM is obviously affecting the JVM but how is not possible to control the amount of memory we have on virtualized containers ? I know that free command will report the host available RAM but OpenJDK should be using cgroups rihgt?.
I'm still monitoring the memory ...
EDIT: A new hope
I did two things, the first one was to remove again my container limit, I want to analyze how much it will grow, also I added a new flag to see how the process is using the native memory -XX:NativeMemoryTracking=summary
At the beginning every thing was normal, the process started consuming like 300MB via kubectl top pod so I let it running about 4 hours and then ...
kubectl top pod
NAME CPU(cores) MEMORY(bytes)
test-646864bc48-69wm2 54m 645Mi
kind of expected, right ? but then I checked the native memory usage
jcmd <PID> VM.native_memory summary
Native Memory Tracking:
Total: reserved=2780631KB, committed=536883KB
- Java Heap (reserved=131072KB, committed=120896KB)
(mmap: reserved=131072KB, committed=120896KB)
- Class (reserved=203583KB, committed=92263KB)
(classes #17086)
( instance classes #15957, array classes #1129)
(malloc=2879KB #44797)
(mmap: reserved=200704KB, committed=89384KB)
( Metadata: )
( reserved=77824KB, committed=77480KB)
( used=76069KB)
( free=1411KB)
( waste=0KB =0.00%)
( Class space:)
( reserved=122880KB, committed=11904KB)
( used=10967KB)
( free=937KB)
( waste=0KB =0.00%)
- Thread (reserved=2126472KB, committed=222584KB)
(thread #2059)
(stack: reserved=2116644KB, committed=212756KB)
(malloc=7415KB #10299)
(arena=2413KB #4116)
- Code (reserved=249957KB, committed=31621KB)
(malloc=2269KB #9949)
(mmap: reserved=247688KB, committed=29352KB)
- GC (reserved=951KB, committed=923KB)
(malloc=519KB #1742)
(mmap: reserved=432KB, committed=404KB)
- Compiler (reserved=1913KB, committed=1913KB)
(malloc=1783KB #1343)
(arena=131KB #5)
- Internal (reserved=7798KB, committed=7798KB)
(malloc=7758KB #28415)
(mmap: reserved=40KB, committed=40KB)
- Other (reserved=32304KB, committed=32304KB)
(malloc=32304KB #3030)
- Symbol (reserved=20616KB, committed=20616KB)
(malloc=17475KB #212850)
(arena=3141KB #1)
- Native Memory Tracking (reserved=5417KB, committed=5417KB)
(malloc=347KB #4494)
(tracking overhead=5070KB)
- Arena Chunk (reserved=241KB, committed=241KB)
(malloc=241KB)
- Logging (reserved=4KB, committed=4KB)
(malloc=4KB #184)
- Arguments (reserved=17KB, committed=17KB)
(malloc=17KB #469)
- Module (reserved=286KB, committed=286KB)
(malloc=286KB #2704)
Wait, What ? 2.1 GB reserved for threads? and 222 MB being used, what is this ? I currently don't know, I just saw it...
I need time trying to understand why this is happening
I finally found my issue and I want to share it so others can benefit in some way from this.
As I found on my last edit I had a thread problem that was causing all the memory consumption over time, specifically we was using an asynchronous method from a third party library without properly taking care those resources (ensure those calls was ending correctly in this case).
I was able to detect the issue because I used a memory limit on my kubernete deployment from the beginning (which is a good practice on production environments) and then I monitored very closely my app memory consumption using tools like jstat, jcmd, visualvm, kill -3 and most importantly the -XX:NativeMemoryTracking=summary flag that gave me so much detail in this regard.
I have a cluster with 3 nodes in my developer environment, with a keyspace and a replication factor = 2, originally I had only one node in this cluster but then I added 2 more nodes, one by one. Cassandra version is 3.7.
All these nodes are "clones" so I just modified the cassandra.yaml with the corresponding IP for every node.
I've done a repair and cleanup on every node, and in my application, I have a consistency level ONE.
This is the nodetool status output:
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.132.0.4 50.54 GiB 256 70.2% 50dc5baf-b8b3-4e19-8173-cf828afd36af rack1
UN 10.132.0.3 50.31 GiB 256 65.3% 2a45b7a5-41ce-4533-ba63-60fd3c5cc530 rack1
UN 10.132.0.9 33.88 GiB 256 64.5% e601fb16-6608-4e72-a820-dd4661977946 rack1
In the Cassandra.yaml I have only 10.132.0.3 as the seed node.
So at this point, everything works fine and as expected, if I turn down one node everything keeps running "fine" unless if this node is 10.132.0.9, if I turn down this "bad" node everything crashes with the following error:
org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM
When I stop the bad node, the good ones show this error in his system.log files (I only copy the error not the entire StackTrace):
ERROR [SharedPool-Worker-1] 2018-02-27 10:59:16,449 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM
I don't understand what's wrong with this node and I don't find a solution...
Edit
My connection code:
cluster_builder = Cluster.builder()
.addContactPoints(serverIP.getCassandraList(sysvar))
.withAuthProvider(new PlainTextAuthProvider(serverIP.getCassandraUser(sysvar), serverIP.getCassandraPwd(sysvar))).withPoolingOptions(poolingOptions)
.withQueryOptions(new QueryOptions().setConsistencyLevel(ConsistencyLevel.ONE));
cluster = cluster_builder.build();
session = cluster.connect(keyspace);
My query:
statement = QueryBuilder.insertInto(keyspace, "measurement_minute").values(this.normal_names, (List<Object>) values);
And the execution:
ResultSetFuture future = session.executeAsync(statement.setConsistencyLevel(ConsistencyLevel.ONE));
I want to mention that I restarted, repaired and cleaned up all the nodes.
You are requesting QUORUM with a replication factor of 2. This won't really work well as you are really requesting ALL. For a quorum, a majority of your nodes need to respond to your query.
You can calculate the node count for a quorum with (RF/2)+1 (using integer arithmetic), so RF=2 gives (2/2)+1=2 - you need both of your replicas and can't have one down. The reason for some queries to work is that those don't use 10.132.0.9.
You can go with a replication factor of RF=3 or use CL.ONE for example.
I'm looking to set up a Clustered Server up on Web Logic. I walk though the following steps and yet when I restart my server then start the cluster I get an exception. The exception is listed below.
Here are the steps I go through on the Console in Web Logic
Create Cluster
In the console page mydomain.com/7002/console on the left side
I click on
Environment
to expand the tree and then click on
Clusters
It changes screens on the right. I fill out the data for the
CLUSTER NAME
and I left the messaging mode to Uni-Cast
The multicast address and port are already filled out but are greyed out.
I then click OK.
After this I configure the server. On the side on the console page I click on Servers
Server Config
I fill out the Server Name: MyServer
I punch in the Server Listen Address
Port is defaulted to 7001 which is available
I then fill out the question
Should this server belong to a cluster
I answer
Yes, Make this a server a member of an existing cluster and I select the cluster I just created.
I then click Finish
After I accept the changes and reboot the server, when I start up the instance of the server I get the below exception in the error logs. Any help would be greatly appreciated.
####<Sep 20, 2016 1:57:25 PM CDT> <Debug> <ServerLifeCycle> <DIFDX> <WLS_RUN_MANAGER> <main> <<WLS Kernel>> <> <> <1474397845602> <BEA-000000> <calling halt on weblogic.nodemanager.NMService#3c3d22af>
####<Sep 20, 2016 1:57:25 PM CDT> <Debug> <DiagnosticContext> <> <> <weblogic.timers.TimerThread> <> <> <> <1474397845604> <BEA-000000> <Invoked DCM.initialValue() for thread id=15, name=weblogic.timers.TimerThread
java.lang.Exception
at weblogic.diagnostics.context.DiagnosticContextManager$1.initialValue(DiagnosticContextManager.java:267)
at weblogic.kernel.ResettableThreadLocal.initialValue(ResettableThreadLocal.java:117)
at weblogic.kernel.ResettableThreadLocal$ThreadStorage.get(ResettableThreadLocal.java:204)
at weblogic.kernel.ResettableThreadLocal.get(ResettableThreadLocal.java:74)
at weblogic.diagnostics.context.DiagnosticContextManager$WLSDiagnosticContextFactoryImpl.findOrCreateDiagnosticContext(DiagnosticContextManager.java:365)
at weblogic.diagnostics.context.DiagnosticContextFactory.findOrCreateDiagnosticContext(DiagnosticContextFactory.java:111)
at weblogic.diagnostics.context.DiagnosticContextFactory.findOrCreateDiagnosticContext(DiagnosticContextFactory.java:94)
at weblogic.diagnostics.context.DiagnosticContextHelper.getContextId(DiagnosticContextHelper.java:32)
at weblogic.logging.LogEntryInitializer.getCurrentDiagnosticContextId(LogEntryInitializer.java:117)
at weblogic.logging.LogEntryInitializer.initializeLogEntry(LogEntryInitializer.java:67)
at weblogic.logging.WLLogRecord.<init>(WLLogRecord.java:43)
at weblogic.logging.WLLogRecord.<init>(WLLogRecord.java:54)
at weblogic.logging.WLLogger.normalizeLogRecord(WLLogger.java:64)
at weblogic.logging.WLLogger.log(WLLogger.java:35)
at weblogic.diagnostics.debug.DebugLogger.log(DebugLogger.java:231)
at weblogic.diagnostics.debug.DebugLogger.debug(DebugLogger.java:204)
at weblogic.work.SelfTuningDebugLogger.debug(SelfTuningDebugLogger.java:18)
at weblogic.work.ServerWorkManagerImpl$1.log(ServerWorkManagerImpl.java:44)
at weblogic.work.SelfTuningWorkManagerImpl.debug(SelfTuningWorkManagerImpl.java:597)
at weblogic.work.RequestManager.log(RequestManager.java:1204)
at weblogic.work.RequestManager.addToCalendarQueue(RequestManager.java:315)
at weblogic.work.RequestManager.addToPriorityQueue(RequestManager.java:301)
at weblogic.work.RequestManager.executeIt(RequestManager.java:248)
at weblogic.work.SelfTuningWorkManagerImpl.scheduleInternal(SelfTuningWorkManagerImpl.java:164)
at weblogic.work.SelfTuningWorkManagerImpl.schedule(SelfTuningWorkManagerImpl.java:144)
at weblogic.timers.internal.TimerManagerFactoryImpl$WorkManagerExecutor.execute(TimerManagerFactoryImpl.java:132)
at weblogic.timers.internal.TimerManagerImpl.waitForStop(TimerManagerImpl.java:241)
at weblogic.timers.internal.TimerManagerImpl.stop(TimerManagerImpl.java:98)
at weblogic.timers.internal.TimerThread$Thread.run(TimerThread.java:250)
Solution: It ended up being a timeout issue. The server was taking such a long time to come up that the instance was assuming no response back and thus assuming it couldn't get though. Once I increased the timeout to be longer, it finally came through and worked.
Stackoverflow error while using H2 database in Multi Threaded Environment
Our Application has service layer querying H2 database and retrieving the resultset.
The service layer connects to the h2 database using opensource clustering middleware "Sequoia" (that offers load balancing and
transparent failover) and also manages database connections .
https://sourceforge.net/projects/sequoiadb/
Our service layer has 50 service methods and we have exposed the service methods as EJB's . While Invoking the EJB's
we get the response from service (that includes H2 READ) with an average response time of 0.2 secs .
The DAO layer, query the database using Hibernate Criteria and we also use JPA2.0 entity manager to manage datasource.
For Load testing , We created a test class (with a main method) that invokes all the 50 EJB Methods .
50 threads were created and all the threads invoked the test class . The execution was Ok for first run and all the 50 threads succssfully completed
invoking 50 EJB methods .
When we triggered the test class again , we encountered "stackoverflowerror".The Detailed stacktrace is shown below
org.h2.jdbc.JdbcSQLException: General error: "java.lang.StackOverflowError" [50000-176]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:344)
at org.h2.message.DbException.get(DbException.java:167)
at org.h2.message.DbException.convert(DbException.java:290)
at org.h2.server.TcpServerThread.sendError(TcpServerThread.java:222)
at org.h2.server.TcpServerThread.run(TcpServerThread.java:155)
at java.lang.Thread.run(Thread.java:784)
Caused by: java.lang.StackOverflowError
at java.lang.Character.digit(Character.java:4505)
at java.lang.Integer.parseInt(Integer.java:458)
at java.lang.Integer.parseInt(Integer.java:510)
at java.text.MessageFormat.makeFormat(MessageFormat.java:1348)
at java.text.MessageFormat.applyPattern(MessageFormat.java:469)
at java.text.MessageFormat.<init>(MessageFormat.java:361)
at java.text.MessageFormat.format(MessageFormat.java:822)
at org.h2.message.DbException.translate(DbException.java:92)
at org.h2.message.DbException.getJdbcSQLException(DbException.java:343)
at org.h2.message.DbException.get(DbException.java:167)
at org.h2.message.DbException.convert(DbException.java:290)
at org.h2.command.Command.executeUpdate(Command.java:262)
at org.h2.jdbc.JdbcPreparedStatement.execute(JdbcPreparedStatement.java:199)
at org.h2.server.TcpServer.addConnection(TcpServer.java:140)
at org.h2.server.TcpServerThread.run(TcpServerThread.java:152)
... 1 more
at org.h2.engine.SessionRemote.done(SessionRemote.java:606)
at org.h2.engine.SessionRemote.initTransfer(SessionRemote.java:129)
at org.h2.engine.SessionRemote.connectServer(SessionRemote.java:430)
at org.h2.engine.SessionRemote.connectEmbeddedOrServer(SessionRemote.java:311)
at org.h2.jdbc.JdbcConnection.<init>(JdbcConnection.java:107)
at org.h2.jdbc.JdbcConnection.<init>(JdbcConnection.java:91)
at org.h2.Driver.connect(Driver.java:74)
at org.continuent.sequoia.controller.connection.DriverManager.getConnectionForDriver(DriverManager.java:266)
We then even added a random thread sleep(10-25 secs) between EJB Invocation . The execution was successful thrice (all 50 EJB Invocation)
and when we triggered for 4th time ,it failed with above error .
We get to see the above failure even with a thread count of 25 .
The Failure is random and there doesn't seems to be a pattern . Kindly let us know if we have missed any configuration .
Please let me know if you need any additional information . Thanks in Advance for any help .
Technology Stack :
1) Java 1.6
2) h2-1.3.176
3) Sequoia Middleware that manages DB Connection Open and Close.
-Variable Connection Pool Manager
-init pool size 250
Thanks Lance Java for your suggestions . Increasing stack size didnt help in our scenario for the following reasons (i.e additional stack helped only for few more executions).
In Our App , we are using Entity Manager (JPA) and the transaction attribute was not set . Hence each query to the database , created a thread carrying out execution . In JVisualVm , we observed the DB Threads, the Live Threads was equal to Total Threads Started .
Eventually our app created more than 30K threads and hence has resulted in Stackoverflow error .
Upon Setting the transaction attribute , the threads get killed after DB execution and all the transactions are then managed by only 25-30 threads.
The Issue is resolved now .
There's two main causes for a stack overflow error
A bug containing a non-terminating recursive call
The allocated stack size for the jvm isn't big enough
Looking at your stack trace it doesn't look recursive so I'm guessing you are running out of space. Have you set the -Xss flag for your JVM? You might need to increase this value.