Why is Weblogic work manager is rejecting work - java

I am using commonJ work managers running in web logic environment.
My setup is as follows:
My spring application context
<bean id="workManagerTaskExecutor"
class="org.springframework.scheduling.commonj.WorkManagerTaskExecutor">
<property name="workManagerName" value="java:comp/env/wm/UVWorkManager" />
</bean>
My web.xml file in WEB-INF
<resource-ref>
<res-ref-name>wm/UVWorkManager</res-ref-name>
<res-type>commonj.work.WorkManager</res-type>
<res-auth>Container</res-auth>
<res-sharing-scope>Shareable</res-sharing-scope>
</resource-ref>
and my weblogic.xml file in WEB-INF
<work-manager>
<name>wm/UVWorkManager</name>
<min-threads-constraint>
<name>min threads</name>
<count>30</count>
</min-threads-constraint>
<max-threads-constraint>
<name>max threads</name>
<count>200</count>
</max-threads-constraint>
<capacity>
<name>max capacity</name>
<count>20</count>
</capacity>
<work-manager-shutdown-trigger>
<max-stuck-thread-time>600</max-stuck-thread-time>
<stuck-thread-count>100</stuck-thread-count>
</work-manager-shutdown-trigger>
</work-manager>
After a while calling the service from SOAP-UI. I am getting the following error
commonj.work.WorkException: commonj.work.WorkRejectedException: [WorkManager:002912]OverloadManager max capacity rejected request as current length 20 exceeds max capacity of 20
I know as the maximum is set to it will be rejected but why doesn't it reclaim those threads? Even if I wait 15 minutes I cannot submit a new request. I have to reboot the service.

The capacity includes all requests, queued or executing, from the constrained work set.
To verify the status of the threads, do a thread dump of the JVM.
You should set the JVM flag -Dweblogic.StuckThreadHandling=true to append WorkManager's name in log files when the theard dump is generated.
kill -3 JVM_PID

Related

Cluster instability with TCPPING protocol

I have 8 different processes distributed across 6 different servers with the following TCP/TCPPING protocol configuration:
<config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">
<TCP
bind_port="${jgroups.tcp.bind_port:16484}"
bind_addr="${jgroups.tcp.bind_addr:127.0.0.1}"
recv_buf_size="20M"
send_buf_size="20M"
max_bundle_size="64K"
sock_conn_timeout="300"
use_fork_join_pool="true"
thread_pool.min_threads="10"
thread_pool.max_threads="100"
thread_pool.keep_alive_time="30000" />
<TCPPING
async_discovery="true"
initial_hosts="${jgroups.tcpping.initial_hosts:127.0.0.1[16484]}"
port_range="5" #/>
<MERGE3 min_interval="10000" max_interval="30000" />
<FD_SOCK get_cache_timeout="10000"
cache_max_elements="300"
cache_max_age="60000"
suspect_msg_interval="10000"
num_tries="10"
sock_conn_timeout="10000"/>
<FD timeout="10000" max_tries="10" />
<VERIFY_SUSPECT timeout="10000" num_msgs="5"/>
<BARRIER />
<pbcast.NAKACK2
max_rebroadcast_timeout="5000"
use_mcast_xmit="false"
discard_delivered_msgs="true" />
<UNICAST3 />
<pbcast.STABLE
stability_delay="1000"
desired_avg_gossip="50000"
max_bytes="4M" />
<AUTH
auth_class="com.qfs.distribution.security.impl.CustomAuthToken"
auth_value="distribution_password"
token_hash="SHA" />
<pbcast.GMS
print_local_addr="true"
join_timeout="10000"
leave_timeout="10000"
merge_timeout="10000"
num_prev_mbrs="200"
view_ack_collection_timeout="10000"/>
</config>
The cluster keeps splitting in subgroups, then merges again and again which results in high memory usages. I can also see in the logs a lot of "suspect" warning resulting from the periodic heartbeats sent by all other cluster members. Am I missing something ?
EDIT
After enabling gc logs, nothing suspect appeared to me. On the other hand, I've noticed this jgroups logs appearing a lot:
WARN: lonlx21440_FrtbQueryCube_QUERY_29302: I was suspected by woklxp00330_Sba-master_DATA_36219; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
DEBUG: lonlx21440_FrtbQueryCube_QUERY_29302: closing expired connection for redlxp00599_Sba-master_DATA_18899 (121206 ms old) in send_table
DEBUG: I (redlxp00599_Sba-master_DATA_18899) will be the merge leader
DEBUG: redlxp00599_Sba-master_DATA_18899: heartbeat missing from lonlx21503_Sba-master_DATA_2175 (number=1)
DEBUG: redlxp00599_Sba-master_DATA_18899: suspecting [lonlx21440_FrtbQueryCube_QUERY_29302]
DEBUG: lonlx21440_FrtbQueryCube_QUERY_29302: removed woklxp00330_Sba-master_DATA_36219 from xmit_table (not member anymore)enter code here
and this one
2020-08-31 16:35:34.715 [ForkJoinPool-3-worker-11] org.jgroups.protocols.pbcast.GMS:116
WARN: lonlx21440_FrtbQueryCube_QUERY_29302: failed to collect all ACKs (expected=6) for view [redlxp00599_Sba-master_DATA_18899|104] after 2000ms, missing 6 ACKs from (6) lonlx21503_Sba-master_DATA_2175, lonlx11179_DRC-master_DATA_15999, lonlx11184_Rrao-master_DATA_31760, lonlx11179_Rrao-master_DATA_25194, woklxp00330_Sba-master_DATA_36219, lonlx11184_DRC-master_DATA_49264
I still can;'t figure out where the instability comes from.
Thanks
Any instability is not due to TCPPING protocol - this belongs to the Discovery protocol family and its purpose is to find new members, not kick them out of the cluster.
You use both FD_SOCK and FD to find if members left, and then VERIFY_SUSPECT to confirm that the node is not reachable. The setting seems pretty normal.
First thing to check is your GC logs. If you experience STW pauses longer than, say, 15 seconds, chances are that the cluster disconnect because of unresponsiviness due to GC.
If your GC logs are fine, increase logging level for FD, FD_SOCK and VERIFY_SUSPECT to TRACE and see what's going on.

Infinispan TimeoutException ISPN000476

I am experiencing Embedded InfiniSpan cache issue where nodes timeout on re-joining the cluster.
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 7 from vvshost
at org.infinispan.remoting.transport.impl.SingleTargetRequest.onTimeout(SingleTargetRequest.java:64)
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:86)
at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:21)
The only way I can get the node to re-join is to switch off the cache and delete all local cache persistence files.
Here is the configuration which I am using:
Transport:
TransportConfigurationBuilder - defaultClusteredBuild
JMX Statistics - Enabled
Duplicate domains - Allowed
Cache Manager:
Manager Class - EmbeddedCacheManager
Memory - Memory Size: 0
Persistence: Single File Store
async: disabled
Clustering Cache Mode - CacheMode.DIST_SYNC
It seems right to me, but the value of remote-timeout is "15000" milliseconds by default. Increase the timeout until you stop getting the error.
Hope it helps

Websphere HTTP outbound connection pool usage

I have an application running on Websphere Application Server 6.1.0.43. And I'm having slowdown issues when thrying to invoke a remote service.
The slowdown is on the method findGroupAndGetConnection from the class outboundConnectionCache.
According to the IBM APAR PK94494:
The delay occurs after the client-side JAX-RPC handler (if present) is invoked and before the actual SOAP message is sent to the provider.
Because the delay occurs in the IBM web services engine, this problem
can be difficult to detect.
A com.ibm.ws.webservices.engine.transport.*=all trace will show entries similar to these which repeat:
[8/19/09 18:08:29:658 GMT] 00000047 OutboundConne 1 Enter:
WSWS3595I: Current pool size: 25. Connections-in-use size: 0.
Configured pool size: 25
In addition, that same trace spec will show long delays in executing
the .findGroupAndGetConnection() method:
[8/19/09 18:08:03:428 GMT] 00000047 OutboundConne >
OutboundConnectionCache.findGroupAndGetConnection()
WAITING_THREADS_THRESHOLD is 5 Entry
[8/19/09 18:08:38:358 GMT] 00000047 OutboundConne <
OutboundConnectionCache.findGroupAndGetConnection() Exit
And they recommend the following:
Reduce the 'com.ibm.websphere.webservices.http.connectionPoolCleanUpTime' from
the default of 180 to 120 seconds
Increase the max connections 'com.ibm.websphere.webservices.http.maxConnection' property from
default of 25 to 50. This will also require increasing the web
container thread pool size to 100.
Before changing the default properties I decided to monitor the Web Container thread usage and I noticed that maximum thread pool size (50) is never reached, but the minimum pool size (10) is reached very often, forcing connections to be destroyed and recreated.
Running over the minimum pool size will cause this slowdown? Should I increase the minimum pool size? Is my problem something other than http outbound connection pool?

Ofbiz error: Could not find simple-method

I have repeatedly this error in ofbiz.log:
Error running the simple-method: Could not find <simple-method name="checkProductRelatedPermission"> in XML document
This is weird, because I have a declaration of this method in my ProductServices.xml: <simple-method method-name="checkProductRelatedPermission" short-description="Check Product Related Permission">
I didn't have this error before and the system was running properly for the pass 6 months. (currently product index is increasing)
Is it related to insufficient memory allocated for ofbiz? Because the server is running on limited memory.
[Update]
This is the service declaration in ProductServices.xml
<simple-method method-name="productGenericPermission" short-description="Main permission logic">
<set field="mainAction" from-field="parameters.mainAction"/>
<if-empty field="mainAction">
<add-error>
<fail-property resource="ProductUiLabels" property="ProductMissingMainActionInPermissionService"/>
</add-error>
<check-errors/>
</if-empty>
<set field="callingMethodName" from-field="parameters.resourceDescription"/>
<set field="checkAction" from-field="parameters.mainAction"/>
<call-simple-method method-name="checkProductRelatedPermission"/>
<if-empty field="error_list">
<set field="hasPermission" type="Boolean" value="true"/>
<field-to-result field="hasPermission"/>
<else>
<property-to-field resource="ProductUiLabels" property="ProductPermissionError" field="failMessage"/>
<set field="hasPermission" type="Boolean" value="false"/>
<field-to-result field="hasPermission"/>
<field-to-result field="failMessage"/>
</else>
</if-empty>
</simple-method>
Execution of <call-simple-method method-name="checkProductRelatedPermission"/> throw an exception.
If I restart the server, the same execution of the process won't throw this exception. The error happened after user heavily enter new product and update product. I can see heavy lucene process in the log.
I increased the server memory from 2GB to 4GB, java memory Xmx:1024m to Xmx:1512. Currently the ofbiz is still running properly after 6 hours monitor.
[Update]
java.net.URL url = new java.net.URL("file:/home/ofbiz/ofbiz/applications/product/script/org/ofbiz/product/product/ProductServices.xml");
System.out.println(org.ofbiz.minilang.SimpleMethod.getSimpleMethod(url, "checkProductRelatedPermission"));
Ouput is simple-method which means the method is found.
Beside, same process execute over thousand and error thrown after thousands execution (randomly). Sometime after few hours, sometime after few days.
The declaration in your xml is with "method-name". The error message says you are missing a tag with "name".
Nothing wrong with the configuration. The problem is JobSandbox (createAlsoBoughtProductAssocs) have to many running instance, pending and queued instance. Those jobs consume all memory and make cpu usage high.
Removed createAlsoBoughtProductAssocs jobs and problem disappeared.

Datastore Admin backup using Cron

I'm trying to use the Datastore Admin backup function to save data in case something happens. When I navigate to the AppEngine link and utilize the web browser, the backup succeeds with the following information:
Overview
Success
Elapsed time: 00:00:03
Start time: 7/21/2014 8:36:53 PM
entity_kind: "AppVersion"
filesystem: "gs"
gs_bucket_name: "zz_backups"
namespace: null
Counters
io-write-bytes: 32768 (10922.67/sec avg.)
io-write-msec: 20 (6.67/sec avg.)
mapper-calls: 4 (1.33/sec avg.)
mapper-walltime-ms: 163 (54.33/sec avg.)
And it properly backs up 'AppVersion' entities to the zz_backup bucket. However, when I try to upload a cron job with the following details using the information here
<cron>
<url>/_ah/datastore_admin/backup.create?name=DataBackup&kind=AppVersion&filesystem=gs&gs_bucket_name=zz_backups</url>
<description>Backs up app data every day</description>
<schedule>every 24 hours</schedule>
<target>beta83</target>
</cron>
It fails. The log files don't say anything useful:
8ms /_ah/datastore_admin/backup.create?name=DataBackup&kind=AppVersion&filesystem=gs&gs_bucket_name=zz_backups
0.1.0.1 - - [21/Jul/2014:17:36:11 -0700] "GET /_ah/datastore_admin/backup.create?name=DataBackup&kind=AppVersion&filesystem=gs&gs_bucket_name=zz_backups HTTP/1.1" 404 234 - "AppEngine-Google; (+http://code.google.com/appengine)" "beta83.themeviewersproject.appspot.com" ms=8 cpu_ms=140 cpm_usd=0.000026 queue_name=__cron task_name=96c49edb17d5ff7f351fe5e42cad6614 instance=00c61b117cd710c22ec657388074a6c119debab4 app_engine_release=1.9.7 trace_id=857193fe141b15961aa6a7f514b907f1
The troubleshooting information at the bottom of this page doesn't appear to help either as there is nothing under similar to bullet 3. What is wrong with my cron file?
Figured it out. I missed the portion about . I had targeted my current version instead of the ones specified by the documentation:
This is required. It identifies the app version the cron backup job is
to be run on. You must use the value ah-builtin-python-bundle because
that is the version of your app that contains the Datastore Admin
features that the cron job needs to execute.
Full cron should have been:
<cron>
<url>/_ah/datastore_admin/backup.create?name=DataBackup&kind=AppVersion&filesystem=gs&gs_bucket_name=zz_backups</url>
<description>Backs up app data every day</description>
<schedule>every 24 hours</schedule>
<target>ah-builtin-python-bundle</target>
</cron>

Categories