What cometd configurations to use to reduce 402 error occurrences? - java

We have implemented a Java servlet running on JBoss container that uses CometD long-polling. This has been implemented in a few organizations without any issue, but in a recent implementation there are functional issues which appear to be related to the network setup of this organization.
Specifically, around 5% of the time, the connect requests are getting back 402 errors:
{"id":"39","error":"402::Unknown client","successful":false,"advice":{"interval":0,"reconnect":"handshake"},"channel":"/meta/connect"}
Getting this organization to address network performance is a significant challenge, so we are looking at a way to tune the implementation to reduce these issues.
Which cometd configuration parameters can be updated to improve this?
maxinterval, timeout, multiSessionInverval, etc?
Thank you!

The "402 unknown client" error is due to the fact that the server does not see /meta/connect heartbeat messages from the client and expires the correspondent session on the server. This is typically due to network issues.
Once the client network is restored, the client sends a /meta/connect heartbeat message but the server doesn't have the correspondent session, hence the 402.
The parameter that controls the server side expiration of sessions is maxInterval, documented here: https://docs.cometd.org/current/reference/#_java_server.
By default is 10 seconds. If you increase it, it means you are retaining in the server memory sessions for a longer time, so you need to take that into account.

Related

How can I monitor EWS SOAP messages relating to subscription creation

We have a spring java app using EWS to connect to our on prem 2016 Exchange server and 'stream' pulling emails. Every 30 minutes a new 30 minute subscription is made (via new thread). We assume old connection just expires.
When one instance is running in our environment, it works perfectly fine, but when two instances run, after some time one instance will eventually start throwing error about
You have exceeded the available concurrent connections for your account. Try again once your other requests have completed.
It seems like an issue which is then hit by throttling. I found that the Exchange servers config is:
EWSMaxConcurrency=27, MaxStreamingConcurrency=10,
HangingConnectionLimit=10
Our code previously didn't explicitly close connections and unsubscribe (was running fine without when one instance). We tried including both but the issue still persists and we noticed the close method for StreamingSubscriptionConnection throws error. The team that handles the Exchange server can find errors referencing the exceeding connection count error above, but nothing relating to the close connection error
...[m.e.w.d.n.StreamingSubscriptionConnection.close(349)]: java.lang.Exception: microsoft.exchange.webservices.data.notification.StreamingSubscriptionConnection
Currently we don't have much ability to make changes on the exchange server side. I'm not familiar with SOAP messages but I was planning to look into how to monitor them to see what inbound and outbound messages there are for some insights
For the service I set service.setTraceEnabled(true) and service.setTraceFlags(EnumSet.allOf(TraceFlags.class)
However I only see trace messages in console when an email arrives. I dont see any messages during start up when a subscription/connection is created
Can anyone help provide any advice on how I can monitor these subscription related messages?
I tried using SOAPUI but I'm having difficulty applying our server's WSDL. I considered using the Tunnelij plugin for intellij but I'm not too familiar with how to set it up either
My suspicion is that there is some intermittent latency issue on Exchange server side, perhaps response messages are not coming back in a timely manner, and this may be screwing up. I presume if I monitor these SOAP messages then I should see more than 10 requests to subscribe before that error appears
The EWS Logs on the CAS (Client Access Server) should have details about the throttling issue. Are you using Impersonation in you Application if you not using Impersonation then the concurrent connections are charged against the account your using with Impersonation that get charged against the account your impersonating. The difference here is that a single user can have no more the 10 streaming subscriptions (unless you modify the web.config) if your using impersonation than you can scale your application to 1000's of users see https://github.com/MicrosoftDocs/office-developer-exchange-docs/blob/main/docs/exchange-web-services/how-to-maintain-affinity-between-group-of-subscriptions-and-mailbox-server.md

Failed to connect to Tomcat server on ec2 instance

UPDATE:
My goal is to learn what factors could overwhelm my little tomcat server. And when some exception happens, what I could do to resolve or remediate it without switching my server to a better machine. This is not a real app in a production environment but just my own experiment (Besides some changes on the server-side, I may also do something on my client-side)
Both of my client and server are very simple: the server only checks the URL format and send 201 code if it is correct. Each request sent from my client only includes an easy JSON body. There is no database involved. The two machines (t2-micro) only run client and server respectively.
My client is OkHttpClient(). To avoid timeout exceptions, I already set timeout 1,000,000 milli secs via setConnectTimeout, setReadTimeout, and setWriteTimeout. I also go to $CATALINA/conf/server.xml on my server and set connectionTimeout = "-1"(infinite)
ORIGINAL POST:
I'm trying to stress out my server by having a client launching 3000+ threads sending HTTP requests to my server. Both of my client and server reside on different ec2 instances.
Initially, I encountered some timeout issues, but after I set the connection, read and write timeout to a bigger value, this exception has been resolved. However, with the same specification, I'm getting java.net.ConnectException: Failed to connect to my_host_ip:8080 exception. And I do not know its root cause. I'm new to multithreading and distributed system, can anyone please give me some insights of this exception?
Below is some screenshot of from my ec2:
1. Client:
2. Server:
Having gone through similar exercise in past I can say that there is no definitive answer to the problem of scaling.
Here are some general trouble shooting steps that may lead to more specific information. I would suggest trying out tests by tweaking a few parameters in each test and measure the changes in Cpu, logs etc.
Please provide what value you have put for the timeout. Increasing timeout could cause your server (or client) to run out of threads quickly (cause each thread can process for longer). Question the need for increasing timeout. Is there any processing that slows your server?
Check application logs, JVM usage, memory usage on the client and Server. There will be some hints there.
Your client seems to be hitting 99%+ and then come down. This implies that there could be a problem at the client side in that it maxes out during the test. Your might want to resize your client to be able to do more.
Look at open file handles. The number should be sufficiently high.
Tomcat has some limit on thread count to handle load. You can check this in server.xml and if required change it to handle more. Although cpu doesn't actually max out on server side so unlikely that this is the problem.
If you a database then check the performance of the database. Also check jdbc connect settings. There is thread and timeout config at jdbc level as well.
Is response compression set up on the Tomcat? It will give much better throughout on server especially if the data being sent back by each request is more than a few kbs.
--------Update----------
Based on update on question few more thoughts.
Since the application is fairly simple, the path in terms of stressing the server should be to start low and increase load in increments whilst monitoring various things (cpu, memory, JVM usage, file handle count, network i/o).
The increments of load should be spread over several runs.
Start with something as low as 100 parallel threads.
Record as much information as you can after each run and if the server holds up well, increase load.
Suggested increments 100, 200, 500, 1000, 1500, 2000, 2500, 3000.
At some level you will see that the server can no longer take it. That would be your breaking point.
As you increase load and monitor you will likely discover patterns that suggest tuning of specific parameters. Each tuning attempt should then be tested again the same level of multi threading. The improvement of available will be obvious from the monitoring.

Spring boot + tomcat 8.5 + mongoDB, AsyncRequestTimeoutException

I have created a spring boot web application and deployed war of the same to tomcat container.
The application connects to mongoDB using Async connections. I am using mongodb-driver-async library for that.
At startup everything works fine. But as soon as load increases, It shows following exception in DB connections:
org.springframework.web.context.request.async.AsyncRequestTimeoutException: null
at org.springframework.web.context.request.async.TimeoutDeferredResultProcessingInterceptor.handleTimeout(TimeoutDeferredResultProcessingInterceptor.java:42)
at org.springframework.web.context.request.async.DeferredResultInterceptorChain.triggerAfterTimeout(DeferredResultInterceptorChain.java:75)
at org.springframework.web.context.request.async.WebAsyncManager$5.run(WebAsyncManager.java:392)
at org.springframework.web.context.request.async.StandardServletAsyncWebRequest.onTimeout(StandardServletAsyncWebRequest.java:143)
at org.apache.catalina.core.AsyncListenerWrapper.fireOnTimeout(AsyncListenerWrapper.java:44)
at org.apache.catalina.core.AsyncContextImpl.timeout(AsyncContextImpl.java:131)
at org.apache.catalina.connector.CoyoteAdapter.asyncDispatch(CoyoteAdapter.java:157)
I am using following versions of software:
Spring boot -> 1.5.4.RELEASE
Tomcat (installed as standalone binary) -> apache-tomcat-8.5.37
Mongo DB version: v3.4.10
mongodb-driver-async: 3.4.2
As soon as I restart the tomcat service, everything starts working fine.
Please help, what could be the root cause of this issue.
P.S.: I am using DeferredResult and CompletableFuture to create Async REST API.
I have also tried using spring.mvc.async.request-timeout in application and configured asynTimeout in tomcat. But still getting same error.
It's probably obvious that Spring is timing out your requests and throwing AsyncRequestTimeoutException, which returns a 503 back to your client.
Now the question is, why is this happening? There are two possibilities.
These are legitimate timeouts. You mentioned that you only see the exceptions when the load on your server increases. So possibly your server just can't handle that load and its performance has degraded to the point where some requests can't complete before Spring times them out.
The timeouts are caused by your server failing to send a response to an asynchronous request due to a programming error, leaving the request open until Spring eventually times it out. It's easy for this to happen if your server doesn't handle exceptions well. If your server is synchronous, it's okay to be a little sloppy with exception handling because unhandled exceptions will propagate up to the server framework, which will send a response back to the client. But if you fail to handle an exception in some asynchronous code, that exception will be caught elsewhere (probably in some thread pool management code), and there's no way for that code to know that there's an asynchronous request waiting on the result of the operation that threw the exception.
It's hard to figure out what might be happening without knowing more about your application. But there are some things you could investigate.
First, try looking for resource exhaustion.
Is the garbage collector running all the time?
Are all CPUs pegged at 100%?
Is the OS swapping heavily?
If the database server is on a separate machine, is that machine showing signs of resource exhaustion?
How many connections are open to the database? If there is a connection pool, is it maxed out?
How many threads are running? If there are thread pools in the server, are they maxed out?
If something's at its limit then possibly it is the bottleneck that is causing your requests to time out.
Try setting spring.mvc.async.request-timeout to -1 and see what happens. Do you now get responses for every request, only slowly, or do some requests seem to hang forever? If it's the latter, that strongly suggests that there's a bug in your server that's causing it to lose track of requests and fail to send responses. (If setting spring.mvc.async.request-timeout appears to have no effect, then the next thing you should investigate is whether the mechanism you're using for setting the configuration actually works.)
A strategy that I've found useful in these cases is to generate a unique ID for each request and write the ID along with some contextual information every time the server either makes an asynchronous call or receives a response from an asynchronous call, and at various checkpoints within asynchronous handlers. If requests go missing, you can use the log information to figure out the request IDs and what the server was last doing with that request.
A similar strategy is to save each request ID into a map in which the value is an object that tracks when the request was started and what your server last did with that request. (In this case your server is updating this map at each checkpoint rather than, or in addition to, writing to the log.) You can set up a filter to generate the request IDs and maintain the map. If your filter sees the server send a 5xx response, you can log the last action for that request from the map.
Hope this helps!
Asynchroneus tasks are arranged in a queue(pool) which is processed in parallel depending on the number of threads allocated. Not all asynchroneus tasks are executed at the same time. Some of them are queued. In a such system getting AsyncRequestTimeoutException is normal behaviour.
If you are filling up the queues with asynchroneus tasks that are unable to execute under pressure. Increasing the timeout will only delay the problem. You should focus instead on the problem:
Reduce the execution time(through various optimizations) of asynchroneus task. This will relax the pooling of async tasks. It oviously requires coding.
Increase the number of CPUSs allocated in order to be able to run more efficiently the parallel tasks.
Increase the number of threads servicing the executor of the driver.
Mongo Async driver is using AsynchronousSocketChannel or Netty if Netty is found in the classpath. In order to increase the number of the worker threads servicing the async comunication you should use:
MongoClientSettings.builder()
.streamFactoryFactory(NettyStreamFactoryFactory(io.netty.channel.EventLoopGroup eventLoopGroup,
io.netty.buffer.ByteBufAllocator allocator))
.build();
where eventLoopGroup would be io.netty.channel.nio.NioEventLoopGroup(int nThreads))
on the NioEventLoopGroup you can set the number of threads servicing your async comunication
Read more about Netty configuration here https://mongodb.github.io/mongo-java-driver/3.2/driver-async/reference/connecting/connection-settings/

Elasticsearch unclosed client. Live threads after Tomcat shutdown. Memory usage impact?

I am using Elasticsearch 1.5.1 and Tomcat 7. Web application creates a TCP client instance as Singleton during server startup through Spring Framework.
Just noticed that I failed to close the client during server shutdown.
Through analysis on various tools like VisualVm, JConsole, MAT in Eclipse, it is evident that threads created by the elasticsearch client are live even after server(tomcat) shutdown.
Note: after introducing client.close() via Context Listener destroy methods, the threads are killed gracefully.
But my query here is,
how to check the memory occupied by these live threads?
Memory leak impact due to this thread?
We have got few Out of memory:Perm gen errors in PROD. This might be a reason but still I would like to measure and provide stats for this.
Any suggestions/help please.
Typically clients run in a different process than the services they communicate with. For example, I can open a web page in a web browser, and then shutdown the webserver, and the client will remain open.
This has to do with the underlying design choices of TCP/IP. Glossing over the details, under most cases a client only detects it's server is gone during the next request to the server. (Again generally speaking) it does not continually poll the server to see if it is alive, nor does the server generally send a "please disconnect" message on shutting down.
The reason that clients don't generally poll servers is because it allows the server to handle more clients. With a polling approach, the server is limited by the number of clients running, but without a polling approach, it is limited by the number of clients actively communicating. This allows it to support more clients because many of the running clients aren't actively communicating.
The reason that servers typically don't send an "I'm shutting down" message is because many times the server goes down uncontrollably (power outage, operating system crash, fire, short circuit, etc) This means that an protocol which requires such a message will leave the clients in a corrupt state if the server goes down in an uncontrolled manner.
So losing a connection is really a function of a failed request to the server. The client will still typically be running until it makes the next attempt to do something.
Likewise, opening a connection to a server often does nothing most of the time too. To validate that you really have a working connection to a server, you must ask it for some data and get a reply. Most protocols do this automatically to simplify the logic; but, if you ever write your own service, if you don't ask for data from the server, even if the API says you have a good "connection", you might not. The API can report back a good "connections" when you have all the stuff configured on your machine successfully. To really know if it works 100% on the other machine, you need to ask for data (and get it).
Finally servers sometimes lose their clients, but because they don't waste bandwidth chattering with clients just to see if they are there, often the servers will put a "timeout" on the client connection. Basically if the server doesn't hear from the client in 10 minutes (or the configured value) then it closes the cached connection information for the client (recreating the connection information as necessary if the client comes back).
From your description it is not clear which of the scenarios you might be seeing, but hopefully this general knowledge will help you understand why after closing one side of a connection, the other side of a connection might still think it is open for a while.
There are ways to configure the network connection to report closures more immediately, but I would avoid using them, unless you are willing to lose a lot of your network bandwidth to keep-alive messages and don't want your servers to respond as quickly as they could.

NewRelic Ignore cometd LongPolling in Jetty

I have a Java Web app running on Jetty which connects to the server using cometD to receive data and returns after 25s if the server has no data and reconnects, i.e., long-polling.
I monitor the performance of the server using NewRelic but those long-polling connections skew the performance diagrams.
Is there a way to tell newrelic to actually ignore the time the server is waiting and only show the actual time that the server has been busy? I understand that it is probably impossible to do this on the newrelic side, but I thought there may be some best practices on how to deal with long-polling connections in newrelic.
Any help is appreciated!
You wont be able to just exclude or ignore the time the server is waiting and only show the actual time that the server has been busy, but what you can do is ignore the transaction completely if you do not need to see those metrics. https://docs.newrelic.com/docs/java/java-agent-api This documentation talks about using New Relics API for ignoring transactions.
CometD sends long polls to a URL that is the base CometD Servlet URL with "/connect" appended, see parameter appendMessageTypeToURL in the documentation.
For example, if you have mapped the CometD Servlet to /cometd/*, then long polls are sent to /cometd/connect.
I don't know NewRelic, but perhaps you can filter out the requests that end in */connect and gather your statistics on the other requests, that now won't be skewed by the long poll timeout.

Categories