I have a java application and elasticsearch with indices of size in order of 10 gig size.
There is strange scenario, which is during the night as request volume decreases response duration increases. I had checked server resources metric but nothing unusual found.
hear we have request volume diagram:
and request duration:
as you see request duration maxima exactly lies at interval of request volume minima.
but when we check server resources nothing unusual can be seen:
I must mention that no special config has been set for elasticsearch and all the config has been left with default.
Also I have checked this link How does high memory pressure affect performance? but as you see server memory is far away from 75% memory usage.
I must also mention that elasticsearch is inside ceph file system
Related
I came across this issue during load test where we are seeing considerable increase in response time of the application when new pod scale up by Kubernetes HPA. The HPA we have set is for 75% CPU utlization, minimum 3 pods are already running. So for example:
As you can see the response time increase drastically, the peaks in this image are the time when new pods scale up. Even if the java application takes some time to start and warming up the JVM, the request serve almost reaches to zero for this time.
Any clue of what could be causing the issue?
Make sure that you have the correct readiness probe for your pod. Seems that the new pod getting status ready before it's ready for serving traffic
Problem Description
Currently, I see SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time (full error log is below) exception from Lambda SDK 2.0 (with Netty client) in a service, where several nodes poll SQS messages from N queues and try to invoke lambdas at very high (unlimited) rate.
I tried to apply back-pressure based on CPU usage per node. This didn't really help, since consuming SQS messages at high rate still produced a lot of networking connections per host, keeping CPU usage low, resulting in the same error.
Also, increasing connection acquisition timeout doesn't help (even makes it worse), since a backlog of connection acquisitions is building up, while new Lambda invocation requests are coming in. Similar applies to increasing number of max connections (currently, I have 120000 max connections value).
Thus, I'm building an SQS back-pressure mechanism, which prevents a node from polling for more messages based on number of networking connections open on this node.
The questions are:
How can I get the number of open connections on a host? (in addition to the solutions below)
Are there any Java libs/frameworks that can be used without implementing custom code for the options mentioned below?
Considered Solutions
Get based on LeasedConcurrency metric (through CloudWatchMetricPublisher) emitted as part of SDK metrics
Get based on JMX FileDescriptorUse metric
Full Error Log
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.
Consider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.
Increasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout.
If the above mechanisms are not able to fix the issue, try smoothing out your requests so that large traffic bursts cannot overload the client, being more efficient with the number of times you need to call AWS, or by increasing the number of hosts sending requests.
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98) ~[AwsJavaSdk-Core-2.0.jar:?]
PS
Links to any related networking/OS/backpressure resources (including low-level details like why CPU is low, while there is high number of connections to handle on a host) would be appreciated
UPDATE:
My goal is to learn what factors could overwhelm my little tomcat server. And when some exception happens, what I could do to resolve or remediate it without switching my server to a better machine. This is not a real app in a production environment but just my own experiment (Besides some changes on the server-side, I may also do something on my client-side)
Both of my client and server are very simple: the server only checks the URL format and send 201 code if it is correct. Each request sent from my client only includes an easy JSON body. There is no database involved. The two machines (t2-micro) only run client and server respectively.
My client is OkHttpClient(). To avoid timeout exceptions, I already set timeout 1,000,000 milli secs via setConnectTimeout, setReadTimeout, and setWriteTimeout. I also go to $CATALINA/conf/server.xml on my server and set connectionTimeout = "-1"(infinite)
ORIGINAL POST:
I'm trying to stress out my server by having a client launching 3000+ threads sending HTTP requests to my server. Both of my client and server reside on different ec2 instances.
Initially, I encountered some timeout issues, but after I set the connection, read and write timeout to a bigger value, this exception has been resolved. However, with the same specification, I'm getting java.net.ConnectException: Failed to connect to my_host_ip:8080 exception. And I do not know its root cause. I'm new to multithreading and distributed system, can anyone please give me some insights of this exception?
Below is some screenshot of from my ec2:
1. Client:
2. Server:
Having gone through similar exercise in past I can say that there is no definitive answer to the problem of scaling.
Here are some general trouble shooting steps that may lead to more specific information. I would suggest trying out tests by tweaking a few parameters in each test and measure the changes in Cpu, logs etc.
Please provide what value you have put for the timeout. Increasing timeout could cause your server (or client) to run out of threads quickly (cause each thread can process for longer). Question the need for increasing timeout. Is there any processing that slows your server?
Check application logs, JVM usage, memory usage on the client and Server. There will be some hints there.
Your client seems to be hitting 99%+ and then come down. This implies that there could be a problem at the client side in that it maxes out during the test. Your might want to resize your client to be able to do more.
Look at open file handles. The number should be sufficiently high.
Tomcat has some limit on thread count to handle load. You can check this in server.xml and if required change it to handle more. Although cpu doesn't actually max out on server side so unlikely that this is the problem.
If you a database then check the performance of the database. Also check jdbc connect settings. There is thread and timeout config at jdbc level as well.
Is response compression set up on the Tomcat? It will give much better throughout on server especially if the data being sent back by each request is more than a few kbs.
--------Update----------
Based on update on question few more thoughts.
Since the application is fairly simple, the path in terms of stressing the server should be to start low and increase load in increments whilst monitoring various things (cpu, memory, JVM usage, file handle count, network i/o).
The increments of load should be spread over several runs.
Start with something as low as 100 parallel threads.
Record as much information as you can after each run and if the server holds up well, increase load.
Suggested increments 100, 200, 500, 1000, 1500, 2000, 2500, 3000.
At some level you will see that the server can no longer take it. That would be your breaking point.
As you increase load and monitor you will likely discover patterns that suggest tuning of specific parameters. Each tuning attempt should then be tested again the same level of multi threading. The improvement of available will be obvious from the monitoring.
I have a redis server that keeps almost 1500000 keys under a huge traffic (500 requests per second, almost 50 write operation per request), on Windows Server 2008 R2 OS. It works good with very low response times. However, when it starts the snapshot process, it keeps increasing ram usage then, couple of minutes later, goes back to normal.
Here is the screen shot when it happens :
And here is the redis snapshot configuration at conf file :
save 900 1
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
Here is the memory info from redis client :
Is this normal ? If not, how can I fix this ?
When I send about 100 users to my web service, I get response and web service performs fine, but when I check for 1000 concurrent users none of the requests get reply.
I am using jmeter for testing.
When I send 1000 concurrent users my glassfish admin panel goes time out in browser and it opens after 4-5 minutes only.Same happen for wsdl URL.
I have tested my web service on our LAN and it works for 2000 queries without any issues.
Please help me find a solution.
Edit 1.0
Some more findings
Hi on your recommendation, what I did is that I simply returned string on web service function call, no lookup, no dao, nothing... just returning a string
Thread pool is 2000 no issues on that.
Now when I ran jmeter for 1000 users they run much fast and returned response for ~200 requests
So this means that my PC running Windows 7 with an i5 processor and 4GB RAM is out performing dedicated server of hostgator having 4GB RAM with xeon 5*** 8 cores :(
This is not for what am paying 220$ a month....
Correct me if my finding is wrong, I tested my app on lan b/w two pc's locally and it can process 2000+msgs smoothly
Edit 1.1
After lot of reading,and practicals I have come to a conclusion that it is network latency which is responsible for such a behavior.
I increased bean pool size in glassfish's admin panel and it helped improving number of concurrent users to 300, but issue arise again no matter how much beans I keep in pool.
So friends question is: please suggest some other settings which I can change in Glassfish's admin panel to remove this issue from root!
You need to add some performance logging for the various steps that your service performs. Does it do multiple steps? Is computation slow? Database access slow? Your connection pool not scale well? Do things need to be tweaked in the web server to allow for such high concurrency? You'll need to measure these things to find the bottlenecks so you can eliminate them.
I had the same problem in a server (with 200+ simultaneously users), I studied the official glassfish tuning guide but there is a parameter very important that doesn't appear. I used Jmeter too and in my case the time response increases exponentially but the server's processor stay low.
In the glassfish admin server (configurations/server-config/Network config/thread pools/http-thread-pool) you can see how many users you server can handle. (The parameters are different in glassfish 2 and 3).
Max Queue Size: The maximum number of threads in the queue. A value of –1 indicates that there is no limit to the queue size.
Max Thread Pool Size: The maximum number of threads in the thread pool
Min Thread Pool Size: The minimum number of threads in the thread pool
Idle Thread Timeout: The maximum amount of time that a thread can remain idle in the pool. After this time expires, the thread is removed from the pool.
I recommend you to set Max Thread Pool Size to 100 or 200 to fix the problem.
Also you can set another JMV variables, for example:
-Xmx/s/m
-server
-XX:ParallelGCThreads
-XX:+AggressiveOpts
I hope it helps.