Java/Scala resource consumption and load - java

I am developing a web application in Scala. Its a simple application which will take data on a port from clients (JSON or ProtoBufs) and do some computation using a database server and then reply the client with a JSON / Protobuf object.
Its not a very heavy application. 1000 lines of code max. It will create a thread on every client request. The time it takes right now between getting the request and replying back is between 20 - 40ms.
I need an advice on what kind of hardware / setup should i use to serve 3000+ such requests per second. I need to procure hardware to put at my data center.
Anybody who has some experience deploying java apps at scale, please advice. Should i use one big box with 2 - 4 Xeon 5500s with 32 GB RAMs or multiple smaller machines.
UPDATE - we dont have many clients. 3 - 4 of them. Requests will be from these 3 of them.

If each request takes on average 30 ms, a single core can handle only 30 requests per second. Supposing that your app scales linearly (the best scenario you can expect), then you will need at least 100 cores to reach 3000 req/s. Which is more than 2-4 Xeon.
Worst, if you app relies on IO or on DB (like most useful applications), you will get a sublinear scaling and you may need a lot more...
So the first thing to do is to analyze and optimize the application. Here are a few tips:
Creating a thread is expensive, try to create a limited number of threads and reuse them among requests (in java see ExecutorService for example).
If you app is IO-intensive: try to reduce IO calls as much a possible, using a cache in memory and give a try to non-blocking IO.
If you app is dependent of a database, consider caching and try a distributed solution if possible.

Related

Test Plan is not giving me accurate results

So I am wondering if someone can help me please, I am trying to load test a java rest application with thousands of requests per minute but something is a miss and I can't quite figure out what's happening.
I am executing the load test from my laptop (via terminal) and it's hitting a number of servers in AWS.
I am using the infinite loop function to fire in a lot of requests repeatedly. The 3 thread groups have the same config.
The problem I am having is the CPU is rising very high and the numbers do not match on what I have in production with the same enviornment with regards to CPU etc, the JMeter load test seems to be making the CPU work harder on my test enviorment.
Question 1 - Is my load test too fast for the server?
Question 2 - Is there a way to space out the load test so that I can say 10k rpm exactly?
Question 1 - Is my load test too fast for the server? - we don't know, if you see high CPU usage using 60 threads only and application responds slowly due to high CPU usage it looks like a bottleneck. Another question is the size of the machine, the number of processors and their frequency. So you need to find which function is consuming the CPU cycles using a profiler tool and look for the way to optimize the function
Question 2 - Is there a way to space out the load test so that I can say 10k rpm exactly? - it is, check out Constant Throughput Timer, but be aware of the next 2 facts:
Constant Throughput Timer can only pause the threads to limit JMeter's requests execution speed to the specified number of requests per minute. So you need to make sure to create sufficient number of threads in Thread Group(s) to produce the desired load, 60 might be not enough.
Application needs to be able to respond fast enough, i.e. 10000 requests per minute is approx 166 requests per second, with 60 threads it means that each thread needs to execute 2.7 requests per second which means that response time needs to be 370 ms or less
There are different aspects before we got for 10k requests.
Configure the tests for one user(/thread) and execute. Check for all
the request we are getting a proper response.
Incrementally increase the number of threads from 1 user, 5 users, 10 users, 20
users, 50 users etc.
Try for different duration scenarios like 10mins, 20 mins, 30 mins, 1 hour etc.
Collect metrics like error %, response time, number of request etc..
You can check probable breakpoints like:
CPU utilisaztion of machine getting high(100 %) from where you are executing the tests. in this case, you can setup few machines in master-slave configuration
error % getting high. Server may not be able to respond, so it might have crashed.
response time getting high. server may be getting busy due to the load.
Also, make sure, you have a reliable connectivity and bandwidth. Just imagine, you want to give a huge load, but the connection you have in few kbps. your tests will fail due to this.

Why does Java App take less overall CPU when running multiple instances of app instead of one instance?

I've got a Java App running on Ubuntu, the app listens on a socket for incoming connections, and creates a new thread to process each connection. The app receives incoming data on each connection processes the data, and sends the processed data back to the client. Simple enough.
With only one instance of the application running and up to 70 simultaneous threads, the app will run up the CPU to over 150%.. and have trouble keeping up processing the incoming data. This is running on a Dell 24 Core System.
Now if I create 3 instances of my application, and split the incoming data across the 3 instances on the same machine, the max overall cpu on the same machine may only reach 25%.
Question is why would one instance of the application use 6 times the amount of CPU that 3 instances on the same machine each processing one third of the amount of data use?
I'm not a linux guy, but can anyone recommend a tool to monitor system resources to try and figure out where the bottleneck is occurring? Any clues as to why 3 instances processing the same amount of data as 1 instance would use so much less overall system CPU?
In general this should not be the case. Maybe you are reading the CPU usage wrong. Try top, htop, ps, vmstat commands to see what's going on.
I could imagine one of the reasons for such behaviour - resource contention. If you have some sort of lock or a busy loop which manifests itself only on one instance (max connections, or max threads) then your system might not parallelize processing optimally and wait for resources. I suggest to connect something like jconsole to your java processes and see what's happening.
As a general recommendation check how many threads are available per JVM and if you are using them correctly. Maybe you don't have enough memory allocated to JVM so it's garbage collecting too often. If you use database ops then check for bottlenecks there too. Profile and find the place where it spends most of the time and compare 1 to 3 instances in terms of % of time spend in that function.

Scaling Software/Hardware for a Large # of External API Requests?

We have a system that given a batch of requests, makes an equivalent number of calls to an external 3rd Party API. Given that this is an I/O bound task, we currently use a cached thread-pool of size 20 to service these requests. Other than above, is the solution to:
Use fewer machines with more cores (less context-switching, capable of supporting more concurrent threads)
or
Use more machines by leveraging commodity/cheap hardware (pizza boxes)
The number of requests we receive a day is on the order of millions.
We're using Java, so the threads here are kernel, not "green".
Other Points/Thoughts:
Hadoop is commonly used for problems of this nature, but this needs to be real-time vs. the stereotypical offline data mining.
The API requests take anywhere from 200ms to 2 seconds on average
There is no shared state between requests
The 3rd Party in question is capable of servicing more requests than we can possibly fire (payments vendor).
It's not obvious to me that you need more resources at all (larger machines or more machines). If you're talking about at most 10 million requests in a day taking at most 2 seconds each, that means:
~110 requests per second. That's not so fast. Are the requests particularly large? Or are there big bursts? Are you doing heavy processing besides dispatching to the third-party API? You haven't given me any information so far that leads me to believe it's not possible to run your whole service on a single core. (Call it three of the smallest possible machines if you want to have n+2 redundancy.)
on average, ~220 active requests. Again, that seems like no problem for a single machine, even with a (pooled) thread-per-request model. Why don't you just expand your pool size and call it a day? Are these really bursty? (And do you have really tight latency/reliability requirements?) Do they need a huge amount of RAM while active?
Could you give some more information on why you think you have to make this choice?
Rather than using a large number of threads, you might fare better with the event-driven I/O using node.js with the caveats that it may mean a large rewrite and the fact that node.js is fairly young.
This SO article may be of interest.

BlazeDS Polling Interval set to 0: Unwanted side-effects?

tl;dr: Setting the polling-interval to 0 has given my performance a huge boost, but I am worried about possible problems down the line.
In my application, I am doing a fair amount of publishing from our java server to our flex client, publishing on a variety of topics and sub-topics.
Recently, we have been on a round of performance improvements system-wide, and the messaging layer was proving to be a big bottleneck.
A few minutes ago, I discovered that setting the <polling-interval-millis> property in our services-config.xml to 0 caused published messages, even when there are lots of them, to be recognized by the client almost instantly, instead of with the 3 second delay that is the default value for polling-interval-millis, which has obviously had a tremendous impact.
So, I'm pretty happy with the current performance, only thing is, I'm a bit nervous about unintended side-effects caused by this change. In particular, I am worried about our Flash client slowing way down, and of way too much unwanted traffic.
My preliminary testing has not borne out this fear, but before I commit the change to our repository, I was hoping that somebody with experience with this stuff would chime in.
Unfortunately your question is too general...there is no way to receive a specific answer. I'll write below some ideas, maybe they are helpful.
Decreasing the value from 3 to 0 means that you are receiving new data way faster. If your Flex client uses this data in order to make complex computations it is possible to slow your client or to show obsolete data (it is a known pattern, see http://help.adobe.com/en_US/LiveCycleDataServicesES/3.1/Developing/WS3a1a89e415cd1e5d1a8a18fb122bdc0aad5-8000Update.html ). You need to understand how the data is processed and probably to do some client benchmarking.
Also the server will have to handle more requests, and it would be good to identify what is the maximum requests per second which can be handled. For that, you will need to use a tool like Jmeter in order to detect the maximum capacity of your system, after that you can do some computations trying to figure out how many requests per second you will have after you reduced the interval from 3 to 0, taking into account that the number of clients is increasing with 10% per month etc etc.
The main idea is that you should do some performance testing for some API and save the scripts in order to see if your future modification are slowing down the system too much. Without having this it is quite hard to guess if it ok or not to change configuration parameters.
You might want to try out long-polling. For our Weblogic servers, we don't get any problems unless we let the poll request go to 5 minutes, so we keep it to 4, then give it a 1 second rest before starting again. We have a couple of hundred total users, with 60-70 on it hard core all day. The thing to keep in mind is that you're basically turning intermittent user requests into what amounts to almost always connected telnet sessions. Depending on the browser your users are using it can implications from that as well, but overall we've been very pleased.

How do I improve jetty response times?

I'm trying to speed test jetty (to compare it with using apache) for serving dynamic content.
I'm testing this using three client threads requesting again as soon as a response comes back.
These are running on a local box (OSX 10.5.8 mac book pro). Apache is pretty much straight out of the box (XAMPP distribution) and I've tested Jetty 7.0.2 and 7.1.6
Apache is giving my spikey times : response times upto 2000ms, but an average of 50ms, and if you remove the spikes (about 2%) the average is 10ms per call. (This was to a PHP hello world page)
Jetty is giving me no spikes, but response times of about 200ms.
This was calling to the localhost:8080/hello/ that is distributed with jetty, and starting jetty with java -jar start.jar.
This seems slow to me, and I'm wondering if its just me doing something wrong.
Any sugestions on how to get better numbers out of Jetty would be appreciated.
Thanks
Well, since I am successfully running a site with some traffic on Jetty, I was pretty surprised by your observation.
So I just tried your test. With the same result.
So I decompiled the Hello Servlet which comes with Jetty. And I had to laugh - it really includes following line:
Thread.sleep(200L);
You can see for yourself.
My own experience with Jetty performance: I ran multi threaded load tests on my real-world app where I had a throughput of about 1000 requests per second on my dev workstation...
Note also that your speed test is really just a latency test, which is fine so long as you know what you are measuring. But Jetty does trade off latency for throughput, so often there are servers with lower latency, but also lower throughput as well.
Realistic traffic for a webserver is not 3 very busy connections - 1 browser will open 6 connections, so that represents half a user. More realistic traffic is many hundreds or thousands of connections, each of them mostly idle.
Have a read of my blogs on this subject:
https://webtide.com/truth-in-benchmarking/
and
https://webtide.com/lies-damned-lies-and-benchmarks-2/
You should definitely check it with profiler. Here are instructions how to setup remote profiling with Jetty:
http://sujitpal.sys-con.com/node/508048/mobile
Speedup or performance tune any application or server is really hard to get done in my experience. You'll need to benchmark several times with different work models to define what your peak load is. Once you define the peak load for the configuration/environment mixture you need to tune and benchmark, you might have to run 5+ iterations of your benchmark. Check the configuration of both apache/jetty in terms of number of working threads to process the request and get them to match if possible. Here are some recommendations:
Consider the differences of the two environments (GC in jetty, consider tuning you min and max memory threshold to the same size and later proceed to execute your test)
The load should come from another box. If you don't have a second box/PC/server take your CPU/core into count and setup your the test to a specific CPU, do the same for jetty/apache.
This is given that you cant get another machine to be the stress agent.
Run several workload model
Moving to modeling the test do the following 2 stages:
One Thread for each configuration for 30 minutes.
Start with 1 thread and going up to 5 with a 10 minutes interval to increase the count,
Base on the metrics Stage 2 define a number of threads for the test. and run that number of thread concurrent for 1 hour.
Correlate the metrics (response times) from your testing app to the server hosting the application resources (use sar, top and other unix commands to track cpu and memory), some other process might be impacting you app. (memory is relevant for apache jetty will be constraint to the JVM memory configuration so it should not change the memory usage once the server is up and running)
Be aware of the Hotspot Compiler.
Methods have to be called several times (1000 times ?), before the are compiled into native code.

Categories