I designed an authentication protocol using java. The average excution time for a single authentication on my Desktop computer is 2.87 ms. My computer has the following specification. Windows 10 with a 1.99 GHz Intel Core i7 and 8GB of RAM.
If a number of users say 10 users preform the authentication simultaneously. What is the total computational time. Can I just say (2.87*10)?
A typical Core i7 CPU has between 4 and 8 cores, and can execute 6..12 threads in parallel (# cores times hyperthreading).
Assuming
you have an i7-8650U (1.9GHz, 4 cores, hyper-threaded),
your Java server is multithreaded (which should be the case if you use any popular implementation like Tomcat, Jetty, or alike) and
there is no other CPU-intensive workloads running
You can say your server can handle 8 authentication requests simultaneously: 4 at full speed + 4 more at ~30% speed because of hyper-threading, or 2.87ms*1.3=3.73ms for every 8 users.
10 users will thus take around ~ 3.73ms+2.87ms = 6.6ms.
What's important when measuring Java performance, however, is to measure steady state under load, in order to take garbage collection overhead into account. When measuring a single request, you may often miss the garbage collection step entirely.
You can't say that the time will be multiplied by the number of users.
First of all because it would mean that your authentication mechanisms works on 1 threads, which would be terrible.
Then the time registered on your computer will for sure be different on the production or staging environment. You will have to add network hops, but on the others end servers that would do only this, but you cannot compare this time to the one on your personal computer.
If you want to test this kind of things, use performance/load testing tools. I can recommend JMeter and open source tool from the Apache foundation, or Gatling another open source tool for this.
They are designed to be used on such use case. With them you could then call your authentication API, for example with 100 users in 10 seconds, and you would see reports in the end which will give you your answer.
Related
Wowza is causing us troubles, not scaling past 6k concurrent users, sometimes freezing on a few hundred. It crashes and starts killing sessions. We step in to restart Wowza multiple times per streaming event.
Our server specs:
DL380 Gen10
2x Intel Xeon Silver 4110 / 2.1 GHz
64 GB RAM
300 GB HDD
The network 10 GB dedicated
Some servers running Centos 6, others Centos 7
Java version 1.8.0_20 64-bit
Wowza streaming engine 4.2.0
I asked and was told:
Wowza scales easily to millions if you put a CDN in front of it (which
is trivially easy to do). 6K users off a single instance simply ain’t
happening. For one, Java maxes out at around 4Gbps per JVM instance.
So even if you have a 10G NIC on the machine, you’ll want to run
multiple instances if you want to use the full bandwidth.
And:
How many 720 streams can you do on a 10gb network # 2mbps?
Without network overhead, it’s about 5,000
With the limitation of java at 4gbps, it’s only 2,000 per instance.
Then if you do manage to utilize that 10Gb network and saturate it,
what happens to all other applications people are accessing on other
servers?
If they want more streams, they need edge servers in multiple data
centers or have to somehow to get more 10Gb networks installed.
That’s for streaming only. No idea what transcoding would add in terms
of CPU load and disk IO.
So I began looking for an alternative to Wowza. Due to the nature of our business, we can't use CDN or cloud hosting except with very few clients. Everything should be hosted in-house, in the client's datacenter.
I found this article and reached out to the author to ask him about Flussonic and how it compares to Wowza. He said:
I can't speak to the 4 Gbps limit that you're seeing in Java. It's
also possible that your Wowza instance is configured incorrectly. We'd
need to look at your configuration parameters to see what's happening.
We've had great success scaling both Wowza and Flussonic media servers
by pairing them with our peer-to-peer (p2p) CDN service. That was the
whole point of the article we wrote in 2016
Because we reduce the number of HTTP requests that the server has to
handle by up to 90% (or more), we increase the capacity of each server
by 10x - meaning each server can handle 10x the number of concurrent
viewers.
Even on the Wowza forum, some people say that Java maxes out at around 5Gbps per JVM instance. Others say that this number is incorrect, or made up.
Due to the nature of our business, this number, as silly as it is, means so much for us. If Java cannot handle more than 7k viewers per instance, we need to hold meetings and discuss what to do with Wowza.
So is it true that Java maxes out at around 4Gbps or 5Gbps per JVM instance?
I have a big question about tuning linux for java performance, so i start with my case.
I have an application running a number of threads that communicate with each other. My typical workflow is:
1) Some consumer thread sync on a common Object lock and calls wait() on it.
2) Some producer thread waits via Selector for data from network.
2.1) producer receives data and form an object with received timestamp (microseconds precision).
2.2) producer puts this packet in some exchange map and calls notifyAll on common lock.
3) Consumer thread wakes up and reads produced object.
3.1) consumer creates new object and writes in it time difference in microseconds between received timestamp and current timestamp. this way i can monitor reaction time.
And this reaction time is the whole issue.
When i test my application on my own machine i usually get about 200-400 microseconds reaction time, but when i monitor it on my production linux machine i get numbers from 2000 to 4000 microseconds!
Right now i'm running ubuntu 16.04 as my production OS and Oracle jdk 8-111. I have a physical server with 2 Xeon Processors. I run only usual OS daemons and my app on this server so there is plenty of resources compared to my dev notebook.
I run my java app as a jar file with flags:
sudo chrt -r 77 java -server -XX:+UseNUMA -d64 -Xmx1500m -XX:NewSize=1000m -XX:+UseG1GC -jar ...
I use sudo chrt to change priority since i thought it's the case, but it didn't help.
I tuned bios for maximum performance and turned off C-States.
What else i can tune for faster reaction times and low context switches?
No, there is no single echo 1 > /proc/sys/unlock_concurrent_magic option on Linux to globally improve concurrent performance. If such an option existed, it would simply be enabled by default.
In fact, despite being tunable in general, you aren't going to find many tunables that have a big effect specifically on raw concurrency on Linux. The ones that might help often incidentially related to concurrency - i.e., something is enabled (let's say THP) which slows down your particular load, and since at least part of the slowness occurs under lock, the whole concurrent throughput is affected.
Java concurrency vs the OS
My experience, however, is that Java applications are very rarely affected directly by OS-level concurrency behavior. In fact, most of Java concurrency is implemented efficiently without using OS features, and will behave the same across OSes. The primary places where the JVM touches the OS as it relates to concurrency is for thread creation, destruction and waiting/blocking. Recent Linux kernels have very good implementations of all three so it would be unusual that you run into a bottleneck there (indeed, a well-tuned application should not be doing a ton of thread creation and should also seek to minimize blocking).
So I find it very likely the performance difference is due to other differences, in hardware, in application configuration, in the applied load, or something else.
Characterize your performance
Here's what I'd try first to characterize the performance discrepancy between your development host and the production system.
At the top level, the difference is going either be because the production stack is actually slower for the load in question or because the local test isn't an accurate reflection of the production load.
One quick test you can do to distinguish the cases is to run whatever local test you are running to get 200-400us response times on an unloaded production server. If the server is still getting response times that are 10x worse, then you know your test is probably reasonable, and the difference is really in the production stack.
At that point, the problem could still be in OS, in the software configuration, in the hardware, etc. So you should try to bisect the differences between the production stack and your local host - set any tunable parameters to the same value, investigate any application-specific configuration differences, try to characterize any hardware differences.
One big gotcha is that often production servers are in multi-socket configurations, which may increase the cost of contention by an order of magnitude, since cross-socket communication (generally 100+ cycles) is required - whereas development boxes are generally multi-core but single-socket, so contention overhead is contained to the shared L3 (generally ~30 cycles).
On the other hand, you might find that your local test performs just fine on the production server as well, so the real issue is that your test doesn't represent the true production load. You should then make an effort to characterize the true production load so you can replicate it and then tune it locally. How to tune it locally could of course fill a book or two (or require a very highly paid contractor or two), so you'd have to come back with a narrower question to get useful help here.
"Big Iron" vs your laptop
It is a common fallacy that "big iron" is going to be faster at everything than your puny laptop. In fact, quite the opposite is true for many operations, especially when you measure operation latency (as opposed to total throughput).
For example, latency to memory on the server parts is often 50% slower versus client parts, even comparing single socket systems. John McCalpin reports a main-memory latency of 54.6 ns for a client Sandy Bridge part and 79 ns for the corresponding server part. It is well known the path to memory and memory controller design for servers trades off latency for throughput, reliability and the ability to support more cores and total DRAM1.
In particular, you mention that your producer server is a "2 Xeon Processors", which I take to mean a dual-socket system. Once you introduce a second socket, you change the mechanics of synchronization entirely. On a single core system, when separate threads under contention, at worst you are sending cache lines and coherency traffic through the shared L3, which has a latency of 30-40 cycles.
On a system with more than one socket, however, concurrency traffic generally has to flow over the QPI links between sockets, which has latency on the order of DRAM access, perhaps 80 ns (i.e., 240 cycles on a 3GHz box). So you can have nearly an order of magnitude slowdown from the hardware architecture alone.
Furthermore, notifyAll type scenarios as you describe your workflow often get much worse with more cores and more threads. E.g., with more cores, you are less likely to have two communicating processes running on the same hyperthread (which dramatically speeds up inter-thread coordination, but is otherwise undesirable) and the total contention and coherency traffic may scale up in proportion to the number of cores (e.g., because a cache line has to ping-pong around to every core when you wake up threads).
So it's often the case that a heavily contended (often badly designed) algorithm performs much worse on "big iron" than on a single-socket consumer system.
1 E.g., through buffering, which adds latency, but increases the host's maximum RAM capacity.
Scenario : I have a sample application and I have 3 different system configuration -
- 2 core processor, 2 GB RAM, 60 GB HHD,
- 4 core processor, 4 GB RAM, 80 GB HHD,
- 8 core processor, 8 GB RAM, 120 GB HHD
In order to effectively exploit the H/W capabilities for my application, I wish to configure the no. of threads at the application level. However, I wish to do this only after a thorough understanding of system capabilities.
Could there be some way(system/modus/tool) to determine the system prowess with reference to the max and min no. of threads it could service optimally & without any loss in efficiency and performance. By this, I could configure only those values for my application that will do full justice and achieve best performance for the respective hardware configuration.
Edited1 :
Could any one please advise any read-up on how to set a baseline for a particular h/w config.
Edited2 :
To make it more direct - Wish to learn/know about any resource/write-up that I can read to gain some understanding on CPU management of Threads at a general/holistic level.
The optimal number of threads to use depends on several factors, but mostly the number of available processors and how cpu-intensive your tasks are. Java Concurrency in Practice proposes the following formal formula to estimate the optimal number of threads:
N_threads = N_cpu * U_cpu * (1 + W / C)
Where:
N_threads is the optimal number of threads
N_cpu is the number of prcessors, which you can obtain from Runtime.getRuntime().availableProcessors();
U_cpu is the target CPU utilization (1 if you want to use the full available resources)
W / C is the ratio of wait time to compute time (0 for CPU-bound task, maybe 10 or 100 for slow I/O tasks)
So for example, in a CPU-bound scenario, you would have as many threads as CPU (some advocate to use that number + 1 but I have never seen that it made a significant difference).
For a slow I/O process, for example a web crawler, W/C could be 10 if downloading a page is 10 times slower than processing it, in which case using 100 threads would be useful.
Note however that there is an upper bound in practice (using 10,000 threads will generally not speed things up, and you would probably get an OutOfMemoryError before you can start them all anyway with normal memory settings).
This is probably the best estimate you can get if you don't know anything about the environment in which your application runs. Profiling your application in production might enable you to fine tune the settings.
Although not strictly related, you might also be interested in Amdahl's law, which aims at measuring the maximum speed-up you can expect from parallelising a program.
My recommendation is to provide config and command-line switches for assigning the number of threads per-machine. Use a heuristic based on Runtime.getRuntime().availableProcessors() as indicated by other answers here, in cases where the user/admin hasn't explicitly configured the application differently. I strongly recommend against exclusive heuristic-based thread-to-core guessing, for several reasons:
Most modern hardware is moving toward increasingly ambiguous types of 'hardware threads': SMT models such as Intel's Hyperthreading and AMD's Compute Modules complicate formulas (details below), and querying this info at runtime can be difficult.
Most modern hardware has a turbo feature that scales speed based on active cores and ambient temperatures. As turbo tech improves, the range of speed (ghz) grows. Some recent Intel and AMD chips can range from 2.6ghz (all cores active) to 3.6ghz (single/dual core active), which combined with SMT can mean each thread getting an effective 1.6ghz - 2.0ghz throughput in the former design. There is currently no way to query this info at runtime.
If you do not have a strong guarantee that your application will be the only process running on the target systems, then blindly consuming all cpu resources may not please the user or server admin (depending on if the software is a user app or server app).
There is no robust way to know what's going on within the rest of the machine at run-time, without replacing the entire operating system with your own home-rolled multitasking kernel. Your software can try to make educated guesses by querying processes and peeking at CPU loads and such, but doing so is complicated and usefulness is limited to specific types of applications (of which yours may qualify), and usually benefit from or require elevated or privileged access levels.
Modern virus scanners now-days work by setting a special priority flag provided by modern operating systems, eg. they let the OS tell them when "the system is idle". The OS bases its decision on more than just CPU load: it also considers user input and multimedia flags that may have been set by movie players, etc. This is fine for mostly-idle tasks, but not useful to a cpu intensive task such as yours.
Distributed home computing apps (BOINC, Folding#Home, etc) work by querying running processes and system CPU load periodically -- once every second or half-second perhaps. If load is detected on processes not belonging to the app for multiple queries in a row then the app will suspend computation. Once the load goes low for some number of queries, it resumes. Multiple queries are required because the CPU load readouts are notorious for brief spikes. There are still caveats: 1. Users are still encouraged to manually reconfigure BOINC to fit their machine's specs. 2. if BOINC is run without Admin privileges then it won't be aware of processes started by other users (including some service processes), so it may unfairly compete with those for CPU resources.
Regarding SMT (HyperThreading, Compute Modules):
Most SMTs will report as hardware cores or threads these days, which is usually not good because few applications perform optimally when scaled across every core on an SMT system. To make matters worse, querying whether a core is shared (SMT) or dedicated often fails to yield expected results. In some cases the OS itself simply doesn't know (Windows 7 being unaware of AMD Bulldozer's shared core design, for example). If you can get a reliable SMT count, then the rule of thumb is to count each SMT as half-a-thread for CPU-intensive tasks, and as a full thread for mostly-idle tasks. But in reality, the weight of the SMT depends on what sort of computation its doing, and the target architecture. Intel and AMD's SMT implementations behave almost opposite of each other, for example -- Intel's is strong at running tasks loaded with integer and branching ops in parallel. AMD's is strong at running SIMD and memory ops in parallel.
Regarding Turbo Features:
Most CPUs these days have very effective built-in Turbo support that further lessens the value-gained from scaling across all cores of the system. Worse, the turbo feature is sometimes based as much on real temperature of the system as it is on CPU loads, so the cooling system of the tower itself affects the speed as much as the CPU specs do. On a particular AMD A10 (Bulldozer), for example, I observed it running at 3.7ghz on two threads. It dropped to 3.5ghz when a third thread is started, and to 3.4ghz when a fourth was started. Since it's an integrated GPU as well, it dropped all the way to approx 3.0ghz when four threads plus the GPU were working (the A10 CPU internally gives priority to the GPU in high-load scenarios); but could still muster 3.6ghz with 2 threads and GPU active. Since my application used both CPU and GPU, this was a critical discovery. I was able to improve overall performance by limiting the process to two CPU-bound threads (the other two shared cores were still helpful, they served as GPU servicing threads -- able to wake up and respond quickly to push new data to the GPU, as needed).
... but at the same time, my application at 4x threads may have performed much better on a system with a higher-quality cooling device installed. It's all so very complicated.
Conclusion: There is no good answer, and because the field of CPU SMT/Turbo design keeps evolving, I doubt there will be a good answer anytime soon. Any decent heuristic you formulate today may very well not produce ideal results tomorrow. So my recommendation is: don't waste much time on it. Rough-guess something based on core counts that suits local your purposes well enough, allow it to be overridden by config/switch, and move on.
You can get the number of processors available to the JVM like this:
Runtime.getRuntime().availableProcessors()
Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.
I think the best strategy would be to decide the optimal number of threads empirically for each of the hardware configuration, and then use these numbers in your application.
I agree with the other answers here that recommend a best-guess approach, and providing configuration for overriding the defaults.
In addition, if your application is particularly CPU-intensive, you may want to look into "pinning" your application to particular processors.
You don't say what your primary operating system is, or whether you're supporting multiple operating systems, but most have some way of doing this. Linux, for instance, has taskset.
A common approach is to avoid CPU 0 (always used by the OS), and to set your application's cpu affinity to a group of CPUs that are in the same socket.
Keeping the app's threads away from cpu 0 (and, if possible, away from other applications) often improves performance by reducing the amount of task switching.
Keeping the application on one socket can further increase performance by reducing cache invalidation as your app's threads switch among cpus.
As with everything else, this is highly dependent on the architecture of the machine that you are running on, as well as what other applications are runnning.
Use VisualVm tool to monitor threads.First Create minimum threads in program and see its performance.Then increase the no of threads within the program ans again analyze its performance.May this help you.
I use this Python script here to determine the number of cores (and memory, etc.) to launch my Java application with optimum parameters and ergonomics. PlatformWise on Github
It works like this: Write a python script which calls the getNumberOfCPUCores() in the above script to get the number of cores, and getSystemMemoryInMB() to get the RAM. You can pass that inform to your program via command line arguments. Your program can then use the appropriate number of threads based on the number of cores.
Creating a thread on application level is good and in a multicore processor separate threads are executed on cores to enhance performance.So to utilize the core processing power it is best practice to implement threading.
What i think:
At a time only 1 thread of a program will execute on 1 core.
Same application with 2 thread will execute on half time on 2 core.
Same application with 4 Threads will execute more faster on 4 core.
So the application you developing should have the threading level<= no of cores.
Thread execution time is managed by the operating system and is a highly unpredictable activity. CPU execution time is known as a time slice or a quantum. If we create more and more threads the operating system spends a fraction of this time slice in deciding which thread goes first, thus reducing the actual execution time each thread gets. In other words each thread will do lesser work if there were a large number of threads queued up.
Read this to get how to actually utilize cpu core's.Fantastic content.
csharp-codesamples.com/2009/03/threading-on-multi-core-cpus/
Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.
On many forums I found that people use Solaris for their Java applications.
I interested what are the main advantages of such combination?
My first assumption is that Solaris is very fast.
I also found out that on Solaris it is possible to match one-to-one java threads with kernel threads - as I understand it results in again very fast thread creation.
Please correct me if I'm wrong and are there any other main points?
What Solaris gives you (as its Software not hardware) over Linux or Windows is greater system manageability and low level tracing like DTrace.
What you appear to be asking about is having more threads running concurrently which is a feature of the hardware. If you run Solaris x86 or Linux or Window on the same hardware you will have the same number of logical threads. However if you run Solaris on some SPARC processors which have lots of logical threads (32 or more) running concurrently which reduces overhead if you have a need for that many threads.
The http://en.wikipedia.org/wiki/SPARC_T3 process supports up to 512 logical threads across 16 cores. This can really improve performance where you have a need for so many threads, e.g. using many blocking IO connections.
However if you need only one to six critical threads (and a bunch of non-critcal threads) a plain x64 processor will be much faster, and cheaper. (As it is designed to handle less threads faster and are mass produced on a larger scale)
We use Solaris for java applications at my workplace. I do not know about any exact performance advantage, but the reasons we decided to use Solaris were:
Solaris Service Management Facility (http://www.oreillynet.com/pub/a/sysadmin/2006/04/13/using-solaris-smf.html )
Ability to copy the entire zone backup to another box in case of HW failure.
We run application servers such as Weblogic, and it helps that SMF starts them back up if they crash for any reason. Also, we do backup of our zones at regular intervals, and from what I hear- the zone can be moved to another machine in case of HW failure, and the application back to normal.
I'm trying to speed test jetty (to compare it with using apache) for serving dynamic content.
I'm testing this using three client threads requesting again as soon as a response comes back.
These are running on a local box (OSX 10.5.8 mac book pro). Apache is pretty much straight out of the box (XAMPP distribution) and I've tested Jetty 7.0.2 and 7.1.6
Apache is giving my spikey times : response times upto 2000ms, but an average of 50ms, and if you remove the spikes (about 2%) the average is 10ms per call. (This was to a PHP hello world page)
Jetty is giving me no spikes, but response times of about 200ms.
This was calling to the localhost:8080/hello/ that is distributed with jetty, and starting jetty with java -jar start.jar.
This seems slow to me, and I'm wondering if its just me doing something wrong.
Any sugestions on how to get better numbers out of Jetty would be appreciated.
Thanks
Well, since I am successfully running a site with some traffic on Jetty, I was pretty surprised by your observation.
So I just tried your test. With the same result.
So I decompiled the Hello Servlet which comes with Jetty. And I had to laugh - it really includes following line:
Thread.sleep(200L);
You can see for yourself.
My own experience with Jetty performance: I ran multi threaded load tests on my real-world app where I had a throughput of about 1000 requests per second on my dev workstation...
Note also that your speed test is really just a latency test, which is fine so long as you know what you are measuring. But Jetty does trade off latency for throughput, so often there are servers with lower latency, but also lower throughput as well.
Realistic traffic for a webserver is not 3 very busy connections - 1 browser will open 6 connections, so that represents half a user. More realistic traffic is many hundreds or thousands of connections, each of them mostly idle.
Have a read of my blogs on this subject:
https://webtide.com/truth-in-benchmarking/
and
https://webtide.com/lies-damned-lies-and-benchmarks-2/
You should definitely check it with profiler. Here are instructions how to setup remote profiling with Jetty:
http://sujitpal.sys-con.com/node/508048/mobile
Speedup or performance tune any application or server is really hard to get done in my experience. You'll need to benchmark several times with different work models to define what your peak load is. Once you define the peak load for the configuration/environment mixture you need to tune and benchmark, you might have to run 5+ iterations of your benchmark. Check the configuration of both apache/jetty in terms of number of working threads to process the request and get them to match if possible. Here are some recommendations:
Consider the differences of the two environments (GC in jetty, consider tuning you min and max memory threshold to the same size and later proceed to execute your test)
The load should come from another box. If you don't have a second box/PC/server take your CPU/core into count and setup your the test to a specific CPU, do the same for jetty/apache.
This is given that you cant get another machine to be the stress agent.
Run several workload model
Moving to modeling the test do the following 2 stages:
One Thread for each configuration for 30 minutes.
Start with 1 thread and going up to 5 with a 10 minutes interval to increase the count,
Base on the metrics Stage 2 define a number of threads for the test. and run that number of thread concurrent for 1 hour.
Correlate the metrics (response times) from your testing app to the server hosting the application resources (use sar, top and other unix commands to track cpu and memory), some other process might be impacting you app. (memory is relevant for apache jetty will be constraint to the JVM memory configuration so it should not change the memory usage once the server is up and running)
Be aware of the Hotspot Compiler.
Methods have to be called several times (1000 times ?), before the are compiled into native code.