I have a java jar application running in the background of my linux servers (debian 7). The servers receives files through apache generally, then the jar pull the database regularly to upload the files to their final destination using a pool of http connections from an httpclient (v2.5).
The servers are 16gb of ram, 8 cores cpu, 1To disk and 2Gb/s internet bandwidth. My problem is that the bandwidth is only used at 10% or 20% for the uploads made by the jar. After many investigations I think it is because of the capacity of the distant server which might be the bottleneck.
So I wanted to launch more threads on my servers to proceed more files at the same time and use all the bandwidth I have, unfortunately file upload with httpclient is eating a lot of cpu it seems !
Actually the jars are working with 20 simultaneous upload runnable threads and the cpu is constantly at 100%, if I try to start more threads the load average increase and break records, getting the system so slow and unusable.
Strange thing is the iowaits seems to be null, so I really don't know what is causing the load average.
I have run an hprof using only one thread, here is the result :
CPU SAMPLES BEGIN (total = 4617) Thu Jul 28 17:42:35 2016
rank self accum count trace method
1 52.76% 52.76% 2436 301157 java.net.SocketOutputStream.socketWrite0
2 33.53% 86.29% 1548 300806 java.net.SocketInputStream.socketRead0
3 1.62% 87.91% 75 301138 org.sqlite.core.NativeDB.step
4 1.47% 89.39% 68 301158 java.io.FileInputStream.readBytes
5 1.06% 90.45% 49 300078 java.lang.ClassLoader.defineClass1
6 0.26% 90.71% 12 300781 com.mysql.jdbc.SingleByteCharsetConverter.<clinit>
7 0.19% 90.90% 9 300386 java.lang.Throwable.fillInStackTrace
8 0.19% 91.10% 9 300653 java.lang.ClassLoader.loadClass
9 0.19% 91.29% 9 300780 com.mysql.jdbc.SingleByteCharsetConverter.<clinit>
10 0.17% 91.47% 8 300387 java.net.URLClassLoader.findClass
11 0.17% 91.64% 8 300389 java.util.zip.Inflater.inflateBytes
12 0.15% 91.79% 7 300090 java.lang.ClassLoader.findBootstrapClass
13 0.15% 91.94% 7 300390 java.util.zip.ZipFile.getEntry
14 0.13% 92.07% 6 300805 java.net.PlainSocketImpl.socketConnect
the files are sent with a common httpclient POST execute request with an overrided writeTo() method from the filebody class that uses a bufferedInputStream of 8kb.
Do you think it is possible to reduce the performance impact of the file uploads, and solve my problem of unused bandwidth ?
Thanks in advance for your help.
You should try https://hc.apache.org/
You could change your httpclient library to an library based on java.nio:
https://docs.oracle.com/javase/7/docs/api/java/nio/package-summary.html
HttpCore is a set of low level HTTP transport components that can be used to build custom client and server side HTTP services with a minimal footprint. HttpCore supports two I/O models: blocking I/O model based on the classic Java I/O and non-blocking, event driven I/O model based on Java NIO.
Please take a look at https://hc.apache.org/httpcomponents-asyncclient-dev/index.html
Hope this is the right idea.
In fact it appears that the problem was not coming from the java process but from apache consuming all the IO of the hard drive, screwing up the performances.
Thanks for your help anyway.
Related
Wowza is causing us troubles, not scaling past 6k concurrent users, sometimes freezing on a few hundred. It crashes and starts killing sessions. We step in to restart Wowza multiple times per streaming event.
Our server specs:
DL380 Gen10
2x Intel Xeon Silver 4110 / 2.1 GHz
64 GB RAM
300 GB HDD
The network 10 GB dedicated
Some servers running Centos 6, others Centos 7
Java version 1.8.0_20 64-bit
Wowza streaming engine 4.2.0
I asked and was told:
Wowza scales easily to millions if you put a CDN in front of it (which
is trivially easy to do). 6K users off a single instance simply ain’t
happening. For one, Java maxes out at around 4Gbps per JVM instance.
So even if you have a 10G NIC on the machine, you’ll want to run
multiple instances if you want to use the full bandwidth.
And:
How many 720 streams can you do on a 10gb network # 2mbps?
Without network overhead, it’s about 5,000
With the limitation of java at 4gbps, it’s only 2,000 per instance.
Then if you do manage to utilize that 10Gb network and saturate it,
what happens to all other applications people are accessing on other
servers?
If they want more streams, they need edge servers in multiple data
centers or have to somehow to get more 10Gb networks installed.
That’s for streaming only. No idea what transcoding would add in terms
of CPU load and disk IO.
So I began looking for an alternative to Wowza. Due to the nature of our business, we can't use CDN or cloud hosting except with very few clients. Everything should be hosted in-house, in the client's datacenter.
I found this article and reached out to the author to ask him about Flussonic and how it compares to Wowza. He said:
I can't speak to the 4 Gbps limit that you're seeing in Java. It's
also possible that your Wowza instance is configured incorrectly. We'd
need to look at your configuration parameters to see what's happening.
We've had great success scaling both Wowza and Flussonic media servers
by pairing them with our peer-to-peer (p2p) CDN service. That was the
whole point of the article we wrote in 2016
Because we reduce the number of HTTP requests that the server has to
handle by up to 90% (or more), we increase the capacity of each server
by 10x - meaning each server can handle 10x the number of concurrent
viewers.
Even on the Wowza forum, some people say that Java maxes out at around 5Gbps per JVM instance. Others say that this number is incorrect, or made up.
Due to the nature of our business, this number, as silly as it is, means so much for us. If Java cannot handle more than 7k viewers per instance, we need to hold meetings and discuss what to do with Wowza.
So is it true that Java maxes out at around 4Gbps or 5Gbps per JVM instance?
I am developing a web based application.
The computer where I write the code has 4 core Intel i5 4440 3.10 GHz processor.
The computer where I deploy the application has 8 core Intel i7 4790K 4.00 GHz processor.
One of the tasks that needs to be calculated is very heavy so I decided to use the java executor framework.
I have this:
ExecutorService executorService = Executors.newFixedThreadPool(8);
and then I add 30 tasks at once.
On my development machine the result was calculated in 3 seconds (it used to be 20 secs. when I used only one thread) whereas on the server machine it got calculated in 16 seconds ( which is the same as it used to be when the code used only one thread ).
As you can guess I am quite confused and have no idea why on the server machine it got calculated so much slower.
Anyone who knows why the faster processor does not get benefits from the multithreading algorithm?
It is hard to guess root cause without more evidence. Could you
profile running application on server machine?
connect to server machine with JConsole and see threading info
My guess is that server machine is under heavy load (maybe from other applications or background threads?). Maybe your server user/java application is allowed to use only core?
I would start with using top (on linux) or Task Manager (windows) to find out if server is under load when you run your application. Profiling/JMX monitoring adds overhead, but you will be able to find out how many threads are actually used.
Final note- is server using same architecture (32/64bit), operating system and major/minor Java version than development?
I have built a custom message server in Java that takes a stream of messages and delivers each message to its client (1:1, drop msg if not connected - very simple). I am running Tomcat 7 on Win7x64 & Java 7 and am using the NIO connector (implemented a Comet servlet).
It works great but I am now looking into scaling that beast and am currently seeing about 85kb of RAM allocated for each connected client. 10.000 clients # under 900MB and scaling linearly. (I am not doing anything else but holding the connection yet) That's quite a lot to my opinion, so I am wondering whether there are some tweaks to make Tomcat or Java save more memory with their NIO impl. All the Tomcat settings I tried so far did not affect this at all.
Does anybody have experience how to put Java or Tomcat on a memory diet regarding socket connections?
UPDATE:
I am now down under 70kb/connection by trimming the socket buffers and some other tomcat internals. Not sure how this now influences throughput. I've also tried it on 32bit / 64bit linux with the same result.
After quite some research and playing around I got to the conclusion that it is simply not possible with Tomcat to handle a huge amount of concurrent connections with reasonable amount of memory. (I'd still be happy to be proven wrong here btw)
However, there is a savior:
Netty: http://www.jboss.org/netty/downloads
It's a Java IO framework that builds on Java's new NIO architecture and seems very well designed and written. You can stack some lightweight modules together and create a mini webserver or simply handle the TCP connections yourself in an asynchronous way.
I ran a loadtest on EC2 and made it to a mindblowing 7 MILLION connections # only 1.5GB of RAM! (like with the tomcat test I did nothing but store the connections, so a real app will of course consume a bit more mem, but 200 bytes / connection "overhead" is nothing!) And it only stopped there because I limited the Java VM to 1.5GB, I am sure a C10M test would be easily doable.
Big kudos to Netty and the Java VM guys! I'm impressed.
I am developing a web application in Scala. Its a simple application which will take data on a port from clients (JSON or ProtoBufs) and do some computation using a database server and then reply the client with a JSON / Protobuf object.
Its not a very heavy application. 1000 lines of code max. It will create a thread on every client request. The time it takes right now between getting the request and replying back is between 20 - 40ms.
I need an advice on what kind of hardware / setup should i use to serve 3000+ such requests per second. I need to procure hardware to put at my data center.
Anybody who has some experience deploying java apps at scale, please advice. Should i use one big box with 2 - 4 Xeon 5500s with 32 GB RAMs or multiple smaller machines.
UPDATE - we dont have many clients. 3 - 4 of them. Requests will be from these 3 of them.
If each request takes on average 30 ms, a single core can handle only 30 requests per second. Supposing that your app scales linearly (the best scenario you can expect), then you will need at least 100 cores to reach 3000 req/s. Which is more than 2-4 Xeon.
Worst, if you app relies on IO or on DB (like most useful applications), you will get a sublinear scaling and you may need a lot more...
So the first thing to do is to analyze and optimize the application. Here are a few tips:
Creating a thread is expensive, try to create a limited number of threads and reuse them among requests (in java see ExecutorService for example).
If you app is IO-intensive: try to reduce IO calls as much a possible, using a cache in memory and give a try to non-blocking IO.
If you app is dependent of a database, consider caching and try a distributed solution if possible.
I'm trying to speed test jetty (to compare it with using apache) for serving dynamic content.
I'm testing this using three client threads requesting again as soon as a response comes back.
These are running on a local box (OSX 10.5.8 mac book pro). Apache is pretty much straight out of the box (XAMPP distribution) and I've tested Jetty 7.0.2 and 7.1.6
Apache is giving my spikey times : response times upto 2000ms, but an average of 50ms, and if you remove the spikes (about 2%) the average is 10ms per call. (This was to a PHP hello world page)
Jetty is giving me no spikes, but response times of about 200ms.
This was calling to the localhost:8080/hello/ that is distributed with jetty, and starting jetty with java -jar start.jar.
This seems slow to me, and I'm wondering if its just me doing something wrong.
Any sugestions on how to get better numbers out of Jetty would be appreciated.
Thanks
Well, since I am successfully running a site with some traffic on Jetty, I was pretty surprised by your observation.
So I just tried your test. With the same result.
So I decompiled the Hello Servlet which comes with Jetty. And I had to laugh - it really includes following line:
Thread.sleep(200L);
You can see for yourself.
My own experience with Jetty performance: I ran multi threaded load tests on my real-world app where I had a throughput of about 1000 requests per second on my dev workstation...
Note also that your speed test is really just a latency test, which is fine so long as you know what you are measuring. But Jetty does trade off latency for throughput, so often there are servers with lower latency, but also lower throughput as well.
Realistic traffic for a webserver is not 3 very busy connections - 1 browser will open 6 connections, so that represents half a user. More realistic traffic is many hundreds or thousands of connections, each of them mostly idle.
Have a read of my blogs on this subject:
https://webtide.com/truth-in-benchmarking/
and
https://webtide.com/lies-damned-lies-and-benchmarks-2/
You should definitely check it with profiler. Here are instructions how to setup remote profiling with Jetty:
http://sujitpal.sys-con.com/node/508048/mobile
Speedup or performance tune any application or server is really hard to get done in my experience. You'll need to benchmark several times with different work models to define what your peak load is. Once you define the peak load for the configuration/environment mixture you need to tune and benchmark, you might have to run 5+ iterations of your benchmark. Check the configuration of both apache/jetty in terms of number of working threads to process the request and get them to match if possible. Here are some recommendations:
Consider the differences of the two environments (GC in jetty, consider tuning you min and max memory threshold to the same size and later proceed to execute your test)
The load should come from another box. If you don't have a second box/PC/server take your CPU/core into count and setup your the test to a specific CPU, do the same for jetty/apache.
This is given that you cant get another machine to be the stress agent.
Run several workload model
Moving to modeling the test do the following 2 stages:
One Thread for each configuration for 30 minutes.
Start with 1 thread and going up to 5 with a 10 minutes interval to increase the count,
Base on the metrics Stage 2 define a number of threads for the test. and run that number of thread concurrent for 1 hour.
Correlate the metrics (response times) from your testing app to the server hosting the application resources (use sar, top and other unix commands to track cpu and memory), some other process might be impacting you app. (memory is relevant for apache jetty will be constraint to the JVM memory configuration so it should not change the memory usage once the server is up and running)
Be aware of the Hotspot Compiler.
Methods have to be called several times (1000 times ?), before the are compiled into native code.