I'm reading about virtual memory swapping and it says that pages of memory can be swapped when an application becomes idle. I've tried to google what that means but haven't found much elaborate information except for this stackoverflow answer:
Your WinForms app is driven by a message loop that pulls messages out
of a queue. When that queue is emptied, the message loop enters a
quiet state, sleeping efficiently until the next message appears in
the message queue. This helps conserve CPU processing resources
(cycles wasted spinning in a loop takes CPU time away from other
processes running on the machine, so everything feels slower) and also
helps reduce power consumption / extend laptop battery life.
So does the application become idle when there's no messages in the message queue?
The operating system decides what idle means. In general, that means that the application doesn't actively utilize system resources (like processor cycles, IO operations, etc).
However, that doesn't mean that application's pages in memory will not be swapped if the application isn't 'idle'. There may be many 'active' applications which contend for the same limited physical memory and the OS may be forced to swap some pages belonging to an active application to make room for another active application.
Related
We've got a Java app where we basically use three dispatcher pools to handle processing tasks:
Convert incoming messages (from RabbitMQ queues) into another format
Serialize messages
Push serialized messages to another RabbitMQ server
The thing, where we don't know how to start fixing it, is, that we have latencies at the first one. In other words, when we measure the time between "tell" and the start of doing the conversion in an actor, there is (not always, but too often) a delay of up to 500ms. Especially strange is that the CPUs are heavily under-utilized (10-15%) and the mailboxes are pretty much empty all of the time, no huge amount of messages waiting to be processed. Our understanding was that Akka typically would utilize CPUs much better than that?
The conversion is non-blocking and does not require I/O. There are approx. 200 actors running on that dispatcher, which is configured with throughput 2 and has 8 threads.
The system itself has 16 CPUs with around 400+ threads running, most of the passive, of course.
Interestingly enough, the other steps do not see such delays, but that can probably explained by the fact that the first step already "spreads" the messages so that the other steps/pools can easily digest them.
Does anyone have an idea what could cause such latencies and CPU under-utilization and how you normally go improving things there?
There isn't a single answer for this, but I don't know where else to ask this question.
I work on a large enterprise system that uses Tomcat to run REST services running in containers, managed by kubernetes.
Tomcat, or really any request processor, has a "max threads" property, such that if enough requests come in that cause creation of many threads, if the number of created threads reaches that defined limit, it will put additional requests into a queue (limited by the value of another property), and then possibly requests will be rejected after that queue is full.
It's reasonable to consider whether this property should be set to a value that could possibly be reached, or whether it should be set to effective infinity.
There are many scenarios to consider, although the only interesting ones are when traffic is extremely higher than normal, either from real customer traffic, or malicious ddos traffic.
In managed container environments, and other similar cases, this also begs the question of how many instances, pods, or containers should be running copies of the service. I would assume you would want to have as few of these as possible, to reduce duplication of resources for each pod, which would increase the average number of threads in each container, but I would assume that's better than spreading them thinly across a set of containers.
Some members of my team think it's better to set the "max threads" property to effective infinity.
What are some reasonable thoughts about this?
As a general rule, I'd suggest trying to scale by running more pods (which can easily be scheduled on multiple hosts) rather than by running more threads. It's also easier for the cluster to schedule 16 1-core pods than to schedule 1 16-core pod.
In terms of thread count, it depends a little bit on how much work your process is doing. A typical Web application spends most of its time talking to the database, and does a little bit of local computation, so you could often set it to run 50 or 100 threads but still with a limit of 1.0 CPU, and be effectively using resources. If it's very computation-heavy (it's doing real image-processing or machine-learning work, say) you might be limited to 1 thread per CPU. The bad case is where your process is allocating 16 threads, but the system only actually has 4 cores available, in which case your process will get throttled but you really want it to scale up.
The other important bad state to be aware of is the thread pool filling up. If it does, requests will get queued up, as you note, but if some of those requests are Kubernetes health-check probes, that can result in the cluster recording your service as unhealthy. This can actually lead to a bad spiral where an overloaded replica gets killed off (because it's not answering health checks promptly), so its load gets sent to other replicas, which also become overloaded and stop answering health checks. You can escape this by running more pods, or more threads. (...or by rewriting your application in a runtime which doesn't have a fixed upper capacity like this.)
It's also worth reading about the horizontal pod autoscaler. If you can connect some metric (CPU utilization, thread pool count) to say "I need more pods", then Kubernetes can automatically create more for you, and scale them down when they're not needed.
I have 1,00,00 elements to use in my web crawling app. I am confused to choose the number of threads to be used with Executor Service (Java 6).
Actually I am getting Out Of Memeory Error if I use more threads which makes me confused to choose the number of threads. Also as many threads hit the server, the internet gets halted and I have to restart my PC everytime.
Could someone help me choose the number of threads for this case?
Thanks!
It is difficult to say, because the limiting factors in terms of throughput is probably your system's network bandwidth and latency. If you try to go beyond that limit, you are likely to reduce throughput due to various secondary effects: e.g. congestion, thrashing, throttling (by the server), etc.
The correct approach is to make the number of threads in the pool a configuration parameter ... and tune it. Start small, and when increasing the thread count doesn't improve throughput significantly, stop increasing it.
I've got a Java App running on Ubuntu, the app listens on a socket for incoming connections, and creates a new thread to process each connection. The app receives incoming data on each connection processes the data, and sends the processed data back to the client. Simple enough.
With only one instance of the application running and up to 70 simultaneous threads, the app will run up the CPU to over 150%.. and have trouble keeping up processing the incoming data. This is running on a Dell 24 Core System.
Now if I create 3 instances of my application, and split the incoming data across the 3 instances on the same machine, the max overall cpu on the same machine may only reach 25%.
Question is why would one instance of the application use 6 times the amount of CPU that 3 instances on the same machine each processing one third of the amount of data use?
I'm not a linux guy, but can anyone recommend a tool to monitor system resources to try and figure out where the bottleneck is occurring? Any clues as to why 3 instances processing the same amount of data as 1 instance would use so much less overall system CPU?
In general this should not be the case. Maybe you are reading the CPU usage wrong. Try top, htop, ps, vmstat commands to see what's going on.
I could imagine one of the reasons for such behaviour - resource contention. If you have some sort of lock or a busy loop which manifests itself only on one instance (max connections, or max threads) then your system might not parallelize processing optimally and wait for resources. I suggest to connect something like jconsole to your java processes and see what's happening.
As a general recommendation check how many threads are available per JVM and if you are using them correctly. Maybe you don't have enough memory allocated to JVM so it's garbage collecting too often. If you use database ops then check for bottlenecks there too. Profile and find the place where it spends most of the time and compare 1 to 3 instances in terms of % of time spend in that function.
These things obviously require close inspection and availability of code to thoroughly analyze and give good suggestions. Nevertheless, that is not always possible and I hope it may be possible to provide me with good tips based on the information I provide below.
I have a server application that uses a listener thread to listen for incoming data. The incoming data is interpreted into application specific messages and these messages then give rise to events.
Up to that point I don't really have any control over how things are done.
Because this is a legacy application, these events were previously taken care of by that same listener thread (largely a single-threaded application). The events are sent to a blackbox and out comes a result that should be written to disk.
To improve throughput, I wanted to employ a threadpool to take care of the events. The idea being that the listener thread could just spawn new tasks every time an event is created and the threads would take care of the blackbox invocation. Finally, I have a background thread performing the writing to disk.
With just the previous setup and the background writer, everything works OK and the throughput is ~1.6 times more than previously.
When I add the thread pool however performance degrades. At the start, everything seems to run smoothly but then after awhile everything is very slow and finally I get OutOfMemoryExceptions. The weird thing is that when I print the number of active threads each time a task is added to the pool (along with info on how many tasks are queued and so on) it looks as if the thread pool has no problem keeping up with the producer (the listener thread).
Using top -H to check for CPU usage, it's quite evenly spread out at the outset, but at the end the worker threads are barely ever active and only the listener thread is active. Yet it doesn't seem to be submitting more tasks...
Can anyone hypothesize a reason for these symptoms? Do you think it's more likely that there's something in the legacy code (that I have no control over) that just goes bad when multiple threads are added? The out of memory issue should be because some queue somewhere grows too large but since the threadpool almost never contains queued tasks it can't be that.
Any ideas are welcome. Especially ideas of how to more efficiently diagnose a situation like this. How can I get a better profile on what my threads are doing etc.
Thanks.
Slowing down then out of memory implies a memory leak.
So I would start by using some Java memory analyzer tools to identify if there is a leak and what is being leaked. Sometimes you get lucky and the leaked object is well-known and it becomes pretty clear who is hanging on to things that they should not.
Thank you for the answers. I read up on Java VisualVM and used that as a tool. The results and conclusions are detailed below. Hopefully the pictures will work long enough.
I first ran the program and created some heap dumps thinking I could just analyze the dumps and see what was taking up all the memory. This would probably have worked except the dump file got so large and my workstation was of limited use in trying to access it. After waiting two hours for one operation, I realized I couldn't do this.
So my next option was something I, stupidly enough, hadn't thought about. I could just reduce the number of messages sent to the application, and the trend of increasing memory usage should still be there. Also, the dump file will be smaller and faster to analyze.
It turns out that when sending messages at a slower rate, no out of memory issue occured! A graph of the memory usage can be seen below.
The peaks are results of cumulative memory allocations and the troughs that follow are after the garbage collector has run. Although the amount of memory usage certainly is quite alarming and there are probably issues there, no long term trend of memory leakage can be observed.
I started to incrementally increase the rate of messages sent per second to see where the application hits the wall. The image below shows a very different scenario then the previous one...
Because this happens when the rate of messages sent are increased, my guess is that my freeing up the listener thread results in it being able to accept a lot of messages very quickly and this causes more and more allocations. The garbage collector doesn't run and the memory usage hits a wall.
There's of course more to this issue but given what I have found out today I have a fairly good idea of where to go from here. Of course, any additional suggestions/comments are welcome.
This questions should probably be recategorized as dealing with memory usage rather than threadpools... The threadpool wasn't the problem at all.
I agree with #djna.
Thread Pool of java concurrency package works. It does not create threads if it does not need them. You see that number of threads is as expected. This means that probably something in your legacy code is not ready for multithreading. For example some code fragment is not synchronized. As a result some element is not removed from collection. Or some additional elements are stored in collection. So, the memory usage is growing.
BTW I did not understand exactly which part of the application uses threadpool now. Did you have one thread that processes events and now you have several threads that do this? Have you probably changed the inter-thread communication mechanism? Added queues? This may be yet another direction of your investigation.
Good luck!
As mentioned by djna, it's likely some type of memory leak. My guess would be that you're keeping a reference to the request around somewhere:
In the dispatcher thread that's queuing the requests
In the threads that deal with the requests
In the black box that's handling the requests
In the writer thread that writes to disk.
Since you said everything works find before you add the thread pool into the mix, my guess would be that the threads in the pool are keeping a reference to the request somewhere. Th idea being that, without the threadpool, you aren't reusing threads so the information goes away.
As recommended by djna, you can use a Java memory analyzer to help figure out where the data is stacking up.