I am working on a system where I need a while(true) where the loop constantly listens to a queue and increments counts in memory.
The data is constantly coming in the queue, so I cannot avoid a while(true) condition. But naturally it increases my CPU utilization to 100%.
So, how can I keep a thread alive which listens to the tail of queue and performs some action, but at the same time reduce the CPU utilization to 100%?
Blocking queues were invented exactly for this purpose.
Also see this: What are the advantages of Blocking Queue in Java?
LinkedBlockingQueue.take() is what you should be using. This waits for an entry to arrive on the queue, with no additional synchronization mechanism needed.
(There are one or two other blocking queues in Java, IIRC, but they have features that make them unsuitable in the general case. Don't know why such an important mechanism is buried so deeply in arcane classes.)
usually a queue has a way to retrieve an item from it and your thread will be descheduled (thus using 0% cpu) until something arrives in the queue...
Based on your comments on another answer, you want to have a queue that is based on changes in hsqldb
Some quick googling turns up:
http://hsqldb.org/doc/guide/triggers-chapt.html
It appears you can set it up so that changes cause a trigger to occur, which will notify a class you write implementing the org.hsqldb.Trigger interface. Have that class contain a reference to a LinkedBlockingDequeue from the Concurrent package in Java and have the trigger add the change to the queue.
You now have a blocking queue that your reading thread will block on until hsqldb fires a trigger (from an update by a writer) which will put something in the queue. The waiting thread will then unblock and have the item off the queue.
lbalazscs and Brain have excellent answers. I couldn’t share my code it was hard for them to give them the exact fix for my issue. And having a while(true) which constantly polls a queue is surely the wrong way to go about it. So, here is what I did:
I used ScheduledExecutorService with a 10sec delay.
I read a block of messages (say 10k) and process those messages
thread is invoked again and the "loop" continues.
This considerably reduces my CPU usage. Suggestions welcomed.
Lots of dumb answers from people who read books and only wasted time in schools, not as many direct logic or answers I see.
while(true) will set your program to use all the CPU power that's basically 'alloted' to it by the windows algorithms to run what is in the loop, usually as-fast-as-possible over and over. This doesn't mean if it says 100% on your application, that if you run a game, your empty loop .exe will be taking all your OS CPU power, the game should still run as intended. It is more like a visual bug, similar to the windows idle process and some other processes. The fix is to add a Sleep(1) (at least 1 millisecond) or better yet a Sleep(5) to make sure other stuff can run and ensure the CPU is not constantly looping your while(true) as fast as possible. This will generally drop CPU usage to 0% or 1% in the visual queue as 1 full millisecond is a big resting time for even older CPU.
Many times while(trues) or generic endless loops are bad designs and can be drastically slowed down to even Sleep(1000) - 1 second interval checks or higher. Endless loops are not always bad designs, but usually they can be improved..
funny to see this bug I learned whe nI was like 12 learning C pop up and all the 'dumb' answers given.
Just know if you try it, unless the scripted slower language you have learned to use has fixed it somewhere along the line by itself, windows will claim to use a lot of CPU on doing an empty loop when the OS is actually having free resources to spend.
Related
I'm currently running a highly concurrent benchmark which accesses a ConcurrentSkipList from the Java collections. I'm finding that threads are getting blocked within that method, more precisely here:
java.util.concurrent.ConcurrentSkipListMap.doGet(ConcurrentSkipListMap.java:828)
java.util.concurrent.ConcurrentSkipListMap.get(ConcurrentSkipListMap.java:1626)
(This is obtained through, over 10 seconds interval, printing the stack trace of each individual thread). This is still not resolved after minutes
Is this is an expected behaviour of collections? What are the concurrent other collections likely to experience blocking?
Having tested, it, I exhibit similar behaviour with ConcurrentHashMaps:
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:994)
This could well be a spurious result.
When you ask Java for a dump of all its current stack traces, it tells each thread to wait when it gets to a yield point, then it captures the traces, and then it resumes all the threads. As you can imagine, this means that yield points are over-represented in these traces; these include synchronized methods, volatile accesses, etc. ConcurrentSkipListMap.head, a volatile field, is accessed in doGet.
See this paper for a more detailed analysis.
Solaris Studio has a profiler that captures stack traces from the OS and translates them to Java stack traces. This does away with the bias toward yield points and gives you more accurate results; you might find that doGet goes away almost entirely. I've only had luck running it on Linux, and even then it's not out-of-the-box. If you're interested, ask me in the comments how to set it up, I'd be happy to help.
As an easier approach, you could wrap your calls to ConcurrentSkipList.get with System.nanoTime() to get a sanity check on whether this is really where your time is going. Figure out how much time you're spending in that method, and confirm whether it's about what you'd expect given that the profiler says you're spending such-and-such percent of your time in that method.
Shameless self-plug: I created a simple program that demonstrates this a few months ago for a presentation at work. If you run it against a profiler, it should show that SpinWork.work appears a lot, while HardWork.work doesn't show up at all -- even though the latter actually takes a lot more time. It doesn't contain yield points.
Well, it isn't blocking in its truest form. Blocking implies the suspension of thread activity. ConcurrentSkipListMap is non-blocking, it will spin until is succeeds. But it also guarantees it will eventually succeed (that is it shouldn't get into an infinite loop)
That being said, unless you are doing many many gets and many many puts asynchronously I don't see how you can be spending so much time in this method.
If you can re-create it with an example and share with us that may help.
ConcurrentHashMap.get is a volatile read, which means, the CPU must finish all outstanding writes before it can perform the read. This is called a STORE/LOAD barrier. Depending on how much is going on in the other thread/cores, this can take a long time. See https://www.cs.umd.edu/users/pugh/java/memoryModel/jsr-133-faq.html.
can you explain this nonsense to me?
i have a method that basically fills up an array with mathematical operations. there's no I/O involved or anything. now, this method takes about 50 seconds to run, and the code is perfectly scalable (theoretically 100%), so i split it up into 4 threads, wait for them to complete, and reassemble the 4 arrays. now, i run the program on a quad core processor, expecting it to take about 15 seconds, and it actually takes 58 seconds. that's right: it takes longer! i see the cpu working 100%, and i know that each thread does 1/4 of the calculations, and creating threads and reassembling the arrays take about 1-2 ms in total.
what's causing such loss of performance? what the hell is the cpu doing all that time?
CODE: http://pastebin.com/cFUgiysw
Threads don't work that way.
Threads are still part of the same process (depending on the OS), so in terms of the operating system - CPU time will be scheduled the same for 4 threads in 1 process as it is for 1 thread in 1 process.
Also, with such a small number of values, you won't see the scalability in the midst of the overhead. Re-assembling the arrays in java will be costly.
Check out things like "Context switching overhead" - things like that always mess you up when you try to map theory to practise :P
I would stick to the single-threaded way :)
~ Dan
http://en.wikipedia.org/wiki/Context_switch
A lot depends on what you are doing and how you are dividing the work. There are many possible causes for this problem.
The most likely cause is, you are using all the bandwidth of your CPU to main memory bus with one thread. This can happen if your data set is larger than your CPU cache. esp if you have some random access behaviour. You could consider trying to reuse the original array, rather than taking multiple copies to reduce cache churn.
Your locking overhead is greater than the performance gain. I suspect you have used very course locking so this shouldnt be an issue.
Starting stopping threads takes too long. As your code is multi second, I doubt this too.
There is a cost associated with opening new threads. I don't think it should be up to 8 second but it depends on what threads you are using. Some threads needs to create a copy of the data that you are handling to be thread safe and that can take some time. This cost is commonly referred to as overhead. If the execution you are doing is somewhere not serializable for instance reads the same file or needs access to a shared resource the threads might need to wait on each other this can take some time and under sub optimal conditions it can take more time than serial execution. My tip is try and check for these unserializable events remove them from the threaded part if possible. Also try and use a lower amount of threads 4 threads for 4 cpus is not always optimal.
Hope it helps.
Unless you are constantly creating and killing threads the thread overhead shouldn't be a problem. Four threads running simultaeously is no big deal for the scheduler.
As Peter Lawrey suggested the memory bandwidth could be the problem. Your 50-second code is running on a java engine and they both compete for the available memory bandwidth. The java engine needs memory bandwidth to execute your code and your code needs it to do its calculations.
You write "perfectly scalable" which would be the case if your code was compiled. Since it runs on a java engine this is not the case. So the 16% increase in overall time could be seen as the difference between the smoothness of one thread vs the chaos of four colliding over memory accesses.
Good evening,
I'm developing a java tcp server for communication between clients.
At this point i'm load testing the developed server.
This morning i got my hands on a profiler (yourkit) and started looking for problem spots in my server.
I now have 480 clients sending messages to the server every 500 msec. The server forwards every received message to 6 clients.
The server is now using about 8% of my cpu, when being on constant load.
My question is about the java functions that uses the most cpu cycles.
The java function that uses the most cpu cycles is strangly "Thread.sleep", followed by "BufferedReader.readLine".
Both of these functions seem to block the current thread while waiting for something (sleep waits for a few msec, readline waits for data).
Can somebody explain why these 2 functions take up that much cpu cycles? I was also wondering if there are alternative approaches that use less cpu cycles.
Kind regards,
T. Akhayo
sleep() and readLine() can use a lot of cpu as they both result in system calls which can context switch. It is also possible that the timing for these methods is not accurate for this reason (it may be an over estimate)
A way to reduce the overhead of context switches/sleep() is to have less threads and avoid needing to use sleep (e.g. use ScheduledExecutorServices), readLine() overhead can be reduced by using NIO but it is likely to add some complexity.
Sleeping shouldn't be an issue, unless you're having a bunch of threads sleep for short periods of time (100-150ms is 'short' in when you have 480 threads running a loop that just sleeps and does something trivial).
The readLine call should be using next to nothing when it's not actually reading something, except when you first call it. But like you said, it blocks, and it shouldn't be using a noticeable amount of CPU unless you have small windows where it blocks. CPU usage isn't that much unless you're reading tons of data, or initially calling the method.
So, your loops are too tight, and you're receiving too many messages too quickly, which is ultimately causing 'tons' of context switching, and processing. I'd suggest using a NIO framework (like Netty) if you're not comfortable enough with NIO to use it on your own.
Also, 8% CPU isn't that much for 480 clients that send 2 messages per second.
Here is a program in which sleep uses almost 100% of the cpu cycles given to the application:
for (i = 0; i < bigNumber; i++){
sleep(someTime);
}
Why? Because it doesn't use very many actual cpu cycles at all,
and of the ones it does use, nearly all of them are spent entering and leaving sleep.
Does that mean it's a real problem? Of course not.
That's the problem with profilers that only look at CPU time.
You need a sampler that samples on wall-clock time, not CPU time.
It should sample the stack, not just the program counter.
It should show you by line of code (not by function) the fraction of stack samples containing that line.
The usual objection to sampling on wall-clock time is that the measurements will be inaccurate due to sharing the machine with other processes.
But that doesn't matter, because to find time drains does not require precision of measurement.
It requires precision of location.
What you are looking for is precise code locations, and call sites, that are on the stack a healthy fraction of actual time, as determined by stack sampling that's uncorrelated with the state of the program.
Competition with other processes does not change the fraction of time that call sites are on the stack by a large enough amount to result in missing the problems.
I am fairly new with concurrent programming and I am learning it.
I am implementing a quick sort in Java JDK 7 (Fork Join API) to sort a list of objects (100K).
While using this recursive piece of code without using concurrency,i observe no memory explosion, everything is fine.
I just added the code to use it on multi cores (by extending the class RecursiveAction) and then the memory usage jumped very high until it reached its limits. By doing some profiling i observe a high creation rate of threads and i think its expectable.
But, is a java Thread by itself much more memory demanding or am i missing something here ?
Quicksort must requires a lot of threads but not much than regular objects.
Should I stop creating RecursiveAction Threads when i meet a threshold and then just switch to a sequential piece of code (no more threads)?
Thank you very much.
Java threads usually take 256k/512k(depeding in OS, jdk versions..) of stack space alone, by default.
You're wasting huge resources and speed if you're running more threads than you have processors/cores for a CPU intensive process such as doing quicksort, so try to not run more threads than you have cores.
Yes, switching over to sequential code is a good idea when the unit of work is in the region of ca. 10,000-100,000 operations. This is just a rule of thumb. So, for quick sort, I'd drop out to sequential execution when the size to be sorted is less than say 10-20,000 elements, depending upon the complexity of the comparison operation.
What's the size of the ForkJoinPool - usually it's set to create the same number of threads as processors, so you shouldn't be seeing too many threads. If you've manually set the parallelism to be high (say, in the hundreds or thousands) then you will see high (virtual) memory use, since each thread allocates space for the stack (256K by default on 32-bit windows and linux.)
As a rule of thumb for a CPU bound computation, once your number of threads exceeds the number of available cores, adding more threads is not going to speed things up. In fact, it will probably slow you down due to the overheads of creating the threads, the resources tied down by each thread (e.g. the thread stacks), and the cost of synchronizing.
Indeed, even if you had an infinite number of cores, it would not be worth creating threads to do small tasks. Even with thread pools and other clever tricks, if the amount of work to be done in a task is too small the overheads of using a thread will exceed any savings. (It is difficult to predict exactly where that threshold is, and it certainly depends on the nature of the task as well as platform-related factors.)
I changed my code and so far I have better results. I invoke the main Thread task in the ForkJoinPool, in the Threads, I dont create more threads if there are a lot more active threads than available cores in the ForkJoinPool.
I dont do synchronism through the join() method. As a result a parent thread will die as soon as it created its offsprings. In the main function that invoked the root task. I wait for the tasks to be completed, aka no more active threads. Its seems to work fine as the memory stays normal and i gained lots of time over a the same piece of code executed sequentially.
I am going to learn more.
Thank you all !
I am testing my java application for any performance bottlenecks. The application uses concurrent.jar for locking purposes.
I have a high computation call which calls lock and unlock functions for its operations.
On removing the lock-unlock mechanism from the code, I have seen the performance degradation by multiple folds contrary to my expectations. Among other things observed was increase in CPU consumption which made me feel that the program is running faster but actually it was not.
Q1. What can be the reason for this degradation in performance when we remove locks?
Best Regards !!!
This can be quite a usual finding, depending on what you're doing and what you're using as an alternative to Locks.
Essentially, what happens is that constructs such as ReentrantLock have some logic built into them that knows "when to back off" when they realistically can't acquire the lock. This reduces the amount of CPU that's burnt just in the logic of repeatedly trying to acquire the lock, which can happen if you use simpler locking constructs.
As an example, have a look at the graph I've hurriedly put up here. It shows the throughput of threads continually accessing random elements of an array, using different constructs as the locking mechanism. Along the X axis is the number of threads; Y axis is throughput. The blue line is a ReentrantLock; the yellow, green and brown lines use variants of a spinlock. Notice how with low numbers of threads, the spinlock gives heigher throughput as you might expect, but as the number of threads ramps up, the back-off logic of ReentrantLock kicks in, and it ends up doing better, while with high contention, the spinlocks just sit burning CPU.
By the way, this was really a trial run done on a dual-processor machine; I also ran it in the Amazon cloud (effectively an 8-way Xeon) but I've ahem... mislaid the file, but I'll either find it or run the experiment again soon and train and post an update. But you get an essentially similar pattern as I recall.
Update: whether it's in locking code or not, a phenomenon that can happen on some multiprocessor architectures is that as the multiple processors do a high volume of memory accesses, you can end up flooding the memory bus, and in effect the processors slow each other down. (It's a bit like with ethernet-- the more machines you add to the network, the more chance of collisions as they send data.)
Profile it. Anything else here will be just a guess and an uninformed one at that.
Using a profiler like YourKit will not only tell you which methods are "hot spots" in terms of CPU time but it will also tell you where threads are spending most of their time BLOCKED or WAITING
Is it still performing correctly? For instance, there was a case in an app server where an unsychronised HashMap caused an occasional infinite loop. It is not to difficult to see how work could simply be repeated.
The most likely culprit for seeing performance decline and CPU usage increase when you remove shared memory protection is a race condition. Two or more threads could be continually flipping a state flag back and forth on a shared object.
More description of the purpose of your application would help with diagnosis.