I am testing my java application for any performance bottlenecks. The application uses concurrent.jar for locking purposes.
I have a high computation call which calls lock and unlock functions for its operations.
On removing the lock-unlock mechanism from the code, I have seen the performance degradation by multiple folds contrary to my expectations. Among other things observed was increase in CPU consumption which made me feel that the program is running faster but actually it was not.
Q1. What can be the reason for this degradation in performance when we remove locks?
Best Regards !!!
This can be quite a usual finding, depending on what you're doing and what you're using as an alternative to Locks.
Essentially, what happens is that constructs such as ReentrantLock have some logic built into them that knows "when to back off" when they realistically can't acquire the lock. This reduces the amount of CPU that's burnt just in the logic of repeatedly trying to acquire the lock, which can happen if you use simpler locking constructs.
As an example, have a look at the graph I've hurriedly put up here. It shows the throughput of threads continually accessing random elements of an array, using different constructs as the locking mechanism. Along the X axis is the number of threads; Y axis is throughput. The blue line is a ReentrantLock; the yellow, green and brown lines use variants of a spinlock. Notice how with low numbers of threads, the spinlock gives heigher throughput as you might expect, but as the number of threads ramps up, the back-off logic of ReentrantLock kicks in, and it ends up doing better, while with high contention, the spinlocks just sit burning CPU.
By the way, this was really a trial run done on a dual-processor machine; I also ran it in the Amazon cloud (effectively an 8-way Xeon) but I've ahem... mislaid the file, but I'll either find it or run the experiment again soon and train and post an update. But you get an essentially similar pattern as I recall.
Update: whether it's in locking code or not, a phenomenon that can happen on some multiprocessor architectures is that as the multiple processors do a high volume of memory accesses, you can end up flooding the memory bus, and in effect the processors slow each other down. (It's a bit like with ethernet-- the more machines you add to the network, the more chance of collisions as they send data.)
Profile it. Anything else here will be just a guess and an uninformed one at that.
Using a profiler like YourKit will not only tell you which methods are "hot spots" in terms of CPU time but it will also tell you where threads are spending most of their time BLOCKED or WAITING
Is it still performing correctly? For instance, there was a case in an app server where an unsychronised HashMap caused an occasional infinite loop. It is not to difficult to see how work could simply be repeated.
The most likely culprit for seeing performance decline and CPU usage increase when you remove shared memory protection is a race condition. Two or more threads could be continually flipping a state flag back and forth on a shared object.
More description of the purpose of your application would help with diagnosis.
Related
Suppose I have a large batch of memory-bound tasks that are quite independent of one another. To make things concrete, let's say I can allocate 30GB for the heap and that each task requires on average about 3GB of memory at its peak, but with some variability both over time and from task to task. A few tasks here and there might even require 6GB.
In this case, it seems more efficient to try to run 10 (or arguably even more) tasks concurrently, and if / when we bump into the memory limit have the task wait, much the same as we do with other shared resources like I/O, specific memory addresses (which are accessed through locks), etc.
Is it possible do this in Java? More generally
What's the best way to handle memory-bound task scheduling in Java?
Some Related Questions and "Close Misses"
This question asks whether it's possible to have threads in java wait for memory instead of throwing an OOM exception, but the answers seem to focus on why this is a bad idea to begin with - perhaps because the question suggests the number of threads is unreasonable. Also, I guess treating all memory requests as equal can lead to deadlocks. So I want to emphasize that here we are talking about only about 10 tasks, and the desire to "max out" the memory usage seems like a very natural one. I do not mind wrapping my tasks by some suitable logic that will distinguish their memory requests as having lower priority. I can even accept a solution where I need to identify the class whose instances are filling up the memory and maybe add some suitable counter - but I'd prefer a platform-independent solution that works "out of the box", if there is one.
This question also also asks about scheduling memory-bound tasks but seems to presuppose a specific solution framework.
The problem is that within a single JVM you have very little control on how much memory a single thread is going to use; unless you make use of offheap (e.g. using Unsafe or direct memory as AnatolyG already mentioned). If you have huge array allocations, you could also control these. But we need to know more about the data-structures that consume the most memory.
But if you have orbitrary object graphs you don't have much control over, perhaps it smarter to model the problem using multiple processes. You have 1 intake controller process and then a bunch of worker processes. And on each process you can configure the maximum amount of heap a JVM is allowed to use.
Bumping into memory limits on OS level can be a huge PITA because it could lead to swapping and this will makes all the threads in a system slow. Or even worse, OOM-killer. Make sure you set the vm.swappiness to a very low value to prevent premature swapping.
Do you know up front how much memory a process is going to consume? If so, then you could keep track of the maximum amount of memory being consumed in the system and don't allow for new tasks in the system before tasks have completed.
If you don't know up front the memory limits, then you could assume each tasks will use the maximum, but this can lead to under-utilization of memory.
I've been looking for the way to increase the speed of processing messages received from rabbitmq queue. The only way I've found is make more than one threads doing the same - receiving and processing. And this gave me some profit. After I created 4 threads the speed quadrupled. As I have 8-core processor I've decided to increase the number of threads to 8. But this gives no performance increasing. YourKit shows that only 50% of CPU is used. Somebody can say that my app is lightweight and so it is so, but I can say that it can't do more work than it does regardless I produce much more what to do. Why this doesn't work?
There are many different issues that can constrain the maximum speed of some application on a given system. For example, it can be limited by memory bandwidth, by Amdahl's Law effects (time needed for non-parallel code, including synchronized blocks), I/O bandwidth, and cache space.
If you want further improvement you need to do some measurements and profiling to find where the time is going, and then work on that.
The short (and not particularly helpful) answer is "overheads and bottlenecks".
For instance:
Creating threads in Java is relatively expensive. If the amount of work done by a thread isn't large, the overhead of creating the thread can out-weigh the benefits.
Context switching between threads is relatively expensive, especially when you take account of memory-related overheads such as cache misses, TLB misses. (These overheads actually hit when a native thread is assigned to a core. If the OS can somehow keep a native thread on a single core continuously (i.e. with no other threads on the same core), then it can use spinlocking ... and avoid the context switch. But the more Java threads you have, the less likely it is that the OS can do this.)
The threads may be spending a large proportion of their time waiting for I/O to complete. The I/O system's throughput or the speed / latency of some external service can be a bottleneck.
You may have contention over data structures; e.g. threads requiring exclusive access to safely read or update (say) a shared Map. If threads regularly need to wait for others to release locks, then you have a bottleneck.
Your computation may be dominated by the costs of "feeding" the threads. For example, if there is a single master thread that hands out "work" to worker threads, then the master thread's activities could be the bottleneck; i.e. it may not be able to provide enough work to keep the workers busy.
Since your tags imply that you are using a message queue, it is possible that that is the bottleneck, especially if the messages are big or the "work" done on each one is relatively small.
(Using a separate separate message queue service is liable to increase context switches, add I/O latency, add protocol overheads and so on. It's not an automatic route to performance improvement for small-scale systems.)
It is also possible that you have "hyperthreaded" cores not real cores, or that the operating system is stopping your JVM from using all cores.
If CPU or waiting for IO is your bottle neck, adding independent threads can make a big difference.
If you have a shared resource is a bottleneck, e.g. your L3 cache, your network adapter, your kernel, adding threads won't help because CPU is not the problem. In fact it can often make it worse by adding overhead.
my app is lightweight
In which case CPU is unlikely to be your issue and you are doing well to see a speed up with more than 1 CPU. Most likely you are speeding up CPU used by RabbitMQ. Ideally it should be more efficient and this shouldn't really help much. IMHO, more efficient messaging solutions don't gain much by multiple CPUs as they will not be bottlenecked on CPU.
One way or another, you're only using 4 cores. There's a lot that can stop stop you from doubling your performance by doubling your threads, but from your 4 thread success you've gotten past all that. I'm guessing there's a bug in your code to set off 8 threads and it's only firing up 4. (Even with hyperthreading, you're going to get some improvement. Even with every possible problem, you're going to get some improvement.) Otherwise, I'll go with T.J.Crowder and Stephen C: I don't think you really have 8 cores.
I'd try using different numbers of threads: 3, 5, 6. See what changes. I think you'll stumble on the problem soon enough.
To be fair to Java: if you write thread-safe code and avoid bottlenecks, it handles threads really well, as you've noticed going from one thread to 4. I've always found the overhead costs to be trivial.
Your application does not have linear speedup, and therefore, it does not have good scalability.
In order to keep increase the number of threads you need to ensure the data being handle is growing accordingly. For a fixed amount of data, increasing number of threads (and/or cores) will have a diminishing return at some point since the overhead of creating threads will outweight the thread's compute time.
Make sure to look up the following link:
Ahmdahl's law
Gustafson's law
Gustafson's law is a great counterpoint to Ahmdal's law so I highly recommend understanding that article.
I am working on a system where I need a while(true) where the loop constantly listens to a queue and increments counts in memory.
The data is constantly coming in the queue, so I cannot avoid a while(true) condition. But naturally it increases my CPU utilization to 100%.
So, how can I keep a thread alive which listens to the tail of queue and performs some action, but at the same time reduce the CPU utilization to 100%?
Blocking queues were invented exactly for this purpose.
Also see this: What are the advantages of Blocking Queue in Java?
LinkedBlockingQueue.take() is what you should be using. This waits for an entry to arrive on the queue, with no additional synchronization mechanism needed.
(There are one or two other blocking queues in Java, IIRC, but they have features that make them unsuitable in the general case. Don't know why such an important mechanism is buried so deeply in arcane classes.)
usually a queue has a way to retrieve an item from it and your thread will be descheduled (thus using 0% cpu) until something arrives in the queue...
Based on your comments on another answer, you want to have a queue that is based on changes in hsqldb
Some quick googling turns up:
http://hsqldb.org/doc/guide/triggers-chapt.html
It appears you can set it up so that changes cause a trigger to occur, which will notify a class you write implementing the org.hsqldb.Trigger interface. Have that class contain a reference to a LinkedBlockingDequeue from the Concurrent package in Java and have the trigger add the change to the queue.
You now have a blocking queue that your reading thread will block on until hsqldb fires a trigger (from an update by a writer) which will put something in the queue. The waiting thread will then unblock and have the item off the queue.
lbalazscs and Brain have excellent answers. I couldn’t share my code it was hard for them to give them the exact fix for my issue. And having a while(true) which constantly polls a queue is surely the wrong way to go about it. So, here is what I did:
I used ScheduledExecutorService with a 10sec delay.
I read a block of messages (say 10k) and process those messages
thread is invoked again and the "loop" continues.
This considerably reduces my CPU usage. Suggestions welcomed.
Lots of dumb answers from people who read books and only wasted time in schools, not as many direct logic or answers I see.
while(true) will set your program to use all the CPU power that's basically 'alloted' to it by the windows algorithms to run what is in the loop, usually as-fast-as-possible over and over. This doesn't mean if it says 100% on your application, that if you run a game, your empty loop .exe will be taking all your OS CPU power, the game should still run as intended. It is more like a visual bug, similar to the windows idle process and some other processes. The fix is to add a Sleep(1) (at least 1 millisecond) or better yet a Sleep(5) to make sure other stuff can run and ensure the CPU is not constantly looping your while(true) as fast as possible. This will generally drop CPU usage to 0% or 1% in the visual queue as 1 full millisecond is a big resting time for even older CPU.
Many times while(trues) or generic endless loops are bad designs and can be drastically slowed down to even Sleep(1000) - 1 second interval checks or higher. Endless loops are not always bad designs, but usually they can be improved..
funny to see this bug I learned whe nI was like 12 learning C pop up and all the 'dumb' answers given.
Just know if you try it, unless the scripted slower language you have learned to use has fixed it somewhere along the line by itself, windows will claim to use a lot of CPU on doing an empty loop when the OS is actually having free resources to spend.
can you explain this nonsense to me?
i have a method that basically fills up an array with mathematical operations. there's no I/O involved or anything. now, this method takes about 50 seconds to run, and the code is perfectly scalable (theoretically 100%), so i split it up into 4 threads, wait for them to complete, and reassemble the 4 arrays. now, i run the program on a quad core processor, expecting it to take about 15 seconds, and it actually takes 58 seconds. that's right: it takes longer! i see the cpu working 100%, and i know that each thread does 1/4 of the calculations, and creating threads and reassembling the arrays take about 1-2 ms in total.
what's causing such loss of performance? what the hell is the cpu doing all that time?
CODE: http://pastebin.com/cFUgiysw
Threads don't work that way.
Threads are still part of the same process (depending on the OS), so in terms of the operating system - CPU time will be scheduled the same for 4 threads in 1 process as it is for 1 thread in 1 process.
Also, with such a small number of values, you won't see the scalability in the midst of the overhead. Re-assembling the arrays in java will be costly.
Check out things like "Context switching overhead" - things like that always mess you up when you try to map theory to practise :P
I would stick to the single-threaded way :)
~ Dan
http://en.wikipedia.org/wiki/Context_switch
A lot depends on what you are doing and how you are dividing the work. There are many possible causes for this problem.
The most likely cause is, you are using all the bandwidth of your CPU to main memory bus with one thread. This can happen if your data set is larger than your CPU cache. esp if you have some random access behaviour. You could consider trying to reuse the original array, rather than taking multiple copies to reduce cache churn.
Your locking overhead is greater than the performance gain. I suspect you have used very course locking so this shouldnt be an issue.
Starting stopping threads takes too long. As your code is multi second, I doubt this too.
There is a cost associated with opening new threads. I don't think it should be up to 8 second but it depends on what threads you are using. Some threads needs to create a copy of the data that you are handling to be thread safe and that can take some time. This cost is commonly referred to as overhead. If the execution you are doing is somewhere not serializable for instance reads the same file or needs access to a shared resource the threads might need to wait on each other this can take some time and under sub optimal conditions it can take more time than serial execution. My tip is try and check for these unserializable events remove them from the threaded part if possible. Also try and use a lower amount of threads 4 threads for 4 cpus is not always optimal.
Hope it helps.
Unless you are constantly creating and killing threads the thread overhead shouldn't be a problem. Four threads running simultaeously is no big deal for the scheduler.
As Peter Lawrey suggested the memory bandwidth could be the problem. Your 50-second code is running on a java engine and they both compete for the available memory bandwidth. The java engine needs memory bandwidth to execute your code and your code needs it to do its calculations.
You write "perfectly scalable" which would be the case if your code was compiled. Since it runs on a java engine this is not the case. So the 16% increase in overall time could be seen as the difference between the smoothness of one thread vs the chaos of four colliding over memory accesses.
I am fairly new with concurrent programming and I am learning it.
I am implementing a quick sort in Java JDK 7 (Fork Join API) to sort a list of objects (100K).
While using this recursive piece of code without using concurrency,i observe no memory explosion, everything is fine.
I just added the code to use it on multi cores (by extending the class RecursiveAction) and then the memory usage jumped very high until it reached its limits. By doing some profiling i observe a high creation rate of threads and i think its expectable.
But, is a java Thread by itself much more memory demanding or am i missing something here ?
Quicksort must requires a lot of threads but not much than regular objects.
Should I stop creating RecursiveAction Threads when i meet a threshold and then just switch to a sequential piece of code (no more threads)?
Thank you very much.
Java threads usually take 256k/512k(depeding in OS, jdk versions..) of stack space alone, by default.
You're wasting huge resources and speed if you're running more threads than you have processors/cores for a CPU intensive process such as doing quicksort, so try to not run more threads than you have cores.
Yes, switching over to sequential code is a good idea when the unit of work is in the region of ca. 10,000-100,000 operations. This is just a rule of thumb. So, for quick sort, I'd drop out to sequential execution when the size to be sorted is less than say 10-20,000 elements, depending upon the complexity of the comparison operation.
What's the size of the ForkJoinPool - usually it's set to create the same number of threads as processors, so you shouldn't be seeing too many threads. If you've manually set the parallelism to be high (say, in the hundreds or thousands) then you will see high (virtual) memory use, since each thread allocates space for the stack (256K by default on 32-bit windows and linux.)
As a rule of thumb for a CPU bound computation, once your number of threads exceeds the number of available cores, adding more threads is not going to speed things up. In fact, it will probably slow you down due to the overheads of creating the threads, the resources tied down by each thread (e.g. the thread stacks), and the cost of synchronizing.
Indeed, even if you had an infinite number of cores, it would not be worth creating threads to do small tasks. Even with thread pools and other clever tricks, if the amount of work to be done in a task is too small the overheads of using a thread will exceed any savings. (It is difficult to predict exactly where that threshold is, and it certainly depends on the nature of the task as well as platform-related factors.)
I changed my code and so far I have better results. I invoke the main Thread task in the ForkJoinPool, in the Threads, I dont create more threads if there are a lot more active threads than available cores in the ForkJoinPool.
I dont do synchronism through the join() method. As a result a parent thread will die as soon as it created its offsprings. In the main function that invoked the root task. I wait for the tasks to be completed, aka no more active threads. Its seems to work fine as the memory stays normal and i gained lots of time over a the same piece of code executed sequentially.
I am going to learn more.
Thank you all !