design a test program: increase/decrease cpu usage - java

I am trying to design and then write a java program that can increase/decrease CPU usage. Here is my basic idea: write a multi-thread program. And each thread does float point calculation. Increase/Decrease cpu usage through adding/reducing threads.
I am not sure what kinds of float point operations are best in this test case. Especially, I am gonna test VMWare virtual machine.

You can just sum up the reciprocals of the natural numbers. Since this sum doesn't converge, the compiler will not dare to optimize it away. Just make sure that the result is somehow used in the end.
1/1 + 1/2 + 1/3 + 1/4 + 1/5 ...
This will of course occupy the floating point unit, but not necessarily the central processing unit. So if this approach is good or not is the main question you should pose.

Just simple busy loops will increase the CPU usage -- I am not aware if doing FP calculations will change this significantly or otherwise be able to achieve a more consistent load factor, even though it does exercise the FPU and not just the ALU.
While creating a similar proof-of-concept in C# I used a fixed number of threads and changed the sleep/word durations of each thread. Bear in mind that this process isn't exact and is subject to both CPU and process throttling as well as other factors of modern preemptive operating systems. Adding VMware to the mix may also compound the observed behaviors. In degenerate cases harmonics can form between the code designed to adjust the load and the load reported by the system.
If lower-level constructs were used (generally requiring "kernel mode" access) then a more consistent throttle could be implemented -- partially because of the ability to avoid certain [thread] preemptions :-)
Another alternative that may be looked into, with the appropriate hardware, is setting the CPU clock and then running at 100%. The current Intel Core-i line of chips is very flexible this way (the CPU multiplier can be set discreetly through the entire range), although access from Java may be problematic. This would be done in the host, of course, not VMware.
Happy coding.

Related

Software benefits of FPGA

I have a doubt: I understood that it takes advantage of hardware parallelism and that it controls I/O at the hardware level in order to provide faster response times but which are the software benefits of an FPGA? Which software components can be accelerated? Thanks in advance.
Both for prototyping and parallelism. Since FPGAs are cheap they are good candidates for using both when making industrial protypes and paralell systems. FPGAs consist of arrays of logic elements connected with wires. The elements contain small lookup tables and flip-flops. FPGAs scale to thousands of lookup tables. Lookup tables and programmable wires are flexible enough to implement any logic function. Once you have the function ready then you might want to use ASIC. Xilinx and Altera are the major brands. Personally I use the Altera DE2 and DE2-115.
You are correct about the parallelism and I/O control of an FPGA. FPGA's are just huge re-configurable logic circuits that allow a developer to create circuits with very specific and dedicated functions. They typically come with a very large amount of I/O when compared to typical micro-controllers as well. Because it is basically a bunch of gates in silicon, everything you code in your hardware description language (HDL) will happen in parallel. This combined with the fact that you can write custom circuits is what gives the ability to accelerate operations in an FPGA over a typical processor even though the processor may have a much higher clock speed. To better illustrate this point, lets say you have an algorithm that you need to run and you want to compare an FPGA to a processor. The FPGA is clocked at 100Mhz and the processor at 3Ghz. This means the processor is running at a rate that is 30 times faster than the FPGA. Let's say you code up a circuit that is capable of computing the algorithm on the FPGA in 10 clock cycles. The equivalent algorithm on a processor could take thousands of instructions to execute. This places the FPGA far ahead of the processor in terms of performance. And, because of the parallel nature of FPGA's, if you code it right and the flow through the FPGA is continuous, every clock cycle, the FPGA can finish the computation of a new result. This is because every stage in an FPGA will execute concurrently. So it may take 10 clock cycles, but at each clock cycle, a different piece of 10 different results can be computed simultaneously (this is called pipelinine: http://en.wikipedia.org/wiki/Pipeline_%28computing%29 ). The processor is not capable of this and can not compute the different stages in parallel without taking extra time to do so. Processors are also bound on a performance level by their instruction set whereas on an FPGA, this can be overcome by good and application specific circuit design. A processor is a very general design that can run any combination of instructions, so it takes longer to compute things because of its generalized nature. FPGAs also don't have the issues of moving things in and out of cache or RAM. They typically do use RAM, but in a parallel nature that does not inhibit or bottleneck the flow of the processing. It is also interesting to note that a processor can be created and implemented on an FPGA because you can implement the circuits that compose a processor.
Typically, you find FPGAs on board with processors or microcontrollers to speed up very math intensive or Digital Signal Processing (DSP) tasks, or tasks that require a large flow of data. For example, a wireless modem that communicates via RF will have to do quite a bit of DSP to pick signals out of the air and decode them. Something like the receiver chip in a cell phone. There will be lots of data continually flowing in and out of the device. This is perfect for an FPGA because it can process such a large amount of data and in parallel. The FPGA can then pass the decoded data off to a microcontroller to allow it to do things like display the text from a text message on a pretty touchscreen display. Note that the chip in your cell phone is not an FPGA, but an ASIC (application specific integrated circuit). This is a bunch of circuits stripped down to the bare minimum for maximum performance and minimal cost. However, it is easy to prototype ASICs on FPGAs and they are very similar. An FPGA can be wasteful because it can have a lot of resources on board that are not needed. Generally you only move from an FPGA to an ASIC if you are going to be producing a TON of them and you know they work perfectly and will only ever have to do exactly what they currently do. Cell phone transceiver chips are perfect for this. They sell millions and millions of them and they only have to do that one thing their entire life.
On something like a desktop processor, it is common to see the term "hardware accleration". This generally means that an FPGA or ASIC is on board to speed up certain operations. A lot of times this means that an ASIC (probably on the die of the processor) is included to do anything such as floating point math, encryption, hashing, signal processing, string processing, etc. This allows the processor to more efficiently process data by offloading certain operations that are known to be difficult and time consuming for a processor. The circuit on the die, ASIC, or FPGA can do the computation in parallel as the processor does something else, then gets the answer back. So the speed up can be very large because the processor is not bogged down with the computation, and is freed up to continue processing other things while the other circuit performs the operation.
Some think of FPGAs as an alternative to processors. This is usually fundamentally wrong.
FPGAs are customizable logic. You can implement a processor in a FPGA and then run your regular software on that processor.
The key advantage with FPGAs is the flexibility to implement any kind of logic.
Think of a processor system. It might have a serial port, USB, Ethernet. What if you need another more specialized interface which your processor system does not support? You would need to change your hardware. Possibly create a new ASIC.
With a FPGA you can implement a new interface without the need for new hardware or ASICs.
FPGAs are almost never used to replace a processor. The FPGA is used for particular tasks, such as implementing a communications interface, speed up a specific operation, switch high bandwidth communication traffic, or things like that. You still run your software on a CPU.

Measure contention in wait-free multi-threaded java programs

I have a wait-free implementation for binary search trees but I am not able to figure out concrete methods to measure thread contention. By contention, here I mean number of threads that try to access the same piece of memory at the same time.
So far, I have searched ThreadMXBean and ThreadInfo class, but as there are no locks involved, I haven't found any solution yet.
There is no way to measure the contention over "memory location" without prohibitive performance costs. Direct measurement (e.g. properly synchronized counter wrapping all the accesses) will introduce the artificial bottlenecks, which will blow up test reliability.
"Same time" is loosely defined on the scale you want to measure it, because only a single CPU "owns" the particular location in memory in a given time. The best you can do in this case it to measure the rate at which CPUs are dealing with memory conflicts, e.g. through the HW counters. Doing that requires the understanding of memory subsystem on a given platfom. Also, the HW counters attribute for machine (= CPU) state, not the memory state; in other words, you can estimate how many conflicts the particular instructions have experienced, not how many CPUs accessed the given memory location.
Trying the measure within the source of the contention is the wrong approach. What might be the reason for contention anyways?!
So, first of all, setup a benchmarking suite which runs typical access patterns on your data structure and graph the performance for different thread counts. Here is a nice example from
nitro cache performance page.
If you scale almost linear: congrats, you are done!
If you don't scale linear, you need more insight. Now you need to profile the system as a whole and see what is the reason e.g. for CPU pipeline stalls. The best way is to use low-overhead tracing for this. On Linux you can use OProfile. OProfile has also Java support, which helps you to correlate the JITed machine code to your Java program.

Normalization of speed for testing on different multicore processors

I want to calculate run time of some simple c programs on different multi-core processors. But as we know with advancement of technology new processors are incorporating more methods for faster computation like clock speed etc. How can I normalize such speed changes(to filter out the effect of other advance methods in processor except multi-core) as I only want to get results on the basis of number of cores of processor.
Under Linux, you can boot with the kernel command line parameter maxcpus=N to limit the machine to only N CPUs. See Documentation/kernel-parameters.txt in the kernel source for details.
Most BIOS environments also have the ability to turn off hyperthreading; depending upon your benchmarks, HT may speed up or slow down your tests; being in control of HT would be ideal.
Decide on a known set of reference hardware, run some sort of repeatable reference benchmark against this, and get a good known value to compare to. Then you can run this benchmark against other systems to figure out how to scale the values you get from your target benchmark runs.
The closer your reference benchmark is to your actual application, the more accurate the results of your scaling will be. You could have a single deterministic run (single code path, maybe average of multiple executions) of your application used as your reference benchmark.
If I understand you correctly, you are trying to find a measurement approach that allows to separate the effect of scaling the number of cores from advances of single processor improvements. I am afraid that is not easily possible. E.g. if you compare a multi-core system to one single core of that system you have a non-linear correlation. Because there are shared resources as e.g. the memory bus. If you use only one core of multi-core system it can use the complete memory bandwidth while it has to share in the multi-core case. Similar arguments apply to many shared resources: as there are caches, buses, io capabillities, ALUs, etc.
Your issue is with the auto scaling of core frequency based on the amount of active cores at any given time. For instance, AMD phenom 6-core chips operate at 3.4GHz (or somewhat similar) and if your application creates more than 3 threads it goes down to 2.8Ghz (or similar). Intel on the other hand uses a bunch of heuristics to determine the right frequency for any given time.
However, you can always turn these settings off by going to BIOS and then the results will be comparable only differing based on clock frequency. Usually, people measure giga flops to have comparable results.

If profiler is not the answer, what other choices do we have?

After watching the presentation "Performance Anxiety" of Joshua Bloch, I read the paper he suggested in the presentation "Evaluating the Accuracy of Java Profilers". Quoting the conclusion:
Our results are disturbing because they indicate that profiler incorrectness is pervasive—occurring in most of our seven benchmarks and in two production JVM—-and significant—all four of
the state-of-the-art profilers produce incorrect profiles. Incorrect
profiles can easily cause a performance analyst to spend time optimizing cold methods that will have minimal effect on performance.
We show that a proof-of-concept profiler that does not use yield
points for sampling does not suffer from the above problems
The conclusion of the paper is that we cannot really believe the result of profilers. But then, what is the alternative of using profilers. Should we go back and just use our feeling to do optimization?
UPDATE: A point that seems to be missed in the discussion is observer effect. Can we build a profiler that really 'observer effect'-free?
Oh, man, where to begin?
First, I'm amazed that this is news. Second, the problem is not that profilers are bad, it is that some profilers are bad.
The authors built one that, they feel, is good, just by avoiding some of the mistakes they found in the ones they evaluated.
Mistakes are common because of some persistent myths about performance profiling.
But let's be positive.
If one wants to find opportunities for speedup, it is really very simple:
Sampling should be uncorrelated with the state of the program.
That means happening at a truly random time, regardless of whether the program is in I/O (except for user input), or in GC, or in a tight CPU loop, or whatever.
Sampling should read the function call stack,
so as to determine which statements were "active" at the time of the sample.
The reason is that every call site (point at which a function is called) has a percentage cost equal to the fraction of time it is on the stack.
(Note: the paper is concerned entirely with self-time, ignoring the massive impact of avoidable function calls in large software. In fact, the reason behind the original gprof was to help find those calls.)
Reporting should show percent by line (not by function).
If a "hot" function is identified, one still has to hunt inside it for the "hot" lines of code accounting for the time. That information is in the samples! Why hide it?
An almost universal mistake (that the paper shares) is to be concerned too much with accuracy of measurement, and not enough with accuracy of location.
For example, here is an example of performance tuning
in which a series of performance problems were identified and fixed, resulting in a compounded speedup of 43 times.
It was not essential to know precisely the size of each problem before fixing it, but to know its location.
A phenomenon of performance tuning is that fixing one problem, by reducing the time, magnifies the percentages of remaining problems, so they are easier to find.
As long as any problem is found and fixed, progress is made toward the goal of finding and fixing all the problems.
It is not essential to fix them in decreasing size order, but it is essential to pinpoint them.
On the subject of statistical accuracy of measurement, if a call point is on the stack some percent of time F (like 20%), and N (like 100) random-time samples are taken, then the number of samples that show the call point is a binomial distribution, with mean = NF = 20, standard deviation = sqrt(NF(1-F)) = sqrt(16) = 4. So the percent of samples that show it will be 20% +/- 4%.
So is that accurate? Not really, but has the problem been found? Precisely.
In fact, the larger a problem is, in terms of percent, the fewer samples are needed to locate it. For example, if 3 samples are taken, and a call point shows up on 2 of them, it is highly likely to be very costly.
(Specifically, it follows a beta distribution. If you generate 4 uniform 0,1 random numbers, and sort them, the distribution of the 3rd one is the distribution of cost for that call point.
It's mean is (2+1)/(3+2) = 0.6, so that is the expected savings, given those samples.)
INSERTED: And the speedup factor you get is governed by another distribution, BetaPrime, and its average is 4. So if you take 3 samples, see a problem on 2 of them, and eliminate that problem, on average you will make the program four times faster.
It's high time we programmers blew the cobwebs out of our heads on the subject of profiling.
Disclaimer - the paper failed to reference my article: Dunlavey, “Performance tuning with instruction-level cost derived from call-stack sampling”, ACM SIGPLAN Notices 42, 8 (August, 2007), pp. 4-8.
If I read it correctly, the paper only talks about sample-based profiling. Many profilers also do instrumentation-based profiling. It's much slower and has some other problems, but it should not suffer from the biases the paper talks about.
The conclusion of the paper is that we
cannot really believe the result of
profilers. But then, what is the
alternative of using profilers.
No. The conclusion of the paper is that current profilers' measuring methods have specific defects. They propose a fix. The paper is quite recent. I'd expect profilers to implement this fix eventually. Until then, even a defective profiler is still much better than "feeling".
Unless you are building bleeding edge applications that need every CPU cycle then I have found that profilers are a good way to find the 10% slowest parts of your code. As a developer, I would argue that should be all you really care about in nearly all cases.
I have experience with http://www.dynatrace.com/en/ and I can tell you it is very good at finding the low hanging fruit.
Profilers are like any other tool and they have their quirks but I would trust them over a human any day to find the hot spots in your app to look at.
If you don't trust profilers, then you can go into paranoia mode by using aspect oriented programming, wrapping around every method in your application and then using a logger to log every method invocation.
Your application will really slow down, but at least you'll have a precise count of how many times each method is invoked. If you also want to see how long each method takes to execute, wrap around every method perf4j.
After dumping all these statistics to text files, use some tools to extract all necessary information and then visualize it. I'd guess this will give you a pretty good overview of how slow your application is in certain places.
Actually, you are better off profiling at the database level. Most enterprise databases come with the ability to show the top queries over a period of time. Start working on those queries until the top ones are down to 300 ms or less, and you will have made great progress. Profilers are useful for showing behavior of the heap and for identifying blocked threads, but I personally have never gotten much traction with the development teams on identifying hot methods or large objects.

Java Performance Degradation on removing locks

I am testing my java application for any performance bottlenecks. The application uses concurrent.jar for locking purposes.
I have a high computation call which calls lock and unlock functions for its operations.
On removing the lock-unlock mechanism from the code, I have seen the performance degradation by multiple folds contrary to my expectations. Among other things observed was increase in CPU consumption which made me feel that the program is running faster but actually it was not.
Q1. What can be the reason for this degradation in performance when we remove locks?
Best Regards !!!
This can be quite a usual finding, depending on what you're doing and what you're using as an alternative to Locks.
Essentially, what happens is that constructs such as ReentrantLock have some logic built into them that knows "when to back off" when they realistically can't acquire the lock. This reduces the amount of CPU that's burnt just in the logic of repeatedly trying to acquire the lock, which can happen if you use simpler locking constructs.
As an example, have a look at the graph I've hurriedly put up here. It shows the throughput of threads continually accessing random elements of an array, using different constructs as the locking mechanism. Along the X axis is the number of threads; Y axis is throughput. The blue line is a ReentrantLock; the yellow, green and brown lines use variants of a spinlock. Notice how with low numbers of threads, the spinlock gives heigher throughput as you might expect, but as the number of threads ramps up, the back-off logic of ReentrantLock kicks in, and it ends up doing better, while with high contention, the spinlocks just sit burning CPU.
By the way, this was really a trial run done on a dual-processor machine; I also ran it in the Amazon cloud (effectively an 8-way Xeon) but I've ahem... mislaid the file, but I'll either find it or run the experiment again soon and train and post an update. But you get an essentially similar pattern as I recall.
Update: whether it's in locking code or not, a phenomenon that can happen on some multiprocessor architectures is that as the multiple processors do a high volume of memory accesses, you can end up flooding the memory bus, and in effect the processors slow each other down. (It's a bit like with ethernet-- the more machines you add to the network, the more chance of collisions as they send data.)
Profile it. Anything else here will be just a guess and an uninformed one at that.
Using a profiler like YourKit will not only tell you which methods are "hot spots" in terms of CPU time but it will also tell you where threads are spending most of their time BLOCKED or WAITING
Is it still performing correctly? For instance, there was a case in an app server where an unsychronised HashMap caused an occasional infinite loop. It is not to difficult to see how work could simply be repeated.
The most likely culprit for seeing performance decline and CPU usage increase when you remove shared memory protection is a race condition. Two or more threads could be continually flipping a state flag back and forth on a shared object.
More description of the purpose of your application would help with diagnosis.

Categories