How to include performance goals in Java test suite?

How to include performance goals in Java test suite? - java

Is there an easy way to automatically enforce goals like "This service must support 1,000 transactions per minute" in daily build tests for Java? Is this ever done in JUnit or are there caveats to it?

You can't do exactly what you are looking for but you can do the below using JUnitPerf.
JUnitPerf tests are intended to be
used specifically in situations where
you have quantitative performance
and/or scalability requirements that
you'd like to keep in check while
refactoring code. For example, you
might write a JUnitPerf test to ensure
that refactoring an algorithm didn't
incur undesirable performance overhead
in a performance-critical code
section. You might also write a
JUnitPerf test to ensure that
refactoring a resource pool didn't
adversely affect the scalability of
the pool under load.

That doesn't feel like a unit test to me. It could be a long-running transaction, one that will delay the completion of your build. You might reconsider if the total running time is longer than your build frequency.
It's not even clear to me how meaningful the question is, because it doesn't take into account simultaneous users.
You can easily do a "poor man's" multi-threaded load test like this with TestNG. It's not possible in JUnit 4.4, but it might be in a later version.

Related

Java application - which all parts of my code are being fired up in production?

I have a java web based application running in production. I need some way to be able to see which all parts of the code is being actually used, by the actions of the end user.
Just to clarify my requirement further.
I do not want to put a logging based solution. Any solution that needs me to put some logs and analyse the logs is not something that I am looking from.
I need some solution that works on similar lines like unit test coverage reporter. Like cobertura or emma reports, after running the unit tests, it shows me which all part of my code was fired up by the unit tests. I need something that will listen to JVM in production and tell me which all parts of my code is being fired up in production by the action of end user.
Why am I trying to do this?
I have a code that I have inherited. It is a big piece - some 25,000 classes. One of the bits that I need to do is to chop off parts of the application that is not being used too much. If I can show to management that there are parts of the application that are being scarcely used, I can chop off those parts from this product and effectively make this product a little more manageable (as in the manual regression test suite that needs to run every week or so and takes a couple of days, can be shortened).
Hope there is some ready solution to this.

As Joachim Sauer said in the comments below your question: the most straightforward approach is to just use a Code Coverage Tool that you'd use for unit testing and instrument the production code with it.
There's a major catch: overhead. Code Coverage analysis can really slow things down and while an informed user-base will tolerate some temporary performance degradation, the whole thing needs to remain useable.
From my experience JaCoCo is relatively light and doesn't impose much overhead, whereas Cobertura will impose a tremendous slowdown. On the other hand, JaCoCo merely flags "hit or no hit" whereas Cobertura gives you per-line hit counts. This means that JaCoCo will only let you find dead spots, whereas Cobertura will let you find rarely hit spots.
Whichever of these two tools you use (possibly one after the other), you may end up with giant class whitelists and class blacklists to restrict the coverage counting to places where it makes sense to do so, thereby keeping the performance overhead down. For example, if the entire thing has a single front controller Servlet, including that in the analysis will maximize the performance overhead while providing no information of value. This could turn into a lot of work and a lot of application deployments.
It may actually be quicker and less work to identify bottlenecks/gateways into specific subsystems and slap a counter on each of those (e.g. perf4j or even a full blown Nagios). Queries are another good place to slap a counter on. If you suspect some part of the application is rarely used, put a few counters there and see what happens.

Automatic Runtime Performance Regression Test in Java

I'm looking for ways to detect changes in runtime performance of my code in an automatic way. This would act in a similar way that JUnit does, but instead of testing the code's functionality it would test for sudden changes in speed. As far as I know there are no tools right now to do this automatically.
So the first question is: Are there any tools available which will do this?
Then the second questions is: If there are no tools available and I need to roll my own, what are the issues that need to be addressed?
If the second question is relevant, then here are the issues that I see:
Variability depending on the environment it is run on.
How do detect changes since micro benchmarks in Java have a large variance.
If Caliper collects the results, how to get the results out of caliper so that they can be saved in a custom format. Caliber's documentation is lacking.

I have just come across http://scalameter.github.io/ which looks appropriate, works in scala and java.

Take a look at Caliper CI, I put out version 2.0 yesterday as a Jenkins plugin.

I don't know any separate tools to handle this, but JUnit has an optional parameter called timeout in the #Test-annotation:
The second optional parameter, timeout, causes a test to fail if it
takes longer than a specified amount of clock time (measured in
milliseconds). The following test fails:
#Test(timeout=100) public void infinity() {
while(true);
}
So, you could write additional unit-tests to check that certain parts work "fast enough". Of course, you'd need to somehow first decide what is the maximum amount of time a particular task should take to run.
-
If the second question is relevant, then here are the issues that I
see:
Variability depending on the environment it is run on.
There will always be some variability, but to minimize it, I'd use Hudson or similar automated building & testing server to run the tests, so the environment would be the same each time (of course, if the server running Hudson also does all other sorts of tasks, these other tasks still could affect the results). You'd need to take this into account when deciding the maximum running time for tests (leave some "head room", so if the test takes, say, 5% more to run than usually, it still wouldn't fail straight away).
How do detect changes since micro benchmarks in Java have a large variance.
Microbenchmarks in Java are rarely reliable, I'd say test larger chunks with integration tests (such as handling a single http-request or what ever you have) and measure the total time. If the test fails due to taking too much time, isolate the problematic code by profiling, or measure and log out the running time of separate parts of the test during the test run to see which part takes the largest amount of time.
If Caliper collects the results, how to get the results out of caliper so that they can be saved in a custom format. Caliber's
documentation is lacking.
Unfortunately, I don't know anything about Caliper.

Bringing unit testing to an existing project

I'm working on an existing Java EE project with various maven modules that are developed in Eclipse, bundled together and deployed on JBoss using Java 1.6. I have the opportunity to prepare any framework and document how unit testing should be brought to the project.
Can you offer any advice on...
JUnit is where I expect to start, is this still the defacto choice for the Java dev?
Any mocking frameworks worth setting as standard? JMock?
Any rules that should be set - code coverage, or making sure it's unit rather than integration tests.
Any tools to generate fancy looking outputs for Project Managers to fawn over?
Anything else? Thanks in advance.

Any tools to generate fancy looking outputs for Project Managers to fawn over?
Be careful. A fancy tool for displaying metrics on unit test counts, coverage, code quality metrics, line counts, check-in counts and so on can be dangerous in the hands of some project managers. A project manager (who is not in touch with the realities of software development) can get obsessed with the metrics, and fail to realize that:
they don't give the real picture of the project's health and progress, and
they can give a completely false picture of the performance of individual team members.
You can get silly situations where a manager gives the developers the message that they should (for example) try to achieve maximal unit test coverage for code where this is simply not warranted. Time is spent on pointless work, the important work doesn't get done, and deadlines are missed.
Any rules that should be set - code coverage, or making sure it's unit rather than integration tests.
Code coverage is more important for parts of the code that are likely to be fragile / buggy. Acceptable coverage levels should reflect this.
Unit tests versus integration tests depends on the nature and complexity of the system you are building.
Adding lots of unit level tests after the fact is probably a waste of time. It should only be done for class identified as being problematic / needing maintenance work.
Adding integration level tests after the fact is useful, especially if the projects original developers are no longer around. A decent integration test suite helps to increase your confidence that some change does not break important system functionality. But this needs to be done judiciously. A test suite that tests the N-th degree of a website's look and feel can be a nightmare to maintain ... and impediment to progress.

Concerning the unit testing framework, there are mainly two of them : jUnit and TestNG. Both have theuir advantages, and both are equally performant. The main dvantage of jUnit is (to my mind) its default incoproration of an Eclipse plugin allowing easy tests calling.
Concerning the mocking framework, I don't find them to be a required part of your testing approach. Of course they're useful, but they solve a specific purpose : testing a behaviour (as opposite to testing an interface - what jUnit allows. With mocking frameworks, you're able to test how a specific class implements a specific interface. Will you need it ? Obviously. Will you need it first ? I don't know.
Concerning the rules, the only one I've found to be useful is simple (as always) : "always test code that broke at least once.". Consider your bug tracker. Each time a bug is encountered, there must be a unit test ensuring there is no regression. It's, to my mind, the faster way to have quality code.
Concerning the fancy- and efficient - output, I can recommend you enough to install a continous integration server (Hudson, obviously). It will run all your test suite each time code is commited, to ensure there are no side effects. it will generate graphs shoiwing the number of test run, and so on. it also can integrate code coverage tools and graphs. This continuous integration server will really become fast your testing buddy.

This is a complex question, so just a few notes about our practice at $work:
JUnit is indeed still the standard. Most documentation and literature treats JUnit.
Mockito seems to be the new star in Java mocking, although we still use JMock and think it's fine for our needs.
We use the EclEmma Eclipse plugin for checking our test coverage, and like it.

If you haven't done so already, read Working Effectively with Legacy Code by Michael Feathers.

I've been retrofitting unit tests to a C++ project and it is not pleasant.
First thing I did was to identify where most of the 'action' occurs. Then use that to start putting unit tests on the functions that can be test easily.
Then once you have the easier ones you can start looking at expanding the coverage virally - attack the functions that have fewer dependancies, run through them a few times in a debugger seeing what values are passed in and then write unit tests with those values to make sure you don't break anything.
Don't expect a quick fix - it's taken 3 weeks (6hr days, 5 days a week) to get 20% coverage but the code spends 80% of the time in that code so I think it has been time well spent and has uncovered quite a few bugs.

Regarding test coverage, I think that when you're bringing in unit testing to an existing project it's too early to start setting coverage expectations. You should start by ensuring that you actually can integrate the test framework and get reports from the coverage tools. Once you've done that you can start monitoring coverage, and then you can consider targets.

How to measure robustness?

I am working on a thesis about meassuring quality of a product. The product in this case is a website. I have identified several quality attributes and meassurement techniques.
One quality attribute is "Robustness". I want to meassure that somehow, but I can't find any useful information how to do this in an objective manner.
Is there any static or dynamic metric that could meassure robustness? Ie, like unit test coverage, is there a way to meassure robustness like that? If so, is there any (free) tool that can do such a thing?
Does anyone have any experience with such tooling?
Last but not least, perhaps there are other ways to determine robustness, if you have any ideas about that I am all ears.
Thanks a lot in advance.

Well, the short answer is "no." Robust can mean a lot of things, but the best definition I can come up with is "performing correctly in every situation." If you send a bad HTTP header to a robust web server, it shouldn't crash. It should return exactly the right kind of error, and it should log the event somewhere, perhaps in a configurable way. If a robust web server runs for a very long time, its memory footprint should stay the same.
A lot of what makes a system robust is its handling of edge cases. Good unit tests are a part of that, but it's quite likely that there will not be unit tests for any of the problems that a system has (if those problems were known, the developers probably would have fixed them and only then added a test).
Unfortunately, it's nearly impossible to measure the robustness of an arbitrary program because in order to do that you need to know what that program is supposed to do. If you had a specification, you could write a huge number of tests and then run them against any client as a test. For example, look at the Acid2 browser test. It carefully measures how well any given web browser complies with a standard in an easy, repeatable fashion. That's about as close as you can get, and people have pointed out many flaws with such an approach (for instance, is a program that crashes more often but does one extra thing according to spec more robust?)
There are, though, various checks that you could use as a rough, numerical estimate of the health of a system. Unit test coverage is a pretty standard one, as are its siblings, branch coverage, function coverage, statement coverage, etc. Another good choice is "lint" programs like FindBugs. These can indicate the potential for problems. Open source projects are often judged by how frequently and recently commits are made or releases released. If a project has a bug system, you can measure how many bugs have been fixed and the percentage. If there's a specific instance of the program you're measuring, especially one with a lot of activity, MTBF (Mean Time Between Failures) is a good measure of robustness (See Philip's Answer)
These measurements, though, don't really tell you how robust a program is. They're merely ways to guess at it. If it were easy to figure out if a program was robust, we'd probably just make the compiler check for it.
Good luck with your thesis! I hope you come up with some cool new measurements!

You could look into mean time between failures as a robustness measure. The problem is that it is a theoretical quantity which is difficult to measure, particularly before you have deployed your product to a real-world situation with real-world loads. Part of the reason for this is that testing often does not cover real-world scalability issues.

In our Fuzzing book (by Takanen, DeMott, Miller) we have several chapters dedicated for metrics and coverage in negative testing (robustness, reliability, grammar testing, fuzzing, many names for the same thing). Also I tried to summarize most important aspects in our company whitepaper here:
http://www.codenomicon.com/products/coverage.shtml
Snippet from there:
Coverage can be seen as the sum of two features, precision and accuracy. Precision is concerned with protocol coverage. The precision of testing is determined by how well the tests cover the different protocol messages, message structures, tags and data definitions. Accuracy, on the other hand, measures how accurately the tests can find bugs within different protocol areas. Therefore, accuracy can be regarded as a form of anomaly coverage. However, precision and accuracy are fairly abstract terms, thus, we will need to look at more specific metrics for evaluating coverage.
The first coverage analysis aspect is related to the attack surface. Test requirement analysis always starts off by identifying the interfaces that need testing. The number of different interfaces and the protocols they implement in various layers set the requirements for the fuzzers. Each protocol, file format, or API might require its own type of fuzzer, depending on the security requirements.
Second coverage metric is related to the specification that a fuzzer supports. This type of metric is easy to use with model-based fuzzers, as the basis of the tool is formed by the specifications used to create the fuzzer, and therefore they are easy to list. A model-based fuzzer should cover the entire specification. Whereas, mutation-based fuzzers do not necessarily fully cover the specification, as implementing or including one message exchange sample from a specification does not guarantee that the entire specification is covered. Typically when a mutation-based fuzzer claims specification support, it means it is interoperable with test targets implementing the specification.
Especially regarding protocol fuzzing, the third-most critical metric is the level of statefulness of the selected Fuzzing approach. An entirely random fuzzer will typically only test the first messages in complex stateful protocols. The more state-aware the fuzzing approach you are using is, the deeper the fuzzer can go in complex protocols exchanges. The statefulness is a difficult requirement to define for Fuzzing tools, as it is more a metric for defining the quality of the used protocol model, and can, thus, only be verified by running the tests.
I hope this was helpful. We also have studies in other metrics such as looking at code coverage and other more or less useless data. ;) Metrics is a great topic for a thesis. Email me at ari.takanen#codenomicon.com if you are interested to get access to our extensive research on this topic.

Robustness is very subjective but you could have a look at FingBugs, Cobertura and Hudson which when correctly combined together could give you a sense of security over time that the software is robust.

You could look into mean time between
failures as a robustness measure.
The problem with "MTBF" is that it is usually measured in positive traffic whereas failures often happen in unexpected situations. It does not give any indication of robustness or reliability. No matter if a web site stays always on in lab environment, it will still be hacked in a second in the Internet if it has a weakness.

Unit testing real-time / concurrent software [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How should I unit test threaded code?
The classical unit testing is basically just putting x in and expecting y out, and automating that process. So it's good for testing anything that doesn't involve time. But then, most of the nontrivial bugs I've come across have had something to do with timing. Threads corrupt each others' data, or cause deadlocks. Nondeterministic behavior happens – in one run out of million. Hard stuff.
Is there anything useful out there for "unit testing" parts of multithreaded, concurrent systems? How do such tests work? Isn't it necessary to run the subject of such test for a long time and vary the environment in some clever manner, to become reasonably confident that it works correctly?

Most of the work I do these days involves multi-threaded and/or distributed systems. The majority of bugs involve "happens-before" type errors, where the developer assumes (wrongly) that event A will always happen before event B. But every 1000000th time the program is run, event B happens first, and this causes unpredictable behavior.
Additionally, there aren't really any good tools to detect timing issues, or even data corruption caused by race conditions. Tools like Helgrind and drd from the Valgrind toolkit work great for trivial programs, but they are not very useful in diagnosing large, complex systems. For one thing, they report false positives quite frequently (Helgrind especially). For another thing, it's difficult to actually detect certain errors while running under Helgrind/drd simply because programs running under Helgrind run almost 1000x slower, and you often need to run a program for quite a long time to even reproduce the race condition. Additionally, since running under Helgrind totally changes the timing of the program, it may become impossible to reproduce a certain timing issue. That's the problem with subtle timing issues; they're almost Heisenbergian in the sense that altering a program to detect timing issues may obscure the original issue.
The sad fact is, the human race still isn't adequately prepared to deal with complex, concurrent software. So unfortunately, there's no easy way to unit-test it. For distributed systems especially, you should plan your program carefully using Lamport's happens-before diagrams to help you identify the necessary order of events in your program. But ultimately, you can't really get away from brute-force unit testing with randomly varying inputs. It also helps to vary the frequency of thread context-switching during your unit-test by, e.g. running another background process which just takes up CPU cycles. Also, if you have access to a cluster, you can run multiple unit-tests in parallel, which can detect bugs much quicker and save you a lot of time.

If you can run your tests under Linux, valgrind includes a tool called helgrind which purports to detect race conditions and potential deadlocks in programs that use pthreads; you might get some benefit from running your multithreaded code under that, since it will report potential errors even if they didn't actually occur in that particular test run.

I have never heard of anything that can.
I guess if someone was to design one, it would have to have exact control over the execution of the threads and execute all possible combinations of stepping of the threads.
Sounds like a major task, not to mention the mathematical combinations for non-trivial sized threads when there are a handful or more of them...
Although, a quick search of stackoverflow... Unit testing a multithreaded application?

If the tested system is simple enough you could control the concurrency quite well by blocking operations in external mockup systems. This blocking can be done for example by waiting for some other operation to be started. If you can control all external calls this might work quite well by implementing different blocking sequences. I have tried this and it does reveal lock-level bugs quite well if you know possible problematic sequences well. And compared to many other concurrency testing it is quite deterministic. However this approach doesn't detect low level race conditions too well. I usually just go for load testing to find those, but I quess that isn't exactly unit testing.
I have seen these concurrency testing frameworks for .net, I'd assume its only matter of time before someone writes one for Java (hopefully).
And not to forget good old code reading. One of the best ways to find concurrency bugs is to just read through the code once again giving it your full concentration.

Perhaps the answer is that you shouldn't. In concurrent systems, there may not always be a single deterministic answer that is correct.
Take the example of people boarding a train and choosing a seat. You are going to end up with different results everytime.

Awaitility is a useful framework when you need to deal with asynchronicity in your tests. It allows you to wait until some state somewhere in your system is updated. For example:
await().untilCall( to(myService).myMethod(), equalTo(3) );
or
await().until( fieldIn(myObject).ofType(int.class), greaterThan(1));
It also has Scala and Groovy support.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.