Our team is responsible for a large codebase containing legal rules.
The codebase works mostly like this:
class SNR_15_UNR extends Rule {
public double getValue(RuleContext context) {
double snr_15_ABK = context.getValue(SNR_15_ABK.class);
double UNR = context.getValue(GLOBAL_UNR.class);
if(UNR <= 0) // if UNR value would reduce snr, apply the reduction
return snr_15_ABK + UNR;
return snr_15_ABK;
}
}
When context.getValue(Class<? extends Rule>) is called, it just evaluates the specific rule and returns the result. This allows you to create a dependency graph while a rule is evaluating, and also to detect cyclic dependencies.
There are about 500 rule classes like this. We now want to implement tests to verify the correctness of these rules.
Our goal is to implement a testing list as follows:
TEST org.project.rules.SNR_15_UNR
INPUT org.project.rules.SNR_15_ABK = 50
INPUT org.project.rules.UNR = 15
OUTPUT SHOULD BE 50
TEST org.project.rules.SNR_15_UNR
INPUT org.project.rules.SNR_15_ABK = 50
INPUT org.project.rules.UNR = -15
OUTPUT SHOULD BE 35
Question is: how many test scenario's are needed? Is it possible to use static code analysis to detect how many unique code paths exist throughout the code? Does any such tool exist, or do I have to start mucking about with Eclipse JDT?
For clarity: I am not looking for code coverage tools. These tell me which code has been executed and which code was not. I want to estimate the development effort required to implement unit tests.
(EDIT 2/25, focused on test-coding effort):
You have 500 sub-classes, and each appears (based on your example with one conditional) to have 2 cases. I'd guess you need 500*2 tests.
If your code is not a regular as you imply, a conventional (branch) code coverage tool might not be the answer you think you want as starting place, but it might actually help you make an estimate. Code T<50 tests across randomly chosen classes, and collect code coverage data P (as a percentage) over whatever part of the code base you think needs testing (particularly your classes). Then you need roughly (1-P)*100*T tests.
If your extended classes are all as regular as you imply, you might consider generating them. If you trust the generation process, you might be able avoid writing the tests.
(ORIGINAL RESPONSE, focused on path coverage tools)
Most code coverage tools are "line" or "branch" coverage tools; they do not count unique paths through the code. At best they count basic blocks.
Path coverage tools do exist; people have built them for research demos, but commercial versions are relatively rare. You can find one at http://testingfaqs.org/t-eval.html#TCATPATH. I don't think this one handles Java.
One of the issues is that the apparent paths through code is generally exponential in the number of decisions since each encountered decision generates a True path and a False path based on the outcome of the conditional (1 decision --> 2 paths, 2 decisions --> 4 paths, ...). Worse loops are in effect a decision repeated as many times as the loop iterates; a loop that repeats a 100 times in effect has 2**100 paths. To control this problem, the more interesting path coverage tools try to determine the feasibility of a path: if the symbolically combined predicates from the conditionals in a prefix of that path are effectively false, the path is infeasible and can be ignored, since it can't really occur. Another standard trick is treat loops as 0, 1, and N iterations to reduce the number of apparent paths. Managing the number of paths requires rather a lot of machinery, considerably above what most branch-coverage test tools need, which helps explain why real path coverage tools are rare.
how many test scenario's are needed?
Many. 500 might be a good start.
Is it possible to use static code analysis to detect how many unique code paths exist throughout the code?
Yes. It called a code coverage tool. Here are some free ones. http://www.java-sources.com/open-source/code-coverage
Related
I was working on an assignment today that basically asked us to write a Java program that checks if HTML syntax is valid in a text file. Pretty simple assignment, I did it very quickly, but in doing it so quickly I made it very convoluted (lots of loops and if statements). I know I can make it a lot simpler, and I will before turning it in, but Amid my procrastination, I started downloading plugins and seeing what information they could give me.
I downloaded two in particular that I'm curious about - CodeMetrics and MetricsReloaded. I was wondering what exactly these numbers that it generates correlate to. I saw one post that was semi-similar, and I read it as well as the linked articles, but I'm still having some trouble understanding a couple of things. Namely, what the first two columns (CogC and ev(G)), as well as some more clarification on the other two (iv(G) and v(G)), mean.
MetricsReloaded Method Metrics:
MetricsReloaded Class Metrics:
These previous numbers are from MetricsReloaded, but this other application, CodeMetrics, which also calculates cyclomatic complexity gives slightly different numbers. I was wondering how these numbers correlate and if someone could just give a brief general explanation of all this.
CodeMetrics Analysis Results:
My final question is about time complexity. My understanding of Cyclomatic complexity is that it is the number of possible paths of execution and that it is determined by the number of conditionals and how they are nested. It doesn't seem like it would, but does this correlate in any way to time complexity? And if so, is there a conversion between them that can be easily done? If not, is there a way in either of these plug-ins (or any other in IntelliJ) that can automate time complexity calculations?
I am using 10 folds cross validations technique to train 200K records. The target class index is like
Status {PASS,FAIL}
Pass has ~144K and Fail has ~6K instances.
while training the model using J48. Its not able to find the failures. The accuracy is 95% but most the cases its predicting just success. where as in our case, we need to find the failure which are actually happening.
So my question is mainly hypothetical analysis.
Does it really matter the distribution among class instances during training(in my case PASS,FAIL).
What could be possible values in weka J48 tree to train better as i see 2% failure in every 1000 records i pass. So, there will be increase in success if we increase the Success scenarios.
What should be the ratio among them in order to better train them.
There is nothing i could find in the API as far as ratio is concerned.
I am not adding the code because this is happening both with Java API as well as using weka GUI tool.
Many Thanks.
The problem here is that your dataset is very unbalanced. You do have a few options on how to help your classification task:
Generate synthetic instances for your minority class using an algorithm like SMOTE. This should increase your performance.
It's not possible in every case, but you could maybe try splitting your majority class into a couple of smaller classes. This would help the balance.
I believe Weka has a One Class Classifier. This allows to see decision boundary of the larger class and considers the minority class as an outlier allowing for hopefully better classifications. See here for Weka's implementation.
Edit:
You could also use a classifier that will weight classifications based on whether they are correct or not. Again, Weka has this as a meta classifier that can be applied to most base classifiers, see here again.
Write a program with the following objective -
be able to identify whether a word/phrase represents a thing/product. For example -
1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product.
2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing.
Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."
I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!
What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:
create your own labelling algorithm, create training data, test, eval and finally tag your data
use an existing knowledge base (lexicon) to extract semantic labels for each target word
The first option is a complex research project in itself. Do it if you have the time and resources.
The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.
This task is called named entity reconition problem.
EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.
Out of the box, Standford NLP can only recognize following types:
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical
(MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION,
SET) entities
so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.
Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).
EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:
Download CRF++ and look at provided examples, they are in a simple text format
Annotate you data in a similar manner
a OTHER
glove PRODUCT
comprising OTHER
...
and so on.
Spilt you annotated data into two files train (80%) and dev(20%)
use following baseline template features (paste in template file)
U02:%x[0,0]
U01:%x[-1,0]
U01:%x[-2,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
4.Run
crf_learn template train.txt model
crf_test -m model dev.txt > result.txt
Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.
As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates
ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.
I see that Cobertura has a <cobertura:check> task that can be used to enforce coverage at build-time (if coverage metrics dip below a certain value, the build fails). The website shows examples with several different attributes that are available, but doesn't really give a description as to what they are or what they do:
branchrate
linerate
totalbranchrate
etc.
Also, what are the standard values for each of these attributes? I'm sure it will differ between projects, but there has to be some way for an organization to gauge what is acceptable and what isn't, and I'm wondering how to even arrive at that. Thanks in advance.
Perhaps the documentation has changed since you asked the question, because I think your answer is right there now.
At the time that I'm writing this, the answers to your specific questions are:
branchrate
Specify the minimum acceptable branch coverage rate needed by each class. This should be an integer value between 0 and 100.
linerate
Specify the minimum acceptable line coverage rate needed by each class. This should be an integer value between 0 and 100.
totalbranchrate
Specify the minimum acceptable average branch coverage rate needed by the project as a whole. This should be an integer value between 0 and 100.
If you do not specify branchrate, linerate, totalbranchrate or totallinerate, then Cobertura will use 50% for all of these values.
A bit of googling shows that most people agree that a "good" coverage number is somewhere from 75% - 95%. I use %85 for new projects. However, I think the metric that is the most useful in gauging whether you have enough test coverage is how comfortable your developers are in making and releasing changes to the code (assuming you have responsible developers who care about introducing bugs). Remember, you can have 100% test coverage without a single assert in any test!
For legacy projects things are usually more complicated. It's rare that you can get time to just focus on coverage alone, so most of the time you find out what your code coverage is, and then try to improve it over time. My dream cobertura-check task would check if the coverage on any given line/method/class/package/project is the same as or better than the last build, and have separate thresholds for any code that is "new in this build." Maybe Sonar has something like that...
I've encountered JBehave recently and I think we should use it. So I have called in the tester of our team and he also thinks that this should be used.
With that as starting point I have asked the tester to write stories for a test application (the Bowling Game Kata of Uncle Bob). At the end of the day we would try to map his tests against the bowling game.
I was expecting a test like this:
Given a bowling game
When player rolls 5
And player rolls 4
Then total pins knocked down is 9
Instead, the tester came with 'logical tests', in other words he was not being that specific. But, in his terms this was a valid test.
Given a bowling game
When player does a regular throw
Then score should be calculated appropriately
My problem with this is ambiguity, what is a 'regular throw'? What is 'appropriately'? What will it mean when one of those steps fail?
However, the tester says that a human does understand and that what I was looking for where 'physical tests', which where more cumbersome to write.
I could probably map 'regular' with rolling two times 4 (still no spare, nor strike), but it feels like I am again doing a translation I don't want to make.
So I wonder, how do you approach this? How do you write your JBehave tests? And do you have any experience when it is not you who writes these tests, and you have to map them to your code?
His test is valid, but requires a certain knowledge of the domain, which no framework will have. Automated tests should be explicit, think of them as examples. Writing them costs more than writing "logical tests", but this pays in the long run since they can be replayed at will, very quickly, and give an immediate feedback.
You should have paired with him writing the first tests, to put it in the right direction. Perhaps you could give him your test, and ask him to increase the coverage by adding new tests.
The amount of explicitness needed in acceptance criteria depends on level of trust between the development team and the business stakeholders.
In your example, the business is assuming that the developers/testers understand enough about bowling to determine the correct outcome.
But imagine a more complex domain, like finance. For that, it would probably be better to have more explicit examples to ensure a good understanding of the requirement.
Alternatively, let's say you have a scenario:
Given I try to sign up with an invalid email address
Then I should not be registered
For this, a developer/tester probably has better knowledge of what constitutes a valid or invalid email address than the business stakeholder does. You would still want to test against a variety of addresses, but that can be specified within the step definitions, rather than exposing it at the scenario level.
I hate such vague words as "appropriately" in the "expected values". The "appropriately" is just an example of "toxic word" for the testing, and if not eliminated, this "approach" can get widespread, effectively killing the testing in general. It might "be enough" for human tester, but such "test cases" are acceptable only at first attempts to exploratory "smoke test".
Whatever reproducible, systematical and automatable, every test case must be specific. (not just "should".. to assume the softness of "would" could be allowed? Instead I use the present tense "shall be" or even better strict "is", as a claim to confirm/refuse.) And this rule is absolute once it comes to automation.
What your tester made, was rather a "test-area", a "scenario template", instead of a real test-case: Because so many possible test-results can be produced...
You were specific, in your scenario: That was a very specific real "test case". It is possible to automate your test case, nice: You can delegate it on a machine and evaluate it as often as you need, automatically. (with the bonus of automated report, from an Continuous Integration server)
But the "empty test scenario template"? It has some value too: It is a "scenario template", an empty skeleton prepared to be filled by data: So I love to name these situations "DDT": "Data Driven Testing".
Imagine a web-form to be tested, with validations on its 10 inputs, with cross-validations... And the submit button. There can be 10 test-cases for every single input:
empty;
with a char, but still too short anyway;
too long for the server, but allowed within the form for copy-paste and further edits;
with invalid chars...
The approach I recommend is to prepare a set of to-pass data: even to generate them (from DB or even randomly), whatever you can predict shall pass the test, the "happy scenario". Keep the data aside, as a data-template, and use it to initialize the form, to fill it up, and then to brake-down some single value: Create test cases "to fail". Do it i.e. 10 times for every single input, for each of the 10 inputs (100 tests-cases even before cross-rules attempted) ... and then, after the 100 times of the refusing of the form by the server, fill up the form by the to-pass data, without distorting them, so the form can be accepted finally. (accepted submit changes status on the server-app, so needs to go as the last one, to test all the 101 cases on the same app-state)
To do your test this way, you need two things:
the empty scenario template,
and a table of 100 rows of data:
10 columns of input data: with only one value manipulated, as passing row by row down the table (i.e. ever heard about grey-code?),
possibly keeping the inheritance history in a row-description, where from is the row derived and how, via which manipulated value.
Also the 11th column, the "expected result" column(s) filled: to pass/fail expected status, expected err/validation message, reference to the requirements, for the test-coveradge tracking. (i.e. ever seen FitNesse?)
And possibly also the column for the real detected result, when test performed, to track history of the single row-test-case. (so the CI server mentioned already)
To combine the "empty scenario skeleton" on one side and the "data-table to drive the test" on the other side, some mechanism is needed, indeed. And your data need to be imported. So you can prepare the rows in excel, which could be theoretically imported too, but for the easier life I recommend either CSV, properties, XML, or just any machine&human readable format, textual format.
His 'logical test' has the same information content as the phrase 'test regular bowling score' in a test plan or TODO list. But it is considerably longer, therefor worse.
Using jbehave at all only makes sense in the case the test team are responsible for generating tests with more information in them than that. Otherwise, it would be more efficient to take the TODO list and code it up in JUnit.
And I love words like "appropriately" in the "expected values". You need to use cucumber or other wrappers as the generic documentation. If you're using it to cover and specify all possible scenarios you're probably wasting a lot of your time scrolling through hundred of feature files.