Calculating Test Strength from the XML output, what does 'detected' represent?

Calculating Test Strength from the XML output, what does 'detected' represent? - java

I am attempting to calculate the Test Strength generated by the HTML output from the data within the XML output of PIT however the figure I calculate doesn't align with the Test Strength from the HTML report.
I use the equation (mutationsKilled / (totalMutations - mutationsNotCovered) * 100), where:
mutationsKilled was generated by counting the entries where status = "KILLED"
totalMutations was generated by counting the total entries
mutationsNotCovered was generated by counting the entries where detected = "false"
However, it's the last of these that I suspect could be the problem, as when looking back at the raw XML output I find that detected metric may not necessarily indicate whether the mutation was covered or not.
Is there a tried and tested way of doing this? Also may be worth mentioning I'm running PIT 1.9.4 and also the incremental analysis mode, if this has any bearing on the issue.

After experimenting and intentionally reducing my test coverage to produce uncovered mutants, I found that there is a possible status, NO_COVERAGE. By adjusting the above equation to generate mutationsNotCovered by counting the entries where status = "NO_COVERAGE", the correct statistics are produced.

Related

Duplication Criteria in Sonar

I have followed below link which is for Java Script
Sonarqube: Is it possible to adapt duplication metric for javascript code?
Similarly I have done for my Java project.
And as per this if we wish to change the duplication criteria, i.e. by default 10 lines, we have to add one line in sonar.properties file which is stored in project.
sonar.projectKey=Test
sonar.projectName=Test
sonar.projectVersion=1.0
sonar.sources=src
sonar.language=java
sonar.sourceEncoding=UTF-8
sonar.cpd.java.minimumLines=5
But its not working for Java, is there anything else I need to configure?

Per SonarQube's Duplications documentation:
A piece of code is considered duplicated as soon as there are at least 100 successive and duplicated tokens (can be overridden with property sonar.cpd.${language}.minimumTokens) spread on at least 10 lines of code (can be overridden with property sonar.cpd.${language}.minimumLines). For Java projects, the duplication detection mechanism behaves slightly differently. A piece of code is considered as duplicated as soon as there is the same sequence of 10 successive statements whatever the number of tokens and lines. This threshold cannot be overridden.

How to predict a continuous value (time) from text documents? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have about 3000 text documents which are related to a duration of time when the document was "interesting". So lets say document 1 has 300 lines of text with content, which led to a duration of interest of 5.5 days, whereas another document with 40 lines of text led to a duration of 6.7 days being "interesting", and so on.
Now the task is to predict the duration of interest (which is a continuous value) based on the text content.
I have two ideas to approach the problem:
Build a model of similar documents with a technology like http://radimrehurek.com/gensim/simserver.html. When a new document arrives one could try to find the 10 most similar documents in the past and simply compute the average of their duration and take that value as prediction for the duration of interest for the new document.
Put the documents into categories of duration (e.g. 1 day, 2 days, 3-5 days, 6-10 days, ...). Then train a classifier to predict the category of duration based on the text content.
The advantage of idea #1 is that I could also calculate the standard deviation of my prediction, whereas with idea #2 it is less clear to me, how I could compute a similar measure of uncertainty of my prediction. Also it is unclear to me which categories to chose to get the best results from a classifier.
So is there a rule of thumb how to build a systems to best predict a continuous value like time from text documents? Should one use a classifier or should one use an approach using average values on similar documents? I have no real experience in that area and would like to know, which approach you think would probably yield the best results. Bonus point are given if you know a simple existing technology (Java or Python based) which could be used to solve this problem.

Approach (1) is called k-nearest neighbors regression. It's perfectly valid. So are myriad other approaches to regression, e.g. plain multiple regression using the documents' tokens as features.
Here's a skeleton script to fit a linear regression model using scikit-learn(*):
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDRegressor
# build a term-document matrix with tf-idf weights for the terms
vect = TfidfVectorizer(input="filename")
Xtrain = vect.fit_transform(documents) # documents: list of filenames
# now set ytrain to a list of durations, such that ytrain[i] is the duration
# of documents[i]
ytrain = ...
# train a linear regression model using stochastic gradient descent (SGD)
regr = SGDRegressor()
regr.fit(Xtrain, ytrain)
That's it. If you now have new documents for which you want to predict the duration of interest, do
Xtest = vect.transform(new_documents)
ytest = regr.predict(Xtest)
This is a simple linear regression. In reality, I would expect interest duration to not be a linear function of a text's contents, but this might get you started. The next step would be to pick up any textbook on machine learning or statistics that treats more advanced regression models.
(*) I'm a contributor to this project, so this is not unbiased advice. Just about any half-decent machine learning toolkit has linear regression models.

(The following is based on my academic "experience", but seems informative enough to post it).
It looks like your task can be reformulated as:
Given a training set of scored documents, design a system for scoring
arbitrary documents based on their content.
"based on their content" is very ambiguous. In fact, I'd say it's too ambiguous.
You could try to find a specific feature of those documents which seems to be responsible for the score. It's more of a human task until you can narrow it down, e.g. you know you're looking for certain "valuable" words which make up the score, or maybe groups of words (have a look at http://en.wikipedia.org/wiki/N-gram).
You might also try developing a search-engine-like system, based on a similarity measure, sim(doc1, doc2). However, you'd need a large corpus featuring all possible scores (from the lowest to the highest, multiple times), so for every input document, similiar documents would have a chance to exist. Otherwise, the results would be inconslusive.
Depending on what values sim() would return, the measure should fullfill a relationship like:
sim(doc1,doc2) == 1.0 - |score(doc1) - score(doc2)|.
To test the quality of the measure, you could compute the similarity and score difference for each pair of ducuments, and check the correlation.
The first pick would be the cosine similarity using tf-idf
You've also mentioned categorizing the data. It seems to me like a method "justifying" a poor similarity measure. I.e. if the measure is good, it should be clear which category the document would fall into. As for classifiers, your documents should first have some "features" defined.
If you had a large corpus of the documents, you could try clustering to speed up the process.
Lastly, to determine the final score, I would suggest processing the scores of a few most similar documents. A raw average might not be the best idea in this case, because "less similar" would also mean "less accurate".
As for implementation, have a look at: Simple implementation of N-Gram, tf-idf and Cosine similarity in Python.
(IMHO, 3000 documents is way too low number for doing anything reliable with it without further knowledge of their content or the relationship between the content and score.)

Cobertura Check & Validation

I see that Cobertura has a <cobertura:check> task that can be used to enforce coverage at build-time (if coverage metrics dip below a certain value, the build fails). The website shows examples with several different attributes that are available, but doesn't really give a description as to what they are or what they do:
branchrate
linerate
totalbranchrate
etc.
Also, what are the standard values for each of these attributes? I'm sure it will differ between projects, but there has to be some way for an organization to gauge what is acceptable and what isn't, and I'm wondering how to even arrive at that. Thanks in advance.

Perhaps the documentation has changed since you asked the question, because I think your answer is right there now.
At the time that I'm writing this, the answers to your specific questions are:
branchrate
Specify the minimum acceptable branch coverage rate needed by each class. This should be an integer value between 0 and 100.
linerate
Specify the minimum acceptable line coverage rate needed by each class. This should be an integer value between 0 and 100.
totalbranchrate
Specify the minimum acceptable average branch coverage rate needed by the project as a whole. This should be an integer value between 0 and 100.
If you do not specify branchrate, linerate, totalbranchrate or totallinerate, then Cobertura will use 50% for all of these values.
A bit of googling shows that most people agree that a "good" coverage number is somewhere from 75% - 95%. I use %85 for new projects. However, I think the metric that is the most useful in gauging whether you have enough test coverage is how comfortable your developers are in making and releasing changes to the code (assuming you have responsible developers who care about introducing bugs). Remember, you can have 100% test coverage without a single assert in any test!
For legacy projects things are usually more complicated. It's rare that you can get time to just focus on coverage alone, so most of the time you find out what your code coverage is, and then try to improve it over time. My dream cobertura-check task would check if the coverage on any given line/method/class/package/project is the same as or better than the last build, and have separate thresholds for any code that is "new in this build." Maybe Sonar has something like that...

Estimate unit tests required in large code base

Our team is responsible for a large codebase containing legal rules.
The codebase works mostly like this:
class SNR_15_UNR extends Rule {
public double getValue(RuleContext context) {
double snr_15_ABK = context.getValue(SNR_15_ABK.class);
double UNR = context.getValue(GLOBAL_UNR.class);
if(UNR <= 0) // if UNR value would reduce snr, apply the reduction
return snr_15_ABK + UNR;
return snr_15_ABK;
}
}
When context.getValue(Class<? extends Rule>) is called, it just evaluates the specific rule and returns the result. This allows you to create a dependency graph while a rule is evaluating, and also to detect cyclic dependencies.
There are about 500 rule classes like this. We now want to implement tests to verify the correctness of these rules.
Our goal is to implement a testing list as follows:
TEST org.project.rules.SNR_15_UNR
INPUT org.project.rules.SNR_15_ABK = 50
INPUT org.project.rules.UNR = 15
OUTPUT SHOULD BE 50
TEST org.project.rules.SNR_15_UNR
INPUT org.project.rules.SNR_15_ABK = 50
INPUT org.project.rules.UNR = -15
OUTPUT SHOULD BE 35
Question is: how many test scenario's are needed? Is it possible to use static code analysis to detect how many unique code paths exist throughout the code? Does any such tool exist, or do I have to start mucking about with Eclipse JDT?
For clarity: I am not looking for code coverage tools. These tell me which code has been executed and which code was not. I want to estimate the development effort required to implement unit tests.

(EDIT 2/25, focused on test-coding effort):
You have 500 sub-classes, and each appears (based on your example with one conditional) to have 2 cases. I'd guess you need 500*2 tests.
If your code is not a regular as you imply, a conventional (branch) code coverage tool might not be the answer you think you want as starting place, but it might actually help you make an estimate. Code T<50 tests across randomly chosen classes, and collect code coverage data P (as a percentage) over whatever part of the code base you think needs testing (particularly your classes). Then you need roughly (1-P)*100*T tests.
If your extended classes are all as regular as you imply, you might consider generating them. If you trust the generation process, you might be able avoid writing the tests.
(ORIGINAL RESPONSE, focused on path coverage tools)
Most code coverage tools are "line" or "branch" coverage tools; they do not count unique paths through the code. At best they count basic blocks.
Path coverage tools do exist; people have built them for research demos, but commercial versions are relatively rare. You can find one at http://testingfaqs.org/t-eval.html#TCATPATH. I don't think this one handles Java.
One of the issues is that the apparent paths through code is generally exponential in the number of decisions since each encountered decision generates a True path and a False path based on the outcome of the conditional (1 decision --> 2 paths, 2 decisions --> 4 paths, ...). Worse loops are in effect a decision repeated as many times as the loop iterates; a loop that repeats a 100 times in effect has 2**100 paths. To control this problem, the more interesting path coverage tools try to determine the feasibility of a path: if the symbolically combined predicates from the conditionals in a prefix of that path are effectively false, the path is infeasible and can be ignored, since it can't really occur. Another standard trick is treat loops as 0, 1, and N iterations to reduce the number of apparent paths. Managing the number of paths requires rather a lot of machinery, considerably above what most branch-coverage test tools need, which helps explain why real path coverage tools are rare.

how many test scenario's are needed?
Many. 500 might be a good start.
Is it possible to use static code analysis to detect how many unique code paths exist throughout the code?
Yes. It called a code coverage tool. Here are some free ones. http://www.java-sources.com/open-source/code-coverage

How can I measure total change in codebase (Eclipse and Mercurial)

We need to be able to compute the total change in lines of code between two versions (V1 and V2) of a large Java codebase. A tool that uses either Eclipse or Mercurial would be ideal.
Counting the number of lines of code in V1 and V2 is not sufficient, since some sections of code will have been removed and rewritten between versions.
What we really need is to compute something like:
I = Intersection of V1 and V2
D = Difference from I to V2
Then we can compute things such as the percentage change = D/V2
Any recommendations for tools that can do this?

hg log --stat will show you various stats for each commit, including inserted / deleted lines.
I don't know if there's a better solution, but you can parse theses results to achieve what you want.
You can also have a look at this previous answer on SO : Counting changed lines of code over time

After trying some approaches based on Hg, I found that the best solution is to use CLOC (Count Lines of Code): http://cloc.sourceforge.net/
You can give it two folders containing two versions of a project, and it will count all of the lines that are the same, modified, added, removed. It's exactly what I needed.

Yes, ProjectCodeMeter can give you differential sloc between 2 versions of source code, but better than that, it can also give you the difference in development time (which is what i guess you want to really aim for).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.