What are the differences between PMD and FindBugs? - java

There was a question comparing PMD and CheckStyle. However, I can't find a nice breakdown on the differences/similarities between PMD and FindBugs. I believe a key difference is that PMD works on source code, while FindBugs works on compiled bytecode files. But in terms of capabilities, should it be an either/or choice or do they complement each other?

I'm using both. I think they complement each other.
As you said, PMD works on source code and therefore finds problems like: violation of naming conventions, lack of curly braces, misplaced null check, long parameter list, unnecessary constructor, missing break in switch, etc. PMD also tells you about the Cyclomatic complexity of your code which I find very helpful (FindBugs doesn't tell you about the Cyclomatic complexity).
FindBugs works on bytecode. Here are some problems FindBugs finds which PMD doesn't: equals() method fails on subtypes, clone method may return null, reference comparison of Boolean values, impossible cast, 32bit int shifted by an amount not in the range of 0-31, a collection which contains itself, equals method always returns true, an infinite loop, etc.
Usually each of them finds a different set of problems. Use both. These tools taught me a lot about how to write good Java code.

The best feature of PMD, is its XPath Rules, bundled with a Rule Designer to let you easily construct new rules from code samples (similar to RegEx and XPath GUI builders). FindBugs is stronger out of the box, but constructing project specific rules and patterns is very important.
For example, I encountered a performance problem involving 2 nested for loops, resulting in a O(n^2) running time, which could easily be avoided. I used PMD to construct an ad-hoc query, to review other instances of nested for loops - //ForStatement/Statement//ForStatement. This pointed out 2 more instances of the problem. This is not a generic rule whatsoever.

PMD is
famous
used widely in industry
you can add your rules in xml
gives you detailed analysis in Errors levels and warning levels
you can also scan your code for "copy and paste lines". Duplicate code. This gives good idea about implementing java oops.

Related

Can Ternary operators should not be nested (squid:S3358) be configured

When I have the following code with 2 levels of Ternary operations
double amount = isValid ? (isTypeA ? vo.getTypeA() : vo.getTypeB()) : 0;
Which Sonar warns about
Ternary operators should not be nested (squid:S3358)
Just because you can do something, doesn't mean you should, and that's the case with nested ternary operations. Nesting ternary operators results in the kind of code that may seem clear as day when you write it, but six months later will leave maintainers (or worse - future you) scratching their heads and cursing.
Instead, err on the side of clarity, and use another line to express the nested operation as a separate statement.
My colleague suggested that such level can be accepted and it's more clear than the alternative.
I wonder if this rule (or others) can be configured to allowed levels limit?
If not, why sonar is so strict when it deals with code conventions?
I don't want to ignore rule, just to customize to allow up to 2 levels instead of 1.
I wonder if this rule can be configured to allowed levels limit?
The Ternary operators should not be nested rule cannot be configured. You are only able to enable or disable it.
I wonder if other rules can be configured to allowed levels limit?
I don't know any existing rule which can do it. Luckily, you are able to create a custom analyzer. The original rule class is here NestedTernaryOperatorsCheck. You can simply copy it and adjust to your needs.
why sonar is so strict when it deals with code conventions?
SonarSource provides a lot of rules for different languages. Every customization makes code more difficult to maintain. They have a limited capacity, so they have to make decisions which are unaccepted by all users (but are accepted by most of them).

Adding immutable programming rules to the Java language within a program

I'm writing a program in Java. I find that reading and debugging code is easiest when the paradigm techniques are consistent, allowing me very quickly assume where and what a problem is.
Doing this has, as you might guess, made my programming much faster, and so I want to find a way to enforce these rules.
For example, lets say I have a method that makes changes to the state of an object, and returns a value. If the method is called outside of the class, I don't ever want to see it resolve inside parameter parentheses, like this:
somefunction(param1, param2, object.change_and_return());
Instead, I want it to be done like this:
int relevant_variable_name = object.change_and_return();
somefunction(param1, param2, relevant_variable_name);
Another example, is I want to create a base class that includes certain print methods, and I want all classes that are user defined to be derived from that base class, much in the way java has done so.
Within my objects, is there a way I can force myself (and anyone else) to adhere to these rules? Ie. if you try to run code that breaks the rules, it will terminate and return the custom error report. Also, if you write code that breaks the rules, the IDE (I use eclipse) will recognize it as an error, underline and call the appropriate javadoc?
For the check and underline violations part:
You can use PMD, it is a static code analyzer.
It has a default ruleset, and you can write custom rules matching what you need.
However your controls seem to be quite complex to express in "PMD language".
PMD is available in Eclipse Marketplace.
For the crash if not conform part
There see no easy way to do it.
Hard/complex ways could be:
Write a rule within PMD, run the analysis at compile time, parse the report (still at compile time) and return an error if your rule is violated.
Write a Java Agent doing the rule check and make it crash the VM if the rule is violated (not sure it is really feasable, agents are meant for instrumentation).
Use reflection anywhere in your code to load classes, and analyze loaded class against your rules and crash the VM if the rule is violated (seriously don't do this: the code would be ugly and the rule easily bypassable).

Why don't compilers use asserts to optimize? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Following pseudo-C++-code:
vector v;
... filling vector here and doing stuff ...
assert(is_sorted(v));
auto x = std::find(v, elementToSearchFor);
find has linear runtime, because it's called on a vector, which can be unsorted. But at that line in that specific program we know that either: The program is incorrect (as in: it doesn't run to the end if the assertion fails) or the vector to search for is sorted, therefore allowing a binary search find with O(log n). Optimizing it into a binary search should be done by a good compiler.
This is only the easiest worst case behavrior I found so far (more complex assertions may allow even more optimization).
Do some compilers do this? If yes, which ones? If not, why don't they?
Appendix: Some higher level languages may easily do this (especially in case of FP ones), so this is more about C/C++/Java/similar languages
Rice's Theorem basically states that non-trivial properties of code cannot be computed in general.
The relationship between is_sorted being true, and running a faster search is possible instead of a linear one, is a non-trivial property of the program after is_sorted is asserted.
You can arrange for explicit connections between is_sorted and the ability to use various faster algorithms. The way you communicate this information in C++ to the compiler is via the type system. Maybe something like this:
template<typename C>
struct container_is_sorted {
C c;
// forward a bunch of methods to `c`.
};
then, you'd invoke a container-based algorithm that would either use a linear search on most containers, or a sorted search on containers wrapped in container_is_sorted.
This is a bit awkward in C++. In a system where variables could carry different compiler-known type-like information at different points in the same stream of code (types that mutate under operations) this would be easier.
Ie, suppose types in C++ had a sequence of tags like int{positive, even} you could attach to them, and you could change the tags:
int x;
make_positive(x);
Operations on a type that did not actively preserve a tag would automatically discard it.
Then assert( {is sorted}, foo ) could attach the tag {is sorted} to foo. Later code could then consume foo and have that knowledge. If you inserted something into foo, it would lose the tag.
Such tags might be run time (that has cost, however, so unlikely in C++), or compile time (in which case, the tag-state of a given variable must be statically determined at a given location in the code).
In C++, due to the awkwardness of such stuff, we instead by habit simply note it in comments and/or use the full type system to tag things (rvalue vs lvalue references are an example that was folded into the language proper).
So the programmer is expected to know it is sorted, and invoke the proper algorithm given that they know it is sorted.
Well, there are two parts to the answer.
First, let's look at assert:
7.2 Diagnostics <assert.h>
1 The header defines the assert and static_assert macros and
refers to another macro,
NDEBUG
which is not defined by <assert.h>. If NDEBUG is defined as a macro name at the point in the source file where <assert.h> is included, the assert macro is defined simply as
#define assert(ignore) ((void)0)
The assert macro is redefined according to the current state of NDEBUG each time that <assert.h> is included.
2 The assert macro shall be implemented as a macro, not as an actual function. If the macro definition is suppressed in order to access an actual function, the behavior is undefined.
Thus, there is nothing left in release-mode to give the compiler any hint that some condition can be assumed to hold.
Still, there is nothing stopping you from redefining assert with an implementation-defined __assume in release-mode yourself (take a look at __builtin_unreachable() in clang / gcc).
Let's assume you have done so. Now, the condition tested could be really complicated and expensive. Thus, you really want to annotate it so it does not ever result in any run-time work. Not sure how to do that.
Let's grant that your compiler even allows that, for arbitrary expressions.
The next hurdle is recognizing what the expression actually tests, and how that relates to the code as written and any potentially faster, but under the given assumption equivalent, code.
This last step results in an immense explosion of compiler-complexity, by either having to create an explicit list of all those patterns to test or building a hugely-complicated automatic analyzer.
That's no fun, and just about as complicated as building SkyNET.
Also, you really do not want to use an asymptotically faster algorithm on a data-set which is too small for asymptotic time to matter. That would be a pessimization, and you just about need precognition to avoid such.
Assertions are (usually) compiled out in the final code. Meaning, among other things, that the code could (silently) fail (by retrieving the wrong value) due to such an optimization, if the assertion was not satisfied.
If the programmer (who put the assertion there) knew that the vector was sorted, why didn't he use a different search algorithm? What's the point in having the compiler second-guess the programmer in this way?
How does the compiler know which search algorithm to substitute for which, given that they all are library routines, not a part of the language's semantics?
You said "the compiler". But compilers are not there for the purpose of writing better algorithms for you. They are there to compile what you have written.
What you might have asked is whether the library function std::find should be implemented to potentially seek whether or not it can perform the algorithm other than using linear search. In reality it might be possible if the user has passed in std::set iterators or even std::unordered_set and the STL implementer knows detail of those iterators and can make use of it, but not in general and not for vector.
assert itself only applies in debug mode and optimisations are normally needed for release mode. Also, a failed assert causes an interrupt not a library switch.
Essentially, there are collections provided for faster lookup and it is up to the programmer to choose it and not the library writer to try to second guess what the programmer really wanted to do. (And in my opinion even less so for the compiler to do it).
In the narrow sense of your question, the answer is they do if then can but mostly they can't, because the language isn't designed for it and assert expressions are too complicated.
If assert() is implemented as a macro (as it is in C++), and it has not been disabled (by setting NDEBUG in C++) and the expression can be evaluated at compile time (or can be data traced) then the compiler will apply its usual optimisations. That doesn't happen often.
In most cases (and certainly in the example you gave) the relationship between the assert() and the desired optimisation is far beyond what a compiler can do without assistance from the language. Given the very low level of meta-programming capability in C++ (and Java) the ability to do this is quite limited.
In the wider sense I think what you're really asking for is a language in which the programmer can make asserts about the intention of the code, from which the compiler can choose between different translations (and algorithms). There have been experimental languages attempting to do that, and Eiffel had some features in that direction, but I'm now aware of any mainstream compiled languages that could do it.
Optimizing it into a binary search should be done by a good compiler.
No! A linear search results in a much more predictable branch. If the array is short enough, linear search is the right thing to do.
Apart from that, even if the compiler wanted to, the list of ideas and notions it would have to know about would be immense and it would have to do nontrivial logic on them. This would get very slow. Compilers are engineered to run fast and spit out decent code.
You might spend some time playing with formal verification tools whose job is to figure out everything they can about the code they're fed in, which asserts can trip, and so forth. They're often built without the same speed requirements compilers have and consequently they're much better at figuring things out about programs. You'll probably find that reasoning rigorously about code is rather harder than it looks at first sight.

What are the subphases of the semantics analysis compiler phase?

I took an interest in finding out how a compiler really works. I looked through several books and all of them agree on the fact that the compiler phases are roughly as this(correct me if I'm wrong): lexical analysis, syntax analysis, semantic analysis, intermediate code, code optimization, code generation. The lexical and syntax phases look pretty clear and straightforward as methods(but this does not mean easy of course). However, I'm still not able to find what the semantic phase really consist of. For one, I know that there should be some subphases like scope checking, declaration checking and type checking but question that has been bothering me is: are there other things that have to be done. Can you tell me what are the mandatory steps that have to taken during this phase. I know this strongly depends on the programming language and the compiler implementation but could you give me some examples concerning C/C++, Java. And could you please point me to a book/page/article where can I read those things in-depth. Thanks.
Edit:
The books I look through were "Compilers: Principles, Techniques, and Tools",Aho and "Modern Compiler Design", Grune, Reeuwijk. I haven't been able to answer this question using them. If you find this question too broad could you please give an answer considering an compiler implementation of your choice for either C,C++ or Java.
There are typical "semantic analysis" phases that many compilers go through in one form or another. After lexing and parsing, the following actions typically occur in this order:
Name and type resolution. Determines lexical scopes, identifiers declared in such scopes, the type information for those identifiers, and for each non-declaration use of an identifier, the declaration to which it refers
Control flow analysis. The construction of a control flow graph over the computations explicit and/or implied (e.g., constructors) by the code.
Data flow analysis. Determines where variables recieve new values, and where those values are read by other parts of the program. (This often has a local analysis done within procedures, followed possibly by one across the procedures).
Also often done, as part of data flow analysis:
Points-to analysis. Determination for each pointer, at each location in the code, which entities that pointer might reference
Call graph. Construction of a call graph across the procedures, often taking into account indirect function pointers whose estimated values occur during the points-to analysis.
As a practical matter, some of these need to be interleaved to produce better results.
Beyond this, there are many analyses used to support various optimizations and code generation passes. If you really want to know more, consult any decent compiler book.
As already mentioned by templatetypedef, semantic analysis is language specific. For C++ it would among other things involve what template instantiations are required (the C++ language tends towards more and more semantic analysis), and for Java there would need to be some checked exception analysis.
Even for C the GNU C compiler can be configured to check arguments given to string-interpolations. I guess there are hundres of semi semantic analysis-related options for GCC to choose from. If you are doing a paper on the subject, you could spend an afternoon counting them :)
Besides availability, I find that the semantic analysis is what differentiates the statically typed imperative object-oriented languages of today.
You can't necessarily divide it into sub-phases at all. There are a number of things that have to be done, but at least conceptually they are all done while walking the parse tree from top to bottom and back up again. What exactly they are and how exactly it all happens depends on the language, the statement being processed, the specific compiler writer, ...
You could start to make a list:
Build symbol table.
Find the declarations of variables referenced.
Check compatibility of variable datatypes.
Establish subexpression types.
...
You can see that already these must be somewhat intermingled in practice, rather than constitute separable sub-phases.

Comparison of two Java classes

I have two java classes that are very similar in semantics but differ in syntax. The differences are minor, like -
Changes in variable names,
Changes in position of some statements (with no dependent lines in between),
Extra imports, etc.
I need to compare these two classes to prove that they are indeed semantically identical. The same needs to be done for a large number of java file pairs.
The first approach of reading from the two files and comparing the lines, with logic to deal with the differences mentioned above seems inefficient. Is there some other way that I can achieve this task? Any helpful APIs out there?
Compile both of the classes without debug information and then decompile them back to source files. The decompiled files should be a lot more similar than the original source files.
You can improve this further by running some optimizations on the compiled files. For example you can use Proguard with just shrinking enabled to removed unused code.
Changes in position of some statements can be hard to detect though.
If you want to examine the changes in the code try Araxis Merge or WinMerge.
But if you want logical differences, I am afraid you might have to do it manually.
I would advise to use one of these tools to look for textual changes and then look for logical differences.
There are a lot of similarity checker out there, and until now there's no yet perfect tool for this. Each has its own advantages / disadvantages. The approaches generally falls into two categories: token-based or tree-based.
Token-based similarity checking is usually done with regular expressions, but other approaches are possible. In one of my projects at university, we developed one utilizing alignment strategy from bioinformatics field. The disadvantage of this technique is mainly if the size of the two sources isn't more or less equal.
Tree-based is more like a compiler, so normally using some compilation techniques it's possible (well, more or less) to check for this. Tree-based approach has disadvantages of being exponential in comparison complexity.
Comparing line by line wont work. I think you may need to use a parser. I would suggest that you take a look at ANTLR. It should have a java grammar where you could put your actions which will do the comparison.
As far as I know there's now way to compare the semantics of two Java classes. Take for example the following two methods:
public String m1(String a, int b) { ... }
and
public String m2(String x, int y) { ... }
A part from changes in variables and methods names, their signature is the same: same return type, and same input types. However, this is no guarantee that the two methods are semantically equivalent. For example, m1 could return a string consisting of the first b characters of a, while m2 could return a string consisting of y repetitions of x. As you can see, although only variables and names change, the semantics of the two methods is totally different.
I don't see an easy way out for your problem. You can perhaps make some assumption and try the following approach:
assume that the methods names in the two classes are the same
write test cases (for example with JUnit) for all the methods in the first class
run the test cases on the second class
ensure that the second class does not have other (untested) methods (for example using reflection)
This approach gives you an idea about equivalent semantics, but it makes strong assumption.
As a final remark, let me add that specifying the semantics of programs is an interesting and open research topic. Some interesting development in this area include research on Semantic Web Services. A widely adopted approach to give machine processable semantics to programs is that of specifying their IOPE: Input and Output types (as int the Java methods above), and their Preconditions and Effects. Preconditions are essentially logical conditions that must hold true for successfully invoking the program, and Effects are formal descriptions of the changes (in the state of the world) caused by the successful execution of the program. Even with IOPE there are a lot of problems ... which I skip in this short description.

Categories