Comparison of two Java classes

Comparison of two Java classes - java

I have two java classes that are very similar in semantics but differ in syntax. The differences are minor, like -
Changes in variable names,
Changes in position of some statements (with no dependent lines in between),
Extra imports, etc.
I need to compare these two classes to prove that they are indeed semantically identical. The same needs to be done for a large number of java file pairs.
The first approach of reading from the two files and comparing the lines, with logic to deal with the differences mentioned above seems inefficient. Is there some other way that I can achieve this task? Any helpful APIs out there?

Compile both of the classes without debug information and then decompile them back to source files. The decompiled files should be a lot more similar than the original source files.
You can improve this further by running some optimizations on the compiled files. For example you can use Proguard with just shrinking enabled to removed unused code.
Changes in position of some statements can be hard to detect though.

If you want to examine the changes in the code try Araxis Merge or WinMerge.
But if you want logical differences, I am afraid you might have to do it manually.
I would advise to use one of these tools to look for textual changes and then look for logical differences.

There are a lot of similarity checker out there, and until now there's no yet perfect tool for this. Each has its own advantages / disadvantages. The approaches generally falls into two categories: token-based or tree-based.
Token-based similarity checking is usually done with regular expressions, but other approaches are possible. In one of my projects at university, we developed one utilizing alignment strategy from bioinformatics field. The disadvantage of this technique is mainly if the size of the two sources isn't more or less equal.
Tree-based is more like a compiler, so normally using some compilation techniques it's possible (well, more or less) to check for this. Tree-based approach has disadvantages of being exponential in comparison complexity.

Comparing line by line wont work. I think you may need to use a parser. I would suggest that you take a look at ANTLR. It should have a java grammar where you could put your actions which will do the comparison.

As far as I know there's now way to compare the semantics of two Java classes. Take for example the following two methods:
public String m1(String a, int b) { ... }
and
public String m2(String x, int y) { ... }
A part from changes in variables and methods names, their signature is the same: same return type, and same input types. However, this is no guarantee that the two methods are semantically equivalent. For example, m1 could return a string consisting of the first b characters of a, while m2 could return a string consisting of y repetitions of x. As you can see, although only variables and names change, the semantics of the two methods is totally different.
I don't see an easy way out for your problem. You can perhaps make some assumption and try the following approach:
assume that the methods names in the two classes are the same
write test cases (for example with JUnit) for all the methods in the first class
run the test cases on the second class
ensure that the second class does not have other (untested) methods (for example using reflection)
This approach gives you an idea about equivalent semantics, but it makes strong assumption.
As a final remark, let me add that specifying the semantics of programs is an interesting and open research topic. Some interesting development in this area include research on Semantic Web Services. A widely adopted approach to give machine processable semantics to programs is that of specifying their IOPE: Input and Output types (as int the Java methods above), and their Preconditions and Effects. Preconditions are essentially logical conditions that must hold true for successfully invoking the program, and Effects are formal descriptions of the changes (in the state of the world) caused by the successful execution of the program. Even with IOPE there are a lot of problems ... which I skip in this short description.

Related

Higher-level, semantic search-and-replace in Java code from command-line

Command-line tools like grep, sed, awk, and perl allow one to carry out textual search-and-replace operations.
However, is there any tool that would allow me to carry out semantic search-and-replace operations in a Java codebase, from command-line?
The Eclipse IDE allows me, e.g., to easily rename a variable, a field, a method, or a class. But I would like to be able to do the same from command-line.
The rename operation above is just one example. I would further like to be able to select the replacee text with additional semantic constraints such as:
only the scopes of methods M1, M2 of classes C, D, and E;
only all variables or fields of class C;
all expressions in which a variable of some class occurs;
only the scope of the class definition of a variable;
only the scopes of all overridden versions of method M of class C;
etc.
Having selected the code using such arbitrary semantic constraints, I would like to be able to then carry out arbitrary transformations on it.
So, basically, I would need access to the symbol-table of the code.
Question:
Is there an existing tool for this type of work, or would I have to build one myself?
Even if I have to build one myself, do any tools or libraries exist that would at least provide me the symbol-table of Java code, on top of which I could add my own search-and-replace and other refactoring operations?

The only tool that I know can do this easily is the long awaited Refaster. However it is still impossible to use it outside of Google. See [the research paper](http:// research.google.com/pubs/pub41876.html) and status on using Refaster outside of Google.
I am the author of AutoRefactor, and I am very interested in implementing this feature as part of this project. Please follow up on the github issue if you would like to help.

What you want is the ability to find code according to syntax, constrained by various semantic conditions, and then be able to replace the found code with new syntax.
access to the symbol table (symbol type/scope/mentions in scope) is just one kind of semantic constraint. You'll probably want others, such as control flow sequencing (this happens after that) and data flow reaching (data produced here is consumed there). In fact there are an unbounded number of semantic conditions you might consider important, depending on the properties of the language (does this function access data in parallel to that function?) or your application interests (is this matrix an upper triangular matrix?)
In general you can't have a tool that has all possible semantic conditions of interest off the shelf. That means you need to be to express new semantic conditions when you discover the need for them.
The best you might hope for is a tool that
knows the language syntax
has some standard semantic properties built in (my preference is symbol tables, control and data flow analysis)
can express patterns on the source in terms of the source code
can constrain the patterns based on such semantic properties
can be extended with new semantic analyses to provide additional properties
There is a classic category of tools that do this, call source to source program transformation systems.
My company offers the DMS Software Reengineering Toolkit, which is one of these. DMS has been used to carry out production transformations at scale on a wide variety of languages (including OP's target: Java). DMS's rewrite rules are of the form:
rule <rule_name>(syntax_parameters): syntax_category =
<match_pattern> -> <replacement_pattern>
if <semantic_condition>;
You can see a lot more detail of the pattern language and rewrite rules look like: DMS Rewrite Rules.
It is worth noting that the rewrite rules represent operations on trees. This means that while they might look like text string matches, they are not. Consequently a rewrite rule matches in spite of any whitespace issues (and in DMS's case, even in spite of differences in number radix or character string escapes). This makes the DMS pattern matches far more effective than a regex, and a lot easier to write since you don't have worry about these issues.
This Software Recommendations link shows how one can define rules with DMS, and (as per OP's request) "run them from the command line": This isn't as succinct as running SED, but then it is doing much more complex tasks.
DMS has a Java front with symbol tables, control and data flow analysis. If one wants additional semantic analyses, one codes them in DMS's underlying programming language.

Why don't compilers use asserts to optimize? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Following pseudo-C++-code:
vector v;
... filling vector here and doing stuff ...
assert(is_sorted(v));
auto x = std::find(v, elementToSearchFor);
find has linear runtime, because it's called on a vector, which can be unsorted. But at that line in that specific program we know that either: The program is incorrect (as in: it doesn't run to the end if the assertion fails) or the vector to search for is sorted, therefore allowing a binary search find with O(log n). Optimizing it into a binary search should be done by a good compiler.
This is only the easiest worst case behavrior I found so far (more complex assertions may allow even more optimization).
Do some compilers do this? If yes, which ones? If not, why don't they?
Appendix: Some higher level languages may easily do this (especially in case of FP ones), so this is more about C/C++/Java/similar languages

Rice's Theorem basically states that non-trivial properties of code cannot be computed in general.
The relationship between is_sorted being true, and running a faster search is possible instead of a linear one, is a non-trivial property of the program after is_sorted is asserted.
You can arrange for explicit connections between is_sorted and the ability to use various faster algorithms. The way you communicate this information in C++ to the compiler is via the type system. Maybe something like this:
template<typename C>
struct container_is_sorted {
C c;
// forward a bunch of methods to `c`.
};
then, you'd invoke a container-based algorithm that would either use a linear search on most containers, or a sorted search on containers wrapped in container_is_sorted.
This is a bit awkward in C++. In a system where variables could carry different compiler-known type-like information at different points in the same stream of code (types that mutate under operations) this would be easier.
Ie, suppose types in C++ had a sequence of tags like int{positive, even} you could attach to them, and you could change the tags:
int x;
make_positive(x);
Operations on a type that did not actively preserve a tag would automatically discard it.
Then assert( {is sorted}, foo ) could attach the tag {is sorted} to foo. Later code could then consume foo and have that knowledge. If you inserted something into foo, it would lose the tag.
Such tags might be run time (that has cost, however, so unlikely in C++), or compile time (in which case, the tag-state of a given variable must be statically determined at a given location in the code).
In C++, due to the awkwardness of such stuff, we instead by habit simply note it in comments and/or use the full type system to tag things (rvalue vs lvalue references are an example that was folded into the language proper).
So the programmer is expected to know it is sorted, and invoke the proper algorithm given that they know it is sorted.

Well, there are two parts to the answer.
First, let's look at assert:
7.2 Diagnostics <assert.h>
1 The header defines the assert and static_assert macros and
refers to another macro,
NDEBUG
which is not defined by <assert.h>. If NDEBUG is defined as a macro name at the point in the source file where <assert.h> is included, the assert macro is defined simply as
#define assert(ignore) ((void)0)
The assert macro is redefined according to the current state of NDEBUG each time that <assert.h> is included.
2 The assert macro shall be implemented as a macro, not as an actual function. If the macro definition is suppressed in order to access an actual function, the behavior is undefined.
Thus, there is nothing left in release-mode to give the compiler any hint that some condition can be assumed to hold.
Still, there is nothing stopping you from redefining assert with an implementation-defined __assume in release-mode yourself (take a look at __builtin_unreachable() in clang / gcc).
Let's assume you have done so. Now, the condition tested could be really complicated and expensive. Thus, you really want to annotate it so it does not ever result in any run-time work. Not sure how to do that.
Let's grant that your compiler even allows that, for arbitrary expressions.
The next hurdle is recognizing what the expression actually tests, and how that relates to the code as written and any potentially faster, but under the given assumption equivalent, code.
This last step results in an immense explosion of compiler-complexity, by either having to create an explicit list of all those patterns to test or building a hugely-complicated automatic analyzer.
That's no fun, and just about as complicated as building SkyNET.
Also, you really do not want to use an asymptotically faster algorithm on a data-set which is too small for asymptotic time to matter. That would be a pessimization, and you just about need precognition to avoid such.

Assertions are (usually) compiled out in the final code. Meaning, among other things, that the code could (silently) fail (by retrieving the wrong value) due to such an optimization, if the assertion was not satisfied.
If the programmer (who put the assertion there) knew that the vector was sorted, why didn't he use a different search algorithm? What's the point in having the compiler second-guess the programmer in this way?
How does the compiler know which search algorithm to substitute for which, given that they all are library routines, not a part of the language's semantics?

You said "the compiler". But compilers are not there for the purpose of writing better algorithms for you. They are there to compile what you have written.
What you might have asked is whether the library function std::find should be implemented to potentially seek whether or not it can perform the algorithm other than using linear search. In reality it might be possible if the user has passed in std::set iterators or even std::unordered_set and the STL implementer knows detail of those iterators and can make use of it, but not in general and not for vector.
assert itself only applies in debug mode and optimisations are normally needed for release mode. Also, a failed assert causes an interrupt not a library switch.
Essentially, there are collections provided for faster lookup and it is up to the programmer to choose it and not the library writer to try to second guess what the programmer really wanted to do. (And in my opinion even less so for the compiler to do it).

In the narrow sense of your question, the answer is they do if then can but mostly they can't, because the language isn't designed for it and assert expressions are too complicated.
If assert() is implemented as a macro (as it is in C++), and it has not been disabled (by setting NDEBUG in C++) and the expression can be evaluated at compile time (or can be data traced) then the compiler will apply its usual optimisations. That doesn't happen often.
In most cases (and certainly in the example you gave) the relationship between the assert() and the desired optimisation is far beyond what a compiler can do without assistance from the language. Given the very low level of meta-programming capability in C++ (and Java) the ability to do this is quite limited.
In the wider sense I think what you're really asking for is a language in which the programmer can make asserts about the intention of the code, from which the compiler can choose between different translations (and algorithms). There have been experimental languages attempting to do that, and Eiffel had some features in that direction, but I'm now aware of any mainstream compiled languages that could do it.

Optimizing it into a binary search should be done by a good compiler.
No! A linear search results in a much more predictable branch. If the array is short enough, linear search is the right thing to do.
Apart from that, even if the compiler wanted to, the list of ideas and notions it would have to know about would be immense and it would have to do nontrivial logic on them. This would get very slow. Compilers are engineered to run fast and spit out decent code.
You might spend some time playing with formal verification tools whose job is to figure out everything they can about the code they're fed in, which asserts can trip, and so forth. They're often built without the same speed requirements compilers have and consequently they're much better at figuring things out about programs. You'll probably find that reasoning rigorously about code is rather harder than it looks at first sight.

Why does Guava's Optional use abstract classes when Java 8's uses nulls?

When Java 8 was released, I was expecting to find its implementation of Optional to be basically the same as Guava's. And from a user's perspective, they're almost identical. But Java 8's Optional uses null internally to mark an empty Optional, rather than making Optional abstract and having two implementations. Aside from Java 8's version feeling wrong (you're avoiding nulls by just hiding the fact that you're really still using them), isn't it less efficient to check if your reference is null every time you want to access it, rather than just invoke an abstract method? Maybe it's not, but I'm wondering why they chose this approach.

Perhaps the developers of Google Guava wanted to develop an idiom closer to those of the functional world:
datatype ‘a option = NONE | SOME of ‘a
In whose case you use pattern matching to check the true nature of an instance of type option.
case x of
NONE => //address null here
| SOME y => //do something with y here
By declaring Option as an abstract class, the Google Guava is following this approach, where Option represent the type ('a option), and the subclasses for of and absent would represent the particular instances of this type (SOME 'a and NONE).
The design of Option was thoroughly discussed in the lambda mailing list. In the words of Brian Goetz:
The problem is with the expectations. This is a classic "blind men
and elephant" problem; the thing called Optional has different
"essential natures" to different viewpoints, and the problem is not
that each is not valid, the problem is that we're all using the same
word to describe different concepts (more precisely, assuming that the
goals of the JDK team are the same as the goals of the people you
condescendingly refer to as "those familiar with the concept."
There is a narrow design scope of what Optional is being used for in
the JDK. The current design mostly meets that; it might be extended
in small ways, but the goal is NOT to create an option monad or solve
the problems that the option monad is intended to solve. (Even if we
did, the result would still likely not be satisfactory; without the
rest of the class library following the same monadic API conventions,
without higher-kinded generics to abstract over different kinds of
monads, without linguistic support for flatmap in the form of the <-
operator, without pattern matching, etc, etc, the value of turning
Optional into a monad is greatly decreased.) Given that this is not
our goal here, we're stopping where it stops adding value according to
our goals. Sorry if people are upset that we're not turning Java into
Scala or Haskell, but we're not.
On a purely practical note, the discussions surrounding Optional have
exceeded its design budget by several orders of magnitude. We've
carefully considered the considerable input we've received, spent no
small amount of time thinking about it, and have concluded that the
current design center is the right one for the current time. What is
surely meant as well-intentioned input is in fact rapidly turning into
a denial-of-service attack. We could spend endless time arguing this
back and forth, and there'd be no JDK 8 as a result. I'm sure no one
wants that.
So, let's keep our input on the subject to that which is within the
design center of the current implementation, rather than trying to
convince us to change the design center.

i would expect virtual method invocation lookup to be more expensive. you have to load the virtual function table, look up an offset, and then invoke the method. a null check is a single bytecode that reads from a register and not memory.

Best choice? Edit bytecode (asm) or edit java file before compiling

Goal
Detecting where comparisons between and copies of variables are made
Inject code near the line where the operation has happened
The purpose of the code: everytime the class is ran make a counter increase
General purpose: count the amount of comparisons and copies made after execution with certain parameters
2 options
Note: I always have a .java file to begin with
1) Edit java file
Find comparisons with regex and inject pieces of code near the line
And then compile the class (My application uses JavaCompiler)
2)Use ASM Bytecode engineering
Also detecting where the events i want to track and inject pieces into the bytecode
And then use the (already compiled but modified) class
My Question
What is the best/cleanest way? Is there a better way to do this?

If you go for the Java route, you don't want to use regexes -- you want a real java parser. So that may influence your decision. Mind, the Oracle JVM includes one, as part of their internal private classes that implement the java compiler, so you don't actually have to write one yourself if you don't want to. But decoding the Oracle AST is not a 5 minute task either. And, of course, using that is not portable if that's important.
If you go the ASM route, the bytecode will initially be easier to analyze, since the semantics are a lot simpler. Whether the simplicity of analyses outweighs the unfamiliarity is unknown in terms of net time to your solution. In the end, in terms of generated code, neither is "better".
There is an apparent simplicity of just looking at generated java source code and "knowing" that What You See Is What You Get vs doing primitive dumps of class files for debugging and etc., but all that apparently simplicity is there because of your already existing comfortability with the Java lanaguage. Once you spend some time dredging through byte code that, too, will become comfortable. Just a question whether it's worth the time to you to get there in the first place.

Generally it all depends how comfortable you are with either option and how critical is performance aspect. The bytecode manipulation will be much faster and somewhat simpler, but you'll have to understand how bytecode works and how to use ASM framework.
Intercepting variable access is probably one of the simplest use cases for ASM. You could find a few more complex scenarios in this AOSD'07 paper.
Here is simplified code for intercepting variable access:
ClassReader cr = ...;
ClassWriter cw = ...;
cr.accept(new MethodVisitor(cw) {
public void visitVarInsn(int opcode, int var) {
if(opcode == ALOAD) { // loading Object var
... insert method call
}
}
});

If it was me i'd probably use the ASM option.
If you need a tutorial on ASM I stumbled upon this user-written tutorial click here

Java -> Python?

Besides the dynamic nature of Python (and the syntax), what are some of the major features of the Python language that Java doesn't have, and vice versa?

List comprehensions. I often find myself filtering/mapping lists, and being able to say [line.replace("spam","eggs") for line in open("somefile.txt") if line.startswith("nee")] is really nice.
Functions are first class objects. They can be passed as parameters to other functions, defined inside other function, and have lexical scope. This makes it really easy to say things like people.sort(key=lambda p: p.age) and thus sort a bunch of people on their age without having to define a custom comparator class or something equally verbose.
Everything is an object. Java has basic types which aren't objects, which is why many classes in the standard library define 9 different versions of functions (for boolean, byte, char, double, float, int, long, Object, short). Array.sort is a good example. Autoboxing helps, although it makes things awkward when something turns out to be null.
Properties. Python lets you create classes with read-only fields, lazily-generated fields, as well as fields which are checked upon assignment to make sure they're never 0 or null or whatever you want to guard against, etc.'
Default and keyword arguments. In Java if you want a constructor that can take up to 5 optional arguments, you must define 6 different versions of that constructor. And there's no way at all to say Student(name="Eli", age=25)
Functions can only return 1 thing. In Python you have tuple assignment, so you can say spam, eggs = nee() but in Java you'd need to either resort to mutable out parameters or have a custom class with 2 fields and then have two additional lines of code to extract those fields.
Built-in syntax for lists and dictionaries.
Operator Overloading.
Generally better designed libraries. For example, to parse an XML document in Java, you say
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse("test.xml");
and in Python you say
doc = parse("test.xml")
Anyway, I could go on and on with further examples, but Python is just overall a much more flexible and expressive language. It's also dynamically typed, which I really like, but which comes with some disadvantages.
Java has much better performance than Python and has way better tool support. Sometimes those things matter a lot and Java is the better language than Python for a task; I continue to use Java for some new projects despite liking Python a lot more. But as a language I think Python is superior for most things I find myself needing to accomplish.

I think this pair of articles by Philip J. Eby does a great job discussing the differences between the two languages (mostly about philosophy/mentality rather than specific language features).
Python is Not Java
Java is Not Python, either

One key difference in Python is significant whitespace. This puts a lot of people off - me too for a long time - but once you get going it seems natural and makes much more sense than ;s everywhere.
From a personal perspective, Python has the following benefits over Java:
No Checked Exceptions
Optional Arguments
Much less boilerplate and less verbose generally
Other than those, this page on the Python Wiki is a good place to look with lots of links to interesting articles.

With Jython you can have both. It's only at Python 2.2, but still very useful if you need an embedded interpreter that has access to the Java runtime.

Apart from what Eli Courtwright said:
I find iterators in Python more concise. You can use for i in something, and it works with pretty much everything. Yeah, Java has gotten better since 1.5, but for example you can iterate through a string in python with this same construct.
Introspection: In python you can get at runtime information about an object or a module about its symbols, methods, or even its docstrings. You can also instantiate them dynamically. Java has some of this, but usually in Java it takes half a page of code to get an instance of a class, whereas in Python it is about 3 lines. And as far as I know the docstrings thing is not available in Java

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.