Tracking method implementation changes in class bytecode

Tracking method implementation changes in class bytecode - java

I have some abstract project (let's call it The Project) bytecode (of it's every class) inside some kotlin code, and each class bytecode is stored as ByteArray; the task is to tell which specific methods in each class are being modified from build to build of The Project. In other words, there are two ByteArrays of a same class of The Project, but they belong to different versions of it, and I need to compare them accurate. A simple example. Let's assume we have a trivial class:
class Rst {
fun getjson(): String {
abc("""ss""");
return "jsonValid"
}
public fun abc(s: String) {
println(s)
}
}
It's bytecode is stored in oldByteCode. Now some changes happened to the class:
class Rst {
fun getjson(): String {
abc("""ss""");
return "someOtherValue"
}
public fun newMethod(s: String) {
println("it's not abc anymore!")
}
}
It's bytecode is stored in newByteCode.
That's the main goal: compare oldByteCode to newByteCode.
Here we have the following changes:
getjson() method had been changed;
abc() method had been removed;
newMethod() had been created.
So, a method is changed, if it's signature remains the same. If not, it's already some different method.
Now back to the actual problem. I have to know every method's exact status by it's bytecode. What I have at the moment is the jacoco analyzer, which parses class bytecode to "bundles". In these bundles I have hierarchy of packages, classes, methods, but only with their signatures, so I cant tell if a method's body has any changes. I can only track signature differences.
Are there any tools, libs to split class bytecode to it's methods bytecodes? With those I could, for example, calculate hashes and compare them. Maybe asm library has any deal with that?
Any ideas are welcome.

TL;DR you approach of just comparing bytecode or even hashes won’t lead to a reliable solution, in fact, there is no solution with a reasonable effort to this kind of problem at all.
I don’t know, how much of it applies to the Kotlin compiler, but as elaborated in Is the creation of Java class files deterministic?, Java compilers are not required to produce identical bytecode even if the same version is used to compile exactly the same source code. While they may have an implementation that tries to be as deterministic as possible, things change when looking at different versions or alternative implementations, as explained in Do different Java Compilers (where the vendor is different) produce different bytecode.
Even when we assume that the Kotlin compiler is outstandingly deterministic, even across versions, it can’t ignore the JVM evolution. E.g. the removal of the jsr/ret instructions could not be ignored by any compiler, even when trying to be conservative. But it’s rather likely that it will incorporate other improvements as well, even when not being forced¹.
So in short, even when the entire source code did not change, it’s not a safe bet to assume that the compiled form has to stay the same. Even with an explicitly deterministic compiler we would have to be prepared for changes when recompiling with newer versions.
Even worse, if one method changes, it may have an impact on the compiled form of others, as instructions refer to items of a constant pool whenever constants or linkage information are needed and these indices may change, depending on how the other methods use the constant pool. There’s also an optimized form for certain instructions when accessing one of the first 255 pool indices, so changes in the numbering may require changing the form of the instruction. This in turn may have an impact on other instructions, e.g. switch instructions have padding bytes, depending on their byte code position.
On the other hand, a simple change of a constant value used in only one method may have no impact on the method’s bytecode at all, if the new constant happened to end up at the same place in the pool than the old constant.
So, to determine whether the code of two methods does actually the same, there is no way around parsing the instructions and understanding their meaning to some degree. Comparing just bytes or hashes won’t work.
¹ to name some non-mandatory changes, the compilation of class literals changed, likewise string concatenation changed from using StringBuffer to use StringBuilder and changed again to use StringConcatFactory, the use of getClass() for intrinsic null checks changed to requireNonNull(…), etc. A compiler for a different language doesn’t have to follow, but no-one wants to be left behind…
There are also bugs to fix, like obsolete instructions, which no compiler would keep just to stay deterministic.

Related

Do private functions use more or less computer resources than public ones?

Computer resources being RAM, possessing power, and disk space. I am just curious, even though it is more or less by a tiny itty-bitty amount.

It could, in theory, be a hair faster in some cases. In practice, they're equally fast.
Non-static, non-public methods are invoked using the invokevirtual bytecode op. This opcode requires the JVM to dynamically look up the actual's method resolution: if you have a call that's statically compiled to AbstractList::contains, should that resolve to ArrayList::contains, or LinkedList::contains, etc? What's more, the compiler can't just reuse the result of this compilation for next time; what if the next time that myList.contains(val) gets called, it's on a different implementation? So, the compiler has to do at least some amount of checking, roughly per-invocation, for non-private methods.
Private methods can't be overridden, and they're invoked using invokespecial. This opcode is used for various kind of method calls that you can resolve just once, and then never change: constructors, call to super methods, etc. For instance, if I'm in ArrayList::add and I call super.add(value) (which doesn't happen there, but let's pretend it did), then the compiler can know for sure that this refers to AbstractList::add, since a class's super class can't ever change.
So, in very rough terms, an invokevirtual call requires resolving the method and then invoking it, while an invokespecial call doesn't require resolving the method (after the first time it's called -- you have to resolve everything at least once!).
This is covered in the JVM spec, section 5.4.3:
Resolution of the symbolic reference of one occurrence of an invokedynamic instruction does not imply that the same symbolic reference is considered resolved for any other invokedynamic instruction.
For all other instructions above, resolution of the symbolic reference of one occurrence of an instruction does imply that the same symbolic reference is considered resolved for any other non-invokedynamic instruction.
(empahsis in original)
Okay, now for the "but you won't notice the difference" part. The JVM is heavily optimized for virtual calls. It can do things like detecting that a certain site always sees an ArrayList specifically, and so "staticify" the List::add call to actually be ArrayList::add. To do this, it needs to verify that the incoming object really is the expected ArrayList, but that's very cheap; and if some earlier method call has already done that work in this method, it doesn't need to happen again. This is called a monomorphic call site: even though the code is technically polymorphic, in practice the list only has one form.
The JVM optimizes monomorphic call sites, and even bimorphic call sites (for instance, the list is always an ArrayList or a LinkedList, never anything else). Once it sees three forms, it has to use a full polymorphic dispatch, which is slower. But then again, at that point you're comparing apples to oranges: a non-private, polymorphic call to a private call that's monomorphic by definition. It's more fair to compare the two kinds of monomorphic calls (virtual and private), and in that case you'll probably find that the difference is minuscule, if it's even detectible.
I just did a quick JMH benchmark to compare (a) accessing a field directly, (b) accessing it via a public getter and (c) accessing it via a private getter. All three took the same amount of time. Of course, uber-micro benchmarks are very hard to get right, because the JIT can do such wonderful things with optimizations. Then again, that's kind of the point: The JIT does such wonderful things with optimizations that public and private methods are just as fast.

Do private functions use more or less computer resources than public ones?
No. The JVM uses the same resources regardless of the access modifier on individual fields or methods.
But, there is a far better reason to prefer private (or protected) beside resource utilization; namely encapsulation. Also, I highly recommend you read The Developer Insight Series: Part 1 - Write Dumb Code.

I am just curious, even though it is more or less by a tiny itty-bitty amount.
While it is good to be curious ... if you start taking this kind of thing into account when you are programming, then:
you are liable to waste a lot of time looking for micro-optimizations that are not needed,
your code is liable to be unmaintainable because you are sacrificing good design principles, and
you even risk making your code less efficient* than it would be if you didn't optimize.
* - It it can go like this. 1) You spend a lot of time tweaking your code to run fast on your test platform. 2) When you run on the production platform, you find that the hardware gives you different performance characteristics. 3) You upgrade the Java installation, and the new JVM's JIT compiler optimizes your code differently, or it has a bunch of new optimizations that are inhibited by your tweaks. 4) When you run your code on real-world workloads, you discover that the assumption that were the basis for your tweaking are invalid.

What happen if I manually changed the bytecode before running it?

I am little bit curious about that what happen if I manually changed something into bytecode before execution. For instance, let suppose assigning int type variable into byte type variable without casting or remove semicolon from somewhere in program or anything that leads to compile time error. As I know all compile time errors are checked by compiler before making .class file. So what happen when I changed byte code after successfully compile a program then changed bytecode manually ? Is there any mechanism to handle this ? or if not then how program behaves after execution ?
EDIT :-
As Hot Licks, Darksonn and manouti already gave correct satisfy answers.Now I just conclude for those readers who all seeking answer for this type question :-
Every Java virtual machine has a class-file verifier, which ensures that loaded class files have a proper internal structure. If the class-file verifier discovers a problem with a class file, it throws an exception. Because a class file is just a sequence of binary data, a virtual machine can't know whether a particular class file was generated by a well-meaning Java compiler or by shady crackers bent on compromising the integrity of the virtual machine. As a consequence, all JVM implementations have a class-file verifier that can be invoked on untrusted classes, to make sure the classes are safe to use.
Refer this for more details.

You certainly can use a hex editor (eg, the free "HDD Hex Editor Neo") or some other tool to modify the bytes of a Java .class file. But obviously, you must do so in a way that maintains the file's "integrity" (tables all in correct format, etc). Furthermore (and much trickier), any modification you make must pass muster by the JVM's "verifier", which essentially rechecks everything that javac verified while compiling the program.
The verification process occurs during class loading and is quite complex. Basically, a data flow analysis is done on each procedure to assure that only the correct data types can "reach" a point where the data type is assumed. Eg, you can't change a load operation to load a reference to a HashMap onto the "stack" when the eventual user of the loaded reference will be assuming it's a String. (But enumerating all the checks the verifier does would be a major task in itself. I can't remember half of them, even though I wrote the verifier for the IBM iSeries JVM.)
(If you're asking if one can "jailbreak" a Java .class file to introduce code that does unauthorized things, the answer is no.)

You will most likely get a java.lang.VerifyError:
Thrown when the "verifier" detects that a class file, though well formed, contains some sort of internal inconsistency or security problem.

You can certainly do this, and there are even tools to make it easier, like http://set.ee/jbe/. The Java runtime will run your modified bytecode just as it would run the bytecode emitted by the compiler. What you're describing is a Java-specific case of a binary patch.
The semicolon example wouldn't be an issue, since semicolons are only for the convenience of the compiler and don't appear in the bytecode.

Either the bytecode executes normally and performs the instructions given or the jvm rejects them.
I played around with programming directly in java bytecode some time ago using jasmin, and I noticed some things.
If the bytecode you edited it into makes sense, it will of coursse run as expected. However there are some bytecode patterns that are rejected with a VerifyError.
For the specific examble of out of bounds access, you can compile code with out of bounds just fine. They will get you an ArrayIndexOutOfBoundsException at runtime.
int[] arr = new int[20];
for (int i = 0; i < 100; i++) {
arr[i] = i;
}
However you can construct bytecode that is more fundamentally flawed than that. To give an example I'll explain some things first.
The java bytecode works with a stack, and instructions works with the top elements on the stack.
The stack naturally have different sizes at different places in the program but sometimes you might use a goto in the bytecode to cause the stack to look different depending on how you reached there.
The stack might contain object, int then you store the object in an object array and the int in an int array. Then you go on and from somewhere else in that bytecode you use a goto, but now your stack contains int, object which would result in an int being passed to an object array and vice versa.
This is just one example of things that could happen which makes your bytecode fundamentally flawed. The JVM detects these kinds of flaws when the class is loaded at runtime, and then emits a VerifyError if something dosen't work.

Java inheritance not recognised in reflection

I generally oppose extension since it creates a very strong connection between classes, which is easy to accidentally break.
However, I finally thought I'd found a reasonable case for it - I want to optionally use a compressed version of a file type in an existing system. The compressed version would be almost as quick as the uncompressed, and would have exactly the same methods available (i.e. read and write) - the only difference would be in the representation on disk. Therefore, I had the compressed version extend the uncompressed version so that either kind of file could be used, just by optionally insantiating the other type.
public class CompressedSpecialFile extends SpecialFile(){ ... }
if (useCompression){
SpecialFile = new CompressedSpecialFile();
} else {
SpecialFile = new SpecialFile();
}
However, at a later point in the program, we use reflection:
Object[] values = new Object[]{SpecialFile sf, Integer param1, String param2, ...}
Class myclass = Class.forName(algorithmName);
Class[] classes = // created by calling .getClass on each object in values
constructor = myclass.getConstructor(classes);
Algorithm = (Algorithm) constructor.newInstance(values)
Which all worked fine, but now the myclass.getConstructor class throws a NoSuchMethodException since the run-time type of the SpecialFile is CompressedSpecialFile.
However, I thought that was how extension is supposed to work - since CompressedSpecialFile extends SpecialFile, any parameter accepting a SpecialFile should accept a CompressedSpecialFile. Is this an error in Java's reflection, or a failure of my understanding?

Hmm, the response to this bug report seems to indicate that this is intentional.
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4301875
We cannot make this change for compatibility reaons. Furthermore, we
would expect that getConstructor should behave analogously to getDeclaredMethod,
which also requires an exact match, thus it does not make sense to change one
without changing the other. It would be possible to add an additional suite of
methods that differed only in the way in which the argument types were matched,
however.
There are certainly cases where we might want to apply at runtime during
reflection the same overload-resolution algorithm used statically by the
compiler, i.e., in a debugger. It is not difficult to implement this
functionality with the existing API, however, so the case for adding this
functionality to core reflection is weak.
That bug report was closed as a duplicate of the following one, which provides a bit more implementation detail:
http://bugs.sun.com/bugdatabase/view_bug.do;jsessionid=1b08c721077da9fffffffff1e9a6465911b4e?bug_id=4287725
Work Around
Users of getMethod must be precise identifying the Class passed to the argument.
Evaluation
The essence of this request is that the user would like for Class.getMethod
to apply the same overloading rules as the compiler does. I think this is
a reasonable request, as I see a need for this arising frequently in certain
kinds of reflective programs, such as debuggers and scripting interpreters,
and it would be helpful to have a standard implementation so that everybody
gets it right. For compatibility, however, the behavior of the existing
Class.getMethod should be left alone, and a new method defined. There is
a case for leaving this functionality out on the basis of footprint, as it
can be implemented using existing APIs, albeit somewhat inefficiently.
See also 4401287.
Consensus appears to be that we should provide overload resolution in
reflection. Exactly when such functionality is provided would depend largely
on interest and potential uses.
For compatibility reasons, the Class.get(Declared)+{Method,Constructor}
implementation should not change; new method should be introduced. The
specification for these methods does need to be modified to define "match". See
bug 4651775.
You can keep digging into those referenced bugs and the actual links I provided (where there's discussion as well as possible workarounds) but I think that gets at the reasoning (though why a new method reflecting java's oop in reflection as well has not yet been implemented, I don't know).
In terms of workarounds, I suppose that for the one-level-deep version of inheritance, you can just call getSuperclass() on each class whose name is that of the extending class, but that's extremely inelegant and tied to you using it only on your classes implementing in the prescribed manner. Very kludgy. I'll try and look for another option though.

Java super-tuning, a few questions

Before I ask my question can I please ask not to get a lecture about optimising for no reason.
Consider the following questions purely academic.
I've been thinking about the efficiency of accesses between root (ie often used and often accessing each other) classes in Java, but this applies to most OO languages/compilers. The fastest way (I'm guessing) that you could access something in Java would be a static final reference. Theoretically, since that reference is available during loading, a good JIT compiler would remove the need to do any reference lookup to access the variable and point any accesses to that variable straight to a constant address. Perhaps for security reasons it doesn't work that way anyway, but bear with me...
Say I've decided that there are some order of operations problems or some arguments to pass at startup that means I can't have a static final reference, even if I were to go to the trouble of having each class construct the other as is recommended to get Java classes to have static final references to each other. Another reason I might not want to do this would be... oh, say, just for example, that I was providing platform specific implementations of some of these classes. ;-)
Now I'm left with two obvious choices. I can have my classes know about each other with a static reference (on some system hub class), which is set after constructing all classes (during which I mandate that they cannot access each other yet, thus doing away with order of operations problems at least during construction). On the other hand, the classes could have instance final references to each other, were I now to decide that sorting out the order of operations was important or could be made the responsibility of the person passing the args - or more to the point, providing platform specific implementations of these classes we want to have referencing each other.
A static variable means you don't have to look up the location of the variable wrt to the class it belongs to, saving you one operation. A final variable means you don't have to look up the value at all but it does have to belong to your class, so you save 'one operation'. OK I know I'm really handwaving now!
Then something else occurred to me: I could have static final stub classes, kind of like a wacky interface where each call was relegated to an 'impl' which can just extend the stub. The performance hit then would be the double function call required to run the functions and possibly I guess you can't declare your methods final anymore. I hypothesised that perhaps those could be inlined if they were appropriately declared, then gave up as I realised I would then have to think about whether or not the references to the 'impl's could be made static, or final, or...
So which of the three would turn out fastest? :-)
Any other thoughts on lowering frequent-access overheads or even other ways of hinting performance to the JIT compiler?
UPDATE: After running several hours of test of various things and reading http://www.ibm.com/developerworks/java/library/j-jtp02225.html I've found that most things you would normally look at when tuning e.g. C++ go out the window completely with the JIT compiler. I've seen it run 30 seconds of calculations once, twice, and on the third (and subsequent) runs decide "Hey, you aren't reading the result of that calculation, so I'm not running it!".
FWIW you can test data structures and I was able to develop an arraylist implementation that was more performant for my needs using a microbenchmark. The access patterns must have been random enough to keep the compiler guessing, but it still worked out how to better implement a generic-ified growing array with my simpler and more tuned code.
As far as the test here was concerned, I simply could not get a benchmark result! My simple test of calling a function and reading a variable from a final vs non-final object reference revealed more about the JIT than the JVM's access patterns. Unbelievably, calling the same function on the same object at different places in the method changes the time taken by a factor of FOUR!
As the guy in the IBM article says, the only way to test an optimisation is in-situ.
Thanks to everyone who pointed me along the way.

Its worth noting that static fields are stored in a special per-class object which contains the static fields for that class. Using static fields instead of object fields are unlikely to be any faster.

See the update, I answered my own question by doing some benchmarking, and found that there are far greater gains in unexpected areas and that performance for simple operations like referencing members is comparable on most modern systems where performance is limited more by memory bandwidth than CPU cycles.

Assuming you found a way to reliably profile your application, keep in mind that it will all go out the window should you switch to another jdk impl (IBM to Sun to OpenJDK etc), or even upgrade version on your existing JVM.
The reason you are having trouble, and would likely have different results with different JVM impls lies in the Java spec - is explicitly states that it does not define optimizations and leaves it to each implementation to optimize (or not) in any way so long as execution behavior is unchanged by the optimization.

How to identify if an object returned was created during the execution of a method - Java

Original Question: Given a method I would like to determine if an object returned is created within the execution of that method. What sort of static analysis can or should I use?
Reworked Questions: Given a method I would like to determine if an object created in that method may be returned by that method. So, if I go through and add all instantiations of the return type within that method to a set, is there an analysis that will tell me, for each member of the set, if it may or may not be returned. Additionally, would it be possible to not limit the set to a single method but, all methods called by the original method to account for delegation?
This is not specific to any invocation.
It looks like method escape analysis may be the answer.
Thanks everyone for your suggestions.

Your question seems to be either a simple "reaching" analysis ("does a new value reach a return statements") if you are interested in any invocation and only if a method-local new creates the value. If you need to know if any invocation can return a new value from any subcomputation you need to compute the possible call-graph and determine if any called function can return a new value, or pass a new value from a called function to its parent.
There are a number of Java static analysis frameworks.
SOOT is a byte-code based analysis framework. You could probably implement your static query using this.
The DMS Software Reengineering Toolkit is a generic engine for building custom analyzers and transformation tools. It has a full Java front end, and computes various useful base analyses (def/use chains, call graph) on source code. It can process class files but presently only to get type information.
If you wanted a dynamic analysis, either by itself or as a way to tighten up the static analysis, DMS can be used to instrument the source code in arbitrary ways by inserting code to track allocations.

I'm not sure if this would work for you circumstances, but one simple approach would be to populate a newly added 'instantiatedTime' field in the constructor of the object and compare that with the time the method was call was made. This assumes you have access to the source for the object in question.

Are you sure static analysis is the right tool for the job? Static analysis can give you a result in some cases but not in all.
When running the JVM under a debugger, it assigns objects with increasing object IDs, which you can fetch via System.identityHashCode(Object o). You can use this fact to build a test case that creates an object (the checkpoint), and then calls the method. If the returned object as an id greater than the checkpoint id, then you know the object was created in the method.
Disclaimer: that this is observed behaviour under a debugger, under Windows XP.

I have a feeling that this is impossible to do without a specially modified JVM. Here are some approaches ... and why they won't work in general.
The Static Analysis approach will work in simple cases. However, something like this is likely to stump any current generation static analysis tool:
// Bad design alert ... don't try this at home!
public class LazySingletonStringFactory {
private String s;
public String create(String initial) {
if (s == null) {
s = new String(initial);
}
return s;
}
}
For a static analyser to figure out if a given call to LazySingletonStringFactory.create(...) returns a newly created String it must figure out that it has not been called previously. The Halting Problem tells us that this is theoretically impossible in some cases, and in practice this is beyond the "state of the art".
The IdentityHashCode approach may work in a single-threaded application that completes without the garbage collector running. However, if the GC runs you will get incorrect answers. And if you have multiple threads, then (depending on the JVM) you may find that objects are allocated in different "spaces" resulting in object "id" creation sequence that is no longer monotonic across all threads.
The Code Instrumentation approach works if you can modify the code of the Classes you are concerned about, either direct source-code changes, annotation-based code injection or by some kind of bytecode processing. However, in general you cannot do these things for all classes.
(I'm not aware of any other approaches that are materially different to the above three ... but feel free to suggest them as a comment.)

Not sure of a reliable way to do this statically.
You could use:
AspectJ or a similar AOP library could be use to instrument classes and increment a counter on object creation
a custom classloader (or JVM agent, but classloader is easier) could be used similarly

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.