Is there any tool that can remove debug info from Java .class files, just like /usr/bin/strip can from C/C++ object files on Linux?
EDIT: I liked both Thilo's and Peter Mmm's answers: Peter's was short and to the point exposing my ignorance of what ships with JDK; Thilo's ProGuard suggestion is something I'll definitely be checking out anyway for all those extra features it appears to provide. Thank you Thilo and Peter!
ProGuard (which the Android SDK for example ships with to reduce code size), can do all kinds of manipulation to shrink JAR files:
Evaluate constant expressions.
Remove unnecessary field accesses and method calls.
Remove unnecessary branches.
Remove unnecessary comparisons and instanceof tests.
Remove unused code blocks.
Merge identical code blocks.
Reduce variable allocation.
Remove write-only fields and unused method parameters.
Inline constant fields, method parameters, and return values.
Inline methods that are short or only called once.
Simplify tail recursion calls.
Merge classes and interfaces.
Make methods private, static, and final when possible.
Make classes static and final when possible.
Replace interfaces that have single implementations.
Perform over 200 peephole optimizations, like replacing ...*2 by ...<<1.
Optionally remove logging code.
They do not mention removing debug info in that list, but I guess they can also do that.
Update: Yes, indeed:
By default, compiled bytecode still contains a lot of debugging information: source file names, line numbers, field names, method names, argument names, variable names, etc. This information makes it straightforward to decompile the bytecode and reverse-engineer entire programs. Sometimes, this is not desirable. Obfuscators such as ProGuard can remove the debugging information and replace all names by meaningless character sequences, making it much harder to reverse-engineer the code. It further compacts the code as a bonus. The program remains functionally equivalent, except for the class names, method names, and line numbers given in exception stack traces.
Related
I have some abstract project (let's call it The Project) bytecode (of it's every class) inside some kotlin code, and each class bytecode is stored as ByteArray; the task is to tell which specific methods in each class are being modified from build to build of The Project. In other words, there are two ByteArrays of a same class of The Project, but they belong to different versions of it, and I need to compare them accurate. A simple example. Let's assume we have a trivial class:
class Rst {
fun getjson(): String {
abc("""ss""");
return "jsonValid"
}
public fun abc(s: String) {
println(s)
}
}
It's bytecode is stored in oldByteCode. Now some changes happened to the class:
class Rst {
fun getjson(): String {
abc("""ss""");
return "someOtherValue"
}
public fun newMethod(s: String) {
println("it's not abc anymore!")
}
}
It's bytecode is stored in newByteCode.
That's the main goal: compare oldByteCode to newByteCode.
Here we have the following changes:
getjson() method had been changed;
abc() method had been removed;
newMethod() had been created.
So, a method is changed, if it's signature remains the same. If not, it's already some different method.
Now back to the actual problem. I have to know every method's exact status by it's bytecode. What I have at the moment is the jacoco analyzer, which parses class bytecode to "bundles". In these bundles I have hierarchy of packages, classes, methods, but only with their signatures, so I cant tell if a method's body has any changes. I can only track signature differences.
Are there any tools, libs to split class bytecode to it's methods bytecodes? With those I could, for example, calculate hashes and compare them. Maybe asm library has any deal with that?
Any ideas are welcome.
TL;DR you approach of just comparing bytecode or even hashes won’t lead to a reliable solution, in fact, there is no solution with a reasonable effort to this kind of problem at all.
I don’t know, how much of it applies to the Kotlin compiler, but as elaborated in Is the creation of Java class files deterministic?, Java compilers are not required to produce identical bytecode even if the same version is used to compile exactly the same source code. While they may have an implementation that tries to be as deterministic as possible, things change when looking at different versions or alternative implementations, as explained in Do different Java Compilers (where the vendor is different) produce different bytecode.
Even when we assume that the Kotlin compiler is outstandingly deterministic, even across versions, it can’t ignore the JVM evolution. E.g. the removal of the jsr/ret instructions could not be ignored by any compiler, even when trying to be conservative. But it’s rather likely that it will incorporate other improvements as well, even when not being forced¹.
So in short, even when the entire source code did not change, it’s not a safe bet to assume that the compiled form has to stay the same. Even with an explicitly deterministic compiler we would have to be prepared for changes when recompiling with newer versions.
Even worse, if one method changes, it may have an impact on the compiled form of others, as instructions refer to items of a constant pool whenever constants or linkage information are needed and these indices may change, depending on how the other methods use the constant pool. There’s also an optimized form for certain instructions when accessing one of the first 255 pool indices, so changes in the numbering may require changing the form of the instruction. This in turn may have an impact on other instructions, e.g. switch instructions have padding bytes, depending on their byte code position.
On the other hand, a simple change of a constant value used in only one method may have no impact on the method’s bytecode at all, if the new constant happened to end up at the same place in the pool than the old constant.
So, to determine whether the code of two methods does actually the same, there is no way around parsing the instructions and understanding their meaning to some degree. Comparing just bytes or hashes won’t work.
¹ to name some non-mandatory changes, the compilation of class literals changed, likewise string concatenation changed from using StringBuffer to use StringBuilder and changed again to use StringConcatFactory, the use of getClass() for intrinsic null checks changed to requireNonNull(…), etc. A compiler for a different language doesn’t have to follow, but no-one wants to be left behind…
There are also bugs to fix, like obsolete instructions, which no compiler would keep just to stay deterministic.
Suppose I have a project structure that looks roughly like this:
{module-package}.webapp
module.gwt.xml
{module-package}.webapp.client
Client.java
UsedByClient.java
NotUsedByClient.java
And the module.gwt.xml file has:
<source path='client'/>
<entry-point class='{module-package}.webapp.client.Client'/>
When I compile this project using GWT, how much of the Java code will be compiled into Javascript?
Is NotUsedByClient.java included, even though the entry point doesn't reference it?
Is UsedByClient.java fully or partially included? E.g. if it has method m() which isn't called by Client, will m be compiled or not?
The motivation is that unfortunately I'm working with a legacy codebase that has server-side code living alongside client-side code in the same package and it would be some work to separate them. The server-side code isn't used by the client, but I'm concerned that GWT might compile it to Javascript where someone might notice it and try to reverse engineer it.
All of the above and more happen:
unreferenced classes are removed
unreferenced methods and fields are removed
constants may be inlined
various operations on constants (like !, ==, +, &&, etc) may be simplified (based on some field always being null, or true, etc)
un-overridden methods may be made final...
...and final methods may be made static in certain situations (leading to smaller callsites, and no "this" reference inside that method)...
and small, frequently called static methods may be inlined
And this process repeats, with even more optimizations that I skipped, to further assist in removing code, both big and small. At the end, all classes, methods, fields, and local variables are renamed in a way to further reduce output size, including reordering methods in the output so that they are ordered by length, letting gzip more efficiently compress your content on the way to the client.
So while some aspects of your code could be reverse engineered (just like any machine code could be reverse engineered), code which isn't referenced won't be available, and code which is may not even be readable.
I somehow managed to stumble upon a 'deep dive' video presentation on the compiler by one of the GWT engineers which has an explanation: https://youtu.be/n-P4RWbXAT8?t=865
Key points:
One of the compiler optimizations is called Pruner and it will "Traverse all reachable code from entrypoint, delete everything else (uses ControlFlowAnalyzer)"
It is actually an essential optimization because without it, all GWT apps would need to include gwt-user.jar in its entirety, which would greatly increase app sizes.
So it seems the GWT compiler does indeed remove unused code.
Goal
Detecting where comparisons between and copies of variables are made
Inject code near the line where the operation has happened
The purpose of the code: everytime the class is ran make a counter increase
General purpose: count the amount of comparisons and copies made after execution with certain parameters
2 options
Note: I always have a .java file to begin with
1) Edit java file
Find comparisons with regex and inject pieces of code near the line
And then compile the class (My application uses JavaCompiler)
2)Use ASM Bytecode engineering
Also detecting where the events i want to track and inject pieces into the bytecode
And then use the (already compiled but modified) class
My Question
What is the best/cleanest way? Is there a better way to do this?
If you go for the Java route, you don't want to use regexes -- you want a real java parser. So that may influence your decision. Mind, the Oracle JVM includes one, as part of their internal private classes that implement the java compiler, so you don't actually have to write one yourself if you don't want to. But decoding the Oracle AST is not a 5 minute task either. And, of course, using that is not portable if that's important.
If you go the ASM route, the bytecode will initially be easier to analyze, since the semantics are a lot simpler. Whether the simplicity of analyses outweighs the unfamiliarity is unknown in terms of net time to your solution. In the end, in terms of generated code, neither is "better".
There is an apparent simplicity of just looking at generated java source code and "knowing" that What You See Is What You Get vs doing primitive dumps of class files for debugging and etc., but all that apparently simplicity is there because of your already existing comfortability with the Java lanaguage. Once you spend some time dredging through byte code that, too, will become comfortable. Just a question whether it's worth the time to you to get there in the first place.
Generally it all depends how comfortable you are with either option and how critical is performance aspect. The bytecode manipulation will be much faster and somewhat simpler, but you'll have to understand how bytecode works and how to use ASM framework.
Intercepting variable access is probably one of the simplest use cases for ASM. You could find a few more complex scenarios in this AOSD'07 paper.
Here is simplified code for intercepting variable access:
ClassReader cr = ...;
ClassWriter cw = ...;
cr.accept(new MethodVisitor(cw) {
public void visitVarInsn(int opcode, int var) {
if(opcode == ALOAD) { // loading Object var
... insert method call
}
}
});
If it was me i'd probably use the ASM option.
If you need a tutorial on ASM I stumbled upon this user-written tutorial click here
I am trying to obfuscate a spring web application using ProGuard. I want to keep class and method names, especially the ones used as spring beans.
But ProGuard renames local variables to local[class name], for example if I have a User object it renames the local variable to localUser. It also renames method parameters to param[Class name], for example if I have a User parameter the variable name in obfuscated method becomes paramUser. So the obfuscated code becomes pretty readable.
I want to prevent ProGuard using local and param prefixes and class names. For example I want it to use x1 instead of localUser. I checked configuration options but I could not find how to do that.
ProGuard manual > Troubleshooting > Unexpected observations after processing > Variable names not being obfuscated
If the names of the local variables and parameters in your obfuscated
code don't look obfuscated, because they suspiciously resemble the
names of their types, it's probably because the decompiler that you
are using is coming up with those names. ProGuard's obfuscation step
does remove the original names entirely, unless you explicitly keep
the LocalVariableTable or LocalVariableTypeTable attributes.
The variable x1 isn't giving away any more information than paramUser, given that the viewed code would be:
public void foo(User x1)
{
...
}
Unless your methods are really long, it wouldn't be hard for anyone reading the method to remember that it's a parameter of type User, which is all that paramUser is saying. Yes, there's a bit of a difference in readability but I wouldn't say it's worth worrying about, personally - if someone's investing enough time to decompile your code to start with, a very small difference like that would be unlikely to deter them. If the class names were obfuscated as well, that makes a bigger difference IMO.
The naming scheme, you are describing, looks like the names regenerated by JD when the LocalVariableTable has been skipped by a Java compiler (see javac -g:var). For me, this is not a bug of ProGuard.
To make more efficient the obfuscation of your applications,
try to replace "protected" by "private" each time that is possible : ProGuard will replace the class, method and field names by short names,
try to use anonymous classes in your code,
and try to split your algoritms in a large number of classes to complexify the understanding of the execution flows.
I have two java classes that are very similar in semantics but differ in syntax. The differences are minor, like -
Changes in variable names,
Changes in position of some statements (with no dependent lines in between),
Extra imports, etc.
I need to compare these two classes to prove that they are indeed semantically identical. The same needs to be done for a large number of java file pairs.
The first approach of reading from the two files and comparing the lines, with logic to deal with the differences mentioned above seems inefficient. Is there some other way that I can achieve this task? Any helpful APIs out there?
Compile both of the classes without debug information and then decompile them back to source files. The decompiled files should be a lot more similar than the original source files.
You can improve this further by running some optimizations on the compiled files. For example you can use Proguard with just shrinking enabled to removed unused code.
Changes in position of some statements can be hard to detect though.
If you want to examine the changes in the code try Araxis Merge or WinMerge.
But if you want logical differences, I am afraid you might have to do it manually.
I would advise to use one of these tools to look for textual changes and then look for logical differences.
There are a lot of similarity checker out there, and until now there's no yet perfect tool for this. Each has its own advantages / disadvantages. The approaches generally falls into two categories: token-based or tree-based.
Token-based similarity checking is usually done with regular expressions, but other approaches are possible. In one of my projects at university, we developed one utilizing alignment strategy from bioinformatics field. The disadvantage of this technique is mainly if the size of the two sources isn't more or less equal.
Tree-based is more like a compiler, so normally using some compilation techniques it's possible (well, more or less) to check for this. Tree-based approach has disadvantages of being exponential in comparison complexity.
Comparing line by line wont work. I think you may need to use a parser. I would suggest that you take a look at ANTLR. It should have a java grammar where you could put your actions which will do the comparison.
As far as I know there's now way to compare the semantics of two Java classes. Take for example the following two methods:
public String m1(String a, int b) { ... }
and
public String m2(String x, int y) { ... }
A part from changes in variables and methods names, their signature is the same: same return type, and same input types. However, this is no guarantee that the two methods are semantically equivalent. For example, m1 could return a string consisting of the first b characters of a, while m2 could return a string consisting of y repetitions of x. As you can see, although only variables and names change, the semantics of the two methods is totally different.
I don't see an easy way out for your problem. You can perhaps make some assumption and try the following approach:
assume that the methods names in the two classes are the same
write test cases (for example with JUnit) for all the methods in the first class
run the test cases on the second class
ensure that the second class does not have other (untested) methods (for example using reflection)
This approach gives you an idea about equivalent semantics, but it makes strong assumption.
As a final remark, let me add that specifying the semantics of programs is an interesting and open research topic. Some interesting development in this area include research on Semantic Web Services. A widely adopted approach to give machine processable semantics to programs is that of specifying their IOPE: Input and Output types (as int the Java methods above), and their Preconditions and Effects. Preconditions are essentially logical conditions that must hold true for successfully invoking the program, and Effects are formal descriptions of the changes (in the state of the world) caused by the successful execution of the program. Even with IOPE there are a lot of problems ... which I skip in this short description.