Java and Python byte code are relatively easy to decompile than compiled machine code generated by C/C++ compiler.
I am unable to find a convincing answer as to why the information from the -g option is insufficient for de-compilation, but sufficient for debugging?
What is the extra stuff contained in Python/Java byte code, that makes decompilation easy?
Here are some of the reasons for this:
Java and Python bytecodes are relatively simple and high-level, whereas the instruction set of some CPUs (think x86) is fiendishly complicated.
The bytecodes closely mimic the structure of the language for which they've been designed.
When generating bytecodes, Java and Python perform do very little by way of optimization. This results in bytecodes that closely correspond to the structure of the original source code. A good optimizing C or C++ compiler is capable of producing assembly that's far removed from the original source code.
There are few Java and Python compilers, and many C and C++ compilers. It's easier to produce a high-quality decompiler if you are targetting a single known compiler (or a small set of known compilers).
Python and Java are relatively simple languages compared to C++ (this point doesn't apply to C).
C++ templates present many challenges to quality decompilation (this point also doesn't apply to C).
The C/C++ preprocessor.
In Python, there is a one-to-one relationship between source files and bytecode files. In Java, the relatioship is one source to one or more bytecode files. In C and C++, the relationship is many-to-many, with a lot of overlap on the source front (think headers).
I am unable to find a convincing answer as to why the information from the -g option is insufficient for de-compilation, but sufficient for debugging?
The debugging information basically contains only mapping between the addresses in the generated code and the source files line numbers. The debugger does not need to decompile code - it just shows you the original sources. If the source files are missing, debugger won't magically show them.
That said, presence of debugging info does make decompilation easier. If the debug info includes the layout of the used types and function prototypes, the decompiler can use it and provide a much more precise decompilation. In many cases, however, it will still likely be different from the original source.
For example, here's a function decompiled with the Hex-Rays decompiler without using the debug info:
int __stdcall sub_4050A0(int a1)
{
int result; // eax#1
result = a1;
if ( *(_BYTE *)(a1 + 12) )
{
result = sub_404600(*(_DWORD *)a1);
*(_BYTE *)(a1 + 12) = 0;
}
return result;
}
Since it does not know the type of a1, the accesses to its fields are represented as additions and casts.
And here's the same function after the symbol file has been loaded:
void __thiscall mytree::write_page(mytree *this, PAGE *src)
{
if ( src->isChanged )
{
cache::set_changed(this->cache, src->baseAddr);
src->isChanged = 0;
}
}
You can see that it's been improved quite a lot.
As for why decompiling bytecode is usually easier, in addition to NPE's answer check also this.
Some processors, like x86 ones, have instructions of variable length. If control is passed into the middle (= anywhere after the first byte) of an instruction, that can be a valid instruction (or several instructions) too. This makes it hard to unambiguously disassemble machine code. C/C++ code can exploit this feature.
On some processors and OSes it is possible to execute data as if it were code and use code as if it were data. This makes it hard to unambiguously separate the two. And, again, this is what C/C++ programs can often do easily.
On some processors and OSes it's easy to generate code on the fly and execute it and it's possible to modify the existing code at run time. This too contributes to ambiguities in decompiling code. And C/C++ programs can often do this as well.
EDIT: Also, some CPUs have multiple different encodings for the same instruction. For example, x86 CPUs have 2 instructions mov reg, reg/mem and mov reg/mem, reg. These let you move data between a register and a memory location (in either direction) and between two registers. Both of these instructions can be used to transfer data between two registers, but they have different encodings. If the program somehow relies on a particular encoding (e.g. for the purpose of validating its integrity via checksums), then from the disassembly like mov eax, ebx you wouldn't be able to tell which of the two mov instructions it originally was and so if you attempt to reassemble the disassembly, you may break the program.
You can use the debugger to debug a program with or without debug/symbol information. This information only makes it easier for the human to navigate the code and data since many (but not necessarily all) routines and variables can be identified and shown using their names and types and not just raw addresses and raw typeless data.
I'm guessing that the various bytecodes are less ambiguous and more restricted in what they can do and that's what makes it easier to decompile those.
Related
I am little bit curious about that what happen if I manually changed something into bytecode before execution. For instance, let suppose assigning int type variable into byte type variable without casting or remove semicolon from somewhere in program or anything that leads to compile time error. As I know all compile time errors are checked by compiler before making .class file. So what happen when I changed byte code after successfully compile a program then changed bytecode manually ? Is there any mechanism to handle this ? or if not then how program behaves after execution ?
EDIT :-
As Hot Licks, Darksonn and manouti already gave correct satisfy answers.Now I just conclude for those readers who all seeking answer for this type question :-
Every Java virtual machine has a class-file verifier, which ensures that loaded class files have a proper internal structure. If the class-file verifier discovers a problem with a class file, it throws an exception. Because a class file is just a sequence of binary data, a virtual machine can't know whether a particular class file was generated by a well-meaning Java compiler or by shady crackers bent on compromising the integrity of the virtual machine. As a consequence, all JVM implementations have a class-file verifier that can be invoked on untrusted classes, to make sure the classes are safe to use.
Refer this for more details.
You certainly can use a hex editor (eg, the free "HDD Hex Editor Neo") or some other tool to modify the bytes of a Java .class file. But obviously, you must do so in a way that maintains the file's "integrity" (tables all in correct format, etc). Furthermore (and much trickier), any modification you make must pass muster by the JVM's "verifier", which essentially rechecks everything that javac verified while compiling the program.
The verification process occurs during class loading and is quite complex. Basically, a data flow analysis is done on each procedure to assure that only the correct data types can "reach" a point where the data type is assumed. Eg, you can't change a load operation to load a reference to a HashMap onto the "stack" when the eventual user of the loaded reference will be assuming it's a String. (But enumerating all the checks the verifier does would be a major task in itself. I can't remember half of them, even though I wrote the verifier for the IBM iSeries JVM.)
(If you're asking if one can "jailbreak" a Java .class file to introduce code that does unauthorized things, the answer is no.)
You will most likely get a java.lang.VerifyError:
Thrown when the "verifier" detects that a class file, though well formed, contains some sort of internal inconsistency or security problem.
You can certainly do this, and there are even tools to make it easier, like http://set.ee/jbe/. The Java runtime will run your modified bytecode just as it would run the bytecode emitted by the compiler. What you're describing is a Java-specific case of a binary patch.
The semicolon example wouldn't be an issue, since semicolons are only for the convenience of the compiler and don't appear in the bytecode.
Either the bytecode executes normally and performs the instructions given or the jvm rejects them.
I played around with programming directly in java bytecode some time ago using jasmin, and I noticed some things.
If the bytecode you edited it into makes sense, it will of coursse run as expected. However there are some bytecode patterns that are rejected with a VerifyError.
For the specific examble of out of bounds access, you can compile code with out of bounds just fine. They will get you an ArrayIndexOutOfBoundsException at runtime.
int[] arr = new int[20];
for (int i = 0; i < 100; i++) {
arr[i] = i;
}
However you can construct bytecode that is more fundamentally flawed than that. To give an example I'll explain some things first.
The java bytecode works with a stack, and instructions works with the top elements on the stack.
The stack naturally have different sizes at different places in the program but sometimes you might use a goto in the bytecode to cause the stack to look different depending on how you reached there.
The stack might contain object, int then you store the object in an object array and the int in an int array. Then you go on and from somewhere else in that bytecode you use a goto, but now your stack contains int, object which would result in an int being passed to an object array and vice versa.
This is just one example of things that could happen which makes your bytecode fundamentally flawed. The JVM detects these kinds of flaws when the class is loaded at runtime, and then emits a VerifyError if something dosen't work.
I just started learning Scala, and I'm having the time of my life. Today I also just heard about Scala for Android, and I'm totally hooked.
However, considering that Scala is kind of like a superset of Java in that it is Java++ (I mean Java with more stuff included) I am wondering exactly how the Scala code would work in Android?
Also, would the performance of an Android application be impacted if written with Scala? That is, if there's any extra work needed to interpret Scala code.
To explain a little bit #Aneesh answer -- Yes, there is no extra work to interpret Scala bytecode, because it is exactly the same bits and pieces as Java bytecode.
Note that there is also a Java bytecode => Dalvik bytecode step when you run code in Android.
But using the same bricks one can build a bikeshed and the other guy can build a townhall. For example, due to the fact that language encourages immutability Scala generates a lot of short living objects. For mature JVMs like HotSpot it is not a big deal for about a decade. But for Dalvik it is a problem (prior to recent versions object pooling and tight reuse of already created objects was one of the most common performance tips even for Java).
Next, writing val isn't the same as writing final Foo bar = .... Internally, this code is represented as a field + getter (unless you prefix val with private [this] which will be translated to the usual final field). var is translated into field + getter + setter. Why is it important?
Old versions of Android (prior to 2.2) had no JIT at all, so this turns out to about 3x-7x penalty compared to the direct field access. And finally, while Google instructs you to avoid inner classes and prefer packages instead, Scala will create a lot of inner classes even if you don't write so. Consider this code:
object Foo extends App {
List(1,2,3,4)
.map(x => x * 2)
.filter(x => x % 3 == 0)
.foreach(print)
}
How many inner classes will be created? You could say none, but if you run scalac you will see:
Foo$$anonfun$1.class // Map
Foo$$anonfun$2.class // Filter
Foo$$anonfun$3.class // Foreach
Foo$.class // Companion object class, because you've used `object`
Foo$delayedInit$body.class // Delayed init functionality that is used by App trait
Foo.class // Actual class
So there will be some penalty, especially if you write idiomatic Scala code with a lot of immutability and syntax sugar. The thing is that it highly depends on your deployment (do you target newer devices?) and actual code patterns (you always can go back to Java or at least write less idiomatic code in performance critical spots) and some of the problems mentioned there will be addressed by language itself (the last one) in the next versions.
Original picture source was Wikipedia.
See also Stack Overflow question Is using Scala on Android worth it? Is there a lot of overhead? Problems? for possible problems that you may encounter during development.
I've found this paper showing some benchmarks beetween the two languages:
http://cse.aalto.fi/en/midcom-serveattachmentguid-1e3619151995344619111e3935b577b50548b758b75/denti_ngmast.pdf
I've not read the entire article but in the end seems they will give a point to Java:
In conclusion, we feel that Scala will not play a major role in the
mobile application development in the future, due to the importance of
keeping a low energy consumption on the device. The strong point of
the Scala language is the way its components scale, which is not of
major importance in mobile devices where applications tend to not
scale at all and stay small.
Credits to Mattia Denti and Jukka K. Nurminen from Aalto University.
While Scala "will just work", because it is compiled to byte code just like Java, there are performance issues to consider. Idiomatic Scala code tends to create a lot more temporary objects, and the Dalvik VM doesn't take too kindly to these.
Here are some things you should be careful with when using Scala on Android:
Vectors, as they can be wasteful (always take up arrays of 32 items, even if you hold just one item)
Method chaining on collections - you should use .view wherever possible, to avoid creating redundant collections.
Boxing - Scala will box your primitives in various cases, like using Generics, using Option[Int], and in some anonymous functions.
for loops can waste memory, consider replacing them with while loops in sensitive sections
Implicit conversions - calls like str.indexWhere(...)will allocate a wrapper object on the string. Can be wasteful.
Scala's Map will allocate an Option[V] every time you access a key. I've had to replace it with Java's HashMap on occasion.
Of course, you should only optimize the places that are bottlenecks after using a profiler.
You can read more about the suggestions above in this blog post:
http://blogs.microsoft.co.il/dorony/2014/10/07/scala-performance-tips-on-android/
Scala compiler will create JVM byte code. So basically at the lower level, it's just like running Java. So the performance will be equivalent to Java.
However, how the Scala compiler creates the byte code may have some impact on performance. I'd say since Scala is new and due to possible inefficiency in its byte code generation, Scala would be a bit slower than Java on Android, though not by a lot.
I know it's possible to do nice stuff with Reflection, such as invoking methods, or altering the values of fields. Is it possible to do heavier code modification, though, at runtime and programmatically?
For instance, if I have a method:
public void foo(){
this.bar = 100;
}
Can I write a program that modifies the innards of this method, notices that it assigns a constant to a field, and turns it into the following:
public int baz = 100;
public void foo(){
this.bar = baz;
}
Perhaps Java isn't really the language to do this kind of thing in - if not, I'm open to suggestions for languages that would allow me to basically reparse or inspect code in this way, and be able to alter it so precisely. I might be pipe dreaming here though, so please tell me if this is the case also.
Just adding a suggestion from a friend - Apache Commons' BCEL looks excellent:
http://commons.apache.org/bcel/manual.html
The Byte Code Engineering Library (Apache Commons BCEL™) is intended to
give users a convenient way to analyze, create, and manipulate (binary)
Java class files (those ending with .class). Classes are represented by
objects which contain all the symbolic information of the given class:
methods, fields and byte code instructions, in particular.
Such objects can be read from an existing file, be transformed by a
program (e.g. a class loader at run-time) and written to a file again.
An even more interesting application is the creation of classes from
scratch at run-time. The Byte Code Engineering Library (BCEL) may be
also useful if you want to learn about the Java Virtual Machine (JVM)
and the format of Java .class files.
You are looking for software that allows you to do bytecode manipulation, there are several frameworks to achieve this, but the two most known currently are:
ASM
javassist
When performing bytecode modifications at runtime in Java classes keep in mind the following:
If you change a class's bytecode after a class has been loaded by a classloader, you'll have to find a way to reload it's class definition (either through classloading tricks, or using hotswap functionalities)
If you change the classes interface (example add new methods or fields) you will be able only to reach them through reflection.
It's probably fair to say that Java wasn't designed with this purpose in mind, but you can do it potentially. How and when depends a little on the ultimate aim of the exercise. A couple of options:
At the source code level, you can use the Java Compiler API to
compile arbitrary code into a class file (which you can then load).
At the bytecode level, you can write an agent that installs a
ClassFileTransformer to arbitrarily alter a class "on the fly"
as it is loaded. In practice, if you do this, you will also probably
make use of a library such as BCEL (Bytecode Engineering
Library) to make manipulating the class easier.
You want to investigate program transformation systems (PTS), which provide general facilities for parsing and transforming languages at the source level. PTS provide rewrite rules that say in effect, "if you see this pattern, replace it by that pattern" using the surface syntax of the target language. This is done using full parsers so the rewrite rule really operates on language syntax and not text; such rewrite rules obviously won't attempt to modify code-like text in comments, unlike tools based on regexps.
Our DMS Software Reengineering Toolkit is one of these. It provides not only the usual parsing, AST building and prettyprinting (reproducing compilable source code complete with comments), but also supports symbol tables and control and data flow analysis. These are needed for almost any interesting transformations. DMS also has front ends for a variety of dialects of Java as well as many other languages.
Bytecode transformers exist because they are much easier to build; it is pretty easy to "parse" bytecode. Of course, you can't make permanent source changes with a bytecode transformer, so it is lot less useful.
You mean like this?
String script1 = "println(\"OK!\");";
eval( script1 );
script1 += "println(\"... well, maybe NOT OK after all\");";
eval( script2 );
Output:
OK!
OK!
... well, maybe NOT OK after all
... use a scripting extension to Java. Groovy and other things like that would probably allow you to do what you want. I've written a scripting extension which integrates with Java through reflection almost seamlessly myself; contact me if you're interested in the details.
Goal
Detecting where comparisons between and copies of variables are made
Inject code near the line where the operation has happened
The purpose of the code: everytime the class is ran make a counter increase
General purpose: count the amount of comparisons and copies made after execution with certain parameters
2 options
Note: I always have a .java file to begin with
1) Edit java file
Find comparisons with regex and inject pieces of code near the line
And then compile the class (My application uses JavaCompiler)
2)Use ASM Bytecode engineering
Also detecting where the events i want to track and inject pieces into the bytecode
And then use the (already compiled but modified) class
My Question
What is the best/cleanest way? Is there a better way to do this?
If you go for the Java route, you don't want to use regexes -- you want a real java parser. So that may influence your decision. Mind, the Oracle JVM includes one, as part of their internal private classes that implement the java compiler, so you don't actually have to write one yourself if you don't want to. But decoding the Oracle AST is not a 5 minute task either. And, of course, using that is not portable if that's important.
If you go the ASM route, the bytecode will initially be easier to analyze, since the semantics are a lot simpler. Whether the simplicity of analyses outweighs the unfamiliarity is unknown in terms of net time to your solution. In the end, in terms of generated code, neither is "better".
There is an apparent simplicity of just looking at generated java source code and "knowing" that What You See Is What You Get vs doing primitive dumps of class files for debugging and etc., but all that apparently simplicity is there because of your already existing comfortability with the Java lanaguage. Once you spend some time dredging through byte code that, too, will become comfortable. Just a question whether it's worth the time to you to get there in the first place.
Generally it all depends how comfortable you are with either option and how critical is performance aspect. The bytecode manipulation will be much faster and somewhat simpler, but you'll have to understand how bytecode works and how to use ASM framework.
Intercepting variable access is probably one of the simplest use cases for ASM. You could find a few more complex scenarios in this AOSD'07 paper.
Here is simplified code for intercepting variable access:
ClassReader cr = ...;
ClassWriter cw = ...;
cr.accept(new MethodVisitor(cw) {
public void visitVarInsn(int opcode, int var) {
if(opcode == ALOAD) { // loading Object var
... insert method call
}
}
});
If it was me i'd probably use the ASM option.
If you need a tutorial on ASM I stumbled upon this user-written tutorial click here
Say I want to take a java class file, disassemble it, tweak the java bytecode output, and then reassemble it again.
I need to rename a symbol in the constant pool table. I also don't have access to the source code, and using a decompiler seems like overkill for this. I'm not trying to optimize anything - java does a fine job at that.
Is there... a simple way to do this?
I've found several tools for either disassembly or reassembly, but none for both; or no pairs of tools which seem to use the same format for representing the bytecode in text.
Did you check the ASM API?
Here is a code sample (adapted from the official documentation) explaining how to modify a class bytecode:
ClasssWriter cw = new ClassWriter();
ClassAdapter ca = new ClassAdapter(cw); // ca forwards all events to cw
// ca should modify the class data
ClassReader cr = new ClassReader("MyClass");
cr.accept(ca, 0);
byte[] b2 = cw.toByteArray(); // b2 represents the same class as MyClass, modified by ca
Then b2 can be stored in a .class file for future use. You can also use the method ClassLoader.defineClass(String,byte[],int,int) to load it if you define your own classloader.
This question is a bit older now, but as I don't find this answered anywhere on stackoverflow, let me put it on record:
There's the standard jasper/jasmin combo that I have successfully used in the past:
jasper for disassembly to a jasmin-compatible format
jasmin which will reassemble jasper's output
The only annoyance with jasper is that it forgets to create a label for the switch default clause, and then jasmin will give you errors like
Main.j:391: JAS Error Label: LABEL0x48 has not been added to the code.
Which then means you have to go into the .j file, and manually fix it. "javap -c" might assist you there. For that bug I'd suggest you jasper and immediately jasmin, before any modifications, just to make sure that works.
You can actually fix that label bug by applying this patch to jasper:
--- Code_Collection.java.orig 1999-06-14 14:10:44.000000000 +0000
+++ Code_Collection.java 2011-02-05 07:23:21.000000000 +0000
## -1210,6 +1210,7 ##
-----------------------------------------------------------------------*/
void getLabel(Code_Collection code) {
for (int i = 0; i < count; i++) code.setLabel(pc+branch[i]);
+ code.setLabel(pc+tableDefault);
}
/*-----------------------------------------------------------------------
I submitted it to the author, but I got a feeling the project has not been worked on for many years, so I don't know if it'll get merged in.
Edit: Jasper with the above patch applied is now available at https://github.com/EugenDueck/Jasper
And then there's Eclipse Bytecode Outline, as described in this answer:
java bytecode editor?
Krakatau provides an open source disassembler and assembler that make this very easy. Krakatau is designed to be a replacement for Jasmin. It uses a Jasmin like syntax for backwards compatibility but extends the format to support all the obscure features in the classfile format and fix bugs in Jasmin. It also lets you easily disassemble, modify, and reassemble classes.
The only real downside to Krakatau is that it currently doesn't documented very well. But if you have any questions, feel free to ask. (Disclosure: I wrote Krakatau).
Wouldn't it just be easier to find the original source code, modify it, then recompile it? Or is this from some binary code you don't have the source to?
Pro tip: The source code to the built-in Java Class Library is available as part of the OpenJDK project, specifically in the OpenJDK 6 Source.
You are describing what modern compilers do already. In addition to that, most JVMs can (and try to) keep optimizing the byte code while the app is running.
Start with studying what existing compilers/JVM's are doing with the byte code. The best case is that you can improve on the the JVM's optimizer which is possible but a low probability and either way you may be reinventing the wheel. The worst case is your changes actually interfere with the the runtime optimizer and cause overall performance to go down.
Study compilers and JVMs
Benchmark
Benchmark
Benchmark
[EDIT] Found a related post: Bytecode manipulation patterns