java disassemble reassemble - java

Say I want to take a java class file, disassemble it, tweak the java bytecode output, and then reassemble it again.
I need to rename a symbol in the constant pool table. I also don't have access to the source code, and using a decompiler seems like overkill for this. I'm not trying to optimize anything - java does a fine job at that.
Is there... a simple way to do this?
I've found several tools for either disassembly or reassembly, but none for both; or no pairs of tools which seem to use the same format for representing the bytecode in text.

Did you check the ASM API?
Here is a code sample (adapted from the official documentation) explaining how to modify a class bytecode:
ClasssWriter cw = new ClassWriter();
ClassAdapter ca = new ClassAdapter(cw); // ca forwards all events to cw
// ca should modify the class data
ClassReader cr = new ClassReader("MyClass");
cr.accept(ca, 0);
byte[] b2 = cw.toByteArray(); // b2 represents the same class as MyClass, modified by ca
Then b2 can be stored in a .class file for future use. You can also use the method ClassLoader.defineClass(String,byte[],int,int) to load it if you define your own classloader.

This question is a bit older now, but as I don't find this answered anywhere on stackoverflow, let me put it on record:
There's the standard jasper/jasmin combo that I have successfully used in the past:
jasper for disassembly to a jasmin-compatible format
jasmin which will reassemble jasper's output
The only annoyance with jasper is that it forgets to create a label for the switch default clause, and then jasmin will give you errors like
Main.j:391: JAS Error Label: LABEL0x48 has not been added to the code.
Which then means you have to go into the .j file, and manually fix it. "javap -c" might assist you there. For that bug I'd suggest you jasper and immediately jasmin, before any modifications, just to make sure that works.
You can actually fix that label bug by applying this patch to jasper:
--- Code_Collection.java.orig 1999-06-14 14:10:44.000000000 +0000
+++ Code_Collection.java 2011-02-05 07:23:21.000000000 +0000
## -1210,6 +1210,7 ##
-----------------------------------------------------------------------*/
void getLabel(Code_Collection code) {
for (int i = 0; i < count; i++) code.setLabel(pc+branch[i]);
+ code.setLabel(pc+tableDefault);
}
/*-----------------------------------------------------------------------
I submitted it to the author, but I got a feeling the project has not been worked on for many years, so I don't know if it'll get merged in.
Edit: Jasper with the above patch applied is now available at https://github.com/EugenDueck/Jasper
And then there's Eclipse Bytecode Outline, as described in this answer:
java bytecode editor?

Krakatau provides an open source disassembler and assembler that make this very easy. Krakatau is designed to be a replacement for Jasmin. It uses a Jasmin like syntax for backwards compatibility but extends the format to support all the obscure features in the classfile format and fix bugs in Jasmin. It also lets you easily disassemble, modify, and reassemble classes.
The only real downside to Krakatau is that it currently doesn't documented very well. But if you have any questions, feel free to ask. (Disclosure: I wrote Krakatau).

Wouldn't it just be easier to find the original source code, modify it, then recompile it? Or is this from some binary code you don't have the source to?
Pro tip: The source code to the built-in Java Class Library is available as part of the OpenJDK project, specifically in the OpenJDK 6 Source.

You are describing what modern compilers do already. In addition to that, most JVMs can (and try to) keep optimizing the byte code while the app is running.
Start with studying what existing compilers/JVM's are doing with the byte code. The best case is that you can improve on the the JVM's optimizer which is possible but a low probability and either way you may be reinventing the wheel. The worst case is your changes actually interfere with the the runtime optimizer and cause overall performance to go down.
Study compilers and JVMs
Benchmark
Benchmark
Benchmark
[EDIT] Found a related post: Bytecode manipulation patterns

Related

Decompile C code with debug info?

Java and Python byte code are relatively easy to decompile than compiled machine code generated by C/C++ compiler.
I am unable to find a convincing answer as to why the information from the -g option is insufficient for de-compilation, but sufficient for debugging?
What is the extra stuff contained in Python/Java byte code, that makes decompilation easy?
Here are some of the reasons for this:
Java and Python bytecodes are relatively simple and high-level, whereas the instruction set of some CPUs (think x86) is fiendishly complicated.
The bytecodes closely mimic the structure of the language for which they've been designed.
When generating bytecodes, Java and Python perform do very little by way of optimization. This results in bytecodes that closely correspond to the structure of the original source code. A good optimizing C or C++ compiler is capable of producing assembly that's far removed from the original source code.
There are few Java and Python compilers, and many C and C++ compilers. It's easier to produce a high-quality decompiler if you are targetting a single known compiler (or a small set of known compilers).
Python and Java are relatively simple languages compared to C++ (this point doesn't apply to C).
C++ templates present many challenges to quality decompilation (this point also doesn't apply to C).
The C/C++ preprocessor.
In Python, there is a one-to-one relationship between source files and bytecode files. In Java, the relatioship is one source to one or more bytecode files. In C and C++, the relationship is many-to-many, with a lot of overlap on the source front (think headers).
I am unable to find a convincing answer as to why the information from the -g option is insufficient for de-compilation, but sufficient for debugging?
The debugging information basically contains only mapping between the addresses in the generated code and the source files line numbers. The debugger does not need to decompile code - it just shows you the original sources. If the source files are missing, debugger won't magically show them.
That said, presence of debugging info does make decompilation easier. If the debug info includes the layout of the used types and function prototypes, the decompiler can use it and provide a much more precise decompilation. In many cases, however, it will still likely be different from the original source.
For example, here's a function decompiled with the Hex-Rays decompiler without using the debug info:
int __stdcall sub_4050A0(int a1)
{
int result; // eax#1
result = a1;
if ( *(_BYTE *)(a1 + 12) )
{
result = sub_404600(*(_DWORD *)a1);
*(_BYTE *)(a1 + 12) = 0;
}
return result;
}
Since it does not know the type of a1, the accesses to its fields are represented as additions and casts.
And here's the same function after the symbol file has been loaded:
void __thiscall mytree::write_page(mytree *this, PAGE *src)
{
if ( src->isChanged )
{
cache::set_changed(this->cache, src->baseAddr);
src->isChanged = 0;
}
}
You can see that it's been improved quite a lot.
As for why decompiling bytecode is usually easier, in addition to NPE's answer check also this.
Some processors, like x86 ones, have instructions of variable length. If control is passed into the middle (= anywhere after the first byte) of an instruction, that can be a valid instruction (or several instructions) too. This makes it hard to unambiguously disassemble machine code. C/C++ code can exploit this feature.
On some processors and OSes it is possible to execute data as if it were code and use code as if it were data. This makes it hard to unambiguously separate the two. And, again, this is what C/C++ programs can often do easily.
On some processors and OSes it's easy to generate code on the fly and execute it and it's possible to modify the existing code at run time. This too contributes to ambiguities in decompiling code. And C/C++ programs can often do this as well.
EDIT: Also, some CPUs have multiple different encodings for the same instruction. For example, x86 CPUs have 2 instructions mov reg, reg/mem and mov reg/mem, reg. These let you move data between a register and a memory location (in either direction) and between two registers. Both of these instructions can be used to transfer data between two registers, but they have different encodings. If the program somehow relies on a particular encoding (e.g. for the purpose of validating its integrity via checksums), then from the disassembly like mov eax, ebx you wouldn't be able to tell which of the two mov instructions it originally was and so if you attempt to reassemble the disassembly, you may break the program.
You can use the debugger to debug a program with or without debug/symbol information. This information only makes it easier for the human to navigate the code and data since many (but not necessarily all) routines and variables can be identified and shown using their names and types and not just raw addresses and raw typeless data.
I'm guessing that the various bytecodes are less ambiguous and more restricted in what they can do and that's what makes it easier to decompile those.

Programatic code modification (e.g. variable extraction) in Java

I know it's possible to do nice stuff with Reflection, such as invoking methods, or altering the values of fields. Is it possible to do heavier code modification, though, at runtime and programmatically?
For instance, if I have a method:
public void foo(){
this.bar = 100;
}
Can I write a program that modifies the innards of this method, notices that it assigns a constant to a field, and turns it into the following:
public int baz = 100;
public void foo(){
this.bar = baz;
}
Perhaps Java isn't really the language to do this kind of thing in - if not, I'm open to suggestions for languages that would allow me to basically reparse or inspect code in this way, and be able to alter it so precisely. I might be pipe dreaming here though, so please tell me if this is the case also.
Just adding a suggestion from a friend - Apache Commons' BCEL looks excellent:
http://commons.apache.org/bcel/manual.html
The Byte Code Engineering Library (Apache Commons BCEL™) is intended to
give users a convenient way to analyze, create, and manipulate (binary)
Java class files (those ending with .class). Classes are represented by
objects which contain all the symbolic information of the given class:
methods, fields and byte code instructions, in particular.
Such objects can be read from an existing file, be transformed by a
program (e.g. a class loader at run-time) and written to a file again.
An even more interesting application is the creation of classes from
scratch at run-time. The Byte Code Engineering Library (BCEL) may be
also useful if you want to learn about the Java Virtual Machine (JVM)
and the format of Java .class files.
You are looking for software that allows you to do bytecode manipulation, there are several frameworks to achieve this, but the two most known currently are:
ASM
javassist
When performing bytecode modifications at runtime in Java classes keep in mind the following:
If you change a class's bytecode after a class has been loaded by a classloader, you'll have to find a way to reload it's class definition (either through classloading tricks, or using hotswap functionalities)
If you change the classes interface (example add new methods or fields) you will be able only to reach them through reflection.
It's probably fair to say that Java wasn't designed with this purpose in mind, but you can do it potentially. How and when depends a little on the ultimate aim of the exercise. A couple of options:
At the source code level, you can use the Java Compiler API to
compile arbitrary code into a class file (which you can then load).
At the bytecode level, you can write an agent that installs a
ClassFileTransformer to arbitrarily alter a class "on the fly"
as it is loaded. In practice, if you do this, you will also probably
make use of a library such as BCEL (Bytecode Engineering
Library) to make manipulating the class easier.
You want to investigate program transformation systems (PTS), which provide general facilities for parsing and transforming languages at the source level. PTS provide rewrite rules that say in effect, "if you see this pattern, replace it by that pattern" using the surface syntax of the target language. This is done using full parsers so the rewrite rule really operates on language syntax and not text; such rewrite rules obviously won't attempt to modify code-like text in comments, unlike tools based on regexps.
Our DMS Software Reengineering Toolkit is one of these. It provides not only the usual parsing, AST building and prettyprinting (reproducing compilable source code complete with comments), but also supports symbol tables and control and data flow analysis. These are needed for almost any interesting transformations. DMS also has front ends for a variety of dialects of Java as well as many other languages.
Bytecode transformers exist because they are much easier to build; it is pretty easy to "parse" bytecode. Of course, you can't make permanent source changes with a bytecode transformer, so it is lot less useful.
You mean like this?
String script1 = "println(\"OK!\");";
eval( script1 );
script1 += "println(\"... well, maybe NOT OK after all\");";
eval( script2 );
Output:
OK!
OK!
... well, maybe NOT OK after all
... use a scripting extension to Java. Groovy and other things like that would probably allow you to do what you want. I've written a scripting extension which integrates with Java through reflection almost seamlessly myself; contact me if you're interested in the details.

Is Compiling String as code possible?

I have an app that gets the content of an html file.
Lets say the text of the page is:
String[] arr = new String[] {"!","#","#"};
for (String str : arr) {
write(str);
}
Can I somehow compile this text and run the code within my app?
Thanks
Use Janino. It's a java runtime in-memory compiler. Way easier than BCEL and the likes.
From the homepage:
"What is Janino?
Janino is a super-small, super-fast Java™ compiler. Not only can it compile a set of source files to a set of class files like the JAVAC tool, but also can it compile a Java™ expression, block, class body or source file in memory, load the bytecode and execute it directly in the same JVM. Janino is not intended to be a development tool, but an embedded compiler for run-time compilation purposes...
You can use the javac compiler, or the Java Compiler API or the BeanShell library (or similar). You can compile it any number of ways, none terribly simple which often leads to finding another way to solve your problem.
Instead of generating source and compiling its common to generate byte code directly using ASM, Javaassist, BCEL or the like
This appears to be the same as
for(char ch: "!##".toCharArray())
write(ch);
which is likely to be the same as
write("!##");
Since the question is tagged android:
The answers posted so far only apply to the “standard” JVM, not to Android's Dalvik VM. In principle, it is possible on Android too. I don't know if there's an existing Java compiler that you can embed, but you would probably generate the final Dalvik bytecode using dexmaker. It may be possible to combine an existing Java compiler with dexmaker.
But please think twice before attempting anything like this, and be very careful. The last thing you want is a way for an attacker to execute arbitrary code on your user's hardware.
You can try javassist, it's not full Java though.
This is not usually that hard to do, but I have to ask can you give more detail on exactly what it is you are trying to accomplish. I do this type thing all the time. This is just another example of getting information from the user and using it somewhere else in your code. Since your using java maybe look at the string API http://docs.oracle.com/javase/6/docs/api/java/lang/String.html and the string tokenizer http://docs.oracle.com/javase/6/docs/api/index.html?java/lang/package-summary.html
Now you can break the string down into single values one word or other value at a time. From there you can use functions such as isNAN() from the float or double class to determine if it a number or string or whatever it is your testing for. Now you know what you’re dealing with you can reconstructed the data in a usable form.
Note for values if you want to use them as values use Float(string value) constructor. i.e Float x = new Float(myString)

Best choice? Edit bytecode (asm) or edit java file before compiling

Goal
Detecting where comparisons between and copies of variables are made
Inject code near the line where the operation has happened
The purpose of the code: everytime the class is ran make a counter increase
General purpose: count the amount of comparisons and copies made after execution with certain parameters
2 options
Note: I always have a .java file to begin with
1) Edit java file
Find comparisons with regex and inject pieces of code near the line
And then compile the class (My application uses JavaCompiler)
2)Use ASM Bytecode engineering
Also detecting where the events i want to track and inject pieces into the bytecode
And then use the (already compiled but modified) class
My Question
What is the best/cleanest way? Is there a better way to do this?
If you go for the Java route, you don't want to use regexes -- you want a real java parser. So that may influence your decision. Mind, the Oracle JVM includes one, as part of their internal private classes that implement the java compiler, so you don't actually have to write one yourself if you don't want to. But decoding the Oracle AST is not a 5 minute task either. And, of course, using that is not portable if that's important.
If you go the ASM route, the bytecode will initially be easier to analyze, since the semantics are a lot simpler. Whether the simplicity of analyses outweighs the unfamiliarity is unknown in terms of net time to your solution. In the end, in terms of generated code, neither is "better".
There is an apparent simplicity of just looking at generated java source code and "knowing" that What You See Is What You Get vs doing primitive dumps of class files for debugging and etc., but all that apparently simplicity is there because of your already existing comfortability with the Java lanaguage. Once you spend some time dredging through byte code that, too, will become comfortable. Just a question whether it's worth the time to you to get there in the first place.
Generally it all depends how comfortable you are with either option and how critical is performance aspect. The bytecode manipulation will be much faster and somewhat simpler, but you'll have to understand how bytecode works and how to use ASM framework.
Intercepting variable access is probably one of the simplest use cases for ASM. You could find a few more complex scenarios in this AOSD'07 paper.
Here is simplified code for intercepting variable access:
ClassReader cr = ...;
ClassWriter cw = ...;
cr.accept(new MethodVisitor(cw) {
public void visitVarInsn(int opcode, int var) {
if(opcode == ALOAD) { // loading Object var
... insert method call
}
}
});
If it was me i'd probably use the ASM option.
If you need a tutorial on ASM I stumbled upon this user-written tutorial click here

Rewriting method calls within compiled Java classes

I want to replace calls to a given class with calls to anther class within a method body whilst parsing compiled class files...
or put another way, is there a method of detecting usages of a given class in a method and replacing just that part of the method using something like javaassist.
for example.. if I had the compiled version of
class A { public int m() { int i = 2; B.multiply(i,i); return i; } }
is there a method of detecting the use of B and then altering the code to perform
class A { public int m() { int i = 2; C.divide(i,i); return i; } }
I know the alternative would be to write a parser to grep the source files for usages but I would prefer a more elegant solution such as using reflection to generate new compiled class files.
Any thoughts ?
As #djna says, it is possible to modify bytecode files before you load them, but you probably do not want to do this:
The code that does the code modification is likely to be complex and hard to maintain.
The code that has been modified is likely to be difficult to debug. For a start, a source level debugger will show you source code that no longer corresponds to the code that you are actually editing.
Bytecode rewriting is useful in certain cases. For example, JDO implementations use bytecode rewriting to replace object member fetches with calls into the persistence libraries. However, if you have access to the source code for these files, you'll get a better (i.e. more maintainable) solution by preprocessing (or generating) the source code.
EDIT: and AOP or Groovy sound like viable alternatives too, depending on the extent of rewriting that you anticipate doing.
BCEL or ASM.
I recently looked at a number of libraries for reading Java class files. BCEL was the fastest, had the least number of dependencies, compiled out of the box, and had a deliciously simple API. I preferred BCEL to ASM because ASM has more dependencies (although the API is reputedly simpler).
AspectJ, as previously mentioned, is another viable option.
BCEL is truly simple. You can get a list of methods in three lines of code:
ClassParser cp = new ClassParser( "A.class" );
JavaClass jc = cp.parse();
Method[] m = jc.getMethods();
There are other API facilities for further introspection, including, I believe, ways to get the instructions in a method. However, this solution will likely be more laborious than AspectJ.
Another possibility is to change the multiply or divide methods themselves, rather than trying to change all instances of the code that calls the operation. That would be an easier road to take with BCEL (or ASM).
The format of byte code for compiled Java is specified and products exist that manipulate it.
This library appears to have the capability you need. I've no idea how easy it is to do these transformations reliably.
If you don't mind using Groovy, you can intercept the call to B.multiply and replace it with C.divide. You can find an example here.
It's much easier to perform these operations ahead-of-time, where the executable on disk is modified before launching the application. Manipulating the code in memory at run time is even more prone to errors than manipulating code in memory in C/C++. Why do you need to do this?

Categories