How to automatically find similarities in Java bytecode?

How to automatically find similarities in Java bytecode? - java

Not sure if the title is the most descriptive way of putting it, but it's about as descriptive as I could think of.
Anyway, onto the question. I want to know how I can find similarities in bytecode. What I mean by this is rather difficult to properly explain (at least for me), so I will give an example instead.
I have aba.class, and nhf.class. These classes are obfuscated classes from a game I made. I offer a modified version of this game which simply has some small code changes in some places, but because the game is for sale it gets reobfuscated every time there is a new update. I want to be able to tell what class has changed to what in the reobfuscation by checking how similar the bytecode is for the classes. I know this is possible, but I have no idea how to check how to do this.
Is there a library, program or something that can parse bytecode and check how similar it is, or would I have to write this myself? If I would have to write it myself, I would appreciate someone to point me in the right direction (or link me to something that might help, etc).
Also, I'm looking at doing this with code, rather than manually, in case that wasn't apparent.

There can be a simpler solution:
I don't know what obfuscator you use (maybe Proguard), but it probably generates a map that maps obfuscated classnames to non-obfuscated classnames. (If not, you can switch to Produard, which generates such map.)
So, you can translate obfuscated classnames to original classnames (and vice versa) provided that you have the map for the version.
So, you can make such map from these two maps by matching original classnames.

Related

Save values to file(e.g yml) in java

First of all this might be a dumb question and I searched for some days but didn't find an answer. So if there is an existing answer concerning my question, I would be grateful for a link.
I don't know if anyone of you ever coded Spigot, Paper or Bukkit, but there was a class called YamlConfiguration which had the following methods:
public FileConfiguration cfg = YamlConfiguration.loadConfiguration(file);
cfg.set(path.path2, "hello");
cfg.getInt/String/...(path.path2); (which obviously returns "hello")
cfg.save(file);
The produced file then looks like this:
path:
path2: "hello"
So you could basically save any value in those files and reuse them even if your program has been restarted.
I know have moved forward from Spigot/Paper to native Java and I'm missing something like that Yaml-thing. The only thing I found was a kind of a config file, where every time the whole file is overwritten, when I try to add values.
Can you show me a proper way of saving values to a file? (would be nice without libraries)

I'm missing sth like that Yaml-thing
SnakeYAML should have you covered. Without knowing anything about your use-case, it makes no sense to discuss its usage here since its documentation already does cover the general topics.
The only thing I found was a kind of a config file, where everytime the whole file is overwritten, when I try to add values.
Saving as YAML will always overwrite the complete file as well. Serialization does not really work with append-only. (Serialization is the term to search for when you want functionality like this, by the way.)
If you mean that previous values were deleted, that probably was because you didn't load the file's content before or some other coding error, but since you don't show your code, we can only speculate.
Can you show me a proper way of saving values to a file?
People will have quite different opinions on what would be a proper way and therefore it is not a good question to ask here. It also heavily depends on your use-case.
would be nice without libraries
So you're basically saying „previously I used a library which had a nice feature but I want to have that feature without using a library“. This stance won't get you far in today's increasingly modular software world.
For example, JAXB which offers (de)serialization from/to XML was previously part of Java SE, but has been removed as of Java SE 11 and is a separate library now.

Java code change analysis tool - e.g tell me if a method signature has changed, method implementation

Is there any diff tool specifically for Java that doesn't just highlight differences in a file, but is more complex?
By more complex I mean it'd take 2 input files, the same class file of different versions, and tell me things like:
Field names changed
New methods added
Deleted methods
Methods whose signatures have changed
Methods whose implementations have changed (not interested in any more detail than that)
Done some Googling and can't find anything like this...I figure it could be useful in determining whether or not changes to dependencies would require a rebuild of a particular module.
Thanks in advance
Edit:
I suppose I should clarify:
I'm not bothered about a GUI for the tool, it'd be something I'm interested in calling programmatically.
And as for my reasoning:
To workout if I need to rebuild certain modules/components if their dependencies have changed (which could save us around 1 hour per component)... More detailed explanation but I don't really see it as important.
To be used to analyse changes made to certain components that we are trying to lock down and rely on as being more stable, we are attempting to ensure that only very rarely should method signatures change in a particular component.

You said above that Clirr is what you're looking for.
But for others with slightly differet needs, I'd like to recommend JDiff. Both have pros and cons, but for my needs I ended up using JDiff. I don't think it'll satisfy your last bullet point and it's difficult to call programmatically. What it does do is generate a useful report for API differences.

Can we refactor the following scenario using Eclipse?

I want to change the type of a variable from String to int, can we use Eclipse to refactor?

There's no out of the box refactoring tool that does it as far as I know. The reason probably is that strictly speaking this isn't refactoring: refactoring is a change that doesn't affect the behaviour of the code, but this change definitely does.
Unless you're using reflection, the easiest way to make this change is to change the field first, then watch the bits that turn red, and work your way through them. (You'll get a cascade of errors, pieces that you fix will cause other pieces to go wrong, but eventually you'll get o the end of it.)
I know this isn't really the answer you wanted but if you follow this pattern (deliberately break the code first, then correct errors that arise), it doesn't take long.
If you do have reflection in your code though, then you have no other option than to go through every single file that uses reflection and check whether it would be affected by your change.

"Cosmetic" clean-up of old, unknown code. Which steps, which order? How invasive?

When I receive code I have not seen before to refactor it into some sane state, I normally fix "cosmetic" things (like converting StringTokenizers to String#split(), replacing pre-1.2 collections by newer collections, making fields final, converting C-style arrays to Java-style arrays, ...) while reading the source code I have to get familiar with.
Are there many people using this strategy (maybe it is some kind of "best practice" I don't know?) or is this considered too dangerous, and not touching old code if it is not absolutely necessary is generally prefered? Or is it more common to combine the "cosmetic cleanup" step with the more invasive "general refactoring" step?
What are the common "low-hanging fruits" when doing "cosmetic clean-up" (vs. refactoring with more invasive changes)?

In my opinion, "cosmetic cleanup" is "general refactoring." You're just changing the code to make it more understandable without changing its behavior.
I always refactor by attacking the minor changes first. The more readable you can make the code quickly, the easier it will be to do the structural changes later - especially since it helps you look for repeated code, etc.
I typically start by looking at code that is used frequently and will need to be changed often, first. (This has the biggest impact in the least time...) Variable naming is probably the easiest and safest "low hanging fruit" to attack first, followed by framework updates (collection changes, updated methods, etc). Once those are done, breaking up large methods is usually my next step, followed by other typical refactorings.

There is no right or wrong answer here, as this depends largely on circumstances.
If the code is live, working, undocumented, and contains no testing infrastructure, then I wouldn't touch it. If someone comes back in the future and wants new features, I will try to work them into the existing code while changing as little as possible.
If the code is buggy, problematic, missing features, and was written by a programmer that no longer works with the company, then I would probably redesign and rewrite the whole thing. I could always still reference that programmer's code for a specific solution to a specific problem, but it would help me reorganize everything in my mind and in source. In this situation, the whole thing is probably poorly designed and it could use a complete re-think.
For everything in between, I would take the approach you outlined. I would start by cleaning up everything cosmetically so that I can see what's going on. Then I'd start working on whatever code stood out as needing the most work. I would add documentation as I understand how it works so that I will help remember what's going on.
Ultimately, remember that if you're going to be maintaining the code now, it should be up to your standards. Where it's not, you should take the time to bring it up to your standards - whatever that takes. This will save you a lot of time, effort, and frustration down the road.

The lowest-hanging cosmetic fruit is (in Eclipse, anyway) shift-control-F. Automatic formatting is your friend.

First thing I do is trying to hide most of the things to the outside world. If the code is crappy most of the time the guy that implemented it did not know much about data hiding and alike.
So my advice, first thing to do:
Turn as many members and methods as
private as you can without breaking the
compilation.
As a second step I try to identify the interfaces. I replace the concrete classes through the interfaces in all methods of related classes. This way you decouple the classes a bit.
Further refactoring can then be done more safely and locally.

You can buy a copy of Refactoring: Improving the Design of Existing Code from Martin Fowler, you'll find a lot of things you can do during your refactoring operation.
Plus you can use tools provided by your IDE and others code analyzers such as Findbugs or PMD to detect problems in your code.
Resources :
www.refactoring.com
wikipedia - List of tools for static code analysis in java
On the same topic :
How do you refactor a large messy codebase?
Code analyzers: PMD & FindBugs

By starting with "cosmetic cleanup" you get a good overview of how messy the code is and this combined with better readability is a good beginning.
I always (yeah, right... sometimes there's something called a deadline that mess with me) start with this approach and it has served me very well so far.

You're on the right track. By doing the small fixes you'll be more familiar with the code and the bigger fixes will be easier to do with all the detritus out of the way.
Run a tool like JDepend, CheckStyle or PMD on the source. They can automatically do loads of changes that are cosemetic but based on general refactoring rules.

I do not change old code except to reformat it using the IDE. There is too much risk of introducing a bug - or removing a bug that other code now depends upon! Or introducing a dependency that didn't exist such as using the heap instead of the stack.
Beyond the IDE reformat, I don't change code that the boss hasn't asked me to change. If something is egregious, I ask the boss if I can make changes and state a case of why this is good for the company.
If the boss asks me to fix a bug in the code, I make as few changes as possible. Say the bug is in a simple for loop. I'd refactor the loop into a new method. Then I'd write a test case for that method to demonstrate I have located the bug. Then I'd fix the new method. Then I'd make sure the test cases pass.
Yeah, I'm a contractor. Contracting gives you a different point of view. I recommend it.

There is one thing you should be aware of. The code you are starting with has been TESTED and approved, and your changes automatically means that that retesting must happen as you may have inadvertently broken some behaviour elsewhere.
Besides, everybody makes errors. Every non-trivial change you make (changing StringTokenizer to split is not an automatic feature in e.g. Eclipse, so you write it yourself) is an opportunity for errors to creep in. Do you get the exact behaviour right of a conditional, or did you by mere mistake forget a !?
Hence, your changes implies retesting. That work may be quite substantial and severely overwhelm the small changes you have done.

I don't normally bother going through old code looking for problems. However, if I'm reading it, as you appear to be doing, and it makes my brain glitch, I fix it.
Common low-hanging fruits for me tend to be more about renaming classes, methods, fields etc., and writing examples of behaviour (a.k.a. unit tests) when I can't be sure of what a class is doing by inspection - generally making the code more readable as I read it. None of these are what I'd call "invasive" but they're more than just cosmetic.

From experience it depends on two things: time and risk.
If you have plenty of time then you can do a lot more, if not then the scope of whatever changes you make is reduced accordingly. As much as I hate doing it I have had to create some horrible shameful hacks because I simply didn't have enough time to do it right...
If the code you are working on has lots of dependencies or is critical to the application then make as few changes as possible - you never know what your fix might break... :)
It sounds like you have a solid idea of what things should look like so I am not going to say what specific changes to make in what order 'cause that will vary from person to person. Just make small localized changes first, test, expand the scope of your changes, test. Expand. Test. Expand. Test. Until you either run out of time or there is no more room for improvement!
BTW When testing you are likely to see where things break most often - create test cases for them (JUnit or whatever).
EXCEPTION:
Two things that I always find myself doing are reformatting (CTRL+SHFT+F in Eclipse) and commenting code that is not obvious. After that I just hammer the most obvious nail first...

From Static Typing to Dynamic Typing

I have always worked on statically typed languages (C/C++, Java). I have been playing with Clojure and I really like it.
One thing I am worried about is: say that I have a windows that takes 3 modules as arguments and along the way the requirements change and I need to pass another module to the function. I just change the function and the compiler complains everywhere I used it. But in Clojure it won't complain until the function is called. I can just do a regex search and replace but it seems there is a chance to miss a call and it will go unnoticed until that function is actually called. How do you guys deal with this?

This is one of the reasons automated testing/test driven development is even more important in dynamically typed languages. I haven't used Clojure (I mostly use Ruby), so unfortunately I can't recommend a specific testing framework.

The first thing I'd like to mention is that Bruce Eckel has written a very interesting article called Strong Typing vs Strong Testing (the link is down at the moment, unfortunately, but hopefully it will be up soon).
His idea is that when dealing with compiled languages, the compiler is just acting as the first, automatic step of automatic testing. When making the move to a dynamic language, you lose this first level of automatic testing. But in both cases, this first, automatic level is just one part of testing, and not even a very important part.
His point is that if you're developing programs properly, i.e. doing some form of tests and regression tests, the lack of a compiler will only force you to add some more, somewhat basic tests anyways, which is why it's no big loss.
So I guess the first answer I'd give you is, focus on your testing, something you should be doing anyway, and such changes shouldn't affect you too badly.
The second thing I'd like to mention is many dynamic languages that I've seen (for example, Python) have much better abilities to change what methods/classes do without breaking existing code.
For example, with Python, if your method used to accept two parameters but now requires a third one, you can always add a default parameter without breaking any existing code, but that you can now utilize. This is a very basic technique, but in Python's case (and I assume most other dynamic languages as well), these techniques can get much more interesting; since they're dynamic, you can pretty much change the implementation of functions for specific modules, change what variables mean, etc.
I'd suggest looking at which techniques Clojure has that allow similair things, and deciding if they apply in your situation.

You do the same thing you did if the method was part of a public interface that you weren't the only user of.
You add a new method with the extra module and and change the old one to call the new one with a suitable default.
Oh and if your program is that big, make sure you have good tests (test-is should make it simpler than Java)

Test coverage is definitely important. But a dynamically typed language will allow you to work in a different way. In a strongly typed language (like Java), a change in the interface needs to modify all the callers. In Ruby, you could do this-- but probably won't. Instead, you'll probably add flexibility to the method on one of a few ways. Namely:
you tend to have very few methods that take as many as three parameters in Ruby (as opposed to Java). Because you don't have strong typed interface of Java, you break the problem down into smaller pieces and steps. It's much more common to write methods that take just 1 parameter, and then refactor when it becomes more complex.
it's possible-- and common-- to leave the old behavior in place while adding more arguments. For example, if you have to add a third argument to a two argument method, you will set its default value to preserve the old behavior (and save you a refactor). If you are familiar with Javascript libraries like jQuery, they take advantage of this everywhere with "optional" arguments.
similar to optional arguments, methods can grow to take a flexible parameter list. With solid test coverage, you can quite easily add a new behavior to an existing method and safely know you haven't broken the existing code. In Rails, methods like "render" take a wide range of options.

You're not completely without compiler support in Clojure. In the specific example you give, it's the arity of the function that changed, which would be picked up by compiling the Clojure code. I'm still making the strong -> dynamic typing transition and find this comforting!

You lose some level of refactoring and type safety when you move to dynamic languages. The more information the compiler has, the more it can do at compile time for you.

Tim Bray discusses it here,critique of which by Cedric is here,and a post on artima discussing it at length.

If you really need static typing, you can use https://github.com/clojure/core.typed and it's leiningen module to test static variable passing.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.