How can you easily compare modified code against a reference implementation?

How can you easily compare modified code against a reference implementation? - java

I'm currently in the process of modifying somebody else's R-Tree implementation in order to add additional behaviour. I want to ensure that once I have made my changes the basic structure of the tree remains unchanged.
My current approach is to create a copy of the reference code and move it into it's own package (tree_ref). I have then created a unit test which has instances of my modified tree and the original tree (in tree_ref). I am filling the trees with data and then checking that their field values are identical - in which case I assert the test case as having passed.
It strikes me that this may not be the best approach and that there may be some recognised methodology that I am unaware of to solve this problem. I haven't been able to find one by searching.
Any help is appreciated. Thanks.

What you're doing makes sense, and is a good practice. Note that whenever you 'clone-and-own' an existing package, you're likely doing it for a reason. Maybe its performance. Maybe it is a behavior change. But whatever the reason, the tests you run against the reference and test subject need to be agnostic to the changes.
Usually, this sort of testing works well with randomized testing -- of some sort of collection implementation, for example.
Note also that if the reference implementation had solid unit tests, you don't need to cover those cases -- you simply need to target the tests at your implementation.
(And for completeness, let me state this no-brainer) you still have to add your own tests to cover the new behavior you've introduced with your changes.

I would do that in two stages:
First, insert random data into the tree. (I assume that's what you are doing)
Second check some extreme cases (does the tree handle negative numbers, NaN, Infinity, hundreds of identical points, unbalanced distribution of points?)
R-trees are fun. Enjoy!

Related

Unit Test Practice About Field Accessibility

I have a tree data structure for example
public class Tree {
class Node {
//stuffs...
}
private Node root;
// ...
}
I'm using junit4. In my unit test, I'd like to run some sanity check where I need to traverse the tree (e.g. check the binary search tree properties are preserved). But since the root is kept private, I can't traverse it outside the class.
Something I can think about are:
A getter for root can prevent the reference itself from being changed, but external code still may change fields in root.
A test is not a part of the data structure itself, I don't really want to put it in the Tree class. And even I do, I have to delete them after the test is done.
Please tell me the right thing to do, thanks.

There's multiple things you can do, a few options are:
Make a public getter (but this breaks code encapsulation, only should be considered if you for some odd reason cannot put tests in the same package)
Write a package-private getter and place the tests in the same package (probably the best solution)
Use reflection to gain access to the value (not recommended)
I would personally go with option 2 (please see my last paragraph for my recommend answer, since I would not do any of the above). This is what we used in industry to test things that we can't normally access to. It is minimally invasive, it doesn't break encapsulation like a public getter does, and it doesn't require you to do intrusive reflection code.
As a discussion on why we didn't use (3), I was on a project once where the lead developer decided to do exactly this in all his unit tests. There were tens of thousands of unit tests which were all using reflection to verify things. The performance gain we got from converting them away from a reflection-assisting library was nice enough that we got a lot more snappy feedback when we ran our unit tests.
Further, you should ask yourself if you need to do such tests. While testing is nice, you should ideally be unit testing the interface of your stuff, not reaching into the guts to assert everything works. This will make it very painful if you ever need to refactor your class because you'll invalidate a bunch of tests when you touch anything. Therefore I recommend only testing the public methods and be very rigorous in your tests.

how change in hashcode implementation effect on hashSet

I had implemented hashSet and i had added some objects but later we had changed the hashcode implementation.
1>what will happen in this case,
2>what to do to prevent the change in hashcode implementaion

As very often, the answers are: it depends.
Assume that you change the hashCode() implementation of one of your classes.
1) if ( your application does not persist its data )
then, when you restart your application, every piece will be using the new implementation. thus: no problem
2) if ( your application does persist its data )
then, when you restart your application will reload its data; and depending on how/where you changed hashCode() ... interesting things might occur.
For your second question; there is no generic way to "solve" that, but there are well known practices, and if you follow them, chances get smaller that somebody messes up:
1) Education and skill: try to make sure that everybody touching code knows what he is doing (and not blindly following orders "but you told me to do xyz, so I sat down and did exactly xyz, not considering at all what the consequences are")
2) Good design, and re-use of existing components. Like: standard java comes with "known" good sets, maps, collections. Why do you think that you have to re-invent the wheel, and why do you think that your implementation will be "better"?
3) Good tests. Do TDD, and make sure that each new function has unit tests that cover all its behavior. And then make sure that your unit tests run automatically when somebody pushes code into your version control system; so you notice when stuff gets broken. Beyond that, build reasonable function/integration tests for those aspects that can't be tested by unit tests.

Sanity Check - Significant increase in the number of objects when using JUNIT

I am using Junit for the first time in a project and I'm fascinated by the way it is forcing me to restructure my code. One thing I've noticed is that the number of objects I've created in order to be able to test chunks of code is significantly increasing. Is this typical?
Thanks,
Elliott

Yes, this is normal.
In general the smaller/more focused your classes and methods are, the easier to understand and test them. This might produce more files and actual lines of code, but it is because you are adding more abstractions that makes your code have a better/cleaner design.
You may want to read about the Single Responsibility Principle. Uncle Bob also has some re-factoring examples in his book called Clean Code where he touches on exactly these points.
One more thing when you are unit testing. Dependency Injection is one of the single most important thing that will save you a lot of headaches when it comes to structuring your code. (And just for clarification, DI will not necessary cause you to have more classes, but it will help decouple your classes more from each other.)

Yes, I think this is fairly typical. When I start introducing testing code into a legacy codebase, I find myself creating smaller utility classes and pojos and testing those. The original class just becomes a wrapper to call these smaller classes.
One example would be when you have a method which does a calculation, updates an object and then saves to a database.
public void calculateAndUpdate(Thing t) {
calculate(t); // quite a complex calculation with mutliple results & updates t
dao.save(t);
}
You could create a calculation object which is returned by the calculate method. The method then updates the Thing object and saves it.
public void calculateAndUpdate(Thing t) {
Calculation calculation = new Calculator().calculate(t); // does not update t at all
update(t, calculation); // updates t with the result of calculation
dao.save(t); // saves t to the database
}
So I've introduced two new objects, a Calculator & Calculation. This allows me to test the result of the calculation without having to have a database available. I can also unit test the update method as well. It's also more functional, which I like :-)
If I continued to test with the original method, then I would have to unit test the calculation udpate and save as one item. Which isn't nice.
For me, the second is a better code design, better separation of concerns, smaller classes, more easily tested. But the number of small classes goes up. But the overall complexity goes down.

depends on what kind of objects you are referring to. Typically, you should be fine with using a mocking framework like EasyMock or Mockito in which case the number of additional classes required solely for testing purposes should be pretty less. If you are referring to additional objects in your main source code, may be unit testing is helping you refactor your code to make it more readable and reusable, which is a good idea anyways IMHO :-)

Good algorithm for generating call graphs?

I am writing some code to generate call graphs for a particular intermediate representation without executing it by statically scanning the IR code. The IR code itself is not too complex and I have a good understanding of what function call sequences look like so all I need to do is trace the calls. I am currently doing it the obvious way:
Keep track of where we are
If we encounter a function call, branch to that location, execute and come back
While branching put an edge between the caller and the callee
I am satisfied with where I am getting at but I want to make sure that I am not reinventing the wheel here and face corner cases. I am wondering if there are any accepted good algorithms (and/or design patterns) that do this efficiently?
UPDATE:
The IR code is a byte-code disassembly from a homebrewn Java-like language and looks like the Jasmine specification.

From an academic perspective, here are some considerations:
Do you care about being conservative / correct? For example, suppose the code you're analyzing contains a call through a function pointer. If you're just generating documentation, then it's not necessary to deal with this. If you're doing a code optimization that might go wrong, you will need to assume that 'call through pointer' means 'could be anything.'
Beware of exceptional execution paths. Your IR may or may not abstract this away from you, but keep in mind that many operations can throw both language-level exceptions as well as hardware interrupts. Again, it depends on what you want to do with the call graph later.
Consider how you'll deal with cycles (e.g. recursion, mutual recursion). This may affect how you write code for traversing the graphs later on (i.e., they will need some sort of 'visited' set to avoid traversing cycles forever).
Cheers.
Update March 6:
Based on extra information added to the original post:
Be careful about virtual method invocations. Keep in mind that, in general, it is unknowable which method will execute. You may have to assume that the call will go to any of the subclasses of a particular class. The standard example goes a bit like this: suppose you have an ArrayList<A>, and you have class B extends A. Based on a random number generator, you will add instances of A and B to the list. Now you call x.foo() for all x in the list, where foo() is a virtual method in A with an override in B. So, by just looking at the source code, there is no way of knowing whether the loop calls A.foo, B.foo, or both at run time.

I don't know the algorithm, but pycallgraph does a decent job. It is worth checking out the source for it. It is not long and should be good for checking out existing design patterns.

Multiple threads modifying a collection in Java?

The project I am working on requires a whole bunch of queries towards a database. In principle there are two types of queries I am using:
read from excel file, check for a couple of parameters and do a query for hits in the database. These hits are then registered as a series of custom classes. Any hit may (and most likely will) occur more than once so this part of the code checks and updates the occurrence in a custom list implementation that extends ArrayList.
for each hit found, do a detail query and parse the output, so that the classes created in (I) get detailed info.
I figured I would use multiple threads to optimize time-wise. However I can't really come up with a good way to solve the problem that occurs with the collection these items are stored in. To elaborate a little bit; throughout the execution objects are supposed to be modified by both (I) and (II).
I deliberately didn't c/p any code, as it would be big chunks of code to make any sense.. I hope it make some sense with the description above.
Thanks,

In Java 5 and above, you may either use CopyOnWriteArrayList or a synchronized wrapper around your list. In earlier Java versions, only the latter choice is available. The same is true if you absolutely want to stick to the custom ArrayList implementation you mention.
CopyOnWriteArrayList is feasible if the container is read much more often than written (changed), which seems to be true based on your explanation. Its atomic addIfAbsent() method may even help simplify your code.
[Update] On second thought, a map sounds more fitting to the use case you describe. So if changing from a list to e.g. a map is an option, you should consider ConcurrentHashMap. [/Update]
Changing the objects within the container does not affect the container itself, however you need to ensure that the objects themselves are thread-safe.

Just use the new java.util.concurrent packages.
Classes like ConcurrentLinkedQueue and ConcurrentHashMap are already there for you to use and are all thread-safe.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.