Good algorithm for generating call graphs? - java

I am writing some code to generate call graphs for a particular intermediate representation without executing it by statically scanning the IR code. The IR code itself is not too complex and I have a good understanding of what function call sequences look like so all I need to do is trace the calls. I am currently doing it the obvious way:
Keep track of where we are
If we encounter a function call, branch to that location, execute and come back
While branching put an edge between the caller and the callee
I am satisfied with where I am getting at but I want to make sure that I am not reinventing the wheel here and face corner cases. I am wondering if there are any accepted good algorithms (and/or design patterns) that do this efficiently?
UPDATE:
The IR code is a byte-code disassembly from a homebrewn Java-like language and looks like the Jasmine specification.

From an academic perspective, here are some considerations:
Do you care about being conservative / correct? For example, suppose the code you're analyzing contains a call through a function pointer. If you're just generating documentation, then it's not necessary to deal with this. If you're doing a code optimization that might go wrong, you will need to assume that 'call through pointer' means 'could be anything.'
Beware of exceptional execution paths. Your IR may or may not abstract this away from you, but keep in mind that many operations can throw both language-level exceptions as well as hardware interrupts. Again, it depends on what you want to do with the call graph later.
Consider how you'll deal with cycles (e.g. recursion, mutual recursion). This may affect how you write code for traversing the graphs later on (i.e., they will need some sort of 'visited' set to avoid traversing cycles forever).
Cheers.
Update March 6:
Based on extra information added to the original post:
Be careful about virtual method invocations. Keep in mind that, in general, it is unknowable which method will execute. You may have to assume that the call will go to any of the subclasses of a particular class. The standard example goes a bit like this: suppose you have an ArrayList<A>, and you have class B extends A. Based on a random number generator, you will add instances of A and B to the list. Now you call x.foo() for all x in the list, where foo() is a virtual method in A with an override in B. So, by just looking at the source code, there is no way of knowing whether the loop calls A.foo, B.foo, or both at run time.

I don't know the algorithm, but pycallgraph does a decent job. It is worth checking out the source for it. It is not long and should be good for checking out existing design patterns.

Related

Is it better to add a conditional inside or outside of a for loop for cleaner code?

So this may sound simple, but I have a method that has a for loop inside, inside the forloop the method createprints needs a map of "parameters" which it gets from getParameters, now there are 2 types of reports, one has a general set of parameters and another has that general set and a set of its own.
I have two options:
Either have 2 getparameters method, one that is general and another that is for rp2 but also calls the general parameters method. If I do this then it would make sense to add the conditional before the for loop like this :
theMethod(){
if (rp1){
for loop{
createPrints(getgenParameters())
do general forloop stuff
}
}else{
for loop{
createPrints(getParameters())
do general forloop stuff
}
}
}
This way it only checks once which parameters method to call instead of having the if statement inside the loop so that it checks every iteration (this is bad because the report type will never change throughout the loop) but then this way, repeating the for loop looks ugly and not clean at all, is there a cleaner way to design this?
The other option is to pass in a boolean to the get parameters method and basically you check inside which type of report it is and based on that you create the map, this however adds a conditional at each iteration as well.
From a performance perspective, it makes sense to have the conditional outside the loop so that its not redundantly checked at each iteration, but it doesnt look clean, and my boss really cares about how clean code looks, he didnt like me using an if else code block instead of doing it with ternary operators since ternary only uses one line (i think performance is still the same no?).
Forgot to mention I am using java, i cannot assign functions to variables or use callbacks
Inside the method, there was an if else code block before the for loop something like
String aVariable;
if(condition){
aVariable= value1;
}else{
aVariable =value2;
}
so i initially wanted to just create a boolean variable like isreport1 and inside if/else code block also assign the value cause it was using the same condition. And then as mentioned before pass in the parameter but, my boss again said not use booleans in parameters, so is this case the same I shouldnt do it here?
The branch prediction is a property of the CPU, not of the compiler. All modern CPUs have it, don't worry.
Most compilers can pull a constant condition out of the loop, so don't worry again. Java does it, too. Details: The javac compiler does not do it, it's job is to transform source cod to byte code and nothing else. But at runtime, the time-critical parts of the byte code get compiled in the machine and there many optimizations happen.
Forgot to mention I am using java, i cannot assign functions to variables or use callbacks.
You sort of can. It's implemented via anonymous classes and since Java 8, there are lambdas. Surely worth learning, but not relevant to your case.
I'd simply go for
for loop{
createPrints(rp1 ? getgenParameters() : getParameters())
do general forloop stuff
}
as it's the shortest and cleanest way for doing it.
There are tons of alternatives like defining Parameters getSomeParameters(boolean rp1), which you can create easily using "extract method" from the above ternary.
From a performance perspective, it makes sense to have the conditional outside the loop so that its not redundantly checked at each iteration, but it doesn't look clean, and my boss really cares about how clean code looks
From a performance perspective, it all doesn't matter. The compiler is damn smart and knows tons of optimizations. Just write clean code with short methods, so it can do its job properly.
he didnt like me using an if else code block instead of doing it with ternary operators since ternary only uses one line
A simple ternary one-liner makes the code much easier to understand, so you should go for it (complicated ternaries may be hard to grasp, but that's not the case here).
(i think performance is still the same no?).
Definitely. Note that
most program parts are irrelevant for the performance (Pareto principle)
low level optimizations rarely matter (the compiler knows them better)
clean code using proper data structures matters
Most "debates" about whether one way of writing some chunk of code is more efficient than another are missing the point.
The JIT compiler performs a lot of optimizations on your code behinds the scenes. It (or more precisely the people who wrote it) know a lot more about how to optimize than you do. And they will have the advantage of having done extensive benchmarking and (machine level) code walk-throughs and the like to figure out the best way to optimize.
The compiler can also optimize more reliably than a human can. And its developers are constantly improving it.
And the compiler does not need to be paid a salary to optimize code. Yes. Every hour you expend on optimizing is time that you could be spending on something else.
So the best strategy for you when optimizing is as follows:
First get the code implemented. Time spent on optimizing while coding is probably wasted. That is classic "premature optimization" behavior. Your intuition is probably not good enough to make the right call at this stage. (Few people are genuinely smart enough ...)
Next get the code working and passing its tests. Optimizing buggy code is missing the point.
Set some realistic, quantifiable goals for performance. ("As fast as possible" is neither realistic or quantifiable). Your code just needs to be fast enough to meet the business requirements. If it is already fast enough, then you are wasting developer time by optimizing it further.
Create a realistic benchmark so that you can measure your application's performance. If you can't (or don't) measure, you won't know if you have met your goals, or if a particular attempted optimization has improved things.
Use profiling to determine which parts of your code are worth the effort optimizing. Profiling tells you the hotspots where most of the time is spent. Focus there ... not on stuff that is only executed occasionally. This is where the 80-20 rule comes in.
I would talk to my boss about this. If the boss is not on board with the idea of being methodical and scientific about performance / optimization, then don't fight it. Just do what the boss says. (But when be willing to push back if you are blamed for missed deadlines that are due to time wasted on unnecessary optimization.)
There are two kinds of efficiency in Software Engineering:
The efficiency of the program
The efficiency of the programmer
These two efficiencies are in conflict.
How about this ?
theMethod(getParametersCallback){
for loop {
createPrints(getgenParametersCallback)
do general forloop stuff
}
}
. . . . .
if (rp1){
theMethod(getgenParameters)
}
else {
theMethod(getParameters)
}

different approaches in retrieving Caller method details apart from stack trace [Java]

There is a need to pass caller method details in java. I do not want to use StackTrace to find out them.
Are there any alternative means to get them?
I know Aspects will help but there is a concern that it will slow down performance.
Any suggestions will help.
I am not aware of any.
In the end, you are asking for some sort of instrumentation. In other words: you want to tell the jvm to keep track of the call stack and more importantly, make that information available to you programmatically.
And even when you only want that to happen for specific methods, the jvm still has to track all method invocations, as it can't know whether one of the methods to track is called in the end. And the fact that java is interpreted and compiled to native machine code adds to the complexity, too.
So, as said: there is no way of tracking method invocations easily without performance impacts. And the tools I know that can keep that performance impact on a reasonable level, like XRebel are for later evaluation, not for programmatic consumption.
Finally: you should rather look into your requirements. Java is simply not a good language when you really need such information. It isn't meant to keep call stacks around. So: the real solution would be to either select a platform that works better for you, or (recommended) to step back and design a solution that doesn't have this requirement.

In java streams using .peek() is regarded as to be used for debugging purposes only, would logging be considered as debugging? [duplicate]

This question already has answers here:
In Java streams is peek really only for debugging?
(10 answers)
Closed 4 years ago.
So I have a list of objects which I want part or whole to be processed, and I would want to log those objects that were processed.
consider a fictional example:
List<ClassInSchool> classes;
classes
.stream()
.filter(verifyClassInSixthGrade())
.filter(classHasNoClassRoom())
.peek(classInSchool -> log.debug("Processing classroom {} in sixth grade without classroom.", classInSchool)
.forEach(findMatchingClassRoomIfAvailable());
Would using .peek() in this instance be regarded as unintended use of the API?
To further explain, in this question the key takeaway is: "Don't use the API in an unintended way, even if it accomplishes your immediate goal." My question is whether or not every use of peek, short from debugging your stream until you have verified the whole chain works as designed and removed the .peek() again, is unintended use. So if using it as a means to log every object actually processed by the stream is considered unintended use.
The documentation of peek describes the intent as
This method exists mainly to support debugging, where you want to see the elements as they flow past a certain point in a pipeline.
An expression of the form .peek(classInSchool -> log.debug("Processing classroom {} in sixth grade without classroom.", classInSchool) fulfills this intend, as it is about reporting the processing of an element. It doesn’t matter whether you use the logging framework or just print statements, as in the documentation’s example, .peek(e -> System.out.println("Filtered value: " + e)). In either case, the intent matters, not the technical approach. If someone used peek with the intent to print all elements, it would be wrong, even if it used the same technical approach as the documentation’s example (System.out.println).
The documentation doesn’t mandate that you have to distinguish between production environment or debugging environment, to remove the peek usage for the former. Actually, your use would even fulfill that, as the logging framework allows you to mute that action via the configurable logging level.
I would still suggest to keep in mind that for some pipelines, inserting a peek operation may insert more overhead than the actual operation (or hinder the JVM’s loop optimizations to such degree). But if you do not experience performance problems, you may follow the old advice to not try to optimize unless you have a real reason…
Peek should be avoided as for certain terminal operations it may not be called, see this answer. In that example it would probably be better to do the logging inside the action of forEach rather than using peek. Debugging in this situation means temporary code used for fixing a bug or diagnosing an issue.
In java streams using .peek() is regarded as to be used for debugging purposes only, would logging be considered as debugging?
It depends on whether your logging code is going to be a permanent fixture of your code, or not.
Only you can really know the real purpose of your logging ...
Also note that the javadoc says:
In cases where the stream implementation is able to optimize away the production of some or all the elements (such as with short-circuiting operations like findFirst, or in the example described in count()), the action will not be invoked for those elements.
So, you are liable to find that in some circumstances peek won't be a reliable way to log (or debug) your pipeline.
In general, adding peek is liable to change the behavior of the pipeline and / or the JVM's ability to optimize it ... in a current or future generation JVM.
Eh, it's somewhat open to interpretation. Intent is something that's not always easy to determine.
I think the API note was mostly added to discourage an overzealous usage of peek when almost all desirable behaviour can be accomplished without it. It was just too useful for the developers to exclude it completely but they wanted to be clear that its inclusion was not to be taken as an unqualified endorsement; they saw the potential for misuse and they tried to address it.
I suspect - though I'm only speculating - that there were mixed opinions on whether to include it at all, and that including a version with a caveat in the JavaDoc was the compromise.
With that in mind, I think my suggestion for deciding when to use peek would simply be: don't use it unless you have a very good reason to.
In your case, you definitely don't have a good reason to. You're iterating over everything and passing the result to the method findMatchingClassRoomIfAvailable (well, presumably - your example wasn't very good). If you want to log something for each item in the stream then just log it at the top that method.
Is it misuse? I don't think so. Would I write it this way? No.

Why do we need getters?

I have read the stackoverflow page which discusses "Why use getters and setters?", I have been convinced by some of the reasons using a setter, for example: later validation, data encapsulation, etc. But what is the reason of using getters anyway? I don't see any harm of getting a value of a private field, or reasons to validation before you get the a field's value. Is it OK to never use a getter and always get a field's value using dot notation?
If a given field in a Java class be visible for reading (on the RHS of an expression), then it must also be possible to assign that field (on the LHS of an expression). For example:
class A {
int someValue;
}
A a = new A();
int value = a.someValue; // if you can do this (potentially harmless)
a.someValue = 10; // then you can also do this (bad)
Besides the above problem, a major reason for having a getter in a class is to shield the consumer of that class from implementation details. A getter does not necessarily have to simply return a value. It could return a value distilled from a Collection or something else entirely. By using a getter (and a setter), we free the consumer of the class from having to worry about the implementation changing over time.
I want to focus on practicalities, since I think you're at a point where you haven't seen the conceptual benefits line up just yet with the actual practice.
The obvious conceptual benefit is that setters and getters can be changed without impacting the outside world using those functions. Another Java-specific benefit is that all methods not marked as final are capable of being overriden, so you get the ability for subclasses to override the behavior as a bonus.
Overkill?
Yet you're probably at a point where you've heard these conceptual benefits before and it still sounds like overkill for your more daily scenarios. A difficult part of understanding software engineering practices is that they are generally designed to deal with very real world, large-scale codebases being managed by teams of developers. A lot of things are going to seem like overkill initially when you're just working on a small project of your own.
So let's get into some practical, real-world scenarios. I formerly worked in a very large-scale codebase. It a was low-level C codebase with a long legacy and sometimes barely a step above assembly, but many of the lessons I learned there translate to all kinds of languages.
Real-World Grief
In this codebase, we had a lot of bugs, and the majority of them related to state management and side effects. For example, we had cases where two fields of a structure were supposed to stay in sync with each other. The range of valid values for one field depended on the value of the other. Yet we ran into bugs where those two fields were out of sync. Unfortunately since they were just public variables with a very global scope ('global' should really be considered a degree with respect to the amount of code that can access a variable rather than an absolute), there were potentially tens of thousands of lines of code that could be the culprit.
As a simpler example, we had cases where the value of a field was never supposed to be negative, yet in our debugging sessions, we found negative values. Let's call this value that's never supposed to be negative, x. When we discovered the bugs resulting from x being negative, it was long after x was touched by anything. So we spent hours placing memory breakpoints and trying to find needles in a haystack by looking at all possible places that modified x in some way. Eventually we found and fixed the bug, but it was a bug that should have been discovered years earlier and should have been much less painful to fix.
Such would have been the case if large portions of the codebase weren't just directly accessing x and used functions like set_x instead. If that were the case, we could have done something as simple as this:
void set_x(int new_value)
{
assert(new_value >= 0);
x = new_value;
}
... and we would have discovered the culprit immediately and fixed it in a matter of minutes. Instead, we discovered it years after the bug was introduced and it took us meticulous hours of headaches to trace it down and fix.
Such is the price we can pay for ignoring engineering wisdom, and after dealing with the 10,000th issue which could have been avoided with a practice as simple as depending on functions rather than raw data throughout a codebase, if your hairs haven't all turned grey at that point, you're still generally not going to have a cheerful disposition.
The biggest value of getters and setters comes from the setters. It's the state manipulation that you generally want to control the most to prevent/detect bugs. The getter becomes a necessity simply as a result of requiring a setter to modify the data. Yet getters can also be useful sometimes when you want to exchange a raw state for a computation non-intrusively (by just changing one function's implementation), e.g.
Interface Stability
One of the most difficult things to appreciate earlier in your career is going to be interface stability (to prevent public interfaces from changing constantly). This is something that can only be appreciated with projects of scale and possibly compatibility issues with third parties.
When you're working on a small project on your own, you might be able to change the public definition of a class to your heart's content and rewrite all the code using it to update it with your changes. It won't seem like a big deal to constantly rewrite the code this way, as the amount of code using an interface might be quite small (ex: a few hundred lines of code using your class, and all code that you personally wrote).
When you work on a large-scale project and look down at millions of lines of code, changing the public definition of a widely-used class might mean that 100,000 lines of code need to be rewritten using that class in response. And a lot of that code won't even be your own code, so you have to intrusively analyze and fix other people's code and possibly collaborate with them closely to coordinate these changes. Some of these people may not even be on your team: they may be third parties writing plugins for your software or former developers who have moved on to other projects.
You really don't want to run into this scenario repeatedly, so designing public interfaces well enough to keep them stable (unchanging) becomes a key skill for your most central interfaces. If those interfaces are leaking implementation details like raw data, then the temptation to change them over and over is going to be a scenario you can face all the time.
So you generally want to design interfaces to focus on "what" they should do, not "how" they should do it, since the "how" might change a lot more often than the "what". For example, perhaps a function should append a new element to a list. However, you may want to swap out the list data structure it's using for another, or introduce a lock to make that function thread safe ("how" concerns). If these "how" concerns are not leaked to the public interface, then you can change the implementation of that class (how it's doing things) locally without affecting any of the existing code that is requesting it to do things.
You also don't want classes to do too much and become monolithic, since then your class variables will become "more global" (become visible to a lot more code even within the class's implementation) and it'll also be hard to settle on a stable design when it's already doing so much (the more classes do, the more they'll want to do).
Getters and setters aren't the best examples of such interface design, but they do avoid exposing those "how" details at least slightly better than a publicly exposed variable, and thus have fewer reasons to change (break).
Practical Avoidance of Getters/Setters
Is it OK to never use a getter and always get a field's value using dot notation?
This could sometimes be okay. For example, if you are implementing a tree structure and it utilizes a node class as a private implementation detail that clients never use directly, then trying too hard to focus on the engineering of this node class is probably going to start becoming counter-productive.
There your node class isn't a public interface. It's a private implementation detail for your tree. You can guarantee that it won't be used by anything more than the tree implementation, so there it might be overkill to apply these kinds of practices.
Where you don't want to ignore such practices is in the real public interface, the tree interface. You don't want to allow the tree to be misused and left in an invalid state, and you don't want an unstable interface which you're constantly tempted to change long after the tree is being widely used.
Another case where it might be okay is if you're just working on a scrap project/experiment as a kind of learning exercise, and you know for sure that the code you write is rather disposable and is never going to be used in any project of scale or grow into anything of scale.
Nevertheless, if you're very new to these concepts, I think it's a useful exercise even for your small scale projects to err on the side of using getters/setters. It's similar to how Mr. Miyagi got Daniel-San to paint the fence, wash the car, etc. Daniel-San finds it all pointless with his arms exhausted on top of that. Then Mr. Miyagi goes "hyah hyah hyoh hyah" throwing big punches and kicks, and using that indirect training, Daniel-San blocks all of them without realizing how he's even doing it.
In java you can't tell the compiler to allow read-only access to a public field from outside.
So exposing public fields opens the door to uncontroled modifications.
Fields are not polymorphic.
The alternative to a getter would be a public field; however, fields are not polymorphic.
This means that you cannot extend the class and "override" the field without introducing weird behaviour. Basically, the value you get will depend on how you refer to the field.
Furthermore, you can't include the field in an interface and you can't perform validation (that applies more to a setter).

Shall I use guard clause, and try to avoid else clause?

I've read (e.g. from Martin Fowler) that we should use guard clause instead of single return in a (short) method in OOP. I've also read (from somewhere I don't remember) that else clause should be avoided when possible.
But my colleagues (I work in a small team with only 3 guys) force me not to use multiple returns in a method, and to use else clause as much as possible, even if there is only one comment line in the else block.
This makes it difficult for me to follow their coding style, because for example, I cannot view all code of a method in one screen. And when I code, I have to write guard clause first, and then try to convert it into the form with out multiple returns.
Am I wrong or what should I do with it?
This is arguable and pure aesthetic question.
Early return has been historically avoided in C and similar languages since it was possible to miss resource cleanup which is usually placed at the end of the function in case of early return.
Given that Java have exceptions and try, catch, finally, there's no need to fear early returns.
Personaly, I agree with you, since I do early return often - that usually means less code and simpler code flow with less if/else nesting.
Guard clause is a good idea because it clearly indicates that current method is not interested in certain cases. When you clear up at the very beginning of the method that it doesn't deal with some cases (e.g. when some value is less than zero), then the rest of the method is pure implementation of its responsibility.
There is one stronger case of guard clauses - statements that validate input and throw exceptions when some argument is unacceptable, e.g. null. In that case you don't want to proceed with execution but wish to throw at the very beginning of the method. That is where guard clauses are the best solution because you don't want to mix exception throwing logic with the core of the method you're implementing.
When talking about guard clauses that throw exceptions, here is one article about how you can simplify them in C# using extension methods: How to Reduce Cyclomatic Complexity: Guard Clause. Though that method is not available in Java, it is useful in C#.
Have them read http://www.cis.temple.edu/~ingargio/cis71/software/roberts/documents/loopexit.txt and see if it will change their minds. (There is history to their idea, but I side with you.)
Edit: Here are the critical points from the article. The principle of single exits from control structures was adopted on principle, not observational data. But observational data says that allowing multiple ways of exiting control structures makes certain problems easier to solve accurately, and does not hurt readability. Disallowing it makes code harder and more likely to be buggy. This holds across a wide variety of programmers, from students to textbook writers. Therefore we should allow and use multiple exits where appropriate.
I'm in the multiple-return/return-early camp and I would lobby to convince other engineers of this. You can have great arguments and cite great sources, but in the end, all you can do is make your pitch, suggest compromises, come to a decision, and then work as a team, which ever way it works out. (Although revisiting of the topic from time to time isn't out of the question either.)
This really just comes down to style and, in the grand scheme of things, a relatively minor one. Overall, you're a more effective developer if you can adapt to either style. If this really "makes it difficult ... to follow their coding style", then I suggest you work on it, because in the end, you'll end up the better engineer.
I had an engineer once come to me and insist he be given dispensation to follow his own coding style (and we had a pretty minimal set of guidelines). He said the established coding style hurt his eyes and made it difficult for him to concentrate (I think he may have even said "nauseous".) I told him that if he was going to work on a lot of people's code, not just code he wrote, and vice versa. If he couldn't adapt to work with the agreed upon style, I couldn't use him and that maybe this type of collaborative project wasn't the right place for him. Coincidentally, it was less of an issue after that (although every code review was still a battle).
My issue with guard clauses is that 1) they can be easily dispersed through code and be easy to miss (this has happened to me on multiple occasions) and 2) I have to remember which code has been "ejected" as I trace code blocks which can become complex and 3) by setting code within if/else you have a contained set of code that you know executes for a given set of criteria. With guard conditions, the criteria is EVERYTHING minus what the guard has ejected. It is much more difficult for me to get my head around that.

Categories