Best practice for detecting changes to functions in Scala programs?

Best practice for detecting changes to functions in Scala programs? - java

I'm working on a Scala-based script language (internal DSL) that allows users to define multiple data transformations functions in a Scala script file. Since the application of these functions could take several hours I would like to cache the results in a database.
Users are allowed to change the definition of the transformation functions and also to add new functions. However, then the user restarts the application with a slightly modified script I would like to execute only those functions that have been changed or added. The question is how to detect those changes? For simplicity let us assume that the user can only adapt the script file so that any reference to something not defined in this script can be assumed to be unchanged.
In this case what's the best practice for detecting changes to such user-defined functions?
Until now I though about:
parsing the script file and calculating fingerprints based on the source code of the function definitions
getting the bytecode of each function at runtime and building fingerprints based on this data
applying the functions to some test data and calculating fingerprints on the results
However, all three approaches have their pitfalls.
Writing a parser for Scala to extract the function definitions could be quite some work, especially if you want to detect changes that indirectly affect the behaviour of your functions (e.g. if your function calls another (changed) function defined in the script).
The bytecode analysis could be another option, but I never worked with those libraries. Thus I have no idea if they can solve my problem and how they deal with Java's dynamic binding.
The approach with example data is definitely the simplest one, but has the drawback that different user-defined functions could be accidentally mapped to the same fingerprint if they return the same results for my test data.
Does someone has experience with one of these "solutions" or can suggest me a better one?

The second option doesn't look difficult. For example, with Javassist library obtaining bytecode of a method is as simple as
CtClass c = ClassPool.getDefault().get(className);
for (CtMethod m: c.getDeclaredMethod()) {
CodeAttribute ca = m.getMethodInfo().getCodeAttribute();
if (ca != null) { // i.e. if the method is not native
byte[] byteCode = ca.getCode();
...
}
}
So, as long as you assume that results of your methods depend on the code of that methods only, it's pretty straighforward.
UPDATE:
On the other hand, since your methods are written in Scala, they probably contain some closures, so that parts of their code reside in anonymous classes, and you may need to trace usage of these classes somehow.

Related

Make my logger very effective to my Java-application

I am struggling with the following problem and ask for help.
My application has a logger module. This takes the trace level and the message (as string).
Often should be messages constructed from different sources and/or different ways (e.G. once using String.format in prior of logging, other times using .toString methods of different objects etc). Therefore: the construction method of the error messages cannot be generalized.
What I want is, to make my logger module effective. That means: the trace messages would only then be constructed if the actual trace level gets the message. And this by preventing copy-paste code in my application.
With C/C++, by using macros it was very easy to achive:
#define LOG_IT(level, message) if(level>=App.actLevel_) LOG_MSG(message);
The LOG_MSG and the string construction was done only if the trace level enabled that message.
With Java, I don't find any similar possibility for that. That to prevent: the logging would be one line (no if-else copy-pastes everywhere), and the string construction (expensive operation) only be done if necessary.
The only solution I know, is to surrond every logger-calls with an IF-statement. But this is exactly what I avoided previously in the C++ app, and what I want to avoid in my actual Java-implementation.
My problem is, on the target system only Java 1.6 is available. Therefore the Supplier is not a choice.
What can I do in Java? How can this C/C++ method easily be done?

Firstly, I would encourage you to read this if you're thinking about implementing your own logger.
Then, I'd encourage you to look at a well-established logging API such as SLF4j. Whilst it is possible to create your own, using a pre-existing API will save you time, effort and above all else provide you with more features and flexibility out of the box (I.e file based configuration, customisability (look at Mapped Diagnostic Context)).
To your specific question, there isn't a simple way to do what you're trying to do. C/C++ are fundamentally different to java in that the preprocessor allows for macros like you've created above. Java doesn't really have an easy-to-use equivalent, though there are examples of projects that do make use of compile time code generation which is probably the closest equivalent (i.e. Project Lombok, Mapstruct).
The simplest way I know of to avoid expensive string building operations whilst logging is to surround the building of the string with a simple conditional:
if ( logger.isTraceEnabled() )
{
// Really expensive operation here
}
Or, if you're using Java 8, the standard logging library takes a java.util.function.Supplier<T> argument which will only be executed if the current log level matches that of the logging method being called:
log.fine(()-> "Value is: " + getValue());
There is also currently a ticket open for SLF4j to implement this functionality here.
If you're really really set on implementing your own logger, the two above features are easy enough to implement yourself, but again I'd encourage you not to.
Edit: Aspectj compile time weaving can be used to achieve something similar to what you're trying to achieve. It would allow you to wrap all your logging statements with a conditional statement in order to remove the boilerplate checking.

Newest logging libraryies, including java.util.logging, have a second form of methods, taking a Supplier<String>.
e.g. log.info( ()->"Hello"); instead of log.info("Hello");.
The get() method of the supplier is only called if the message has effectively to be logged, therefore your string is only constructed in that case.

I think the most important thing to understand here is that the C/C++ macro solution, does not save computational effort by not constructing the logged message, in case the log level was such that the message would not be logged.
Why is so? Simply because the macro method would make the pre-processor substitute every usage of the macro:
LOG_IT(level, message)
with the code:
if(level>=App.actLevel_) LOG_MSG(message);
Substituting anything you passed as level and anything you passed as message along with the macro itself. The resulting code to be compiled will be exactly the same as if you copied and pasted the macro code everywhere in your program. The only thing macros help you with, is to avoid the actual copying and pasting, and to make the code more readable and maintainable.
Sometimes they manage to do it, other times they make the code more cryptic and thus harder to maintain as a result. In any case, macros do not provide deferred execution to save you from actually constructing the string, as Java8 Logger class does by using lambda expressions. Java defers the execution of the body of a lambda until the last possible time. In other words, the body of the lambda is executed after the if statement.
To go back to your example in C\C++, you as a developer, would probably want the code to work regardless of the log level, so you would be forced to construct a valid string message and pass it to the macro. Otherwise in certain log levels, the program would crash! So, since the message string construction code must be before the call to the macro, you will execute it every time, regardless of the log level.
So, to make the equivalent to your code is quite simple in Java 6! You just use the built-in class: Logger. This class provides support for logging levels automatically, so you do not need to create a custom implementation of them.
If what you are asking is how to implement deferred execution without lambdas, though, I do not think it is possible.
If you wanted to make real deferred execution in C\C++ you would have to make the logging code such, as to take a function pointer to a function returning the message string, you would make your code execute the function passed to you by the function pointer inside the if statement and then you would call your macro passing not a string but a function that creates and returns the string! I believe the actual C\C++ code to do this is out of scope for this question... The key concept here, is that C\C++ provide you the tools to make deferred execution, simply because they support function pointers. Java does not support function pointers, until Java8.

C/C++ exposure of functions / methods vs Java

The world of Minecraft Modding has made me curious about differences in mechanisms between Java and C/C++ libraries to allow methods / functions in the libraries to be invoked externally.
My understanding is that Minecraft Modding came about due to the ability to decompile / reflect over Java in order to reverse engineer classes and methods that can be invoked from the library. I believe that the Java class specification includes quite a lot of metadata about the structure of classes allowing code to be used in ways other than intended.
There are some obfuscation tools around that try to make it harder to reverse engineer Java but overall it seems to be quite difficult to prevent.
I don't have the depth of knowledge in C/C++ to know to what degree the same can be done there.
For C/C++ code is compiled natively ahead of time. The end result is an assembly of machine code specific for that platform. C/C++ has the notion of externalising functions so that they can be exposed from outside the library or executable. Some libraries also have an entry point.
Typically when connecting to external functions there is a header file to list what functions are available to code against from the library.
I would assume there would need to be a mechanism to map an exposed function to the address within the library / executable machine code assembly so the function calls get made in the right place.
Typically connecting the function calls together with the address is the job of the linker. The linker still needs to somehow know where to find these functions.
This makes me wonder if it is fundamentally possible to invoke non exported functions. If so would this require the ability to locate their address and understand their parameter format?
function calls in C/C++ as I understand it is typically done by assigning the parameters to registers for simple functions or to an argument array for more complex functions.
I don't know if the practice of invoking non-public API's in native code is common
or if the inherent difficulty in doing so makes native code pretty safe from this kind of use.

First of all, there are tools (of varying quality and capabilities) to reverse engineer compiled machine-code back to the original language [or another language, for that matter]. The biggest problem when doing this is that languages such as C and C++, the names of members in a structure don't have names, and often become "flat", so what is originally:
struct user
{
std::string name;
int age;
int score;
};
will become:
struct s0
{
char *f0;
char *f1;
int f2;
int f3;
};
[Note of course that std::string may be implemented in a dozen different ways, and the "two pointers" is just one plausible variant]
Of course, if there is a header file describing how the library works, you can use the data structures in that to get better type information. Likewise, if there is debug information in the file, it can be used to form data structures and variable names in a much better way. But someone who wants to keep these things private will (most often) not ship the code with debug symbols, and only publish the actual necessary parts to call the public functionality.
But if you understand how these are used [or read some code that for example displayed a "user", you can figure out what is the name, the age and what is the score.
Understanding what is an array and what is separate fields can also be difficult. Which is it:
struct
{
int x, y, z;
};
or
int arr[3];
Several years ago, I started on a patience card game (Similar to "Solitaire"). To do that, I needed a way to display cards on the screen. So I thought "well, there's one for the existing Solitaire on Windows, I bet I can figure out how to use that", and indeed, I did. I could draw the Queen of Clubs or Two of Spades, as I wished. I never finished the actual game-play part, but I certainly managed to load the card-drawing functionality from a non-public shared library. Not rocket science by any means (there are people who do this for commercial games with thousands of functions and really complex data structures - this had two or three functions that you needed to call), but I didn't spend much time on it either, a couple of hours if I remember right, from coming up with the idea to having something that "works".
But for the second part of your question, plugin-interfaces (such as filter plugins to Photoshop, or transitions in video editors), are very often implemented as "shared libraries" (aka "dynamic link libraries", DLLs).
There are functions in the OS to load a shared library into memory, and to query for functions by their name. The interface of these functions is (typically) pre-defined, so a function pointer prototype in a header-file can be used to form the actual call.
As long as the compiler for the shared library and the application code are using the the same ABI (application binary interface), all should work out when it comes to how arguments are passed from the caller to the function - it's not like the compiler just randomly uses whatever register it fancies, the parameters are passed in a well-defined order and which register is used for what is defined by the ABI specification for a given processor architecture. [It gets further more complicated if you have to know the contents of data structures, and there are different versions of such structures - say for example someone has a std::string that contains two pointers (start and end), and for whatever reason, the design is changed to be one pointer and a length - both the application code and the shared library need to be compiled with the same version of std::string, or bad things will happen!]
Non-public API functions CAN be called, but they wouldn't be discoverable by calling the query for finding a function by name - you'd have to figure out some other way - for example by knowing that "this function is 132 bytes on from the function XYZ", and of course, you wouldn't have the function prototype either.
There is of course the added complication where Java Bytecode is portable for many different processor architectures, machine code only works on a defined set of processors - code for x86 works on Intel and AMD processors (and maybe a few others), code for ARM processors work in chips developed with the ARM instruction set, and so on. You have to compile the C or C++ code for the given process.

Programming: Equivalent function but different(ly optimized) implemention and automatic (cached) choice before each execution - How?

A function may work on CPU another on GPU, but both do the same job.
Sure, you want to use the GPU-solution (assuming it's faster), but it is not available on -for example- let's say the older OpenGL version.
Instead of programming a check (if available then use "this" else "that") on each function call, you may want to just call one function trough a reference bound to this function call.
To go further into optimization, imagine 4 solutions:
CPU + optimized for small pictures
CPU + optimized for big pictures
GPU + optimized for small pictures
GPU + optimized for big pictures
Now, not only do you (as a programmer) have to eliminate 2 possibilities depending on old/new "OpenGL version", you also have to choose one of the 2 remaining possibilities depending on usage.
Some calls only have small or big pictures as function parameters, but in others places of your code, you need to choose which function they want to call depending on the picture-parameter's values.
- For 4x4 pixel pictures or small lookup-tables even the CPU-solution could be the fastest (lower overhead)
One solution could be to make a function from which code paths split and lead to optimized functions.
This works for the same package, not for different packages providing the same function (example: standard-library vs driver-library/hooks)
Another solution could be to write jet another package which incorporates the used ones and chooses the function optimized for certain task.
Jet another -even uglier- solution could be to update each function call by hand.
But the solution I am searching uses a function reference for each call given to the function at program-loading time, depending on hardware or software environment.
It should even be able to change when dependency-libraries load or unload.
(For example: a new version of the other library is installed, the old one uninstalled while your program is running - or waiting during execution of another thread on this CPU-core)
The program shouldn't bother if there is just 1 or more functions under this name. It should bother what's the fastest to execute.
Example:
package Pictures; //has averageRedValue( byte[height][width][RGB] )
package Images; //has averageRedValue( byte[height][width][RGB] ) too
If they both give the same result, why should the programmer care about which one is used?
He wants the fast solution or an option read from a settings-file.
And the end-user wants a simple option to choose the same functions as used in a past date - which asks for version control and rollback features
Please tell me if you have seen a solution or an idea where to look at.

This is all rather confused, but for the normal case, the solution is trivial: Write an interface containing the needed methods and make sure that only the best implementation gets loaded.
The JIT determines that there's just one such implementation (Class Hierarchy Analysis) and calls the proper method directly.
It should even be able to change when dependency-libraries load or unload.
Java can't do this efficiently. Whenever a second implementation gets loaded, the optimized code must be thrown away and the methods get recompiled. The conditional branch is still pretty cheap, with more implementations loaded its gets slower.
There's no way to unload a class without using some classloader magic.
What do you need it for?

Dynamically load Java functions from external library

I am working on a big Java project not so well engineered, and we actually have two main development branches.
One branch, A, is a subset of the second, B, having all the functionalities of the latter but no security checks integrated on the user operations (they are just hashes on files that mark which user did what).
Since the development is done on the A branch, I have to manually merge all the work on branch B whenever a bugfix is done.
The codebase is huge and it has interdependencies all around, but rewriting it is out of discussion (founding problems, as usual). Moreover, the whole architecture is so complex that any structural changes can have strange side-effects.
(I realize that this is a programmer's nightmare!).
Now, my question as a Java beginner is the following one: would it be possible to "externalize" some functions of some classes -- that is, all the functions that implement security checks -- in an external library, so that the code executes these functions whenever the library is present in the jar file, and executes the plain "no-security" functions otherwise?
Just to be clear, here's a small schematic of what I would like to do:
--- branch A ---
+ class ONE
f1()
f2()
+ class TWO
g1()
g2()
--- branch B ---
+ class ONE
f1*()
f2()
+ class TWO
g1*()
g2()
The code has to execute f1() and g1() whenever the library is not present, but executes their starred version if the library is there.
Ideally, given the problems above mentioned, I would like to just cut&paste the "security-related" functions in a set of java files, and compile them as a library, and I would perform the changes to these functions manually when needed -- they are not often modified.
Is there otherwise a way to deal with this situation that prevents these problems?
Thanks a lot in advance!

#RH6, what you are asking is certainly possible but may not be very easy in the situation you described above. However, as detailed above, the fundamental idea is to look for the presence/absence of the library in question and behave accordingly. This is more of a design matter and there are more than one approach, so right from the onset, you should be prepared to modify your design to incorporate this behaviour.
One avenue that you could explore is to use AspectJ and weave advices (around advice). In this advice body you could check if the required JAR is present or not, if it is present, you could use a custom class loader (though it is not necessary if the JAR is on classpath) load/create object of the required class and execute the f1*()/g1*() method. If the JAR is not present, proceed to execute the f1()/g1() method.
As you have observed, the above method is slightly less intrusive (requires build level intrusion into the existing code base) but it would require you to modify the build process as well as develop & maintain the advices.

I don't think you need to load functions dynamically. For example, you can either:
Make B extends A (and name it something like SecuredA) and overwrite f1() and g1() to add the required security checks.
Create a SecurityManager interface that is called inside of f1() and g1(). You then have to create 2 implementations: one that does nothing (= A) and one that does security related functions (= B). Then you will just have to inject/use the correct SecurityManager depending of the current case.

There are various design principles to solve this.
For example: IoC (inversion of control).
In software engineering, inversion of control (IoC) describes a design in which custom-written portions of a computer program receive the flow of control from a generic, reusable library. A software architecture with this design inverts control as compared to traditional procedural programming: in traditional programming, the custom code that expresses the purpose of the program calls into reusable libraries to take care of generic tasks, but with inversion of control, it is the reusable code that calls into the custom, or task-specific, code.
The most popular framework for this (as far as I know) is Spring. During the instantiation of your objects you start using a factory method. This factory method will check an XML file for possible overruling.
Here is an example:
<?xml version="1.0" encoding="UTF-8"?>
<beans ...>
<bean id="myClass" class="package.my.MyClass" />
</beans>
Alternatively if you don't like the Spring dependency. You can just create something yourself using some reflection:
Class defaultClass = package.my.MyClass.class;
String overruledClassName = System.getProperty(defaultClassName.getName + ".clazz");
Class clazz = (overruledClassName == null) ? defaultClass : Class.forName(overruledClassName);
Object createdObject = clazz.newInstance();
In combination with a property file that contains the following property:
package.my.MyClass.clazz = package.my.MyClassVersion2

How do I include the text of a file in my Java code?

It seems that Java is not set up to do what I have previously done in C++ (no big surprise there). I've got a set of rules that are generated from another application (a series of if-then checks). These rules change from time to time, so in C++ I would do this:
double variableForRules=1;
bool condition=false;
#include "rules.out";
if(condition) //do something
Essentially the if-then checks in rules.out would use the "variableForRules" (and several other variables) to decide whether condition should be set to true. If it gets set to true after the eval of the rules, the program does something.
Is there a similar way to do this in Java? Or is my only option to have rules.out actually be an entire class that needs to be instantiated, etc.?
Thanks!

Since you're autogenerating that rules.out, you could autogenerate your Java function as well. Hopefully it's not too painful to add that functionality.

Since there is no preprocessor in Java, you can't do this. Like you said, you have to implement your logic inside a class.

Maybe you could use scripting for that. You could take a look at the Java Scripting Programmer's guide.

In Java, it would be common for the other application to save the rules into an .xml or .properties file, and then have Java read in that file.

rules.out actually has to be an entire class that needs to be instantiated for code to be executed.
Since rules.out is generated by third-party application, best thing would be to write your own CppToJavaTransformer that reads file rules.out as input and generates Rules.java. This assumes rules.out is available before compile time and Rules.java will be used at compile time. Drawback of this is that there is an extra transformation required.
Alternately you can write code that interprets rules.out and execute required instructions using introspection. This is hard way but rules.out can be changed at runtime as well.

Even if could include the rules file it would be of no help. Your rule are dynamic as you said. Looks like you java code needs to change for a different scenario.

You could try using reflection (See: Creating New Objects and Invoking Methods by Name)
First generate a Rules.java in the manner in which you currently build rules.out and compile it.
Then load the class file into your app at runtime in the manner in which JDBC drivers were traditionally loaded.
Class clazz = Class.forName("com.mydomain.Rules");
If your app runs for long periods of time (longer then the lifetime of a single Rules.class file) then you would have to create your own ClassLoader in order to swap out the underlying class during a single runtime.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.