java frameworks complexity statistics - java

It is extremely difficult to illustrate the complexity of frameworks (hibernate, spring, apache-commons, ...)
The only thing I could think of was to compare the file sizes of the jar libraries or even better, the number of classes contained in the jar files.
Of course this is not a mathematical sound proof of complexity. But at least it should make clear that some frameworks are lightweight compared to others.
Of course it would take quiet some time to calculate statistics. In an attempt to save time I was wondering if perhaps somebody did so already ?
EDIT:
Yes, there are a lot of tools to calculate the complexity of individual methods and classes. But this question is about third party jar files.
Also please note that 40% of phrases in my original question stress the fact that everybody is well aware of the fact that complexity is hard to measure and that file size and nr of classes may indeed not be sufficient. So, it is not necessary to elaborate on this any further.

There are tools out there that can measure the complexity of code. However this is more of a psychological question as you cannot mathematically define the term 'complex code'. And obviously giving two random persons some piece of code will give you very different answers.
In general the issue with complexity arises from the fact that a human brain cannot process more than a certain number of lines of code simultaneously (actually functional pieces, but normal lines of code should be exactly that). The exact number of lines that one can hold and understand in memory at the same time of course varies based on many factors (including time of day, day of the week and status of your coffee machine) and therefore completely depends on the audience. However less number of lines of code that you have to keep in your 'internal memory register' for one task is better, therefore this should be the general factor when trying to determine the complexity of an API.
There is however a pitfall with this way of calculating complexity, as many APIs offer you a fast way of solving a problem (easy entry level), but this solution later turns out to cause several very complex coding decisions, that on overall makes your code very difficult to understand. In contrast other APIs require you to do a very complex setup that is hard to understand at first, but the rest of your code will be extremely easy because of that initial setup.
Therefore a good way of measuring API complexity is to define a task to solve by that API that is representative and big enough, and then measure the average amount of simultaneous lines of code one has to keep in mind to implement that task.And once you're done, please publish the result in a scientific paper of your choice. ;)

Related

Cyclomatic Complexity in Intellij

I was working on an assignment today that basically asked us to write a Java program that checks if HTML syntax is valid in a text file. Pretty simple assignment, I did it very quickly, but in doing it so quickly I made it very convoluted (lots of loops and if statements). I know I can make it a lot simpler, and I will before turning it in, but Amid my procrastination, I started downloading plugins and seeing what information they could give me.
I downloaded two in particular that I'm curious about - CodeMetrics and MetricsReloaded. I was wondering what exactly these numbers that it generates correlate to. I saw one post that was semi-similar, and I read it as well as the linked articles, but I'm still having some trouble understanding a couple of things. Namely, what the first two columns (CogC and ev(G)), as well as some more clarification on the other two (iv(G) and v(G)), mean.
MetricsReloaded Method Metrics:
MetricsReloaded Class Metrics:
These previous numbers are from MetricsReloaded, but this other application, CodeMetrics, which also calculates cyclomatic complexity gives slightly different numbers. I was wondering how these numbers correlate and if someone could just give a brief general explanation of all this.
CodeMetrics Analysis Results:
My final question is about time complexity. My understanding of Cyclomatic complexity is that it is the number of possible paths of execution and that it is determined by the number of conditionals and how they are nested. It doesn't seem like it would, but does this correlate in any way to time complexity? And if so, is there a conversion between them that can be easily done? If not, is there a way in either of these plug-ins (or any other in IntelliJ) that can automate time complexity calculations?

Custom math functions vs. supplied Math functions?

I am basically making a Java program that will have to run a lot of calculations pretty quickly(each frame, aiming for at least 30 f/s). These will mostly be trigonometric and power functions.
The question I'm asking is:
Which is faster: using the already-supplied-by-Java Math functions? Or writing my own functions to run?
The built-in Math functions will be extremely difficult to beat, given that most of them have special JVM magic that makes them use hardware intrinsics. You could conceivably beat some of them by trading away accuracy with a lot of work, but you're very unlikely to beat the Math utilities otherwise.
You will want to use the java.lang.Math functions as most of them run native in the JVM. you can see the source code here.
Lots of very intelligent and well-qualified people have put a lot of effort, over many years, into making the Math functions work as quickly and as accurately as possible. So unless you're smarter than all of them, and have years of free time to spend on this, it's very unlikely that you'll be able to do a better job.
Most of them are native too - they're not actually in Java. So writing faster versions of them in Java is going to be a complete no-go. You're probably best off using a mixture of C and Assembly Language when you come to write your own; and you'll need to know all the quirks of whatever hardware you're going to be running this on.
Moreover, the current implementations have been tested over many years, by the fact that millions of people all around the world are using Java in some way. You're not going to have access to the same body of testers, so your functions will automatically be more error-prone than the standard ones. This is unavoidable.
So are you still thinking about writing your own functions?
If you can bear 1e-15ish relative error (or more like 1e-13ish for pow(double,double)), you can try this, which should be faster than java.lang.Math if you call it a lot : http://sourceforge.net/projects/jafama/
As some said, it's usually hard to beat java.lang.Math in pure Java if you want to keep similar (1-ulp-ish) accuracy, but a little bit less accuracy in double precision is often totally bearable (and still much more accurate than what you would have when computing with floats), and can allow for some noticeable speed-up.
What might be an option is caching the values. If you know you are only going to need a fixed set of values or if you can get away without perfect accuracy then this could save a lot of time. Say if you want to draw a lot of circles pre compute values of sin and cos for each degree. Then use these values when drawing. Most circles will be small enough that you can't see the difference and the small number which are very big can be done using the libraries.
Be sure to test if this is worth it. On my 5 year old macbook I can do a million evaluations of cos a second.

Where can I get large number of rules and facts or how can I generate them for Drools benchmark?

I would like to test Drools performance, such as memory consupmtion and inferencing speed for large amount of data. I did it through running benchmarks that are available on drools projects https://github.com/droolsjbpm/drools just as other example there. There are commonly used benchmarks such as manners, waltz and waltzdb. But on my computer they takes dozen of seconds. Could U suggest me any sources of rules and objects/facts that can I use and test for free with Drools? Maybe it is possible to generate such data and rules? Then how could I do that?
Thanks for help.
It's worth noting that those benchmarks have no purpose whatsoever. They are mostly specifically designed to do things which are inefficient in rules engines. They even have very little value for comparison between engines, given that you're unlikely to ever write a real-world application that is anything like Miss Manners.
If you just want large amounts of data for your tests, there is loads of open data out there. For instance, the UK provides a variety of open data sets. You can pick one which suits your experiment here.
http://data.gov.uk/data/search
Or you could grab a load of gene sequence data from GenBank:
http://www.ncbi.nlm.nih.gov/genbank/
There's loads of free data out there, for which you could write rules.
If you are really looking to benchmark rules engines, then it would probably be better to generate the data yourself. That's the best way to ensure that you get reliable statistical variations.
However, all you will be doing is benchmarking a specific set of rules. Any such benchmarks would be redundant as soon as the rules change.

What can I do in Java code to optimize for CPU caching?

When writing a Java program, do I have influence on how the CPU will utilize its cache to store my data? For example, if I have an array that is accessed a lot, does it help if it's small enough to fit in one cache line (typically 128 byte on a 64-bit machine)? What if I keep a much used object within that limit, can I expect the memory used by it's members to be close together and staying in cache?
Background: I'm building a compressed digital tree, that's heavily inspired by the Judy arrays, which are in C. While I'm mostly after its node compression techniques, Judy has CPU cache optimization as a central design goal and the node types as well as the heuristics for switching between them are heavily influenced by that. I was wondering if I have any chance of getting those benefits, too?
Edit: The general advice of the answers so far is, don't try to microoptimize machine-level details when you're so far away from the machine as you're in Java. I totally agree, so felt I had to add some (hopefully) clarifying comments, to better explain why I think the question still makes sense. These are below:
There are some things that are just generally easier for computers to handle because of the way they are built. I have seen Java code run noticeably faster on compressed data (from memory), even though the decompression had to use additional CPU cycles. If the data were stored on disk, it's obvious why that is so, but of course in RAM it's the same principle.
Now, computer science has lots to say about what those things are, for example, locality of reference is great in C and I guess it's still great in Java, maybe even more so, if it helps the optimizing runtime to do more clever things. But how you accomplish it might be very different. In C, I might write code that manages larger chunks of memory itself and uses adjacent pointers for related data.
In Java, I can't (and don't want to) know much about how memory is going to be managed by a particular runtime. So I have to take optimizations to a higher level of abstraction, too. My question is basically, how do I do that? For locality of reference, what does "close together" mean at the level of abstraction I'm working on in Java? Same object? Same type? Same array?
In general, I don't think that abstraction layers change the "laws of physics", metaphorically speaking. Doubling your array in size every time you run out of space is a good strategy in Java, too, even though you don't call malloc() anymore.
The key to good performance with Java is to write idiomatic code, rather than trying to outwit the JIT compiler. If you write your code to try to influence it to do things in a certain way at the native instruction level, you are more likely to shoot yourself in the foot.
That isn't to say that common principles like locality of reference don't matter. They do, but I would consider the use of arrays and such to be performance-aware, idiomatic code, but not "tricky."
HotSpot and other optimizing runtimes are extremely clever about how they optimize code for specific processors. (For an example, check out this discussion.) If I were an expert machine language programmer, I'd write machine language, not Java. And if I'm not, it would be unwise to think that I could do a better job of optimizing my code than the experts.
Also, even if you do know the best way to implement something for a particular CPU, the beauty of Java is write-once-run-anywhere. Clever tricks to "optimize" Java code tend to make optimization opportunities harder for the JIT to recognize. Straight-forward code that adheres to common idioms is easier for an optimizer to recognize. So even when you get the best Java code for your testbed, that code might perform horribly on a different architecture, or at best, fail to take advantages of enhancements in future JITs.
If you want good performance, keep it simple. Teams of really smart people are working to make it fast.
If the data you're crunching is primarily or wholly made up of primitives (eg. in numeric problems), I would advise the following.
Allocate a flat structure of fixed size arrays-of-primitives at initialisation-time, and make sure the data therein is periodically compacted/defragmented (0->n where n is the smallest max index possible given your element count), to be iterated over using a for-loop. This is the only way to guarantee contiguous allocation in Java, and compaction further serves to improves locality of reference. Compaction is beneficial, as it reduces the need to iterate over unused elements, reducing the number of conditionals: As the for loop iterates, the termination occurs earlier, and less iteration = less movement through the heap = fewer chances for a cache miss. While compaction creates an overhead in and of itself, this may be done only periodically (with respect to your primary areas of processing) if you so choose.
Even better, you can interleave values in these pre-allocated arrays. For instance, if you are representing spatial transforms for many thousands of entities in 2D space, and are processing the equations of motion for each such, you might have a tight loop like
int axIdx, ayIdx, vxIdx, vyIdx, xIdx, yIdx;
//Acceleration, velocity, and displacement for each
//of x and y totals 6 elements per entity.
for (axIdx = 0; axIdx < array.length; axIdx += 6)
{
ayIdx = axIdx+1;
vxIdx = axIdx+2;
vyIdx = axIdx+3;
xIdx = axIdx+4;
yIdx = axIdx+5;
//velocity1 = velocity0 + acceleration
array[vxIdx] += array[axIdx];
array[vyIdx] += array[ayIdx];
//displacement1 = displacement0 + velocity
array[xIdx] += array[vxIdx];
array[yIdx] += array[vxIdx];
}
This example ignores such issues as rendering of those entities using their associated (x,y)... rendering always requires non-primitives (thus, references/pointers). If you do need such object instances, then you can no longer guarantee locality of reference, and will likely be jumping around all over the heap. So if you can split your code into sections where you have primitive-intensive processing as shown above, then this approach will help you a lot. For games at least, AI, dynamic terrain, and physics can be some of the most processor-intensives aspect, and are all numeric, so this approach can be very beneficial.
If you are down to where an improvement of a few percent makes a difference, use C where you'll get an improvement of 50-100%!
If you think that the ease of use of Java makes it a better language to use, then don't screw it up with questionable optimizations.
The good news is that Java will do a lot of stuff beneath the covers to improve your code at runtime, but it almost certainly won't do the kind of optimizations you're talking about.
If you decide to go with Java, just write your code as clearly as you can, don't take minor optimizations into account at all. (Major ones like using the right collections for the right job, not allocating/freeing objects inside a loop, etc. are still worth while)
So far the advice is pretty strong, in general it's best not to try and outsmart the JIT. But as you say some knowledge about the details is useful sometimes.
Regarding memory layout for objects, Sun's Jvm (now Oracle's) lays objects into memory by type (i.e. doubles and longs first, then ints and floats, then shorts and chars, after that bytes and booleans and finally object references). You can get more details here..
Local variables are usually kept in the stack (that is references and primitive types).
As Nick mentions, the best way to ensure the memory layout in Java is by using primitive arrays. That way you can make sure that data is contiguous in memory. Be careful about array sizes though, GCs have trouble with large arrays. It also has the downside that you have to do some memory management yourself.
On the upside, you can use a Flyweight pattern to get Object-like usability while keeping fast performance.
If you need the extra oomph in performance, generating your own bytecode on the fly helps with some problems, as long as the generated code is executed enough times and your VM's native code cache doesn't get full (which disables the JIT for all practical purposes).
To the best of my knowledge: No. You pretty much have to be writing in machine code to get that level of optimization. With assembly you're a step away because you no longer control where things are stored. With a compiler you're two steps away because you don't even control the details of the generated code. With Java you're three steps away because there's a JVM interpreting your code on the fly.
I don't know of any constructs in Java that let you control things on that level of detail. In theory you could indirectly influence it by how you organize your program and data, but you're so far away that I don't see how you could do it reliably, or even know whether or not it was happening.

Solving nonlinear equations numerically

I need to solve nonlinear minimization (least residual squares of N unknowns) problems in my Java program. The usual way to solve these is the Levenberg-Marquardt algorithm. I have a couple of questions
Does anybody have experience on the different LM implementations available? There exist slightly different flavors of LM, and I've heard that the exact implementation of the algorithm has a major effect on the its numerical stability. My functions are pretty well-behaved so this will probably not be a problem, but of course I'd like to choose one of the better alternatives. Here are some alternatives I've found:
FPL Statistics Group's Nonlinear Optimization Java Package. This includes a Java translation of the classic Fortran MINPACK routines.
JLAPACK, another Fortran translation.
Optimization Algorithm Toolkit.
Javanumerics.
Some Python implementation. Pure Python would be fine, since it can be compiled to Java with jythonc.
Are there any commonly used heuristics to do the initial guess that LM requires?
In my application I need to set some constraints on the solution, but luckily they are simple: I just require that the solutions (in order to be physical solutions) are nonnegative. Slightly negative solutions are result of measurement inaccuracies in the data, and should obviously be zero. I was thinking to use "regular" LM but iterate so that if some of the unknowns becomes negative, I set it to zero and resolve the rest from that. Real mathematicians will probably laugh at me, but do you think that this could work?
Thanks for any opinions!
Update: This is not rocket science, the number of parameters to solve (N) is at most 5 and the data sets are barely big enough to make solving possible, so I believe Java is quite efficient enough to solve this. And I believe that this problem has been solved numerous times by clever applied mathematicians, so I'm just looking for some ready solution rather than cooking my own. E.g. Scipy.optimize.minpack.leastsq would probably be fine if it was pure Python..
The closer your initial guess is to the solution, the faster you'll converge.
You said it was a non-linear problem. You can do a least squares solution that's linearized. Maybe you can use that solution as a first guess. A few non-linear iterations will tell you something about how good or bad an assumption that is.
Another idea would be trying another optimization algorithm. Genetic and ant colony algorithms can be a good choice if you can run them on many CPUs. They also don't require continuous derivatives, so they're nice if you have discrete, discontinuous data.
You should not use an unconstrained solver if your problem has constraints. For
instance if know that some of your variables must be nonnegative you should tell
this to your solver.
If you are happy to use Scipy, I would recommend scipy.optimize.fmin_l_bfgs_b
You can place simple bounds on your variables with L-BFGS-B.
Note that L-BFGS-B takes a general nonlinear objective function, not just
a nonlinear least-squares problem.
I agree with codehippo; I think that the best way to solve problems with constraints is to use algorithms which are specifically designed to deal with them. The L-BFGS-B algorithm should probably be a good solution in this case.
However, if using python's scipy.optimize.fmin_l_bfgs_b module is not a viable option in your case (because you are using Java), you can try using a library I have written: a Java wrapper for the original Fortran code of the L-BFGS-B algorithm. You can download it from http://www.mini.pw.edu.pl/~mkobos/programs/lbfgsb_wrapper and see if it matches your needs.
The FPL package is quite reliable but has a few quirks (array access starts at 1) due to its very literal interpretation of the old fortran code. The LM method itself is quite reliable if your function is well behaved. A simple way to force non-negative constraints is to use the square of parameters instead of the parameters directly. This can introduce spurious solutions but for simple models, these solutions are easy to screen out.
There is code available for a "constrained" LM method. Look here http://www.physics.wisc.edu/~craigm/idl/fitting.html for mpfit. There is a python (relies on Numeric unfortunately) and a C version. The LM method is around 1500 lines of code, so you might be inclined to port the C to Java. In fact, the "constrained" LM method is not much different than the method you envisioned. In mpfit, the code adjusts the step size relative to bounds on the variables. I've had good results with mpfit as well.
I don't have that much experience with BFGS, but the code is much more complex and I've never been clear on the licensing of the code.
Good luck.
I haven't actually used any of those Java libraries so take this with a grain of salt: based on the backends I would probably look at JLAPACK first. I believe LAPACK is the backend of Numpy, which is essentially the standard for doing linear algebra/mathematical manipulations in Python. At least, you definitely should use a well-optimized C or Fortran library rather than pure Java, because for large data sets these kinds of tasks can become extremely time-consuming.
For creating the initial guess, it really depends on what kind of function you're trying to fit (and what kind of data you have). Basically, just look for some relatively quick (probably O(N) or better) computation that will give an approximate value for the parameter you want. (I recently did this with a Gaussian distribution in Numpy and I estimated the mean as just average(values, weights = counts) - that is, a weighted average of the counts in the histogram, which was the true mean of the data set. It wasn't the exact center of the peak I was looking for, but it got close enough, and the algorithm went the rest of the way.)
As for keeping the constraints positive, your method seems reasonable. Since you're writing a program to do the work, maybe just make a boolean flag that lets you easily enable or disable the "force-non-negative" behavior, and run it both ways for comparison. Only if you get a large discrepancy (or if one version of the algorithm takes unreasonably long), it might be something to worry about. (And REAL mathematicians would do least-squares minimization analytically, from scratch ;-P so I think you're the one who can laugh at them.... kidding. Maybe.)

Categories