I was working on an assignment today that basically asked us to write a Java program that checks if HTML syntax is valid in a text file. Pretty simple assignment, I did it very quickly, but in doing it so quickly I made it very convoluted (lots of loops and if statements). I know I can make it a lot simpler, and I will before turning it in, but Amid my procrastination, I started downloading plugins and seeing what information they could give me.
I downloaded two in particular that I'm curious about - CodeMetrics and MetricsReloaded. I was wondering what exactly these numbers that it generates correlate to. I saw one post that was semi-similar, and I read it as well as the linked articles, but I'm still having some trouble understanding a couple of things. Namely, what the first two columns (CogC and ev(G)), as well as some more clarification on the other two (iv(G) and v(G)), mean.
MetricsReloaded Method Metrics:
MetricsReloaded Class Metrics:
These previous numbers are from MetricsReloaded, but this other application, CodeMetrics, which also calculates cyclomatic complexity gives slightly different numbers. I was wondering how these numbers correlate and if someone could just give a brief general explanation of all this.
CodeMetrics Analysis Results:
My final question is about time complexity. My understanding of Cyclomatic complexity is that it is the number of possible paths of execution and that it is determined by the number of conditionals and how they are nested. It doesn't seem like it would, but does this correlate in any way to time complexity? And if so, is there a conversion between them that can be easily done? If not, is there a way in either of these plug-ins (or any other in IntelliJ) that can automate time complexity calculations?
Related
I was solving a Codeforces problem yesterday. The problem's URL is this
I will just explain the question in short below.
Given a binary string, divide it into a minimum number of subsequences
in such a way that each character of the string belongs to exactly one
subsequence and each subsequence looks like "010101 ..." or "101010
..." (i.e. the subsequence should not contain two adjacent zeros or
ones).
Now, for this problem, I had submitted a solution yesterday during the contest. This is the solution. It got accepted temporarily and on final test cases got a Time limit exceeded status.
So today, I again submitted another solution and this passed all the cases.
In the first solution, I used HashSet and in the 2nd one I used LinkedHashSet. I want to know, why didn't HashSet clear all the cases? Does this mean I should use LinkedHashSet whenever I need a Set implementation? I saw this article and found HashSet performs better than LinkedHashSet. But why my code doesn't work here?
This question would probably get more replies on Codeforces, but I'll answer it here anyways.
After a contest ends, Codeforces allows other users to "hack" solutions by writing custom inputs to run on other users' programs. If the defending user's program runs slowly on the custom input, the status of their code submission will change from "Accepted" to "Time Limit Exceeded".
The reason why your code, specifically, changed from "Accepted" to "Time Limit Exceeded" is that somebody created an "anti-hash test" (a test on which your hash function results in many collisions) on which your program ran slower than usual. If you're interested in how such tests are generated, you can find several posts on Codeforces, like this one: https://codeforces.com/blog/entry/60442.
As linked by #Photon, there's a post on Codeforces explaining why you should avoid using Java.HashSet and Java.HashMap: https://codeforces.com/blog/entry/4876, which is essentially due to anti-hash tests. In some instances, adding the extra log(n) factor from a balanced BST may not be so bad (by using TreeSet or TreeMap). In many cases, an extra log(n) factor won't make your code time out, and it gives you protection from anti-hash tests.
How do you determine whether your algorithm is fast enough to add the log(n) factor? I guess this comes with some experience, but most people suggest performing some sort of calculation. Most online judges (including Codeforces) show the time that your program is allowed to run on a particular problem (usually somewhere between one and four seconds), and you can use 10^9 constant-time operations per second as a rule of thumb when performing calculations.
His
I have a somewhat hypothetical question. We've just programmed some code implementing genetic algorithm to find a solution to a sudoku game as part of the Computational Intelligence course project. Unfortunately it runs very slowly which limits our ability to perform adequate number of runs to find the optimal parameters. The question is whether reprogramming the whole thing - the code basis is not that big - into java would be a viable solution to boost up the speed of the software. Like we need 10x performance improvement really and i am doubtful that a Java version would be so much snappier. Any thoughts?
Thanks
=== Update 1 ===
Here is the code of the function that is computationally most expensive. It's a GA fitness function, that iterates through the population (different sudoku boards) and computes for each row and column how many elements are duplicates. The parameter n is passed, and is currently set to 9. That is, the function computes how many elements a row has that come up within the range 1 to 9 more then once. The higher the number the less is the fitness of the board, meaning that it is a weak candidate for the next generation.
The profiler reports that the two lines calling intersect in the for loops causing the poor performance, but we don't know how to really optimize the code. It follows below:
function [fitness, finished,d, threshold]=fitness(population_, n)
finished=false;
threshold=false;
V=ones(n,1);
d=zeros(size(population_,2),1);
s=[1:1:n];
for z=1:size(population_,2)
board=population_{z};
t=0;
l=0;
for i=1:n
l=l+n-length(intersect(s,board(:,i)'));
t=t+n-length(intersect(s,board(i,:)));
end
k=sum(abs(board*V-t));
f=t+l+k/50;
if t==2 &&l==2
threshold=true;
end
if f==0
finished=true;
else
fitness(z)=1/f;
d(z)=f;
end
end
end
=== Update 2 ===
Found a solution here: http://www.mathworks.com/matlabcentral/answers/112771-how-to-optimize-the-following-function
Using histc(V, 1:9), it's much faster :)
This is rather impossible to say without viewing your code, knowing if you use parallelization, etc. Indeed, as MrAzzaman says, profiling is the first thing to do. If you find a single bottleneck, especially if it is loop-heavy, it might be sufficient to write that part in C and connect it to Matlab via MEX.
In genetics algorithms, I'd believe that a 10x speed increase could be obtained rather than not. I do not quite agree with MrAzzaman here - in some cases (for loops, working with dynamic objects) is much, much slower than C/C++/Java. That is not to say that Matlab is always slow, for it is not, but there is plenty of algorithms where it would be slow.
I.e., I'd say that if you don't spend so much time looping over things, don't use objects, are not limited by Matlab's data structures, you might be ok with Matlab. That said, if I was to write GAs in Java or Matlab, I'd rather pick the former (and I'm using Matlab a lot more than Java these days, it's not just a matter of habit).
Btw. if you don't want to program it yourself, have a look at JGAP, it's a rather useful Java library for GAs.
OK, the first step is just to write a faster MATLAB function. Save the new languages for later.
I'm going to make the assumption that the board is full of valid guesses: that is, each entry is in [1, 9]. Now, what we're really looking for are duplicate entries in each row/column. To find duplicates, we sort. On a sorted row, if any element is equal to its neighbor, we have a duplicate. In MATLAB, the diff function does sliding pairwise differencing, and a zero in its output means that two neighboring values are equal. Both sort and diff operate on entire matrices, so no need for looping. Here's the code for the columnwise check:
l=sum(sum(diff(sort(board)) == 0));
The rowwise check is exactly the same, just using the transpose. Now let's put that in a test harness to compare results and timing with the previous version:
n = 9;
% Generate a test board. Random integers numbers from 1:n
board = randi(n, n);
s = 1:n;
K=1000; % number of iterations to use for timing
% Repeat current code for comparison
tic
for k=1:K
t=0;
l=0;
for i=1:n
l=l+n-length(intersect(s,board(:,i)'));
t=t+n-length(intersect(s,board(i,:)));
end
end
toc
% New code based on sort/diff for finding repeated values
tic
for k=1:K
l2=sum(sum(diff(sort(board)) == 0));
t2=sum(sum(diff(sort(board.')) == 0));
end
toc
% Check that reported values match
disp([l l2])
disp([t t2])
I encourage you to break down the sort/diff/sum code, and build it up on a sample board right at the command line, and try to understand exactly how it works.
On my system, the new code is about 330x faster.
For traditional GA applications for studying and research purposes it is better to use a native machine compiled source code programming language, like C, C++. Which I used when working with Genetic
Programming in the past and it is really fast.
However if you are planning to put this inside a more modern type of application that can be deployed in a web container or run in a mobile device, different OS, etc. Then Java is your best alternative as it is platform independent.
Another thing that can be important is about concurrency. For example lets us suppose that you want to put your GA in the Internet and you will have a growing number of users that are connected concurrently and all of them want to solve a different sudoku, Java applications are very good for scaling horizontally and works great with big number of concurrent connections.
Other thing that can be good if you migrate to Java is the number of libraries and frameworks that you can use, the Java universe is so big that you can find useful tools for almost any kind of application.
Java is a Virtual Machine compiled language, but it is important to note that currently the JVMs are very good in performance and are able to optimize the programs, for example they will find which methods are being more heavily used and compile them to native code, which means that for some applications you will find a Java program to be almost same fast than a native compiled from C.
Matlab is a platform that is very useful for engineering training and math, vector, matrix based calculations, also for some control stuff with Simulink. I used these products when in my electrical engineering bachelor, however those product's goal is to be mainly a tool for academic purposes I won't definitely go for Matlab if I am wanting to build a production application for the real world. It is not scalable, it is expensive to maintain and fine-tune it, also there are not lot of infrastructure providers that will support this kind of technology.
About the complexity of rewriting your code to Java, the Matlab code and Java code syntax is pretty similar, they also live in the same paradigm: Procedural OOP, even if you are not using OO in your code it can be easy rewritten in Java, the painful stuff will be when working with Matlab shortcuts to Math structures like matrix and passing functions as parameters.
For the matrix stuff, there are lot of java libraries like EJML that will make your life easier. About assigning functions to variables and then pass them as parameters to another functions, Java is not currently able to do that (Java 8 will be with Lambda Expressions) but you can have a equivalent functionality by using Class closures. Maybe these will be the only little painful things that you will find if migrating.
Found a solution here: http://www.mathworks.com/matlabcentral/answers/112771-how-to-optimize-the-following-function
Using histc(V, 1:9), it's much faster :)
It is extremely difficult to illustrate the complexity of frameworks (hibernate, spring, apache-commons, ...)
The only thing I could think of was to compare the file sizes of the jar libraries or even better, the number of classes contained in the jar files.
Of course this is not a mathematical sound proof of complexity. But at least it should make clear that some frameworks are lightweight compared to others.
Of course it would take quiet some time to calculate statistics. In an attempt to save time I was wondering if perhaps somebody did so already ?
EDIT:
Yes, there are a lot of tools to calculate the complexity of individual methods and classes. But this question is about third party jar files.
Also please note that 40% of phrases in my original question stress the fact that everybody is well aware of the fact that complexity is hard to measure and that file size and nr of classes may indeed not be sufficient. So, it is not necessary to elaborate on this any further.
There are tools out there that can measure the complexity of code. However this is more of a psychological question as you cannot mathematically define the term 'complex code'. And obviously giving two random persons some piece of code will give you very different answers.
In general the issue with complexity arises from the fact that a human brain cannot process more than a certain number of lines of code simultaneously (actually functional pieces, but normal lines of code should be exactly that). The exact number of lines that one can hold and understand in memory at the same time of course varies based on many factors (including time of day, day of the week and status of your coffee machine) and therefore completely depends on the audience. However less number of lines of code that you have to keep in your 'internal memory register' for one task is better, therefore this should be the general factor when trying to determine the complexity of an API.
There is however a pitfall with this way of calculating complexity, as many APIs offer you a fast way of solving a problem (easy entry level), but this solution later turns out to cause several very complex coding decisions, that on overall makes your code very difficult to understand. In contrast other APIs require you to do a very complex setup that is hard to understand at first, but the rest of your code will be extremely easy because of that initial setup.
Therefore a good way of measuring API complexity is to define a task to solve by that API that is representative and big enough, and then measure the average amount of simultaneous lines of code one has to keep in mind to implement that task.And once you're done, please publish the result in a scientific paper of your choice. ;)
I looking for a program or library in Java capable of finding non-random properties of a byte sequence. Something when given a huge file, runs some statistical tests and reports if the data show any regularities.
I know three such programs, but not in Java. I tried all of them, but they don't really seem to work for me (which is quite surprising as one of them is by NIST). The oldest of them, diehard, works fine, but it's a bit hard to use.
As some of the commenters have stated, this is really an expert mathematics problem. The simplest explanation I could find for you is:
Run Tests for Non-randomness
Autocorrelation
It's interesting, but as it uses 'heads or tails' to simplify its example, you'll find you need to go much deeper to apply the same theory to encryption / cryptography etc - but it's a good start.
Another approach would be using Fuzzy logic. You can extract fuzzy associative rules from sets of data. Those rules are basically implications in the form:
if A then B, interpreted for example "if 01101 (is present) then 1111 (will follow)"
Googling "fuzzy data mining"/"extracting fuzzy associative rules" should yield you more than enough results.
Your problem domain is quite huge, actually, since this is what data/text mining is all about. That, and statistical & combinatorial analysis, just to name a few.
About a program that does that - take a look at this.
Not so much an answer to your question but to your comment that "any observable pattern is bad". Which got me thinking that randomness wasn't the problem but rather observable patterns, and to tackle this problem surely you need observers. So, in short, just set up a website and crowdsource it.
Some examples of this technique applied to colour naming: http://blog.xkcd.com/2010/05/03/color-survey-results/ and http://www.hpl.hp.com/personal/Nathan_Moroney/color-name-hpl.html
I need to solve nonlinear minimization (least residual squares of N unknowns) problems in my Java program. The usual way to solve these is the Levenberg-Marquardt algorithm. I have a couple of questions
Does anybody have experience on the different LM implementations available? There exist slightly different flavors of LM, and I've heard that the exact implementation of the algorithm has a major effect on the its numerical stability. My functions are pretty well-behaved so this will probably not be a problem, but of course I'd like to choose one of the better alternatives. Here are some alternatives I've found:
FPL Statistics Group's Nonlinear Optimization Java Package. This includes a Java translation of the classic Fortran MINPACK routines.
JLAPACK, another Fortran translation.
Optimization Algorithm Toolkit.
Javanumerics.
Some Python implementation. Pure Python would be fine, since it can be compiled to Java with jythonc.
Are there any commonly used heuristics to do the initial guess that LM requires?
In my application I need to set some constraints on the solution, but luckily they are simple: I just require that the solutions (in order to be physical solutions) are nonnegative. Slightly negative solutions are result of measurement inaccuracies in the data, and should obviously be zero. I was thinking to use "regular" LM but iterate so that if some of the unknowns becomes negative, I set it to zero and resolve the rest from that. Real mathematicians will probably laugh at me, but do you think that this could work?
Thanks for any opinions!
Update: This is not rocket science, the number of parameters to solve (N) is at most 5 and the data sets are barely big enough to make solving possible, so I believe Java is quite efficient enough to solve this. And I believe that this problem has been solved numerous times by clever applied mathematicians, so I'm just looking for some ready solution rather than cooking my own. E.g. Scipy.optimize.minpack.leastsq would probably be fine if it was pure Python..
The closer your initial guess is to the solution, the faster you'll converge.
You said it was a non-linear problem. You can do a least squares solution that's linearized. Maybe you can use that solution as a first guess. A few non-linear iterations will tell you something about how good or bad an assumption that is.
Another idea would be trying another optimization algorithm. Genetic and ant colony algorithms can be a good choice if you can run them on many CPUs. They also don't require continuous derivatives, so they're nice if you have discrete, discontinuous data.
You should not use an unconstrained solver if your problem has constraints. For
instance if know that some of your variables must be nonnegative you should tell
this to your solver.
If you are happy to use Scipy, I would recommend scipy.optimize.fmin_l_bfgs_b
You can place simple bounds on your variables with L-BFGS-B.
Note that L-BFGS-B takes a general nonlinear objective function, not just
a nonlinear least-squares problem.
I agree with codehippo; I think that the best way to solve problems with constraints is to use algorithms which are specifically designed to deal with them. The L-BFGS-B algorithm should probably be a good solution in this case.
However, if using python's scipy.optimize.fmin_l_bfgs_b module is not a viable option in your case (because you are using Java), you can try using a library I have written: a Java wrapper for the original Fortran code of the L-BFGS-B algorithm. You can download it from http://www.mini.pw.edu.pl/~mkobos/programs/lbfgsb_wrapper and see if it matches your needs.
The FPL package is quite reliable but has a few quirks (array access starts at 1) due to its very literal interpretation of the old fortran code. The LM method itself is quite reliable if your function is well behaved. A simple way to force non-negative constraints is to use the square of parameters instead of the parameters directly. This can introduce spurious solutions but for simple models, these solutions are easy to screen out.
There is code available for a "constrained" LM method. Look here http://www.physics.wisc.edu/~craigm/idl/fitting.html for mpfit. There is a python (relies on Numeric unfortunately) and a C version. The LM method is around 1500 lines of code, so you might be inclined to port the C to Java. In fact, the "constrained" LM method is not much different than the method you envisioned. In mpfit, the code adjusts the step size relative to bounds on the variables. I've had good results with mpfit as well.
I don't have that much experience with BFGS, but the code is much more complex and I've never been clear on the licensing of the code.
Good luck.
I haven't actually used any of those Java libraries so take this with a grain of salt: based on the backends I would probably look at JLAPACK first. I believe LAPACK is the backend of Numpy, which is essentially the standard for doing linear algebra/mathematical manipulations in Python. At least, you definitely should use a well-optimized C or Fortran library rather than pure Java, because for large data sets these kinds of tasks can become extremely time-consuming.
For creating the initial guess, it really depends on what kind of function you're trying to fit (and what kind of data you have). Basically, just look for some relatively quick (probably O(N) or better) computation that will give an approximate value for the parameter you want. (I recently did this with a Gaussian distribution in Numpy and I estimated the mean as just average(values, weights = counts) - that is, a weighted average of the counts in the histogram, which was the true mean of the data set. It wasn't the exact center of the peak I was looking for, but it got close enough, and the algorithm went the rest of the way.)
As for keeping the constraints positive, your method seems reasonable. Since you're writing a program to do the work, maybe just make a boolean flag that lets you easily enable or disable the "force-non-negative" behavior, and run it both ways for comparison. Only if you get a large discrepancy (or if one version of the algorithm takes unreasonably long), it might be something to worry about. (And REAL mathematicians would do least-squares minimization analytically, from scratch ;-P so I think you're the one who can laugh at them.... kidding. Maybe.)