I need to implement the calculation of some special polynomials in Java (the language is not really important). These are calculated as a weighted sum of a number of base polynomials with fixed coefficients.
Each base polynomial has 2 to 10 coefficients and there are typically 10 base polynomials considered, giving a total of, say 20-50 coefficients.
Basically the calculation is no big deal but I am worried about typos. I only have a printed document as a template. So i would like to implement unit tests for the calculations. The issue is: How do I get reliable testing data. I do have another software that is supposed to calculate these functions but the process is complicated and also error prone - I would have to scale the input values, go through a number of menu selections in the software to produce the output and then paste it to my testing code.
I guess that there is no way around using the external software to generate some testing data, but maybe you have some recommendations for making this type of testing procedure safer or minimize the required number of test cases.
I am also worried about providing suitable input values: Depending on the value of the independent variable, certain terms will only have a tiny contribution to the output, while for other values they might dominate.
The types of errors I expect (and need to avoid) are:
Typos in coefficients
Coefficients applied to wrong power (i.e. a_7*x^6 instead of a_7*x^7 - just for demonstration, I am not calculating this way but am using Horner's scheme)
Off-by one errors (i.e. missing zero order or highest order term)
Since you have a polynomial of degree 10, testing at 11 distinct points should give certainty.
However, already a test at one well-randomized point, x=1.23004 to give an idea (away from small fractions like 2/3, 4/5), will with high probability show a difference if there is an error, because it is unlikely that the difference between the wrong and the true polynomial has a root at exactly this place.
Related
I would like to create two models of binary prediction: one with the cut point strictly greater than 0.5 (in order to obtain fewer signals but better ones) and second with the cut point strictly less than 0.5.
Doing the cross-validation, we have a test error related to the cut point equal to 0.5. How can I do it with other cut value? I talk about XGBoost for Java.
xgboost returns a list of scores. You can do what ever you want to that list of scores.
I think that particularly in Java, it returns a 2d ArrayList of shape (1, n)
In binary prediction you probably used a logistic function, thus your scores will be between 0 to 1.
Take your scores object and create a custom function that will calculate new predictions, by the rules you've described.
If you are using an automated/xgboost-implemented Cross Validation Function, you might want to build a customized evaluation function which will do as you bid, and pass it as an argument to xgb.cv
If you want to be smart when setting your threshold, I suggest reading about AUC of Roc Curve and Precision Recall Curve.
I need to write a program that determines the Big-O notation of an algorithm in Java.
I don't have access to the algorithms code, so it must be based on experimentation and execution times. I don't know where to start.
Can someone help me?
Edit: The only thing i know about the algorithm is that takes an integer value and doesn't have a return
Firstly, you need to be aware that what such a program does it to provided an evidence-based guess as to what complexity class the algorithm belongs to. It can give the wrong answer. (Indeed, in complicated cases where the complexity class is unusual, wrong answers are increasingly likely.)
In short, this is NOT complexity analysis.
The general approach would be:
Run the algorithm multiple times with values of N across the range, measuring the execution times. Repeat multiple times for each N, to ensure that you are getting consistent measurements.
Try to fit the experimental results to different kinds of curves; i.e. linear, quadratic, logarithmic. Note that it is the fit for large values of N that matters. So when you check for "goodness of fit", use a measure that gives increasing weight to the larger data points.
This is intended as a start point. For example, I'm expecting that you will do your own research on issues such as:
how to get reliable execution-time measurements (for Java),
how to do curve fitting in a mathematically sound way, and
dealing with the case where the execution times get too long to measure experimentally for large N.
You could do some experiments and graph the amount of input vs the time spent executing the function. Then you could compare it to the well known curves associated with Big-O or try to estimate the equation.
Since you don't have access to the algorithm's source code, the only thing you can do is to measure how long the algorithm takes for inputs of different size, and then try to extrapolate a function from that. Since you are doing experiments, you now enter the field of statistics, so maybe you can use ideas from that area, such as regression analysis.
Several months ago I had to implement a two-dimensional Fourier transformation in Java. While the results seemed sane for a few manual checks I wondered how a good test-driven approach would look like.
Basically what I did was that I looked at reasonable values of the DC components and compared the AC components if they roughly match the Mathematica output.
My question is: Which unit tests would you implement for a discrete Fourier transformation? How would you validate results returned by your calculation?
As for other unit-tests, you should consider small fixed input test-vectors for which results can easily be computed manually and compared against. For the more involved input test-vectors, a direct DFT implementation should be easy enough to implement and used to cross-validate results (possibly on top of your own manual computations).
As far as specific test vectors for one-dimensional FFT, you can start with the following from dsprelated, which they selected to exercise common flaws:
Single FFT tests - N inputs and N outputs
Input random data
Inputs are all zeros
Inputs are all ones (or some other nonzero value)
Inputs alternate between +1 and -1.
Input is e^(8*j*2*pi*i/N) for i = 0,1,2, ...,N-1. (j = sqrt(-1))
Input is cos(8*2*pi*i/N) for i = 0,1,2, ...,N-1.
Input is e^((43/7)*j*2*pi*i/N) for i = 0,1,2, ...,N-1. (j sqrt(-1))
Input is cos((43/7)*2*pi*i/N) for i = 0,1,2, ...,N-1.
Multi FFT tests - run continuous sets of random data
Data sets start at times 0, N, 2N, 3N, 4N, ....
Data sets start at times 0, N+1, 2N+2, 3N+3, 4N+4, ....
For two-dimensional FFT, you can then build on the above. The first three cases are still directly applicable (random data, all zeros, all ones). Others require a bit more work but are still manageable for small input sizes.
Finally google searches should yield some reference images (before and after transform) for a few common cases such as black & white squares, rectangle, circles which are can be used as reference (see for example http://www.fmwconcepts.com/misc_tests/FFT_tests/).
99.9% of the numerical and coding issues you will likely find will be found by testing with a random complex vectors and comparing with a direct DFT to a tolerance on the order of floating point precision.
Zero, constant, or sinusoidal vectors may help understand a failure by allowing your eye to catch issues like initialization, clipping, folding, scaling. But they will not typically find anything that the random case does not.
My kissfft library does a few extra tests related to fixed point issues -- not an issue if you are working in floating point.
Hi I am building a simple multilayer network which is trained using back propagation. My problem at the moment is that some attributes in my dataset are nominal (non numeric) and I have to normalize them. I wanted to know what the best approach is. I was thinking along the lines of counting up how many distinct values there are for each attribute and assigning each an equal number between 0 and 1. For example suppose one of my attributes had values A to E then would the following be suitable?:
A = 0
B = 0.25
C = 0.5
D = 0.75
E = 1
The second part to my question is denormalizing the output to get it back to a nominal value. Would I first do the same as above to each distinct output attribute value in the dataset in order to get a numerical representation? Also after I get an output from the network, do I just see which number it is closer to? For example if I got 0.435 as an output and my output attribute values were assigned like this:
x = 0
y = 0.5
z = 1
Do I just find the nearest value to the output (0.435) which is y (0.5)?
You can only do what you are proposing if the variables are ordinal and not nominal, and even then it is a somewhat arbitrary decision. Before I suggest a solution, a note on terminology:
Nominal vs ordinal variables
Suppose A, B, etc stand for colours. These are the values of a nominal variable and can not be ordered in a meaningful way. You can't say red is greater than yellow. Therefore, you should not be assigning numbers to nominal variables .
Now suppose A, B, C, etc stand for garment sizes, e.g. small, medium, large, etc. Even though we are not measuring these sizes on an absolute scale (i.e. we don't say that small corresponds to 40 a chest circumference), it is clear that small < medium < large. With that in mind, it is still somewhat arbitrary whether you set small=1, medium=2, large=3, or small=2, medium=4, large=8.
One-of-N encoding
A better way to go about this is to to use the so called one-out-of-N encoding. If you have 5 distinct values, you need five input units, each of which can take the value 1 or 0. Continuing with my garments example, size extra small can be encoded as 10000, small as 01000, medium as 00100, etc.
A similar principle applies to the outputs of the network. If we treat garment size as output instead of input, when the network output the vector [0.01 -0.01 0.5 0.0001 -.0002], you interpret that as size medium.
In reply to your comment on #Daan's post: if you have 5 inputs, one of which takes 20 possible discrete values, you will need 24 input nodes. You might want to normalise the values of your 4 continuous inputs to the range [0, 1], because they may end out dominating your discrete variable.
It really depends on the meaning of the attributes you're trying to normalize, and the functions used inside your NN. For example, if your attributes are non-linear, or if you're using a non-linear activation function, then linear normalization might not end up doing what you want it to do.
If the ranges of attribute values are relatively small, splitting the input and output into sets of binary inputs and outputs will probably be simpler and more accurate.
EDIT:
If the NN was able to accurately perform it's function, one of the outputs will be significantly higher than the others. If not, you might have a problem, depending on when you see inaccurate results.
Inaccurate results during early training are expected. They should become less and less common as you perform more training iterations. If they don't, your NN might not be appropriate for the task you're trying to perform. This could be simply a matter of increasing the size and/or number of hidden layers. Or it could be a more fundamental problem, requiring knowledge of what you're trying to do.
If you've succesfully trained your NN but are seeing inaccuracies when processing real-world data sets, then your training sets were likely not representative enough.
In all of these cases, there's a strong likelihood that your NN did something entirely different than what you wanted it to do. So at this point, simply selecting the highest output is as good a guess as any. But there's absolutely no guarantee that it'll be a better guess.
I'm trying to write a program that solves for the reduced row echelon form when given a matrix. Basically what I'm doing is writing a program that solves systems of equations. However, due to the fact that there are times when I need to do division to result in repeating digits (such as 2/3 which is .66666...) and java rounds off to a certain digit, there are times when a pivot should be 0 (meaning no pivot) is something like .0000001 and it messes up my whole program.
My first question is if I were to have some sort of if statement, what is the best way to write something like "if this number is less than .00001 away from being an integer, then round to that closest integer".
My second question is does anyone have any ideas on more optimal ways of handling this situation rather than just put if statements rounding numbers all over the place.
Thank you very much.
You say that you are writing a program that solves systems of equations. This is quite a complicated problem. If you only want to use such a program, you are better off using a library written by somebody else. I will assume that you really want to write the program yourself, for fun and/or education.
You identified the main problem: using floating point numbers leads to rounding and thus to inexact results. There are two solutions for this.
The first solution is not to use floating point numbers. Use only integers and reduce the matrix to row echelon form (not reduced); this can be done without divisions. Since all computations with integers are exact, a pivot that should be 0 will be exactly 0 (actually, there may be a problem with overflow). Of course, this will only work if the matrix you start with consists of integers. You can generalize this approach by working with fractions instead of integers.
The second solution is to use floating point numbers and be very careful. This is a topic of a whole branch of mathematics / computer science called numerical analysis. It is too complicated to explain in an answer here, so you have to get a book on numerical analysis. In simple terms, what you want to do is to say that if Math.abs(pivot) < some small value, then you assume that the pivot should be zero, but that it is something like .0000000001 because of rounding errors, so you just act as if the pivot is zero. The problem is finding out what "some small value" is.