XGBoost Cross Validation with different cut point - java

I would like to create two models of binary prediction: one with the cut point strictly greater than 0.5 (in order to obtain fewer signals but better ones) and second with the cut point strictly less than 0.5.
Doing the cross-validation, we have a test error related to the cut point equal to 0.5. How can I do it with other cut value? I talk about XGBoost for Java.

xgboost returns a list of scores. You can do what ever you want to that list of scores.
I think that particularly in Java, it returns a 2d ArrayList of shape (1, n)
In binary prediction you probably used a logistic function, thus your scores will be between 0 to 1.
Take your scores object and create a custom function that will calculate new predictions, by the rules you've described.
If you are using an automated/xgboost-implemented Cross Validation Function, you might want to build a customized evaluation function which will do as you bid, and pass it as an argument to xgb.cv
If you want to be smart when setting your threshold, I suggest reading about AUC of Roc Curve and Precision Recall Curve.

Related

How to get the optimal cluster number using the elbow method for java?

I use haifengl/smile and I need to get the optimal cluster number.
I am using CLARANS where I need to specify the number of clusters to create. I think maybe there is some solution to sort out for example from 2 to 10 clusters, see the best result and choose the number of clusters with the best result. How can this be done with the Elbow method?
To determine the appropriate number of clusters such that elements within the cluster are similar to each other and dissimilar to elements in other groups, can be found by applying a variety of techniques like;
Gap Statistic- compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data.
Silhouette Method The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k.
Sum of Square method
For more details, read the sklearn documentation on this subject.
The Elbow method is not automatic.
You compute the scores for the desired range of k, plot this, and then visually try to find an "elbow" - which may or may not work.
Because x and y have no "correct" relation to each other, beware that the interpretation of the plot (and any geometric attempt to automate this) depend on the scaling of the plot and are inherently subjective. In the end, the entire concept of an "elbow" likely is flawed and not sound in this form. I'd rather look for more advanced measures where you can argue for the maximum or minimum, although some notion of "significantly better k" would be desirable.
Ways to find clusters:
1- Silhouette method:
Using separation and cohesion or just using an implemented method the optimal number of clusters is the one with the maximum silhouette coefficient. * silhouette coefficient range from [-1,1] and 1 is the best value.
Example of the silhouette method with scikit-learn.
2- Elbow method (You can use the elbow method automatically)
The elbow method is a graph between the number of clusters and the average square sum of the distances.
To apply it automatically in python there is a library Kneed in python to detect the knee in a graph.Kneed Repository

How to create multivariate uniform distribution?

Goal
I would like to sample from a bi-variate uniform distributions with specified correlation coefficient in java.
Question
What method can I use to implement such multivariate uniform distribution?
or
Is there an existing package that would implement such thing so that I don't have to reinvent the wheel?
What I've got so far
The packages mvtnorm in R allow to sample from a multivariate normal distribution with specified correlation coefficients. I thought that understanding their method may help me out either by doing something similar with uniform distributions or by repeating their work and using copulas to transform the multivariate normal into a multivariate uniform (as I did in R there).
The source code is written in Fortran and I don't speak Fortran! The code is based on this paper by Genz and Bretz but it is too math heavy for me.
I have an idea. Typically, you generate U(0,1) by generating, say, 32 random bits and dividing by 232, getting back one float. For two U(0,1) you generate two 32bit values, divide, get two floats back. So far so good. Such bi-variate generator would be uncorrelated, very simple to check etc
Suppose you build it your bi-variate generator in following way. Inside, you get two random 32bit integers, and then produce two U(0,1) with shared parts. Say, you take 24bits from first integer and 24bits for second integer, but upper (or lower, or middle, or ...) 8bits would be the same (taken from first integer and copied to second) for both of them.
Clearly, those two U(0,1) would be correlated. We could write them as
U(0,1)0 = a0 + b
U(0,1)1 = a1 + b
I omit some coefficients etc for simplicity. Each one is U(0,1) with mean of 1/2 and variance of 1/12. Now you have to compute Pearson correlation as
r = ( E[U(0,1)0 U(0,1)1] - 1/4 ) / sqrt(1/12)2
Using expansion above it should be easy after some algebra to compute r and compare with one you want. You may vary size of the correlated part b, as well as its position (high bits, low bits, somewhere in the middle) to fit desired r.
Realistically speaking, there should be infinite possibilities to have the same r but different sampling code and different bi-variate distributions. You might want to add more constrains in the future

How to test the correct implementation of special polynomials?

I need to implement the calculation of some special polynomials in Java (the language is not really important). These are calculated as a weighted sum of a number of base polynomials with fixed coefficients.
Each base polynomial has 2 to 10 coefficients and there are typically 10 base polynomials considered, giving a total of, say 20-50 coefficients.
Basically the calculation is no big deal but I am worried about typos. I only have a printed document as a template. So i would like to implement unit tests for the calculations. The issue is: How do I get reliable testing data. I do have another software that is supposed to calculate these functions but the process is complicated and also error prone - I would have to scale the input values, go through a number of menu selections in the software to produce the output and then paste it to my testing code.
I guess that there is no way around using the external software to generate some testing data, but maybe you have some recommendations for making this type of testing procedure safer or minimize the required number of test cases.
I am also worried about providing suitable input values: Depending on the value of the independent variable, certain terms will only have a tiny contribution to the output, while for other values they might dominate.
The types of errors I expect (and need to avoid) are:
Typos in coefficients
Coefficients applied to wrong power (i.e. a_7*x^6 instead of a_7*x^7 - just for demonstration, I am not calculating this way but am using Horner's scheme)
Off-by one errors (i.e. missing zero order or highest order term)
Since you have a polynomial of degree 10, testing at 11 distinct points should give certainty.
However, already a test at one well-randomized point, x=1.23004 to give an idea (away from small fractions like 2/3, 4/5), will with high probability show a difference if there is an error, because it is unlikely that the difference between the wrong and the true polynomial has a root at exactly this place.

Unit testing a discrete Fourier transformation

Several months ago I had to implement a two-dimensional Fourier transformation in Java. While the results seemed sane for a few manual checks I wondered how a good test-driven approach would look like.
Basically what I did was that I looked at reasonable values of the DC components and compared the AC components if they roughly match the Mathematica output.
My question is: Which unit tests would you implement for a discrete Fourier transformation? How would you validate results returned by your calculation?
As for other unit-tests, you should consider small fixed input test-vectors for which results can easily be computed manually and compared against. For the more involved input test-vectors, a direct DFT implementation should be easy enough to implement and used to cross-validate results (possibly on top of your own manual computations).
As far as specific test vectors for one-dimensional FFT, you can start with the following from dsprelated, which they selected to exercise common flaws:
Single FFT tests - N inputs and N outputs
Input random data
Inputs are all zeros
Inputs are all ones (or some other nonzero value)
Inputs alternate between +1 and -1.
Input is e^(8*j*2*pi*i/N) for i = 0,1,2, ...,N-1. (j = sqrt(-1))
Input is cos(8*2*pi*i/N) for i = 0,1,2, ...,N-1.
Input is e^((43/7)*j*2*pi*i/N) for i = 0,1,2, ...,N-1. (j sqrt(-1))
Input is cos((43/7)*2*pi*i/N) for i = 0,1,2, ...,N-1.
Multi FFT tests - run continuous sets of random data
Data sets start at times 0, N, 2N, 3N, 4N, ....
Data sets start at times 0, N+1, 2N+2, 3N+3, 4N+4, ....
For two-dimensional FFT, you can then build on the above. The first three cases are still directly applicable (random data, all zeros, all ones). Others require a bit more work but are still manageable for small input sizes.
Finally google searches should yield some reference images (before and after transform) for a few common cases such as black & white squares, rectangle, circles which are can be used as reference (see for example http://www.fmwconcepts.com/misc_tests/FFT_tests/).
99.9% of the numerical and coding issues you will likely find will be found by testing with a random complex vectors and comparing with a direct DFT to a tolerance on the order of floating point precision.
Zero, constant, or sinusoidal vectors may help understand a failure by allowing your eye to catch issues like initialization, clipping, folding, scaling. But they will not typically find anything that the random case does not.
My kissfft library does a few extra tests related to fixed point issues -- not an issue if you are working in floating point.

Java - normalize and denormalize nominal attributes in neural networks

Hi I am building a simple multilayer network which is trained using back propagation. My problem at the moment is that some attributes in my dataset are nominal (non numeric) and I have to normalize them. I wanted to know what the best approach is. I was thinking along the lines of counting up how many distinct values there are for each attribute and assigning each an equal number between 0 and 1. For example suppose one of my attributes had values A to E then would the following be suitable?:
A = 0
B = 0.25
C = 0.5
D = 0.75
E = 1
The second part to my question is denormalizing the output to get it back to a nominal value. Would I first do the same as above to each distinct output attribute value in the dataset in order to get a numerical representation? Also after I get an output from the network, do I just see which number it is closer to? For example if I got 0.435 as an output and my output attribute values were assigned like this:
x = 0
y = 0.5
z = 1
Do I just find the nearest value to the output (0.435) which is y (0.5)?
You can only do what you are proposing if the variables are ordinal and not nominal, and even then it is a somewhat arbitrary decision. Before I suggest a solution, a note on terminology:
Nominal vs ordinal variables
Suppose A, B, etc stand for colours. These are the values of a nominal variable and can not be ordered in a meaningful way. You can't say red is greater than yellow. Therefore, you should not be assigning numbers to nominal variables .
Now suppose A, B, C, etc stand for garment sizes, e.g. small, medium, large, etc. Even though we are not measuring these sizes on an absolute scale (i.e. we don't say that small corresponds to 40 a chest circumference), it is clear that small < medium < large. With that in mind, it is still somewhat arbitrary whether you set small=1, medium=2, large=3, or small=2, medium=4, large=8.
One-of-N encoding
A better way to go about this is to to use the so called one-out-of-N encoding. If you have 5 distinct values, you need five input units, each of which can take the value 1 or 0. Continuing with my garments example, size extra small can be encoded as 10000, small as 01000, medium as 00100, etc.
A similar principle applies to the outputs of the network. If we treat garment size as output instead of input, when the network output the vector [0.01 -0.01 0.5 0.0001 -.0002], you interpret that as size medium.
In reply to your comment on #Daan's post: if you have 5 inputs, one of which takes 20 possible discrete values, you will need 24 input nodes. You might want to normalise the values of your 4 continuous inputs to the range [0, 1], because they may end out dominating your discrete variable.
It really depends on the meaning of the attributes you're trying to normalize, and the functions used inside your NN. For example, if your attributes are non-linear, or if you're using a non-linear activation function, then linear normalization might not end up doing what you want it to do.
If the ranges of attribute values are relatively small, splitting the input and output into sets of binary inputs and outputs will probably be simpler and more accurate.
EDIT:
If the NN was able to accurately perform it's function, one of the outputs will be significantly higher than the others. If not, you might have a problem, depending on when you see inaccurate results.
Inaccurate results during early training are expected. They should become less and less common as you perform more training iterations. If they don't, your NN might not be appropriate for the task you're trying to perform. This could be simply a matter of increasing the size and/or number of hidden layers. Or it could be a more fundamental problem, requiring knowledge of what you're trying to do.
If you've succesfully trained your NN but are seeing inaccuracies when processing real-world data sets, then your training sets were likely not representative enough.
In all of these cases, there's a strong likelihood that your NN did something entirely different than what you wanted it to do. So at this point, simply selecting the highest output is as good a guess as any. But there's absolutely no guarantee that it'll be a better guess.

Categories