I have two Weka instances which, when printed, look as follows:
0.44,0.34,0.48,0.5,0.3,0.33,0.43,cp
0.51,0.37,0.48,0.5,0.35,0.36,0.45,cp
I am trying to obtain their distance using the in-built Euclidean Distance function. My code:
EuclideanDistance e = new EuclideanDistance(neighbours);
double x = e.distance(neighbours.instance(0), neighbours.instance(1));
Where neighbours is an object of type Instances and the objects at indexes 0 and 1 are the two instances I referred to.
I am slightly confused because x is returned with value 1.5760032627255223 although, by doing the calculation separately, I was expecting 0.09798. cp is the class label, but earlier in my code I did specify data.setClassIndex(data.numAttributes() - 1);
Any advice?
By default, Weka's EuclideanDistance metric normalizes the ranges to compute the distance. If you don't want that, call e.setDontNormalize(true).
Related
I would like to create two models of binary prediction: one with the cut point strictly greater than 0.5 (in order to obtain fewer signals but better ones) and second with the cut point strictly less than 0.5.
Doing the cross-validation, we have a test error related to the cut point equal to 0.5. How can I do it with other cut value? I talk about XGBoost for Java.
xgboost returns a list of scores. You can do what ever you want to that list of scores.
I think that particularly in Java, it returns a 2d ArrayList of shape (1, n)
In binary prediction you probably used a logistic function, thus your scores will be between 0 to 1.
Take your scores object and create a custom function that will calculate new predictions, by the rules you've described.
If you are using an automated/xgboost-implemented Cross Validation Function, you might want to build a customized evaluation function which will do as you bid, and pass it as an argument to xgb.cv
If you want to be smart when setting your threshold, I suggest reading about AUC of Roc Curve and Precision Recall Curve.
I have already understood the Horner's scheme for one variable polynomials like (2x^3+x+1) but I have not found a clear explenation for the two variable polynomials like (2x^6+3y+9) and i want to create a program in java to calcualte the scheme for me.
Use that A[x,Y] = A[x][Y]. In other words, consider your polynomials in x and Y with coefficients in some set A as polynomials in Y, whose coefficients are in turn polynomials, this time in A[x]. For example, rewrite
x^3+x^2Y+xY^2+xY+x^2+x+Y^3+Y^2+Y+1
as
Y^3 + (x+1)Y^2 + (x^2+x+1)Y + (x^3+x^2+x+1)
and then apply Horn first in A[x][Y] using it again for every one of the coefficients 1, x+1, xˆ2+x+1and x^3+x^2+x+1 in A[x].
Note that this will require sorting first the monomials according to their Y-degree and after grouping the coefficients, sorting their monomials according to the x-degree.
For two variables you could separate the two and apply the rule to each group:
x^3+x^2y+xy^2+xy+x^2+x+1+y^3+y^2+y+1=
=[1+x(1+y+y^2+x(1+y+x))+]+[1+y(1+y(1+y))]
So the algorithm would be:
Put x and xy terms in one group
treat y as a constant and apply
put y-only terms in another
apply
I am calculating the 95th percentile of the following list of numbers:
66,337.8,989.7,1134.6,1118.7,1097.9,1122.1,1121.3,1106.7,871,325.2,285.1,264.1,295.8,342.4
The apache libraries use the NIST standards to calculate the percentile which is the same method used by Excel. According to Excel the 95th percentile of the list above should be 1125.85.
However, using the following code I get a different result:
DescriptiveStatistics shortList = new DescriptiveStatistics();
#BeforeTest
#Parameters("shortStatsList")
private void buildShortStatisticsList(String list) {
StringTokenizer tokens = new StringTokenizer(list, ",");
while (tokens.hasMoreTokens()) {
shortList.addValue(Double.parseDouble(tokens.nextToken()));
}
}
#Test
#Parameters("95thPercentileShortList")
public void percentileShortListTest(String percentile) {
Assert.assertEquals(Double.toString(shortList.getPercentile(95)), percentile);
}
This fails with the following message:
java.lang.AssertionError: expected:<1125.85> but was:<1134.6>
at org.testng.Assert.fail(Assert.java:89)
at org.testng.Assert.failNotEquals(Assert.java:489)
1134.6 is the maximum value in the list, not the 95th percentile, so I don't know where this value is coming from.
According to the documentation of getPercentile() it is using the percentile estimation algorithm, as recorded here.
Percentiles can be estimated from N measurements as follows: for the pth percentile, set p(N+1) equal to k+d for k an integer, and d, a fraction greater than or equal to 0 and less than 1.
For 0<k<N, Y(p)=Y[k]+d(Y[k+1]−Y[k])
For k=0, Y(p)=Y[1]
Note that any p ≤ 1/(N+1) will simply be set to the minimum value.
For k≥N,Y(p)=Y[N]
Note that any p ≥ N/(N+1) will simply be set to the maximum value.
Basically this means multiplying the requested percentile (0.95) by (N+1). In your case N is 15, and N+1 is 16, so you get 15.2.
You split this into the whole part k (15), and d (0.2). The k falls into category 3 above. That is, the estimated percentile is the maximum value.
If you keep on reading the NIST article that I linked above, you'll see the part titled "Note that there are other ways of calculating percentiles in common use". They refer you to an article by Hyndman & Fann, which describes several alternative ways of calculating percentiles. It's a misconception to thing that there is one NIST method. The methods in Hyndman & Fann are denoted by the labels R1 through R9. The article goes on to say:
Some software packages set 1+p(N−1) equal to k+d and then proceed as above. This is method R7 of Hyndman and Fan. This is the method used by Excel and is the default method for R (the R quantile function can optionally use any of the nine methods discussed in Hyndman & Fan).
The method used by default by Apache's DescriptiveStatistics is Hyndman & Fan's R6. The method used by Excel is R7. Both of them are "NIST methods", but for a small number of measurements, they can give different results.
Note that the Apache library does allow you to use the R7 algorithm or any of the others, by using the Percentile class. Something like this should do the trick:
DescriptiveStatistics shortList = new DescriptiveStatistics();
shortList.setPercentileImpl( new Percentile().
withEstimationType( Percentile.EstimationType.R_7 ) );
(Note that I haven't tested this).
I need to implement a multiplication formula where an row matrix of size 'n' is to be multiplied by an n*n matrix..
I have used DenseMatrix class to create the n*n matrix from a 2D array...but my problem is how to create a row Vector...
I can use the CompRowMatrix class to create a row matrix...but for that, input must be of 'Matrix'..but Matrix is an interface..can't instantiate it..the first constructor of CompRowMatrix class states it requires a 'non-zero array of indices' as input..but i am unable to understand what is this non-zero array of indices??
also, I can create a vector with DenseVector or any other suitable class..but there seems to be no method to directly multiply a vector with a matrix..
plz help
The CompRowMatrix class is not really intended to be used as a row vector, rather it is used to represent sparse matricies in such a way that it is easy to iterate over the matrix elements row by row.
While it is possible to use CompRowMatrix as a vector by setting all rows other than the 1st to zero, this is more complicated for you as a programmer and less efficient for the code which has to assume that other rows could potentially become non-zero.
Instead, use a DenseVector object to hold your row vector and use the mult method from the Matrix interface. It accepts two Vector objects as arguments and produces a vector-matrix product. The method is called on the matrix object being multiplied with the following arguments:
1st arg, x, is the vector you want to multiply with your matrix
2nd arg, y, holds the result of the multiplication
So to produce the vector-matrix product y = x*A (where both x and y are 1xnrow vectors and A is an nxn matrix), you would do something like this:
// create matrix A
double[][] matValues = new double[n][n];
... // initialize values of the matrix
Matrix A = new DenseMatrix(matValues);
// create vector x
double[] vecValues = new double[n];
... // initialize values of the vector
Vector x = new DenseVector(vecValues);
// create vector y to store result of multiplication
Vector y = new DenseVector(n);
// perform multiplication
A.mult(x, y);
Now you can use y in the rest of your code as needed. It is important that you allocate y before the multiplication, but it is irrelevant what data it holds. The mult method will overwrite whatever is in y on exit.
Also note that the ways I chose to initialize x and A were not the only ways available. For instance, the above code automatically deep copies the arrays vecValues and matValues when constructing the corresponding Vector and Matrix objects. If you don't intend to use the arrays for any other purpose, then you should probably not perform this deep copy. You do this by passing an extra boolean paramter set to false in the constructor, e.g.
// create matrix A without deep copying matValues
Matrix A = new DenseMatrix(matValues, false);
You should refer to the javadoc both you and I linked to earlier for more constructor options. Be aware, however, that said javadoc is for a different version that the current release of MTJ (version 1.01 as of the time of this post). I don't know which version it is for nor have I been able to find javadoc for the current version, but I did spot a few differences between it and the current source code.
If I understand your question, one solution would be to create a matrix with one row and n columns to premultiply the n x n matrix. There are routines for multiplying vectors, but I believe they all have the vector post-multiplying the matrix. If you'd like to use these routines instead, you'd have to do the appropriate transposes.
Is there a distance calculation implementation using hadoop map/reduce. I am trying to calculate a distance between a given set of points.
Looking for any resources.
Edit
This is a very intelligent solution. I have tried some how like the first algorithm, and I get almost what I was looking for. I am not concerned about optimizing the program at the moment, but my problem was the dist(X,Y) function was not working. When I got all the points on the reducer, I was unable to go through all the points on an Iterator and calculate the distance. Someone on stackoverflow.com told me that the Iterator on hadoop is different than the normal JAVA Iterator, i am not sure about that. But if i can find a simple way to go through the Iterator on my dist() function, i can use your second algorithm to optimize.
//This is your code and I am refering to that code too, just to make my point clear.
map(x,y) {
for i in 1:N #number of points
emit(i, (x,y)) //i did exactly like this
reduce (i, X)
p1 = X[i]
for j in i:N
// here is my problem, I can't get the values from the Iterator.
emit(dist(X[i], X[j]))
you need to do a self join on that data set. In hive that would look like, more or less
select dist(P1.x,P1.y,P2.x, P2.y) from points P1 join points P2 on (True) where P1.x < P2.x or (P1.x = P2.x and P1.y < P2.y)
The function dist would need to be implemented using other hive functions or written in Java and added as a UDF. Also I am not sure about the True constant but you can write 0=0 to the same effect. The where clause is to avoid computing the same distances twice or 0 distances. The question is: would hive optimize this the way you can do programming carefully in hadoop? I am not sure. This is a sketch in hadoop
map(x,y) {
for i in 1:N #number of points
emit(i, (x,y))
reduce (i, X)
p1 = X[i]
for j in i:N
emit(dist(X[i], X[j]))
For this to work you need X to get to the reducer sorted in some order, for instance by x and then by y using secondary sort keys (that do not affect the grouping). This way every reducer gets a copy of all the points and works on a column of the distance matrix you are trying to generate. The memory requirements are minimal. You could trade some communication for memory by re-organizing the computation so that every reducer computes a square submatrix of the final matrix, knowing only two subsets of the points and calculating the distances among all of them. To achieve this, you need to make explicit the order of your points, say you are storing i, x, y
map(i,x,y) {
for j in 1:N/k #k is size of submatrix
emit((i/k, j), ("row", (x,y)))
emit((j, i/k), ("col", (x,y)))
reduce ((a,b), Z)
split Z in rows X and cols Y
for x in X
for y in Y
emit(dist(x,y))
In this case you can see that the map phase emits only 2*N*N/k points, whereas the previous algorithm emitted N^2. Here we have (N/k)^2 reducers vs N for the other one. Each reducer has to hold k values in memory (using the secondary key technique to have all the rows get to the reducer before all the columns), vs only 2 before. So you see there are tradeoffs and for the second algorithm you can use the parameter k for perf tuning.
This problem does not sound like a good fit for map-reduce since you're not really able to break it into pieces and calculate each piece independently. If you could have a separate program that generates the complete graph of your points as a list (x1,y1,x2,y2) then you could do a straightforward map to get the distance.