digit categorisation using Euclidean distance

digit categorisation using Euclidean distance - java

I want to categorise digits which are represented in a 64 dimensional space which gives an 8X8 pixel character image. Each attribute is an integer from 0...16. I have 20 rows of 64 values plus one at the end which determines the category. The category is previously determined by UCI but I want to know how they got each particular category for each row. So they say they used Euclidean distance to determine the category.
My question is how do I apply Euclidean distance to 64 values? I tried to use following formula (pythagorean theorem) Math.sqrt(Math.pow(x2-x1)+Math.pow(y2-y1)) within a row but the result was too big and I do not know what that represents. For example for the first row I obtained 1612 which is the square root of 40.15
This is my code for the process:
enter code here
public static void main(String[]args)
{
int row[]= new int[64];
for(int z=0;z<64;z++)
{
row[z]=digits[0][z]; //get the first row and store it
}
double result = 0;
for(int z=0;z<64;z+=2)
{
double distance = Math.pow(row[z]-row[z+1],2);
result = result+distance; //add distance each time
System.out.print(result+", ");
}
}
The first row of digits is this:
0,0,5,13,9,1,0,0,0,0,13,15,10,15,5,0,0,3,15,2,0,11,8,0,0,4,12,0,0,8,8,0,0,5,8,0,0,9,8,0,0,4,11,0,1,12,7,0,0,2,14,5,10,12,0,0,0,0,6,13,10,0,0,0,0
I am not sure if this makes sense but if something is not clear please do ask.
Thanks in advance.

My question is how do I apply Euclidean distance to 64 values?
You do not. Distance is a measure between two objects, each of which can have 64 values, but you need two objects. In particular, euclidean distance is defined as
dist(x, y) = ||x-y||_2 = sqrt[ SUM_{i=1}^d (x_i - y_i)^2 ]
where d is the number of dimensions, and x_i means ith dimension of x.
So they say they used Euclidean distance to determine the category.
They said more than that, as the distance itself does not define anything besides... distance. Category on the other hand is an abstract object, which might be defined by some some characteristic point (centroid), then you assign a category with closest (in terms of given distance) centroid.

Related

How to compare two curves (array of points)

I have problem to find method to compare two trajectories (curves).
The first original contains points (x,y).
The second one can be offset, smaller or larger scale, and with rotation - also array with points (x,y)
My first method that i did is to find smallest distance between two points and repeat this process in every iteration, sum of it and divide by number of points - then my result tell me value the average error per point:
http://www.mathopenref.com/coorddist.html
And also i find this method:
https://help.scilab.org/docs/6.0.0/en_US/fminsearch.html
But i cant figure out how to use it.
I would like compare both trajectories but my results have to include rotation, or at least offset for beginning.
My current result is calculate error per point (distance)
get coordinate (x,y) second trajectory.
in loop i try to find min_distance between (x,y) from 1. and point from original trajectory.
add smallest_distance what i found in 2 step.
divide sum of smallest distance by number of points from second trajectory.
My result describe average error(distance) per points if we compare with original trajectory.
But i can not figure how to handle if trajectory is rotated, scaled or is shifted.
Please look at my example trajectories:
http://pokazywarka.pl/trajectory/
http://pokazywarka.pl/trajectory2/

So you need to compare shape of 2 curves invariant on rotation,translation and scale.
Solution
Let assume 2 sinwaves for testing. Both rotated and scaled but with the same aspect ratio and one with added noise. I generated them in C++ like this:
struct _pnt2D
{
double x,y;
// inline
_pnt2D() {}
_pnt2D(_pnt2D& a) { *this=a; }
~_pnt2D() {}
_pnt2D* operator = (const _pnt2D *a) { *this=*a; return this; }
//_pnt2D* operator = (const _pnt2D &a) { ...copy... return this; }
};
List<_pnt2D> curve0,curve1; // curves points
_pnt2D p0,u0,v0,p1,u1,v1; // curves OBBs
const double deg=M_PI/180.0;
const double rad=180.0/M_PI;
void rotate2D(double alfa,double x0,double y0,double &x,double &y)
{
double a=x-x0,b=y-y0,c,s;
c=cos(alfa);
s=sin(alfa);
x=x0+a*c-b*s;
y=y0+a*s+b*c;
}
// this code is the init stuff:
int i;
double x,y,a;
_pnt2D p,*pp;
Randomize();
for (x=0;x<2.0*M_PI;x+=0.01)
{
y=sin(x);
p.x= 50.0+(100.0*x);
p.y=180.0-( 50.0*y);
rotate2D(+15.0*deg,200,180,p.x,p.y);
curve0.add(p);
p.x=150.0+( 50.0*x);
p.y=200.0-( 25.0*y)+5.0*Random();
rotate2D(-25.0*deg,250,100,p.x,p.y);
curve1.add(p);
}
OBB oriented bounding box
compute OBB which will find the rotation angle and position of both curves so rotate one of them so they start at the same position and has the same orientation.
If the OBB sizes are too different then the curves are different.
For above example it yealds this result:
Each OBB is defined by start point P and basis vectors U,V where |U|>=|V| and z coordinate of U x V is positive. That will ensure the same winding for all OBBs. It can be done in OBBox_compute by adding this to the end:
// |U|>=|V|
if ((u.x*u.x)+(u.y*u.y)<(v.x*v.x)+(v.y*v.y)) { _pnt2D p; p=u; u=v; v=p; }
// (U x V).z > 0
if ((u.x*v.y)-(u.y*v.x)<0.0)
{
p0.x+=v.x;
p0.y+=v.y;
v.x=-v.x;
v.y=-v.y;
}
So curve0 has p0,u0,v0 and curve1 has p1,u1,v1.
Now we want to rescale,translate and rotate curve1 to match curve0 It can be done like this:
// compute OBB
OBBox_compute(p0,u0,v0,curve0.dat,curve0.num);
OBBox_compute(p1,u1,v1,curve1.dat,curve1.num);
// difference angle = - acos((U0.U1)/(|U0|.|U1|))
a=-acos(((u0.x*u1.x)+(u0.y*u1.y))/(sqrt((u0.x*u0.x)+(u0.y*u0.y))*sqrt((u1.x*u1.x)+(u1.y*u1.y))));
// rotate curve1
for (pp=curve1.dat,i=0;i<curve1.num;i++,pp++)
rotate2D(a,p1.x,p1.y,pp->x,pp->y);
// rotate OBB1
rotate2D(a,0.0,0.0,u1.x,u1.y);
rotate2D(a,0.0,0.0,v1.x,v1.y);
// translation difference = P0-P1
x=p0.x-p1.x;
y=p0.y-p1.y;
// translate curve1
for (pp=curve1.dat,i=0;i<curve1.num;i++,pp++)
{
pp->x+=x;
pp->y+=y;
}
// translate OBB1
p1.x+=x;
p1.y+=y;
// scale difference = |P0|/|P1|
x=sqrt((u0.x*u0.x)+(u0.y*u0.y))/sqrt((u1.x*u1.x)+(u1.y*u1.y));
// scale curve1
for (pp=curve1.dat,i=0;i<curve1.num;i++,pp++)
{
pp->x=((pp->x-p0.x)*x)+p0.x;
pp->y=((pp->y-p0.y)*x)+p0.y;
}
// scale OBB1
u1.x*=x;
u1.y*=x;
v1.x*=x;
v1.y*=x;
You can use Understanding 4x4 homogenous transform matrices to do all this in one step. Here the result:
sampling
in case of non uniform or very different point density between curves or between any parts of it you should re-sample your curves to have common point density. You can use linear or polynomial interpolation for this. You also do not need to store the new sampling in memory but instead you could build function that returns point of each curve parametrized by arc-length from start.
point curve0(double distance);
point curve1(double distance);
comparison
Now you can substract the 2 curves and sum up the abs of the differences. Then divide it by the curve length and threshold the result.
for (double sum=0.0,l=0.0;d<=bigger_curve_length;l+=step)
sum+=fabs(curve0(l)-curve1(l));
sum/=bigger_curve_length;
if (sum>threshold) curves are different
else curves match
You should try this even with +180deg rotation as the orientation difference from OBB has only half of the true range.
Here few related QAs:
compare shapes
How can i produce multi point linear interpolation?

Using random function in selecting an object if two same distance values

I have an ArrayList unsolvedOutlets containing object Outlet that has attributes longitude and latitude.
Using the longitude and latitude of Outlet objects in ArrayList unsolvedOutlets, I need to find the smallest distance in that list using the distance formula : SQRT(((X2 - X1)^2)+(Y2-Y1)^2), wherein (X1, Y1) are given. I use Collections.min(list) in finding the smallest distance.
My problem is if there are two or more values with the same smallest distance, I'd have to randomly select one from them.
Code:
ArrayList<Double> distances = new ArrayList<Double>();
Double smallestDistance = 0.0;
for (int i = 0; i < unsolvedOutlets.size(); i++) {
distances.add(Math.sqrt(
(unsolvedOutlets.get(i).getLatitude() - currSolved.getLatitude())*
(unsolvedOutlets.get(i).getLatitude() - currSolved.getLatitude())+
(unsolvedOutlets.get(i).getLongitude() - currSolved.getLongitude())*
(unsolvedOutlets.get(i).getLongitude() - currSolved.getLongitude())));
distances.add(0.0); //added this to test
distances.add(0.0); //added this to test
smallestDistance = Collections.min(distances);
System.out.println(smallestDistance);
}
The outcome in the console would print out 0.0 but it wont stop. Is there a way to know if there are multiple values with same smallest value. Then I'd incorporate the Random function. Did that make sense? lol but if anyone would have the logic for that, it would be really helpful!!
Thank you!

Keep track of the indices with min distance in your loop and after the loop choose one at random:
Random random = ...
...
List<Integer> minDistanceIndices = new ArrayList<>();
double smallestDistance = 0.0;
for (int i = 0; i < unsolvedOutlets.size(); i++) {
double newDistance = Math.sqrt(
(unsolvedOutlets.get(i).getLatitude() - currSolved.getLatitude())*
(unsolvedOutlets.get(i).getLatitude() - currSolved.getLatitude())+
(unsolvedOutlets.get(i).getLongitude() - currSolved.getLongitude())*
(unsolvedOutlets.get(i).getLongitude() - currSolved.getLongitude()));
distances.add(newDistance);
if (newDistance < smallestDistance) {
minDistanceIndices.clear();
minDistanceIndices.add(i);
smallestDistance = newDistance;
} else if (newDistance == smallestDistance) {
minDistanceIndices.add(i);
}
}
if (!unsolvedOutlets.isEmpty()) {
int index = minDistanceIndices.get(random.nextInt(minDistanceIndices.size()));
Object chosenOutlet = unsolvedOutlets.get(index);
System.out.println("chosen outlet: "+ chosenOutlet);
}
As Jon Skeet mentioned you don't need to take the square root to compare the distances.
Also if you want to use distances on a sphere your formula is wrong:
With your formula you'll get the same distance for (0° N, 180° E) to (0° N, 0° E) as for (90° N, 180° E) to (90° N, 0° E), but while you need to travel around half the earth to travel from the first to the second, the last 2 coordinates both denote the north pole.

Note: I believe fabian's solution is superior to this, but I've kept it around to demonstrate that there are many different ways of implementing this...
I would probably:
Create a new type which contained the distance from the outlet as well as the outlet (or just the square of the distance), or use a generic Pair type for the same purpose
Map (using Stream.map) the original list to a list of these pairs
Order by the distance or square-of-distance
Look through the sorted list until you find a distance which isn't the same as the first one in the list
You then know how many - and which - outlets have the same distance.
Another option would be to simply shuffle the original collection, then sort the result by distance, then take the first element - that way even if multiple of them do have the same distance, you'll be taking a random one of those.
JB Nizet's option of "find the minimum, then perform a second scan to find all those with that distance" would be fine too - and quite possibly simpler :) Lots of options...

Most valuable plot in a 2D array?

This is the problem:
"A 2d array of ints will be used to represent the value of each block in a city. The value could be negative indicating the block is a liability to own. Complete a method that finds the value of the most valuable contiguous sub rectangle in the city represented by the 2d array. The sub rectangle must be at least 1 by 1. (If all the values are negative "the most valuable" rectangle would be the negative value closest to 0.)
Consider the following example. The 2d array of ints has 6 rows and 5 columns per row, representing an area of the city. The cells with the square around it represent the most valuable contiguous sub rectangle in the given array. (Value of 15.)"
I am completely stumped as to how to go about solving this. I'm thinking that I could start on every single value and make every possible subplot with it and update a variable for the highest value. Is there another way of going about doing this? I'm not looking for the answer, I just need some guidance. Thanks
int most=-10000;
int current=0;
for(int i=0;i<city.length;i++){
for(int j=0;j<city.length;j++){
current+=city[i][j];
if(current>most){
most=current;
}
}
}
return most;
This is my attempt so far. Hopefully you guys can see where I'm going with it. I start at 0,0 and check the entire line and update most accordingly.

The algorithm is to explore all rectangular shapes, and scan the city for that shape. The maximum value is found in a particular shape in a particular part of the city.
Algorithm (assume the city is NxM):
Set MAX = Lowest value in the city
// ROW / COL represent the shape of the rectangle
for ROW = 1 to N
for COL = 1 to M
// scan the city for a shape the size of ROWxCOL
for POS_X = 0 to N-ROW
for POS_Y = 0 to M-COL
// You now have a top,left co-ordinate for the shape (POS_X,POS_Y)
// This represents the position in the city[][] array
SUM Values from co-ordinate POS_X,POS_Y to POS_X+ROW-1, POS_Y+COL-1
IF SUM>MAX; MAX=SUM
PRINT MAX

Calculate the distance between points with large xy values

I'm trying to compare the distance between Point 1 and Point 2 with the distance between Point 1 and Point 3. And I'm trying to find the smaller one. The only problem is that the xy values of all three points are rather large and using the distance formula will likely cause an overflow. Is there another way to find the distances?

Scale the values by a constant, calculate the distance, then "unscale" the values. For example, divide your values by 10^6, or 10^9, or whatever it takes, then calculate the scaled distance and then convert back using your scale constant.

Math.hypot() may be useful in this context, as "the final result is without medium underflow or overflow."

It is the fastest solution:
double dx12=x1-x2;
double dy12=y1-y2;
double dx13=x1-x3;
double dy13=y1-y3;
double r12sq=dx12*dx12+dy12*dy12;
double r13sq=dx13*dx13+dy13*dy13;
int minR= r12sq>r13sq ? Math.sqrt(r13sq) : Math.sqrt(r12sq);
you need to take only one sqrt - that one for the shortest distance.
Normalization by some fixed constant is senseless for double.
If you use integers instead of doubles, the normalization and centering of coordinates by some fixed constant could be useful for some distances and bad for others. For example, if you are dividing by 1000, it is good for coordinates that have differences about some billions, but for differences about some hundreds its effect will be killing. So, you can evaluate the useful coefficient of normalization only after you have the medium dx and dy. Let us you need 4 digits for work
int dx12=x1-x2;
int dy12=y1-y2;
int dx13=x1-x3;
int dy13=y1-y3;
int d=(abs(dx12) +abs(dx13) + abs(dy12) + abs(dy13));
int coeff = d/10000;
if(coeff<1) coeff=1;
int dx12=dx12/coeff;
int dy12=dy12/coeff;
int dx13=dx13/coeff;
int dy13=dy13/coeff;
int r12sq=dx12*dx12+dy12*dy12;
int r13sq=dx13*dx13+dy13*dy13;
int minR= r12sq>r13sq ? Math.sqrt(r13sq) : Math.sqrt(r12sq);
Here you can multiply these int variables without overflow.

java cosine similarity problem

I developed some java program to calculate cosine similarity on the basis of TF*IDF. It worked very well. But there is one problem.... :(
for example:
If I have following two matrix and I want to calculate cosine similarity it does not work as rows are not same in length
doc 1
1 2 3
4 5 6
doc 2
1 2 3 4 5 6
7 8 5 2 4 9
if rows and colums are same in length then my program works very well but it does not if rows and columns are not in same length.
Any tips ???

I'm not sure of your implementation but the cosine distance of two vectors is equal to the normalized dot product of those vectors.
The dot product of two matrix can be expressed as a . b = aTb. As a result if the matrix have different length you can't take the dot product to identify the cosine.
Now in a standard TF*IDF approach the terms in your matrix should be indexed by term, document as a result any terms not appearing in a document should appear as zeroes in your matrix.
Now the way you have it set up seems to suggest there are two different matrices for your two documents. I'm not sure if this is your intent, but it seems incorrect.
On the other hand if one of your matrices is supposed to be your query, then it should be a vector and not a matrix, so that the transpose produces the correct result.
A full explanation of TF*IDF follows:
Ok, in a classic TF*IDF you construct a term-document matrix a. Each value in matrix a is characterized as ai,j where i is the term and j is the document. This value is a combination of local, global and normalized weights (although if you normalize your documents, the normalized weight should be 1). Thus ai,j = fi,j*D/di, where fi,j is the frequency of word i in doc j, D is the document size, and di is the number of documents with term i in them.
Your query is a vector of terms designated as b. For each term bi,q in your query refers to term i for query q. bi,q = fi,q where fi,q is the frequency of term i in query q. In this case each query is a vector, and multiple queries form a matrix.
We can then calculate the unit vectors of each so that when we take the dot product it will produce the correct cosine. To achieve the unit vector we divide both the matrix a and the query b by their Frobenius norm.
Finally we can perform the cosine distance by taking the transpose of the vector b for a given query. Thus one query (or vector) per calculation. This is denoted as bTa. The final result is a vector with the scoring for each term where a higher score denotes higher document rank.

simple java cosine similarity
static double cosine_similarity(Map<String, Double> v1, Map<String, Double> v2) {
Set<String> both = Sets.newHashSet(v1.keySet());
both.removeAll(v2.keySet());
double sclar = 0, norm1 = 0, norm2 = 0;
for (String k : both) sclar += v1.get(k) * v2.get(k);
for (String k : v1.keySet()) norm1 += v1.get(k) * v1.get(k);
for (String k : v2.keySet()) norm2 += v2.get(k) * v2.get(k);
return sclar / Math.sqrt(norm1 * norm2);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.