Cosine similarity returning wrong distance

Cosine similarity returning wrong distance - java

I have two vectors represented as a HashMap and I want to measure the similarity between them. I use the cosine similarity metric as in the following code:
public static void cosineSimilarity(HashMap<Integer,Double> vector1, HashMap<Integer,Double> vector2){
double scalar=0.0d, v1Norm=0.0d, v2Norm=0.0d;
for(int featureId: vector1.keySet()){
scalar+= (vector1.get(featureId)* vector2.get(featureId));
v1Norm+= (vector1.get(featureId) * vector1.get(featureId));
v2Norm+= (vector2.get(featureId) * vector2.get(featureId));
}
v1Norm=Math.sqrt(v1Norm);
v2Norm=Math.sqrt(v2Norm);
double cosine= scalar / (v1Norm*v2Norm);
System.out.println("v1 is: "+v1Norm+" , v2 is: "+v2Norm+" Cosine is: "+cosine);
}
Strangely, two vectors that are supposed to be dissimilar come close to .9999 result which is just wrong!
Please note that the keys are exactly the same for both maps.
data file is here: file
File format:
FeatureId vector1_value vector2_value

Your code is fine.
The vectors are dominated by several large features. In those features, the two vectors are almost collinear, which is why the similarity measure is close to 1.
I include the six largest features below. Look at the ratio of vec2 over vec1: it's almost identical across those features.
feature vec1 vec2 vec2/vec1
64806110 2875 1.85E+07 6.43E+03
64806108 5750 3.68E+07 6.40E+03
64806107 8625 5.49E+07 6.37E+03
64806106 11500 7.29E+07 6.34E+03
64806111 14375 9.07E+07 6.31E+03
64806109 17250 1.08E+08 6.28E+03

Related

Accessing weights and raw activations of all layers in deeplearning4j

My goal is to visualize a model classifying an image. For the visualization I need the raw activations / outputs of each layer. Is there a way to access these when predicting? Furthermore, it would be very helpful if there is a way to access the weights. However, this is only optional.
The models to visualize are built dynamically and will be used to classify images of the MNIST and EMNIST data sets.
model.summary() of an exemplary model:
=======================================================================
LayerName (LayerType) nIn,nOut TotalParams ParamsShape
=======================================================================
layer0 (DenseLayer) 784,200 157.000 W:{784,200}, b:{1,200}
layer1 (DenseLayer) 200,100 20.100 W:{200,100}, b:{1,100}
layer2 (OutputLayer) 100,10 1.010 W:{100,10}, b:{1,10}
-----------------------------------------------------------------------
Total Parameters: 178.110
Trainable Parameters: 178.110
Frozen Parameters: 0
=======================================================================
The code for image classification:
INDArray reshaped = reshapeImage(image);
int predictedIndex = model.predict(reshaped)[0];
double conf = model.output(reshaped).getDouble(predictedIndex);
If you need more information / code snippets, please let me know.

Converting square metres to Square Kilometres using javax.measure

I have a program that calculates the area of a polygon in metres squared and I would like to convert it to other units (as the user wants) using the javax.measure library.
Measure<Double, Area> a = Measure.valueOf(area, SI.SQUARE_METRE);
So if I want hectares I can use:
a.doubleValue(NonSI.HECTARE);
but the only other Area quantity is Are.
While I can easily divide through by 1000*1000 to get KM squared it gets messier when I try to get Acres or Sq Miles or other common areal units.

A code snippet like NonSI.MILE.divide(8.0).times... tells you are using the old, unfinished JSR 275 and implementations like JScience 4 in your solution.
Any reason to prefer that over the official, finished javax.measure JSR 363?
All of above with a few variations (e.g. UnitFormat is an API interface now, the code in JSR 363 would be SimpleUnitFormat.getInstance().label(acre, "acre");) works perfectly fine in JSR 363. Unlike 275 there is also a vast infrastructure of extension modules for the SI System and other Unit systems. And a couple of major Open Source projects already use the new standard. Give it a try.

After some investigation and experimentation I can generate sq miles and sq kilometres as so:
Unit<Area> sq_km = (Unit<Area>) SI.KILOMETER.times(SI.KILOMETER);
System.out.println(a.to(sq_km));
Unit<Area> sq_mile = (Unit<Area>) NonSI.MILE.times(NonSI.MILE);
System.out.println(a.to(sq_mile));
System.out.println(a.to(NonSI.HECTARE));
Which gives me the output:
4.872007325925411E11 m²
487200.7325925411 km²
188109.25449744106 mi²
4.8720073259254105E7 ha
But acres are escaping me, according to Wikipedia an acre is 1 furlong times 66 ft. So I tried:
Unit<Area> acre = (Unit<Area>) NonSI.MILE.divide(8.0).times(NonSI.FOOT).times(66.0);
System.out.println(a.to(acre));
which gives the right answer but gives me a unit of m²*4046.8564224.
Edit
So further experimentation gives me:
Unit<Area> acre = (Unit<Area>) NonSI.MILE.divide(8.0).times(NonSI.FOOT).times(66.0);
UnitFormat.getInstance().label(acre, "acre");
and the output (for a different polygon than before):
2.6529660563942477E7 acre
Further Update
GeoTools now uses JSR-363 units so the above becomes:
Unit<Area> sq_km = (Unit<Area>) MetricPrefix.KILO(SI.METRE).multiply(MetricPrefix.KILO(SI.METRE));
System.out.println(a.to(sq_km));
System.out.println(pop.divide(a.to(sq_km)));
Unit<Area> sq_mile = (Unit<Area>) USCustomary.MILE.multiply(USCustomary.MILE);
System.out.println(a.to(sq_mile));
System.out.println(a.to(NonSI.HECTARE));
System.out.println(a.to(USCustomary.ACRE).getValue() + " acres");
So Acres are in but for some reason don't have a unit defined in the java8 jar (but does in master).

ApacheCommons: Weird results from ChiSquareTest

I am using the Apache Commons lib to calculate the p-value with the ChiSquareTest:
I use the method chiSquareTest(double[] expected, long[] observed); But the values I get back don't make sense to me. So I tried numerous ChiSquare Online Calculators to find out what this function calculates.
An example:
Group 1: {25,25}
Group 2: {30,20}
(Taken from Wikipedia, German Chi Square Test article)
P- values from:
http://www.quantpsy.org/chisq/chisq.htm and
http://vassarstats.net/newcs.html
P = 0.3149 and 0.31490284
0.42154642 and 0.4201
(with and without Yates Correction)
Apache Commons: 0.1489146731787664
Code:
ChiSquareTest tester = new ChiSquareTest();
long[] b = {25,25};
double[] a = {30,20};
tester.chiSquareTest(a,b);
Another thing I do not understand is the need to have a long and a double array. Why not two long arrays?

There are two functions in the lib:
chiSquareTest(double[] expected, long[] observed)
chiSquareTest(long[][] values)
The first one (which I used in the question above) computes the goodness of a fit. But I expected the result from the second one, the test of independence.
The answer was given to me on the Apache Commons user Mailinglist, I will add a link to the archive once it is there. But it is also written in the JavaDoc.
Update:
Mailinglist Archive

Java - convert MGRS coordinate to LatLon WGS

For my App I need compact code for converting between LatLon (WGS84) and MGRS.
JCoord.jar:
Looks great, but the version 1.1 jar is 0.5Mb in size. That is doubles my App for only perfoming a 2-way conversion of coordinates.
Openmap:
Isolating just the MGRSPoint.java (https://code.google.com/p/openmap/source/browse/src/openmap/com/bbn/openmap/proj/coords/MGRSPoint.java) from the rest is not easy.
GeographicLib:
This seems a good solution, but I could not find a Java source file.
Is it available for usage?
Nasa:
The Nasa code looks great, see http://worldwind31.arc.nasa.gov/svn/trunk/WorldWind/src/gov/nasa/worldwind/geom/coords/MGRSCoordConverter.java. Isolating just the MGRS conversion code was not easy.
GDAL:
Was implemented in another programming language.
IBM (via j-coordconvert.zip):
Is complact, suits well for the UTM conversion, but the MGRS conversion is described to be errorneous. Alas.
Is there a good (compact) Java source for converting between LatLon/wgs84 and MGRS?

Finally found a sufficient good answer. Berico, thank you!
https://github.com/Berico-Technologies/Geo-Coordinate-Conversion-Java
This source code isolates the NASA Java source code and adds 1 nice utility class.
Examples:
double lat = 52.202050;
double lon = 6.102050;
System.out.println( "To MGRS is " + Coordinates.mgrsFromLatLon( lat, lon));
And the other way around:
String mgrs = "31UCU 59248 14149";
double[] latlon = Coordinates.latLonFromMgrs( mgrs);

Issue with Jama's Eigenvalue decomposition function

I am getting a wrong eigen-vector (also checked by running multiple times to be sure) when i am using matrix.eig(). The matrix is:
1.2290 1.2168 2.8760 2.6370 2.2949 2.6402
1.2168 0.9476 2.5179 2.1737 1.9795 2.2828
2.8760 2.5179 8.8114 8.6530 7.3910 8.1058
2.6370 2.1737 8.6530 7.6366 6.9503 7.6743
2.2949 1.9795 7.3910 6.9503 6.2722 7.3441
2.6402 2.2828 8.1058 7.6743 7.3441 7.6870
The function returns the eigen vectors:
-0.1698 0.6764 0.1442 -0.6929 -0.1069 0.0365
-0.1460 0.6478 0.1926 0.6898 0.0483 -0.2094
-0.5239 0.0780 -0.5236 0.1621 -0.2244 0.6072
-0.4906 -0.0758 -0.4573 -0.1279 0.2842 -0.6688
-0.4428 -0.2770 0.4307 0.0226 -0.6959 -0.2383
-0.4884 -0.1852 0.5228 -0.0312 0.6089 0.2865
Matlab gives the following eigen-vector for the same input:
0.1698 -0.6762 -0.1439 0.6931 0.1069 0.0365
0.1460 -0.6481 -0.1926 -0.6895 -0.0483 -0.2094
0.5237 -0.0780 0.5233 -0.1622 0.2238 0.6077
0.4907 0.0758 0.4577 0.1278 -0.2840 -0.6686
0.4425 0.2766 -0.4298 -0.0227 0.6968 -0.2384
0.4888 0.1854 -0.5236 0.0313 -0.6082 0.2857
The eigen-values for matlab and jama are matching but eigen-vectors the first 5 columns are reversed in sign and only the last column is accurate.
Is there any issue on the kind of input that Jama.Matrix.EigenvalueDecomposition.eig()
accepts or any other problem with the same? Please tell me how i can fix the error. Thanks in advance.

There is no error here, both results are correct - as is any other scalar times the eigen vectors.
There are an infinite number of eigen vectors that work - its just convention that most software programs report the vectors that have length of one. That Jama reports eigen vectors equal to -1 times those of Matlab is probably just an artifact of the algorithm they used.

For a given matrix, the eigenvalues are unique, whose number equals the dimension of the matrix if plurality is considered. While the corresponding eigenvalues might be different because the vectors can scale according to a certain direction. In your post results, both JAVA and Matlab versions are correct.
Also, you could check the D matrix, where the eigenvalues come from. You could find they are the same.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Cosine similarity returning wrong distance - java

Related

Accessing weights and raw activations of all layers in deeplearning4j

Converting square metres to Square Kilometres using javax.measure

ApacheCommons: Weird results from ChiSquareTest

Java - convert MGRS coordinate to LatLon WGS

Issue with Jama's Eigenvalue decomposition function

Categories

Resources