Compute covariance matrix using Nd4j

Compute covariance matrix using Nd4j - java

Given a 2 dimensional matrix, I'd like to compute the corresponding covariance matrix.
Are there any methods included with Nd4j that would facilitate this operation?
For example, the covariance matrix computed from the following matrix
1 2
8 12
constructed using Nd4j here:
INDArray array1 = Nd4j.zeros(2, 2);
array1.putScalar(0, 0, 1);
array1.putScalar(0, 1, 2);
array1.putScalar(1, 0, 8);
array1.putScalar(1, 1, 12);
should be
24.5 35.0
35.0 50.0
This can easily be done using pandas' DataFrame's cov method like so:
>>> pandas.DataFrame([[1, 2],[8, 12]]).cov()
0 1
0 24.5 35.0
1 35.0 50.0
Is there any way of doing this using Nd4j?

I hope you already found a solution, for those who are facing the same problem, here is a method in ND4J that computes a covariance matrix:
/**
* Returns the covariance matrix of a data set of many records, each with N features.
* It also returns the average values, which are usually going to be important since in this
* version, all modes are centered around the mean. It's a matrix that has elements that are
* expressed as average dx_i * dx_j (used in procedure) or average x_i * x_j - average x_i * average x_j
*
* #param in A matrix of vectors of fixed length N (N features) on each row
* #return INDArray[2], an N x N covariance matrix is element 0, and the average values is element 1.
*/
public static INDArray[] covarianceMatrix(INDArray in)
GitHub source
This method is found in the org.nd4j.linalg.dimensionalityreduction.PCA package.

Related

Get random number with larger numbers increasingly unlikely [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
How can I get a random number in the range k to h such that the closer a number is to h the more unlikely it will come up?
I'm going to need a number between 20 and 1980.

I've tried some stuff in Eclipse, here are results.
interface Generator {
double generate(double low, double high);
}
abstract class AbstractGenerator implements Generator {
protected final Random rand;
public AbstractGenerator()
{
rand = new Random();
}
public AbstractGenerator(long seed)
{
rand = new Random(seed);
}
}
Now results for various generator implementations:
I've tried to generate 100k numbers on scale 0 to 9, and here they are shown as bars.
Catan 2 (add two dice)
class Catan2 extends AbstractGenerator {
#Override
public double generate(double low, double high)
{
return low + (high - low) * Math.abs(-1 + (rand.nextDouble() + rand.nextDouble()));
}
}
Reusults:
0 : *******************
1 : ******************
2 : ****************
3 : **************
4 : ************
5 : *********
6 : *******
7 : *****
8 : ***
9 : *
Catan 3 (add three dice)
class Catan3 extends AbstractGenerator {
#Override
public double generate(double low, double high)
{
return low + (high - low) * Math.abs(-1.5 + (rand.nextDouble() + rand.nextDouble() + rand.nextDouble())) / 1.5;
}
}
Reusults:
0 : ***********************
1 : *********************
2 : *******************
3 : ***************
4 : ***********
5 : *******
6 : *****
7 : ***
8 : *
9 : *
Catan 4 (add four dice)
class Catan4 extends AbstractGenerator {
#Override
public double generate(double low, double high)
{
return low + (high - low) * Math.abs(-2 + (rand.nextDouble() + rand.nextDouble() + rand.nextDouble() + rand.nextDouble())) / 2D;
}
}
Results:
0 : ***************************
1 : ************************
2 : ********************
3 : **************
4 : *********
5 : *****
6 : ***
7 : *
8 : *
9 : *
I think "Catan 3" is the best of those.
Formula being: low+(high-low)*abs(-1.5+(RAND+RAND+RAND))/1.5
Basically, I get a "hill" distribution, then I center it and take it's abs value. Then I norm it to the desired values.

And yet another option. There are standard methods to produce random numbers on a Gaussian distribution. Set up a Gaussian RNG with an average of k and a standard deviation of h/5. Reject any number below k (about half the numbers generated) and reject all numbers greater than h (5% or less).
You can tweak the standard deviation if you want to optimise the results. Effectively this is a half-Gaussian RNG with a truncated tail, so the numbers are not linear; you will get more closer to k than to h.
ETA: Thanks to #MightyPork's comment, which got me thinking. A Gaussian distribution is symmetric, so there is no need to throw away any raw values less than k. Just shift them from below k to the same distance above k:
if (raw < k)
raw <- k + (k - raw)
end if
Values above h will still need to be rejected.

Say our range is [0,4], create an array like this:
[000001111222334]
Now use a standard Random object to draw from the array. By doing this, we have gone from drawing from a uniform distribution to a distribution of our own design. In reality, we're not going to want to use an auxiliary array. You can do the following in lieu of an auxiliary array:
Draw from [0,14]; map [0,4] to 0, [5,8] to 1, [9,11] to 2, [12,13] to 3 and [14] to 4.
It really depends on what your distribution looks like. You can approximate drawing from a non-uniform distribution via drawing multiple times from uniform distributions over varying ranges. Of course, if you know the probability mass function or probability density function of your distribution, then you're golden.

If you need good control over the distribution of numbers, then a good way to go is the method of inverses. Create a sorted table of (x,y) pairs where x and y both increase monotonically: x from 0 to 1 and y from the low to high value of pseudo-random numbers you need. The algorithm is:
x = uniform random float in [0..1)
Search the table to find (x[i],y[i]) such that x[i] <= x < x[i+1]
// Return linearly interpolated y value
return y[i] + (x - x[i]) / (x[i+1] - x[i]) * (y[i+1] - y[i])
You control the distribution of return values with the table entries.
If the table contains only (0,0) and (1,1), then obviously the return value is equal to x, and the distribution is uniform. To get more high numbers, describe a curve that increases more rapidly at the start and is flatter at the higher x values, say:
(0,0) (0.25,0.5) (1,1)
You should be able to see why this works. In the uniform distribution, half the numbers are between 0 and .5. With this table, only a quarter of the numbers are in that range, so the other three-quarters are in 0.5 to 1. The high numbers are more frequent as you require.
You can create as smooth a curve as you like and of any shape as long as it's monotonically increasing. If the table has more than a few pairs, consider binary search for speed.
For a range of 20 to 1980, the corresponding table would be something like:
(0, 20) (0.25, 1000) (1, 1980)
If you need integers, you'd want to use
(0, 20) (0.25, 1000) (1, 1981)
and then truncate the fraction from the result.
Again, you'd probably want more points in the table to make the ICDF smoother. This is for illustration.
The Math
The curve stored in the table is called the inverse cumulative density function (ICDF) for returned pseudo-random numbers. A probability distribution function (PDF) is a non-negative function with area under the curve of 1. Commonly used PFDs are uniform, exponential, and normal. The corresponding CDF is the running integral of the PDF. The ICDF is the inverse of the CDF. It's well known that to generate random numbers with any given PDF, you can find the ICDF and apply the algorithm above.

Find a sum equal or greater than given target using only numbers from set

Example 1:
Shop selling beer, available packages are 6 and 10 units per package. Customer inputs 26 and algorithm replies 26, because 26 = 10 + 10 + 6.
Example 2:
Selling spices, available packages are 0.6, 1.5 and 3. Target value = 5. Algorithm returns value 5.1, because it is the nearest greater number than target possible to achieve with packages (3, 1.5, 0.6).
I need a Java method that will suggest that number.
Simmilar algorithm is described in Bin packing problem, but it doesn't suit me.
I tried it and when it returned me the number smaller than target I was runnig it once again with increased target number. But it is not efficient when number of packages is huge.
I need almost the same algorithm, but with the equal or greater nearest number.
Similar question: Find if a number is a possible sum of two or more numbers in a given set - python.

First let's reduce this problem to integers rather than real numbers, otherwise we won't get a fast optimal algorithm out of this. For example, let's multiply all numbers by 100 and then just round it to the next integer. So say we have item sizes x1, ..., xn and target size Y. We want to minimize the value
k1 x1 + ... + kn xn - Y
under the conditions
(1) ki is a non-positive integer for all n ≥ i ≥ 1
(2) k1 x1 + ... + kn xn - Y ≥ 0
One simple algorithm for this would be to ask a series of questions like
Can we achieve k1 x1 + ... + kn xn = Y + 0?
Can we achieve k1 x1 + ... + kn xn = Y + 1?
Can we achieve k1 x1 + ... + kn xn = Y + z?
etc. with increasing z
until we get the answer "Yes". All of these problems are instances of the Knapsack problem with the weights set equal to the values of the items. The good news is that we can solve all those at once, if we can establish an upper bound for z. It's easy to show that there is a solution with z ≤ Y, unless all the xi are larger than Y, in which case the solution is just to pick the smallest xi.
So let's use the pseudopolynomial dynamic programming approach to solve Knapsack: Let f(i,j) be 1 iif we can reach total item size j with the first i items (x1, ..., xi). We have the recurrence
f(0,0) = 1
f(0,j) = 0 for all j > 0
f(i,j) = f(i - 1, j) or f(i - 1, j - x_i) or f(i - 1, j - 2 * x_i) ...
We can solve this DP array in O(n * Y) time and O(Y) space. The result will be the first j ≥ Y with f(n, j) = 1.
There are a few technical details that are left as an exercise to the reader:
How to implement this in Java
How to reconstruct the solution if needed. This can be done in O(n) time using the DP array (but then we need O(n * Y) space to remember the whole thing).

You want to solve the integer programming problem min(ct) s.t. ct >= T, c >= 0 where T is your target weight, and c is a non-negative integer vector specifying how much of each package to purchase, and t is the vector specifying the weight of each package. You can either solve this with dynamic programming as pointed out by another answer, or, if your weights and target weight are too large then you can use general integer programming solvers, which have been highly optimized over the years to give good speed performance in practice.

Implementation using linear congruential equation in java

I see an LCG implementation in Java under Random class as shown below:
/*
* This is a linear congruential pseudorandom number generator, as
* defined by D. H. Lehmer and described by Donald E. Knuth in
* <i>The Art of Computer Programming,</i> Volume 3:
* <i>Seminumerical Algorithms</i>, section 3.2.1.
*
* #param bits random bits
* #return the next pseudorandom value from this random number
* generator's sequence
* #since 1.1
*/
protected int next(int bits) {
long oldseed, nextseed;
AtomicLong seed = this.seed;
do {
oldseed = seed.get();
nextseed = (oldseed * multiplier + addend) & mask;
} while (!seed.compareAndSet(oldseed, nextseed));
return (int)(nextseed >>> (48 - bits));
}
But below link tells that LCG should be of the form, x2=(ax1+b)modM
https://math.stackexchange.com/questions/89185/what-does-linear-congruential-mean
But above code does not look in similar form. Instead it uses & in place of modulo operation as per below line
nextseed = (oldseed * multiplier + addend) & mask;
Can somebody help me understand this approach of using & instead of modulo operation?

Bitwise-ANDing with a mask which is of the form 2^n - 1 is the same as computing the number modulo 2^n: Any 1's higher up in the number are multiples of 2^n and so can be safely discarded. Note, however, that some multiplier/addend combinations work very poorly if you make the modulus a power of two (rather than a power of two minus one). That code is fine, but make sure it's appropriate for your constants.

This can be used if mask + 1 is a power of 2.
For instance, if you want to do modulo 4, you can write x & 3 instead of x % 4 to obtain the same result.
Note however that this requires that x be a positive number.

YCbCr at LipMap detection

I'm confused with converting the RGB values to YCbCr color scheme. I used this equation:
int R, G, b;
double Y = 0.229 * R + 0.587 * G + 0.144 * B;
double Cb = -0.168 * R - 0.3313 * G + 0.5 * B + 128;
double Cr = 0.5 * R - 0.4187 * G - 0.0813 * B + 128;
The expected output of YCbCr is normalized between 0-255, I'm confused because one of my source says it is normalized within the range of 0-1.
And it is going well, But I am having problem when getting the LipMap to isolate/detect the lips of the face, I implemented this:
double LipMap = Cr*Cr*(Cr*Cr-n*(Cr/Cb))*(Cr*Cr-n*(Cr/Cb));
n returns 0-255, the equation for n is: n=0.95*(summation(Cr*Cr)/summation(Cr/Cb))
but another sources says: n = 0.95*(((1/k)*summation(Cr*Cr))/((1/k)*summation(Cr/Cb)))
where k is equal to the number of pixels in the face image.
It say's from my sources that it will return a result of 0-255, but in my program it always returns large numbers always, not even giving me 0-255.
So can anyone help me implement this and solve my problem?

From the sources you linked in your comments, it looks like either the equations or the descriptions in the first source are wrong:
If you use RGB values in the Range [0,255] and the given conversion (your Cb conversion differs from that btw.) you should get Cr and Cb values in the same range.
Now if you calculate n = 0.95 * (ΣCr2/Σ(Cr/Cb)) you'll notice that the values for Cr2 range from [0,65025] whereas Cr/Cb is in the range [0,255] (assuming Cb=0 is not possible and thus the highest value would be 255/1 = 255).
If you further assume an image with quite high red and low blue components, you'll get way higher values for n than what is stated in that paper:
Constant η fits final value in range 0..255
The second paper states this, which makes much more sense IMHO (although I don't know whether they normalize Cr and Cb to range [0,1] before the calculation or if they normalize the result which might result in a higher difference between Cr2 and Cr/Cb):
Where (Cr) 2,(Cr/Cb) all are normalized to the
range [0 1].
Note that in order to normalize Cr and Cb to range [0,1] you'd either need to divide the result of your equations by 255 or simply use RGB in range [0,1] and add 0.5 instead of 128:
//assumes RGB are in range [0,1]
double Cb = -0.168 * R - 0.3313 * G + 0.5 * B + 0.5;
double Cr = 0.5 * R - 0.4187 * G - 0.0813 * B + 0.5;

java cosine similarity problem

I developed some java program to calculate cosine similarity on the basis of TF*IDF. It worked very well. But there is one problem.... :(
for example:
If I have following two matrix and I want to calculate cosine similarity it does not work as rows are not same in length
doc 1
1 2 3
4 5 6
doc 2
1 2 3 4 5 6
7 8 5 2 4 9
if rows and colums are same in length then my program works very well but it does not if rows and columns are not in same length.
Any tips ???

I'm not sure of your implementation but the cosine distance of two vectors is equal to the normalized dot product of those vectors.
The dot product of two matrix can be expressed as a . b = aTb. As a result if the matrix have different length you can't take the dot product to identify the cosine.
Now in a standard TF*IDF approach the terms in your matrix should be indexed by term, document as a result any terms not appearing in a document should appear as zeroes in your matrix.
Now the way you have it set up seems to suggest there are two different matrices for your two documents. I'm not sure if this is your intent, but it seems incorrect.
On the other hand if one of your matrices is supposed to be your query, then it should be a vector and not a matrix, so that the transpose produces the correct result.
A full explanation of TF*IDF follows:
Ok, in a classic TF*IDF you construct a term-document matrix a. Each value in matrix a is characterized as ai,j where i is the term and j is the document. This value is a combination of local, global and normalized weights (although if you normalize your documents, the normalized weight should be 1). Thus ai,j = fi,j*D/di, where fi,j is the frequency of word i in doc j, D is the document size, and di is the number of documents with term i in them.
Your query is a vector of terms designated as b. For each term bi,q in your query refers to term i for query q. bi,q = fi,q where fi,q is the frequency of term i in query q. In this case each query is a vector, and multiple queries form a matrix.
We can then calculate the unit vectors of each so that when we take the dot product it will produce the correct cosine. To achieve the unit vector we divide both the matrix a and the query b by their Frobenius norm.
Finally we can perform the cosine distance by taking the transpose of the vector b for a given query. Thus one query (or vector) per calculation. This is denoted as bTa. The final result is a vector with the scoring for each term where a higher score denotes higher document rank.

simple java cosine similarity
static double cosine_similarity(Map<String, Double> v1, Map<String, Double> v2) {
Set<String> both = Sets.newHashSet(v1.keySet());
both.removeAll(v2.keySet());
double sclar = 0, norm1 = 0, norm2 = 0;
for (String k : both) sclar += v1.get(k) * v2.get(k);
for (String k : v1.keySet()) norm1 += v1.get(k) * v1.get(k);
for (String k : v2.keySet()) norm2 += v2.get(k) * v2.get(k);
return sclar / Math.sqrt(norm1 * norm2);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.