R - lm and r squared - java

I have been running through the questions on this site and others and I just want to make sure I understand that I am going this correctly and then I will want some advice to analyse the results.
I am exporting a m by n binary matrix from Java to R (using jri) and then I want to run lm() against an expected vector of 0s.
Here is the export function for getting the matrix into R
REXP x = re.eval("selectionArray <- c()");
for (int j = 0; j < currentSelection.length; j++){
boolean result = re.assign("currentSNPs", currentSelection[j]);
if (result == true){
x = re.eval("selectionArray <- rbind(selectionArray, currentSNPs)");
}
}
So then I want to execute the the lm() function to get the r squared values
x = re.eval("fm = lm(selectionArray ~ 0)");
I know that I need to use summary(fm) at this point to get the r squared values but I am not sure how to pull them out or what they mean at this point. I want to know the significance of the deviation from the expected 0 value at each column.
Thanks!

to extract the R^2 value from an 'lm' object named 'm'
summary(m)$r.squared
you can always view the structure of an object in R by using the str() function; in this situation you want str(summary(m))
However, it's not clear what you're trying to accomplish here. In the formula argument of the lm() function you specify selectionArray ~ 0, which doesn't make sense for two reasons: 1) as previously hinted at, a 0 on the right side of the formula corresponds to a model where your predictor variable is a vector of zeros and the beta coefficient corresponding to this predictor cannot be defined. 2) Your outcome, selectionArray, is a matrix. As far as I know, lm() isn't set up to have multiple outcomes.
Are you attempting to test the significance that each column of selectionArray differs from a 0? If so, ANY column with at least one success (1) in it is significantly different from a 0 column. If you're interested in the confidence intervals for the probability of success in each column, use the following code. Note that this does not adjust for multiple comparisons.
First let's start with a toy example to demonstrate the concept
v1 <- rbinom(100,size=1,p=.25)
#create a vector, length 100,
#where each entry corresponds to the
#result of a bernoulli trial with probability p
binom.test(sum(v1), n=length(v1), p = 0)
##let's pretend we didn't just generate v1 ourselves,
##we can use binom.test to determine the 95% CI for p
#now in terms of what you want to do...
#here's a dataset that might be something like yours:
selectionArray <- sapply(runif(10), FUN=function(.p) rbinom(100,size=1,p=.p))
#I'm just generating 10 vectors from a binomial distribution
#where each entry corresponds to 1 trial and each column
#has a randomly generated p between 0 and 1
#using a for loop
#run a binomial test on each column, store the results in binom.test.results
binom.test.results <- list()
for(i in 1:ncol(selectionArray)){
binom.test.results[[i]] <- binom.test(sum(selectionArray[,i]),
n=nrow(selectionArray), p=0)
}
#for loops are considered bad programming in r, so here's the "right" way to do it:
binom.test.results1 <- lapply(as.data.frame(selectionArray), function(.v){
binom.test(sum(.v), n=nrow(selectionArray), p = 0)
})
#using str() on a single element of binom.test.result will help you
#identify what results you'd like to extract from each test

I don't know much about Java, so I am not talking about that.
So, you 've got a matrix of 0 and 1 values and no other binary numbers?
And you want to know if the means of the columns are signif. different from 0?
That means, you should do hypothesis tests and not necessarily a regression. However a regression can be equivalent to such a test.
lm(y~0) does not make sense. If you only want an intercept you should use lm(y~1). However, that would be equivalent to a t-Test, which is not statistically correct.
I suspect it would be better to use fit<-glm(y~1,family=binomial) and than extract the p-value p<-summary(fit)$coef[4], but I am not a statistician.

Related

Dynamic Programming - Rod Cutting Problem with maximum cuts and actual solution

So I'm trying to write code for a modified version of the rod cutting problem. The link gives a good intuition of the problem. However, I want to modify the code to not only actually return the solution, i.e. what cuts give the optimal solution, but also limit the number of cuts to a maximum of k.
For proof of concept, I'm trying to create an algorithm to achieve this. The following is what I have so far, I think it successfully returns the actual solution, however, I can't figure out how to limit the maximum to k.
let r[0..n] be a new array
r[0] = 0
for j = 1 to n
q = -1
for i = 1 to j
for k = 0 to n-1
q = Math.max(q[n][k], p[i] + q[n-i-1][k-1]);
r[j] = q
return r[n]
Please do not provide with actual code in your answers, I want to implement that myself, I just need help tweaking my algorithm to give the correct solution.
Update 1: I am already able to find optimal solution for a maximum of k cuts by adding a second dimension to my array. This is shown in the above code.
As you say, you already have the optimal solution, this answer includes only how to retrace the exact solution (cuts made at each step).
Store the candidate cut for length = n and maximum cuts = k
For this, you simply need a 2-d array (say, visit[n][k]) to store the cut made that gets the maximum solution to q[n][k]. In terms of pseudo code and recurrence relations, it will look like the following.
for each value of i:
q[n][k] = q[n][k-1]
visit[n][k] = -1
if q[n][k] < p[i] + q[n-i-1][k-1]:
q[n][k] = p[i] + q[n-i-1][k-1]
visit[n][k] = i
Explanation
It is possible that we don't have a cut that maximizes the solution. In this case, we initialize visit[n][k] = -1.
Every time, we have a candidate to cut the rod of length n at length=i+1, ie. we could get a better price by a cut, we will store the respective cut in another 2-d array.
Reconstruct the solution
Using this 2-d array (visit[n][k]), to back trace the exact cuts, you can use the following pseudo code (I am deliberately avoiding code since you mentioned you don't need it).
cuts = []
while k > 0:
i = visit[n][k]
if i != -1
// If there is a cut
cuts.push(i + 1)
n = n - i - 1
k = k - 1
Explanation
We iterate from k to 0.
Every time, when visit[n][k] is not -1, ie. it is optimal to cut somewhere, we reassign n after making the cut, ie. n = n - i - 1 and store the resultant cut in the array cuts.
Finally, cuts will contain the exact cuts that led to the optimal solution.
Please note that the pseudo code present in your question is slightly incorrect in terms of variables used in the recurrence relation. q is used both to store the DP 2-d array as well as an integer -1. j is not used in the bottom-up DP at all and is replaced with constant n. q[j][k] is uninitialized. However, the general idea is correct.

Convex Hull Optimization Java

I recently read the article from PEG Wiki about the convex hull trick. Surprisingly, at the end of the article I read that we can achieve a fully dynamic variant of the trick (meaning that there are no conditions of applicability) if we store the lines in a std::set. Although I have understood the approach mentioned, I always fail when I try to implement it.
In other words, there is an array A of size n, where each array element contains two positive integers ai and bi.
There are Q queries where each query can be one of two types:
1) Given a positive integer x, find max (aix + bi) for all i from 1 to n
2) Update values of ai and bi for some i.
Value to be updated will be in non-decreasing order i.e. ai1>=ai2 and bi1>=bi2 for Q >= i1 > i2 >= 1.
Update Part can be performed using by deleting previous line and adding a new line. I am looking both update and query part for amortized (log n) complexity in Java

Java: Inverse of a matrix using EJML not working as expected

Within a java project I've developed I need to calculate the inverse of a matrix. In order to align with other projects and other developers I'm using the Efficient Java Matrix Library (orj.ejml).
For inverting the Matrix I'm using invert from org.ejml.ops.CommonOps, and I has worked fine until now that I'm getting a unexpected result
I've isolated the case that doesn't work to be:
DenseMatrix64F X = new DenseMatrix64F(3, 3);
X.setData(new double[]{77.44000335693366,-24.64000011444091,-8.800000190734865, -24.640000114440916,7.839999732971196,2.799999952316285, -8.800000190734865,2.799999952316285,1.0000000000000004});
DenseMatrix64F invX = new DenseMatrix64F(3, 3);
boolean completed = CommonOps.invert(X, invX);
System.out.println(X);
System.out.println(invX);
System.out.println(completed);
The output I get from this test is:
Type = dense , numRows = 3 , numCols = 3
77.440 -24.640 -8.800
-24.640 7.840 2.800
-8.800 2.800 1.000
Type = dense , numRows = 3 , numCols = 3
NaN -Infinity Infinity
NaN Infinity -Infinity
NaN -Infinity Infinity
true
My first thought was that it could be a singular matrix and therefore not invertible, but after testing the same matrix with a different calculation tool I've found that it is not singular.
So I went back to the EJML documentation and found out the following information for this particular function.
If the algorithm could not invert the matrix then false is returned. If it returns true that just means the algorithm finished. The results could still be bad because the matrix is singular or nearly singular.
And, in this particular case the matrix is not singular but we could say it is near singular.
The only solution I could think off was to search the inverted matrix for NaN or Infinites after calculating it, and if I find something funny in there I just replace the inverted matrix with the original matrix, although it doesn't seem a very clean practice it yields reasonable results.
My question is:
Could you think of any solution for this situation? Something smarter and wiser than just using the original matrix as its own inverse matrix.
In case there is no way around it, do you know of any other Java Matrix library that has some solution to this situation, I'm not looking forward to introduce a new library but it may be the only solution if this becomes a real problem.
Regards and thanks for your inputs!
You should try using SVD if you have to have an inverse. Also consider a pseudo inverse instead. Basically any library using LU decomposition will have serious issues. Here's the output from Octave. Note how two of the singular values are almost zero. Octave will give you an inverse with real numbers, but it's a poor one...
octave:7> cond(B)
ans = 8.5768e+17
octave:8> svd(B)
ans =
8.6280e+01
3.7146e-15
1.0060e-16
inv(B)*B
warning: inverse: matrix singular to machine precision, rcond = 4.97813e-19
ans =
0.62500 0.06250 0.03125
0.00000 0.00000 0.00000
0.00000 0.00000 4.00000

Is it possible to get k-th element of m-character-length combination in O(1)?

Do you know any way to get k-th element of m-element combination in O(1)? Expected solution should work for any size of input data and any m value.
Let me explain this problem by example (python code):
>>> import itertools
>>> data = ['a', 'b', 'c', 'd']
>>> k = 2
>>> m = 3
>>> result = [''.join(el) for el in itertools.combinations(data, m)]
>>> print result
['abc', 'abd', 'acd', 'bcd']
>>> print result[k-1]
abd
For a given data the k-th (2-nd in this example) element of m-element combination is abd. Is it possible to that value (abd) without creating the whole combinatory list?
I'am asking because I have data of ~1,000,000 characters and it is impossible to create full m-character-length combinatory list to get k-th element.
The solution can be pseudo code, or a link the page describing this problem (unfortunately, I didn't find one).
Thanks!
http://en.wikipedia.org/wiki/Permutation#Numbering_permutations
Basically, express the index in the factorial number system, and use its digits as a selection from the original sequence (without replacement).
Not necessarily O(1), but the following should be very fast:
Take the original combinations algorithm:
def combinations(elems, m):
#The k-th element depends on what order you use for
#the combinations. Assuming it looks something like this...
if m == 0:
return [[]]
else:
combs = []
for e in elems:
combs += combinations(remove(e,elems), m-1)
For n initial elements and m combination length, we have n!/(n-m)!m! total combinations. We can use this fact to skip directly to our desired combination:
def kth_comb(elems, m, k):
#High level pseudo code
#Untested and probably full of errors
if m == 0:
return []
else:
combs_per_set = ncombs(len(elems) - 1, m-1)
i = k / combs_per_set
k = k % combs_per_set
x = elems[i]
return x + kth_comb(remove(x,elems), m-1, k)
first calculate r = !n/(!m*!(n-m)) with n the amount of elements
then floor(r/k) is the index of the first element in the result,
remove it (shift everything following to the left)
do m--, n-- and k = r%k
and repeat until m is 0 (hint when k is 0 just copy the following chars to the result)
I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem appears to fall under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it too is faster than other published techniques.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to Java, Python, or C++.

mapreduce distance calculation in hadoop

Is there a distance calculation implementation using hadoop map/reduce. I am trying to calculate a distance between a given set of points.
Looking for any resources.
Edit
This is a very intelligent solution. I have tried some how like the first algorithm, and I get almost what I was looking for. I am not concerned about optimizing the program at the moment, but my problem was the dist(X,Y) function was not working. When I got all the points on the reducer, I was unable to go through all the points on an Iterator and calculate the distance. Someone on stackoverflow.com told me that the Iterator on hadoop is different than the normal JAVA Iterator, i am not sure about that. But if i can find a simple way to go through the Iterator on my dist() function, i can use your second algorithm to optimize.
//This is your code and I am refering to that code too, just to make my point clear.
map(x,y) {
for i in 1:N #number of points
emit(i, (x,y)) //i did exactly like this
reduce (i, X)
p1 = X[i]
for j in i:N
// here is my problem, I can't get the values from the Iterator.
emit(dist(X[i], X[j]))
you need to do a self join on that data set. In hive that would look like, more or less
select dist(P1.x,P1.y,P2.x, P2.y) from points P1 join points P2 on (True) where P1.x < P2.x or (P1.x = P2.x and P1.y < P2.y)
The function dist would need to be implemented using other hive functions or written in Java and added as a UDF. Also I am not sure about the True constant but you can write 0=0 to the same effect. The where clause is to avoid computing the same distances twice or 0 distances. The question is: would hive optimize this the way you can do programming carefully in hadoop? I am not sure. This is a sketch in hadoop
map(x,y) {
for i in 1:N #number of points
emit(i, (x,y))
reduce (i, X)
p1 = X[i]
for j in i:N
emit(dist(X[i], X[j]))
For this to work you need X to get to the reducer sorted in some order, for instance by x and then by y using secondary sort keys (that do not affect the grouping). This way every reducer gets a copy of all the points and works on a column of the distance matrix you are trying to generate. The memory requirements are minimal. You could trade some communication for memory by re-organizing the computation so that every reducer computes a square submatrix of the final matrix, knowing only two subsets of the points and calculating the distances among all of them. To achieve this, you need to make explicit the order of your points, say you are storing i, x, y
map(i,x,y) {
for j in 1:N/k #k is size of submatrix
emit((i/k, j), ("row", (x,y)))
emit((j, i/k), ("col", (x,y)))
reduce ((a,b), Z)
split Z in rows X and cols Y
for x in X
for y in Y
emit(dist(x,y))
In this case you can see that the map phase emits only 2*N*N/k points, whereas the previous algorithm emitted N^2. Here we have (N/k)^2 reducers vs N for the other one. Each reducer has to hold k values in memory (using the secondary key technique to have all the rows get to the reducer before all the columns), vs only 2 before. So you see there are tradeoffs and for the second algorithm you can use the parameter k for perf tuning.
This problem does not sound like a good fit for map-reduce since you're not really able to break it into pieces and calculate each piece independently. If you could have a separate program that generates the complete graph of your points as a list (x1,y1,x2,y2) then you could do a straightforward map to get the distance.

Categories