Working out a point using curve fitting in java - java

The following code produces a curve that should fit fit the points
1, 1
150, 250
10000, 500
100000, 750
100000, 1000
I built this code based off the documentation here, however, I am not entirely sure how to use the data correctly for further calcuations and whether PolynomialCurveFitter.create(3) will affect the answers in these future calcuations.
For example, how would I use the data outputted to calculate what is the x value if the y value is 200 and how would the value differ if I had PolynomialCurveFitter.create(2) instead of PolynomialCurveFitter.create(3)?
import java.util.ArrayList;
import java.util.Arrays;
import org.apache.commons.math3.fitting.PolynomialCurveFitter;
import org.apache.commons.math3.fitting.WeightedObservedPoints;
public class MyFuncFitter {
public static void main(String[] args) {
ArrayList<Integer> keyPoints = new ArrayList<Integer>();
keyPoints.add(1);
keyPoints.add(150);
keyPoints.add(10000);
keyPoints.add(100000);
keyPoints.add(1000000);
WeightedObservedPoints obs = new WeightedObservedPoints();
if(keyPoints != null && keyPoints.size() != 1) {
int size = keyPoints.size();
int sectionSize = (int) (1000 / (size - 1));
for(int i = 0; i < size; i++) {
if(i != 0)
obs.add(keyPoints.get(i), i * sectionSize);
else
obs.add(keyPoints.get(0), 1);
}
} else if(keyPoints.size() == 1 && keyPoints.get(0) >= 1) {
obs.add(1, 1);
obs.add(keyPoints.get(0), 1000);
}
PolynomialCurveFitter fitter = PolynomialCurveFitter.create(3);
fitter.withStartPoint(new double[] {keyPoints.get(0), 1});
double[] coeff = fitter.fit(obs.toList());
System.out.println(Arrays.toString(coeff));
}
}

About what the consequences of changing d for your function
PolynomialCurveFitter.create takes the degree of the polynomial as a parameter.
Very (very) roughly speaking, the polynomial degree will describe the "complexity" of the curve you want to fit. A low-level degree will produce simple curves (just a parabola for d=2), whereas higher degrees will produce more intricate curves, with lots of peaks and valleys, of highly varying sizes, therefore more able to perfectly "fit" all your data points, at the expense of not necessarily being a good "prediction" of all other values.
Like the blue curve on this graphic:
You can see how the straight line would be a better "approximation", while not fitting the data point properly.
How to compute x for any y values in the computed function
You "simply" need to solve the polynomial ! Using the very same library. Add the inverted y value to your coefficents list, and find its root.
Let's say you chose a degree of 2.
Your coefficients array coeffs will contains 3 factors {a0, a1, a2} which describes the equation as such:
If you want to solve this for a particular value, like y= 600, you need to solve :
So, basically,
So, just substract 600 to a0:
coeffs[0] -= 600
and find the root of the polynomial using the dedicated function:
PolynomialFunction polynomial = new PolynomialFunction(coeffs);
LaguerreSolver laguerreSolver = new LaguerreSolver();
double x = laguerreSolver.solve(100, polynomial, 0, 1000000);
System.out.println("For y = 600, we found x = " + x);

Related

Exponential distribution in Java not right - values too small?

I am trying to generate an exponential distribution for arrival and service times of processes. In C++, the example I have works fine and generates pseudo-random numbers in the range [0, inf) and some are bigger as expected. In Java, it does not work. The numbers are orders of magnitude smaller than their C++ equivalents, and I NEVER get any values > 0.99 even though I am using the same formula. In C++ I get 1.xx, or 2.xx etc., but never in Java.
lambda is the average rate of arrival and gets varied from 1 to 30.
I know that rand.nextDouble() gives a value b/w 0 and 1 and from the formula given and answers here on this site, this seems to be a needed component.
I should mention that multiplying my distribution values by 10 gets me much closer to where they need to be and they behave as expected.
In Java:
Random rand = new Random();
// if I multiply x by 10, I get much closer to the distribution I need
// I just don't know why it's off by a factor of 10?!
x = (Math.log(1-rand.nextDouble())/(-lambda));
I have also tried:
x = 0;
while (x == 0)
{
x = (-1/lambda)*log(rand.nextDouble());
}
The C++ code I was given:
// returns a random number between 0 and 1
float urand()
{
return( (float) rand()/RAND_MAX );
}
// returns a random number that follows an exp distribution
float genexp(float lambda)
{
float u,x;
x = 0;
while (x == 0)
{
u = urand();
x = (-1/lambda)*log(u);
}
return(x);
}

XOR Neural Network(FF) converges to 0.5

I've created a program that allows me to create flexible Neural networks of any size/length, however I'm testing it using the simple structure of an XOR setup(Feed forward, Sigmoid activation, back propagation, no batching).
EDIT: The following is a completely new approach to my original question which didn't supply enough information
EDIT 2: I started my weight between -2.5 and 2.5, and fixed a problem in my code where I forgot some negatives. Now it either converges to 0 for all cases or 1 for all, instead of 0.5
Everything works exactly the way that I THINK it should, however it is converging toward 0.5, instead of oscillating between outputs of 0 and 1. I've completely gone through and hand calculated an entire setup of feeding forward/calculating delta errors/back prop./ etc. and it matched what I got from the program. I have also tried optimizing it by changing learning rate/ momentum, as well as increase complexity in the network(more neurons/layers).
Because of this, I assume that either one of my equations is wrong, or I have some other sort of misunderstanding in my Neural Network. The following is the logic with equations that I follow for each step:
I have an input layer with two inputs and a bias, a hidden with 2 neurons and a bias, and an output with 1 neuron.
Take the input from each of the two input neurons and the bias neuron, then multiply them by their respective weights, and then add them together as the input for each of the two neurons in the hidden layer.
Take the input of each hidden neuron, pass it through the Sigmoid activation function (Reference 1) and use that as the neuron's output.
Take the outputs of each neuron in hidden layer (1 for the bias), multiply them by their respective weights, and add those values to the output neuron's input.
Pass the output neuron's input through the Sigmoid activation function, and use that as the output for the whole network.
Calculate the Delta Error(Reference 2) for the output neuron
Calculate the Delta Error(Reference 3) for each of the 2 hidden neurons
Calculate the Gradient(Reference 4) for each weight (starting from the end and working back)
Calculate the Delta Weight(Reference 5) for each weight, and add that to its value.
Start the process over with by Changing the inputs and expected output(Reference 6)
Here are the specifics of those references to equations/processes (This is probably where my problem is!):
x is the input of the neuron: (1/(1 + Math.pow(Math.E, (-1 * x))))
-1*(actualOutput - expectedOutput)*(Sigmoid(x) * (1 - Sigmoid(x))//Same sigmoid used in reference 1
SigmoidDerivative(Neuron.input)*(The sum of(Neuron.Weights * the deltaError of the neuron they connect to))
ParentNeuron.output * NeuronItConnectsTo.deltaError
learningRate*(weight.gradient) + momentum*(Previous Delta Weight)
I have an arrayList with the values 0,1,1,0 in it in that order. It takes the first pair(0,1), and then expects a 1. For the second time through, it takes the second pair(1,1) and expects a 0. It just keeps iterating through the list for each new set. Perhaps training it in this systematic way causes the problem?
Like I said before, they reason I don't think it's a code problem is because it matched exactly what I had calculated with paper and pencil (which wouldn't have happened if there was a coding error).
Also when I initialize my weights the first time, I give them a random double value between 0 and 1. This article suggests that that may lead to a problem: Neural Network with backpropogation not converging
Could that be it? I used the n^(-1/2) rule but that did not fix it.
If I can be more specific or you want other code let me know, thanks!
This is wrong
SigmoidDerivative(Neuron.input)*(The sum of(Neuron.Weights * the deltaError of the neuron they connect to))
First is sigmoid activation (g)
second is derivative of sigmoid activation
private double g(double z) {
return 1 / (1 + Math.pow(2.71828, -z));
}
private double gD(double gZ) {
return gZ * (1 - gZ);
}
Unrelated note: Your notation of (-1*x) is really strange just use -x
Your implementation from how you phrase the steps of your ANN seems poor. Try to focus on implementing Forward/BackPropogation and then an UpdateWeights method.
Creating a matrix class
This is my Java implementation, its very simple and somewhat rough. I use a Matrix class to make the math behind it appear very simple in code.
If you can code in C++ you can overload operaters which will enable for even easier writing of comprehensible code.
https://github.com/josephjaspers/ArtificalNetwork/blob/master/src/artificalnetwork/ArtificalNetwork.java
Here are the algorithms (C++)
All of these codes can be found on my github (the Neural nets are simple and funcitonal)
Each layer includes the bias nodes, which is why there are offsets
void NeuralNet::forwardPropagation(std::vector<double> data) {
setBiasPropogation(); //sets all the bias nodes activation to 1
a(0).set(1, Matrix(data)); //1 to offset for bias unit (A = X)
for (int i = 1; i < layers; ++i) {
// (set(1 -- offsets the bias unit
z(i).set(1, w(i - 1) * a(i - 1));
a(i) = g(z(i)); // g(z ) if the sigmoid function
}
}
void NeuralNet::setBiasPropogation() {
for (int i = 0; i < activation.size(); ++i) {
a(i).set(0, 0, 1);
}
}
outLayer D = A - Y (y is the output data)
hiddenLayers d^l = (w^l(T) * d^l+1) *: gD(a^l)
d = derivative vector
W = weights matrix (Length = connections, width = features)
a = activation matrix
gD = derivative function
^l = IS NOT POWER OF (this just means at layer l)
= dotproduct
*: = multiply (multiply each element "through")
cpy(n) returns a copy of the matrix offset by n (ignores n rows)
void NeuralNet::backwardPropagation(std::vector<double> output) {
d(layers - 1) = a(layers - 1) - Matrix(output);
for (int i = layers - 2; i > -1; --i) {
d(i) = (w(i).T() * d(i + 1).cpy(1)).x(gD(a(i)));
}
}
Explaining this code maybe confusing without images so I'm sending this link which I think is a good source, it also contains an explanation of BackPropagation which may be better then my own explanation.
http://galaxy.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html
void NeuralNet::updateWeights() {
// the operator () (int l, int w) returns a double reference at that position in the matrix
// thet operator [] (int n) returns the nth double (reference) in the matrix (useful for vectors)
for (int l = 0; l < layers - 1; ++l) {
for (int i = 1; i < d(l + 1).length(); ++i) {
for (int j = 0; j < a(l).length(); ++j) {
w(l)(i - 1, j) -= (d(l + 1)[i] * a(l)[j]) * learningRate + m(l)(i - 1, j);
m(l)(i - 1, j) = (d(l + 1)[i] * a(l)[j]) * learningRate * momentumRate;
}
}
}
}

Bertrand's Paradox Simulation

So, I saw this on Hacker News the other day: http://web.mit.edu/tee/www/bertrand/problem.html
It basically says what's the probability that a random chord on a circle with radius of 1 has a length greater than the square root of 3.
Looking at it, it seems obvious that the answer is 1/3, but comments on HN have people who are smarter than me debating this. https://news.ycombinator.com/item?id=10000926
I didn't want to debate, but I did want to make sure I wasn't crazy. So I coded what I thought would prove it to be P = 1/3, but I end up getting P ~ .36. So, something's got to be wrong with my code.
Can I get a sanity check?
package com.jonas.betrand;
import java.awt.geom.Point2D;
import java.util.Random;
public class Paradox {
final static double ROOT_THREE = Math.sqrt(3);
public static void main(String[] args) {
int greater = 0;
int less = 0;
for (int i = 0; i < 1000000; i++) {
Point2D.Double a = getRandomPoint();
Point2D.Double b = getRandomPoint();
//pythagorean
if (Math.sqrt(Math.pow((a.x - b.x), 2) + Math.pow((a.y - b.y), 2)) > ROOT_THREE) {
greater++;
} else {
less++;
}
}
System.out.println("Probability Observerd: " + (double)greater/(greater+less));
}
public static Point2D.Double getRandomPoint() {
//get an x such that -1 < x < 1
double x = Math.random();
boolean xsign = new Random().nextBoolean();
if (!xsign) {
x *= -1;
}
//formula for a circle centered on origin with radius 1: x^2 + y^2 = 1
double y = Math.sqrt(1 - (Math.pow(x, 2)));
boolean ysign = new Random().nextBoolean();
if (!ysign) {
y *= -1;
}
Point2D.Double point = new Point2D.Double(x, y);
return point;
}
}
EDIT: Thanks to a bunch of people setting me straight, I found that my method of finding a random point wasn't indeed so random. Here is a fix for that function which returns about 1/3.
public static Point2D.Double getRandomPoint() {
//get an x such that -1 < x < 1
double x = Math.random();
Random r = new Random();
if (!r.nextBoolean()) {
x *= -1;
}
//circle centered on origin: x^2 + y^2 = r^2. r is 1.
double y = Math.sqrt(1 - (Math.pow(x, 2)));
if (!r.nextBoolean()) {
y *= -1;
}
if (r.nextBoolean()) {
return new Point2D.Double(x, y);
} else {
return new Point2D.Double(y, x);
}
}
I believe you need to assume one fixed point say at (0, 1) and then choose a random amount of rotation in [0, 2*pi] around the circle for the location of the second point of the chord.
Just for the hell of it I wrote your incorrect version in Swift (learn Swift!):
struct P {
let x, y: Double
init() {
x = (Double(arc4random()) / 0xFFFFFFFF) * 2 - 1
y = sqrt(1 - x * x) * (arc4random() % 2 == 0 ? 1 : -1)
}
func dist(other: P) -> Double {
return sqrt((x - other.x) * (x - other.x) + (y - other.y) * (y - other.y))
}
}
let root3 = sqrt(3.0)
let total = 100_000_000
var samples = 0
for var i = 0; i < total; i++ {
if P().dist(P()) > root3 {
samples++
}
}
println(Double(samples) / Double(total))
And the answer is indeed 0.36. As the comments have been explaining, a random X value is more likely to choose the "flattened area" around pi/2 and highly unlikely to choose the "vertically squeezed" area around 0 and pi.
It is easily fixed however in the constructor for P:
(Double(arc4random()) / 0xFFFFFFFF is fancy-speak for random floating point number in [0, 1))
let angle = Double(arc4random()) / 0xFFFFFFFF * M_PI * 2
x = cos(angle)
y = sin(angle)
// outputs 0.33334509
Bertrand's paradox is exactly that: a paradox. The answer can be argued to be 1/3 or 1/2 depending on how the problem is interpreted. It seems you took the random chord approach where one side of the line is fixed and then you draw a random chord to any part of the circle. Using this method, the chances of drawing a chord that is longer than sqrt(3) is indeed 1/3.
But if you use a different approach, I'll call it the random radius approach, you'll see that it can be 1/2! The random radius is this, you draw a radius in the circle, and then you take a random chord that this radius bisects. At this point, a random chord will be longer than sqrt(3) 1/2 of the time.
Lastly, the random midpoint method. Choose a random point in the circle, and then draw a chord with this random point as the midpoint of the chord. If this point falls within a concentric circle of radius 1/2, then the chord is shorter than sqrt(3). If it falls outside the concentric circle, it is longer than sqrt(3). A circle of radius 1/2 has 1/4 the area of a circle with radius 1, so the chance of a chord smaller than sqrt(3) is 1/4.
As for your code, I haven't had time to look at it yet, but hope this clarifies the paradox (which is just an incomplete question not actually a paradox) :D
I would argue that the Bertrand paradox is less a paradox and more a cautionary lesson in probability. It's really asking the question: What do you mean by random?
Bertrand argued that there are three natural but different methods for randomly choosing a chord, giving three distinct answers. But of course, there are other random methods, but these methods are arguably not the most natural ones (that is, not the first that come to mind). For example, we could randomly position the two chord endpoints in a non-uniform manner. Or we position the chord midpoint according to some non-uniform density, like a truncated bi-variate normal.
To simulate the three methods with a programming language, you need to be able to generate uniform random variables on the unit interval, which is what all standard (pseudo)-random number generators should do. For one of the methods/solutions (the random midpoint one), you then have to take the square root of one of the uniform random variables. You then multiple the random variables by a suitable factor (or rescale). Then for each simulation method (or solution), some geometry gives the expressions for the two endpoints.
For more details, I have written a post about this problem. I recommend the links and books I have cited at the end of that post, under the section Further reading. For example, see Section 1.3 in this new set of published lecture notes. The Bertrand paradox is also in The Pleasures of Probability by Isaac. It’s covered in a non-mathematical way in the book Paradoxes from A to Z by Clark.
I have also uploaded some simulation code in MATLAB, R and Python, which can be found here.
For example, in Python (with NumPy):
import numpy as np; #NumPy package for arrays, random number generation, etc
import matplotlib.pyplot as plt #for plotting
from matplotlib import collections as mc #for plotting line chords
###START Parameters START###
#Simulation disk dimensions
xx0=0; yy0=0; #center of disk
r=1; #disk radius
numbLines=10**2;#number of lines
###END Parameters END###
###START Simulate three solutions on a disk START###
#Solution A
thetaA1=2*np.pi*np.random.uniform(0,1,numbLines); #choose angular component uniformly
thetaA2=2*np.pi*np.random.uniform(0,1,numbLines); #choose angular component uniformly
#calculate chord endpoints
xxA1=xx0+r*np.cos(thetaA1);
yyA1=yy0+r*np.sin(thetaA1);
xxA2=xx0+r*np.cos(thetaA2);
yyA2=yy0+r*np.sin(thetaA2);
#calculate midpoints of chords
xxA0=(xxA1+xxA2)/2; yyA0=(yyA1+yyA2)/2;
#Solution B
thetaB=2*np.pi*np.random.uniform(0,1,numbLines); #choose angular component uniformly
pB=r*np.random.uniform(0,1,numbLines); #choose radial component uniformly
qB=np.sqrt(r**2-pB**2); #distance to circle edge (alonge line)
#calculate trig values
sin_thetaB=np.sin(thetaB);
cos_thetaB=np.cos(thetaB);
#calculate chord endpoints
xxB1=xx0+pB*cos_thetaB+qB*sin_thetaB;
yyB1=yy0+pB*sin_thetaB-qB*cos_thetaB;
xxB2=xx0+pB*cos_thetaB-qB*sin_thetaB;
yyB2=yy0+pB*sin_thetaB+qB*cos_thetaB;
#calculate midpoints of chords
xxB0=(xxB1+xxB2)/2; yyB0=(yyB1+yyB2)/2;
#Solution C
#choose a point uniformly in the disk
thetaC=2*np.pi*np.random.uniform(0,1,numbLines); #choose angular component uniformly
pC=r*np.sqrt(np.random.uniform(0,1,numbLines)); #choose radial component
qC=np.sqrt(r**2-pC**2); #distance to circle edge (alonge line)
#calculate trig values
sin_thetaC=np.sin(thetaC);
cos_thetaC=np.cos(thetaC);
#calculate chord endpoints
xxC1=xx0+pC*cos_thetaC+qC*sin_thetaC;
yyC1=yy0+pC*sin_thetaC-qC*cos_thetaC;
xxC2=xx0+pC*cos_thetaC-qC*sin_thetaC;
yyC2=yy0+pC*sin_thetaC+qC*cos_thetaC;
#calculate midpoints of chords
xxC0=(xxC1+xxC2)/2; yyC0=(yyC1+yyC2)/2;
###END Simulate three solutions on a disk END###

Hash function for 2D point in limited Euclidean space

I am storing a lot of objects with geographically positions as 2D points (x,y) in granularity of meters. To represent the world I am using a grid divided in cells of 1 square km. Currently I am using HashMap<Position, Object> for this. Any other map or appropriate data structure is fine, but I the solution works so I am only interested in solving the details.
I have been reading a lot about making good hash functions, specifically for 2D points. So far, no solutions have been really good (rated in terms of as collision-free as possible).
To test some ideas I wrote a very simple java program to generate hash codes for points from an arbitrary number (-1000,-1000) to (1000, 1000) (x1, y1 -> x2,y2) and storing them in a HashSet<Integer> and this is my result:
# java HashTest
4000000 number of unique positions
test1: 3936031 (63969 buckets, 1,60%) collisions using Objects.hash(x,y)
test2: 0 (4000000 buckets, 100,00%) collisions using (x << 16) + y
test3: 3998000 (2000 buckets, 0,05%) collisions using x
test4: 3924037 (75963 buckets, 1,90%) collisions using x*37 + y
test5: 3996001 (3999 buckets, 0,10%) collisions using x*37 + y*37
test6: 3924224 (75776 buckets, 1,89%) collisions using x*37 ^ y
test7: 3899671 (100329 buckets, 2,51%) collisions using x*37 ^ y*37
test8: 0 (4000000 buckets, 100,00%) collisions using PerfectlyHashThem
test9: 0 (4000000 buckets, 100,00%) collisions using x << 16 | (y & 0xFFFF)
Legend: number of collisions , buckets(collisions), perc(collisions)
Most of these hash functions perform really bad. In fact, the only good solution is the one that shifts x to the first 16 bits of the integer. The limitation, I guess, is that the two most distant points must not be more than the square root of Integer.MAX_INT, i.e. area must be less than 46 340 square km.
This is my test function (just copied for each new hash function):
public void test1() {
HashSet<Integer> hashCodes = new HashSet<Integer>();
int collisions = 0;
for (int x = -MAX_VALUE; x < MAX_VALUE; ++x) {
for (int y = -MAX_VALUE; y < MAX_VALUE; ++y) {
final int hashCode = Objects.hash(x,y);
if (hashCodes.contains(hashCode))
collisions++;
hashCodes.add(hashCode);
}
}
System.console().format("test1: %1$s (%2$s buckets, %3$.2f%%) collisions using Objects.hash(x,y)\n", collisions, buckets(collisions), perc(collisions));
}
Am I thinking wrong here? Should I fine-tune the primes to get better results?
Edits:
Added more hash functions (test8 and test9). test8 comes from the reponse by #nawfal in Mapping two integers to one, in a unique and deterministic way (converted from short to int).
public void test1() {
int MAX_VALUE = 1000;
HashSet<Integer> hashCodes = new HashSet<Integer>();
int collisions = 0;
for (int x = -MAX_VALUE; x < MAX_VALUE; ++x) {
for (int y = -MAX_VALUE; y < MAX_VALUE; ++y) {
final int hashCode = ((x+MAX_VALUE)<<16)|((y+MAX_VALUE)&0xFFFF);
if (hashCodes.contains(hashCode))
collisions++;
hashCodes.add(hashCode);
}
}
System.out.println("Collisions: " + collisions + " // Buckets: " + hashCodes.size());
}
Prints: Collisions: 0 // Buckets: 4000000
I a similar question with the answer being to use a Cantor pairing function. Here:
Mapping two integers to one, in a unique and deterministic way.
The Cantor pairing function can be used for negative integers as well, using bijection.

Generating correlated numbers

Here is a fun one: I need to generate random x/y pairs that are correlated at a given value of Pearson product moment correlation coefficient, or Pearson r. You can imagine this as two arrays, array X and array Y, where the values of array X and array Y must be re-generated, re-ordered or transformed until they are correlated with each other at a given level of Pearson r. Here is the kicker: Array X and Array Y must be uniform distributions.
I can do this with a normal distribution, but transforming the values without skewing the distribution has me stumped. I tried re-ordering the values in the arrays to increase the correlation, but I will never get arrays correlated at 1.00 or -1.00 just by sorting.
Any ideas?
--
here is the AS3 code for random correlated gaussians, to get the wheels turning:
public static function nextCorrelatedGaussians(r:Number):Array{
var d1:Number;
var d2:Number;
var n1:Number;
var n2:Number;
var lambda:Number;
var r:Number;
var arr:Array = new Array();
var isNeg:Boolean;
if (r<0){
r *= -1;
isNeg=true;
}
lambda= ( (r*r) - Math.sqrt( (r*r) - (r*r*r*r) ) ) / (( 2*r*r ) - 1 );
n1 = nextGaussian();
n2 = nextGaussian();
d1 = n1;
d2 = ((lambda*n1) + ((1-lambda)*n2)) / Math.sqrt( (lambda*lambda) + (1-lambda)*(1-lambda));
if (isNeg) {d2*= -1}
arr.push(d1);
arr.push(d2);
return arr;
}
I ended up writing a short paper on this
It doesn't include your sorting method (although in practice I think it's similar to my first method, in a roundabout way), but does describe two ways that don't require iteration.
Here is an implementation of of twolfe18's algorithm written in Actionscript 3:
for (var j:int=0; j < size; j++) {
xValues[i]=Math.random());
}
var varX:Number = Util.variance(xValues);
var varianceE:Number = 1/(r*varX) - varX;
for (var i:int=0; i < size; i++) {
yValues[i] = xValues[i] + boxMuller(0, Math.sqrt(varianceE));
}
boxMuller is just a method that generates a random Gaussian with the arguments (mean, stdDev).
size is the size of the distribution.
Sample output
Target p: 0.8
Generated p: 0.04846346291280387
variance of x distribution: 0.0707786253165176
varianceE: 17.589920412141158
As you can see I'm still a ways off. Any suggestions?
This apparently simple question has been messing up with my mind since yesterday evening! I looked for the topic of simulating distributions with a dependency, and the best I found is this: simulate dependent random variables. The gist of it is, you can easily simulate 2 normals with given correlation, and they outline a method to transform these non-independent normals, but this won't preserve correlation. The correlation of the transform will be correlated, so to speak, but not identical. See the paragraph "Rank correlation coefficents".
Edit: from what I gather from the second part of the article, the copula method would allow you to simulate / generate random variables with rank correlation.
start with the model y = x + e where e is the error (a normal random variable). e should have a mean of 0 and variance k.
long story short, you can write a formula for the expected value of the Pearson in terms of k, and solve for k. note, you cannot randomly generate data with the Pearson exactly equal to a specific value, only with the expected Pearson of a specific value.
i'll try to come back and edit this post to include a closed form solution when i have access to some paper.
EDIT: ok, i have a hand-wavy solution that is probably correct (but will require testing to confirm). for now, assume desired Pearson = p > 0 (you can figure out the p < 0 case). like i mentioned earlier, set your model for Y = X + E (X is uniform, E is normal).
sample to get your x's
compute var(x)
the variance of E should be: (1/(rsd(x)))^2 - var(x)
generate your y's based on your x's and sample from your normal random variable E
for p < 0, set Y = -X + E. proceed accordingly.
basically, this follows from the definition of Pearson: cov(x,y)/var(x)*var(y). when you add noise to the x's (Y = X + E), the expected covariance cov(x,y) should not change from that with no noise. the var(x) does not change. the var(y) is the sum of var(x) and var(e), hence my solution.
SECOND EDIT: ok, i need to read definitions better. the definition of Pearson is cov(x, y)/(sd(x)sd(y)). from that, i think the true value of var(E) should be (1/(rsd(x)))^2 - var(x). see if that works.
To get a correlation of 1 both X and Y should be the same, so copy X to Y and you have a correlation of 1. To get a -1 correlation, make Y = 1 - X. (assuming X values are [0,1])
A strange problem demands a strange solution -- here is how I solved it.
-Generate array X
-Clone array X to Create array Y
-Sort array X (you can use whatever method you want to sort array X -- quicksort, heapsort anything stable.)
-Measure the starting level of pearson's R with array X sorted and array Y unsorted.
WHILE the correlation is outside of the range you are hoping for
IF the correlation is to low
run one iteration of CombSort11 on array Y then recheck correlation
ELSE IF the correlation is too high
randomly swap two values and recheck correlation
And thats it! Combsort is the real key, it has the effect of increasing the correlation slowly and steadily. Check out Jason Harrison's demo to see what I mean. To get a negative correlation you can invert the sort or invert one of the arrays after the whole process is complete.
Here is my implementation in AS3:
public static function nextReliableCorrelatedUniforms(r:Number, size:int, error:Number):Array {
var yValues:Array = new Array;
var xValues:Array = new Array;
var coVar:Number = 0;
for (var e:int=0; e < size; e++) { //create x values
xValues.push(Math.random());
}
yValues = xValues.concat();
if(r != 1.0){
xValues.sort(Array.NUMERIC);
}
var trueR:Number = Util.getPearson(xValues, yValues);
while(Math.abs(trueR-r)>error){
if (trueR < r-error){ // combsort11 for y
var gap:int = yValues.length;
var swapped:Boolean = true;
while (trueR <= r-error) {
if (gap > 1) {
gap = Math.round(gap / 1.3);
}
var i:int = 0;
swapped = false;
while (i + gap < yValues.length && trueR <= r-error) {
if (yValues[i] > yValues[i + gap]) {
var t:Number = yValues[i];
yValues[i] = yValues[i + gap];
yValues[i + gap] = t;
trueR = Util.getPearson(xValues, yValues)
swapped = true;
}
i++;
}
}
}
else { // decorrelate
while (trueR >= r+error) {
var a:int = Random.randomUniformIntegerBetween(0, size-1);
var b:int = Random.randomUniformIntegerBetween(0, size-1);
var temp:Number = yValues[a];
yValues[a] = yValues[b];
yValues[b] = temp;
trueR = Util.getPearson(xValues, yValues)
}
}
}
var correlates:Array = new Array;
for (var h:int=0; h < size; h++) {
var pair:Array = new Array(xValues[h], yValues[h]);
correlates.push(pair);}
return correlates;
}

Categories