Gradient Descent in linear regression

Gradient Descent in linear regression - java

I am trying to implement linear regression in java. My hypothesis is theta0 + theta1 * x[i].
I am trying to figure out the value of theta0 and theta1 so that the cost function is minimum.
I am using gradient descent to find out the value -
In the
while(repeat until convergence)
{
calculate theta0 and theta1 simultaneously.
}
what is this repeat until convergence?
I understood that it is the local minimum but what is exact code that I should put in the while loop?
I am very new to machine learning and just began to code basic algos to get better understanding. Any help will be greatly appreciated.

The gradient descent is an iterative approach for minimizing the given function. We start with an initial guess of the solution and we take the gradient of the function at that point. We step the solution in the negative direction of the gradient and we repeat the process. The algorithm will eventually converge where the gradient is zero (which correspond to a local minimum). So your job is to find out the value of theta0 and theta1 that minimization the loss function [for example least squared error].
The term "converges" means you reached in the local minimum and further iteration does not affect the value of parameters i.e. value of theta0 and theta1 remains constant. Lets see an example Note: Assume it is in first quadrant for this explanation.
Lets say you have to minimize a function f(x) [cost function in your case]. For this you need to find out the value of x that minimizes the functional value of f(x). Here is the step by step procedure to find out the value of x using gradient descent method
You choose the initial value of x. Lets say it is in point A in the figure.
You calculate the gradient of f(x) with respect to x at A.
This gives the slope of the function at point A. Since it the function is increasing at A, it will yield a positive value.
You subtract this positive value from initial guess of x and update the value of x. i.e. x = x - [Some positive value]. This brings the x more closer to the D [i.e. the minimum] and reduces the functional value of f(x) [from figure]. Lets say after iteration 1, you reach to the point B.
At point B, you repeat the same process as mention in step 4 and reach the point C, and finally point D.
At point D, since it is local minimum, when you calculate gradient, you get 0 [or very close to 0]. Now you try to update value of x i.e. x = x - [0]. You will get same x [or very closer value to the previous x]. This condition is known as "Convergence".
The above steps are for increasing slope but are equally valid for decreasing slope. For example, the gradient at point G results into some negative value. When you update x i.e x = x - [ negative value] = x - [ - some positive value] = x + some positive value. This increases the value of x and it brings x close to the point F [ or close to the minimum].
There are various approaches to solve this gradient descent. As #mattnedrich said, the two basic approaches are
Use fixed number of iteration N, for this pseudo code will be
iter = 0
while (iter < N) {
theta0 = theta0 - gradient with respect to theta0
theta1 = theta1 - gradient with respect to theta1
iter++
}
Repeat until two consecutive values of theta0 and theta1 are almost identical. The pseudo code is given by #Gerwin in another answer.
Gradient descent is one of the approach to minimize the function in Linear regression. There exists direct solution too. Batch processing (also called normal equation) can be used to find out the values of theta0 and theta1 in a single step. If X is the input matrix, y is the output vector and theta be the parameters you want to calculate, then for squared error approach, you can find the value of theta in a single step using this matrix equation
theta = inverse(transpose (X)*X)*transpose(X)*y
But as this contains matrix computation, obviously it is more computationally expensive operation then gradient descent when size of the matrix X is large.
I hope this may answer your query. If not then let me know.

Gradient Descent is an optimization algorithm (minimization be exact, there is gradient ascent for maximization too) to. In case of linear regression, we minimize the cost function. It belongs to gradient based optimization family and its idea is that cost when subtracted by negative gradient, will take it down the hill of cost surface to the optima.
In your algorithm, repeat till convergence means till you reach the optimal point in the cost-surface/curve, which is determined when the gradient is very very close to zero for some iterations. In that case, the algorithm is said to be converged (may be in local optima, and its obvious that Gradient Descent converges to local optima in many cases)
To determine if your algorithm has converged, you can do following:
calculate gradient
theta = theta -gradientTheta
while(True):
calculate gradient
newTheta = theta - gradient
if gradient is very close to zero and abs(newTheta-Theta) is very close to zero:
break from loop # (The algorithm has converged)
theta = newTheta
For detail on Linear Regression and Gradient Descent and other optimizations you can follow Andrew Ng's notes : http://cs229.stanford.edu/notes/cs229-notes1.pdf.

I do not know very much about the gradient descend, but we learned another way to calculate linear regression with a number of points:
http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line
But if you really want to add the while loop, I recommend the following:
Eventually, theta0 and theta1 will converge to a certain value. This means that, no matter how often you apply the formula, it will always stay in the vicinity of that value. (http://en.wikipedia.org/wiki/(%CE%B5,_%CE%B4)-definition_of_limit).
So applying the code once again will not change theta0 and theta1 very much, only for a very small amount. Or: the difference between theta0(1) and the next theta0(1) is smaller than a certain amount.
This brings us to the following code:
double little = 1E-10;
do {
$theta0 = theta0;
$theta1 = theta1;
// now calculate the new theta0, theta1 simultaneously.
} while(Math.abs(theta0-$theta0) + Math.abs(theta1-$theta1)>little);

You need to do the following inside of the while loop:
while (some condition is not met)
// 1) Compute the gradient using theta0 and theta1
// 2) Use the gradient to compute newTheta0 and newTheta1 values
// 3) Set theta0 = newTheta0 and theta1 = newTheta1
You can use several different criteria for terminating the gradient descent search. For example, you can run gradient descent
For a fixed number of iterations
Until the gradient value at (theta0, theta1) is sufficiently close to zero (indicating a minimum)
Each iteration you should get closer and closer to the optimal solution. That is, if you compute the error (how well your theta0, theta1 model predicts your data) for each iteration it should get smaller and smaller.
To learn more about how to actually write this code, you can refer to:
https://www.youtube.com/watch?v=CGHDsi_l8F4&list=PLnnr1O8OWc6ajN_fNcSUz9k5gF_E9huF0
https://www.youtube.com/watch?v=kjes46vP5m8&list=PLnnr1O8OWc6asSH0wOMn5JjgSlqNiK2C4

Initially assign theta[0] and theta[1] to some arbitrary value and then calculate the value of your hypothesis (theta[0] +theta[1]*x1), and then by the gradient descent algorithm calculate theta[0] and theta[1]. By the algo:
theta[0](new) = theta[0](old) - alpha*[partialderivative(J(theta[0],theta[1]) w.r.t theta[0])
theta[1](new) = theta[1](old) - alpha*[partialderivative(J(theta[0],theta[1]) w.r.t theta[1])
where alpha: learning rate
J(theta[0],theta[1])=cost function
You'll get the new value of theta[0] and theta[1]. You then need to again calculate the new value of your hypothesis. Repeat this process of calculating theta[0] and theta[1], until the difference between theta[i](new) and theta[i](old) is less than 0.001
For details refer: http://cs229.stanford.edu/notes/cs229-notes1.pdf

Related

Least squares Levenburg Marquardt with Apache commons

I'm using the non linear least squares Levenburg Marquardt algorithm in java to fit a number of exponential curves (A+Bexp(Cx)). Although the data is quite clean and has a good approximation to the model the algorithm is not able to model the majority of them even with a excessive number of iterations(5000-6000). For the curves it can model, it does so in about 150 iterations.
LeastSquaresProblem problem = new LeastSquaresBuilder()
.start(start).model(jac).target(dTarget)
.lazyEvaluation(false).maxEvaluations(5000)
.maxIterations(6000).build();
LevenbergMarquardtOptimizer optimizer = new LevenbergMarquardtOptimizer();
LeastSquaresOptimizer.Optimum optimum = optimizer.optimize(problem);}
My question is how would I define a convergence criteria in apache commons in order to stop it hitting a max number of iterations?

I don't believe Java is your problem. Let's address the mathematics.
This problem is easier to solve if you change your function.
Your assumed equation is:
y = A + B*exp(C*x)
It'd be easier if you could do this:
y-A = B*exp(C*x)
Now A is just a constant that can be zero or whatever value you need to shift the curve up or down. Let's call that variable z:
z = B*exp(C*x)
Taking the natural log of both sides:
ln(z) = ln(B*exp(C*x))
We can simplify that right hand side to get the final result:
ln(z) = ln(B) + C*x
Transform your (x, y) data to (x, z) and you can use least squares fitting of a straight line where C is the slope in (x, z) space and ln(B) is the intercept. Lots of software available to do that.

How to efficiently select neighbour in 1-dimensional and n-dimensional space for Simulated Annealing

I would like to use Simulated Annealing to find local minimum of single variable Polynomial function, within some predefined interval. I would also like to try and find Global minimum of Quadratic function.
Derivative-free algorithm such as this is not the best way to tackle the problem, so this is only for study purposes.
While the algorithm itself is pretty straight-forward, i am not sure how to efficiently select neighbor in single or n-dimensional space.
Lets say that i am looking for local minimum of function: 2*x^3+x+1 over interval [-0.5, 30], and assume that interval is reduced to tenths of each number, e.g {1.1, 1.2 ,1.3 , ..., 29.9, 30}.
What i would like to achieve is balance between random walk and speed of convergence from starting point to points with lower energy.
If i simply select random number form the given interval every time, then there is no random walk and the algorithm might circle around. If, on the contrary, next point is selected by simply adding or subtracting 0.1 with the equal probability, then the algorithm might turn into exhaustive search - based on the starting point.
How should i efficiently balance Simulated Annealing neighbor search in single dimensional and n-dimensional space ?

So you are trying to find an n-dimensional point P' that is "randomly" near another n-dimensional point P; for example, at distance T. (Since this is simulated annealing, I assume that you will be decrementing T once in a while).
This could work:
double[] displacement(double t, int dimension, Random r) {
double[] d = new double[dimension];
for (int i=0; i<dimension; i++) d[i] = r.nextGaussian()*t;
return d;
}
The output is randomly distributed in all directions and centred on the origin (notice that r.nextDouble() would favour 45º angles and be centred at 0.5). You can vary the displacement by increasing t as needed; 95% of results will be within 2*t of the origin.
EDIT:
To generate a displaced point near a given one, you could modify it as
double[] displaced(double t, double[] p, Random r) {
double[] d = new double[p.length];
for (int i=0; i<p.length; i++) d[i] = p[i] + r.nextGaussian()*t;
return d;
}
You should use the same r for all calls (because if you create a new Random() for each you will keep getting the same displacements over and over).

In "Numerical Recepies in C++" there is a chapter titled "Continuous Minimization by Simulated Annealing". In it we have
A generator of random changes is inefficient if, when local downhill moves exist, it nevertheless almost always proposes an uphill move. A good generator, we think, should not become inefficient in narrow valleys; nor should it become more and more inefficient as convergence to a minimum is approached.
They then proceed to discuss a "downhill simplex method".

How to move along sinus function at constant speed

I have a Java program that moves an object according to a sinus function asin(bx).
The object moves at a certain speed by changing the x parameter by the time interval x the speed. However, this only moves the object at a constant speed along the x axis.
What I want to do is move the object at a constant tangential speed along the curve of the function. Could someone please help me? Thanks.

Let's say you have a function f(x) that describes a curve in the xy plane. The problem consists in moving a point along this curve at a constant speed S (i.e., at a constant tangential speed, as you put it.)
So, let's start at an instant t and a position x. The point has coordinates (x, f(x)). An instant later, say, at t + dt the point has moved to (x + dx, f(x + dx)).
The distance between these two locations is:
dist = sqrt((x + dx - dx)2 + (f(x+dx) - f(x))2) = sqrt(dx2 + (f(x+dx) - f(x))2)
Now, let's factor out dx to the right. We get:
dist = sqrt(1 + f'(x)2) dx
where f'(x) is the derivative (f(x+dx) - f(x)) /dx.
If we now divide by the time elapsed dt we get
dist/dt = sqrt(1 + f'(x)2) dx/dt.
But dist/dt is the speed at which the point moves along the curve, so it is the constant S. Then
S = sqrt(1 + f'(x)2) dx/dt
Solving for dx
dx = S / sqrt(1 + f'(x)2) dt
which gives you how much you have to move the x-coordinate of the point after dt units of time.

The arc length on a sine curve as a function of x is an elliptic integral of the second kind. To determine the x coordinate after you move a particular distance (or a particular time with a given speed) you would need to invert this elliptic integral. This is not an elementary function.
There are ways to approximate the inverse of the elliptic integral that are much simpler than you might expect. You could combine a good numerical integration algorithm such as Simpson's rule with either Newton's method or binary search to find a numerical root of arc length(x) = kt. Whether this is too computationally expensive depends on how accurate you need it to be and how often you need to update it. The error will decrease dramatically if you estimate the length of one period once, and then reduce t mod the arc length on one period.
Another approach you might try is to use a different curve than a sine curve with a more tractable arc length parametrization. Unfortunately, there are few of those, which is why the arc length exercises in calculus books repeat the same types of curves over and over. Another possibility is to accept a speed that isn't constant but doesn't go too far above or below a specified constant, which you can get with some Fourier analysis.
Another approach is to recognize the arc length parametrization as a solution to a 2-dimensional ordinary differential equation. A first order numerical approximation (Euler's method) might suffice, and I think that's what Leandro Caniglia's answer suggests. If you find that the round off errors are too large, you can use a higher order method such as Runge-Kutta.

Java - How to calculate score exponentially higher based on reverse parabola

I would like the score to be exponentially higher based on how close the value is to x as illustrated in the graph below (and my apologies for the bad quality)
I know the math form y = x^2 + ax + b (well not exactly) however I can't get it to work in such a manner that y (the score) is higher based on close you are to x.

If you want your function to be higher when it is closer to zero, then you should have negative coefficient in front of x^2
y = -1 * x^2 + ...
Take a look at this example at Wolfram-Alpha: parabola

Nearest point on a quadratic bezier curve

I am having some issues calculating the nearest point on a quadratic curve to the mouse position. I have tried a handful of APIs, but have not had any luck finding a function for this that works. I have found an implementation that works for 5th degree cubic bezier curves, but I do not have the math skills to convert it to a quadratic curve. I have found some methods that will help me solve the problem if I have a t value, but I have no idea how to begin finding t. If someone could point me to an algorithm for finding t, or some example code for finding the nearest point on a quadratic curve to an arbitrary point, I'd be very grateful.
Thanks

I can get you started on the math.
I'm not sure how quadratic Bezier is defined, but it must be equivalent
to:
(x(t), y(t)) = (a_x + b_x t + c_x t^2, a_y + b_y t + c_y t^2),
where 0 < t < 1. The a, b, c's are the 6 constants that define the curve.
You want the distance to (X, Y):
sqrt( (X - x(t))^2 + (Y - y(t))^2 )
Since you want to find t that minimizes the above quantity, you take its
first derivative relative to t and set that equal to 0.
This gives you (dropping the sqrt and a factor of 2):
0 = (a_x - X + b_x t + c_x t^2) (b_x + 2 c-x t) + (a_y - Y + b_y t + c_y t^2) ( b_y + 2 c_y t)
which is a cubic equation in t. The analytical solution is known and you can find it on the web; you will probably need to do a bit of algebra to get the coefficients of the powers of t together (i.e. 0 = a + b t + c t^2 + d t^3). You could also solve this equation numerically instead, using for example Newton-Raphson.
Note however that if none of the 3 solutions might be in your range 0 < t < 1. In that case just compute the values of the distance to (X, Y) (the first equation) at t = 0 and t = 1 and take the smallest distance of the two.
EDIT:
actually, when you solve the first derivative = 0, the solutions you get can be maximum distances as well as minimum. So you should compute the distance (first equation) for the solutions you get (at most 3 values of t), as well as the distances at t=0 and t=1 and pick the actual minimum of all those values.

There's an ActionScript implementation that could easily be adapted to Java available online here.
It's derived from an algorithm in the book "Graphics Gems" which you can peruse on Google Books here.

The dumb, naive way would be to iterate through every point on the curve and calculate the distance between that point and the mouse position, with the smallest one being the winner. You might have to go with something better than that though depending on your application.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.