Linear Regression – Gradients and Optimization

Hi there! In my last post, we explores the basics of Linear Regression. Check out the link below for the same

https://codingmachinelearning.wordpress.com/2017/02/13/linear-regression-understanding-the-details/

In  this post, we will see how to do an optimization of parameters for a  simple linear regression. Just  a quick recap:

  • Y = MX + C, this it the equation of a straight line which we use to approximate the data.
  • M and C are the tunable parameters – we will start with random values of M and C and iteratively converge to an ‘local’ optimal solution.
  • The line of best fit is that best parameters M and C, for which the error (squared loss defined in the previous post)  is minimum

Now , I need you to imaging this. When we say (x1,y1), for example (3,6) – this tuple is a point in the xy-space right. Now, what about M and C? M and C are essentially points in the parameter-space and we seek to iteratively converge to best set of M and C for minimizing the error.

Now the next question will be, what is the meaning for iterative convergence ? It essentially means the following sequence of steps:

  • Start with random values for M and C
  • For that (M,C) – compute loss
  • Compute Gradient of the loss – (why on earth do this?)
  • Based on the ‘learning rate’ and gradients, update M and C to a more meaningful position in the parameter space.
  • And go on and on!

The steps of random values for M and C, and computing loss is quite straight forward. This post, we will deal with gradients and optimization.

Gradients

This is the most important part for learning –  this tells you, in what direction ‘to move’ to get a optimal solution. In plain simple terms, gradients point to the rate of change and points to the direction of steepest ascent for  a given function. If you move in the direction opposite to the gradient, that will help converge to the optimal M and C parameters.

So, we have our cost function and 2 tunable parameters M and C. We have the following gradients to compute:

  • Gradient of cost with respect to M
  • Gradient of cost with respect to C

Once you compute the gradient of cost with respect to M and C we have a quantifiable value of change in cost, w.r.to a unit change in M and C. Based on this approximation for the direction of descent we update the learnable parameters M and C

Optimization

Once you have the gradients ( direction to move in parameter space – (M,C) space), we now need to update the values of M and C to the place of next meaningful ‘jump’. For that we do the following:

  • M = M – learning_rate * gradient of cost w.r.t M
  • C = C – learning_rate * gradient of cost w.r.t C

Now you have a new set of values  (M,C), and perform this ‘n’ number of times and Voila you will see convergence.

This method is called gradient descent – which converges to a local optimum. In the next post, we will see what is local optimum, and how to try best to avoid the pitfall of converging to local optimum.

It is very intentional that I have left out the math and the derivative part. I too understand that math puts-off people. I wanted to keep this simple and clear. Do let me know if you are interested in details, I will make a separate post for it. And watch out next week for functions, analysis and more on optimizations!.

Leave a comment