Before going to understand gradient descent let's try to go over the few terms. First let's try to understand what is maxima and minima.
In the above image we can see global maxima, global minima, local maxima and local minima .So what are these terms? local minima is a minima which is minimum to its neighborhood and local maxima is maxima which is maximum to its neighborhood. And global maxima is maxima for the whole function and global minima is the minimum for the whole function f(x).
Our main agenda here is to find global minima by using gradient descent if we find global minima it is easy to find global maximum because
argmin(f(x))= argmax(-f(x)), let's see the below image for more clarity.
For y=x²
Here the minimum value for x² is 0.
For y=-x²
For - x² the maximum value is 0.
Hence argmin(f(x))= argmax(-f(x)).
Why do we need gradient descent?
Because it is easier to solve derivative of equation like this:
and gradient descent is an iterative function, which works well in modern computers.
Let’s take x0.Where,x0 is nothing but a guess to find optimal minima.
Next ,we make the next best guess x1 which is closer to minima (which is 0)than x0 by using gradient descent.
And then my next best guess is x2, x3, x4 …..and so on. Each one will be much closer to minima. For ex: x2 will be closer to minima than x1,x3 will closer to minima than x2 etc..
Gradient descent formula is as follows:
Here x0 is the first value we took as guess and then we move to x1 which is optimal x that is minima.
For the next iteration formula becomes like this:
In general gradient descent equation is:
This is also called update equation.
Here,x i+1 is the next step and xi is the current step r is called as step size or learning rate.
For our convenience let r=1.
Here the slope of x0 is positive because tan(Θ) is positive.
that is df/dx is positive.
General equation,
Which will ultimately result in x0 > x1.
Hence,x1 will be closer to minima than x0. Similarly we find x2,x3,x4 and we do this iteratively until the difference between the xi+1 and the xi is negligible.
When does difference between xi+1 and xi is negligible?
As we move towards minima the slope will be smaller for each iteration
Hence,
so for each iteration dy/dx will be smaller. Therefore, we get smaller difference in each iteration.
Let's try to look at from the negative side imagine x0 being negative.
Here, the slope of x0 is negative because tan(Θ) is negative. Therefore, df/dx is negative.
Let's assume that something negative as -1.
Therefore, xi+1 > xi.
x1 will be closer to minima than x0. Similarly we find x2,x3,x4 and we do this iteratively until the difference between the xi+1 and the xi is negligible.
Thank you for your time.