Stochastic gradient descent

Nikhil.M
2 min readAug 19, 2021

Ordinary gradient descent algorithm takes a lot of time to converge especially if there are large number of data in a dataset like 1,000,000 …

please check out this link for gradient descent:

let's take an example, here we have a linear regression gradient descent update function.

Let's say N =1,000,000

for each iteration we have to compute over one million data points, ie to reach from wj to wj+1 over 1,000,000 data points are computed for each iteration.

It's because of this part of the equation:

hence, computation cost and time will be extremely high.

Why do we need stochastic gradient descent (SGD)?

In stochastic gradient descent ,we're going to choose a set of points from one million data(n) randomly for each iteration.

let's say K is the size of the new data set that we have taken from n(one million).

Here,k will be significantly smaller than n .if n is 1,000,000.Then,K will be only 1000.

k<<n.

So the new equation is:

since the size of K(1000) is significantly smaller than this n(1,000,000) equation will take more iterations comparative to the normal gradient descent however it is cost and time efficient.

K can be any number it could be 10,100,1000. But it should be significantly smaller than n.

k is called as the batch size in SGD( stochastic gradient descent).

Even though stochastic gradient descent is simple but it is very powerful it saves a lot of time and computing cost.

There are many variations for stochastic gradient descent some of them are Adam, Adagrad etc…in deep learning.

Thank you for your time.

--

--

No responses yet