**Credits :https://unsplash.com/@roman_lazygeek**

Stochastic gradient descent

2 min readAug 19, 2021

Ordinary gradient descent algorithm takes a lot of time to converge especially if there are large number of data in a dataset like 1,000,000 …

please check out this link for gradient descent:

Gradient descent

Before going to understand gradient descent let's try to go over the few terms. First let's try to understand what is…

mnikhilmnikki1997.medium.com

let's take an example, here we have a linear regression gradient descent update function.

Let's say N =1,000,000

for each iteration we have to compute over one million data points, ie to reach from wj to wj+1 over 1,000,000 data points are computed for each iteration.

It's because of this part of the equation:

hence, computation cost and time will be extremely high.

Why do we need stochastic gradient descent (SGD)?

In stochastic gradient descent ,we're going to choose a set of points from one million data(n) randomly for each iteration.

let's say K is the size of the new data set that we have taken from n(one million).

Here,k will be significantly smaller than n .if n is 1,000,000.Then,K will be only 1000.

k<<n.

So the new equation is:

since the size of K(1000) is significantly smaller than this n(1,000,000) equation will take more iterations comparative to the normal gradient descent however it is cost and time efficient.

K can be any number it could be 10,100,1000. But it should be significantly smaller than n.

k is called as the batch size in SGD( stochastic gradient descent).

Even though stochastic gradient descent is simple but it is very powerful it saves a lot of time and computing cost.

There are many variations for stochastic gradient descent some of them are Adam, Adagrad etc…in deep learning.

Thank you for your time.

Stochastic gradient descent

Gradient descent

Before going to understand gradient descent let's try to go over the few terms. First let's try to understand what is…

Written by Nikhil.M

No responses yet