Ordinary gradient descent algorithm takes a lot of time to converge especially if there are large number of data in a dataset like 1,000,000 …
please check out this link for gradient descent:
let's take an example, here we have a linear regression gradient descent update function.
Let's say N =1,000,000
for each iteration we have to compute over one million data points, ie to reach from wj to wj+1 over 1,000,000 data points are computed for each iteration.
It's because of this part of the equation:
hence, computation cost and time will be extremely high.
Why do we need stochastic gradient descent (SGD)?
In stochastic gradient descent ,we're going to choose a set of points from one million data(n) randomly for each iteration.
let's say K is the size of the new data set that we have taken from n(one million).
Here,k will be significantly smaller than n .if n is 1,000,000.Then,K will be only 1000.
k<<n.
So the new equation is:
since the size of K(1000) is significantly smaller than this n(1,000,000) equation will take more iterations comparative to the normal gradient descent however it is cost and time efficient.
K can be any number it could be 10,100,1000. But it should be significantly smaller than n.
k is called as the batch size in SGD( stochastic gradient descent).
Even though stochastic gradient descent is simple but it is very powerful it saves a lot of time and computing cost.
There are many variations for stochastic gradient descent some of them are Adam, Adagrad etc…in deep learning.
Thank you for your time.