Batch, Mini-Batch and Stochastic Gradient Descent for Linear Regression

Implementation and comparison of three basic Gradient Descent variants

Robert Kwiatkowski

Follow

Published in

TDS Archive

8 min readJun 15, 2021

Image by geralt from Pixabay modified by author

1. Introduction

Gradient Descent algorithm is an iterative first-order optimisation method to find the function’s local minimum (ideally global). Its basic implementation and behaviour I’ve described in my other article here. This one focuses on three main variants in terms of the amount of data the algorithm uses to calculate the gradient and to make steps.

These 3 variants are:

Batch Gradient Descent (BGD)
Stochastic Gradient Descent (SGD)
Mini-Batch Gradient Descent (mBGD)

In this article, we will see their performance in a simple linear regression task.

A quick recap — a univariate linear function is defined as:

It is parametrised by two coefficients:

a0 - bias
a1 - function’s slope.

For demonstration purposes we define a following linear function:

Where σ is a white (Gaussian) noise. A code below generates 100 points for a dataset we will be working with.

The cost function (metric) we want to minimise is Mean Squared Error defined as:

In a case of a univariate function it can be written explicitly as:

A code below calculates MSE cost for a given set of two parameters.

Notice that for our original coefficient due to the random error (white noise) the minimal cost function is not 0 (it will vary at each run) an this time equals to :

Below visualisation shows this function in the vicinity of the optimum point. We can see it has a shape of an elongated bowl.

Cost function around global minimum; Image by author

To use any gradient descent algorithm we have to calculate a gradient of this function. Because for a univariate linear regression our algorithm minimises 2 coefficients we have to calculate derivatives for each of them separately. Let’s notice that:

Now, using a chain rule we obtain the following result:

Next section are focusing on the algorithms themselves. Code used is available on my GitHub repository.

2. Batch Gradient Descent

In Batch GD the entire dataset is used at each step to calculate the gradient (remember: we don’t calculate the cost function itself). The graph below shows how does it behave during the optimisation process. It requires 86 iterations to find the global optimum (within a given tolerance).

Animation of Batch Gradient Descent progress; image by author

Trajectory of Batch Gradient Descent; Image by author

The trajectory of Batch Gradient Descent looks good — with each step it’s getting closer to the optimum and lateral oscillations are getting smaller with time. It is a reason it has a good convergence rate.

To find exactly its convergence rate we have to do some maths. Not to over-complicate, let’s assume our cost function is strongly convex (twice differentiable) and is has a Lipschitz continuous gradient with L>0 defined as:

The second assumption restricts the speed of the gradient’s change.

If you can calculate L then you can derive so-called “bound on guaranteed progress” which is a step size (learning rate) that guarantees convergence:

However, you should not use this value in practice because it is really small and converges slow. Finding the best learning rate is a huge topic, good for a separate article — just check things like for example “backtracing line-search”, “Armijo condition” or “Wolfe condition”.

Assuming fixed step size the convergence rate depends on the function’s convexity.

For simply (weak) convex functions the convergence rate is [1]:

where k is the number of iterations. This rate is called “sub-linear convergence” and for a given tolerance ε it needs the following number of iterations to converge [1]:

For strongly convex functions the rate is [1]:

where 0<γ<1 and k is the number of iterations. This rate is called “linear convergence” and it means that Batch GD is exponentially fast. For a given tolerance ε it needs the following number of iterations to converge [1]:

Pros and Cons of Batch Gradient Descent:

Pros:

A simple algorithm that just needs to compute a gradient
A fixed learning rate can be used during training and BGD can be expected to converge
Very quick convergence ratio to a global minimum if the loss function is convex (and to local minimum one for non-convex functions)

Cons:

Even with a vectorised implementation, it may be slow when datasets are huge (case of Big Data)
Not all problems are convex so gradient descent algorithms are not universal

Typical use cases:

Small databases that fit into computer memory
Problems with convex cost functions (like OLS, Logistic Regression, etc.)

3. Stochastic Gradient Descent

The idea of Stochastic Gradient Descent is not to use the entire dataset to calculate the gradient but only a single sample. The objective is to speed up the process. There are two main rules in terms of selecting a sample:

randomised rule — randomly chosen sample (repetitions possible)
cyclic rule — each sample once (no or minimised number of repetitions)

The randomised rule is much more common.

The graph below shows how SGD converges to the final solution (exemplary run). The red dot indicates the sample chosen for given step calculations.

Animation of SGS convergence process; Image by author

Due to its stochastic nature, each run requires a different number of steps to arrive at the global minimum. Below the histogram of iterations required for 100 runs with the same starting point (0,0) and the same learning rate (0.05).

Number of iterations required to converge; Image by author

In contrary to Batch GD it doesn’t converge directly to the solution because it uses only 1 sample per iteration — meaning steps are quite noisy. However, it is much more efficient — less CPU/GPU load. This effect is barely visible for small databases (like this) but has a huge impact on performance when dealing with big data.

Below a graph showing the trajectory of SGD steps from the example above.

Trajectory of Stochastic Gradient Descent; Image by author

Convergence rate for Stochastic Gradient Descent with a fixed step size [1]:

This means SGD do not have linear convergence rate as Batch Gradient Descent — simply meaning it needs more iterations (but not necessarily computational time).

Pros and cons of Stochastic Gradient Descent:

Pros:

Converges quicker (less time) than Batch GD for huge datasets
Can escape from local minima

Cons:

Steps are noisier — SGD may require more iterations to converge with limits
It can “bounce around” global optimum — it may require bigger tolerance than Batch GD

Typical use case:

Is a basis of more advanced stochastic algorithms used in training artificial neural networks

4. Mini-batch Gradient Descent

Mini-batch Gradient Descent is an approach to find a fine balance between pure SGD and Batch Gradient Descent. The idea is to use a subset of observations to update the gradient. The number of points used for each size is called batch size and each iteration over a batch is called an epoch. An animation below shows convergence process with points used at each step (batch size of 10).

Convergence process of mini-Batch Gradient Descent; Image by author

The trajectory is still noisy but goes more steadily toward the minimum.

Trajectory of mini-Batch Gradient Descent; Image by author

Convergence ratio of this algorithm lays somewhere between BGD and mBGD and is [1]:

where b is a batch size.

Pros and Cons of mini-Batch Gradient Descent:

Pros:

A good balance between BGD and SGD in terms of efficiency
Easily fits into computer memory
Can escape from local minimums

Cons:

It can still “bounce around” global optimum — it may require bigger tolerance than Batch GD but smaller than SGD
One more hyper-parameter to optimise — a batch size

Typical use case:

It is very common algorithm in training of deep neutral networks

5. Summary

We went through 3 basic variants of Gradient Descent algorithms. In contemporary ML more advanced and efficient versions are used but still using fundamental ideas described here. Further modifications encompass things like adaptive learning rate, various momentums (e.g. Nesterov), averaging, etc.

Some very popular implementations are:

There is an ongoing research effort to improve them further for non-convex functions (deep neural networks) which includes various ideas to per-process data.

If you want to learn more details about topics in this article I highly encourage you to check out these readings: