1. Introduction

Internal Covariate Shift_: the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the trianing by requiring low learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities.

This problem is usually addressed by normalizing layer inputs.

Performing the normalization for each training mini-batch. BN allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout.

When the inputs distribution to a learning system changes, it is said to experience covariate shift. This is typically handled via domain adaptation.

Below is about saturating nonlinearities.


A saturating activation function squeezes the input.


These definitions are not specific to convolutional neural networks.


The Rectified Linear Unit (ReLU) activation function, which is defined as f(x)=max(0,x)f(x)=max(0,x) is non-saturating because limz→+∞f(z)=+∞limz→+∞f(z)=+∞:

enter image description here

The sigmoid activation function, which is defined as f(x)=11+e−xf(x)=11+e−x is saturating, because it squashes real numbers to range between [0,1][0,1]:

enter image description here

The tanh activation function is saturating as it squashes real numbers to range between [−1,1][−1,1]:

enter image description here

(figures are from CS231n, MIT License)

2. Towards reducing internal covariate shift

By fixing the distribution of the layer inputs x as the training progresses, we expect to improve the training speed.

the network training converges faster if its inputs are whitened - i.e. , linearly transformed to have zero means and unit variances, and decorrelated.

TO REMOVE THE ILL EFFECTS OF THE INTERNAL COVARIATE SHIFT, we could achieve the same whitening of the inputs of each layer, so that we would take a step towards achieving the fixed distributions of inputs.

In Batch Normalization, we’d like to ensure that, for any parameter values, the network always produces activations with the desired distribution.

we want to preserve the information in the network by normalizing the activations in a training example relative to the statistics of the entire training data.

3. Normalization via Mini-Batch Statistics

the full whitening of each layer’s inputs is costly and not everywhere differentiable, we make two necessary simplifications:

Below is the BN algorithm, epsilon is a constant added to the mini-batch variance for numerical stability.

the BN transform can be added to a network to manipulate any activation.

3.1 Training and Inference with Batch-Normalized Networks.

the normalization of activations taht depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference.

we want the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization using the population, rather than mini-batch, statistics.

Neglecting ǫ, these normalized activations have the same mean 0 and variance 1 as during training. We use the un-2 biased variance estimate Var[x] = m−1 m · E B [σ B ], where the expectation is over training mini-batches of size m and 2 σ B are their sample variances.

using moving averages instead, we can track the accuracy of a model as it trains.

Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation.

3.2 Batch-Normalized Convolutional Networks

For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a mini-batch, over all locations. In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations – so for a mini-batch of size m and feature maps of size p × q, we use the effective mini-batch of size m’ =beta= m*p*q. we learn a pair of parameters gamma^(k) and beta^(k) per feature map, rather than per activation. The inference progress is modified accordingly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.

3.3 batch normalization enables higher learning rates and regularizes the model.

By normalizing activations throughout the network, it prevents small changes to the parameters from amplifying into larger and suboptimal changes in activations in gradients; for instance, it prevents the training from getting stuck in the saturated regimes of nonlinearities.

Batch Normalization also makes training more resilient to the parameter scale. Normally, large learning rates may increase the scale of layer parameters, which then amplify the gradient during backpropagationand lead to the model explosion. However, with Batch Normalization, back-propagation through a layer is unaffected by the scale of its parameters. Moreover, larger weights lead to smaller gradients, and Batch Normalization will stabilize the parameter growth.

we futher conjecture that Batch Normalization may lead the layer Jacobains to have singular values close to 1, which known to be beneficial for training.

when training with BN, a training example is seen in conjunction with other examples in the mini-batch, and the training network no longer producing deterministic values for a given training example. this effect is advantageous to the generalization of the network. Whereas Dropout is typically used to reduce over-fitting, in a batch-normalized network we found that it can be either removed or reduced in strength.