**Internal Covariate Shift_**: the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the trianing by requiring **low learning rates** and **careful parameter initialization**, and makes it notoriously hard to train models with **saturating nonlinearities**.

This problem is usually addressed by normalizing layer inputs.

Performing the normalization for *each training mini-batch*. BN allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout.

When the inputs distribution to a learning system changes, it is said to experience *covariate shift*. This is typically handled via domain adaptation.

Below is about **saturating nonlinearities**.

## Intuition

A saturating activation function squeezes the input.

## Definitions

ff is non-saturating iff ( limz→−∞f(z) =+∞)∨ limz→+∞f(z) =+∞)( limz→−∞f(z) =+∞)∨ limz→+∞f(z) =+∞) - ff is saturating iff ff is not non-saturating.
These definitions are not specific to convolutional neural networks.

## Examples

The Rectified Linear Unit (ReLU) activation function, which is defined as f(x)=max(0,x)f(x)=max(0,x) is non-saturating because limz→+∞f(z)=+∞limz→+∞f(z)=+∞:

The sigmoid activation function, which is defined as f(x)=11+e−xf(x)=11+e−x is saturating, because it squashes real numbers to range between [0,1][0,1]:

The tanh activation function is saturating as it squashes real numbers to range between [−1,1][−1,1]:

(figures are from CS231n, MIT License)

By fixing the distribution of the layer inputs x as the training progresses, we expect to improve the training speed.

the network training converges faster if its inputs are whitened - i.e. , linearly transformed to have zero means and unit variances, and decorrelated.

**TO REMOVE THE ILL EFFECTS OF THE INTERNAL COVARIATE SHIFT**, we could achieve the same whitening of the inputs of each layer, so that we would take a step towards achieving the fixed distributions of inputs.

In Batch Normalization, we’d like to ensure that, for any parameter values, the network always **produces activations with the desired distribution**.

we want to preserve the information in the network by normalizing the activations in a training example relative to the statistics of the entire training data.

the full whitening of each layer’s inputs is costly and not everywhere differentiable, we make two necessary simplifications:

instead of whitening the features in layer inputs and outputs jointly, we’ll normalize each scalar feature independently, by making it have the mean of 0 and the variance of 1. Such normalization speeds up convergence, even when the features are not decorrelated.

simple normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. to address this, we make sure that

**the transformation inserted in the network can represent the identity transform**. to achieve this, we introduce, for**each activation x^(k)**, a pair of parameters gamma^(k), beta^(k), which scale and shift the normalized value.these parameters are learned along with the original model parameters, and

**restore the representation power of the network**.In the batch setting where each training step is based on the entire training set, we would use the whole set to normalize activations. However, this is impractical when when using stochastic optimization.

therefore, since we use mini-batches in SGD training,

**each mini-batch produces estimates of the mean and variance of each activation**.This way, the statistics used for normalization can fully participate in the gradient back propagation.

Note that the use of mini-batches is enbaled by computation of per-dimension variances rather than joint covariances.

above is the definition of

**Batch Normalizing Transform**.

**Below is the BN algorithm, epsilon is a constant added to the mini-batch variance for numerical stability**.

the BN transform can be added to a network to manipulate any activation.

the normalization of activations taht depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference.

we want the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization using the **population**, rather than mini-batch, statistics.

Neglecting ǫ, these normalized activations have the same mean 0 and variance 1 as during training. We use the un-2 biased variance estimate Var[x] = m−1 m · E B [σ B ], where the expectation is over training mini-batches of size m and 2 σ B are their sample variances.

using **moving averages** instead, we can track the accuracy of a model as it trains.

Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation.

For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a mini-batch, over all locations. In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations – so for a mini-batch of size m and feature maps of size p × q, we use the effective mini-batch of size m’ = | beta | = m*p*q. we learn a pair of parameters gamma^(k) and beta^(k) per feature map, rather than per activation. The inference progress is modified accordingly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map. |

By normalizing activations throughout the network, it prevents small changes to the parameters from amplifying into larger and suboptimal changes in activations in gradients; for instance, it prevents the training from getting stuck in the saturated regimes of nonlinearities.

Batch Normalization also makes training more resilient to the parameter scale. Normally, large learning rates may increase the scale of layer parameters, which then amplify the gradient during backpropagationand lead to the model explosion. However, with Batch Normalization, back-propagation through a layer is unaffected by the scale of its parameters. Moreover, **larger weights lead to smaller gradients**, and Batch Normalization will stabilize the parameter growth.

we futher conjecture that Batch Normalization may lead the layer Jacobains to **have singular values close to 1**, which known to be beneficial for training.

when training with BN, a training example is seen in conjunction with other examples in the mini-batch, and the training network no longer producing deterministic values for a given training example. this effect is advantageous to the **generalization** of the network. Whereas Dropout is typically used to reduce over-fitting, in a batch-normalized network we found that it can be either removed or reduced in strength.