BatchNorm NN

1. Why it works?
2. Use before activation
3. BatchNorm affects Gradient Penalty in WGAN

Improves train loss but degrades test loss

Its easy for Neural Networks when then data is normalized. The most common form of data normalization is

centering the data on zero
and giving the data a unit standard deviation

BatchNormalization does this not for the input only but for the output of transformations in the Neural Network.

1. Why it works?

May not be internal covariate shift: Although the original paper stated that batch normalization operates by "reducing internal covariate shift", no one really knows for sure why batch normalization helps.
Helps with gradient propagation: In practice, the main effect of batch normalization appears to be that it helps with gradient propagation much like residual connections and thus allows for deeper networks.

(pg. 256 Deep Learning with Python - François Chollet)

2. Use before activation

This is also recommended by Deep Learning with Python - François Chollet (pg. 256). Although he mentions that this is still debatable.

The intuitive reason for this approach is that batch normalization will center your inputs on zero, while your relu activation uses zero as a pivot for keeping or dropping activated channels: doing normalization before the activation maximizes the utilization of the relu. That said, this ordering best practice is not exactly critical, so if you do convolution, then activation, and then batch normalization, your model will still train, and you won't necessarily see worse results.

3. BatchNorm affects Gradient Penalty in WGAN

Batch normalization is avoided for the critic (discriminator). Batch normalization creates correlations between samples in the same batch. It impacts the effectiveness of the gradient penalty which is confirmed by experiments.

References

https://jonathan-hui.medium.com/gan-wasserstein-gan-wgan-gp-6a1a2aa1b490

BatchNorm NN

Table of Contents

1. Why it works?

2. Use before activation

3. BatchNorm affects Gradient Penalty in WGAN

References