BatchNorm NN
Table of Contents
- Improves train loss but degrades test loss
Its easy for Neural Networks when then data is normalized. The most common form of data normalization is
- centering the data on zero
- and giving the data a unit standard deviation
BatchNormalization does this not for the input only but for the output of transformations in the Neural Network.
1. Why it works?
- May not be internal covariate shift: Although the original paper stated that batch normalization operates by "reducing internal covariate shift", no one really knows for sure why batch normalization helps.
- Helps with gradient propagation: In practice, the main effect of batch normalization appears to be that it helps with gradient propagation much like residual connections and thus allows for deeper networks.
2. Use before activation
This is also recommended by Deep Learning with Python - François Chollet (pg. 256). Although he mentions that this is still debatable.
The intuitive reason for this approach is that batch normalization will center your inputs on zero, while your relu activation uses zero as a pivot for keeping or dropping activated channels: doing normalization before the activation maximizes the utilization of the relu. That said, this ordering best practice is not exactly critical, so if you do convolution, then activation, and then batch normalization, your model will still train, and you won't necessarily see worse results.
3. BatchNorm affects Gradient Penalty in WGAN
Batch normalization is avoided for the critic (discriminator). Batch normalization creates correlations between samples in the same batch. It impacts the effectiveness of the gradient penalty which is confirmed by experiments.