🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Why Batch Normalization is considered a Regularization technique?
  • Normalization of Input Space
  • Noise Induction Behavior
  • What is Batch Normalization
  • Theory Behind Batch Normalization
  • Fusing Batch-Norm parameters with Conv parameters at Run-time
  • Fixed or Frozen Batch-Norm during Fine-Tuning
  • Issues with Batch Normalization
  • Caveats
  1. Deep Learning

Batch Normalization

Why Batch Normalization is considered a Regularization technique?

Regularization is something which helps prevent overfitting, how does batch-normalization do that?

Normalization of Input Space

We know that, model is learnt better input of our model is normalized, that's the reason we always normalize our input. Normalized data gives better representation of data instead specific data points. Now that's okay for data itself. But what about in between features in layers of neural network. So basically batch-norm normalize the features gives better feature representation for the neural networks to learn weights in more general sense and bit agnostic to exact data points.

Hence, normalization improves weight learning in sense that it makes it more general, hence batch-normalization is also regularization. It's affects weights indirectly rather directly like in dropout or weight decay.

But the thing is what if, for some layer features need to be normalize, in that case we have gamma and beta as learnable parameters. Network learn these in order to unnormalize the normalize features according to as much needed. So on one hand, we do the normalization for more general presentation, and on the other hand we give network the ability to transform the normalized features as need to preserves it's representation ability.

Noise Induction Behavior

One more reason batch-norm is said to be regularization because it is said to add noise in the training - each sample appears multiple times, each time within a different batch of other samples, chosen randomly. This means that the statistics computed for the normalization of the batch containing this sample are slightly different each time, adding a form of non-determinism to the behavior of the network on this sample during training. Batch norm is similar to dropout in the sense that it multiplies each hidden unit by a random value at each step of training. In this case, the random value is the standard deviation of all the hidden units in the minibatch. Because different examples are randomly chosen for inclusion in the minibatch at each step, the standard deviation randomly fluctuates. Batch norm also subtracts a random value (the mean of the minibatch) from each hidden unit at each step. Both of these sources of noise mean that every layer has to learn to be robust to a lot of variation in its input, just like with dropout.

But it is believed that normalization of input space is more convincing reason for batch-norm to work rather than the noise inducing behavior.

Advantages of Batch Normalization:

  • Normalized data makes the network bit robust of inter-distributional change.

  • Increase the learning rate

  • Reduce the need for dropout and other regularization techniques.

  • Achieves higher accuracy.

  • Resolves the problem of covariance shift as it weakens the coupling between the initial layers parameters and later layers parameters, thus changes to the layers are independent of each other, therefore speeding the learning process of the network.

What is Batch Normalization

There is debate about whether to use the Batch Norm layer before or after the activation function. Many convincing arguments are there which suggests that batch norm layer should be used after the activation function whereas in the original paper, batch norm layer is used before activation.

Theory Behind Batch Normalization

Fusing Batch-Norm parameters with Conv parameters at Run-time

Fixed or Frozen Batch-Norm during Fine-Tuning

In fine-tuning if batch size is too small to do batch norm then it can degrade the performance. Then we convert the BN layer to simple affine layer as follows

NOTE: Now you don't have batch normalization here i.e there is no activation normalization with batch statistics.

Why and When to do this?

In the case when our batch statistics are not good representation of actual statistics, in this case there will be difference in testing vs training normalization as there is difference in statistics which will lead to different training and testing performance.

So, if your fine-tuning dataset is small and different from the pre-trained dataset then your fine-tuning batch-statics may not be equal to population statistics as moving average of population statistics will be dominated by pre-trained data. Hence, in this case you simply freeze the batch-norm layer and there is no issue of statistics at all.

But if you have large fine-tuning dataset and if training goes for long enough then your final population statistics will same as fine-tuning dataset. In that case you may not need to freeze the batch-norm layer. Also if your fine-tune dataset is similar to pre-train dataset.

Thing which can be done

Issues with Batch Normalization

  • Different parameters used to compute normalized output during training and inference

    • How are we sure that using estimated mean and variance are better than using batch mean and variance during testing. Testing and Training is only similar if batch estimates are similar to estimated (population) mean which we are gonna use at testing.

  • BN’s error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation.

    • it is required for BN to work with a sufficiently large batch size ( 32 per worker). A small batch leads to inaccurate estimation of the batch statistics, and reducing BN’s batch size increases the model error dramatically

    • They introduced group norm for this.

  • Non-i.i.d minibatches can have a detrimental effect on models with batchnorm. For e.g. in a metric learning scenario, for a minibatch of size 32, we may randomly select 16 labels then choose 2 examples for each of these labels, the examples interact at every layer and may cause model to overfit to the specific distribution of minibatches and suffer when used on individual examples.

  • The pre-computed statistics may also change when the target data distribution changes. These issues lead to inconsistency at training, transferring, and testing time.

Caveats

PreviousNormalizationNextVariational Inference

Last updated 4 years ago

During implementation many times people simply use learnable parameters of batch norm which are γ\gammaγand β\betaβto be constant set as 111and 000respectively. Hence in this setting batch-norm only normalize the features and doesn't allow to learn it scale and shift accordingly. This generally doesn't affect very much I guess, as doing shift and scale is simple an affine transformation which we are already doing in conv layers.

x^=x−μσ2+ϵ×γ+β\hat x = \frac{x-\mu}{\sqrt {\sigma^2 +\epsilon}} \times \gamma + \betax^=σ2+ϵ​x−μ​×γ+β

Where μ\muμand σ2\sigma^2σ2are estimated mean and variance during training using moving average. γ\gammaγand β\betaβare learned parameters. And all these parameters can be combines to create a simple affine transform.

x^=γ×xσ2+ϵ+β−μγσ2+ϵ=γ′x+β′\hat x = \gamma\times \frac{x}{\sqrt {\sigma^2 +\epsilon}} + \beta -\frac{\mu\gamma}{\sqrt {\sigma^2 +\epsilon}} \\ \\ = \gamma'x+\beta'x^=γ×σ2+ϵ​x​+β−σ2+ϵ​μγ​=γ′x+β′

So during fine tuning we have parameters γ′\gamma'γ′and β′\beta'β′whose values are γσ2+ϵ \frac{\gamma}{\sqrt {\sigma^2 +\epsilon}}σ2+ϵ​γ​ and β−μγσ2+ϵ\beta -\frac{\mu\gamma}{\sqrt {\sigma^2 +\epsilon}}β−σ2+ϵ​μγ​ respectively.

As such, the BN layers become linear activations with constant offsets and scales (β′\beta'β′and γ′\gamma'γ′), and BN statistics (μ\muμand σ2\sigma^2σ2 )are not updated during fine-tuning. So this is basically a fixed layer behaving like an activation function particularly as an affine transform with parameters γ′\gamma'γ′and β′\beta'β′.

Now during fine-tuning we can do one thing if we want - make γ′\gamma'γ′and β′\beta'β′learnable parameters which are initialized with above values instead of keeping them fixed during fine-tuning.

Normalizing your data (specifically, input and batch normalization).Jeremy Jordan
Pitfalls of Batch Norm in TensorFlow and Sanity Checks for Training NetworksMedium
Fusing batch normalization and convolution in runtime
How Does Batch Normalization Help Optimization?arXiv.org
How to Train Your ResNet 7: Batch NormMyrtle
fuse convolutional and batch_norm weights into one convolutional-layer · Issue #5 · AlexeyAB/yolo2_lightGitHub
Logo
On The Perils of Batch Normalexirpan
Logo
Logo
Logo
Logo
Logo