🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Prelu (parametric relu):
  • Merging Batch Norm with Conv:
  • Deconvolutions introduce Checkerboard Artifacts:
  • Weight Decay and L2 reguralization:
  • DenseNet:
  • Modelling Network Architecture for constrained time cost
  • One hot Encoding:
  • K Nearest Neighbour Layer:
  1. Deep Learning

Deep learning techniques

This page contains deep learning techniques which might help your model to converge fast, get higher accuracy, decrease run time, etc.

PreviousDimensionality of Layer Vs Number of LayersNextDilated Convolutions

Last updated 6 years ago

Prelu (parametric relu):

This is just like leaky relu, but there the parameter alpha which happens y = alpha*x when x<0, here this parameter is learn-able, hence its value changes during training. Author of paper claim that this provide better model fitting, and can have little overfitting risk.

f(x) = alpha * x for x < 0, f(x) = x for x >= 0,
where alpha is a learned array with the same shape as x.

Merging Batch Norm with Conv:

/

We can merge the batch norm learned params gamma and beta with convolution layer weight and baises. This allows to remove the batch norm layer at the inference time and allowing us to get the batch norm effect with less computation.

Deconvolutions introduce Checkerboard Artifacts:

This blog by colah says that those checkerboard artifacts which we usually see in GANs, and other image generation models are due to use of Deconvolutions.

Solution includes to make sure that kernel size is divisible by stride to avoid uneven activations. But still this is not fully effetive. The best way for upsampling is to first resize the input layer using nearest neighbour and then using the convolution layer on that.

Weight Decay and L2 reguralization:

When using L2 regularization in the update of weights we get w` = w - lr*(2*ƛ*w + d(loss)/dw)

But in update rule which generally have momentum with weight decay we have . v' = 0.9*v - lr*(0.0005*w + d(loss)/dw) and then w'=w+v` This v is for momentum . Now there is paper which states that weight decay work best with adam optimizer.

DenseNet:

Here, For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. Advantages:

  • encourage feature reuse, and substantially reduce the number of parameters.

  • they alleviate the vanishing-gradient problem

  • strengthen feature propagation

A possibly counter-intuitive effect of this dense connec- tivity pattern is that it requires fewer parameters than tra- ditional convolutional networks, as there is no need to re- learn redundant feature-maps.

Modelling Network Architecture for constrained time cost

In this, it shown how we can architecture, i.e depth(no of layers), feature map depth, stride, etc. Such that computation time doesn't change. Trade offs in all these factors is discussed for maintaining time cost.

A general architecture of CNN:

  • 1st Stage: A convolution layer with large filter size (7x7 or 11x11) with less number of filters with pooling.

  • 2nd: A layer with 5x5 filters.

  • 3rd: A layer with 3x3 filters.

Time Complexity of Neural Network:

Here l is the index of a convolutional layer, and d is the depth (number of convolutional layers). nl is the number of filters (also known as “width”) in the l-th layer. nl−1 is also known as the number of input channels of the l-th layer. sl is the spatial size (length) of the filter. ml is the spatial size of the output feature map.

Trade Offs between depth, feature maps, filter size:

One hot Encoding:

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

Suppose we have K categories, then one hot vector which represents the example will have value 1 at corresponding index and 0 elsewhere. Hence, it is useful to turn the categorical data to vector.

K Nearest Neighbour Layer:

In this, colah proposes k nearest neighbour layer instead of softmax. First what he did is, he trained a classification model of MNIST using softmax and say cross entropy loss. Now at test time, he drops the softmax layer and after the final layer he uses KNN to do the classification. Basically he used the KNN in feature space of input images. This gave better performance than softmax. Now how to put KNN at train time also instead of softmax. From the link:

k-NN is differentiable with respect to the representation it’s acting on, because of the 1/distance weighting. As such, we can train a network directly for k-NN classification. This can be thought of as a kind of “nearest neighbor” layer that acts as an alternative to softmax.

We don’t want to feedforward our entire training set for each mini-batch because that would be very computationally expensive. I think a nice approach is to classify each element of the mini-batch based on the classes of other elements of the mini-batch, giving each one a weight of 1/(distance from classification target).

- This papers rite all the trade offs between depth , no of feature maps in a layer, kernel size, etc. It states the depth is most important factor. And you can you can increase the depth uptp some extent without having time penalty on the compensation of kernel size, no of feature maps, spatial size of featuer maps.

https://arxiv.org/abs/1412.1710
http://machinethink.net/blog/object-detection-with-yolo
Deconvolution and Checkerboard ArtifactsDistill
Neural Networks, Manifolds, and Topology -- colah's blog
DenseNet architecture