๐Ÿง 
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenbergโ€“Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Summary
  • Methodolgy
  • Insights/Discussions
  1. Papers
  2. Probabilistic Object Detection and Uncertainty Estimation

Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics

How uncertainty is used to weigh different losses in task where multiple losses are used.

Previous*Sampling Free Epistemic Uncertainty Estimation using Approximated Variance PropagationNextCan We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving

Last updated 5 years ago

Summary

They learn the task uncertainty (homoscedastic uncertainty), they use this uncertainty to wight the losses of different tasks. So when we have a training which have mutiple losses such as L=โˆ‘iLiwi L = \sum_i L_iw_i L=โˆ‘iโ€‹Liโ€‹wiโ€‹, then wiw_iwiโ€‹is is not a hyperparameter here like traditional methods but is learnable uncertainty. They show this in the network where they jointy learn semantic segmentation and depth regression, hence classification and regression task.

Methodolgy

When we learn multiple output of network such as in this case. The outputs have different units and dimensions, this raises a problem to asign different weights to different losses of outputs.

Homoscedastic Uncertainty: It is an aleatoric uncertainty which is not dependent on the input data. It is not a model output, rather a quantity which stays constant for all input data and varies between different tasks.

Deriving losses using liklihood

For classification we often squash the model output through a softmax function, and sample from the resulting probability vector:

Here the classification liklihood they squash the scaled version of the model output through a softmax function:

This can be interpreted as a Boltz- mann distribution (also called Gibbs distribution) where the input is scaled by ฯƒ 2 (often referred to as temperature). The log liklihood of this is given as

The loss for this will the negative log liklihood of above which is given by:

Hence, here we mathematically derived the multi task loss using task uncertainty for each task. Using this loss formulation of classification and regression, you extend it to any of the multitask loss.

Insights/Discussions

  • Using uncertainty for weighing the losses allows us to incorporate the problem for different dimension and units for different tasks in a multi task learning problem.

  • The unceratiainty are learnable parameters and does not depend on input but the task itself.

Let fW(x)f_W (x)fWโ€‹(x) be the output of a neural network with weights WWW on input xxx. For regrassion tasks, we define our liklihood as a Gaussian with mean as model output:

p(yโˆฃfW(x))=N(fW(x),ฯƒ2)p(y|f _W (x)) = \mathcal{N} (f_W (x), ฯƒ^2 )p(yโˆฃfWโ€‹(x))=N(fWโ€‹(x),ฯƒ2)

where this ฯƒ2\sigma^2ฯƒ2 is the observation noise (also task uncertainty in this case). When we use maximize the liklihood for this, we get:

logp(yโˆฃfW(x))โˆโˆ’12ฯƒ2โˆฃโˆฃyโˆ’fW(x)โˆฃโˆฃ2โˆ’logโกฯƒlog p(y|f_W (x)) โˆ \frac{โˆ’ 1}{2\sigma^2} ||y โˆ’ f ^W (x)||^ 2 โˆ’ \log ฯƒlogp(yโˆฃfWโ€‹(x))โˆ2ฯƒ2โˆ’1โ€‹โˆฃโˆฃyโˆ’fW(x)โˆฃโˆฃ2โˆ’logฯƒ
p(yโˆฃfW(x))=Softmax(fW(x))p(y|f_W (x)) = Softmax(f_W (x))p(yโˆฃfWโ€‹(x))=Softmax(fWโ€‹(x))
p(yโˆฃfW(x),ฯƒ)=Softmax(1ฯƒ2fW(x))p(y|f^W (x), ฯƒ) = Softmax( \frac{1}{\sigma^2} f^W (x))p(yโˆฃfW(x),ฯƒ)=Softmax(ฯƒ21โ€‹fW(x))
logโกp(y=cโˆฃfW(x),ฯƒ)=1ฯƒ2fcW(x)โˆ’logโกโˆ‘cโ€˜expโก(1ฯƒ2fcโ€˜W(x))\log p(y = c|f^ W (x), ฯƒ) = \frac{1}{\sigma^2} f_c ^W(x) - \log \sum_{c^`}\exp(\frac{1}{\sigma^2} f_{c`} ^W(x))logp(y=cโˆฃfW(x),ฯƒ)=ฯƒ21โ€‹fcWโ€‹(x)โˆ’logcโ€˜โˆ‘โ€‹exp(ฯƒ21โ€‹fcโ€˜Wโ€‹(x))

with fcW(x)f_c ^W (x)fcWโ€‹(x) the cโ€™cโ€™cโ€™th element of the vector fW(x)f^W(x)fW(x)

Now let y1,y2y_1, y_2y1โ€‹,y2โ€‹be the model outputs corresponding to regression and classification, then:

p(y1,y2โˆฃfW(x))=p(y1โˆฃfW(x))โ‹…p(y2โˆฃfW(x))p(y _1 , y _2 |f_ W (x)) = p(y _1 |f_ W (x)) ยท p(y_2 |f_ W (x))p(y1โ€‹,y2โ€‹โˆฃfWโ€‹(x))=p(y1โ€‹โˆฃfWโ€‹(x))โ‹…p(y2โ€‹โˆฃfWโ€‹(x))
L(W,ฯƒ1,ฯƒ2)=โˆ’logโกp(y1,y2=cโˆฃfW(x))=โˆ’logโกN(y1;fW(x),ฯƒ12)โ‹…Softmax(y2=c;fW(x),ฯƒ2)โ‰ˆ12ฯƒ12L1(w)+12ฯƒ22L2(w)+logโกฯƒ1+logโกฯƒ2L(W, ฯƒ_1 , ฯƒ_2 ) = -\log p(y _1 , y _2=c |f_ W (x)) \\ = โˆ’ \log \mathcal{N} (y_1 ; f^ W (x), ฯƒ_1 ^2 ) ยท Softmax(y_ 2 = c; f ^W (x), ฯƒ _2 ) \\ \approx \frac{1}{2\sigma_1^2}L_1(w) + \frac{1}{2\sigma_2^2}L_2(w) + \log \sigma_1 + \log \sigma_2L(W,ฯƒ1โ€‹,ฯƒ2โ€‹)=โˆ’logp(y1โ€‹,y2โ€‹=cโˆฃfWโ€‹(x))=โˆ’logN(y1โ€‹;fW(x),ฯƒ12โ€‹)โ‹…Softmax(y2โ€‹=c;fW(x),ฯƒ2โ€‹)โ‰ˆ2ฯƒ12โ€‹1โ€‹L1โ€‹(w)+2ฯƒ22โ€‹1โ€‹L2โ€‹(w)+logฯƒ1โ€‹+logฯƒ2โ€‹

where L1(W)=โˆฃโˆฃy1โˆ’fW(x)โˆฃโˆฃ2L_1 (W) = ||y_ 1 โˆ’ f^ W (x)||^ 2L1โ€‹(W)=โˆฃโˆฃy1โ€‹โˆ’fW(x)โˆฃโˆฃ2 for the euclidean loss of y1y_1y1โ€‹ and L2(W)=โˆ’logโกSoftmax(y2,fW(x)) L_2 (W) = โˆ’ \log Softmax(y_2 , f^ W (x)) L2โ€‹(W)=โˆ’logSoftmax(y2โ€‹,fW(x)).

Looking logically, if the output of task is more spreaded i.e has larger units, then ฯƒ\sigmaฯƒwill be high. Hence normalizing the loss. Using ฯƒ\sigmaฯƒbasically normalize the loss and helps getting rid of units in the equation.

Pipeline of the work