🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Joint Probability
  • Marginalization or Total Probability Theorem
  • Conditional Probability
  • Bayes Theorem/ Conditional Probability
  • Bayes with Relative Probabilities
  • Distributing two variables over Third Variable
  • Event independence
  • Conditional Independence
  • Chain Rule of Probability
  • Marginalization of Conditional Probability
  • Breaking Independence
  1. Probability & Statistics
  2. Probability

Probablistic Equations

PreviousProbabilityNextBayes Theorem

Last updated 5 years ago

Joint Probability

p(a,b)=p(a∣b)p(b)=p(b∣a)p(a)p(a,b) = p(a|b)p(b) = p(b|a)p(a) p(a,b)=p(a∣b)p(b)=p(b∣a)p(a)

Marginalization or Total Probability Theorem

When variable b is dependent/independent on variable a then we have:

p(b)=∫ap(b,a)da=∫ap(b∣a)p(a)da=∑ip(b∣ai)p(ai)p(b) = \int_{a}p(b,a) da = \int_{a}p(b|a)p(a) da = \sum_i p(b|a_i)p(a_i)p(b)=∫a​p(b,a)da=∫a​p(b∣a)p(a)da=i∑​p(b∣ai​)p(ai​)

Here we say the distribution bbb is marginalized wrt to distribution aaa .

Note that aia_iai​are disjoint for i=1,2,...i=1,2,...i=1,2,...

Marginalization as Expected Value

pX(x)=∫ypX∣Y(x∣y)pY(y)dy=EY[pX∣Y(x∣y)]p_X(x) = \int_y p_{X|Y}(x|y) p_Y(y)dy = E_Y[p_{X|Y}(x|y)]pX​(x)=∫y​pX∣Y​(x∣y)pY​(y)dy=EY​[pX∣Y​(x∣y)]

Intuitively, the marginal probability of X is computed by examining the conditional probability of X given a particular value of Y, and then averaging this conditional probability over the distribution of all values of Y.

This follows from the definition of expected value (after applying the law of the unconscious statistician)

Conditional Probability

P(A|B) = Probability\; of\; A, given\; B = \frac{P(A \cap B)}{P(B)} = \frac{\text{# of A and B} }{\text{# of B}}

i.e "If I know B is the case, then what is the probability that A is also the case"

To understand the intuition behind the concept of conditional probability, note that if B has occurred then we know that the selected ω belongs to B. Therefore, when evaluating A, one should ignore all ω’s that are not in B. Thus, a revised measure of the likeliness of A should depend on P(A ∩ B). Intuitively this is equivalent to reducing the effective sample space from Ω to B ⊂ Ω. However, according to the second axiom of probability, the total probability measure must be equal to one. Hence, we re-scale P(A ∩ B) by P(B) to obtain P(Ω|B) = P(B|B) = 1.

Example: Consider rolling a dice example: Ω = {1, 2, . . . , 6}. Let A = {6} and B = {2, 4, 6}. In this example, the unconditional probability of rolling a six is P(A) = 1/6. Now suppose you know that a player rolled an even number, however, you do not know what number exactly. In that case, you can update the probability of rolling a six to P(A|B) = P(A ∩ B)/P(B) = P({6})/P({2, 4, 6}) = 1/3.

Bayes Theorem/ Conditional Probability

Using joint probability equation:

A common scenario for applying the Bayes Rule formula is when you want to know the probability of something “unobservable” given an “observed” event.

Also, using total probability theorem

Bayes with Relative Probabilities

There are times when we would like to use Bayes Theorem to update a belief, but there is no way to calculate the probability of the event observed, P(F). All hope is not lost. In such situations we can still calculate the relative probability of events. For example, imagine we would like to answer the question, is even A or event B more likely given an observation F. We can express this mathematically as calculating whether P(A|F)/P(B|F) is greater than or equal to 1. Both of those terms can be expanded using Bayes, and when they are expanded the P(F) term cancels out:

Distributing two variables over Third Variable

Event independence

Conditional Independence

Chain Rule of Probability

Marginalization of Conditional Probability

Breaking Independence

But I wanted to pass on this surprising insight so that you have a more full understanding of probability. If two events E and F are independent, it is possible that there exists another event G such that E|G is no longer independent of F|G. As an example, Let’s say a person has a fever (G) if they either have malaria (E) or have an infection (F). We are going to assume that getting malaria (E) and having an infection (F) are independent: knowing if a person has malaria does not tell us if they have an infection. Now, a patient walks into a hospital with a fever (G). Your belief that the patient has malaria, given that they have a fever, P(E|G), is high and your belief that the patient has an infection given that they have a fever, P(F|G), is high. Both explain the fever. Interestingly, at this point in time (conditioned on G), given our knowledge that the patient has a fever, gaining the knowledge that the patient has malaria will change your belief the patient has an infection! The malaria explains why the patient has a fever, and so the alternate explanation becomes less likely. The two events (which were previously independent) are dependent now that we have conditioned on the patient having a fever.

Resources

There is one more way to express conditional probability, let me take an to explain it. Suppose you have to tell if a person is man or women given he/she have long hair. How would you calculate that. You are having information about ratio of male to female and percentage of long hair in each of gender i.e you have P(m),P(w),P(long  hair∣w),P(long  hair∣m)P(m), P(w), P(long\; hair | w), P(long\; hair | m)P(m),P(w),P(longhair∣w),P(longhair∣m)and we have to find P(w∣lh)  or  P(m∣lh)P(w|lh) \;or \; P(m|lh)P(w∣lh)orP(m∣lh), note that P(w∣lh)+P(m∣lh)=1P(w|lh) +P(m|lh) = 1P(w∣lh)+P(m∣lh)=1 , hence finding anyone will suffice. Using Bayes Theorem:

P(w∣lh)=P(lh∣w)P(w)P(lh)P(w∣lh)=P(lh∣w)P(w)P(lh∣w)P(w)+P(lh∣m)P(m)P(w∣lh)=P(lh,w)P(lh,m)+P(lh,w)P(w | lh) = \frac{P(lh|w)P(w)}{P(lh)} \\ P(w | lh) = \frac{P(lh|w)P(w)}{P(lh|w)P(w)+P(lh|m)P(m)} \\ P(w | lh) = \frac{P(lh,w)}{P(lh, m)+P(lh,w)} P(w∣lh)=P(lh)P(lh∣w)P(w)​P(w∣lh)=P(lh∣w)P(w)+P(lh∣m)P(m)P(lh∣w)P(w)​P(w∣lh)=P(lh,m)+P(lh,w)P(lh,w)​

This shows that P(w∣lh)P(w|lh)P(w∣lh) is the just the ratio of women with long here to the total people with long hair.

p(a∣b)=p(a,b)p(b)=p(b∣a)p(a)p(b)p(a|b) = \frac{p(a,b)}{p(b)} = \frac{p(b|a)p(a)}{p(b)} p(a∣b)=p(b)p(a,b)​=p(b)p(b∣a)p(a)​

Considering conditionality in Bayes theorem, let's say that we condition the probability p(a∣b)p(a|b)p(a∣b) on some event ccc. So it is basically probability of aaa given (bbbconditioned on ccc )now. Now will just conditional the every probability on cccin above equation. So we get:

p(a∣b,c)=p(b∣a,c)p(a∣c)p(b∣c)p(a|b,c) = \frac{p(b|a,c)p(a|c)}{p(b|c )} p(a∣b,c)=p(b∣c)p(b∣a,c)p(a∣c)​
p(a∣b)=p(b∣a)p(a)∫ap(b∣a)p(a)dap(a|b) = \frac{p(b|a)p(a)}{\int_{a}p(b|a)p(a) da}p(a∣b)=∫a​p(b∣a)p(a)dap(b∣a)p(a)​
p(a,b∣c)=p(a∣b,c)p(b∣c)=p(b∣a,c)p(a∣c)p(a,b|c) = p(a|b,c)p(b|c) = p(b|a,c)p(a|c)p(a,b∣c)=p(a∣b,c)p(b∣c)=p(b∣a,c)p(a∣c)

When events xxx and yyy are independent of each other, then

p(x,y)=p(x)p(y)p(x∣y)=p(x)p(y∣x)=p(y)\begin{align*} p(x,y)&=p(x)p(y) & p(x|y) &= p(x) & p(y|x)=p(y) \end{align*}p(x,y)​=p(x)p(y)​p(x∣y)​=p(x)​p(y∣x)=p(y)​

In this, two event xxx and yyy are indepentent given some event zzz , i.e if you know the value of zzz then xxx and yyy become independent of each other which initially they were not. Note: Coniditional Indpendence doesn't imply normal independence.

p(x,y∣z)=p(x∣z)p(y∣z)p(x∣y,z)=p(x∣z)p(y∣x,z)=p(y∣z)\begin{align*} p(x,y|z)&=p(x|z)p(y|z) & p(x|y,z) &= p(x|z) & p(y|x,z)=p(y|z) \end{align*}p(x,y∣z)​=p(x∣z)p(y∣z)​p(x∣y,z)​=p(x∣z)​p(y∣x,z)=p(y∣z)​
p(a,b,c,d)=p(a∣b,c,d)p(b∣c,d)p(c∣d)p(d)p(a,b,c,d) = p(a|b,c,d)p(b|c,d)p(c|d)p(d)p(a,b,c,d)=p(a∣b,c,d)p(b∣c,d)p(c∣d)p(d)
p(a∣b)=p(a,b)p(b)=∑cp(a,b,c)p(b)=∑cp(a,c∣b)p(b)p(b)=∑cp(a,c∣b)p(a|b) = \frac{p(a,b)}{p(b)} = \frac{\sum_c p(a,b,c) }{p(b)} = \frac{\sum_c p(a,c|b)p(b)}{p(b)} = \sum_c p(a,c|b)p(a∣b)=p(b)p(a,b)​=p(b)∑c​p(a,b,c)​=p(b)∑c​p(a,c∣b)p(b)​=c∑​p(a,c∣b)

https://medium.com/@laumannfelix/statistics-probability-fundamentals-2-cbb1239f9605
133KB
Probability_summary.pdf
pdf
Brief Summary of probability theorems/equations