🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Notation
  • Probability Distribution
  • Probability Density function
  • Summary
  • What Is Probability
  • The Frequentist View
  • The Bayesian View
  • Probability vs Liklihood
  • Stationarity
  • Why liklihood function is not pdf?
  • Inverse transform sampling
  • Support of Function (Function can be probability density)
  1. Probability & Statistics

Probability

PreviousRandom VariablesNextProbablistic Equations

Last updated 4 years ago

Notation

  • Uppercase XXX denotes a random variable

  • Uppercase P(X)P(X)P(X) denotes the probability distribution over that variable

  • Lowercase x∼P(X)x∼P(X)x∼P(X) denotes a value xxx sampled (∼) from the probability distribution P(X)P(X)P(X) via some generative process.

  • Lowercase p(X)p(X)p(X) is the density function of the distribution of XXX. It is a scalar function over the measure space of XXX.

  • p(X=x)p(X=x)p(X=x)(shorthand p(x)p(x)p(x)) denotes the density function evaluated at a particular value x.

Probability Distribution

A probability distribution is a list of all of the possible outcomes of a random variable along with their corresponding probability values.

Like this:

Probability Density function

This is function which represents probability distribution of a random variable. Denoted by p(x)p(x)p(x).

For discrete random variables.

P(X=x)=p(x)P(X=x)= p(x) P(X=x)=p(x)

For continuous random variables.

P(a<x<b)=∫abp(x)dxP(a<x<b) = \int_a^b p(x)dxP(a<x<b)=∫ab​p(x)dx

So basically function pppis a way to associated probability to each outcome of random variable xxx.

Note that functions have parameters in them, for example, in normal distribution we have mean and variance which are its parameters. To show parameters we use the notation p(x;μ,σ)p(x;\mu, \sigma)p(x;μ,σ)or f(x∣μ,σ)f(x|\mu, \sigma)f(x∣μ,σ).

Also, how to distinguish two probability functions? Let's say you have three variables X,Y,ZX,Y,ZX,Y,Z, now instead of using three letters for pdf of these like f(x),g(y),h(z)f(x), g(y), h(z)f(x),g(y),h(z)you denote pdf by pppbut you use the random variable letter as a subscript with ppplike pX,pY,pZp_X, p_Y, p_ZpX​,pY​,pZ​. So when you write pX(a)p_X(a)pX​(a), it means you are talking about random variable XXXtaking value aaaand not Y,ZY,ZY,Z.

Summary

What Is Probability

Let’s suppose I want to bet on a soccer game between two teams of robots, Arduino Arsenal and C Milan. After thinking about it, I decide that there is an 80% probability that Arduino Arsenal winning. What do I mean by that? Here are three possibilities…

  • They’re robot teams, so I can make them play over and over again, and if I did that, Arduino Arsenal would win 8 out of every 10 games on average.

  • For any given game, I would only agree that betting on this game is only “fair” if a $1 bet on C Milan gives a $5 payoff (i.e. I get my $1 back plus a $4 reward for being correct), as would a $4 bet on Arduino Arsenal (i.e., my $4 bet plus a $1 reward).

  • My subjective “belief” or “confidence” in an Arduino Arsenal victory is four times as strong as my belief in a C Milan victory.

The Frequentist View

It defines probability as a long-run frequency. Suppose we were to try flipping a fair coin, over and over again. By definition, this is a coin that has P(H)=0.5P(H)=0.5P(H)=0.5. What might we observe? It basically defines probability in term of the frequency of happening of event. In this case, number of times we get Heads si the probability of Head.

The frequentist definition of probability has some desirable characteristics. First, it is objective: the probability of an event is necessarily grounded in the world. The only way that probability statements can make sense is if they refer to (a sequence of) events that occur in the physical universe. Second, it is unambiguous: any two people watching the same sequence of events unfold, trying to calculate the probability of an event, must inevitably come up with the same answer.

The frequentist definition has a narrow scope. There are lots of things out there that human beings are happy to assign probability to in everyday language, but cannot (even in theory) be mapped onto a hypothetical sequence of events. For instance, if a meteorologist comes on TV and says, “the probability of rain in Adelaide on 2 November 2048 is 60%” we humans are happy to accept this. But it’s not clear how to define this in frequentist terms. There’s only one city of Adelaide, and only 2 November 2048. There’s no infinite sequence of events here, just a once-off thing. Frequentist probability genuinely forbids us from making probability statements about a single event.

The Bayesian View

Bayesian view of probability is often called the subjectivist view. The most common way of thinking about subjective probability is to define the probability of an event as the degree of belief that an intelligent and rational agent assigns to that truth of that event.

However, in order for this approach to work, we need some way of operationalising “degree of belief”. One way that you can do this is to formalise it in terms of “rational gambling”, though there are many other ways. Suppose that I believe that there’s a 60% probability of rain tomorrow. If someone offers me a bet: if it rains tomorrow, then I win $5, but if it doesn’t rain then I lose $5. Clearly, from my perspective, this is a pretty good bet. On the other hand, if I think that the probability of rain is only 40%, then it’s a bad bet to take.

The main advantage is that it allows you to assign probabilities to any event you want to. You don’t need to be limited to those events that are repeatable. The main disadvantage (to many people) is that we can’t be purely objective – specifying a probability requires us to specify an entity that has the relevant degree of belief. This entity might be a human, an alien, a robot, or even a statistician, but there has to be an intelligent agent out there that believes in things. To many people this is uncomfortable: it seems to make probability arbitrary. While the Bayesian approach does require that the agent in question be rational (i.e., obey the rules of probability), it does allow everyone to have their own beliefs; I can believe the coin is fair and you don’t have to, even though we’re both rational.

Probability vs Liklihood

When you say probability, is a number between 0 to 1 for some event to happen. Whereas liklihood is the value which tells relative chances of some event to happen. To understand better, lets consider a constinuous random variable with gaussian distribution with mean μ\muμ and stddev σ\sigmaσ. Now if what's the probability of random variable to be exactly say s μ\muμ, it's 0. But it's liklihood is greatest. Probability is given by area under curve of distribution, whereas liklihood is the value at the point on the distribution curve.

Stationarity

If a process is stationery then it means that its density (probability distribution) doesn't change with time.

Why liklihood function is not pdf?

Inverse transform sampling

The problem that the inverse transform sampling method solves is as follows:

The inverse transform sampling method works as follows:

So, this can be basically use to draw samples from different probability distributions.

Sampling from Categorical Distribution

Support of Function (Function can be probability density)

In mathematics, the support of a real-valued function f is the subset of the domain containing those elements which are not mapped to zero.

The set-theoretic support of f, written supp(f), is the set of points in X where f is non-zero, and X is domain of f.

supp(f)={x∈X∣f(x)≠0}supp(f) = \{x \in X | f(x) \neq 0\}supp(f)={x∈X∣f(x)=0}

The support of f is the smallest subset of X with the property that f is zero on the subset's complement. If f(x) = 0 for all but a finite number of points x in X, then f is said to have finite support.

Support of Random Variable

In case of probability distribution, the support of random variable on which distribution is used is same as the support of probability density function. They are used interchangebly many times.

be a whose distribution can be described by the .

We want to generate values of which are distributed according to this distribution.

from the standard uniform distribution in the interval , e.g. from

Find the inverse of the desired CDF, e.g. .

Compute . The computed random variable has distribution.

Expressed differently, given a continuous uniform variable in and an cumulative distribution function , the random variable has distribution (or, is distributed ).

For , it is the set of all the realizations that have a strictly positive probability of being observed.

Example If a discrete random variable has its support, denoted by , is

discrete random variables
random variable
cumulative distribution function
Generate a random number
invertible
probability mass function
What is the reason that a likelihood function is not a pdf?Cross Validated
How to obtain a random sample from a categorical distribution using Matlab's rand()?Stack Overflow
u
F_{X}
X
X
[0,1]
X
{\displaystyle U\sim \mathrm {Unif} [0,1].}
[0,1]
F_{X}
F_{X}
U
X
{\displaystyle F_{X}^{-1}(x)}
F_{X}
{\displaystyle X=F_{X}^{-1}(u)}
F_X(x)
{\displaystyle X=F_{X}^{-1}(U)}
Logo
Logo
X
[eq2]
R_X
[eq1]