🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  1. Information Theory

Log Probability

PreviousEntropyNextProbability & Statistics

Last updated 2 months ago

Human: How to interpret log probability? Can it be considered as a quantity of its own? What would be intuition behind understanding log probability as a concept of its own

ChatGPT said:

Log probability is simply the logarithm of a probability. Instead of working with raw probabilities, we work with their logarithms, which makes certain calculations easier and more numerically stable.

Interpretation of Log Probability

  • Summing Instead of Multiplying: Since probabilities multiply when dealing with sequences of independent events, taking the logarithm converts this into summation. This is especially useful in machine learning, where working with very small probabilities can lead to numerical underflow.

  • Scale-Invariance: Log probabilities spread out values more evenly, making them easier to interpret in cases where raw probabilities are very small.

  • Comparative Measure: A higher log probability means a higher actual probability, but in a different scale. The values are negative because log(x) for 0<x<10 < x < 10<x<1 is negative.

Can Log Probability Be Considered as a Quantity of Its Own?

Yes! Many machine learning models, especially language models, work with log probabilities directly rather than probabilities. When evaluating a model's performance, we often use metrics like log-likelihood or cross-entropy, which are based on log probabilities rather than raw probabilities.

Intuition Behind Understanding Log Probability as a Concept of Its Own

Imagine you are dealing with a language model predicting the next word. Instead of thinking about a tiny probability like 0.000010.000010.00001, you might think of it as log⁡(0.00001)=−11.5\log(0.00001) = -11.5log(0.00001)=−11.5, which is easier to compare meaningfully against another log probability, like log⁡(0.01)=−4.6\log(0.01) = -4.6log(0.01)=−4.6.

Would you like an example where using log probabilities simplifies calculations?

You said:

yes, I would like that example.

ChatGPT said:

Example: Computing Probability vs. Log Probability in a Sequence

Imagine you have a language model predicting the probability of a sentence:

"The cat sat on the mat."

Suppose the model assigns the following probabilities to each word given the previous ones:

P("The")=0.2,P("cat"∣"The")=0.1,P("sat"∣"Thecat")=0.05,...P("The")=0.2,P("cat"∣"The")=0.1,P("sat"∣"The cat")=0.05,...P("The")=0.2,P("cat"∣"The")=0.1,P("sat"∣"Thecat")=0.05,...P("The")=0.2,P("cat"∣"The")=0.1,P("sat"∣"The cat")=0.05,...P(\text{"The"}) = 0.2, \quad P(\text{"cat"} \mid \text{"The"}) = 0.1, \quad P(\text{"sat"} \mid \text{"The cat"}) = 0.05, \quad \text{...}P("The")=0.2,P("cat"∣"The")=0.1,P("sat"∣"The cat")=0.05,...P("The")=0.2,P("cat"∣"The")=0.1,P("sat"∣"Thecat")=0.05,...P("The")=0.2,P("cat"∣"The")=0.1,P("sat"∣"The cat")=0.05,...P("The")=0.2,P("cat"∣"The")=0.1,P("sat"∣"Thecat")=0.05,...

Using Regular Probabilities

To compute the probability of the full sentence, we multiply:

P("Thecatsatonthemat")=0.2×0.1×0.05×0.08×0.03×0.04=9.6×10−7P("The cat sat on the mat")=0.2×0.1×0.05×0.08×0.03×0.04=9.6×10−7P("Thecatsatonthemat")=0.2×0.1×0.05×0.08×0.03×0.04=9.6×10−7P("The cat sat on the mat")=0.2×0.1×0.05×0.08×0.03×0.04=9.6×10−7P(\text{"The cat sat on the mat"}) = 0.2 \times 0.1 \times 0.05 \times 0.08 \times 0.03 \times 0.04 = 9.6 \times 10^{-7}P("The cat sat on the mat")=0.2×0.1×0.05×0.08×0.03×0.04=9.6×10−7P("Thecatsatonthemat")=0.2×0.1×0.05×0.08×0.03×0.04=9.6×10−7P("The cat sat on the mat")=0.2×0.1×0.05×0.08×0.03×0.04=9.6×10−7P("Thecatsatonthemat")=0.2×0.1×0.05×0.08×0.03×0.04=9.6×10−7

This is a very small number, which can cause numerical underflow when working with floating-point numbers.

Using Log Probabilities

Instead of multiplying, we take the log of each probability and sum:

log⁡P("The")=log⁡(0.2)=−1.61log⁡P("The")=log⁡(0.2)=−1.61logP("The")=log(0.2)=−1.61log⁡P("cat"∣"The")=log⁡(0.1)=−2.30log⁡P("cat"∣"The")=log⁡(0.1)=−2.30logP("cat"∣"The")=log(0.1)=−2.30log⁡P("sat"∣"Thecat")=log⁡(0.05)=−2.99log⁡P("sat"∣"The cat")=log⁡(0.05)=−2.99logP("sat"∣"Thecat")=log(0.05)=−2.99log⁡P("The")=log⁡(0.2)=−1.61\log P(\text{"The"}) = \log(0.2) = -1.61logP("The")=log(0.2)=−1.61 log⁡P("cat"∣"The")=log⁡(0.1)=−2.30\log P(\text{"cat"} \mid \text{"The"}) = \log(0.1) = -2.30logP("cat"∣"The")=log(0.1)=−2.30 log⁡P("sat"∣"The cat")=log⁡(0.05)=−2.99\log P(\text{"sat"} \mid \text{"The cat"}) = \log(0.05) = -2.99logP("sat"∣"The cat")=log(0.05)=−2.99log⁡P("The")=log⁡(0.2)=−1.61logP("The")=log(0.2)=−1.61logP("The")=log(0.2)=−1.61log⁡P("cat"∣"The")=log⁡(0.1)=−2.30logP("cat"∣"The")=log(0.1)=−2.30logP("cat"∣"The")=log(0.1)=−2.30log⁡P("sat"∣"Thecat")=log⁡(0.05)=−2.99logP("sat"∣"The cat")=log(0.05)=−2.99logP("sat"∣"Thecat")=log(0.05)=−2.99

Summing these:

log⁡P("Thecatsatonthemat")=−1.61+(−2.30)+(−2.99)+(−2.52)+(−3.50)+(−3.22)=−16.14log⁡P("The cat sat on the mat")=−1.61+(−2.30)+(−2.99)+(−2.52)+(−3.50)+(−3.22)=−16.14logP("Thecatsatonthemat")=−1.61+(−2.30)+(−2.99)+(−2.52)+(−3.50)+(−3.22)=−16.14log⁡P("The cat sat on the mat")=−1.61+(−2.30)+(−2.99)+(−2.52)+(−3.50)+(−3.22)=−16.14\log P(\text{"The cat sat on the mat"}) = -1.61 + (-2.30) + (-2.99) + (-2.52) + (-3.50) + (-3.22) = -16.14logP("The cat sat on the mat")=−1.61+(−2.30)+(−2.99)+(−2.52)+(−3.50)+(−3.22)=−16.14log⁡P("Thecatsatonthemat")=−1.61+(−2.30)+(−2.99)+(−2.52)+(−3.50)+(−3.22)=−16.14logP("The cat sat on the mat")=−1.61+(−2.30)+(−2.99)+(−2.52)+(−3.50)+(−3.22)=−16.14logP("Thecatsatonthemat")=−1.61+(−2.30)+(−2.99)+(−2.52)+(−3.50)+(−3.22)=−16.14

Why Log Probability is Useful?

  1. Avoids Numerical Underflow: Instead of multiplying tiny numbers, we sum manageable negative values.

  2. Faster Computation: Addition is computationally cheaper than repeated multiplication.

  3. Easier Comparison: Instead of comparing tiny probabilities, comparing log probabilities is more intuitive (e.g., -16.14 vs. -20 is clearer than 10−710^{-7}10−7 vs. 10−910^{-9}10−9).

Would you like a visualization or code example to solidify the intuition?

You said:

Yes I would like visualization to make it more intuitive

ChatGPT said:

Here's a simple visualization to show the difference between working with probabilities and log probabilities.

I'll generate:

  1. A plot of probabilities showing how they shrink as we multiply them.

  2. A plot of log probabilities showing how they remain manageable when summed instead.

Let’s generate the plots.

Here’s the visualization:

  1. Left Plot (Multiplication of Probabilities)

    • The cumulative probability shrinks rapidly, making it hard to work with very small numbers.

    • The y-axis is in log scale because otherwise, the values would be too small to see clearly.

  2. Right Plot (Summation of Log Probabilities)

    • Instead of shrinking, log probabilities accumulate in a linear way.

    • This makes them easier to compute, compare, and interpret.

This visualization highlights why log probabilities are preferred over raw probabilities in machine learning and NLP models. Let me know if you’d like a deeper breakdown! 🚀 ​​

You said:

I would still want to have a better intuition of what log probability could denote as a quantity, instead of log probability being beneficial numerically. Like probability is chance of an event happening or belief about event happening, could log probability be also associated with any such real or pseudo quantity2/2

ChatGPT said:

That’s a great question! You're asking if log probability can be understood as a meaningful quantity rather than just a numerical trick.

Intuition: Log Probability as "Surprise" or "Information Content"

Log probability is closely linked to information theory, where it represents the amount of "surprise" or "unexpectedness" of an event.

  • Low probability events are surprising.

  • High probability events are expected.

  • The log of the probability quantifies that surprise in a linear way.

This concept is formalized as Shannon Information Content:

I(x)=−log⁡P(x)I(x) = -\log P(x)I(x)=−logP(x)

Example: Rolling a Die 🎲

  • If you roll a fair 6-sided die, the probability of rolling any number (1-6) is 1/6.

  • Using log base 2 (common in information theory):

    I(x)=−log⁡2(1/6)≈2.58 bitsI(x) = -\log_2(1/6) \approx 2.58 \text{ bits}I(x)=−log2​(1/6)≈2.58 bits

    This means observing the result of a fair die roll gives about 2.58 bits of surprise.

  • If instead, you had a biased die where one side (say "6") comes up with probability 1/2, then:

    I(6)=−log⁡2(1/2)=1 bitI(6) = -\log_2(1/2) = 1 \text{ bit}I(6)=−log2​(1/2)=1 bit

    You’re less surprised because it happens frequently.

Interpreting Log Probability as "Belief Update"

In Bayesian reasoning, log probabilities relate to how much you should update your belief when new evidence appears.

If an event has log probability = -5, it's much less expected than an event with log probability = -1. This could correspond to a larger shift in our belief when it actually happens.

Key Takeaway

Log probability is like a measure of surprise or unexpectedness of an event. Instead of thinking about it as just a transformation of probability, you can interpret it as:

"How much information do I gain when this event happens?"

Does this framing help? Would you like another example, maybe related to language modeling? 🚀

You said:

I would like to more about log probability as a belief update. Maybe with example or some maths

ChatGPT said:

Great! Let’s dive into log probability as a belief update, using an example from Bayesian inference.


Belief Update in Bayesian Terms

In Bayesian reasoning, we update our beliefs when we receive new evidence. This is expressed by Bayes’ Theorem:

P(H∣E)=P(E∣H)P(H)P(E)P(H \mid E) = \frac{P(E \mid H) P(H)}{P(E)}P(H∣E)=P(E)P(E∣H)P(H)​

where:

  • P(H)P(H)P(H) is our prior belief in hypothesis HHH (before seeing evidence).

  • P(E∣H)P(E \mid H)P(E∣H) is the likelihood (how probable the evidence is given the hypothesis).

  • P(E)P(E)P(E) is the total probability of the evidence occurring.

  • P(H∣E)P(H \mid E)P(H∣E) is our posterior belief (updated belief after seeing evidence).

Now, let’s take the log of this equation.


Log Probability Form: Additive Belief Update

Taking the logarithm of both sides:

log⁡P(H∣E)=log⁡P(E∣H)+log⁡P(H)−log⁡P(E)\log P(H \mid E) = \log P(E \mid H) + \log P(H) - \log P(E)logP(H∣E)=logP(E∣H)+logP(H)−logP(E)

What This Means

  • The posterior log probability is the sum of:

    1. The prior log probability log⁡P(H)\log P(H)logP(H)

    2. The log-likelihood log⁡P(E∣H)\log P(E \mid H)logP(E∣H) (how well the hypothesis explains the evidence)

    3. A normalization term −log⁡P(E)-\log P(E)−logP(E) (which doesn’t depend on HHH)

Thus, learning new evidence adds information in log-space rather than multiplying in probability-space.


Example: Diagnosing a Disease

Imagine a doctor diagnosing a rare disease.

Step 1: Prior Belief

  • Suppose the probability of having the disease before any test is very low: P(D)=0.001⇒log⁡P(D)=log⁡(0.001)=−6.91P(D) = 0.001 \quad \Rightarrow \quad \log P(D) = \log(0.001) = -6.91P(D)=0.001⇒logP(D)=log(0.001)=−6.91

Step 2: New Evidence (Positive Test Result)

  • The test is 90% accurate if you have the disease:

    P(T∣D)=0.9⇒log⁡P(T∣D)=log⁡(0.9)=−0.105P(T \mid D) = 0.9 \quad \Rightarrow \quad \log P(T \mid D) = \log(0.9) = -0.105P(T∣D)=0.9⇒logP(T∣D)=log(0.9)=−0.105

  • The test also gives false positives (5% of healthy people test positive):

    P(T∣¬D)=0.05P(T \mid \neg D) = 0.05P(T∣¬D)=0.05

  • The total probability of a positive test (marginal likelihood):

    P(T)=P(T∣D)P(D)+P(T∣¬D)P(¬D)P(T) = P(T \mid D) P(D) + P(T \mid \neg D) P(\neg D)P(T)=P(T∣D)P(D)+P(T∣¬D)P(¬D)

    Plugging in numbers:

    P(T)=(0.9×0.001)+(0.05×0.999)=0.05085P(T) = (0.9 \times 0.001) + (0.05 \times 0.999) = 0.05085P(T)=(0.9×0.001)+(0.05×0.999)=0.05085

    Taking the log:

    log⁡P(T)=log⁡(0.05085)=−2.98\log P(T) = \log(0.05085) = -2.98logP(T)=log(0.05085)=−2.98

Step 3: Updating Belief with Log Probabilities

Now, applying our formula:

log⁡P(D∣T)=log⁡P(T∣D)+log⁡P(D)−log⁡P(T)\log P(D \mid T) = \log P(T \mid D) + \log P(D) - \log P(T)logP(D∣T)=logP(T∣D)+logP(D)−logP(T)

Substituting values:

log⁡P(D∣T)=(−0.105)+(−6.91)−(−2.98)\log P(D \mid T) = (-0.105) + (-6.91) - (-2.98)logP(D∣T)=(−0.105)+(−6.91)−(−2.98) log⁡P(D∣T)=−4.03\log P(D \mid T) = -4.03logP(D∣T)=−4.03

Converting back to probability:

P(D∣T)=e−4.03≈0.018P(D \mid T) = e^{-4.03} \approx 0.018P(D∣T)=e−4.03≈0.018

Even after a positive test result, the probability of having the disease only increases from 0.1% to 1.8%.


Key Takeaways

  1. Log probability makes belief updates additive

    • Instead of multiplying probabilities, we add log probabilities for each new piece of evidence.

  2. Prior + Evidence = Posterior in Log-Space

    • We can think of log probability as a "belief score" that increases or decreases as we observe new data.

  3. Prevents Numerical Issues

    • Bayesian inference often involves tiny probabilities; log-space helps avoid numerical underflow.


Final Intuition: Log Probability as "Weight of Evidence"

  • If something is unlikely, it has a very negative log probability (strong disbelief).

  • If something is likely, it has a less negative log probability (strong belief).

  • Each new piece of evidence adds or subtracts from that belief in a linear way.

Would you like me to simulate this belief update with a plot? 🚀

https://chatgpt.com/share/67cb8814-8324-8007-a105-9f96315ab607