🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Estimation
  • Estimator & Estimate
  • Are you a Good Estimator?
  • Distribution of Estimator
  • Bias of Estimator
  • Variance of Estimator
  • Mean Square Error of Estimator
  • Estimation Error
  • Estimation Criteria
  • Random
  • Resources
  1. Probability & Statistics
  2. Statistical Inference

Estimation

Lets talk about estimators, estimates, etc

PreviousEstimator vs ParameterNextBayesian/Probabilistic Inference

Last updated 3 years ago

Estimation

As you know in statistics we try to understand a large population on the basis of information available in a small sample. Among what we mean by "understand" is to know the values of the population parameters. These populations parameters partially describes the population or say distribution of population. The game here is to use suitable sample statistics to estimate population parameters. For example, we may like to use the sample mean xˉ\bar xxˉ as an estimate for the population mean μ\muμ .

So mostly when we will be talking about estimations, we will talk about estimating:

  • Parameter of distributions θ\boldsymbol{\theta}θ, for example if we assume our population to follow Gaussian distribution, then we will estimated its mean and variance.

  • Parameters of some modelw\boldsymbol{w}w, for example in our neural networks we estimate the weights of our network using gradients based on loss.

  • Output, for example we have output of neural network as y^\hat{y}y^​, which is basically estimation.

Estimator & Estimate

An estimator is basically a random variable XXX, a random variable have set of possible values which it can realize as part of experiment called as its Sample Space. Generally an estimator is an statistic, for ex: sample mean Xˉ\bar XXˉ which is an statistic is an estimator for population mean μ\muμ. For example, a neural network is an estimator.

An estimate is basically number xxxwhich is realization of estimator XXXas part of some experiment. Hence, xxxwill definitely lie in Sample Space of XXX.

For example, we say that the sample mean X is an estimator of the population mean m and the computed value x of X is an estimate of m. The estimator is a sampling random variable and the estimate is a number. Similarly, the sample standard deviation SSS is an estimator of the population standard deviation σ\sigmaσ and the computed value sss of SSS is an estimate of σ\sigmaσ.

Example: I got a sample and I take the variance SSSof that sample. The sample that you take is a random sample from your population, so the sample variance SSS is (at least before you actually take the sample of the population and compute the sample variance) itself a random variable. If you can figure out the distribution of the sample variance, then you can find its expected value. In general, once we have the sample in place, the estimator that we compute is a fixed value that depends on the actual sample that we got. Until we've taken the sample, it's a random variable that we can analyze in terms of expected value, variance, etc.

Note the estimate depends upon the actual value of data you sampled, where as estimator is just a random variable without any value assigned to it.

Random Variables are functions or mathematical procedures and hence Estimators.

An estimator is a definite mathematical procedure (a random variable is also a function) that comes up with a number (the estimate) for any possible set of data that a particular problem could produce. That number is intended to represent some definite numerical property (g(θ)) of the data-generation process; we might call this the "estimand."

NOTE: Often θ^\hat \thetaθ^is used to denote both estimator and estimate. The meaning should be clear from the context. Estimator θ^\hat \thetaθ^ is a random variable and have distribution associated with it. Estimate θ^\hat \thetaθ^is realization of estimator for some particular sample, basically a point sampled form estimator's distribution.

Are you a Good Estimator?

  • Estimator's sampling distribution

  • Bias of Estimator - Related to sampling distribution

  • Variance of Estimator- Related to sampling distribution

Distribution of Estimator

How does estimators works? You take an sample from the population, you calculate an estimate from the sample defined by your estimator. Now you have an estimate of some population parameter. Now if you draw other sample from the population, you can again estimate the parameter.

  • Visualize calculating an estimator over and over with different samples from the same population, i.e. take a sample, calculate an estimate using that rule, then repeat.

  • This process yields sampling distribution for the estimator

Distribution of estimator is also called Sampling Distribution of estimator.

Basically, an estimator is also a random variable, so it will also have an distribution associated with it. You can get this distribution by having multiple estimates using multiple sample from the populations.

Remember whenever someone say distribution of parameter, it means distribution of parameter's estimator. Because most of the time actual parameter is unknown and hence we try to find out that parameter's value, but we can actually only find a distribution using multiple estimates of that parameter.

Example:

Bias of Estimator

When you draw a sample and calculate an estimate of some population parameter, that estimate might not be exactly equal to the actual population parameter, there will be some error between actual parameter and single estimate i.e. an estimate may overestimate or underestimate the parameter.

Note the expectation in the above formula.

An estimator is said to be unbiased if its bias is equal to zero for all values of parameter θ.

Variance of Estimator

For an estimator we would want that very concentrated towards the actual value of parameter. Hence, and ideal estimator would be with less variance. Shouldn't need to reiterate but as estimator is a random variable, hence it will have variance. Also, having less variance means that all the estimates are very close to each other. One thing to note here is that, even if let's say all the estimates of estimator are very close to each other and far from actual value. Still the estimator will have low variance. Hence the variance of estimator is indicator of how much the estimates are spread among themselves and not of how much they are spread from actual value of parameter.

Mean Square Error of Estimator

So, what do we have up until now? We have bias of estimator, which tells us how much mean of our estimates is far from actual value. We have variance, which tell how much estimates are spread/concentrated among themselves.

But one important thing is still there to tell - how much out estimates are spread from actual value of parameter. This we can tell using Mean Square Error of the estimator.

If one or more of the estimators are biased, it may be harder to choose between them. For example, one estimator may have a very small bias and a small variance, while another is unbiased but has a very large variance. In this case, you may prefer the biased estimator over the unbiased one.

Mean square error (MSE) is a criterion which tries to take into account concerns about both bias and variance of estimators.

Let's write MSE in one more form

Now if we already have any estimator which have zero bias, then minimizing the mean square error is equivalent to minimizing the variance.

Estimation Error

Estimation Criteria

The concept of optimality expressed by the words best estimate corresponds to the minimization of the state estimation error in some respect.

Different optimization criteria may be chosen, leading to different estimates of the system’s state vector. The estimate can be.

  • the mean, i.e., the center of the probability mass, corresponding to the minimum mean-square error criteria,

  • the mode that corresponds to the value of x that has the highest probability, corresponding to the Maximum a Posterior (MAP) criteria.

  • the median, where the estimate is the value of x such that half the probabil- ity weight lies to the left and half to the right of it.

Random

Resources

Are are estimators for? Estimators are for estimating some population parameters. But now the thing is you can have multiple estimators for same population parameters. For example you can use ∑Xin\frac{\sum X_i}{n}n∑Xi​​ as estimator for estimating mean or you can also use ∑Xin−1\frac{\sum X_i}{n-1}n−1∑Xi​​. How do you know which one is better and what makes you actually use ∑Xin\frac{\sum X_i}{n}n∑Xi​​ to estimate population mean. To answer this, you have to analyse some properties of the estimators and based on those properties you decide which estimator is good. What properties are used to analyse estimators?

We look at the mean of this sampling distribution to see what value our estimates are centered around. This is basically the expectation (expected value) of estimator E[θ^]E[\hat \theta]E[θ^] . θ^\hat \thetaθ^is our estimator.

The spread of this distribution is the variance of the estimator. Var[θ^]Var[\hat \theta]Var[θ^]

If you will think about it our distribution of estimator is basically the distribution of parameter. I mean that parameter θ\thetaθis actually a fixed constant (most of the times unknown), it's just a point value and not a random variable so it can't have distribution. But to get that actual point value of θ\thetaθwe get multiple estimates and those estimates actually makes the distribution. Hence this how we get parameter's distribution, but it is actually estimator's. Note that how people can easily use word parameter in place of it's estimator. Be carefully because many times when people are talking about parameter, they might actually mean it's estimator. Hence, many times you can use parameter instead of estimator because we know that actual value of parameter is unknown, hence talking about parameter means talking about our understanding of parameter which is it's estimator.

Let's say we have parameter ggg(gravity), note it is a fixed constant, it's just that we don't know it's value hence unknown. What are we gonna do to get the value of ggg. We are gonna setup a physics experiment using which we will try to estimate the value of ggg. So this experiment is some mathematical procedure using which we gonna try to estimate value of ggg. Hence this experiment is nothing but basically an estimator. But will write mathematical equation of this experiment which will make more clear that how this experiment is an estimator. What physics experiment can you think to estimate ggg? I think can two of either free fall experiment or pendulum. Basically drop a ball from height hhhand measure it's velocity at ground, this is represent by v2=2ghv^2=2ghv2=2gh. Or measure time period of a pendulum, T=2πgLT=2\pi \sqrt \frac{g}{L}T=2πLg​​. When I said these experiment is our estimator, how? These experiment gives us the mathematical procedure to estimate ggg. Hence basically these equations is our estimator. Let's we choose free body experiment. Then we have estimator g^=v22h\boldsymbol{ \hat g} = \frac{v^2}{2h}g^​=2hv2​. This is estimator is given by mathematical equation (procedure) which is given by experiment. Hence the experiment is the estimator. Now let's say we did the experiment 5 times and got following measurement for v=[10,20,15,25,20] v = [10,20,15,25,20]v=[10,20,15,25,20]and h=[5,22,11,32,20]h=[5, 22, 11,32,20]h=[5,22,11,32,20]. Using these value we get following estimates g^\hat gg^​for actual parameter g [10.00,9.09,10.22,9.26,10][10.00,9.09, 10.22,9.26, 10 ][10.00,9.09,10.22,9.26,10]. So, these are the estimates for the parameter ggg. These values basically gives us the distribution of parameter gggbut actually is distribution of estimator g^\boldsymbol {\hat g}g^​. Now expectation ofg^\boldsymbol {\hat g}g^​=10.00+9.09+10.22+9.26+105=9.71=\frac{10.00+9.09+10.22+9.26+ 10}{5} = 9.71=510.00+9.09+10.22+9.26+10​=9.71.

But that's the beauty of statistics, you don't need to restrict yourself to a single estimate, instead you can draw more samples and calculate multiple estimate for the same parameter. Now all of these estimates may be bit more or less than the actual value of parameter. So, the actual thing to measure how good is your estimator is not single estimate rather the mean of all the estimates. Hence, for a good estimator, mean of all the estimate should be equal to the actual population whereas individual estimates can be bit more or less compared to actual value of parameter. How do calculate the mean of all estimates? Simple, it's the expectation of the estimator i.e E[θ^]E[\hat \theta]E[θ^].

Although it should be clear that why mean of estimates is just expectation of the estimator, I will have a take on it. So, it is clear that estimatorθ^\hat \thetaθ^is just and random variable, and every random variable have some distribution associated with it, in this case distribution of estimator is called sampling distribution. Now what are estimates, the estimates are simply the point sampled from the estimator's distribution. So, mean of estimates is simply the mean of points sampled from a RV's distribution and what is the mean of points sampled from some RV's distribution? It's the mean or expectation of that RV itself. Hence, mean of estimate is mean/expectation E[θ^]E[\hat \theta]E[θ^]of the random variable θ^\hat \thetaθ^which is our estimator.

Bias is the tendency of a estimator to overestimate or underestimate a parameter. See to get what is difference between statistic and a parameter. Bias is a property of the estimator.

Mathematically, the bias (or bias function) of an is the difference between this estimator's and the true value of the parameter being estimated. I.e difference between mean of estimated value from different samples and actual population parameter. Note that it's not the difference with single estimated value but with mean of different estimated values over different samples.

Let θ^\hat \thetaθ^ be an estimator and θ\thetaθ be an parameter or estimand. Then Bias is given by

Bias(θ^)=E[θ^]−θBias(\hat \theta) = E[\hat \theta] - \thetaBias(θ^)=E[θ^]−θ

Suppose we have a , parameterized by a real number θ, giving rise to a probability distribution for observed data, , and a statistic which serves as an of θ based on any observed data . That is, we assume that our data follow some unknown distribution (where θ is a fixed constant that is part of this distribution, but is unknown), and then we construct some estimator that maps observed data to values that we hope are close to θ. The bias of relative to is defined as

where denotes over the distribution , i.e. averaging over all possible observations. The second equation follows since θ is measurable with respect to the conditional distribution.

Ever notices why we use s2=∑(xi−xˉ)2n−1s^2 = \frac{\sum (x_i - \bar x)^2}{n-1}s2=n−1∑(xi​−xˉ)2​this formula as estimator for population variance and not s2=∑(xi−xˉ)2ns^2 = \frac{\sum (x_i - \bar x)^2}{n}s2=n∑(xi​−xˉ)2​, even later being the formula for population variance. There answer lies in the bias of both estimators. xˉ\bar xxˉis sample mean (also an estimator for population mean, hence also a random variable). Note the in the above estimators, xix_ixi​are still random variable sampled from population distribution and not values, until they are actually observed.

For example, variance of sample mean xˉ=∑xin\bar x = \frac{\sum x_i}{n}xˉ=n∑xi​​is σ2n\frac {\sigma^2}{n}nσ2​, where σ2\sigma^2σ2is the population variance.

MSE(θ^)=E[(θ^−θ)2]MSE(\hat \theta) = E[(\hat \theta-\theta)^2]MSE(θ^)=E[(θ^−θ)2]

MSE measures the expected value of square of error of estimate θ^\hat \thetaθ^ from actual value θ\thetaθ.

MSE(θ^)=E[(θ^−θ)2]=E[((θ^−θˉ)−(θ−θˉ))2]=E[(θ^−θˉ)2+(θ−θˉ)2−2(θ^−θˉ)(θ−θˉ)]=E[(θ^−θˉ)2]+E[(θ−θˉ)2]−2E[(θ^−θˉ)(θ−θˉ)]MSE(\hat \theta) = E[(\hat \theta-\theta)^2] \\ = E[((\hat \theta-\bar\theta)-(\theta-\bar \theta))^2]\\ = E[(\hat \theta-\bar\theta)^2 + (\theta-\bar \theta)^2 -2(\hat \theta-\bar\theta)(\theta-\bar \theta) ] \\ = E[(\hat \theta-\bar\theta)^2] + E[ (\theta-\bar \theta)^2] -2E[(\hat \theta-\bar\theta)(\theta-\bar \theta)] MSE(θ^)=E[(θ^−θ)2]=E[((θ^−θˉ)−(θ−θˉ))2]=E[(θ^−θˉ)2+(θ−θˉ)2−2(θ^−θˉ)(θ−θˉ)]=E[(θ^−θˉ)2]+E[(θ−θˉ)2]−2E[(θ^−θˉ)(θ−θˉ)]

where θˉ=E[θ^]\bar \theta = E[\hat \theta]θˉ=E[θ^], is mean of the estimator. Now (θ−θˉ)(\theta-\bar \theta)(θ−θˉ)is constant for this expectation (think about it, it is). Therefore, E[(θ−θˉ)2]=(θ−θˉ)2E[(\theta-\bar \theta)^2] = (\theta-\bar \theta)^2E[(θ−θˉ)2]=(θ−θˉ)2 and E[(θ^−θˉ)(θ−θˉ)]=(θ−θˉ)E[(θ^−θˉ)]=0E[(\hat \theta-\bar\theta)(\theta-\bar \theta)] = (\theta-\bar \theta)E[(\hat \theta-\bar\theta)] = 0E[(θ^−θˉ)(θ−θˉ)]=(θ−θˉ)E[(θ^−θˉ)]=0because E[(θ^−θˉ)]=0E[(\hat \theta-\bar\theta)] = 0E[(θ^−θˉ)]=0. Hence, we get

MSE(θ^)=Var(θ^)+(θˉ−θ)2=Var(θ^)+Bias2MSE(\hat \theta) = Var(\hat \theta) + (\bar \theta-\theta)^2 \\ = Var(\hat \theta) + Bias^2MSE(θ^)=Var(θ^)+(θˉ−θ)2=Var(θ^)+Bias2

Imagine that we have a point estimate θ^\hat \thetaθ^ for population parameter θ\thetaθ. Even with a good point estimate, there is very likely to be some error ( θ^\hat \thetaθ^ = θ not likely) . We can express this error of estimation, denoted ε, as ε = | θ^\hat \thetaθ^ − θ| . This is the number of units that our estimate, θ^\hat \thetaθ^, is off from θ (doesn’t take into account the direction of the error).

Note that since the estimator θ^\hat \thetaθ^ is a RV, the error ϵ\epsilonϵ is also random. We can use the sampling distribution of θ^\hat \thetaθ^to help place some bounds on how big the error is likely to be. Note how much our estimation is related to mean square error.

here
estimator
expected value
statistical model
estimator
expected value
P_{\theta }(x)=P(x\mid \theta )
P(x\mid \theta )
{\hat {\theta }}
{\hat {\theta }}
{\hat {\theta }}
P(x\mid \theta )
P(x\mid \theta )
\theta
{\displaystyle \operatorname {Bias} _{\theta }[\,{\hat {\theta }}\,]=\operatorname {E} _{x\mid \theta }[\,{\hat {\theta }}\,]-\theta =\operatorname {E} _{x\mid \theta }[\,{\hat {\theta }}-\theta \,],}
x
x
{\displaystyle \operatorname {E} _{x\mid \theta }}
Sampling variation and sampling distributions
Linear MMSE Estimation of Random Variables
Logo
Mean Squared Error (MSE)
Logo