Estimation

Lets talk about estimators, estimates, etc

Estimation

As you know in statistics we try to understand a large population on the basis of information available in a small sample. Among what we mean by "understand" is to know the values of the population parameters. These populations parameters partially describes the population or say distribution of population. The game here is to use suitable sample statistics to estimate population parameters. For example, we may like to use the sample mean xˉ\bar x as an estimate for the population mean μ\mu .

So mostly when we will be talking about estimations, we will talk about estimating:

  • Parameter of distributions θ\boldsymbol{\theta}, for example if we assume our population to follow Gaussian distribution, then we will estimated its mean and variance.

  • Parameters of some modelw\boldsymbol{w}, for example in our neural networks we estimate the weights of our network using gradients based on loss.

  • Output, for example we have output of neural network as y^\hat{y}, which is basically estimation.

Estimator & Estimate

An estimator is basically a random variable XX, a random variable have set of possible values which it can realize as part of experiment called as its Sample Space. Generally an estimator is an statistic, for ex: sample mean Xˉ\bar X which is an statistic is an estimator for population mean μ\mu. For example, a neural network is an estimator.

An estimate is basically number xxwhich is realization of estimator XXas part of some experiment. Hence, xxwill definitely lie in Sample Space of XX.

For example, we say that the sample mean X is an estimator of the population mean m and the computed value x of X is an estimate of m. The estimator is a sampling random variable and the estimate is a number. Similarly, the sample standard deviation SS is an estimator of the population standard deviation σ\sigma and the computed value ss of SS is an estimate of σ\sigma.

Example: I got a sample and I take the variance SSof that sample. The sample that you take is a random sample from your population, so the sample variance SS is (at least before you actually take the sample of the population and compute the sample variance) itself a random variable. If you can figure out the distribution of the sample variance, then you can find its expected value. In general, once we have the sample in place, the estimator that we compute is a fixed value that depends on the actual sample that we got. Until we've taken the sample, it's a random variable that we can analyze in terms of expected value, variance, etc.

Note the estimate depends upon the actual value of data you sampled, where as estimator is just a random variable without any value assigned to it.

Random Variables are functions or mathematical procedures and hence Estimators.

An estimator is a definite mathematical procedure (a random variable is also a function) that comes up with a number (the estimate) for any possible set of data that a particular problem could produce. That number is intended to represent some definite numerical property (g(θ)) of the data-generation process; we might call this the "estimand."

NOTE: Often θ^\hat \thetais used to denote both estimator and estimate. The meaning should be clear from the context. Estimator θ^\hat \theta is a random variable and have distribution associated with it. Estimate θ^\hat \thetais realization of estimator for some particular sample, basically a point sampled form estimator's distribution.

Are you a Good Estimator?

Are are estimators for? Estimators are for estimating some population parameters. But now the thing is you can have multiple estimators for same population parameters. For example you can use ∑Xin\frac{\sum X_i}{n} as estimator for estimating mean or you can also use ∑Xin−1\frac{\sum X_i}{n-1}. How do you know which one is better and what makes you actually use ∑Xin\frac{\sum X_i}{n} to estimate population mean. To answer this, you have to analyse some properties of the estimators and based on those properties you decide which estimator is good. What properties are used to analyse estimators?

  • Estimator's sampling distribution

  • Bias of Estimator - Related to sampling distribution

  • Variance of Estimator- Related to sampling distribution

Distribution of Estimator

How does estimators works? You take an sample from the population, you calculate an estimate from the sample defined by your estimator. Now you have an estimate of some population parameter. Now if you draw other sample from the population, you can again estimate the parameter.

  • Visualize calculating an estimator over and over with different samples from the same population, i.e. take a sample, calculate an estimate using that rule, then repeat.

  • This process yields sampling distribution for the estimator

Distribution of estimator is also called Sampling Distribution of estimator.

Basically, an estimator is also a random variable, so it will also have an distribution associated with it. You can get this distribution by having multiple estimates using multiple sample from the populations.

We look at the mean of this sampling distribution to see what value our estimates are centered around. This is basically the expectation (expected value) of estimator E[θ^]E[\hat \theta] . θ^\hat \thetais our estimator.

The spread of this distribution is the variance of the estimator. Var[θ^]Var[\hat \theta]

If you will think about it our distribution of estimator is basically the distribution of parameter. I mean that parameter θ\thetais actually a fixed constant (most of the times unknown), it's just a point value and not a random variable so it can't have distribution. But to get that actual point value of θ\thetawe get multiple estimates and those estimates actually makes the distribution. Hence this how we get parameter's distribution, but it is actually estimator's. Note that how people can easily use word parameter in place of it's estimator. Be carefully because many times when people are talking about parameter, they might actually mean it's estimator. Hence, many times you can use parameter instead of estimator because we know that actual value of parameter is unknown, hence talking about parameter means talking about our understanding of parameter which is it's estimator.

Remember whenever someone say distribution of parameter, it means distribution of parameter's estimator. Because most of the time actual parameter is unknown and hence we try to find out that parameter's value, but we can actually only find a distribution using multiple estimates of that parameter.

Example:

Let's say we have parameter gg(gravity), note it is a fixed constant, it's just that we don't know it's value hence unknown. What are we gonna do to get the value of gg. We are gonna setup a physics experiment using which we will try to estimate the value of gg. So this experiment is some mathematical procedure using which we gonna try to estimate value of gg. Hence this experiment is nothing but basically an estimator. But will write mathematical equation of this experiment which will make more clear that how this experiment is an estimator. What physics experiment can you think to estimate gg? I think can two of either free fall experiment or pendulum. Basically drop a ball from height hhand measure it's velocity at ground, this is represent by v2=2ghv^2=2gh. Or measure time period of a pendulum, T=2Ï€gLT=2\pi \sqrt \frac{g}{L}. When I said these experiment is our estimator, how? These experiment gives us the mathematical procedure to estimate gg. Hence basically these equations is our estimator. Let's we choose free body experiment. Then we have estimator g^=v22h\boldsymbol{ \hat g} = \frac{v^2}{2h}. This is estimator is given by mathematical equation (procedure) which is given by experiment. Hence the experiment is the estimator. Now let's say we did the experiment 5 times and got following measurement for v=[10,20,15,25,20] v = [10,20,15,25,20]and h=[5,22,11,32,20]h=[5, 22, 11,32,20]. Using these value we get following estimates g^\hat gfor actual parameter g [10.00,9.09,10.22,9.26,10][10.00,9.09, 10.22,9.26, 10 ]. So, these are the estimates for the parameter gg. These values basically gives us the distribution of parameter ggbut actually is distribution of estimator g^\boldsymbol {\hat g}. Now expectation ofg^\boldsymbol {\hat g}=10.00+9.09+10.22+9.26+105=9.71=\frac{10.00+9.09+10.22+9.26+ 10}{5} = 9.71.

Bias of Estimator

When you draw a sample and calculate an estimate of some population parameter, that estimate might not be exactly equal to the actual population parameter, there will be some error between actual parameter and single estimate i.e. an estimate may overestimate or underestimate the parameter.

But that's the beauty of statistics, you don't need to restrict yourself to a single estimate, instead you can draw more samples and calculate multiple estimate for the same parameter. Now all of these estimates may be bit more or less than the actual value of parameter. So, the actual thing to measure how good is your estimator is not single estimate rather the mean of all the estimates. Hence, for a good estimator, mean of all the estimate should be equal to the actual population whereas individual estimates can be bit more or less compared to actual value of parameter. How do calculate the mean of all estimates? Simple, it's the expectation of the estimator i.e E[θ^]E[\hat \theta].

Although it should be clear that why mean of estimates is just expectation of the estimator, I will have a take on it. So, it is clear that estimatorθ^\hat \thetais just and random variable, and every random variable have some distribution associated with it, in this case distribution of estimator is called sampling distribution. Now what are estimates, the estimates are simply the point sampled from the estimator's distribution. So, mean of estimates is simply the mean of points sampled from a RV's distribution and what is the mean of points sampled from some RV's distribution? It's the mean or expectation of that RV itself. Hence, mean of estimate is mean/expectation E[θ^]E[\hat \theta]of the random variable θ^\hat \thetawhich is our estimator.

Bias is the tendency of a estimator to overestimate or underestimate a parameter. See here to get what is difference between statistic and a parameter. Bias is a property of the estimator.

Mathematically, the bias (or bias function) of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. I.e difference between mean of estimated value from different samples and actual population parameter. Note that it's not the difference with single estimated value but with mean of different estimated values over different samples.

Let θ^\hat \theta be an estimator and θ\theta be an parameter or estimand. Then Bias is given by

Bias(θ^)=E[θ^]−θBias(\hat \theta) = E[\hat \theta] - \theta

Note the expectation in the above formula.

An estimator is said to be unbiased if its bias is equal to zero for all values of parameter θ.

Ever notices why we use s2=∑(xi−xˉ)2n−1s^2 = \frac{\sum (x_i - \bar x)^2}{n-1}this formula as estimator for population variance and not s2=∑(xi−xˉ)2ns^2 = \frac{\sum (x_i - \bar x)^2}{n}, even later being the formula for population variance. There answer lies in the bias of both estimators. xˉ\bar xis sample mean (also an estimator for population mean, hence also a random variable). Note the in the above estimators, xix_iare still random variable sampled from population distribution and not values, until they are actually observed.

Variance of Estimator

For an estimator we would want that very concentrated towards the actual value of parameter. Hence, and ideal estimator would be with less variance. Shouldn't need to reiterate but as estimator is a random variable, hence it will have variance. Also, having less variance means that all the estimates are very close to each other. One thing to note here is that, even if let's say all the estimates of estimator are very close to each other and far from actual value. Still the estimator will have low variance. Hence the variance of estimator is indicator of how much the estimates are spread among themselves and not of how much they are spread from actual value of parameter.

For example, variance of sample mean xˉ=∑xin\bar x = \frac{\sum x_i}{n}is σ2n\frac {\sigma^2}{n}, where σ2\sigma^2is the population variance.

Mean Square Error of Estimator

So, what do we have up until now? We have bias of estimator, which tells us how much mean of our estimates is far from actual value. We have variance, which tell how much estimates are spread/concentrated among themselves.

But one important thing is still there to tell - how much out estimates are spread from actual value of parameter. This we can tell using Mean Square Error of the estimator.

If one or more of the estimators are biased, it may be harder to choose between them. For example, one estimator may have a very small bias and a small variance, while another is unbiased but has a very large variance. In this case, you may prefer the biased estimator over the unbiased one.

Mean square error (MSE) is a criterion which tries to take into account concerns about both bias and variance of estimators.

MSE(θ^)=E[(θ^−θ)2]MSE(\hat \theta) = E[(\hat \theta-\theta)^2]

MSE measures the expected value of square of error of estimate θ^\hat \theta from actual value θ\theta.

Let's write MSE in one more form

MSE(θ^)=E[(θ^−θ)2]=E[((θ^−θˉ)−(θ−θˉ))2]=E[(θ^−θˉ)2+(θ−θˉ)2−2(θ^−θˉ)(θ−θˉ)]=E[(θ^−θˉ)2]+E[(θ−θˉ)2]−2E[(θ^−θˉ)(θ−θˉ)]MSE(\hat \theta) = E[(\hat \theta-\theta)^2] \\ = E[((\hat \theta-\bar\theta)-(\theta-\bar \theta))^2]\\ = E[(\hat \theta-\bar\theta)^2 + (\theta-\bar \theta)^2 -2(\hat \theta-\bar\theta)(\theta-\bar \theta) ] \\ = E[(\hat \theta-\bar\theta)^2] + E[ (\theta-\bar \theta)^2] -2E[(\hat \theta-\bar\theta)(\theta-\bar \theta)]

where θˉ=E[θ^]\bar \theta = E[\hat \theta], is mean of the estimator. Now (θ−θˉ)(\theta-\bar \theta)is constant for this expectation (think about it, it is). Therefore, E[(θ−θˉ)2]=(θ−θˉ)2E[(\theta-\bar \theta)^2] = (\theta-\bar \theta)^2 and E[(θ^−θˉ)(θ−θˉ)]=(θ−θˉ)E[(θ^−θˉ)]=0E[(\hat \theta-\bar\theta)(\theta-\bar \theta)] = (\theta-\bar \theta)E[(\hat \theta-\bar\theta)] = 0because E[(θ^−θˉ)]=0E[(\hat \theta-\bar\theta)] = 0. Hence, we get

MSE(θ^)=Var(θ^)+(θˉ−θ)2=Var(θ^)+Bias2MSE(\hat \theta) = Var(\hat \theta) + (\bar \theta-\theta)^2 \\ = Var(\hat \theta) + Bias^2

Now if we already have any estimator which have zero bias, then minimizing the mean square error is equivalent to minimizing the variance.

Estimation Error

Imagine that we have a point estimate θ^\hat \theta for population parameter θ\theta. Even with a good point estimate, there is very likely to be some error ( θ^\hat \theta = θ not likely) . We can express this error of estimation, denoted ε, as ε = | θ^\hat \theta − θ| . This is the number of units that our estimate, θ^\hat \theta, is off from θ (doesn’t take into account the direction of the error).

Note that since the estimator θ^\hat \theta is a RV, the error ϵ\epsilon is also random. We can use the sampling distribution of θ^\hat \thetato help place some bounds on how big the error is likely to be. Note how much our estimation is related to mean square error.

Estimation Criteria

The concept of optimality expressed by the words best estimate corresponds to the minimization of the state estimation error in some respect.

Different optimization criteria may be chosen, leading to different estimates of the system’s state vector. The estimate can be.

  • the mean, i.e., the center of the probability mass, corresponding to the minimum mean-square error criteria,

  • the mode that corresponds to the value of x that has the highest probability, corresponding to the Maximum a Posterior (MAP) criteria.

  • the median, where the estimate is the value of x such that half the probabil- ity weight lies to the left and half to the right of it.

Random

Resources

Last updated