Friday, July 5, 2019

Estimate vs. Estimand. Above my grade.

Raj Bhuptani, Harvard '13 (Statistics), Two Sigma Investments

The estimand is the quantity of interest whose true value you want to know.

An estimator is a method for estimating the estimand.

An estimate is a numerical estimate of the estimand that results from the use of a particular estimator.

For example, suppose we are interested in the mean height of all male adults in the United States. Our estimand is "the mean height of all male adults in the United States". A foolproof way to find this mean exactly would be to measure the height of each and every male adult in the United States and compute the mean. But that sounds too hard, so instead we decide to estimate the mean height by taking a random sample of male adults in the United States and measuring the height of each individual. Suppose we take a random sample of 100 adult men in the United States and measure their heights. Using this data, we now have to choose an estimator that will provide us with an estimate of our estimand.

The most obvious thing to do would be to compute the sample average of the heights. That is, "the sample average" is an estimator that provides an estimate of our estimand. Suppose the sample average is 70 inches. Then 70 inches is the estimate of our estimand provided by the "sample average" estimator.

Another strategy could be to use the sample median of the heights as an estimator. "The sample median" is another estimator that provides an estimate of our estimand. Suppose the sample median is 69 inches. Then 69 inches is the estimate of our estimand provided by the "sample median" estimator.

Yet another strategy could be to use the average of the largest height and the smallest height from the sample. "Method three" (I don't think it has a name) is another estimatorthat provides an estimate of our estimand. Suppose the result of this calculation is 71 inches. Then 71 inches is the estimate of our estimand provided by the "method three" estimator.

Which of these three estimators should we use? You probably know that the "correct" answer is "the sample average", but why? Think of estimators as rules. We would want to choose the rule that tends to perform the best. There are many ways of evaluating the performance of estimators. What follows is a description of each and a justification of why "the sample average" is the "best" estimator for the above problem:

  • bias: The bias of an estimator is the expected value of the difference between the estimate and the true value, averaged across all possible samples. Ostensibly, a good estimator should be unbiased: the expected value of the estimator should equal the true value. While this is usually a good rule of thumb, there are certain interesting cases where one might actually prefer a biased estimator. In the above examples, the "sample average" estimator is unbiased. The other two estimators are unbiased if the distribution of adult male heights is symmetric about the true mean value.
  • variance: The variance of an estimator is... the variance of the estimator (across all possible samples). In general, an estimator with low variance is preferable over an estimator with high variance. In the case that the distribution of adult male heights is symmetric about the true mean value (so that all of the three methods discussed above are unbiased), the "sample average" estimator wins over the other two because it has lower variance. That is, even though all three estimators provide the correct estimate on average, the "sample average" estimator tends to be "less off" the truth, on average, followed by the "sample median" estimator and then the "method three" estimator.
  • mean squared error: The mean squared error of an estimator is the sum of the bias squared plus the variance. Often times, estimators are chosen to minimize not solely bias or solely variance, but instead the MSE. The phenomenon that it is often possible to reduce the bias of an estimator by increasing its variance, or to decrease the variance of an estimator by adding some bias, is known as the bias-variance tradeoff, a concept which is central to all of statistics. All sorts of ideas, theorems, and results from all across statistics can usually be boiled down to the bias-variance tradeoff.
  • consistency: An estimator is consistent if, as sample size increases, the estimate produced by the estimator "converges in probability" to the estimand. Ostensibly this is another good property for an estimator to have. Again assuming that the distribution of male heights is symmetric about the true mean value, all three estimators from above are consistent. Note that consistent is not the same thing as unbiased. For example, you might have heard of the mysterious degrees of freedom, which dictate dividing by (n - 1) instead of (n) in the calculation of sample variance for the sake of unbiasedness. Indeed, using (n - 1) results in an estimator that is unbiased and consistent; using (n) results in an estimator that is biased but still consistent.
  • efficiency: An estimator is efficient if (a) it is unbiased and (b) it has the lowest possible variance among all unbiased estimators. Assuming that the distribution of adult male heights is normal (note that we did not assume this anywhere in the above), both the "sample average" estimator and the "sample median" estimator are unbiased, but only the sample average is efficient. In fact, as the sample size grows large, a noteworthy result is that the "sample median" estimator is 64 percent (2 / pi) as efficient as the "sample mean" estimator.

No comments:

Post a Comment

update trials of Alzheimers

 The best part of the day is when I have a bowel movement.   Recently started Miralax. I found MOM too harsh. Pacing helps but I get exhaust...