Theoretical Statistics

Notes based off Casella and Berger edition 2.

Probability Theory

To Do

Transformations and Expectations

Transformation

Theorem 2.1.5: Let \(X\) have pdf \(f_X(x)\) and let \(Y = g(X)\) where \(g\) is a monotone function. Let \(X\) and \(Y\) be defined by \(\mathbf{X} = \{ x: f_X(x) > 0 \}\) and \(\mathbf{Y} = \{y: y = g(x)\) for some \(x \in \mathbf{X} \}\). Suppose that \(f_X(x)\) is continuous on \(\mathbf{X}\) and that \(g^{-1}(y)\) has a continuous derivative on \(\mathbf{Y}\). Then, the pdf of \(Y\) is given by

\[\begin{split}f_Y(y) = \begin{cases} f_X(g^{-1}(y)) |\frac{d}{dy} g^{-1}(y) | & y \in \mathbf{Y}\\ 0 & o.w.\\ \end{cases}\end{split}\]

Procedure:

  1. Check if \(g\) is monotonic

  2. Split up into regions where monotonic and then evaluate formula above

See: Example 2.1.6 (monotonic) and Example 2.1.7 (multiple regions)


Theorem 2.1.10:

\[F^{-1}_X(y) = x \Leftrightarrow F_X(x) = y\]

See: Proof on pg. 54


Expected values

\[\begin{split}E g(X) = \begin{cases} \int_{-\infty}^\infty g(x) f_X(x) dx& \hbox{ if $X$ is continuous}\\ \sum_{x \in \mathbf{X}} g(x) f_X(x) = \sum_{x \in \mathbf{X}} g(x) P(X=x)& \hbox{ if $X$ is discrete}\\ \end{cases}\end{split}\]

See: Example 2.2.2 for continuous and Example 2.2.3 for discrete

Properties:

  1. \(E(a g_1(X) + b g_2(X) + c) = a E g_1(X) + b E g_2(X) + c\).

  2. If \(g_1(x) \geq 0\) for all \(x\) then \(E g_1(X) \geq 0\).

  3. If \(g_1(x) \geq g_2(x)\) for all \(x\) then \(E g_1(X) \geq E g_2(X)\).

  4. If \(a \leq g_1(x) \leq b\) for all \(x\) then \(a \leq E g_1(X) \leq b\).

Example: Minimize distance

\[\begin{split}\begin{align*} & \textcolor{red}{\hbox{Add } \pm E X}\\ E(X-b)^2 &= E(X - E X + E X + b^2)\\ & \textcolor{red}{\hbox{Group terms}}\\ &= E((X - EX) + (EX - b))^2\\ &= E(X-EX)^2 + (EX - b)^2 + 2E((X-EX)(EX - b))\\ & \textcolor{red}{\hbox{We know } E((X-EX)(EX - b)) = (EX-b)E(X-EX) = 0}\\ & \textcolor{red}{\hbox{because (EX-b) is constant and } E(X-EX) = EX-EX=0}\\ &= E(X-EX)^2 + (EX - b)^2\\ \min_b E(X-b)^2 &= E(X-EX)^2\\ & \textcolor{red}{\hbox{If choosing } b=EX} \end{align*}\end{split}\]

This result happens to be the definition of variance.


Moments

The \(n\)th central moment of \(X\) is

\[\mu_n = E(X-\mu)^n\]

where \(\mu = E X\).

From this, we know the variance is

\[Var X = E(X - EX)^2\]

See: Example 2.3.3 for the variance of a parameter

Properties:

  1. \(Var(aX + b) = a^2 Var X\)

  2. \(Var X = E(X - EX)^2 = EX^2 - (EX)^2\)


Moment Generating Function (mgf)

Let \(X\) be a R.V. with cdf \(F_X\). The mgf of \(X\) (or \(F_X\)) is

\[M_X(t) = E e^{tX}\]

With our knowledge of expected values,

\[\begin{split}M_X(t) = \begin{cases} \int_{-\infty}^\infty e^{tX} f_X(x) dx & \hbox{if } X \hbox{ is continuous}\\ \sum_{x} e^{tx} P(X=x) & \hbox{if } X \hbox{ is discrete}\\ \end{cases}\end{split}\]

Theorem: The \(n\)th moment is equal to the \(n\)th derivative of \(M_X(t)\) evaluated at \(t=0\).

\[M_X^{(n)}(0) = \frac{d^n}{dt^n} M_X(t) \rvert_{t=0}\]

Assuming we can differentiate under the integral sign (see Leibnitz Rule below),

\[\frac{d}{dt} M_X(t) = E X e^{tX}\]

Evaluating this at \(t=0\), we have

\[\frac{d^n}{dt^n} M_X(t)\rvert_{t=0} = E X^N e^{tX} \rvert_{t=0} = E X^n\]

See: Example 2.3.8 for continuous case and Example 2.3.9 for discrete.

Properties:

  1. \(M_{aX + b}(t) = e^{bt} M_X(at)\)


Convergence of mgfs

Suppose \(\{X_i, i = 1, 2, ... \}\) is a sequence of RVs, each with mgf \(M_{X_i}(t)\). Furthermore suppose that

\[\lim_{i\xrightarrow{} \infty} M_{X_i}(t) = M_X(t)\]

for all \(t\) in a neighborhood of 0 and \(M_X(t)\) is an mgf. Then, there is a unique cdf \(F_X\) whose moments are determined by \(M_X(t)\) and, for all \(x\) where \(F_X(x)\) is continuous we have

\[\lim_{t\xrightarrow{}\infty}F_{X_i}(x) = F_X(x)\]

. That is, convergence, for \(|t| < h\), of mgfs to an mgf implies convergence of cdfs.

This relies on Laplace transforms, which defines

\[M_X(t) = \int_{-\infty}^\infty e^{tX} f_X(x) dx\]

Proof: Poisson approximation of Binomial

We know that the poisson approximation is valid when \(n\) is large and \(np\) is small.

Recall that the moment of binomial is \(M_X(t) = [p e^t + (1 - p)]^n\).

From the rule above (and just from txtbk), the MGF of poisson is \(M_Y(t) = e^{\lambda (e^t - 1)}\).

If we define \(p = \lambda / n\) then \(M_X(t) \xrightarrow{} M_Y(t)\) as \(n \xrightarrow{} \infty\).


Lemma: If \(\lim_{n\xrightarrow{} \infty} a_n = a\), then

\[\lim_{n\xrightarrow{} \infty} (1 + \frac{a_n}{n})^n = e^a\]

Leibnitz Rule

If \(f(x,\theta)\), \(a(\theta)\), and \(b(\theta)\) are differentiable with respect to \(\theta\), then

See page 69.

If \(a(\theta)\) and \(b(\theta)\) are constant, then

\[\frac{d}{d\theta} \int_a^b f(x,\theta) dx = \int_a^b \frac{\delta}{\delta \theta} f(x,\theta) dx\]

Lebesgue’s Dominated Convergence Theorem

See page 69 & 70. Basically, if the integral is not too badly behaved, then we can say it’s good enough to bring a limit inside an integral.

Lipschitz Continuous

Impose smoothness on a function by bounding its first derivative by a function with finite integral. It leads to interchangeability of integration and differentiation.

See: Theorem 2.4.3 (pg 70), Corollary 2.4.4 and Examples 2.4.5 and 2.4.6



Families of Distributions

Check if pdf part of exponential family

\[f(x|\mathbf{\theta}) = h(x) c(\mathbf{\theta}) \exp (\sum_{i=1}^k w_i(\mathbf{\theta}) t_i(x))\]

where \(h(x) \geq 0\), \(t_i(x)\) are real-valued functions of \(x\), \(c(\mathbf{\theta}) \geq 0\) and \(w_i(\mathbf{\theta})\) are real-valued functions of the possibly vector-valued parameter \(\mathbf{\theta}\) which is independent of \(x\).

Here are some common exponential families:

  1. Continuous: normal, gamma, beta

  2. Discrete: binomial, poisson, negative binomial

A distribution which is a member of the exponential family has nice properties. For instance,

Expectations and Variance of exponential family pdf

Theorem: If \(X\) is a RV with pdf or pmf which is member of exponential family,

\[E(\sum_{i=1}^k \frac{\delta w_i(\mathbf{\theta})}{d \theta_j} t_i(X)) = - \frac{\delta}{\delta \theta_j} \log c(\mathbf{\theta})\]
\[Var(\sum_{i=1}^k \frac{\delta w_i(\mathbf{\theta})}{d \theta_j} t_i(X)) = - \frac{\delta^2}{\delta \theta_j^2} \log c(\mathbf{\theta}) - E(\sum_{i=1}^k \frac{\delta^2 w_i(\mathbf{\theta})}{d \theta^2_j} t_i(X))\]

Definition: The indicator function of a set \(A\)

\[\begin{split}I_A(x) = \begin{cases} 1 & x \in A\\ 0 & x \not\in A\\ \end{cases}\end{split}\]

So, we can write the normal pdf (example 3.4.4) as

\[f(x|\mu, \sigma^2) = h(x) c(\mu, \sigma) \exp [w_1(\mu, \sigma) t_1(x) + w_2(\mu, \sigma) t_2(x)] I_{(-\infty, \infty)}(x)\]

Since the indicator function is only a function of \(x\), it can be incorporated into the function \(h(x)\), showing that this pdf is of the exponential family form.

Another example is of \(f(x|\theta) = \theta^{-1} \exp(1 - \frac{x}{\theta})\) on \(0 < \theta < x < \infty\). Although this expression can fit the exponential family definition, the indicator function is dependent, \(I_{[\theta, \infty)}(x)\).


Chebychev’s inequality

Let \(X\) be a RV and let \(g(x)\) be a nonnegative function. Then, for any \(r>0\),

\[P(g(X) \geq r) \leq \frac{E g(X)}{r}\]

We usually set \(r = t^2\).

See: Example 3.6.2 and 3.6.3



Multiple Random Variables

Joint probability \(f_{X,Y}(x,y)\)

Definition: Discrete

Marginal probability is \(f_X(x) = \sum_{y \in \mathbf{R}} f_{X,Y}(x,y)\)

Definition: Continuous

\[Eg(X,Y) = \int_{-\infty}^\infty \int_{-\infty}^\infty g(x,y) f(x,y) dx dy\]
\[f_X(x) \int_{-\infty}^\infty f(x,y)dy\]

where \(-\infty <x < \infty\)

\[\frac{\delta^2 F(x,y)}{\delta x \delta y} = f(x,y)\]

Definition: Conditional

\[f(y|x) = P(Y=y|X=x) = \frac{f(x,y)}{f_X(x)}\]

where \(\sum_y f(y|x) = 1\).

\[E(g(Y) | x) = \int_{-\infty}^\infty g(y) f(y|x) dy\]

Independence properties

\[f(x,y) = f_X(x) f_Y(y)\]
\[f(y|x) = \frac{f(x,y)}{f_X(x)} = \frac{f_X(x) f_Y(y)}{f_X(x)} = f_Y(y)\]
\[E(g(X)h(Y)) = (Eg(X))(Eh(Y))\]
\[M_Z(t) = M_X(t) M_Y(t)\]

With \(U=g(X)\) and \(V=h(Y)\) where \(X\) and \(Y\) are independent and \(A_u = \{ x: g(x) \leq u\}\) and $B_v = { y: h(y) :nbsphinx-math:`leq `v}. Then,

\[f_{U,V}(u, v) = \frac{\delta^2}{\delta u \delta v} F_{U,V}(u,v) = (\frac{d}{du} P(X \in A_u)) (\frac{d}{dv} P(Y \in B_v))\]

Conditional expectation \(EX = E(E(X|Y))\)

Rewritten, we say \(E_X X = E_Y (E_{X|Y} (X|Y))\) because \(E(X|Y)\) is a rv (random in \(Y\)),

\[ \begin{align}\begin{aligned}E(X|Y=y) = \int x f_{X|Y}(x|Y=y) dx\\is a constant and\end{aligned}\end{align} \]
\[E_Y E(X|Y=y) = \int \{ \int x f_{X|Y}(x|y)dx \} f_Y(y) dy\]

See: Example 4.4.5


Definition: Conditional variance identity

For any two rv \(X\) and \(Y\),

\[Var X = E(Var(X|Y)) + Var(E(X|Y))\]

See: Example 4.4.8


Covariance and correlation

Covariance and correlation measure the strength of a relationship between two rv.

Covariance of \(X\) and \(Y\) is

\[Cov(X,Y) = E((X - \mu_X)(Y - \mu_Y)) = EXY - \mu_X \mu_Y\]

This gives information regarding the relationship of \(X\) and \(Y\). Large positive values mean \(X\) and \(Y\) both go up together or down together. This value, however, struggles because, by itself, it is domain-specific. We can normalize by the variance to ensure the range of the metric… this is what correlation does.

Correlation of \(Y\) and \(Y\) is

\[\rho_{XY} = \frac{Cov(X,Y)}{\sigma_X \sigma_Y}\]

\(\rho_{XY}\) is also known as the correlation coefficient.

If \(X\) and \(Y\) are independent, then \(EXY = (EX)(EY)\) and therefore,

\[Cov(X,Y) = EXY - (EX)(EY) = 0\]
\[p_{XY} = \frac{Cov(X,Y)}{\sigma_X \sigma_Y} = \frac{0}{\sigma_X \sigma_Y} = 0\]

Note: It is invalid to say because \(Cov(X,Y)=0\), \(X\) and \(Y\) are independent. For example, if \(X \sim f(x-\theta)\) and \(Y\) is an indicator function \(Y = I(|X-\theta|<2)\), then \(Y\) and \(X\) are not independent but \(E(XY)\) ends up equaling \(EXEY\) so \(Cov(X,Y) = 0\).

Properties:

For any rv \(X\) and \(Y\),

  1. \(-1 \leq \rho_{XY} \leq 1\)

  2. \(|\rho_{XY}| = 1\) iff there exists \(a \neq 0\) and \(b\) st \(P(Y=aX+b) = 1\). If \(\rho_{XY} = 1\), then \(a > 0\), and if \(\rho_{XY} = -1\), then \(a<0\).


Definition: Multivariate variance

If \(X\) and \(Y\) are any two rv and \(a\) and \(b\) are any two constants, then

\[Var(aX + bY) = a^2 Var X + b^2 Var Y + 2ab Cov(X,Y)\]

Note: if \(X\) and \(Y\) are independent rv then

\[Var(aX + bY) = a^2 Var X + b^2 Var Y\]

Multivariate distributions

With \(\mathbf{X} = (X_1, ..., X_n)\) representing a sample space that is a subset of \(\mathbf{R}^n\)

\[P(\mathbf{X} \in A) = \int ... \int_A f(\mathbf{x}) d\mathbf{x}\]

and its expectation

\[E g(\mathbf{X}) = \int_{-\infty}^\infty ... \int_{-\infty}^\infty g(\mathbf{x}) f(\mathbf{x}) d\mathbf{x}\]

The marginal pdf of any subset of the coordinates of \((X_1, ..., X_n)\) can be computed by integrating the joint pdf over all possible values of the coordinates


Multinomial distribution

Let \(n\) and \(m\) be positive integers and let \(p_1,..., p_n\) be numbers satisfying \(0 \leq p_i \leq 1\), \(i=1, ..., n\) and \(\sum_{i=1}^n p_i = 1\). Then the rv (X_1, …, X_n)$ has a multinomial distribution with \(m\) trials and cell probabilities \(p_1, ..., p_n\) if the join pmf of \((X_1, ..., X_n)\) is

\[f(x_1, ..., x_n) = \frac{m!}{x_1! \times ... \times x_n!} p_1^{x_1} \times ... \times p_n^{x_n} = m! \prod_{i=1}^n \frac{p_i^{x_i}}{x_i!}\]

on the set of \((x_1, ..., x_n)\) st \(x_i\) is a nonnegative integer and \(\sum_{i=1}^n x_i = m\).

This follows the following experiment: the experiment consists of \(m\) independent trials. each trial results in one of \(n\) distinct possible outcomes. The probability of \(i\)th outcome is \(p_i\) on every trial. And, \(X_i\) is the count of the number of times the \(i\)th outcome occurred in the \(m\) trials. For \(n=2\), this is just the binomial experiment in which each trial has \(n=2\) possible oucomes and \(X_i\) counts the number of “successes” and \(X_2 = m - X_1\) counts the number of fials in \(m\) tirlas. In a general multinomial experiment, there are \(n\) possible outcomes to count.


Multinomial properties (similar to univariate)

\[Cov(X_i, X_j) = E[(X_i - p_i)(X_j - p_j)] = -m p_i p_j\]
\[E(g_1(X_1) \times ... \times g_n(X_n)) = (g_1(X_1))\times ... \times (g_n(X_n))\]
\[M_Z(t) = M_{X_1}(t) \times ... \times M_{X_n}(t)\]

In particular, if \(X_1, ..., X_n\) share the same distribution with mgf \(M_X(t)\), then

\[M_Z(t) = (M_X(t))^n\]

Corollary: Linear combination of independent distributions form the same (but multivariate) distribution

Let \(X_1, ..., X_n\) be mutually independent rv with mgfs \(M_{X_1}(t), ..., M_{X_n}(t)\). Let \(a_1, ..., a_n\) and \(b_1, ..., b_n\) be fixed constants. Let \(Z = (a_1X_1 +b_1) + ... + (a_n X_n + b_n)\). Then, the mgf of \(Z\) is

\[\begin{split}\begin{align*} M_Z(t) &= E e^{tZ}\\ &= Ee^{t\sum (a_i X_i + b_i)}\\ &= (e^{t(\sum b_i)}) E(e^{t(\sum a_i)} \times ... \times e^{t(\sum a_n)})\\ &= (e^{t(\sum b_i)}) M_{X_1}(a_1 t) \times ... \times M_{X_n}(a_n t)\\ \end{align*}\end{split}\]

From this, we can conclude (for instance) that a linear combination of independent (say) normal rv is normally distributed.

\[Z = \sum_{i=1}^n (a_i X_i + b_i) \sim Normal(\sum_{i=1}^n (a_i \mu_i + b_i), \sum_{i=1}^n a_i^2 \sigma_i^2)\]

Inequalities

Per Holder’s Inequality, if \(\frac{1}{p} + \frac{1}{q} = 1\), then \(|EXY| \leq E|XY| \leq (E|X|^p)^{1/p}(E|X|^q)^{1/q}\)

Cauchy-Schwarz’s inequality is a special case of Holder’s Inequality where \(p=q=2\):

\[|EXY| \leq E|XY| \leq (E|X|^2)^{1/2}(E|X|^2)^{1/2}\]

The covariance inequality states that if \(X\) and \(Y\) have means \(\mu_X\) and \(\mu_Y\) and variances \(\sigma_X^2\) and \(\sigma_Y^2\), we can apply Cauch-Schwarz’s Inequality to get

\[E|(X-\mu_X)(Y-\mu_Y)| \leq \{ E(X-\mu_X)^2 \}^{1/2} \{ E(Y-\mu_Y)^2 \}^{1/2}\]

By squaring both sides, we get a useful property:

\[(Cov(X,Y))^2 \leq \sigma_X^2 \sigma_Y^2\]

This can be modified (by setting \(Y \equiv 1\)) to state \(E|X| \leq \{ E|X|^p \}^{1/p}\) (with \(1<p<\infty\)).

Additionally, Liapounov’s Inequality takes this a step further by, for \(1<r<p\), if we replace \(|X|\) by \(|X|^r\) we obtain

\[E|X|^r \leq \{ E(|X|^{pr})^{1/p} \}\]

and then we set \(s=pr\) (where \(s>r\)) and rearrange:

\[\{E|X|^r\}^{1/r} \leq \{ E(|X|^{s})^{1/s} \}\]

Also, Minkowski’s Inequality states that for two rvs \(X\) and \(Y\) and for \(1 \leq p < \infty\),

\[[E|X+Y|^p]^{1/p} \leq [E|X|^p]^{1/p} + [E|Y|^p]^{1/p}\]

This just uses the normal triangle inequality property.

Lastly, Jensen’s Inequality says that for a rv \(X\), if \(g(x)\) is a convex function, then

\[Eg(X) \geq g(EX)\]

Note: This only holds true iff, for every line \(a+bx\) that is tangent to \(g(x)\) at \(x=EX\), \(P(g(X) = a+bX)=1\).



Properties of Random Sample

Statistics

Definition: A statistic \(Y = T(X_1, ..., X_n)\) cannot be a function of a parameter of the distribution.

The sample mean is

\[\bar{X} = \frac{X_1 + ... + X_n}{n} = \frac{1}{n} \sum_{i=1}^n X_i\]

The sample variance is

\[S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2\]

Theorem: With the definition defined like above, we know a few things:

  1. \(\min_a \sum_{i=1}^n (x_i - a)^2 = \sum_{i=1}^n (x_i - \bar{x})^2\)

  2. \((n-1)s^2 = \sum_{i=1}^n (x_i - \bar{x})^2 = \sum_{i=1}^n x_i^2 - n \bar{x}^2\)


Lemma: Let \(X_1, ..., X_n\) be a random sample from a population and let \(g(x)\) be a function such that \(Eg(X_1)\) and \(\hbox{Var} g(X_1)\) exist. Then,

\[E(\sum_{i=1}^n g(X_i)) = n(E g(X_1))\]
\[\hbox{Var}(\sum_{i=1}^n g(X_i)) = n(\hbox{Var} g(X_1))\]

Theorem: With \(X_1, ..., X_n\) as a random sample and population with mean \(\mu\) and variance \(\sigma^2<\infty\),

  1. \(E \bar{X} = \mu\)

  2. \(\hbox{Var} \bar{X} = \frac{\sigma^2}{n}\)

  3. \(E S^2 = \sigma^2\)

In relationship (c), we see why \(S^2\) requires a \(\frac{1}{n-1}\):

\[\begin{split}\begin{align*} ES^2 &= E(\frac{1}{n-1} [\sum_{i=1}^n X_i^2 - n \bar{X}^2])\\ &= \frac{1}{n-1} (n E X_1^2 - n E \bar{X}^2)\\ &= \frac{1}{n-1} (n(\sigma^2 + \mu^2) - n(\frac{\sigma^2}{n} + \mu^2))\\ &= \sigma^2\\ \end{align*}\end{split}\]

Distribution of statistics

Example: since \(\bar{X} = \frac{1}{n}(X_1 + ... + X_n)\), if \(f(y)\) is the pdf of \(Y = (X_1 + ... + X_n)\), then \(f_{\bar{X}}(x) = nf(nx)\) is the pdf of \(\bar{X}\). We can prove this using the transformation equation from chapter 4.3.2:

We say that \(\bar{X} = g(Y) = (1/n)Y\). Therefore, \(g^{-1}(\bar{X}) = n\bar{X}\)

\(f_{\bar{X}}(x) = f_{Y}(n\bar{X}) |n|\)

Additionally, this can be conducted for mgfs:

\[M_{\bar{X}}(t) = Ee^{t\bar{X}} = Ee^{t(X_1 + ... + X_n)/n} = Ee^{(t/n)Y} = M_Y(t/n)\]

Since \(X_1, ..., X_n\) are iid, then this is true.

Theorem: However, if they are just from a random sample from a population with mgf \(M_X(t)\), then the mgf of the sample mean is:

\[M_{\bar{X}}(t) = [M_X(t/n)]^n\]

This can be useful in cases where \(M_{\bar{X}}(t)\) is a familiar mgf, for instance:

Example: Distribution of the mean

Let \(X_1, ..., X_n\) be a random sample from a \(n(\mu, \sigma^2)\) population. Then, the mgf of the sample mean is

\[\begin{split}\begin{align*} M_{\bar{X}}(t) &= [ \exp(\mu \frac{t}{n} + \frac{\sigma^2(t/n)^2}{2}) ]^n\\ &= \exp(n(\mu\frac{t}{n} + \frac{\sigma^2(t/n)^2}{2}))\\ &= \exp(\mu t + \frac{(\sigma^2/n)t^2}{2})\\ \end{align*}\end{split}\]

This also works for a random sample of \(\gamma(\alpha,\beta)\) (see Example 5.2.8)


Definition: Check if pdf is member of exponential family

Suppose \(X_1, ..., X_n\) is a random sample from a pdf or pmf \(f(x|\theta)\), where

\[f(x|\theta) = h(x)c(\theta) \exp(\sum_{i=1}^k w_i(\theta) t_i(x))\]

is a member of an exponential family. Define statistics \(T_1, ..., T_k\) by

\[T_i(X_1, ..., X_n) = \sum_{j=1}^n t_i(X_j)\]

where \(i=1,...,k\).

If the set \(\{(w_1(\theta) w_2(\theta), ..., w_k(\theta)), \theta \in \Theta\}\) contains an open subset of \(\mathbf{R}^k\), then the distribution of \((T_1,...,T_k)\) is an exponential family of the form

\[f_{T}(u_1,...,u_k|\theta) = H(u_1,...,u_k)[c(\theta)]^n \exp(\sum_{i=1}^k w_i(\theta)u_i)\]

The open set condition eliminates a density such as the \(n(\theta, \theta^2)\) and, in general, eliminates curved exponential families.

Example: Sum of bernuolli rvs

Suppose \(X_1, ..., X_n\) is a random sample from a \(Bernuolli(p)\). We know that

\[\begin{split}\begin{align*} f(x|p) &= {n \choose x} p^x (1-p)^{n-x}\\ &= {n \choose x} (1 - p)^n (\frac{p}{1-p})^x\\ &= {n \choose x} (1 - p)^n \exp(\log(\frac{p}{1-p})x)\\ \end{align*}\end{split}\]

We can see that this is a exponential family where (if \(n=1\), because \(Bernuolli(p) \sim Binomial(1,p)\))

\[\begin{split}h(x) =\begin{cases} {n \choose x} & x=0,...,n\\ 0 & \hbox{o.w.} \end{cases}\end{split}\]

\(c(p) = (1-p)^n, 0<p<1\)

\(w_1(p) = \log(\frac{p}{1-p}), 0<p<1\)

\(t_1(x) = x\)

Thus, from the previous theorem, \(T_1(X_1,...,X_n) = X_1 + ... + X_n\). From the definition of a binomial distribution, we know that \(T_1\) has a \(binomial(n,p)\) distribution, which we have already shown is an exponential family. This verifies the theorem shown above.


Properties: of sample mean and variance

  1. \(\bar{X}\) and \(S^2\) are independent random variables

  2. \(\bar{X}\) has a \(n(\mu, \sigma^2/n)\) distribution

  3. \((n-1)S^2/\sigma^2 \sim \chi_{n-1}^2\)

Some distribution facts

Facts: about \(\chi_p^2\) distribution with \(p\) dof

  1. If \(Z\) is a \(n(0,1)\) rv, then \(Z^2 \sim \chi_1^2\)

  2. If \(X_1, ..., X_n\) are independent and \(X_i \sim \chi_{p_i}^2\) then \(X_1 + ... X_n \sim \chi^2_{p_1 + ... p_n}\); so, independent chi-squared variables add to a chi-squared variable AND the dof also add.


Definition: Student’s t-distribution

Instead of looking at the \(n(\mu, \sigma^2)\),

\[\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}\]

where we can use our knowledge of \(\sigma\) and our measurement of \(\bar{X}\) as a basis to determine \(\mu\), we can look at a distribution where \(\mu\) and \(\sigma\) are unknown:

\[\frac{\bar{X} - \mu}{S /\sqrt{n}}\]

Properties: of t-distribution

  1. Has no mgf becasue it does not have moments of all orders. If it has \(p\) degrees of freedom, it only has \(p-1\) moments. For instance, \(t_1\) has no mean and \(t_2\) has no variance.


Definition: F-distribution

Built by a ratio of variances. See Definition 5.3.6

This distribution has a few important corollaries:

  1. If \(X \sim F_{p,q}\), then \(\frac{1}{X} \sim F_{q,p}\)

  2. If \(X \sim t_q\), then \(X^2 \sim F_{1,q}\)

  3. If \(X \sim F_{p,q}\), then \(\frac{p}{q} \frac{X}{1 + (p/q)X} \sim beta(p/2, q/2)\)


Order statistics

Organize the random variables by size: \(X_{(1)} \leq ... \leq X_{(n)}\)

We know the pdf of \(X_{(j)}\) of a random sample \(X_1, ..., X_n\) from a continuous population with cdf \(F_X(x)\) and pdf \(f_X(x)\) is

\[f_{X_{(j)}}(x) = \frac{n!}{(j-1)!(n-j)!} f_X(x) [F_X(x)]^{j-1} [1 - F_X(x)]^{n-j}\]

and the joint pdf of \(X_{(i)}\) and \(X_{(j)}\), \(1 \leq i \leq j \leq n\) is

\[f_{X_{(i)}, X_{(j)}}(u, v) = \frac{n!}{(i-1)!(j-1-i)!(n-j)!} f_X(u) f_X(v) [F_X(u)]^{i-1} \times [F_X(v) - F_X(u)]^{j-1-i} [1 - F_X(v)]^{n-j}\]

Convergence

Definition: Convergence in probability

Weak law of large numbers (WLLN) says that \(\lim_{n\xrightarrow[]{}\infty}P(|\bar{X}_n - \mu| < \epsilon) = 1\). That is, \(\bar{X}_n\) converges in probability to \(\mu\). So, \textbf{convergence in probability} is \(\lim_{n\xrightarrow[]{}\infty} P(w \in S |X_n(w) - X(w)| \geq \epsilon) = 0\) where \(w\) is all the solutions in the set. I.e. for \(\Sigma X_i = 4\), the set could be \(\{(1,1,1,1), (2,2), \hbox{etc.}\}\)

Definition: Convergence almost surely

A stronger definition of convergence (yet, it does not need to converge on a set with probability 0) says that a sequence of RVs \textbf{converges almost surely} to a random variable X if \(P(\lim_{n\xrightarrow[]{}\infty}|X_n - X| < \epsilon) = 1\). Moving the limit inside gives it a more strict definition.

Definition: Convergence in distribution

A sequence of RVs converge in distribution to a random variable X if \(\lim_{n\xrightarrow[]{}\infty}F_{X_n}(x) = F_X(x)\). The CDFs converge.


Proving Tools

Definition: Slutsky’s Theorem

If \(X_n \xrightarrow{} X\) in distribution and \(Y_n \xrightarrow{} a\), a constant, in probability, then

  1. \(Y_n X_n \xrightarrow{} a X\) in distribution

  2. \(X_n + Y_n \xrightarrow{} X + a\) in distribution

Definition: Delta method

Let \(Y_n\) be a sequence of rvs that satisfies \(\sqrt{n} (Y_n - \theta) \xrightarrow{} n(0, \sigma^2)\) in distribution. For a given function \(g\) and a specific value of \(\theta\), suppose that \(g^\prime(\theta)\) exists and is not 0. Then, \(\sqrt{n}[g(Y_n) - g(\theta)] \xrightarrow{} n(0, \sigma^2[g^\prime(\theta)]^2)\) in distribution.



Principles of Data Reduction

We are interested in methods of data reduction that do not discard important information about the unknown parameter \(\theta\) and methods that successfully discard information that is irrelevant as far as gaining knowledge about \(\theta\) is concerned.

  • Sufficiency: data reduction that does not discard information about \(\theta\) while achieving some summarization of the data

  • Likelihood: a function of the parameter, obtained by the observed sample, that contains all the information about \(\theta\) that is available from the sample

  • Equivariance: preserve important features of the model


Sufficiency

If \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) then any inference about \(\theta\) should depend on the sample \(\mathbf{X}\) only through the value \(T(\mathbf{X})\). That is, if \(\mathbf{x}\) and \(\mathbf{y}\) are two sample points such that \(T(\mathbf{x}) = T(\mathbf{y})\) then the inference about \(\theta\) shoud be the ame whether \(\mathbf{X}=\mathbf{x}\) or \(\mathbf{X}=\mathbf{y}\) is observed.

A statistic \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\)

  1. if the conditional distribution of the sample \(\mathbf{X}\) given the value of \(T(\mathbf{X})\) does not depend on \(\theta\)

  2. if \(p(\mathbf{x}|\theta)\) is the joint pdf/pmf of \(\mathbf{X}\) and \(q(t|\theta)\) is the pdf/pmf of \(T(\mathbf{X})\) and if, for every \(\mathbf{x}\) in the sample space, the ratio \(p(\mathbf{x}|\theta)/q(T(\mathbf{x})|\theta)\) is constant as a function of \(\theta\) (aka, does not depend on \(\theta\)).

Definition: Determine if sufficient statistic

Factorization Theorem: (no prereq)

\[f(\mathbf{x}|\theta) = g(T(\mathbf{x})|\theta) h(\mathbf{x})\]

For exponential family pdfs:

\[f(x|\mathbf{\theta}) = h(x) c(\theta) \exp(\sum_{i=1}^k w_i(\mathbf{\theta}) t_i(x))\]

where \(\mathbf{\theta} = (\theta_1, ..., \theta_d), d \leq k\). Then,

\[T(\mathbf{X}) = ( \sum_{j=1}^n t_1(X_j), ..., \sum_{j=1}^n t_k(X_j))\]

is a sufficient statistic for \(\mathbf{\theta}\).


Definition: Minimal sufficient statistic

General: (no prereq)

A sufficient statistic \(T(\mathbf{X})\) is a minimal sufficient statistic if, for any other sufficient statistic \(T^\prime(\mathbf{X})\), \(T(\mathbf{x})\) is a function \(T^\prime(\mathbf{x})\).

General: (no prereq)

Let \(f(\mathbf{x}|\theta)\) be the pmf or pdf of a sample \(\mathbf{X}\). Suppose there exists a function \(T(\mathbf{x})\) st, for every two sample points \(\mathbf{x}\) and \(\mathbf{y}\), the ratio \(f(\mathbf{x}|\theta) / f(\mathbf{y}|\theta)\) is constant as a function of \(\theta\) iff \(T(\mathbf{x}) = T(\mathbf{y})\). Then, \(T(\mathbf{X})\) is a minimal sufficient statistic for \(\theta\).

Therefore, show \(\frac{f(\textbf{x}|\theta)}{f(\textbf{y}|\theta)} = \frac{g^\prime(T^\prime(\textbf{x})|\theta)h^\prime(\textbf{x})}{g^\prime(T^\prime(\textbf{y})|\theta)h^\prime(\textbf{y})} = \frac{h^\prime(\textbf{x})}{h^\prime(\textbf{y})}\) does not depend on \(\theta\). Therefore, \(T(\textbf{x}) = T(\textbf{y})\). Thus, \(T(\textbf{x})\) is a function of \(T^\prime(\textbf{x})\) and \(T(\textbf{x}) (6.2.13)\)


Other properties

Definition: Ancillary statistics

A statistic \(S(\mathbf{X})\) whose distribution does not depend on the parameter \(\theta\) is called an ancillary statistic.

Prove that the statistic does not depend on \(\theta\). Derive \(f(T(X)|\theta)\) and check whether \(\theta\) is in it. Using Basu’s theorem, if \(T(\textbf{X})\) is a complete and minimal sufficient statistic, then \(T(\textbf{X})\) is independent of every ancillary statistic.


Definition: Complete statistic

Let \(f(t|\theta)\) be a family of pdfs or pmfs for a statistic \(T(\mathbf{X})\). The family of probability distributions is called complete if \(E_\theta g(T) = 0\) for all \(\theta\) implies \(P_\theta(g(T) = 0) = 1\) for all \(\theta\). Equivalently, \(T(\mathbf{X})\) is called a complete statistic.

General: (no prereq)

Basu’s theorem

For exponential pdfs:

\[f(x|\mathbf{\theta}) = h(x) c(\mathbf{\theta}) \exp(\sum_{i=1}^k w(\theta_j) t_j(x))\]

where $:nbsphinx-math:mathbf{theta} = (\theta_1, …, \theta_k). Then the statistic

\[T(\mathbf{X}) = ( \sum_{i=1}^n t_1(X_i), ..., \sum_{i=1}^n t_k(X_i))\]

is complete as long as the parameter space \(\Theta\) contains an open set in \(\mathbf{R}^k\).


Likelihood

Let \(f(\mathbf{x}|\theta)\) denote the joint pdf/pmf of the sample \(\mathbf{X} = (X_1, ..., X_n)\). Then, given that \(\mathbf{X} = \mathbf{x}\) is observed the function of \(\theta\) is defined by

\[L(\theta|\mathbf{x}) = f(\mathbf{x}|\theta)\]

is called the likelihood function.

Definition: Likelihood principle

If \(\mathbf{x}\) and \(\mathbf{y}\) are two sample points st \(L(\theta|\mathbf{x})\) is proportional to \(L(\theta|\mathbf{y})\), that is, there exists a constant \(C(\mathbf{x},\mathbf{y})\) st

\[L(\theta|\mathbf{x}) = C(\mathbf{x},\mathbf{y}) L(\theta|\mathbf{y})\]

for all \(\theta\), then the conclusions drawn from \(\mathbf{x}\) and \(\mathbf{y}\) should be identical.

If \(C(\mathbf{x},\mathbf{y}) = 1\), then the likelihood principle states that if two sample points result in the same likelihood function, then they contain the same information about \(\theta\). But, this can be taken further: the principle states that even if two sample points have only proportional likelihoods, then they contain equivalent information about \(\theta\). The plausibility can be observed by the proportion. For instance, if \(L(\theta_2|\mathbf{x}) = 2L(\theta_1|\mathbf{x})\) then \(\theta_2\) is said to be twice as plausible as \(\theta_1\).

The fiducial inference sometime interprets likelihoods as probabilities for \(\theta\). That is, \(L(\theta|\mathbf{x})\) is multiplied by \(M(\mathbf{x}) = (\int_{-\infty}^\infty L(\theta|\mathbf{x})d\theta)^{-1}\) and then \(M(\mathbf{x})L(\theta|\mathbf{x})\) is interpreted as a pdf for \(\theta\) (if \(M(\mathbf{x})\) is finite).


Equivariance

If \(\mathbf{Y} = g(\mathbf{X})\) is a change of measurement scale st the model for \(\mathbf{Y}\) has the same formal structure as the model for \(\mathbf{X}\), then an inference procedure should be both measurement equivariant and formally equivariant.

Point Estimation

A point estimator is any function \(W(X_1, ..., X_n)\) of a sample; that is, any statistic is a point estimator.

There exist three ways of finding estimators: 1) Method of Moments, 2) Maximum Likelihood Estimation, and 3) Bayes’

Method of moments (MOM)

Let \(X_1, ..., X_n\) be a sample from a population with pdf/pmf \(f(x|\theta_1,...,\theta_k)\). MOM estimators are found by equating the first \(k\) sample moments to the corresponding \(k\) population moments, and solving the resulting system of simultaneous equations.

\[\begin{split}\begin{cases} m_1 = \frac{1}{n} \sum_{i=1}^n X_i^1, & \mu_1^\prime = EX^1\\ m_2 = \frac{1}{n} \sum_{i=1}^n X_i^2, & \mu_2^\prime = EX^2\\ \vdots\\ m_k = \frac{1}{n} \sum_{i=1}^n X_i^k, & \mu_k^\prime = EX^k\\ \end{cases}\end{split}\]

Maximum likelihood estimation (MLE)

Get \(L(\theta | \mathbf{x})\), then \(\ln L(\theta | \mathbf{x})\), then observe \(\frac{d}{d\theta} \ln L(\theta | \mathbf{x})\) domain and find the MLE. Check endpoints.

Bayes

\(\hbox{posterior} = \pi(\theta|\mathbf{x}) = f(\mathbf{x}|\theta)\pi(\theta) / m(\mathbf{x})\)

where \(f(\mathbf{x}|\theta)\pi(\theta) = f(\mathbf{x},\theta)\), the joint PDF and the marginal PDF \(m(\mathbf{x}) = \int f(x|\theta) \pi(\theta) d\theta\). \(\pi(\theta)\) is your prior distribution.

Definition: Let \(\mathbf{F}\) denote the class of pdfs or pmfs \(f(x|\theta)\) (indexed by \(\theta\)). A class \(\Pi\) of prior distributions is a conjugate family for \(\mathbf{F}\) if the posterior distribution is in the class \(\Pi\) for all \(f \in \mathbf{F}\), all priors in \(\Pi\) and all \(x \in \mathbf{X}\).

For instance, the beta family is conjugate for the binomial family. Thus, if we start with a beta prior, we will end up with a beta posterior.


Examples Finding Estimators

Example: Normal distribution

MOM

If \(X_1, ..., X_n\) are iid \(n(\theta,\sigma^2)\), then \(\theta_1 = \theta\) and \(\theta_2 = \sigma^2\). We have \(m_1 = \bar{X}, m_2 = \frac{1}{n} \sum X_i^2, \mu_1^\prime = \theta, \mu_2^\prime = \theta^2 + \sigma^2\), and hence we must solve

\[\bar{X} = \theta\]
\[\frac{1}{n} \sum X_i^2 = \theta^2 + \sigma^2\]

Solving for \(\theta\) and \(\sigma^2\) yields the MOM estimators:

\[\vec{\theta} = \bar{X}\]
\[\vec{\sigma}^2 = \frac{1}{n} \sum X_i^2 - \bar{X}^2 = \frac{1}{n} \sum(X_i - \bar{X})^2\]

Bayes

(Example 7.2.16) Let \(X \sim n(\theta, \sigma^2)\) and suppose that the prior distribution on \(\theta\) is \(n(\mu, \tau^2)\). Here, we assume all the parameters are known. The posterior distribution of \(\theta\) is also normal with mean and variance given by

\[\pi(\theta) = \frac{1}{\sqrt{2\pi}\tau} \exp(-\frac{1}{2\tau^2} (\theta - \mu)^2)\]
\[f(x|\theta) = \frac{1}{\sqrt{2\pi}\sigma} \exp(-\frac{1}{2\sigma^2} (x - \theta)^2)\]

Step 1: evaluate the posterior \(\pi(\theta|\vec{x}) = \frac{f(x|\theta) \pi(\theta)}{m(x)}\). Or, since \(m(x)\) is not dependent on \(\theta\), we can just evaluate the joint distribution \(f(x,\theta) = f(x|\theta) \pi(\theta)\).

\[\begin{split}\begin{align*} f(x,\theta) &= [\frac{1}{\sqrt{2\pi}\sigma} \exp(-\frac{1}{2\sigma^2} (x - \theta)^2)] [\frac{1}{\sqrt{2\pi}\tau} \exp(-\frac{1}{2\tau^2} (\theta - \mu)^2)]\\ &= \frac{1}{2\pi\sigma\tau} \exp(-\frac{1}{2\sigma^2} (x - \theta)^2 - \frac{1}{2\tau^2} (\theta - \mu)^2)\\ &= \frac{1}{2\pi\sigma\tau} \exp(-\frac{1}{2\sigma^2} (x^2 - 2x\theta + \theta^2) - \frac{1}{2\tau^2} (\theta^2 - 2\theta\mu + \mu^2))\\ &= \frac{1}{2\pi\sigma\tau} \exp(-\frac{1}{2\sigma^2} (x^2 - 2x\theta + \theta^2) - \frac{1}{2\tau^2} (\theta^2 - 2\theta\mu + \mu^2))\\ &= \frac{1}{2\pi\sigma\tau} \exp(-\frac{x^2}{2\sigma^2} + \frac{2x\theta}{2\sigma^2} - \frac{\theta^2}{2\sigma^2} - \frac{\theta^2}{2\tau^2} + \frac{2\theta\mu}{2\tau^2} - \frac{\mu^2}{2\tau^2} )\\ &= \frac{1}{2\pi\sigma\tau} \exp(\theta^2(- \frac{1}{2\sigma^2} - \frac{1}{2\tau^2}) + \theta(\frac{x}{\sigma^2} + \frac{\mu}{\tau^2}) - \frac{\mu^2}{2\tau^2} -\frac{x^2}{2\sigma^2} )\\ &= \frac{1}{2\pi\sigma\tau} \exp(-\theta^2(\frac{1}{2\sigma^2} + \frac{1}{2\tau^2})) \times \exp(\theta(\frac{x}{\sigma^2} + \frac{\mu}{\tau^2})) \times \exp(- \frac{\mu^2}{2\tau^2} - \frac{x^2}{2\sigma^2} )\\ \end{align*}\end{split}\]

Step 1b: Find \(m(x) = \int f(x|\theta) \pi(\theta) d\theta = \int f(x,\theta) d\theta\).

\[\begin{split}\begin{align*} m(x) &= \int f(x,\theta) d\theta\\ &= \frac{1}{2\pi\sigma\tau} \exp(-\theta^2(\frac{1}{2\sigma^2} + \frac{1}{2\tau^2})) \times \exp(\theta(\frac{x}{\sigma^2} + \frac{\mu}{\tau^2})) \times \exp(- \frac{\mu^2}{2\tau^2} - \frac{x^2}{2\sigma^2} )\\ & \textcolor{red}{\hbox{This is a hard integral which is okay because we don't need it!}} \end{align*}\end{split}\]

We can continue to boil down the joing \(f(x,\theta)\)

\[\begin{split}\begin{align*} f(x,\theta) &= \frac{1}{2\pi\sigma\tau} \exp(\theta^2(- \frac{1}{2\sigma^2} - \frac{1}{2\tau^2}) + \theta(\frac{x}{\sigma^2} + \frac{\mu}{\tau^2}) - \frac{\mu^2}{2\tau^2} -\frac{x^2}{2\sigma^2} )\\ &= ...\\ &\sim N(\frac{\tau^2 x + \sigma^2 \mu}{\tau^2 + \sigma^2}, \frac{\sigma^2 \tau^2}{\tau + \sigma^2}) \times N(\mu, \tau^2 + \sigma^2)\\ \end{align*}\end{split}\]

Therefore, we can find the mean and variance

\[E(\theta|x) = \frac{\tau^2}{\tau^2 + \sigma^2} x + \frac{\sigma^2}{\sigma^2 + \tau^2} \mu\]
\[Var(\theta|x) = \frac{\sigma^2 \tau^2}{\sigma^2 + \tau^2}\]

When have \(tau^2\) near 0, the weight on \(\bar{x}\) is 0 and the weight on \(\mu\) is 1.

The normal family is its own conjugate.


Example: Let \(X_1, ..., X_n\) be iid \(binomial(k,p)\), that is,

\[P(X_i=x|k,p) = {k \choose x} p^x (1-p)^{k-x}\]

on \(x = 0,1,...,k\).

We desire point estimators for both \(k\) and \(p\). We start with the population yields:

\[\bar{X} = kp\]
\[\frac{1}{n}\sum X_i^2 = kp(1-p) + k^2 p^2\]

Or more simply,

\[\frac{1}{n} \sum(X_i - \bar{X})^2 = kp(1-p)\]

With these definitions, we can build parameter estimates

\[\vec{p} = 1 - \frac{\frac{1}{n} \sum(X_i - \bar{X})^2}{\bar{X}}\]
\[\vec{n} = \frac{\bar{X}}{\vec{p}}\]

Metrics for evaluating estimators

Definition: Mean squared error (MSE)

The MSE of an estimator \(W\) of a parameter \(\theta\) is a function of \(\theta\) defined by \(E_\theta (W - \theta)^2\).

\[E_\theta (W-\theta)^2 = Var_\theta W + (E_\theta W - \theta)^2 = Var_\theta W + (Bias_\theta W)^2\]

This is usually a balancing act.

Note: for an unbiased (\(Bias_\theta = 0\)) estimator, we have

\[E_\theta(W-\theta)^2 = Var_\theta W\]

Example: Normal MSE

Let \(X_1, ..., X_n\) be iid \(n(\mu, \sigma^2)\). The statistics \(\bar{X}\) and \(S^2\) are both unbiased estimators since

\[E\bar{X} = \mu\]
\[E S^2 = \sigma^2\]

for all \(\mu\) and \(\sigma^2\). This is always true!!

This is true without the normality assumption. The MSE of these estimators are

\[MSE_\mu (\bar{X}_n) = E(\bar{X} - \mu)^2 = Var \bar{X} = \frac{\sigma^2}{n}\]

This goes to 0 as \(n \xrightarrow{} \infty\).

\[MSE_{\sigma^2}(S_n^2) = E(S^2 - \sigma^2)^2 = Var S^2 = \frac{2 \sigma^4}{n-1}\]

For a non-normal case, this is \(MSE_{\sigma^2}(S_n^2) = \frac{1}{n}(\theta_4 - \frac{n-3}{n-1} \theta_2^2)\)

Example: \(MSE(\hat{\sigma}^2) < MSE(S^2)\)

We know the MLE of \(\hat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 = \frac{n-1}{n} \sigma^2\).

\[E \hat{\sigma^2} = E(\frac{n-1}{n} S^2) = \frac{n-1}{n} \sigma^2\]

So, \(\hat{\sigma^2}\) is a biased estimator of \(\sigma^2\). The variance of \(\hat{\sigma^2}\) can be calculated

\[Var \hat{\sigma^2} = Var(\frac{n-1}{n} S^2) = (\frac{n-1}{n})^2 Var S^2 = \frac{2(n-1)\sigma^4}{n^2}\]

because \(Var S^2 = \frac{1}{n}(\theta_4 - \frac{n-3}{n-1} \theta_2^2)\), from above.

Therefore, the \(MSE\) is given by

\[E(\hat{\sigma^2} - \sigma^2)^2 = Var(\frac{n-1}{n} S^2) = \frac{2(n-1)\sigma^4}{n^2} + (\frac{n-1}{n} \sigma^2 - \sigma^2)^2 = (\frac{2n-1}{n^2})\sigma^4\]

Conclusions about MSE:

  • Small inccrease in bias can be traded for a large decrease in variance resulting in a smaller MSE.

  • Because MSE is a function of the parameter, there is often not one best esetimator. Often, the MSEs of two estimators will cross each other, showing that each estimator is better (wrt the other) in only a portion of the parameter space.

This second bullet point is why we discuss other tactics of finding the best estimator… see next!


Definition: Best unbiased estimator

We want to recommend a candidate estimator. Specifically, we consider unbiased estimators. So, if both \(W_1\) and \(W_2\) are unbiased estimators of a parameter \(\theta\), that is, \(E_\theta W_1 = E_\theta W_2 = \theta\), then their MSE are equal to their variances, so we should choose the estimator with the smaller variance.