Theoretical Statistics
Notes based off Casella and Berger edition 2.
Probability Theory
To Do
Transformations and Expectations
Transformation
Theorem 2.1.5: Let \(X\) have pdf \(f_X(x)\) and let \(Y = g(X)\) where \(g\) is a monotone function. Let \(X\) and \(Y\) be defined by \(\mathbf{X} = \{ x: f_X(x) > 0 \}\) and \(\mathbf{Y} = \{y: y = g(x)\) for some \(x \in \mathbf{X} \}\). Suppose that \(f_X(x)\) is continuous on \(\mathbf{X}\) and that \(g^{-1}(y)\) has a continuous derivative on \(\mathbf{Y}\). Then, the pdf of \(Y\) is given by
Procedure:
Check if \(g\) is monotonic
Split up into regions where monotonic and then evaluate formula above
See: Example 2.1.6 (monotonic) and Example 2.1.7 (multiple regions)
Theorem 2.1.10:
See: Proof on pg. 54
Expected values
See: Example 2.2.2 for continuous and Example 2.2.3 for discrete
Properties:
\(E(a g_1(X) + b g_2(X) + c) = a E g_1(X) + b E g_2(X) + c\).
If \(g_1(x) \geq 0\) for all \(x\) then \(E g_1(X) \geq 0\).
If \(g_1(x) \geq g_2(x)\) for all \(x\) then \(E g_1(X) \geq E g_2(X)\).
If \(a \leq g_1(x) \leq b\) for all \(x\) then \(a \leq E g_1(X) \leq b\).
Example: Minimize distance
This result happens to be the definition of variance.
Moments
The \(n\)th central moment of \(X\) is
where \(\mu = E X\).
From this, we know the variance is
See: Example 2.3.3 for the variance of a parameter
Properties:
\(Var(aX + b) = a^2 Var X\)
\(Var X = E(X - EX)^2 = EX^2 - (EX)^2\)
Moment Generating Function (mgf)
Let \(X\) be a R.V. with cdf \(F_X\). The mgf of \(X\) (or \(F_X\)) is
With our knowledge of expected values,
Theorem: The \(n\)th moment is equal to the \(n\)th derivative of \(M_X(t)\) evaluated at \(t=0\).
Assuming we can differentiate under the integral sign (see Leibnitz Rule below),
Evaluating this at \(t=0\), we have
See: Example 2.3.8 for continuous case and Example 2.3.9 for discrete.
Properties:
\(M_{aX + b}(t) = e^{bt} M_X(at)\)
Convergence of mgfs
Suppose \(\{X_i, i = 1, 2, ... \}\) is a sequence of RVs, each with mgf \(M_{X_i}(t)\). Furthermore suppose that
for all \(t\) in a neighborhood of 0 and \(M_X(t)\) is an mgf. Then, there is a unique cdf \(F_X\) whose moments are determined by \(M_X(t)\) and, for all \(x\) where \(F_X(x)\) is continuous we have
. That is, convergence, for \(|t| < h\), of mgfs to an mgf implies convergence of cdfs.
This relies on Laplace transforms, which defines
Proof: Poisson approximation of Binomial
We know that the poisson approximation is valid when \(n\) is large and \(np\) is small.
Recall that the moment of binomial is \(M_X(t) = [p e^t + (1 - p)]^n\).
From the rule above (and just from txtbk), the MGF of poisson is \(M_Y(t) = e^{\lambda (e^t - 1)}\).
If we define \(p = \lambda / n\) then \(M_X(t) \xrightarrow{} M_Y(t)\) as \(n \xrightarrow{} \infty\).
Lemma: If \(\lim_{n\xrightarrow{} \infty} a_n = a\), then
Leibnitz Rule
If \(f(x,\theta)\), \(a(\theta)\), and \(b(\theta)\) are differentiable with respect to \(\theta\), then
See page 69.
If \(a(\theta)\) and \(b(\theta)\) are constant, then
Lebesgue’s Dominated Convergence Theorem
See page 69 & 70. Basically, if the integral is not too badly behaved, then we can say it’s good enough to bring a limit inside an integral.
Lipschitz Continuous
Impose smoothness on a function by bounding its first derivative by a function with finite integral. It leads to interchangeability of integration and differentiation.
See: Theorem 2.4.3 (pg 70), Corollary 2.4.4 and Examples 2.4.5 and 2.4.6
Families of Distributions
Check if pdf part of exponential family
where \(h(x) \geq 0\), \(t_i(x)\) are real-valued functions of \(x\), \(c(\mathbf{\theta}) \geq 0\) and \(w_i(\mathbf{\theta})\) are real-valued functions of the possibly vector-valued parameter \(\mathbf{\theta}\) which is independent of \(x\).
Here are some common exponential families:
Continuous: normal, gamma, beta
Discrete: binomial, poisson, negative binomial
A distribution which is a member of the exponential family has nice properties. For instance,
Expectations and Variance of exponential family pdf
Theorem: If \(X\) is a RV with pdf or pmf which is member of exponential family,
Definition: The indicator function of a set \(A\)
So, we can write the normal pdf (example 3.4.4) as
Since the indicator function is only a function of \(x\), it can be incorporated into the function \(h(x)\), showing that this pdf is of the exponential family form.
Another example is of \(f(x|\theta) = \theta^{-1} \exp(1 - \frac{x}{\theta})\) on \(0 < \theta < x < \infty\). Although this expression can fit the exponential family definition, the indicator function is dependent, \(I_{[\theta, \infty)}(x)\).
Chebychev’s inequality
Let \(X\) be a RV and let \(g(x)\) be a nonnegative function. Then, for any \(r>0\),
We usually set \(r = t^2\).
See: Example 3.6.2 and 3.6.3
Multiple Random Variables
Joint probability \(f_{X,Y}(x,y)\)
Definition: Discrete
Marginal probability is \(f_X(x) = \sum_{y \in \mathbf{R}} f_{X,Y}(x,y)\)
Definition: Continuous
where \(-\infty <x < \infty\)
Definition: Conditional
where \(\sum_y f(y|x) = 1\).
Independence properties
With \(U=g(X)\) and \(V=h(Y)\) where \(X\) and \(Y\) are independent and \(A_u = \{ x: g(x) \leq u\}\) and $B_v = { y: h(y) :nbsphinx-math:`leq `v}. Then,
Conditional expectation \(EX = E(E(X|Y))\)
Rewritten, we say \(E_X X = E_Y (E_{X|Y} (X|Y))\) because \(E(X|Y)\) is a rv (random in \(Y\)),
See: Example 4.4.5
Definition: Conditional variance identity
For any two rv \(X\) and \(Y\),
See: Example 4.4.8
Covariance and correlation
Covariance and correlation measure the strength of a relationship between two rv.
Covariance of \(X\) and \(Y\) is
This gives information regarding the relationship of \(X\) and \(Y\). Large positive values mean \(X\) and \(Y\) both go up together or down together. This value, however, struggles because, by itself, it is domain-specific. We can normalize by the variance to ensure the range of the metric… this is what correlation does.
Correlation of \(Y\) and \(Y\) is
\(\rho_{XY}\) is also known as the correlation coefficient.
If \(X\) and \(Y\) are independent, then \(EXY = (EX)(EY)\) and therefore,
Note: It is invalid to say because \(Cov(X,Y)=0\), \(X\) and \(Y\) are independent. For example, if \(X \sim f(x-\theta)\) and \(Y\) is an indicator function \(Y = I(|X-\theta|<2)\), then \(Y\) and \(X\) are not independent but \(E(XY)\) ends up equaling \(EXEY\) so \(Cov(X,Y) = 0\).
Properties:
For any rv \(X\) and \(Y\),
\(-1 \leq \rho_{XY} \leq 1\)
\(|\rho_{XY}| = 1\) iff there exists \(a \neq 0\) and \(b\) st \(P(Y=aX+b) = 1\). If \(\rho_{XY} = 1\), then \(a > 0\), and if \(\rho_{XY} = -1\), then \(a<0\).
Definition: Multivariate variance
If \(X\) and \(Y\) are any two rv and \(a\) and \(b\) are any two constants, then
Note: if \(X\) and \(Y\) are independent rv then
Multivariate distributions
With \(\mathbf{X} = (X_1, ..., X_n)\) representing a sample space that is a subset of \(\mathbf{R}^n\)
and its expectation
The marginal pdf of any subset of the coordinates of \((X_1, ..., X_n)\) can be computed by integrating the joint pdf over all possible values of the coordinates
Multinomial distribution
Let \(n\) and \(m\) be positive integers and let \(p_1,..., p_n\) be numbers satisfying \(0 \leq p_i \leq 1\), \(i=1, ..., n\) and \(\sum_{i=1}^n p_i = 1\). Then the rv (X_1, …, X_n)$ has a multinomial distribution with \(m\) trials and cell probabilities \(p_1, ..., p_n\) if the join pmf of \((X_1, ..., X_n)\) is
on the set of \((x_1, ..., x_n)\) st \(x_i\) is a nonnegative integer and \(\sum_{i=1}^n x_i = m\).
This follows the following experiment: the experiment consists of \(m\) independent trials. each trial results in one of \(n\) distinct possible outcomes. The probability of \(i\)th outcome is \(p_i\) on every trial. And, \(X_i\) is the count of the number of times the \(i\)th outcome occurred in the \(m\) trials. For \(n=2\), this is just the binomial experiment in which each trial has \(n=2\) possible oucomes and \(X_i\) counts the number of “successes” and \(X_2 = m - X_1\) counts the number of fials in \(m\) tirlas. In a general multinomial experiment, there are \(n\) possible outcomes to count.
Multinomial properties (similar to univariate)
In particular, if \(X_1, ..., X_n\) share the same distribution with mgf \(M_X(t)\), then
Corollary: Linear combination of independent distributions form the same (but multivariate) distribution
Let \(X_1, ..., X_n\) be mutually independent rv with mgfs \(M_{X_1}(t), ..., M_{X_n}(t)\). Let \(a_1, ..., a_n\) and \(b_1, ..., b_n\) be fixed constants. Let \(Z = (a_1X_1 +b_1) + ... + (a_n X_n + b_n)\). Then, the mgf of \(Z\) is
From this, we can conclude (for instance) that a linear combination of independent (say) normal rv is normally distributed.
Inequalities
Per Holder’s Inequality, if \(\frac{1}{p} + \frac{1}{q} = 1\), then \(|EXY| \leq E|XY| \leq (E|X|^p)^{1/p}(E|X|^q)^{1/q}\)
Cauchy-Schwarz’s inequality is a special case of Holder’s Inequality where \(p=q=2\):
The covariance inequality states that if \(X\) and \(Y\) have means \(\mu_X\) and \(\mu_Y\) and variances \(\sigma_X^2\) and \(\sigma_Y^2\), we can apply Cauch-Schwarz’s Inequality to get
By squaring both sides, we get a useful property:
This can be modified (by setting \(Y \equiv 1\)) to state \(E|X| \leq \{ E|X|^p \}^{1/p}\) (with \(1<p<\infty\)).
Additionally, Liapounov’s Inequality takes this a step further by, for \(1<r<p\), if we replace \(|X|\) by \(|X|^r\) we obtain
and then we set \(s=pr\) (where \(s>r\)) and rearrange:
Also, Minkowski’s Inequality states that for two rvs \(X\) and \(Y\) and for \(1 \leq p < \infty\),
This just uses the normal triangle inequality property.
Lastly, Jensen’s Inequality says that for a rv \(X\), if \(g(x)\) is a convex function, then
Note: This only holds true iff, for every line \(a+bx\) that is tangent to \(g(x)\) at \(x=EX\), \(P(g(X) = a+bX)=1\).
Properties of Random Sample
Statistics
Definition: A statistic \(Y = T(X_1, ..., X_n)\) cannot be a function of a parameter of the distribution.
The sample mean is
The sample variance is
Theorem: With the definition defined like above, we know a few things:
\(\min_a \sum_{i=1}^n (x_i - a)^2 = \sum_{i=1}^n (x_i - \bar{x})^2\)
\((n-1)s^2 = \sum_{i=1}^n (x_i - \bar{x})^2 = \sum_{i=1}^n x_i^2 - n \bar{x}^2\)
Lemma: Let \(X_1, ..., X_n\) be a random sample from a population and let \(g(x)\) be a function such that \(Eg(X_1)\) and \(\hbox{Var} g(X_1)\) exist. Then,
Theorem: With \(X_1, ..., X_n\) as a random sample and population with mean \(\mu\) and variance \(\sigma^2<\infty\),
\(E \bar{X} = \mu\)
\(\hbox{Var} \bar{X} = \frac{\sigma^2}{n}\)
\(E S^2 = \sigma^2\)
In relationship (c), we see why \(S^2\) requires a \(\frac{1}{n-1}\):
Distribution of statistics
Example: since \(\bar{X} = \frac{1}{n}(X_1 + ... + X_n)\), if \(f(y)\) is the pdf of \(Y = (X_1 + ... + X_n)\), then \(f_{\bar{X}}(x) = nf(nx)\) is the pdf of \(\bar{X}\). We can prove this using the transformation equation from chapter 4.3.2:
We say that \(\bar{X} = g(Y) = (1/n)Y\). Therefore, \(g^{-1}(\bar{X}) = n\bar{X}\)
\(f_{\bar{X}}(x) = f_{Y}(n\bar{X}) |n|\)
Additionally, this can be conducted for mgfs:
Since \(X_1, ..., X_n\) are iid, then this is true.
Theorem: However, if they are just from a random sample from a population with mgf \(M_X(t)\), then the mgf of the sample mean is:
This can be useful in cases where \(M_{\bar{X}}(t)\) is a familiar mgf, for instance:
Example: Distribution of the mean
Let \(X_1, ..., X_n\) be a random sample from a \(n(\mu, \sigma^2)\) population. Then, the mgf of the sample mean is
This also works for a random sample of \(\gamma(\alpha,\beta)\) (see Example 5.2.8)
Definition: Check if pdf is member of exponential family
Suppose \(X_1, ..., X_n\) is a random sample from a pdf or pmf \(f(x|\theta)\), where
is a member of an exponential family. Define statistics \(T_1, ..., T_k\) by
where \(i=1,...,k\).
If the set \(\{(w_1(\theta) w_2(\theta), ..., w_k(\theta)), \theta \in \Theta\}\) contains an open subset of \(\mathbf{R}^k\), then the distribution of \((T_1,...,T_k)\) is an exponential family of the form
The open set condition eliminates a density such as the \(n(\theta, \theta^2)\) and, in general, eliminates curved exponential families.
Example: Sum of bernuolli rvs
Suppose \(X_1, ..., X_n\) is a random sample from a \(Bernuolli(p)\). We know that
We can see that this is a exponential family where (if \(n=1\), because \(Bernuolli(p) \sim Binomial(1,p)\))
\(c(p) = (1-p)^n, 0<p<1\)
\(w_1(p) = \log(\frac{p}{1-p}), 0<p<1\)
\(t_1(x) = x\)
Thus, from the previous theorem, \(T_1(X_1,...,X_n) = X_1 + ... + X_n\). From the definition of a binomial distribution, we know that \(T_1\) has a \(binomial(n,p)\) distribution, which we have already shown is an exponential family. This verifies the theorem shown above.
Properties: of sample mean and variance
\(\bar{X}\) and \(S^2\) are independent random variables
\(\bar{X}\) has a \(n(\mu, \sigma^2/n)\) distribution
\((n-1)S^2/\sigma^2 \sim \chi_{n-1}^2\)
Some distribution facts
Facts: about \(\chi_p^2\) distribution with \(p\) dof
If \(Z\) is a \(n(0,1)\) rv, then \(Z^2 \sim \chi_1^2\)
If \(X_1, ..., X_n\) are independent and \(X_i \sim \chi_{p_i}^2\) then \(X_1 + ... X_n \sim \chi^2_{p_1 + ... p_n}\); so, independent chi-squared variables add to a chi-squared variable AND the dof also add.
Definition: Student’s t-distribution
Instead of looking at the \(n(\mu, \sigma^2)\),
where we can use our knowledge of \(\sigma\) and our measurement of \(\bar{X}\) as a basis to determine \(\mu\), we can look at a distribution where \(\mu\) and \(\sigma\) are unknown:
Properties: of t-distribution
Has no mgf becasue it does not have moments of all orders. If it has \(p\) degrees of freedom, it only has \(p-1\) moments. For instance, \(t_1\) has no mean and \(t_2\) has no variance.
Definition: F-distribution
Built by a ratio of variances. See Definition 5.3.6
This distribution has a few important corollaries:
If \(X \sim F_{p,q}\), then \(\frac{1}{X} \sim F_{q,p}\)
If \(X \sim t_q\), then \(X^2 \sim F_{1,q}\)
If \(X \sim F_{p,q}\), then \(\frac{p}{q} \frac{X}{1 + (p/q)X} \sim beta(p/2, q/2)\)
Order statistics
Organize the random variables by size: \(X_{(1)} \leq ... \leq X_{(n)}\)
We know the pdf of \(X_{(j)}\) of a random sample \(X_1, ..., X_n\) from a continuous population with cdf \(F_X(x)\) and pdf \(f_X(x)\) is
and the joint pdf of \(X_{(i)}\) and \(X_{(j)}\), \(1 \leq i \leq j \leq n\) is
Convergence
Definition: Convergence in probability
Weak law of large numbers (WLLN) says that \(\lim_{n\xrightarrow[]{}\infty}P(|\bar{X}_n - \mu| < \epsilon) = 1\). That is, \(\bar{X}_n\) converges in probability to \(\mu\). So, \textbf{convergence in probability} is \(\lim_{n\xrightarrow[]{}\infty} P(w \in S |X_n(w) - X(w)| \geq \epsilon) = 0\) where \(w\) is all the solutions in the set. I.e. for \(\Sigma X_i = 4\), the set could be \(\{(1,1,1,1), (2,2), \hbox{etc.}\}\)
Definition: Convergence almost surely
A stronger definition of convergence (yet, it does not need to converge on a set with probability 0) says that a sequence of RVs \textbf{converges almost surely} to a random variable X if \(P(\lim_{n\xrightarrow[]{}\infty}|X_n - X| < \epsilon) = 1\). Moving the limit inside gives it a more strict definition.
Definition: Convergence in distribution
A sequence of RVs converge in distribution to a random variable X if \(\lim_{n\xrightarrow[]{}\infty}F_{X_n}(x) = F_X(x)\). The CDFs converge.
Proving Tools
Definition: Slutsky’s Theorem
If \(X_n \xrightarrow{} X\) in distribution and \(Y_n \xrightarrow{} a\), a constant, in probability, then
\(Y_n X_n \xrightarrow{} a X\) in distribution
\(X_n + Y_n \xrightarrow{} X + a\) in distribution
Definition: Delta method
Let \(Y_n\) be a sequence of rvs that satisfies \(\sqrt{n} (Y_n - \theta) \xrightarrow{} n(0, \sigma^2)\) in distribution. For a given function \(g\) and a specific value of \(\theta\), suppose that \(g^\prime(\theta)\) exists and is not 0. Then, \(\sqrt{n}[g(Y_n) - g(\theta)] \xrightarrow{} n(0, \sigma^2[g^\prime(\theta)]^2)\) in distribution.
Principles of Data Reduction
We are interested in methods of data reduction that do not discard important information about the unknown parameter \(\theta\) and methods that successfully discard information that is irrelevant as far as gaining knowledge about \(\theta\) is concerned.
Sufficiency: data reduction that does not discard information about \(\theta\) while achieving some summarization of the data
Likelihood: a function of the parameter, obtained by the observed sample, that contains all the information about \(\theta\) that is available from the sample
Equivariance: preserve important features of the model
Sufficiency
If \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) then any inference about \(\theta\) should depend on the sample \(\mathbf{X}\) only through the value \(T(\mathbf{X})\). That is, if \(\mathbf{x}\) and \(\mathbf{y}\) are two sample points such that \(T(\mathbf{x}) = T(\mathbf{y})\) then the inference about \(\theta\) shoud be the ame whether \(\mathbf{X}=\mathbf{x}\) or \(\mathbf{X}=\mathbf{y}\) is observed.
A statistic \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\)
if the conditional distribution of the sample \(\mathbf{X}\) given the value of \(T(\mathbf{X})\) does not depend on \(\theta\)
if \(p(\mathbf{x}|\theta)\) is the joint pdf/pmf of \(\mathbf{X}\) and \(q(t|\theta)\) is the pdf/pmf of \(T(\mathbf{X})\) and if, for every \(\mathbf{x}\) in the sample space, the ratio \(p(\mathbf{x}|\theta)/q(T(\mathbf{x})|\theta)\) is constant as a function of \(\theta\) (aka, does not depend on \(\theta\)).
Definition: Determine if sufficient statistic
Factorization Theorem: (no prereq)
For exponential family pdfs:
where \(\mathbf{\theta} = (\theta_1, ..., \theta_d), d \leq k\). Then,
is a sufficient statistic for \(\mathbf{\theta}\).
Definition: Minimal sufficient statistic
General: (no prereq)
A sufficient statistic \(T(\mathbf{X})\) is a minimal sufficient statistic if, for any other sufficient statistic \(T^\prime(\mathbf{X})\), \(T(\mathbf{x})\) is a function \(T^\prime(\mathbf{x})\).
General: (no prereq)
Let \(f(\mathbf{x}|\theta)\) be the pmf or pdf of a sample \(\mathbf{X}\). Suppose there exists a function \(T(\mathbf{x})\) st, for every two sample points \(\mathbf{x}\) and \(\mathbf{y}\), the ratio \(f(\mathbf{x}|\theta) / f(\mathbf{y}|\theta)\) is constant as a function of \(\theta\) iff \(T(\mathbf{x}) = T(\mathbf{y})\). Then, \(T(\mathbf{X})\) is a minimal sufficient statistic for \(\theta\).
Therefore, show \(\frac{f(\textbf{x}|\theta)}{f(\textbf{y}|\theta)} = \frac{g^\prime(T^\prime(\textbf{x})|\theta)h^\prime(\textbf{x})}{g^\prime(T^\prime(\textbf{y})|\theta)h^\prime(\textbf{y})} = \frac{h^\prime(\textbf{x})}{h^\prime(\textbf{y})}\) does not depend on \(\theta\). Therefore, \(T(\textbf{x}) = T(\textbf{y})\). Thus, \(T(\textbf{x})\) is a function of \(T^\prime(\textbf{x})\) and \(T(\textbf{x}) (6.2.13)\)
Other properties
Definition: Ancillary statistics
A statistic \(S(\mathbf{X})\) whose distribution does not depend on the parameter \(\theta\) is called an ancillary statistic.
Prove that the statistic does not depend on \(\theta\). Derive \(f(T(X)|\theta)\) and check whether \(\theta\) is in it. Using Basu’s theorem, if \(T(\textbf{X})\) is a complete and minimal sufficient statistic, then \(T(\textbf{X})\) is independent of every ancillary statistic.
Definition: Complete statistic
Let \(f(t|\theta)\) be a family of pdfs or pmfs for a statistic \(T(\mathbf{X})\). The family of probability distributions is called complete if \(E_\theta g(T) = 0\) for all \(\theta\) implies \(P_\theta(g(T) = 0) = 1\) for all \(\theta\). Equivalently, \(T(\mathbf{X})\) is called a complete statistic.
General: (no prereq)
Basu’s theorem
For exponential pdfs:
where $:nbsphinx-math:mathbf{theta} = (\theta_1, …, \theta_k). Then the statistic
is complete as long as the parameter space \(\Theta\) contains an open set in \(\mathbf{R}^k\).
Likelihood
Let \(f(\mathbf{x}|\theta)\) denote the joint pdf/pmf of the sample \(\mathbf{X} = (X_1, ..., X_n)\). Then, given that \(\mathbf{X} = \mathbf{x}\) is observed the function of \(\theta\) is defined by
is called the likelihood function.
Definition: Likelihood principle
If \(\mathbf{x}\) and \(\mathbf{y}\) are two sample points st \(L(\theta|\mathbf{x})\) is proportional to \(L(\theta|\mathbf{y})\), that is, there exists a constant \(C(\mathbf{x},\mathbf{y})\) st
for all \(\theta\), then the conclusions drawn from \(\mathbf{x}\) and \(\mathbf{y}\) should be identical.
If \(C(\mathbf{x},\mathbf{y}) = 1\), then the likelihood principle states that if two sample points result in the same likelihood function, then they contain the same information about \(\theta\). But, this can be taken further: the principle states that even if two sample points have only proportional likelihoods, then they contain equivalent information about \(\theta\). The plausibility can be observed by the proportion. For instance, if \(L(\theta_2|\mathbf{x}) = 2L(\theta_1|\mathbf{x})\) then \(\theta_2\) is said to be twice as plausible as \(\theta_1\).
The fiducial inference sometime interprets likelihoods as probabilities for \(\theta\). That is, \(L(\theta|\mathbf{x})\) is multiplied by \(M(\mathbf{x}) = (\int_{-\infty}^\infty L(\theta|\mathbf{x})d\theta)^{-1}\) and then \(M(\mathbf{x})L(\theta|\mathbf{x})\) is interpreted as a pdf for \(\theta\) (if \(M(\mathbf{x})\) is finite).
Equivariance
If \(\mathbf{Y} = g(\mathbf{X})\) is a change of measurement scale st the model for \(\mathbf{Y}\) has the same formal structure as the model for \(\mathbf{X}\), then an inference procedure should be both measurement equivariant and formally equivariant.
Point Estimation
A point estimator is any function \(W(X_1, ..., X_n)\) of a sample; that is, any statistic is a point estimator.
There exist three ways of finding estimators: 1) Method of Moments, 2) Maximum Likelihood Estimation, and 3) Bayes’
Method of moments (MOM)
Let \(X_1, ..., X_n\) be a sample from a population with pdf/pmf \(f(x|\theta_1,...,\theta_k)\). MOM estimators are found by equating the first \(k\) sample moments to the corresponding \(k\) population moments, and solving the resulting system of simultaneous equations.
Maximum likelihood estimation (MLE)
Get \(L(\theta | \mathbf{x})\), then \(\ln L(\theta | \mathbf{x})\), then observe \(\frac{d}{d\theta} \ln L(\theta | \mathbf{x})\) domain and find the MLE. Check endpoints.
Bayes
\(\hbox{posterior} = \pi(\theta|\mathbf{x}) = f(\mathbf{x}|\theta)\pi(\theta) / m(\mathbf{x})\)
where \(f(\mathbf{x}|\theta)\pi(\theta) = f(\mathbf{x},\theta)\), the joint PDF and the marginal PDF \(m(\mathbf{x}) = \int f(x|\theta) \pi(\theta) d\theta\). \(\pi(\theta)\) is your prior distribution.
Definition: Let \(\mathbf{F}\) denote the class of pdfs or pmfs \(f(x|\theta)\) (indexed by \(\theta\)). A class \(\Pi\) of prior distributions is a conjugate family for \(\mathbf{F}\) if the posterior distribution is in the class \(\Pi\) for all \(f \in \mathbf{F}\), all priors in \(\Pi\) and all \(x \in \mathbf{X}\).
For instance, the beta family is conjugate for the binomial family. Thus, if we start with a beta prior, we will end up with a beta posterior.
Examples Finding Estimators
Example: Normal distribution
MOM
If \(X_1, ..., X_n\) are iid \(n(\theta,\sigma^2)\), then \(\theta_1 = \theta\) and \(\theta_2 = \sigma^2\). We have \(m_1 = \bar{X}, m_2 = \frac{1}{n} \sum X_i^2, \mu_1^\prime = \theta, \mu_2^\prime = \theta^2 + \sigma^2\), and hence we must solve
Solving for \(\theta\) and \(\sigma^2\) yields the MOM estimators:
Bayes
(Example 7.2.16) Let \(X \sim n(\theta, \sigma^2)\) and suppose that the prior distribution on \(\theta\) is \(n(\mu, \tau^2)\). Here, we assume all the parameters are known. The posterior distribution of \(\theta\) is also normal with mean and variance given by
Step 1: evaluate the posterior \(\pi(\theta|\vec{x}) = \frac{f(x|\theta) \pi(\theta)}{m(x)}\). Or, since \(m(x)\) is not dependent on \(\theta\), we can just evaluate the joint distribution \(f(x,\theta) = f(x|\theta) \pi(\theta)\).
Step 1b: Find \(m(x) = \int f(x|\theta) \pi(\theta) d\theta = \int f(x,\theta) d\theta\).
We can continue to boil down the joing \(f(x,\theta)\)
Therefore, we can find the mean and variance
When have \(tau^2\) near 0, the weight on \(\bar{x}\) is 0 and the weight on \(\mu\) is 1.
The normal family is its own conjugate.
Example: Let \(X_1, ..., X_n\) be iid \(binomial(k,p)\), that is,
on \(x = 0,1,...,k\).
We desire point estimators for both \(k\) and \(p\). We start with the population yields:
Or more simply,
With these definitions, we can build parameter estimates
Metrics for evaluating estimators
Definition: Mean squared error (MSE)
The MSE of an estimator \(W\) of a parameter \(\theta\) is a function of \(\theta\) defined by \(E_\theta (W - \theta)^2\).
This is usually a balancing act.
Note: for an unbiased (\(Bias_\theta = 0\)) estimator, we have
Example: Normal MSE
Let \(X_1, ..., X_n\) be iid \(n(\mu, \sigma^2)\). The statistics \(\bar{X}\) and \(S^2\) are both unbiased estimators since
for all \(\mu\) and \(\sigma^2\). This is always true!!
This is true without the normality assumption. The MSE of these estimators are
This goes to 0 as \(n \xrightarrow{} \infty\).
For a non-normal case, this is \(MSE_{\sigma^2}(S_n^2) = \frac{1}{n}(\theta_4 - \frac{n-3}{n-1} \theta_2^2)\)
Example: \(MSE(\hat{\sigma}^2) < MSE(S^2)\)
We know the MLE of \(\hat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 = \frac{n-1}{n} \sigma^2\).
So, \(\hat{\sigma^2}\) is a biased estimator of \(\sigma^2\). The variance of \(\hat{\sigma^2}\) can be calculated
because \(Var S^2 = \frac{1}{n}(\theta_4 - \frac{n-3}{n-1} \theta_2^2)\), from above.
Therefore, the \(MSE\) is given by
Conclusions about MSE:
Small inccrease in bias can be traded for a large decrease in variance resulting in a smaller MSE.
Because MSE is a function of the parameter, there is often not one best esetimator. Often, the MSEs of two estimators will cross each other, showing that each estimator is better (wrt the other) in only a portion of the parameter space.
This second bullet point is why we discuss other tactics of finding the best estimator… see next!
Definition: Best unbiased estimator
We want to recommend a candidate estimator. Specifically, we consider unbiased estimators. So, if both \(W_1\) and \(W_2\) are unbiased estimators of a parameter \(\theta\), that is, \(E_\theta W_1 = E_\theta W_2 = \theta\), then their MSE are equal to their variances, so we should choose the estimator with the smaller variance.