Portfolio website: Michael Hopwood

Hello, I am a PhD student at the University of Central Florida studying data science. I have worked on many data science and data analytics projects, all of which are summarized in my CV. Summarized results of select journal papers, conference proceedings, hackathons, and personal projects are located in projects. For a more up-to-date list of my publications, see my google scholar.
Here’s my LinkedIn; let’s connect!
An ongoing goal of mine is to capture some notes from my PhD and other studies. An abridged version of my PhD notes is captured in notes, which will continue to grow as I find time to transcribe them! The notes are a concatenation of lecture notes and online resources. I try to cite resources as often as I can but do not always.
This website was built to be a portfolio for others to see; however, perhaps more importantly, it functions as a platform for me to keep track of previous projects and my ongoing notes.
Projects
Exploring a link between network topology and active learning
Active learning in graph neural networks can provide similar, if not better, results in a node classification task simply by ordering the input data in the training process. An ordering process is studied and an empirical rule-of-thumb is established to infer the best ordering based off the topology of the network. This work was formalized in a peer-review paper and a conference paper.

An assessment of the value of principal component analysis for photovoltaic IV trace classification of physically-induced failures
Utilizing PCA on Photovoltaic current-voltage (IV) curves improves the performance of a failure classification task. Observing the performance of a random forest classifier on point-wise classification shows better results where the IV curve profile more often contains failure trends. Thi work was presented at PVSC47 conference in 2020 and won best student presentation and paper.

Neural network-based classification of string-level IV curves from physically-induced failures of photovoltaic modules
A classification task of photovoltaic failures using photovoltaic current-voltage (IV) curves is answered through neural networks, specifically a single-headed LSTM, multi-headed LSTM, and 1D CNN. Results show high accuracy (99%+) in the classification of three common failures which were physically induced in a real system. This study established a methodology which could be applied to a larger list of failures in more complicated systems. This work was formalized in a peer-review journal paper.

Physics-based method for generating fully synthetic IV curve training datasets for machine learning classification of PV failures
To circumvent having to induce failures in real systems for data collection (as done in the study above), a physics-based simulation framework is built. Results show no difference in a failure classification task (evaluated on measured data) when using simulations or real measured data in training. The code was released here and a paper is pending review from a journal.

pvOps: Improving Operational Assessments through Data Fusion
Two (PVSC & AGU) conference proceedings introduce a python package which preprocesses text (Operations and maintenance tickets) and time (timeseries meteorological and electrical) data for the purpose of better-informed knowledge retrieval. Data preprocessing steps were established for both types of data. Results show distributions of failure frequencies and failure impacts on system production.

Electric Vehicle Detection
Through a competition with an energy utility provider, a detection task was conducted to identify whether a household had an electric vehicle (EV). Our results showed an accuracy of 82.4% which beat our competitor’s results (~70%). Outside of the data provided in the competition, we found data from the United Nations online which helped capture global trends in EV. Data smoothing and enemble models were conducted to establish the prediction. A white paper was written and submitted to the company which is covered by an NDA.

Generalized Low-Rank Models for Parking Garage Capacity Modeling
Participation in a Stanford hackathon for a few hours was conducted studying the parking capacity levels across different garages at UCF. A GLRM was built to summarize a model which best fit busy-day conditions that way deviations from this model would symbolize smaller-than-normal capacity given the time of day. Specifically, a set of regularization methods were adopted to minimize the construction loss where a busy day is mandated to be smooth, parabolic, and similar to other busy days. The results generated interesting garage-specific profiles; additionally, garage capacity changes due to covid were analyzed. The code and paper are available online.

Notes (Beta)
The notes on this website was written during my PhD in data science. Below are a list of topics.
Supervised Learning: An Introduction
Least-squares
The assumption is that \(f(X) = E(Y|X)\) is linear.
Here, \(X \in R^{n \times (p+1)}\) where the extra parameter is a column of ones for the intercept \(\beta_0\).
We assume that \(Y = E(Y|X) + \epsilon\) where \(\epsilon\) captures the tangents not captured by the predictors (e.g. noise) and \(\epsilon \sim N(0, \sigma_{\epsilon}^2)\) independent from \(X\).
Where the \(\beta_0\) term is the model bias. The gradient \(f^\prime(X) = \beta\) is a vector in input space that points in the steepest uphill direction. To fit the model, a (simple) method is least squares. Here, we pick coefficients \(\beta\) to minimize the residual sum of squares
which shows a quadratic function of the parameters. Therefore, a minimum always exists but may not be unique. In matrix notation,
where X is an \(N \times p\) matrix with each row a sample, and y is an N-vector of the outputs in the training set. Differentiating w.r.t. \(\beta\) we get the normal equations
If \(X^T X\) is nonsingular (i.e. invertible, \(AB = BA = I\)), then the unique solution is given by
The projection matrix (or hat matrix) \(H=X(X^T X)^{-1} X^T\). Our observation \(y\). We estimate \(\hat{y}\). The best we can do is find the projection of \(y\) on the \(X\) space.It should happen that the residual (vertical projection axis) should be indpeendent of \(\hat{y}\) because the best linear prediction will have no systematic bias. The length of the residual is longer than the current \(y\).
Error is \(y - \hat{y} = (I - H)Y\). We must check whether
Because we know \(E[Y|X] = XB + \epsilon\),
Because \(\epsilon \epsilon^T = \sigma_{\epsilon}^2 I\), we conclude that
We know $HH = X(X^T X)^{-1} X^T X(X^T X)^{-1} X^T = X(X^T X)^{-1} X^T $
Projecting \(y\) to the space that is orthogonal to the \(X\) plane is defined by \(I-H\). When we multiply this times \(H\), we get \(H-HH = 0\).
If given a new input, then the same thing applies.
We know \(RSS\) increases with degrees of freedom, so we use residual stnanard error (RSE) instead:
This is an unbiased estimator: \(\sigma_{\epsilon}^2 = \frac{RSS}{N - (p+1)} = RSE^2\)
Therefore, a solution for the best \(\beta\) can be found without iteration.
A “poor man’s” classifier can use linear regression and predict \(1(\hat{Y} > 0.5)\). Ideally, we would like to estimate \(P(Y=1|X=x)\)
Additionally,
Because \(tr(AB) = tr(BA)\), we know \(tr(H) = tr(X(X^T X)^{-1} X^T) = tr((X^T X)^{-1} X^T X) = tr(I p + 1) = p + 1\). To understand residuals, we can understand how bad an outlier is.
\(Var(\hat{\epsilon}) = Var(y - \hat{y}) = (1 - H) \hat{\sigma}_\epsilon^2\)
\(Var(\hat{\epsilon}_i) = (1 - h_i) \hat{\sigma}_\epsilon^2\). This grants \(H\) an important role to calculate variance of residuals. If we have a large \(h_i\) then the residual has to have a small variance. This is called leverage.
For simple linear regression (p=1), \(h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_j (x_j - \bar{x})^2}\). Where the ratio \(\frac{(x_i - \bar{x})^2}{\sum_j (x_j - \bar{x})^2}\) is the variance, which is always between 0 and 1.
We can standardize the epsilon by the variance (resulting in a t-distribution).
A large \(r_i\), . For instance if \(r_i = -2.5\), then the point is at 2.5 percent tail. A large \(r_i\) is an outlier.
Cook’s distance shows the model evaluation when if the model did not use a specific sample.
As shown here, an influential point must have high leverage and a high standard residual. We call this influence, as in it has a large influence on the final model.
On a plot of sqrt of standardized residuals vs fitted values, if you see a nonlinear pattern, then you may want to transform. The constant residual assumption. Consider transforming X instead of Y. (If you have heteroscedasticity, consider transforming Y). https://stats.stackexchange.com/questions/116486/why-y-should-be-transformed-before-the-predictors
On a plot of standradized rewiduals vs leverage, If near contour, the corresponding cooke’s distance will be 0.5. If you observe a point outside 1, then it’s a very influential point. If 0.5 to 1, it’s influential but not as extreme.
How do we evaluate our linear model?
We use the residual sum of errors (RSE): the smaller the better. This depends on the unit and scale of the response.
\(R^2\) instead measures the proportion of variance explained by regression.
We can only assume that all of the Ys are IID, so in this case, we get a horizontal line (at the mean). the total variability not explained by the model
\(SSE\) (or \(RSS\)), as described above, measures from the data point to the regression line. the total variability explained by the model. (\(=SST - SSR\))
\(SSR\) measures from the regression line to the mean (horizontal line).
\(SST\) measures from the data point to the mean, defines the total variability in your dataset
We know that \(SST = SSE + SSR\).
\(R^2\) (r-squared, r squared) measures the proportion of variability which can be explained by the model.
For simple linear regression, \(Y = \beta_0 + \beta_1 X + \epsilon\), \(R^2 = r^2\), where \(r\) is the pearson correlation. The pearson correlation coefficient is
It measures how strong X and Y are correlated, and is in range [-1,1].
Question: Can we get \(F\)-statistic from \(R^2\)?
Yes. We know \(n\) (number of samples), \(p\) is number of parameters. We know $y =
Statistical Inference
A random sample \(X_1, X_2, ..., X_n\) be i.i.d., a statistic \(T = T(X_1, X_2, ..., X_n)\) is a function of the random sample. Let’s say this is the mean. The estimator \(\hat{\theta} = \hat{\theta}(X_1, X_2, ..., X_n)\). Our hypothesis could be to conclude if \(H_0 : \theta = \theta_0\) (two-sided).
There are two types of errors (type 1 and type 2).
p-value: Probability obtaininga value of statistic more extreme than the observed value \(H_0\). This mimics type 1 error.
Confidence interval: \(CI(X_1, X_2, ..., X_n)\) interval constructed from the random sample such that \(PR(\theta \in CI(X_1, X_2, ..., X_n)) = 1 - \alpha\). The narrower the confidence is, the more specific the estimation, given a pre-specified confidence level.
Common distributions
If \(Z_i \sim N(0,1)\), i.i.d. then \(\sum_i Z_i^2 \sim \chi_n^2\) where \(n\) is the number of RVs.
If \(Z \sim N(0,1)\) and \(W \sim \chi_n^2\) and they are independent then \(frac{Z}{\sqrt{W/n}} \sim t_n\). For instance, the ratio of the \(\bar{x}\) and sample standard deviaiton \(s^2\)
If \(W_1 \sim \chi_n^2\) and \(W_2 \sim X_m^2\) adn they are independent then \(F_{n,m}\). For instance if SSE / SSR is large enough.
Significance t-tests of coefficients
Check importance of the coefficients. Generally, if \(\beta_j = 0\), then we say that the predictors are not aiding the model’s prediction abilities.
\(H_0 : \beta_j = 0\) and \(H_1: \beta_j \neq 0\)
Note \(\hat{\beta} = (X^T X)^{-1} X^T (X \beta + \epsilon)\)
Given \(X\), \(\hat{\beta} \sim N(\beta, (X^T X)^{-1} \sigma_\epsilon^2)\). The expected value is therefore: \(\beta\).
Also, \(Cov(\hat{\beta}, \hat{\beta}) = E[(\hat{\beta} - \beta) (\hat{\beta} - \beta)^T]E(Y|X) = X\beta\)
The Z-score for each \(\hat{\beta_j}\) is
Under null (\(\hat{\beta} = 0\)), we should get \(z_j \sim t_{N-p-1}\).
More goodness of fit
Bringing in more predictors will reduce SSE naturally (and increase \(R^2\)). Instead, we can do a hypothesis test.
SSR is independnet of SSE. We define the F-statistic
then \(F \sim F_{p, N-p-1}\). The p-value is \(Pr(F_{p,N-p-1} \geq F)\). If large F then reject.
There are two parameters to define a line (an intercept and a slope), so if have more predictors (p) then you will have \(p\) degrees of freedom for SSR. Residual has \(p\) predictors and \(1\) y-intercept.
Question: Is t-test and f-test equivalent for simple linear regression?
Answer: Yes!
Nearest neighbors
For regression, calculates average values of the \(k\) nearest neighbors. This replaces the expected value (in normal regression) with the sample average. For classification, a majority vote is conducted.
If large number of variables, it’ll require a larger number \(k\). If kept same, then smaller number of neighbors will be included (Curse of dimensionality). Increased number of features, the definition of the neighborhood will also have to expand. The bias increases. This is because as you add another feature, it’ll inherently make the points be further apart.
Also, as you increase \(k\), a smoother surface will be formed (i.e. reduced variance).
The best \(k\) can be found empirically.
Bias-variance tradeoff
For a fixed \(x_0\),
We know that \(E[\hat{f}(x_0) - E\hat{f}(x_0)] = 0\). Therefore,
There is no bias if \(k=1\) in nearest neighbor analysis. Small \(k\) is small bias but high variance. Large \(k\) is the summation over \(n\) so benefiting from Variance (because for sample variance, there is a \(\frac{1}{n}\) term) will be low but bias will be high.
Linear regression vs. kNN
Linear regression has high bias (linear assumption can be violated) but only needs to estimate p+1 parameters.
kNN uses \(\frac{n}{k}\) parameters but is flexible and adaptive. It is small bias but large variance.
Interval prediction
Confidence interval
A confidence interval of \(f(X) = \sum_{j=0}^p \beta_j x_j\) for given \(x=(x_0,...,x_p)^T\) is
Why? Hint: What is the distribution of \(\vec{x}^T \hat{\beta}\), where \(\vec{x} = (x_0=1, x_1, ..., x_p)^T\)?
Prediction interval
A confidence interval of \(y = \sum_{j=0}^p \beta_j x_j + \epsilon\) for given \(x=(x_0,...,x_p)^T\) is
\(\sum(\beta_j x_j - \hat{\beta}_j x_j)\). The CI of \(\beta_j x_j\) is calculated above.
This is the same as $y - \hat{y} = y - \sum `:nbsphinx-math:hat{beta}`_j x_j = \epsilon `+ :nbsphinx-math:sum :nbsphinx-math:beta`_j x_j \sum `:nbsphinx-math:hat{beta}`_j x_j $
Review of Conditional Expectation
The conditional expected value is just the expectation when X is specified.
Conditional expectation is a random variable. Without specificing \(X=x\), \(E(Y|X)\) is a function of \(X\). Because \(X\) is a RV, then \(E(Y|X)\) is also RV.
Tower property: \(E(Y) = E[E(Y|X)]\).
We say that \(X\) takes a fixed value such as \(x_0 = 0\), then \(g(x_0)\) is deterministic (i.e. not random). Its form may be unknown, or involves unknown parameters, e.g.
Example
\(Y = a X^2 + \epsilon\), \(\epsilon\) ind \(X\), \(\epsilon \sim N(0,1)\)
\(E(Y|X) = E(c + X^2 + \epsilon | X) = c + X^2 + E(\epsilon|X) = c + X^2\) where \(E(\epsilon|X)=0\)
Example
\(Y = X^2 + 10X + 20 + \epsilon\)
where \(\epsilon \sim N(0, 3)\) and \(X \sim N(30,10)\)
In this case, we know the underlying probability model.
The joint distribution gives a lot of information!
We can evalaute for the best model \(f\) by minimizing a loss function (i.e. \(L(Y, f(X)) = Y - f(X))^2\))
Because we have assumed that we know the joint distribution (and it’s all continuous), then we evaluate an integral.
The best f is E(Y|X=x)
^ This depends on your loss function! (using squared loss!) If you use L1 then your best \(f\) will be at the median. Squared loss is better because can take derivative of it. However, it can be influenced by extreme values.
Minimize \(E[(Y - f(x)]^2 | X)\) for every X. This can be decomposed
With \(A = Y - E(Y|X)\) and \(B = E(Y|X) - f(X)\)
We know that at a given \(X\), \(A \times B\) is a constant.
Therefore, \(EPE(f) = EPE(E(Y|X)) + B\)
If the population is known, then \(f(x) = \int y f_{Y|X} (y|x) dy\) simply. This is the ideal case where you have population. However, this is rare.
For Example, if \(Y\) is a known funtion of \(X\) (with some error), then you know the conditional distribution. From this, you can estimate \(f\) as the mean of that conditional distribution.
Categorical classification
Loss matrix can be used to penalize categories heavier.
For example, in stock market prediction, we may place a heavier scaler on the loss function for when the stock market
Popular choice: \(L = 1_{K \times K} - I_K\) forms a matrix of ones except for zeros in the diagonal (because no update should be made if it is correct). This can also be expressed as \(L(G, \hat{G}(X)) = I(G \neq \hat{G}(X))\).
The solution that minimizes the EPE is \(\hat{G}(x) = arg max_g Pr_{G|X} (g|x)\). The group that maximizes the conditional probability \(Pr_{G|X}(g|x)\). This is called the bayes classifier. Its error is called the bayes rate. The group has a prior (original) distribution. For example, increasing and decreasing is equally likely. According to yesterday’s information, update and calculate posterior probability \(Pr_{G|X}(g|x)\).
Example
Generate \(X|G \sim N(\mu_G, I_2)\) where two centers are defined: \(\mu_1 = (0,1)^T, \mu_2 = (1,0)^T\)
Because this was generated, we know the labels: \(G_{1}, ..., G_{100} = 1\) and \(G_{101}, ..., G_{200} = 2\).
The bayes classifier is found by assuming the joint distribution \(X|G \sim N(\mu_G, I_2)\). Therefore, each group is equally likely. The boundary between these two groups is found by
$E(1(G|X)) = P(G=1|x_0) $ versus \(P(G=2|x_0)\) and the larger one is chosen for the point.
At the beginning, \(P(G=1) = P(G=2) = 0.5\).
At a sample located at \(\vec{X} = (10,9)\), the expectation can be evaluated by \(P(G_j = 1 | x_0 = (10,9)) = f(x_0 = (10,9)) = \frac{f(x_0 = (10,9) | G=1)}{f_x( (10, 9) )}\)
\(f_{N(0,1)}(x_0, x_1)\) is the double normal distribution (a function of X2 and X1).
So plug in the likelihood of observing the X multiplied by the given distribution (per bayesian rule). Bayes rule finds the ratio of the joint probability
Linear regression
With a feature \(p=1\), what is the estimated \(\beta\)?
Solution: Take the derivative and then set equal to zero. RSS will have a minimum.
\(RSS(\beta_0, \beta_1) = y - X\beta\)
Exercises
Exercise: Suppose each of \(K\)-classes has an associated target \(𝑡_𝑘\), which is a vector of all zeros, except a one in the \(k\)th position. Show that classifying to the largest element of \(\hat{y}\) amounts to choosing the closest target, \(min_{k} ||t_k - \hat{y}||\), if the elements of \(\hat{y}\) sum to one.
Proof:
where \(t_k \in T\).
The model predicts \(Pr(y_i = t_k)\) where
For the first term, when \(k=i\), the quantity equals 1 else it is 0. Thus, $:nbsphinx-math:sumi t{k,i}^2 = 1 for all values of \(k\). Likewise, the last term of \(\sum_i y_i^2\) is independent of \(k\) so that it is constant wrt \(k\). Finally, the middle term $:nbsphinx-math:sumi -2 t{k,i} y_i = -2 y_i when \(k=i\) and is 0 otherwise. Note that it also varies across different values of \(k\) so that it is a function of \(k\). Then, we can rewrite the above function as a function of only the middle term as follows:
Multiplying the above quantity by (-1), we can change the min to a max problem.
Therfore, we state that the largest element in \(\hat{y}\) is the closest target.
Excercise: Show how to compute the bayes decision boundary for the simulation example in Figure 2.5
[6]:
from utils import disp
disp('bayes_decision_boundary.png')

Proof:
Above, we see two classes, generated by a mixture of Gaussians. Our generating density is \(N(m_k, I / 5)\) is a weighted sum of 10 Gaussians generated from \(N((0,1)^T, I)\).
Bayes classifier says that we classify to the most probable class using the conditional distribution \(Pr(G|X)\). Hence, the decision boundary is the set of points that partitions the vector space into two sets: one for each class. On the decision boundary itself, the output label is ambiguous.
Boundary = \(\{ x: max_{g \in G} Pr(g | X=x) = max_{k\in G} Pr(k | X=x)\}\)
It is the set of points where the most probable class is tied between two or more classes.
In the case of two examples,
Boundary = \(\{ x: Pr(g|X=x) = Pr(k|X=x)\} = \{ x: \frac{Pr(g|X=x)}{Pr(k|X=x)} = 1 \}\)
We can rewrite the above quantity by Bayes rule as follows:
\(\frac{Pr(g|X=x)}{Pr(k|X=x)} = \frac{Pr(X=x|g) Pr(g) / Pr(X=x)}{Pr(X=x|k) Pr(k) / Pr(X=x)} = \frac{Pr(X=x | g) Pr(g)}{Pr(X=x|k) Pr(k)} = 1\)
because we have 100 points in each class, so \(Pr(g) = Pr(k)\). The boundary becomes \(\{x: Pr(X=x|g) = Pr(X=x|k) \}\). We know \(Pr(X=x|g)\) because we know the generating density is gaussian. So,
We take the log to ensure a monotonic function
Equating class \(g\) and \(k\) to get the decision boundary, we get
Boundary = \(\{ x: \sum_{k=1}^{10} \ln( \frac{1}{5 \sqrt{2 \pi}}) - \frac{(x - m_k)^2}{2 \times 25} = \sum_{k=1}^{10} \ln( \frac{1}{5 \sqrt{2 \pi}}) - \frac{(x - n_i)^2}{2 \times 25} \}\)
The observations of cluster 1 (\(m_k\)) and cluster 2 (\(n_i\)) help generate the exact boundary.
[7]:
import numpy as np
import pandas as pd
class KNearestNeighbors():
def __init__(self, X_train, y_train, n_neighbors=5, weights='uniform'):
self.X_train = X_train
self.y_train = y_train
self.n_neighbors = n_neighbors
self.weights = weights
self.n_classes = len(np.unique(y_train))
def euclidian_distance(self, a, b):
return np.sqrt(np.sum((a - b)**2, axis=1))
def kneighbors(self, X_test, return_distance=False):
dist = []
neigh_ind = []
point_dist = [self.euclidian_distance(x_test, self.X_train) for x_test in X_test]
for row in point_dist:
enum_neigh = enumerate(row)
sorted_neigh = sorted(enum_neigh,
key=lambda x: x[1])[:self.n_neighbors]
ind_list = [tup[0] for tup in sorted_neigh]
dist_list = [tup[1] for tup in sorted_neigh]
dist.append(dist_list)
neigh_ind.append(ind_list)
if return_distance:
return np.array(dist), np.array(neigh_ind)
return np.array(neigh_ind)
def predict(self, X_test):
if self.weights == 'uniform':
neighbors = self.kneighbors(X_test)
print(neighbors)
y_pred = np.array([
np.argmax(np.bincount(self.y_train[neighbor]))
for neighbor in neighbors
])
return y_pred
if self.weights == 'distance':
dist, neigh_ind = self.kneighbors(X_test, return_distance=True)
inv_dist = 1 / dist
mean_inv_dist = inv_dist / np.sum(inv_dist, axis=1)[:, np.newaxis]
proba = []
for i, row in enumerate(mean_inv_dist):
row_pred = self.y_train[neigh_ind[i]]
for k in range(self.n_classes):
indices = np.where(row_pred == k)
prob_ind = np.sum(row[indices])
proba.append(np.array(prob_ind))
predict_proba = np.array(proba).reshape(X_test.shape[0],
self.n_classes)
y_pred = np.array([np.argmax(item) for item in predict_proba])
return y_pred
def score(self, X_test, y_test):
y_pred = self.predict(X_test)
return float(sum(y_pred == y_test)) / float(len(y_test))
data = np.array(
[
[0,3,0,1],
[2,0,0,1],
[0,1,3,1],
[0,1,2,0],
[1,0,1,0],
[1,1,1,1]
]
)
X = data[:,0:3]
y = data[:,-1]
k = 3
knn = KNearestNeighbors(X,y,n_neighbors=k)
knn.predict([[0,0,0]])
[[4 5 1]]
[7]:
array([1], dtype=int64)
[10]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
np.random.seed(1234)
alpha = 0.95
cov = [[1, 0], [0, 1]] # diagonal covariance
mean0 = np.array([0, 1])
x, y = np.random.multivariate_normal(mean0, cov, 100).T
plt.plot(x, y, 'o', color='orange', label='G=0', alpha=0.6)
mean1 = np.array([1, 0])
x2, y2 = np.random.multivariate_normal(mean1, cov, 100).T
plt.plot(x2, y2, 'bo', label='G=1', alpha=0.6)
plt.axis('equal')
plt.legend()
plt.xlabel(r'$X_1$')
plt.ylabel(r'$X_2$')
plt.grid('..')
plt.show()
X = np.append(np.column_stack((x,y)), np.column_stack((x2,y2)), axis=0)
Y = np.append(np.zeros(len(x)), np.ones(len(x2))).astype(float)
nneighbors = [1,10,20,100]
accs = []
for n in nneighbors:
neigh = KNeighborsClassifier(n_neighbors=n)
neigh.fit(X, Y)
yhat = neigh.predict(X)
acc = accuracy_score(Y, yhat)
accs.append(1 - acc)
plt.plot(nneighbors, accs, 'o-')
plt.xlabel(r'$k$ neighbors')
plt.ylabel('Error rate')
plt.show()
plt.plot(x, y, 'o', color='orange', label='G=0', alpha=0.6)
plt.plot(x2, y2, 'bo', label='G=1', alpha=0.6)
plt.axis('equal')
plt.legend()
plt.xlabel(r'$X_1$')
plt.ylabel(r'$X_2$')
plt.grid('..')
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda_object = lda.fit(X, Y)
print('m', lda_object.coef_)
print('b', lda_object.intercept_)
x1 = np.array([np.min(X[:,0], axis=0), np.max(X[:,0], axis=0)])
i = 0
c = 'r'
b, w1, w2 = lda.intercept_[i], lda.coef_[i][0], lda.coef_[i][1]
print(b, w1, w2)
y1 = -(b+x1*w1)/w2
plt.plot(x1,y1,c=c)
plt.xlim(-3,3)
plt.ylim(-4,4)
plt.show()


m [[ 1.15981357 -0.59760123]]
b [-0.43829084]
-0.4382908386736625 1.1598135678848691 -0.5976012319170687

[16]:
import collections
import numpy as np
from numpy import sqrt, exp
def pre_prob(y):
y_dict = collections.Counter(y)
pre_probab = np.ones(2)
for i in range(0, 2):
pre_probab[i] = y_dict[i]/y.shape[0]
return pre_probab
def mean_var(X, y):
n_features = X.shape[1]
m = np.ones((2, n_features))
v = np.ones((2, n_features))
n_0 = np.bincount(y)[np.nonzero(np.bincount(y))[0]][0]
x0 = np.ones((n_0, n_features))
x1 = np.ones((X.shape[0] - n_0, n_features))
k = 0
for i in range(0, X.shape[0]):
if y[i] == 0:
x0[k] = X[i]
k = k + 1
k = 0
for i in range(0, X.shape[0]):
if y[i] == 1:
x1[k] = X[i]
k = k + 1
for j in range(0, n_features):
m[0][j] = np.mean(x0.T[j])
v[0][j] = np.var(x0.T[j])*(n_0/(n_0 - 1))
m[1][j] = np.mean(x1.T[j])
v[1][j] = np.var(x1.T[j])*((X.shape[0]-n_0)/((X.shape[0]
- n_0) - 1))
return m, v # mean and variance
def prob_feature_class(m, v, x):
n_features = m.shape[1]
pfc = np.ones(2)
for i in range(0, 2):
product = 1
for j in range(0, n_features):
product = product * (1/sqrt(2*np.pi*v[i][j])) * exp(-0.5
* pow((x[j] - m[i][j]),2)/v[i][j])
pfc[i] = product
return pfc
def GNB(X, y, x):
m, v = mean_var(X, y)
pfc = prob_feature_class(m, v, x)
pre_probab = pre_prob(y)
pcf = np.ones(2)
total_prob = 0
for i in range(0, 2):
total_prob = total_prob + (pfc[i] * pre_probab[i])
for i in range(0, 2):
pcf[i] = (pfc[i] * pre_probab[i])/total_prob
prediction = int(pcf.argmax())
return m, v, pre_probab, pfc, pcf, prediction
Y = Y.astype(int)
# executing the Gaussian Naive Bayes for the test instance...
m, v, pre_probab, pfc, pcf, prediction = GNB(X, Y, np.array([2,2]))
print(m) # Output given below...(mean for 2 classes of all features)
print(v) # Output given below..(variance for 2 classes of features)
print(pre_probab) # Output given below.........(prior probabilities)
print(pfc) # Output given below............(posterior probabilities)
print(pcf) # Conditional Probability of the classes given test-data
print(prediction) # Output given below............(final prediction)
[[0.12908891 0.85586925]
[1.13164542 0.12410715]]
[[0.79127964 1.14334245]
[0.84673467 0.96361033]]
[0.5 0.5]
[0.01034183 0.01819066]
[0.36245814 0.63754186]
1
[53]:
import pandas as pd # for data manipulation
import numpy as np # for data manipulation
from sklearn.model_selection import train_test_split # for splitting the data into train and test samples
from sklearn.metrics import classification_report # for model evaluation metrics
from sklearn.preprocessing import OrdinalEncoder # for encoding categorical features from strings to number arrays
import plotly.express as px # for data visualization
import plotly.graph_objects as go # for data visualization
# Differnt types of Naive Bayes Classifiers
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import BernoulliNB
# Function that handles sample splitting, model fitting and report printing
def mfunc(X, y, typ):
# Create training and testing samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Fit the model
model = typ
clf = model.fit(X_train, y_train)
# Predict class labels on a test data
pred_labels = model.predict(X_test)
# Print model attributes
print('Classes: ', clf.classes_) # class labels known to the classifier
if str(typ)=='GaussianNB()':
print('Class Priors: ',clf.class_prior_) # prior probability of each class.
else:
print('Class Log Priors: ',clf.class_log_prior_) # log prior probability of each class.
# Use score method to get accuracy of the model
print('--------------------------------------------------------')
score = model.score(X_test, y_test)
print('Accuracy Score: ', score)
print('--------------------------------------------------------')
# Look at classification report to evaluate the model
print(classification_report(y_test, pred_labels))
# Return relevant data for chart plotting
return X_train, X_test, y_train, y_test, clf, pred_labels
y = Y
# Fit the model and print the result
X_train, X_test, y_train, y_test, clf, pred_labels, = mfunc(X, y, GaussianNB())
# Specify a size of the mesh to be used
mesh_size = 5
margin = 5
# Create a mesh grid on which we will run our model
x_min, x_max = np.min(X[:,0]) - margin, np.max(X[:,0]) + margin
y_min, y_max = np.min(X[:,1]) - margin, np.max(X[:,1]) + margin
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)
# Create classifier, run predictions on grid
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)
# Specify traces
trace_specs = [
[X_train, y_train, 0, 'Train', 'brown'],
[X_train, y_train, 1, 'Train', 'aqua'],
[X_test, y_test, 0, 'Test', 'red'],
[X_test, y_test, 1, 'Test', 'blue']
]
# Build the graph using trace_specs from above
fig = go.Figure(data=[
go.Scatter(
x=X[y==label,0], y=X[y==label,1],
name=f'{split} data, Actual Class: {label}',
mode='markers', marker_color=marker
)
for X, y, label, split, marker in trace_specs
])
# Update marker size
fig.update_traces(marker_size=5, marker_line_width=0)
# Update axis range
#fig.update_xaxes(range=[-1600, 1500])
#fig.update_yaxes(range=[0,345])
# Update chart title and legend placement
fig.update_layout(title_text="Decision Boundary for Naive Bayes Model",
legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1))
# Add contour graph
fig.add_trace(
go.Contour(
x=xrange,
y=yrange,
z=Z,
showscale=True,
colorscale='magma',
opacity=1,
name='Score',
hoverinfo='skip'
)
)
fig.show()
Classes: [0 1]
Class Priors: [0.5125 0.4875]
--------------------------------------------------------
Accuracy Score: 0.775
--------------------------------------------------------
precision recall f1-score support
0 0.76 0.72 0.74 18
1 0.78 0.82 0.80 22
accuracy 0.78 40
macro avg 0.77 0.77 0.77 40
weighted avg 0.77 0.78 0.77 40
Data type cannot be displayed: application/vnd.plotly.v1+json
[6]:
import numpy as np
from scipy import stats
from pprint import pprint
np.random.seed(1)
n = 100
p = 1
model_type = 'y=f(x)'
#model_type = 'x=f(y)'
#add_intercept = False
add_intercept = True
df = pd.read_csv('_static//datamining_hw2_question2.csv')
xx = df['x'].values
yy = df['y'].values
# GENERATE DATA IN PYTHON
# xx = np.random.normal(0,1,n)
# yy = 2 * xx + np.random.normal(0,1,n)
if model_type == 'y=f(x)':
y = yy
x = xx
elif model_type == 'x=f(y)':
y = xx
x = yy
if add_intercept:
xvec = np.hstack((np.vstack(np.ones(len(x))),np.vstack(x)))
else:
xvec = np.vstack(x)
def f(arr):
if add_intercept:
return m[0] + m[1] * arr
else:
return m * arr
m = np.dot(np.linalg.inv(np.dot(xvec.T, xvec)), np.dot(xvec.T, y))
# also, m, _, _, _ = np.linalg.lstsq(xvec, y)
yhat = f(x)
print('constants', m)
def analyze_linear_model(y, yhat, x, n, p):
ybar = np.sum(y)/len(y)
residuals = y - yhat
SSR = np.sum((yhat - ybar)**2)
SST = np.sum((y - ybar)**2)
SSE = np.sum((y - yhat)**2) # or residuals.T @ residuals
RSE = np.sqrt(SSR / (n - 2))
MSE = (sum((y-yhat)**2))/(n-p)
correlation_r = []
for col in range(x.shape[1]):
correlation_r.append(np.cov(x[:,col],y)[0][1]/ (np.std(x[:,col]) * np.std(y)))
sigma_squared_hat = SSE / (n - p)
var_beta_hat = (np.linalg.inv(xvec.T @ xvec) * sigma_squared_hat)[0][0]
# or var_beta_hat = MSE*(np.linalg.inv(np.dot(xvec.T,xvec)).diagonal())
sd_b = np.sqrt(var_beta_hat)
ts_b = m/ sd_b
p_ttest =2*(1-stats.t.cdf(np.abs(ts_b),(n - p)))
F = (SSR/p)/(SSE/(n - p - 1))
p_ftest = stats.f.cdf(F, p, n-p-1)
R2_another_calc = 1 - (1 + F * (p) / (n - p - 1))**(-1)
# print("r2 another way", R2_another_calc)
info = {'SSR': SSR,
'SSE': SSE,
'SST': SST,
'r2': SSR / SST,
'RSE': RSE,
'MSE': MSE,
'r': correlation_r,
'Var(Bhat)': var_beta_hat,
'Sd(Bhat)': sd_b,
't(Bhat)': ts_b,
'p_ttest(Bhat)': p_ttest,
'F(Bhat)': F,
'p_ftest(Bhat)': p_ftest
}
pprint(info)
import statsmodels.api as sm
model = sm.OLS(y,x)
results = model.fit()
results_summary = results.summary()
print(results_summary)
analyze_linear_model(y, yhat, xvec, n, p)
# Approximate form of t-test (for a no-intercept model)
approx_t_bhat = (np.sqrt(n - 1) * np.sum(x * y)) / np.sqrt(np.sum(x**2) * np.sum(y**2) - (np.sum(x * y))**2)
'approx_t_bhat', approx_t_bhat
constants [-0.03769261 1.99893961]
{'F(Bhat)': 344.31026392149494,
'MSE': 0.9175314048171069,
'RSE': 1.8045834713513378,
'SSE': 90.83560907689355,
'SSR': 319.1391074972955,
'SST': 409.97471657418896,
'Sd(Bhat)': 0.09649621618971455,
'Var(Bhat)': 0.009311519738932126,
'p_ftest(Bhat)': 0.9999999999999999,
'p_ttest(Bhat)': array([0.69692306, 0. ]),
'r': [nan, 0.8912022644583839],
'r2': 0.7784360707998542,
't(Bhat)': array([-0.39061235, 20.7152124 ])}
<ipython-input-6-9a69040a4fca>:61: RuntimeWarning: invalid value encountered in double_scalars
correlation_r.append(np.cov(x[:,col],y)[0][1]/ (np.std(x[:,col]) * np.std(y)))
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.778
Model: OLS Adj. R-squared: 0.776
Method: Least Squares F-statistic: 344.3
Date: Wed, 29 Sep 2021 Prob (F-statistic): 7.72e-34
Time: 21:56:20 Log-Likelihood: -137.09
No. Observations: 100 AIC: 278.2
Df Residuals: 98 BIC: 283.4
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0377 0.097 -0.389 0.698 -0.230 0.155
x1 1.9989 0.108 18.556 0.000 1.785 2.213
==============================================================================
Omnibus: 3.621 Durbin-Watson: 2.174
Prob(Omnibus): 0.164 Jarque-Bera (JB): 3.626
Skew: 0.448 Prob(JB): 0.163
Kurtosis: 2.743 Cond. No. 1.17
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[6]:
('approx_t_bhat', 18.725931937448564)
[7]:
import numpy as np
from scipy import stats
import pandas as pd
from pprint import pprint
np.random.seed(1)
n = 100
p = 5
# TO BUILD IN PYTHON
# X = np.random.normal(np.arange(1,p+1), 1, (n,p))
# eps = np.random.normal(0,1,n)
# beta_star = np.array([1.0,0.0,2.0,-0.5,0.5,1.0])[:p+1]
# y = np.dot(xvec, beta_star) + eps
# Read in from R to ensure same data
if p == 5:
filename = "_static//datamining_hw2_question3.csv"
df = pd.read_csv(filename, index_col=0)
X = df[[f'x.{i}' for i in range(1,p+1)]].values
y = df['y'].values
else:
filename = "_static//datamining_hw2_question3c.csv"
df = pd.read_csv(filename, index_col=0)
X = df[['x']].values
y = df['y'].values
xvec = np.hstack((np.vstack(np.ones(len(X))),np.vstack(X)))
m = np.dot(np.linalg.inv(np.dot(xvec.T, xvec)), np.dot(xvec.T, y))
print('coefficients:', m)
yhat = np.dot(xvec, m)
analyze_linear_model(y, yhat, xvec, n, p)
import matplotlib.pyplot as plt
plt.plot(y)
plt.plot(yhat)
r2 = np.corrcoef(y,yhat) ** 2
r2
coefficients: [ 0.80429471 -0.01880692 2.07708222 -0.46570561 0.49094525 0.96793266]
{'F(Bhat)': 465.3294242077351,
'MSE': 0.9608057901465713,
'RSE': 4.80140017057294,
'SSE': 91.2765500639243,
'SSR': 2259.2374726018297,
'SST': 2350.5140226657495,
'Sd(Bhat)': 0.23273137162305024,
'Var(Bhat)': 0.054163891337546316,
'p_ftest(Bhat)': 0.9999999999999999,
'p_ttest(Bhat)': array([8.21720809e-04, 9.35763387e-01, 3.28626015e-14, 4.82409864e-02,
3.75311950e-02, 7.00777143e-05]),
'r': [nan,
0.7231431404804363,
0.9424031322894084,
0.6299604540858132,
0.7127929784445699,
0.8480567254338526],
'r2': 0.9611674088374925,
't(Bhat)': array([ 3.45589297, -0.08080958, 8.92480546, -2.00104357, 2.10949323,
4.1590124 ])}
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.961
Model: OLS Adj. R-squared: 0.959
Method: Least Squares F-statistic: 465.3
Date: Wed, 29 Sep 2021 Prob (F-statistic): 1.17e-64
Time: 21:56:21 Log-Likelihood: -137.33
No. Observations: 100 AIC: 286.7
Df Residuals: 94 BIC: 302.3
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.8043 0.234 3.438 0.001 0.340 1.269
x1 -0.0188 0.099 -0.191 0.849 -0.215 0.177
x2 2.0771 0.095 21.926 0.000 1.889 2.265
x3 -0.4657 0.088 -5.263 0.000 -0.641 -0.290
x4 0.4909 0.092 5.310 0.000 0.307 0.675
x5 0.9679 0.084 11.474 0.000 0.800 1.135
==============================================================================
Omnibus: 1.726 Durbin-Watson: 2.009
Prob(Omnibus): 0.422 Jarque-Bera (JB): 1.664
Skew: -0.225 Prob(JB): 0.435
Kurtosis: 2.555 Cond. No. 18.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<ipython-input-6-9a69040a4fca>:61: RuntimeWarning: invalid value encountered in double_scalars
correlation_r.append(np.cov(x[:,col],y)[0][1]/ (np.std(x[:,col]) * np.std(y)))
[7]:
array([[1. , 0.96116741],
[0.96116741, 1. ]])

TODO: add CI for betas, try out MLE solution (maximize log likelihood across parameter lambda in a boxcox), add H leverage, add standard error r, plot standandard error (y) vs leverage (x) and add cookes distance lines
Discriminant Analysis
Overview
Discriminant analysis employs a decision boundary to partition a feature space into classifications. We find a boundary by finding where \(\delta_1 = \delta_2\).
Details
Other notes
LDA has \(k \times p\) (from \(\mu_k\)) PLUS \(k - 1\) (from \(\pi_k\)) PLUS \(\frac{p}{p+1}/2\) from \(\Sigma\) (because the correlation matrix is symmetric)
QDA has \(k \times p\) (from \(\mu_k\)) PLUS \(k - 1\) (from \(\pi_k\)) PLUS \(k \times \frac{p}{p+1}/2\) from \(\Sigma_k\) (because the correlation matrix is symmetric)
QDA:
If take \(\log\) of this, it becomes the discriminant score. LDA cancels the first term. The squared term is used so a quadratic boundary is found.
\(\delta_k(x) = \ln p_k(x) + c\)
Hidden Markov Models (HMM)
A hidden markov model (HMM) is a stochatic process (evolution in time of a random process) based on a markov chain (probability of each event depends only on the state attained in the previous event). A HMM infers unobserved information from observed data through the study of probabilities.
Observed data can contain noise (i.e. due to sensor readings) which hides the ground truth. For instance, if reading location from a GPS while standing still, the sensor may jump around a small amount even though the hidden truth is constant. This “hidden truth” is called a state variable. A HMM chains these states together (chronologically) to estimate the state at time = 1 from time = 0 (for the case of a simple 1st order markov chain). Each state \(X_i\) has an associated observation \(Y_i\).
Assume there are M possible states to choose from (i.e. for binary, M=2). We can define a matrix A of size \(M \times M\) which symbolizes the transition matrix. This transition matrix contains the probabilities of moving from a state \(j\) given the previous state \(i\).
We can define a vector \(S_k\) at a specific point in time \(k\) as a probability vector.
This is calculated by multiplying the previous state’s probabilities \(S_{k-1}\) by the transition matrix \(A\).
Again, this tells us the probabilities of a state at timestep \(k\).
The probabilities of a state \(i\) resulting in an observed value \(j\) can be contained in an emission matrix \(B\), of size \(m \times k\). Written formally,
The emission matrix shows us the probability of a state \(i\) will result in a result \(k\). For instance, there exists a (say) 5% chance that intentionally typing “d” (state) will result in “c” (observed) by slipping your fingers.
Generally speaking, a transition matrix \(A\) and emission matrix \(B\) could be rebuilt at each timestep \(t\) to consider non-stationary markov chains. However, in the above case, we assume stationary behavior and can therefore conclude a single matrix \(A\) that can be applied across time.
Baum Welch
The transition matrix \(A\), emission matrix \(B\) and initial state distribution \(\pi_0\) can be trained so that the model is maximally like the observed data. This is called expectation maximization (EM). In the training process, there are three phases: 1) initialization, 2) forward, 3) backward, and 4) update.
In the initialization phase, the \(A\), \(B\) and \(\pi_0\) parameters are initialized. This can be randomized, but usually is distributed as a Dirichlet, \(Dir(\alpha)\), which is a multivariate generalization of the beta distribution. This distribution is an exponential family distribution, it has a conjugate prior and thus can be used easily in bayesian analysis.
NN Learning
We can solve for gradients in a vectorized manner.
A function \(\mathbf{f}: R^n \xrightarrow{} R^m\) exists such that \(\mathbf{f}(\mathbf{x}) = [f_1(x_1, ..., x_n), f_2(x_1, ..., x_n), ..., f_m(x_1, ..., x_n)]\). Its jacobian can be calculated as
In other words, \((\frac{d\mathbf{f}}{d\mathbf{x}})_{ij} = \frac{df_i}{dx_j}\). This jacobian is useful for the application of chain rule to a vector-valued function just by multiplying jacobians.
For example, suppose we have a function \(\mathbf{f}(x) = [f_1(x), f_2(x)]\) taking a scaler to a vector of size 2 and a function \(\mathbf{g}(\mathbf{y}) = [g_1(y_1, y_2), g_2(y_1, y_2)]\) taking a vector of size two to a vector of size two. Now let’s compose them to get \(\mathbf{g}(x) = [g_1(f_1(x), f_2(x)), g_2(f_1(x), f_2(x))]\). Using the regular chain rule, we can compute the derivative of \(\mathbf{g}\) as the Jacobian
This is the same as multiplying the two jacobians.
Useful Linear Algebra Identities
(1) Matrix times column vector wrt the column vector
\(\mathbf{z} = \mathbf{W} \mathbf{x}\), what is \(\frac{d\mathbf{z}}{d\mathbf{x}}\)?
Suppose \(\mathbf{W} \in R^{n \times m}\), then we can think of z as a function of \mathbf{x} taking an \(m\)-dimensional vector to an \(n\)-dimensional vector. So, its Jacobian will be \(n \times m\). Note that
An entry \((\frac{d\mathbf{f}}{d\mathbf{x}})_{ij}\) of the Jacobian will be
because \(\frac{d}{d x_j} x_k = 1\) if \(k = j\) and 0 if otherwise. So, we see that \(\frac{d \mathbf{z}}{d \mathbf{x}} = \mathbf{W}\).
(2) Row vector times matrix wrt the row vector
\(\mathbf{z} = \mathbf{x} \mathbf{W}\), what is \(\frac{d \mathbf{z}}{d \mathbf{x}}\)?
A computation similar to (1) shows that \(\frac{d \mathbf{z}}{d \mathbf{x}} = \mathbf{W}^T\).
(3) A vector with itself
\(\mathbf{z} = \mathbf{x}\), what is \(\frac{d \mathbf{z}}{d \mathbf{x}}\)?
We have \(z_i = x_i\), so
So we see that the Jacobian \(\frac{d\mathbf{z}}{d \mathbf{x}}\) is a diagonal matrix where the entry at \((i, i)\) is 1. This is just the identity matrix: \(\frac{d\mathbf{z}}{d \mathbf{x}} = \mathbf{I}\). When applying the chain rule, this term will disappear because a matrix or vector multiplied by the identity matrix does not change.
(4) An elementwise function applied a vector
\(\mathbf{z} = f(\mathbf{x})\), what is \(\frac{d \mathbf{z}}{d \mathbf{x}}\)?
Since \(f\) is being applied elementwise, we have \(z_i = f(x_i)\). So,
So we see that the jacobian \(\frac{d\mathbf{z}}{d\mathbf{x}}\) is a diagonal matrix where the entry at \((i,i)\) is the derivative of \(f\) applied to \(x_i\). We can write this as \(\frac{d \mathbf{z}}{d \mathbf{x}} = diag( f^\prime (\mathbf{x}))\). Since multiplication by a diagonal matrix is the same as doing elementwise multiplication by the diagonal, we could also write \(\circ f^\prime (\mathbf{x})\) when applying the chain rule.
(5) Matrix times column vector wrt the matrix
\(\mathbf{z} = \mathbf{W} \mathbf{x}\), \(\mathbf{\delta} = \frac{d J}{d z}\) what is \(\frac{dJ}{d\mathbf{W}} = \frac{d J}{d \mathbf{z}} \frac{d \mathbf{z}}{d \mathbf{W}} = \delta \frac{d \mathbf{z}}{d \mathbf{W}}\)?
Suppose we ahve a loss function \(J\) (a scalar) and are computing its gradient wrt a matrix \(\mathbf{W} \in R^{n \times m}\). Then we could think of \(J\) as a function of \(\mathbf{W}\) taking \(n m\) inputs (the entries of \(\mathbf{W}\)) to a single output (\(J\)). This means the Jacobian \(\frac{d J}{d \mathbf{W}}\) would be a \(1 \times nm\) vector. But in practice this is not a very useful way of arranging the gradient. It would be much nicer if hte derivatives were in a \(n \times m\) matrix like this:
Since this matrix has the same shape as \(\mathbf{W}\), we could just subtract it (times the learning rate) from \(\mathbf{W}\) when doing gradient descent. So (in a slight abuse of notation) let’s find this matrix as \(\frac{d J}{d \mathbf{W}}\) instead.
This way of arranging the gradients becomes complicated when computing \(\frac{d \mathbf{z}}{d \mathbf{W}}\). Unlike \(J\), \(\mathbf{z}\) is a vector. So if we are trying to rearrange the gradients like with \(\frac{d J}{d \mathbf{W}}\), \(\frac{d \mathbf{z}}{d \mathbf{W}}\) would be an \(n \times m \times n\) tensor! Luckily, we can avoid the issue by taking the gradient wrt a single weight \(W_{ij}\) instead. \(\frac{d \mathbf{z}}{d W_{ij}}\) is just a vector, which is much easier to deal with. We have
Note that \(\frac{d}{d W_{ij}} W_{kl} = 1\) if \(i=k\) and \(j=l\) and 0 if otherwise. So if \(k \neq i\) everything in the sum is zero and the gradient is zero. Otherwise, the only nonzero element of hte sum is when \(l=j\), so we just get \(x_j\). Thus we find \(\frac{d z_k}{d W_{ij}} = x_j\) if \(k = i\) and 0 if otherwise. Another way of writing this is
Here, the \(x_j\) is located in the \(i\)th element.
Now let’s compute \(\frac{d J}{d W_{ij}}\)
The only nonzero term in the sum is \(\delta_i \frac{d z_i}{d W_{ij}}\). To get \(\frac{d J}{d \mathbf{W}}\) we want a matrix where entry \((i, j)\) is \(\delta_i x_j\). This matrix is equal to the outer product \(\frac{d J}{d \mathbf{W}} = \delta^T x^T\).
(6) Row vector time matrix wrt the matrix
\(\mathbf{z} = \mathbf{x} \mathbf{W}\), \(\mathbf{\delta} = \frac{d J}{d z}\) what is \(\mathbf{d J}{d \mathbf{W}} = \delta \frac{d \mathbf{z}}{d \mathbf{W}}\)?
A similar computation to the one above shows \(\frac{d J}{d \mathbf{W}} = \mathbf{x}^T \delta\).
(7) Cross-entropy loss wrt logits
\(\mathbf{\hat{y}} = softmax(\mathbf{\theta}), J = CE(\mathbf{y}, \mathbf{\hat{y}})\), what is \(\frac{d J}{d \mathbf{\theta}}\)?
The gradient is \(\frac{d J}{d \mathbf{\theta}} = \mathbf{\hat{y}} - \mathbf{y}\), or \((\mathbf{\hat{y}} - \mathbf{y})^T\) if \(\mathbf{y}\) is a column vector.
These identities will be enough to let you quickly compute the gradients for many neural networks.
Gradient layout
Jacobean formulation is great for applying the chain rule: simply multiply the Jacobians. However, when doing SGD it’s more convenient to follow the convention “the shape of the gradient equals the shape of the parameter” (as done when computing \(\frac{d J}{d \mathbf{W}}\)) That way subtracting the gradient times the learning rate from the parameters is easy.
Example on 1-layer NN
We compute the gradients of a full neural network with one-layer and cross-entropy (CE) loss.
The forward pass is as follows:
It helps to break up the model into the simplest parts possible, so note that we defined \(\mathbf{z}\) and \(\mathbf{\theta}\) to split up the activation functions from the linear transformations in the network’s layers. The dimensions of the model’s parameters are
where \(D_x\) is the size of our input, \(D_h\) is the size of our hidden layer, and \(N_c\) is the number of classes.
In this example, we will compute all of the network’s gradients: \(\frac{d J}{d \mathbf{U}}\), \(\frac{d J}{d \mathbf{b_2}}\), \(\frac{d J}{d \mathbf{W}}\), \(\frac{d J}{d \mathbf{b_1}}\), \(\frac{d J}{d \mathbf{x}}\)
To start with, recall that ReLU(\(x\)) = max(\(x, 0\)). This means
where sgn is the signum function. Note that we are able to write the derivative of the activation in terms of the activation itself.
Now let’s write out the chain rule for \(\frac{d J}{d \mathbf{U}}\) and \(\frac{d J}{d \mathbf{b_2}}\):
Notice that \(\frac{d J}{d \mathbf{\hat{y}}} \frac{d \mathbf{\hat{y}}}{d \mathbf{\theta}} = \frac{d J}{d \mathbf{\theta}}\) is present in both gradients. This makes the math a bit cumbersome. Even worse, if we’re implementing the model without automatic differentiation, computing \(\frac{d J}{d \mathbf{\theta}}\) twice will be inefficient. So it will help us to define some variables to represent the intermediate derivatives:
These are the error signals passed down to \(\mathbf{\theta}\) and \(\mathbf{z}\) when doing backpropagation. We can compute them as follows:
Per cross-entropy loss wrt logits,
Using the chain rule,
Substituting in \(\mathbf{\delta_1}\),
Using matrix times column vector wrt column vector,
Using elementwise function applied to a vector,
These final objects have the following sizes: \(\frac{d J}{d \mathbf{z}}\) is (\(1 \times D_h\)), \(\mathbf{\delta_1}\) is (\(1 \times N_c\)), \(\mathbf{U}\) is (\(N_c \times D_h\)), and \(sgn(\mathbf{h})\) is (\(D_h\), ).
The error terms can be utilized to compute the gradients.
Using the property that the matrix times column vector wrt matrix,
Using identity matrix and transposing
Using the property that the matrix times column vector wrt matrix,
Using identity matrix and transposing
Using matrix times column vector wrt column vector and transposing
[ ]:
The below example shows the use of backpropagation on a linear regression problem. As a note, this method can be solved (easier, perhaps) with least squares linear algebra.
[36]:
from numpy import *
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# y = mx + b
# m is slope, b is y-intercept
def compute_error_for_line_given_points(b, m, points):
totalError = 0
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
totalError += (y - (m * x + b)) ** 2
return totalError / float(len(points))
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
# Calculate gradient with partial derivatives
m_gradient += -x * (y - (m_current * x + b_current))
b_gradient += -(y - (m_current * x + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
for i in range(num_iterations):
b, m = step_gradient(b, m, array(points), learning_rate)
return [b, m]
def gen_data():
n_samples = 1000
random_state = 170
X, y = make_blobs(n_samples=n_samples, random_state=random_state, centers=1)
# Anisotropicly distributed data to stretch the blobs
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
X_aniso = dot(X, transformation)
return X_aniso
def run():
points = gen_data()
learning_rate = 0.0001
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
print("Starting gradient descent at b = {0}, m = {1}, error = {2}".format(initial_b, initial_m, compute_error_for_line_given_points(initial_b, initial_m, points)))
print("Running...")
[b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
print("After {0} iterations b = {1}, m = {2}, error = {3}".format(num_iterations, b, m, compute_error_for_line_given_points(b, m, points)))
plt.scatter(points[:,0], points[:,1])
plt.plot(points[:,0], m*points[:,0] + b)
plt.show()
run()
Starting gradient descent at b = 0, m = 0, error = 2.059703523518794
Running...
After 1000 iterations b = -3.2643663545390273, m = -1.3386872077957057, error = 0.11444715957515363

Rules of thumb for neural networks
Batch normalization standardizes the inputs (or activations of a prior layer or inputs directly), offering some regularization effect and reducing generalization error, perhaps no longer requiring the use of dropout for regularization. It can halve the epochs or better. Do not use dropout and batch normalization at the same time.
The number of hidden layers in a neural network \(N_h = \frac{N_s}{\alpha \times (N_i + N_o)}\) where \(\alpha\) is an arbitrary scaling factor (usually 2 through 10) which symbolizes the number of nonzero weights for each neuron (which is smaller when including dropout layers), \(N_i\) is the number of input neurons, \(N_o\) is the number of output neurons, and \(N_s\) number of samples in a training data set.
[2]:
# -*- coding: utf-8 -*-
import torch
import math
class LegendrePolynomial3(torch.autograd.Function):
"""
We can implement our own custom autograd Functions by subclassing
torch.autograd.Function and implementing the forward and backward passes
which operate on Tensors.
"""
@staticmethod
def forward(ctx, input):
"""
In the forward pass we receive a Tensor containing the input and return
a Tensor containing the output. ctx is a context object that can be used
to stash information for backward computation. You can cache arbitrary
objects for use in the backward pass using the ctx.save_for_backward method.
"""
ctx.save_for_backward(input)
return 0.5 * (5 * input ** 3 - 3 * input)
@staticmethod
def backward(ctx, grad_output):
"""
In the backward pass we receive a Tensor containing the gradient of the loss
with respect to the output, and we need to compute the gradient of the loss
with respect to the input.
"""
input, = ctx.saved_tensors
return grad_output * 1.5 * (5 * input ** 2 - 1)
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# Create Tensors to hold input and outputs.
# By default, requires_grad=False, which indicates that we do not need to
# compute gradients with respect to these Tensors during the backward pass.
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)
# Create random Tensors for weights. For this example, we need
# 4 weights: y = a + b * P3(c + d * x), these weights need to be initialized
# not too far from the correct result to ensure convergence.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
a = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
b = torch.full((), -1.0, device=device, dtype=dtype, requires_grad=True)
c = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
d = torch.full((), 0.3, device=device, dtype=dtype, requires_grad=True)
learning_rate = 5e-6
for t in range(2000):
# To apply our Function, we use Function.apply method. We alias this as 'P3'.
P3 = LegendrePolynomial3.apply
# Forward pass: compute predicted y using operations; we compute
# P3 using our custom autograd operation.
y_pred = a + b * P3(c + d * x)
# Compute and print loss
loss = (y_pred - y).pow(2).sum()
if t % 100 == 99:
print(t, loss.item())
# Use autograd to compute the backward pass.
loss.backward()
# Update weights using gradient descent
with torch.no_grad():
a -= learning_rate * a.grad
b -= learning_rate * b.grad
c -= learning_rate * c.grad
d -= learning_rate * d.grad
# Manually zero the gradients after updating weights
a.grad = None
b.grad = None
c.grad = None
d.grad = None
print(f'Result: y = {a.item()} + {b.item()} * P3({c.item()} + {d.item()} x)')
99 209.95831298828125
199 144.6602020263672
299 100.70250701904297
399 71.03519439697266
499 50.97850799560547
599 37.403133392333984
699 28.206867218017578
799 21.97317886352539
899 17.745729446411133
999 14.877889633178711
1099 12.93176555633545
1199 11.610918045043945
1299 10.714245796203613
1399 10.105476379394531
1499 9.69210433959961
1599 9.411376953125
1699 9.220744132995605
1799 9.091286659240723
1899 9.003360748291016
1999 8.943641662597656
Result: y = 4.691534383205465e-10 + -2.208526849746704 * P3(2.9543581470115043e-10 + 0.2554861009120941 x)
Network science
Graphs \(G = G(A, X)\) contain an adjacency definition and a feature matrix.
Components of complex systems
Complex systems are ones which
have many interactions
There exists no central point
Interactions
Interacting occur between nodes and with their environment
Emergence
Sometimes we see something happen that we dont expect but its the result of the interaction of all of us.
In simple systems, the properties of the whole can be understood or predicted from the addition or aggregation of its components. If you have knowledge of the microscopic properties of your system, you can predict macroscopic properties of the system.
In complex systems, knowing the knowledge of the components will not be enough to know what happens at the end. Here, the whole is more than the sum of its parts. These systems change their states dynamically and its likely hard to expect what will happen.
Self-organization
Complex systems may self-organize to produce non-trivial patterns spontaneously without a blueprint. For example,
local tree growth with random forest fires caused by lightning yield complex spatio-temporal patterns
Adaption
Complex systems may adapt and evolve. Complex systems are often active and responding to the environment. For example,
immune system continuously learning about pathogens
evolution theory
Interdisciplinary
Complex systems appear in all scientific and professional domains.
Universality means that many systems in different domains display phenomena with common underlying features that can be described using the same scientific models. This is a common idea in transfer learning
Methods
Complex science involves many variables and configurations, needs advanced mathematical and computational modeling, analysis and simulations, and needs to use computers to check ifa set of hypothetical rules could lead to a behavior observed in nature. For example,
Agent-based modeling for the flocking of birds
mathematical and computer models of the brain
climate forecasting computer models
computer models of pedestrian dynamics
Two main methods:
advanced mathemaatical models
computer-based mathematical techniques
Complexity Theory
System Theory: Mother of complexity theory, a framework which can be investigated and/or develops any group of objects that work together to produce some result
Chaos Theory: Concerns deterministic systems whose behavior can, in principle, be predicted. They are highly sensitive to initial conditions (butterfly effect). Chaotic systems are predictable for a while and then ‘appear’ to become random.
Adaptive System Theory: Organization of things without any central control. This deals with action and behavior of others and operate on simple rules. Examples: evolutionary adaption in biology, robotics
Network Theory: study of graphs as a representation of relations between discrete objects. Studies real-world data. Examples: social networks, gene interactions, pandemics, transporation networks, citation networks
Anatomy
Node degree: the number of links connected to the node
Directed networks define an in-degree (\(k_i^{in}\)) and out-degree (\(k_i^{out}\)) of node \(i\). The total degree is the sum of in- and out-degree. The source is a node with \(k^{in} = 0\). A sink is a node with \(k^{out} = 0\)
Average degree
Undirected:
we define the total number of links as \(L = \frac{1}{2} \sum_{i=1}^N k_i\). We divide by two to not double count.
the average link is \(<k> = \frac{1}{N} \sum_{i=1}^N k_i = \frac{2L}{N}\)
where \(N\) is the number of nodes in the graph.
Directed:
A node’s total degree is \(k_i = k_i^{in} + k_i^{out}\).
The total number of links is \(L = \sum_{i=1}^N k_i^{in} = \sum_{i=1}^{N} k_i^{out}\)
The average degree is \(<k^{in}> = \frac{1}{N} \sum_{i=1}^N k_i^{in} = <k^{out}> = \frac{1}{N} \sum_{i=1}^N k_i^{out} = \frac{L}{N}\)
Degree Distribution
A probability \(p_k\) that a randomly chosen node has degree \(k\).
Where \(N_k\) is the number of nodes with degree \(k\).
The average degree is simply the weighted average (expected value of a discrete variable).
log-log plots are helpful when defining degree distribution with wide range of probabilities
“hubs” are the highly-connected nodes in a network
Bipartite networks
Nodes can be divided into two disjoint sets \(U\) and \(V\) such that each link connects a U-node to a V-node. We can generate two projections for each bipartite network. The first projection connects two \(U\)-nodes if they are linked to the same \(V\)-node. The second projection connects the \(V\)-nodes if they connect to the same \(U\)-node.
This is used in (say) the hollywood actor network, where a movie is connected to an actor if the actor plays in that movie. We can decompose a \(U\) (movie network) and \(V\) (actor network).
The Human disease network connects diseases to the genes whose mutations are known to cause or effect the corresponding disease. We can decompose a \(U\) (diseases network) and \(V\) (genes network).
Shortest path
Never loops, never overlaps.
Shortest path between two nodes can be calculated by A^2. But this is not computationally efficient.
Instead, we can use Breadth-first-search.
Step 1: Start at node i that we label with “0”.
Step 2: Find the nodes directly linked to i, and label them distance “1” and put them in a queue.
Step 3: Take the first node, labeled n, out of the queue (n=1 in the first step), find the unlabeled nodes adjacent ot it in the graph, and label them with \(n+1\) and put them in the queue.
Step 4: Repeat step 2 until you find the target \(j\) or there are no more nodes in the queue.
Step 5: The distance between i and j is the label of j. if j does not have a label, then \(d_{ij} = \infty\).
Network Diameter
Network diameter, \(d_max\), is the maximum distance between any pair of nodes in the graph.
The average path length is the average number of steps along the shortest paths for all possible pairs of networks nodes.
here \(d_{ij}\) is the distance from node \(i\) to node \(j\).
These are the measure of efficiency of information or mass transport on a network.
Connectedness
Subgraphs in an adjacency matrix are defined by \(k \times k\) matrices with at least one connection (nonzero)
Clustering coefficient
A clustering coefficient ( in [0,1]) tells you what fraction of your neighbors are connected.
from Watts & Strogatz, Nature 1998.
For random networks, the diameter of \(G(n,m)\) would be the diameter of a graph G averaged over the ensemble
For instance for the \(G(n,m)\) model, the average number of edges is \(m\) and the mean node degree is \(<k>=2m/n\). However, other properties are hard to analytically obtain.
If, instead, the number nodes and the probability of making connections is fixed, notated by \(G(n,p)\), can do other things.
Question:
\(p(L)\): What is the probability that a random network has exactly L links?
The number of nodes \(N\) has the total number of paris of nodes \(n(n-1)/2\). The probability that \(L\) of the attempts to connect the \(N(N-1)/2\) pairs of nodes is \(p^L\). The probability that the remaining \(N(N-1)/2 - L\) attempts have not resulted in a link: \((1-p) []\) <- CHECK ON THIS
The probability to have exactly \(L\) links in a network of \(N\) nodes and probability \(p\) is
Small World
Distance is not related to N. It is related to \(\ln N\). As a note, real networks are \(\ln \ln N\).
Expected number of nodes up to distance \(d\) from our starting node is
Obviously, the \(N(d)\) must not exceed \(N\). So, \(N(d_{max}) \approx N\). Assuming that \(<k>\) is much greater than \(1\), \(<k>^{d_{max}} \approx N\). Therefore, \(d_{max} \approx \frac{\ln N}{\ln <k>}\). This changes slowly because of the natural logs.
Clustering coefficient
Degree of a node contains no information about the relationship between a node’s neighbors.
We estimate the expected number of links \(L_i\) between the node’s \(k_i\) nieghbors
where \(p\) is the probability that two of \(i\)’s neighbors link to each other.
The local clustering coefficient of a random network is
Two predictions (regarding real networks) can be formed:
For fixed , the larger the network, the smaller is a node’s clustering coefficient. Note that also follows the same formula for \(C_i\). THIS PREDICTION IS NOT CORRECT. IT SHOULD GO UP.
The local clusteirng coefficient of a node is independent of the node’s degree.
Summary: Random network model does not apture the clustering of real networks.
Real networks have a much higher clustering coefficient than expected
Average clustering coefficient of real networks is much higher
In real networks, the number of nodes at distance \(d > <d>\) drops rapidly
Universality
Problem with small world - distance between arbitray nodes is very small - but, the clustering coefficient cannot be explained in small world model. So what kind of model can explain a real network?
Scale free property - we can see hub.
Fit the power law, we can figure out if it is scale free.
Algorithm: Step 1: Plot the degree distribution
Step 2: Measure the degree exponent
Step 3: Exponent
As the value of the degree exponent plays an important role in predicting various network properties, we need tools to fit the \(p_k\) distribution and to estimate \(\gamma\)
A power grid is not scale free because it is a grid - there arent these huge hubs that exist.
Barabasi-Albert (BA) Model
Generative Adversarial Networks (GANs)
GANs are trained in a self-supervised process where a discriminator tries to determine whether a data point is real data or one that is generated.
[1]:
from utils import disp
disp('gan_architecture.png')

The GAN uses a value function (minmax problem) to learn.
We optimize the loss function using a loss function (i.e. stochastic gradient descent). The training process is generally structured as follows:
Fix the learning of G
Inner loop on D k times
Take \(m\) data samples & m fake data samples
Update \(\theta_d\) by gradient ascent (because discriminator is trying to maximize the value function)
Fix learning of D
Take m fake data samples
Update \(\theta_g\) by gradient descent
For every \(k\) updates of D, we update G once.
The goal is to make \(P_G\) converge to \(P_{data}\). Here, we assume that \(P_G\) is equal to \(P_{data}\) at the global minimum. To prove this,
For fixed G,
will be maximum for \(D(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}\)
Fixing D as this maximum, we can plug into our value function
From here, we want to prove that pdf of G is the same as pdf of data. To measure the difference between two distributions, we can use JS divergence.
This is close to the previous expression. With some modifications, we get
What is minimum value of this expression? JS cannot be negative. It can only be zero when \(p_{data} = p_g\). If this occurs, then our minimum is \(min_G V = -2ln2\).
Once the generator is trained, the discriminator will not be able to tell which data point is original and which is generated. So, the discriminator will output 0.5 for every input (theoretically).
Thought experiment: GANs for explainable AI?
GANs should be helpful for explainable AI.
Normal implemnntation:
[3]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torch.optim as optim
import torchvision.datasets as datasets
import imageio
import numpy as np
import matplotlib
from torchvision.utils import make_grid, save_image
from torch.utils.data import DataLoader
from matplotlib import pyplot as plt
from tqdm import tqdm
matplotlib.style.use('ggplot')
# learning parameters
batch_size = 512
epochs = 200
sample_size = 64 # fixed sample size
nz = 128 # latent vector size
k = 1 # number of steps to apply to the discriminator
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,),(0.5,)),
])
to_pil_image = transforms.ToPILImage()
train_data = datasets.FashionMNIST(
root='input/data',
train=True,
download=True,
transform=transform
)
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
class Generator(nn.Module):
def __init__(self, nz):
super(Generator, self).__init__()
self.nz = nz
self.main = nn.Sequential(
nn.Linear(self.nz, 256),
nn.LeakyReLU(0.2),
nn.Linear(256, 512),
nn.LeakyReLU(0.2),
nn.Linear(512, 1024),
nn.LeakyReLU(0.2),
nn.Linear(1024, 784),
nn.Tanh(),
)
def forward(self, x):
return self.main(x).view(-1, 1, 28, 28)
class Discriminator(nn.Module):
def __init__(self):
super(Discriminator, self).__init__()
self.n_input = 784
self.main = nn.Sequential(
nn.Linear(self.n_input, 1024),
nn.LeakyReLU(0.2),
nn.Dropout(0.3),
nn.Linear(1024, 512),
nn.LeakyReLU(0.2),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.LeakyReLU(0.2),
nn.Dropout(0.3),
nn.Linear(256, 1),
nn.Sigmoid(),
)
def forward(self, x):
x = x.view(-1, 784)
return self.main(x)
generator = Generator(nz).to(device)
discriminator = Discriminator().to(device)
print('##### GENERATOR #####')
print(generator)
print('######################')
print('\n##### DISCRIMINATOR #####')
print(discriminator)
print('######################')
# optimizers
optim_g = optim.Adam(generator.parameters(), lr=0.0002)
optim_d = optim.Adam(discriminator.parameters(), lr=0.0002)
# loss function
criterion = nn.BCELoss()
losses_g = [] # to store generator loss after each epoch
losses_d = [] # to store discriminator loss after each epoch
images = [] # to store images generatd by the generator
# to create real labels (1s)
def label_real(size):
data = torch.ones(size, 1)
return data.to(device)
# to create fake labels (0s)
def label_fake(size):
data = torch.zeros(size, 1)
return data.to(device)
# function to create the noise vector
def create_noise(sample_size, nz):
return torch.randn(sample_size, nz).to(device)
# to save the images generated by the generator
def save_generator_image(image, path):
save_image(image, path)
# function to train the discriminator network
def train_discriminator(optimizer, data_real, data_fake):
b_size = data_real.size(0)
real_label = label_real(b_size)
fake_label = label_fake(b_size)
optimizer.zero_grad()
output_real = discriminator(data_real)
loss_real = criterion(output_real, real_label)
output_fake = discriminator(data_fake)
loss_fake = criterion(output_fake, fake_label)
loss_real.backward()
loss_fake.backward()
optimizer.step()
return loss_real + loss_fake
# function to train the generator network
def train_generator(optimizer, data_fake):
b_size = data_fake.size(0)
real_label = label_real(b_size)
optimizer.zero_grad()
output = discriminator(data_fake)
loss = criterion(output, real_label)
loss.backward()
optimizer.step()
return loss
# create the noise vector
noise = create_noise(sample_size, nz)
generator.train()
discriminator.train()
for epoch in range(epochs):
loss_g = 0.0
loss_d = 0.0
for bi, data in tqdm(enumerate(train_loader), total=int(len(train_data)/train_loader.batch_size)):
image, _ = data
image = image.to(device)
b_size = len(image)
# run the discriminator for k number of steps
for step in range(k):
data_fake = generator(create_noise(b_size, nz)).detach()
data_real = image
# train the discriminator network
loss_d += train_discriminator(optim_d, data_real, data_fake)
data_fake = generator(create_noise(b_size, nz))
# train the generator network
loss_g += train_generator(optim_g, data_fake)
# create the final fake image for the epoch
generated_img = generator(noise).cpu().detach()
# make the images as grid
generated_img = make_grid(generated_img)
# save the generated torch tensor models to disk
save_generator_image(generated_img, f"_output/gen_img{epoch}.png")
images.append(generated_img)
epoch_loss_g = loss_g / bi # total generator loss for the epoch
epoch_loss_d = loss_d / bi # total discriminator loss for the epoch
losses_g.append(epoch_loss_g)
losses_d.append(epoch_loss_d)
print(f"Epoch {epoch} of {epochs}")
print(f"Generator loss: {epoch_loss_g:.8f}, Discriminator loss: {epoch_loss_d:.8f}")
print('DONE TRAINING')
torch.save(generator.state_dict(), '_output/generator.pth')
# save the generated images as GIF file
imgs = [np.array(to_pil_image(img)) for img in images]
imageio.mimsave('_output/generator_images.gif', imgs)
# plot and save the generator and discriminator loss
plt.figure()
plt.plot(losses_g, label='Generator loss')
plt.plot(losses_d, label='Discriminator Loss')
plt.legend()
plt.savefig('_output/loss.png')
##### GENERATOR #####
Generator(
(main): Sequential(
(0): Linear(in_features=128, out_features=256, bias=True)
(1): LeakyReLU(negative_slope=0.2)
(2): Linear(in_features=256, out_features=512, bias=True)
(3): LeakyReLU(negative_slope=0.2)
(4): Linear(in_features=512, out_features=1024, bias=True)
(5): LeakyReLU(negative_slope=0.2)
(6): Linear(in_features=1024, out_features=784, bias=True)
(7): Tanh()
)
)
######################
##### DISCRIMINATOR #####
Discriminator(
(main): Sequential(
(0): Linear(in_features=784, out_features=1024, bias=True)
(1): LeakyReLU(negative_slope=0.2)
(2): Dropout(p=0.3, inplace=False)
(3): Linear(in_features=1024, out_features=512, bias=True)
(4): LeakyReLU(negative_slope=0.2)
(5): Dropout(p=0.3, inplace=False)
(6): Linear(in_features=512, out_features=256, bias=True)
(7): LeakyReLU(negative_slope=0.2)
(8): Dropout(p=0.3, inplace=False)
(9): Linear(in_features=256, out_features=1, bias=True)
(10): Sigmoid()
)
)
######################
118it [00:10, 12.12it/s]
Epoch 0 of 200
Generator loss: 1.42679989, Discriminator loss: 0.98298115
118it [00:10, 10.82it/s]
Epoch 1 of 200
Generator loss: 1.62489319, Discriminator loss: 0.70377380
118it [00:10, 12.19it/s]
Epoch 2 of 200
Generator loss: 3.10378194, Discriminator loss: 0.70644265
118it [00:10, 12.16it/s]
Epoch 3 of 200
Generator loss: 4.92797756, Discriminator loss: 1.26403916
118it [00:10, 11.90it/s]
Epoch 4 of 200
Generator loss: 3.85051537, Discriminator loss: 0.34978214
118it [00:10, 10.73it/s]
Epoch 5 of 200
Generator loss: 4.61972666, Discriminator loss: 0.26378199
118it [00:10, 10.79it/s]
Epoch 6 of 200
Generator loss: 4.38954210, Discriminator loss: 0.36163151
118it [00:10, 12.21it/s]
Epoch 7 of 200
Generator loss: 4.36764336, Discriminator loss: 0.45374233
118it [00:10, 12.28it/s]
Epoch 8 of 200
Generator loss: 3.98004150, Discriminator loss: 0.61882484
118it [00:10, 12.31it/s]
Epoch 9 of 200
Generator loss: 2.88338304, Discriminator loss: 0.70212597
118it [00:10, 12.20it/s]
Epoch 10 of 200
Generator loss: 3.22872758, Discriminator loss: 0.73741198
118it [00:10, 10.97it/s]
Epoch 11 of 200
Generator loss: 3.26948786, Discriminator loss: 0.62486356
118it [00:10, 10.92it/s]
Epoch 12 of 200
Generator loss: 3.41106462, Discriminator loss: 0.41624784
118it [00:10, 12.09it/s]
Epoch 13 of 200
Generator loss: 3.96793866, Discriminator loss: 0.48023599
118it [00:10, 10.98it/s]
Epoch 14 of 200
Generator loss: 3.39452696, Discriminator loss: 0.40835723
118it [00:10, 11.98it/s]
Epoch 15 of 200
Generator loss: 3.72954559, Discriminator loss: 0.32693329
118it [00:10, 12.20it/s]
Epoch 16 of 200
Generator loss: 4.63171339, Discriminator loss: 0.35380819
118it [00:10, 10.95it/s]
Epoch 17 of 200
Generator loss: 3.63302135, Discriminator loss: 0.40510964
118it [00:10, 12.04it/s]
Epoch 18 of 200
Generator loss: 3.70032191, Discriminator loss: 0.36745089
118it [00:10, 12.15it/s]
Epoch 19 of 200
Generator loss: 3.79039121, Discriminator loss: 0.47844118
118it [00:10, 11.89it/s]
Epoch 20 of 200
Generator loss: 3.54762769, Discriminator loss: 0.30928752
118it [00:10, 12.22it/s]
Epoch 21 of 200
Generator loss: 3.71710515, Discriminator loss: 0.38970572
118it [00:10, 12.20it/s]
Epoch 22 of 200
Generator loss: 3.31698275, Discriminator loss: 0.52958453
118it [00:11, 10.69it/s]
Epoch 23 of 200
Generator loss: 3.57484317, Discriminator loss: 0.55633307
118it [00:11, 10.55it/s]
Epoch 24 of 200
Generator loss: 3.25523281, Discriminator loss: 0.51782972
118it [00:10, 12.00it/s]
Epoch 25 of 200
Generator loss: 3.18583012, Discriminator loss: 0.47012895
118it [00:10, 12.14it/s]
Epoch 26 of 200
Generator loss: 3.19335461, Discriminator loss: 0.51145601
118it [00:10, 12.10it/s]
Epoch 27 of 200
Generator loss: 3.21174383, Discriminator loss: 0.56492382
118it [00:10, 12.13it/s]
Epoch 28 of 200
Generator loss: 3.45740318, Discriminator loss: 0.48976579
118it [00:10, 12.14it/s]
Epoch 29 of 200
Generator loss: 3.05499339, Discriminator loss: 0.57028377
118it [00:10, 11.98it/s]
Epoch 30 of 200
Generator loss: 2.88134313, Discriminator loss: 0.55931473
118it [00:10, 12.14it/s]
Epoch 31 of 200
Generator loss: 3.24226332, Discriminator loss: 0.51059788
118it [00:10, 12.09it/s]
Epoch 32 of 200
Generator loss: 2.86517239, Discriminator loss: 0.63791561
118it [00:10, 12.02it/s]
Epoch 33 of 200
Generator loss: 2.45503569, Discriminator loss: 0.68121541
118it [00:10, 12.02it/s]
Epoch 34 of 200
Generator loss: 2.83101821, Discriminator loss: 0.51598758
118it [00:10, 10.88it/s]
Epoch 35 of 200
Generator loss: 3.05535507, Discriminator loss: 0.49450979
118it [00:10, 11.66it/s]
Epoch 36 of 200
Generator loss: 3.04819036, Discriminator loss: 0.50154257
118it [00:10, 10.93it/s]
Epoch 37 of 200
Generator loss: 3.18599129, Discriminator loss: 0.46228018
118it [00:10, 12.20it/s]
Epoch 38 of 200
Generator loss: 3.08141351, Discriminator loss: 0.50581169
118it [00:10, 12.07it/s]
Epoch 39 of 200
Generator loss: 2.89046168, Discriminator loss: 0.54147011
118it [00:10, 11.00it/s]
Epoch 40 of 200
Generator loss: 3.11255932, Discriminator loss: 0.54231316
118it [00:10, 12.05it/s]
Epoch 41 of 200
Generator loss: 2.69802594, Discriminator loss: 0.55339050
118it [00:10, 11.79it/s]
Epoch 42 of 200
Generator loss: 3.06265306, Discriminator loss: 0.51745349
118it [00:10, 10.87it/s]
Epoch 43 of 200
Generator loss: 2.89635634, Discriminator loss: 0.49023831
118it [00:10, 12.13it/s]
Epoch 44 of 200
Generator loss: 2.85235381, Discriminator loss: 0.54068846
118it [00:10, 12.14it/s]
Epoch 45 of 200
Generator loss: 2.77114344, Discriminator loss: 0.57422400
118it [00:10, 12.08it/s]
Epoch 46 of 200
Generator loss: 2.94623184, Discriminator loss: 0.50683928
118it [00:10, 12.12it/s]
Epoch 47 of 200
Generator loss: 2.51394677, Discriminator loss: 0.66653192
118it [00:10, 12.05it/s]
Epoch 48 of 200
Generator loss: 2.32383966, Discriminator loss: 0.68538570
118it [00:11, 12.04it/s]
Epoch 49 of 200
Generator loss: 2.57492971, Discriminator loss: 0.61851585
118it [00:10, 12.13it/s]
Epoch 50 of 200
Generator loss: 2.47522569, Discriminator loss: 0.60982305
118it [00:10, 12.16it/s]
Epoch 51 of 200
Generator loss: 2.36268973, Discriminator loss: 0.64775735
118it [00:10, 12.05it/s]
Epoch 52 of 200
Generator loss: 2.54246497, Discriminator loss: 0.58563745
118it [00:10, 10.92it/s]
Epoch 53 of 200
Generator loss: 2.59760189, Discriminator loss: 0.61331671
118it [00:10, 12.11it/s]
Epoch 54 of 200
Generator loss: 2.66271830, Discriminator loss: 0.60149300
118it [00:10, 12.17it/s]
Epoch 55 of 200
Generator loss: 2.57131219, Discriminator loss: 0.62431341
118it [00:10, 11.98it/s]
Epoch 56 of 200
Generator loss: 2.45217991, Discriminator loss: 0.64607465
118it [00:10, 10.92it/s]
Epoch 57 of 200
Generator loss: 2.45755219, Discriminator loss: 0.67585933
118it [00:10, 12.11it/s]
Epoch 58 of 200
Generator loss: 2.52498460, Discriminator loss: 0.62965387
118it [00:10, 10.88it/s]
Epoch 59 of 200
Generator loss: 2.40800285, Discriminator loss: 0.66435039
118it [00:10, 10.87it/s]
Epoch 60 of 200
Generator loss: 2.32626867, Discriminator loss: 0.70449632
118it [00:10, 11.75it/s]
Epoch 61 of 200
Generator loss: 2.38575864, Discriminator loss: 0.67307276
118it [00:10, 12.06it/s]
Epoch 62 of 200
Generator loss: 2.46737528, Discriminator loss: 0.61570835
118it [00:10, 12.14it/s]
Epoch 63 of 200
Generator loss: 2.42095017, Discriminator loss: 0.70034283
118it [00:10, 11.89it/s]
Epoch 64 of 200
Generator loss: 2.31449914, Discriminator loss: 0.65447074
118it [00:10, 12.01it/s]
Epoch 65 of 200
Generator loss: 2.39653182, Discriminator loss: 0.60882252
118it [00:10, 12.02it/s]
Epoch 66 of 200
Generator loss: 2.37150860, Discriminator loss: 0.68175209
118it [00:10, 11.57it/s]
Epoch 67 of 200
Generator loss: 2.28956985, Discriminator loss: 0.69170892
118it [00:10, 12.00it/s]
Epoch 68 of 200
Generator loss: 2.57752275, Discriminator loss: 0.64327300
118it [00:10, 12.18it/s]
Epoch 69 of 200
Generator loss: 2.37329602, Discriminator loss: 0.71189284
118it [00:10, 12.09it/s]
Epoch 70 of 200
Generator loss: 2.16448283, Discriminator loss: 0.69848782
118it [00:10, 11.93it/s]
Epoch 71 of 200
Generator loss: 2.34472179, Discriminator loss: 0.67156476
118it [00:10, 12.10it/s]
Epoch 72 of 200
Generator loss: 2.17551994, Discriminator loss: 0.69883925
118it [00:10, 12.05it/s]
Epoch 73 of 200
Generator loss: 2.17170811, Discriminator loss: 0.70166582
118it [00:10, 12.08it/s]
Epoch 74 of 200
Generator loss: 2.05627131, Discriminator loss: 0.76419610
118it [00:10, 11.90it/s]
Epoch 75 of 200
Generator loss: 2.27301955, Discriminator loss: 0.66036499
118it [00:10, 11.71it/s]
Epoch 76 of 200
Generator loss: 2.28018999, Discriminator loss: 0.69041985
118it [00:10, 12.02it/s]
Epoch 77 of 200
Generator loss: 2.20005488, Discriminator loss: 0.70438713
118it [00:10, 10.90it/s]
Epoch 78 of 200
Generator loss: 2.27268672, Discriminator loss: 0.65581048
118it [00:10, 10.79it/s]
Epoch 79 of 200
Generator loss: 2.36812997, Discriminator loss: 0.68771487
118it [00:11, 11.57it/s]
Epoch 80 of 200
Generator loss: 2.19527173, Discriminator loss: 0.69965822
118it [00:10, 11.87it/s]
Epoch 81 of 200
Generator loss: 2.38353372, Discriminator loss: 0.67542112
118it [00:10, 10.83it/s]
Epoch 82 of 200
Generator loss: 2.00821018, Discriminator loss: 0.77538747
118it [00:10, 11.75it/s]
Epoch 83 of 200
Generator loss: 2.06199837, Discriminator loss: 0.72377205
118it [00:10, 10.89it/s]
Epoch 84 of 200
Generator loss: 2.10705400, Discriminator loss: 0.73932397
118it [00:10, 12.01it/s]
Epoch 85 of 200
Generator loss: 2.16597724, Discriminator loss: 0.72270906
118it [00:10, 12.11it/s]
Epoch 86 of 200
Generator loss: 2.15547657, Discriminator loss: 0.72369409
118it [00:10, 12.07it/s]
Epoch 87 of 200
Generator loss: 1.98006141, Discriminator loss: 0.77487230
118it [00:10, 12.10it/s]
Epoch 88 of 200
Generator loss: 2.17026782, Discriminator loss: 0.71066028
118it [00:10, 10.85it/s]
Epoch 89 of 200
Generator loss: 2.10151815, Discriminator loss: 0.71836138
118it [00:10, 12.10it/s]
Epoch 90 of 200
Generator loss: 2.15137696, Discriminator loss: 0.70107192
118it [00:10, 10.76it/s]
Epoch 91 of 200
Generator loss: 2.13201833, Discriminator loss: 0.71878541
118it [00:10, 12.08it/s]
Epoch 92 of 200
Generator loss: 2.12932873, Discriminator loss: 0.69973910
118it [00:10, 12.03it/s]
Epoch 93 of 200
Generator loss: 2.10871553, Discriminator loss: 0.74259883
118it [00:10, 12.06it/s]
Epoch 94 of 200
Generator loss: 2.17168546, Discriminator loss: 0.69877094
118it [00:11, 11.87it/s]
Epoch 95 of 200
Generator loss: 2.05657387, Discriminator loss: 0.76476651
118it [00:11, 11.83it/s]
Epoch 96 of 200
Generator loss: 1.95592308, Discriminator loss: 0.76803041
118it [00:11, 11.70it/s]
Epoch 97 of 200
Generator loss: 2.10769629, Discriminator loss: 0.70019478
118it [00:10, 12.07it/s]
Epoch 98 of 200
Generator loss: 1.97890449, Discriminator loss: 0.73829842
118it [00:10, 12.10it/s]
Epoch 99 of 200
Generator loss: 1.90354776, Discriminator loss: 0.80168343
118it [00:10, 12.09it/s]
Epoch 100 of 200
Generator loss: 1.97588968, Discriminator loss: 0.78837484
118it [00:10, 10.86it/s]
Epoch 101 of 200
Generator loss: 1.93798590, Discriminator loss: 0.80875230
118it [00:10, 12.08it/s]
Epoch 102 of 200
Generator loss: 1.96531487, Discriminator loss: 0.78535426
118it [00:10, 10.75it/s]
Epoch 103 of 200
Generator loss: 1.86302114, Discriminator loss: 0.76587468
118it [00:10, 10.80it/s]
Epoch 104 of 200
Generator loss: 1.93535113, Discriminator loss: 0.77565098
118it [00:10, 11.95it/s]
Epoch 105 of 200
Generator loss: 1.99481142, Discriminator loss: 0.74644858
118it [00:10, 12.06it/s]
Epoch 106 of 200
Generator loss: 1.92222798, Discriminator loss: 0.75295651
118it [00:10, 11.93it/s]
Epoch 107 of 200
Generator loss: 1.98527122, Discriminator loss: 0.76719052
118it [00:10, 12.05it/s]
Epoch 108 of 200
Generator loss: 1.88296998, Discriminator loss: 0.75847024
118it [00:10, 12.18it/s]
Epoch 109 of 200
Generator loss: 1.91546285, Discriminator loss: 0.79423177
118it [00:10, 12.04it/s]
Epoch 110 of 200
Generator loss: 2.05347419, Discriminator loss: 0.78333181
118it [00:10, 11.95it/s]
Epoch 111 of 200
Generator loss: 1.98762047, Discriminator loss: 0.74812883
118it [00:10, 11.90it/s]
Epoch 112 of 200
Generator loss: 2.00333953, Discriminator loss: 0.75139976
118it [00:10, 10.78it/s]
Epoch 113 of 200
Generator loss: 1.93170142, Discriminator loss: 0.73702711
118it [00:10, 12.06it/s]
Epoch 114 of 200
Generator loss: 1.93748260, Discriminator loss: 0.77334636
118it [00:11, 12.00it/s]
Epoch 115 of 200
Generator loss: 1.93474305, Discriminator loss: 0.76822644
118it [00:10, 10.77it/s]
Epoch 116 of 200
Generator loss: 1.93480587, Discriminator loss: 0.78705817
118it [00:10, 11.94it/s]
Epoch 117 of 200
Generator loss: 1.81781912, Discriminator loss: 0.80566841
118it [00:10, 12.09it/s]
Epoch 118 of 200
Generator loss: 1.90504992, Discriminator loss: 0.74386704
118it [00:10, 11.88it/s]
Epoch 119 of 200
Generator loss: 1.86636162, Discriminator loss: 0.76808059
118it [00:10, 12.02it/s]
Epoch 120 of 200
Generator loss: 1.93788433, Discriminator loss: 0.75187492
118it [00:10, 11.95it/s]
Epoch 121 of 200
Generator loss: 1.75950122, Discriminator loss: 0.80740660
118it [00:11, 12.00it/s]
Epoch 122 of 200
Generator loss: 1.73203480, Discriminator loss: 0.84196305
118it [00:10, 11.94it/s]
Epoch 123 of 200
Generator loss: 1.83940530, Discriminator loss: 0.80670452
118it [00:10, 11.90it/s]
Epoch 124 of 200
Generator loss: 1.80591881, Discriminator loss: 0.78669566
118it [00:10, 11.86it/s]
Epoch 125 of 200
Generator loss: 1.84839618, Discriminator loss: 0.77231079
118it [00:11, 10.45it/s]
Epoch 126 of 200
Generator loss: 1.82202017, Discriminator loss: 0.80033922
118it [00:11, 12.02it/s]
Epoch 127 of 200
Generator loss: 1.82240641, Discriminator loss: 0.80433965
118it [00:10, 10.82it/s]
Epoch 128 of 200
Generator loss: 1.78250694, Discriminator loss: 0.78977102
118it [00:10, 11.96it/s]
Epoch 129 of 200
Generator loss: 1.73686993, Discriminator loss: 0.83117014
118it [00:10, 10.81it/s]
Epoch 130 of 200
Generator loss: 1.69894624, Discriminator loss: 0.83944160
118it [00:10, 11.58it/s]
Epoch 131 of 200
Generator loss: 1.70522535, Discriminator loss: 0.83726037
118it [00:10, 12.03it/s]
Epoch 132 of 200
Generator loss: 1.63788044, Discriminator loss: 0.82422584
118it [00:11, 10.69it/s]
Epoch 133 of 200
Generator loss: 1.68943703, Discriminator loss: 0.79931623
118it [00:10, 12.04it/s]
Epoch 134 of 200
Generator loss: 1.77729273, Discriminator loss: 0.82075375
118it [00:10, 11.92it/s]
Epoch 135 of 200
Generator loss: 1.68428946, Discriminator loss: 0.82309252
118it [00:10, 11.99it/s]
Epoch 136 of 200
Generator loss: 1.67224669, Discriminator loss: 0.82916367
118it [00:10, 12.03it/s]
Epoch 137 of 200
Generator loss: 1.72521901, Discriminator loss: 0.83404922
118it [00:10, 10.75it/s]
Epoch 138 of 200
Generator loss: 1.78916895, Discriminator loss: 0.80936950
118it [00:11, 11.95it/s]
Epoch 139 of 200
Generator loss: 1.75150931, Discriminator loss: 0.83272260
118it [00:11, 11.77it/s]
Epoch 140 of 200
Generator loss: 1.76419652, Discriminator loss: 0.81509191
118it [00:11, 11.38it/s]
Epoch 141 of 200
Generator loss: 1.77207625, Discriminator loss: 0.80571991
118it [00:10, 12.01it/s]
Epoch 142 of 200
Generator loss: 1.68275905, Discriminator loss: 0.82235527
118it [00:10, 10.74it/s]
Epoch 143 of 200
Generator loss: 1.76718843, Discriminator loss: 0.80837804
118it [00:11, 11.75it/s]
Epoch 144 of 200
Generator loss: 1.65909672, Discriminator loss: 0.85105503
118it [00:11, 10.62it/s]
Epoch 145 of 200
Generator loss: 1.71078217, Discriminator loss: 0.80839849
118it [00:11, 12.06it/s]
Epoch 146 of 200
Generator loss: 1.73700798, Discriminator loss: 0.83374196
118it [00:11, 10.64it/s]
Epoch 147 of 200
Generator loss: 1.56826782, Discriminator loss: 0.86148900
118it [00:10, 12.03it/s]
Epoch 148 of 200
Generator loss: 1.71385252, Discriminator loss: 0.83115745
118it [00:11, 10.63it/s]
Epoch 149 of 200
Generator loss: 1.70669937, Discriminator loss: 0.84008586
118it [00:11, 10.67it/s]
Epoch 150 of 200
Generator loss: 1.76197851, Discriminator loss: 0.81925625
118it [00:11, 10.39it/s]
Epoch 151 of 200
Generator loss: 1.68872285, Discriminator loss: 0.86071086
118it [00:11, 10.43it/s]
Epoch 152 of 200
Generator loss: 1.68789732, Discriminator loss: 0.83586329
118it [00:11, 10.33it/s]
Epoch 153 of 200
Generator loss: 1.63197529, Discriminator loss: 0.87918919
118it [00:11, 11.87it/s]
Epoch 154 of 200
Generator loss: 1.54937506, Discriminator loss: 0.88343716
118it [00:11, 11.64it/s]
Epoch 155 of 200
Generator loss: 1.57689917, Discriminator loss: 0.92167449
118it [00:11, 11.15it/s]
Epoch 156 of 200
Generator loss: 1.75131404, Discriminator loss: 0.82436347
118it [00:11, 10.36it/s]
Epoch 157 of 200
Generator loss: 1.68738174, Discriminator loss: 0.82771856
118it [00:11, 11.69it/s]
Epoch 158 of 200
Generator loss: 1.65384710, Discriminator loss: 0.85041094
118it [00:11, 11.69it/s]
Epoch 159 of 200
Generator loss: 1.63523507, Discriminator loss: 0.84368449
118it [00:11, 10.67it/s]
Epoch 160 of 200
Generator loss: 1.59416497, Discriminator loss: 0.88509560
118it [00:11, 10.59it/s]
Epoch 161 of 200
Generator loss: 1.61184669, Discriminator loss: 0.88112897
118it [00:11, 10.67it/s]
Epoch 162 of 200
Generator loss: 1.57819831, Discriminator loss: 0.87391430
118it [00:11, 10.63it/s]
Epoch 163 of 200
Generator loss: 1.60059571, Discriminator loss: 0.88005883
118it [00:11, 10.56it/s]
Epoch 164 of 200
Generator loss: 1.64797294, Discriminator loss: 0.84477597
118it [00:11, 11.56it/s]
Epoch 165 of 200
Generator loss: 1.63212502, Discriminator loss: 0.89352602
118it [00:11, 10.56it/s]
Epoch 166 of 200
Generator loss: 1.62318993, Discriminator loss: 0.86100531
118it [00:11, 10.66it/s]
Epoch 167 of 200
Generator loss: 1.69759154, Discriminator loss: 0.86181867
118it [00:11, 11.73it/s]
Epoch 168 of 200
Generator loss: 1.65174174, Discriminator loss: 0.87229192
118it [00:11, 10.44it/s]
Epoch 169 of 200
Generator loss: 1.70960736, Discriminator loss: 0.83075064
118it [00:11, 11.86it/s]
Epoch 170 of 200
Generator loss: 1.53380632, Discriminator loss: 0.88815320
118it [00:11, 10.55it/s]
Epoch 171 of 200
Generator loss: 1.61525106, Discriminator loss: 0.86920583
118it [00:11, 10.57it/s]
Epoch 172 of 200
Generator loss: 1.74724710, Discriminator loss: 0.83650088
118it [00:11, 12.01it/s]
Epoch 173 of 200
Generator loss: 1.56013346, Discriminator loss: 0.90197140
118it [00:11, 11.50it/s]
Epoch 174 of 200
Generator loss: 1.71457875, Discriminator loss: 0.83909225
118it [00:11, 11.84it/s]
Epoch 175 of 200
Generator loss: 1.70677531, Discriminator loss: 0.83399868
118it [00:11, 10.72it/s]
Epoch 176 of 200
Generator loss: 1.60876012, Discriminator loss: 0.86906475
118it [00:11, 11.86it/s]
Epoch 177 of 200
Generator loss: 1.47320735, Discriminator loss: 0.90702981
118it [00:11, 10.49it/s]
Epoch 178 of 200
Generator loss: 1.71973288, Discriminator loss: 0.79212767
118it [00:11, 10.58it/s]
Epoch 179 of 200
Generator loss: 1.60410821, Discriminator loss: 0.87754571
118it [00:11, 10.28it/s]
Epoch 180 of 200
Generator loss: 1.62787437, Discriminator loss: 0.83345455
118it [00:11, 11.79it/s]
Epoch 181 of 200
Generator loss: 1.67670560, Discriminator loss: 0.88268971
118it [00:11, 11.76it/s]
Epoch 182 of 200
Generator loss: 1.55184007, Discriminator loss: 0.87109929
118it [00:11, 11.36it/s]
Epoch 183 of 200
Generator loss: 1.63959110, Discriminator loss: 0.87113965
118it [00:11, 11.79it/s]
Epoch 184 of 200
Generator loss: 1.59965444, Discriminator loss: 0.87320077
118it [00:11, 10.61it/s]
Epoch 185 of 200
Generator loss: 1.62461567, Discriminator loss: 0.85745209
118it [00:11, 10.50it/s]
Epoch 186 of 200
Generator loss: 1.61729789, Discriminator loss: 0.85247296
118it [00:11, 10.44it/s]
Epoch 187 of 200
Generator loss: 1.58122087, Discriminator loss: 0.86318010
118it [00:11, 10.55it/s]
Epoch 188 of 200
Generator loss: 1.51833153, Discriminator loss: 0.92227668
118it [00:11, 10.58it/s]
Epoch 189 of 200
Generator loss: 1.59854269, Discriminator loss: 0.88623565
118it [00:11, 11.85it/s]
Epoch 190 of 200
Generator loss: 1.60617423, Discriminator loss: 0.88586754
118it [00:11, 11.25it/s]
Epoch 191 of 200
Generator loss: 1.59582424, Discriminator loss: 0.88292956
118it [00:11, 10.47it/s]
Epoch 192 of 200
Generator loss: 1.59524691, Discriminator loss: 0.89096653
118it [00:11, 11.85it/s]
Epoch 193 of 200
Generator loss: 1.58404517, Discriminator loss: 0.88426113
118it [00:11, 11.95it/s]
Epoch 194 of 200
Generator loss: 1.56883609, Discriminator loss: 0.88825554
118it [00:11, 11.86it/s]
Epoch 195 of 200
Generator loss: 1.66106474, Discriminator loss: 0.85098940
118it [00:11, 11.81it/s]
Epoch 196 of 200
Generator loss: 1.66241097, Discriminator loss: 0.87594551
118it [00:11, 11.82it/s]
Epoch 197 of 200
Generator loss: 1.50948346, Discriminator loss: 0.94106507
118it [00:11, 10.48it/s]
Epoch 198 of 200
Generator loss: 1.62702358, Discriminator loss: 0.85632008
118it [00:11, 11.84it/s]
Epoch 199 of 200
Generator loss: 1.69493830, Discriminator loss: 0.84410864
DONE TRAINING

Explainable AI
There exist a few categories of explainable AI: post-hoc, intrinsic, and distillation.
The post-hoc paradigm usually provides a heat map highlighting important regions for the decision (e.g. [31, 30]). The heat map is computed besides the forward path of the model. The intrinsic paradigm explores the important piece of information within the forward path of the model, e.g., as attention maps
Post-hoc uses forward path of the model to calculate (usually) a heat map which highlight important regions in an image.
Intrinsic looks at attention maps, exploring the important piece of information within the forward path of the model.
Distillation tries to rebuild the neural network into a transparent model.
Summaries
ML & DL models
Category |
Model |
Description |
When to use |
Extending Concepts |
---|---|---|---|---|
Ensemble Learning |
Gradient Boosting |
Builds tree one at a time! At each stage \(m\) (\(1 \leq m \leq M\), of \(M\) total stages) of gradient boosting, suppose some imperfect model \(F_{m}\) (for low \(m\), this model may simply return \(\hat {y}_{i}=\bar{y}\). In order to improve \(F_m\), our algorithm should add some new estimator, \(h_{m}(x)\). Thus, \(F_{m+1}(x)=F_{m}(x)+h_{m}(x)=y\). Therefore, \(h_{m}(x)=y - F_m (x)\). Gradient boosting will fit \(h\) to the residual \(y - F_m(x)\), attempting to correct the errors of its predecessor. |
Decrease the bias error. Used in classification (MSE) and regression (log-Loss) |
A generalization of this idea to loss functions other than squared error, and to classification and ranking problems, follows from the observation that residuals \(h_{m}(x)\) for a given model are the negative gradients of the mean squared error (MSE) loss function (with respect to \(F(x)\): \(L_{\rm {MSE}}={\frac {1}{2}}\left(y-F(x)\right)^{2}\) and \(h_{m}(x)=-{\frac {\partial L_{\rm {MSE}}}{\partial F}}=y-F(x)\), meaning the gradient boosting could be specialized to a gradient descent algorithm, and generalizing it entails “plugging in” a different loss and its gradient. |
Ensemble Learning |
XGBoost |
A form of gradient boosting. XGBoost delivers high performance as compared to Gradient Boosting. Its training is very fast and can be parallelized / distributed across clusters. XGBoost computes second-order gradients, i.e. second partial derivatives of the loss function, which provides more information about the direction of gradients and how to get to the minimum of our loss function. XGBoost also handles missing values in the dataset. So, in data wrangling, you may or may not do a separate treatment for the missing values, because XGBoost is capable of handling missing values internally. |
A good model to try first. |
|
Ensemble Learning |
Random Forest |
RFs train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data |
multi-class because efficient |
|
Statistics |
Beta regression |
|||
Statistics |
Naive Bayes |
Classifier \(\hat{y} = argmax_{k\in\{1,...,K\}} p(C_k) \Pi_{i=1}^n p(x_i \| C_k)\) that often (in Gaussian NB) uses \(p(X\|C_k) \sim N(\mu_k, \sigma_k^2)\). |
Per the product in this equation, NB requires feature independent. Data scarcity causes probabilities to be close to zero which can cause numerical instabilities. Additionally, continuous features have to be transformed to discrete, throwing away a lot of information. |
|
Statistics |
Linear Discriminant Analysis |
While naive bayes classifier assumes covariance = 0, the LDA assumes equal covariances between classes \(G\) in \(y\). As a note: quadratic discriminant analysis allows different covariance matrices. |
Does not work if the classes are not balanced. Also, it’s not applicable to non-linear problems. Also, sensitive to overfitting. |
|
Statistics |
Support Vector Machine |
see notebook. |
Good when \(p > n\) or when good class partition |
Sampling
Category |
Technique |
Description |
When to use |
Extending Concepts |
---|---|---|---|---|
Estimate integrals:
Variational Bayesian http://www.orchid.ac.uk/eprints/40/1/fox_vbtut.pdf
Monte carlo: see Monte Carlo Integration tab
Optimizers
In stochastic gradient descent, \(\theta_{t-1} = \theta_t - \alpha \delta L(\theta_t)\), the \(\theta\) are getting changed according to the gradient of the loss wrt \(\theta\). \(\alpha\) is the learning rate. If it is small, convergence will be slow and has potential to get caught at local minima; large \(\alpha\) will lead to divergence. The gradient of the loss \(L\) changes quickly after each iteration due to the diversity of each training example. In this base case, it is common to have “zig-zag” even though we slowly reach the loss minima.
Momentum can overcome this “zig-zag”. We introduce a new hyperparameter \(\mu\) where \(v_{t+1} = \mu v_t - \alpha \delta L(\theta_t)\) and \(\theta_{t+1} = \theta + v_{t+1}\).
Category |
Technique |
Description |
When to use |
Extending Concepts |
---|---|---|---|---|
Adaptive learning rate |
Adagrad |
Scales \(\alpha\) for each parameter according to the history of gradients (previous steps) for that parameter which is basically done by dividing current gradient in updated rule by the sum of previous gradients. As a result, what happens is that when the gradient is very large, alpha is reduced and vice-versa. \(g_{t+1} = g_t + \delta L(\theta_t)^2\) and \(\theta_{t+1} = \theta_t - \frac{\alpha \delta L(\theta)^2}{\sqrt{g_{t+1} + \eps ilon}}\) |
||
Adaptive learning rate |
RMSProp |
\(g_t\) term (as seen in Adagrad) is calculated by exponential decaying average and not the sum of gradients. |
||
Adaptive learning rate |
Adam |
Uses both the first order moment \(m_t\) and the second order moment \(g_t\) but they are both decayed over time. Step size is approximately \(\pm \alpha\). Step size will decrease as it approaches minimum. |
There is often value to using more than one method (an ensemble), because every method has a weakness.
A/B Tests
This tutorial is verbatim from Source: https://towardsdatascience.com/the-math-behind-a-b-testing-with-example-code-part-1-of-2-7be752e1d06f.
1. Set up the experiment
The goal of running an A/B test is to evaluate if a change in a (say) website will lead to improved performance in a specific metric. You may decide to test very simple alternatives such as changing the look of a single button on a webpage or testing different layouts and headlines. You could also run an A/B test on multi-step processes which may have many differences. Examples of this include the steps required in signing up a new user or processing the sale on an online marketplace.
Baseline Conversion Rate and Lift
Before running the test, we will know the baseline conversion rate and the desired lift or increase in signups that we would like to test. The baseline conversion rate is the current rate at which we sign up new users under the existing design. For our example, we want to use our test to confirm that the changes we make to our signup process will result in at least a 2% increase in our sign up rate. We currently sign up 10 out of 100 users who are offered a premium account.
[15]:
bcr = 0.10 # baseline conversion rate
d_hat = 0.02 # difference between the groups
Control Group (A) and Test Group (B)
Typically, the total number of users participating in the A/B test make up a small percentage of the total amount of users. Users are randomly selected and assigned to either a control group or a test group. The sample size that you decide on will determine how long you might have to wait until you have collected enough. For example, websites with large audiences may be able to collect enough data very quickly, while other websites may have to wait a number of weeks. There are some events that happen rarely even for high-traffic websites, so determining the necessary sample size will inform how soon you can assess your experiment and move on to improving other metrics.
Initially, we will collect 1000 users for each group and serve the current signup page to the control group and a new signup page to the test group.
[16]:
N_A = 1000 # Number of users in control group
N_B = 1000 # Number of users in test group
[20]:
import scipy.stats as scs
import pandas as pd
import numpy as np
np.random.seed(2)
def generate_data(N_A, N_B, p_A, p_B, days=None, control_label='A',
test_label='B'):
"""Returns a pandas dataframe with fake CTR data
Example:
Parameters:
N_A (int): sample size for control group
N_B (int): sample size for test group
Note: final sample size may not match N_A provided because the
group at each row is chosen at random (50/50).
p_A (float): conversion rate; conversion rate of control group
p_B (float): conversion rate; conversion rate of test group
days (int): optional; if provided, a column for 'ts' will be included
to divide the data in chunks of time
Note: overflow data will be included in an extra day
control_label (str)
test_label (str)
Returns:
df (df)
"""
# initiate empty container
data = []
# total amount of rows in the data
N = N_A + N_B
# distribute events based on proportion of group size
group_bern = scs.bernoulli(N_A / (N_A + N_B))
# initiate bernoulli distributions from which to randomly sample
A_bern = scs.bernoulli(p_A)
B_bern = scs.bernoulli(p_B)
for idx in range(N):
# initite empty row
row = {}
# for 'ts' column
if days is not None:
if type(days) == int:
row['ts'] = idx // (N // days)
else:
raise ValueError("Provide an integer for the days parameter.")
# assign group based on 50/50 probability
row['group'] = group_bern.rvs()
if row['group'] == 0:
# assign conversion based on provided parameters
row['converted'] = A_bern.rvs()
else:
row['converted'] = B_bern.rvs()
# collect row into data container
data.append(row)
# convert data into pandas dataframe
df = pd.DataFrame(data)
# transform group labels of 0s and 1s to user-defined group labels
df['group'] = df['group'].apply(
lambda x: control_label if x == 0 else test_label)
return df
[21]:
ab_data = generate_data(N_A, N_B, bcr, d_hat)
ab_data.head()
[21]:
group | converted | |
---|---|---|
0 | A | 0 |
1 | B | 0 |
2 | A | 0 |
3 | A | 0 |
4 | A | 0 |
The generated data has two columns. The converted column indicates whether a user signed up for the premium service or not with a 1 or 0, respectively.
2. Run the test and record the success rate for each group.
[22]:
import numpy as np
ab_summary = ab_data.pivot_table(values='converted', index='group', aggfunc=np.sum)
# add additional columns to the pivot table
ab_summary['total'] = ab_data.pivot_table(values='converted', index='group', aggfunc=lambda x: len(x))
ab_summary['rate'] = ab_data.pivot_table(values='converted', index='group')
ab_summary
[22]:
converted | total | rate | |
---|---|---|---|
group | |||
A | 114 | 1070 | 0.106542 |
B | 17 | 930 | 0.018280 |
It looks like the difference in conversion rates between the two groups is 0.028 which is greater than the lift we initially wanted of 0.02. This is a good sign but this is not enough evidence for us to confidently go with the new design. At this point we have not measured how confident we are in this result. This can be mitigated by looking at the distributions of the two groups.
3. Plot the distribution of the difference between the two samples.
4. Calculate the statistical power.
5. Evaluate how sample size affects A/B tests.
Attention
See for implementations: https://github.com/uzaymacar/attention-mechanisms
The key/value/query formulation of attention is from the paper Attention Is All You Need.
How should one understand the queries, keys, and values
The key/value/query concepts come from retrieval systems. For example, when you type a query to search for some video on Youtube, the search engine will map your query against a set of keys (video title, description, etc.) associated with candidate videos in the database, then present you the best matched videos (values).
The attention operation turns out can be thought of as a retrieval process as well, so the key/value/query concepts also apply here. (BTW the above example is just a toy system for illustration, in practice search engines and recommendation systems are much more complex.)
As mentioned in the paper you referenced (Neural Machine Translation by Jointly Learning to Align and Translate), attention by definition is just a weighted average of values,
If we restrict \(\alpha\) to be a one-hot vector, this operation becomes the same as retrieving from a set of elements \(h\) with index \(\alpha\). With the restriction removed, the attention operation can be thought of as doing “proportional retrieval” according to the probability vector \(\alpha\).
It should be clear that \(h\) in this context is the value. The difference between the two papers lies in how the probability vector \(\alpha\) is calculated. The first paper (Bahdanau et al. 2015) computes the score through a neural network
A more efficient model would be to first project \(s\) and \(h\) onto a common space, then choose a similarity measure (e.g. dot product) as the attention score, like
This is essentially the approach proposed by the second paper (Vaswani et al. 2017), where the two projection vectors are called query (for decoder) and key (for encoder), which is well aligned with the concepts in retrieval systems. (There are later techniques to further reduce the computational complexity, for example Reformer, Linformer.)
How are the queries, keys, and values obtained
The proposed multihead attention alone doesn’t say much about how the queries, keys, and values are obtained, they can come from different sources depending on the application scenario.
For unsupervised language model training like GPT, \(Q, K, V\) are usually from the same source, so such operation is also called self-attention.
For the machine translation task in the second paper, it first applies self-attention separately to source and target sequences, then on top of that it applies another attention where \(Q\) is from the target sequence and \(K, V\) are from the source sequence.
For recommendation systems, \(Q\) can be from the target items, \(K, V\) can be from the user profile and history.
[1]:
from numpy import array
from numpy import random
from numpy import dot
from scipy.special import softmax
# encoder representations of four different words
word_1 = array([1, 0, 0])
word_2 = array([0, 1, 0])
word_3 = array([1, 1, 0])
word_4 = array([0, 0, 1])
# stacking the word embeddings into a single array
words = array([word_1, word_2, word_3, word_4])
# generating the weight matrices
random.seed(42)
W_Q = random.randint(3, size=(3, 3))
W_K = random.randint(3, size=(3, 3))
W_V = random.randint(3, size=(3, 3))
# generating the queries, keys and values
Q = words @ W_Q
K = words @ W_K
V = words @ W_V
# scoring the query vectors against all key vectors
scores = Q @ K.transpose()
# computing the weights by a softmax operation
weights = softmax(scores / K.shape[1] ** 0.5, axis=1)
# computing the attention by a weighted sum of the value vectors
attention = weights @ V
print(attention)
[[0.98522025 1.74174051 0.75652026]
[0.90965265 1.40965265 0.5 ]
[0.99851226 1.75849334 0.75998108]
[0.99560386 1.90407309 0.90846923]]
Common Questions
Here, I answer questions from published here in preparation for my interviews. I also include other questions.
Vectors
Dot product
[E] What’s the geometric interpretation of the dot product of two vectors?
[E] Given a vector , find vector of unit length such that the dot product of u and v is maximum.
Outer product
[E] Given two vectors a=[3,2,1] and b=[-1,0,1]. Calculate the outer product \(a^T b\)?
[M] Give an example of how the outer product can be useful in ML.
[E] What does it mean for two vectors to be linearly independent?
[M] Given two sets of vectors \(A = a_1, a_2, ..., a_n\) and \(B = b_1, b_2, ..., b_n\). How do you check that they share the same basis?
[M] Given n vectors, each of d dimensions. What is the dimension of their span?
Norms and metrics
[E] What’s a norm? What is \(L_0, L_1, L_2, L_{norm}\)?
[M] How do norm and metric differ? Given a norm, make a metric. Given a metric, can we make a norm?
[5]:
import numpy as np
import matplotlib.pyplot as plt
print("""1.1 Dot product finds the length of the projection of x onto y
""")
num_iter = 3
fig, axs = plt.subplots(1,num_iter)
for seed, ax in zip(range(num_iter), axs):
np.random.seed(seed)
n=2
x = np.random.uniform(0,1,n)
y = np.random.uniform(0,1,n)
# Dot product finds the length of the projection of x onto y
dot = np.sum(x.T*y) # or np.dot(x,y)
x_mag = np.sqrt(np.sum(np.square(x)))
y_mag = np.sqrt(np.sum(np.square(y)))
angle = np.arccos(dot / (x_mag * y_mag)) * 360 / (2 * np.pi)
ax.plot([0,x[0]], [0,x[1]], label='x')
ax.plot([0,y[0]], [0,y[1]], label='y')
ax.set_title(f"Dot:{round(dot,2)}, angle:{round(angle,2)}")
handles, labels = ax.get_legend_handles_labels()
fig.legend(handles, labels, loc='center right')
plt.tight_layout()
plt.show()
print("""1.2 The maximum dot product is found when the lines are parallel.
""")
print("""2.1 Calculate elementwise product (notated with "X⊗Y")
""")
x = np.array([3,2,1])
y = np.array([-1,0,1])
print('x', x), print('y', y)
print('X⊗Y =', np.multiply.outer(x.T,y))
print("""2.2 Cross products can be used to analyze pairwise correlations
""")
print("""3. Linearly independent vectors have dot(x,y)=0 because angle=90. In terms of eigenvectors/eigenvalues, if the eigenvalue of the matrix is zero, the eigenvector is linearly dependent.
""")
import numpy as np
matrix = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
lambdas, V = np.linalg.eig(matrix.T)
# The linearly dependent row vectors
print("Dependent: ", matrix[lambdas == 0,:])
print("4. Confirm independence.")
print("5. The span is the same dimension as the basis. It is generated from linear combinations of the basis vectors.")
print("6. L0 reports the number of incorrect responses. For instance, if 1 answer is reported incorrect out of 5 questions, then the L0 is 1.")
print(" L1 is manhattan distance and is described as the sum of absolutes.")
print(" L2 is euclidean distance and is described as the square root of the sum of squares.")
print(" L-infinity reports the largest magnitud among each element of a vector. In the analogy of construction, by minimizing the L-infinity, we are reducing the cost of the most expensive building.")
print("""\nMetrics d(u,v) induced by a vector space norm has additional properties that are not true of general metrics, namely:
1. Translation Invariance: d(u+w, v+w) = d(u,v)
2. Scaling property: for any real number t, d(tu,tv) = |t| d(u,v)
Conversely, if a metric has the above properties, then d(u,0) is a norm. In other words, a metric is a function of two variables while a norm is a function of one variable.
""")
1.1 Dot product finds the length of the projection of x onto y

1.2 The maximum dot product is found when the lines are parallel.
2.1 Calculate elementwise product (notated with "X⊗Y")
x [3 2 1]
y [-1 0 1]
X⊗Y = [[-3 0 3]
[-2 0 2]
[-1 0 1]]
2.2 Cross products can be used to analyze pairwise correlations
3. Linearly independent vectors have dot(x,y)=0 because angle=90. In terms of eigenvectors/eigenvalues, if the eigenvalue of the matrix is zero, the eigenvector is linearly dependent.
Dependent: [[0 1 1 0]]
4. Confirm independence.
5. The span is the same dimension as the basis. It is generated from linear combinations of the basis vectors.
6. L0 reports the number of incorrect responses. For instance, if 1 answer is reported incorrect out of 5 questions, then the L0 is 1.
L1 is manhattan distance and is described as the sum of absolutes.
L2 is euclidean distance and is described as the square root of the sum of squares.
L-infinity reports the largest magnitud among each element of a vector. In the analogy of construction, by minimizing the L-infinity, we are reducing the cost of the most expensive building.
Metrics d(u,v) induced by a vector space norm has additional properties that are not true of general metrics, namely:
1. Translation Invariance: d(u+w, v+w) = d(u,v)
2. Scaling property: for any real number t, d(tu,tv) = |t| d(u,v)
Conversely, if a metric has the above properties, then d(u,0) is a norm. In other words, a metric is a function of two variables while a norm is a function of one variable.
Matrices
1. Why do we say that matrices are linear transformations?
Matrices, when multiplied with a vector (for instance) cause a linear transformation on that vector.
Matrices give us a powerful systematic way to describe a wide variety of transformations: they can describe rotations, reflections, dilations, and much more
2. What’s the inverse of a matrix? Do all matrices have an inverse? Is the inverse of a matrix always unique?
\(A^{-1} A = A A^{-1} = I\) descibes a matrix \(A\) that, when multiplied by its inverse \(A^{-1}\), generates the identity matrix. Matrices are invertible when they have a nonzero determinant, nonzero eigenvalues, trivial nullspace (only zeros), and full rank (rank = dimension). By, \(A=AI=A(CB)=(AC)B=IB=B\), where \(A\) and \(B\) are square matrices with the same inverse \(C\), an inverse of a matrix is always unique.
3. What does the determinant of a matrix represent?
Factor of deformation caused by the transformation. A determinant of zero “squashes” the parallelpiped, in other words, this matrix is singular.
4. What happens to the determinant of a matrix if we multiply one of its rows by a scalar :math:`ttimes R` ?
\(\det (kA) = k^n \det(A)\) where A is an \(n \times n\) matrix
Also, If a matrix \(A\) has a row that is all zeros, then \(\det A = 0\)
5. A :math:`4 times 4` matrix has four eigenvalues :math:`3,3,2,−1`. What can we say about the trace and the determinant of this matrix?
Trace is the sum of the eigenvalues of a matrix. Product of eigenvalues of a matrix is equal to the value of the determinant of a matrix.
6. Given the following matrix:
This matrix has dependent columns, so we know that the determinant is zero. This is true because a matrix whose column vectors are linearly dependent will have a zero row show up in its reduced row echelon form, which means that a parameter in the system can be of any value you like.
7. What’s the difference between the covariance matrix :math:`A^T A` and the Gram matrix :math:`AA^T` ? Given :math:`A in R^{ntimes m}` and :math:`b in R^n`.
\(A A^T\) is a \(m \times m\) matrix
\(A^T A\) is a \(n \times n\) matrix and resembles the covariance.
i. Find :math:`x` such that: :math:`Ax=b` .
\(Ax = b\)
\(A^{-1} A x = A^{-1} b\)
\(I x = A^{-1} B\)
\(x = A^{-1} B\)
ii. When does this have a unique solution?
When A is invertible.
iii. Why is it when A has more columns than rows, Ax=b has multiple solutions?
The most condensed solution will still be a function of multiple columns, meaning multiple solutions will exist.
iv. Given a matrix A with no inverse. How would you solve the equation Ax=b ? What is the pseudoinverse and how to calculate it?
https://www.omnicalculator.com/math/pseudoinverse
8. Derivative is the backbone of gradient descent.
i. What does derivative represent?
Speed of change.
ii. What’s the difference between derivative, gradient, and Jacobian?
Gradient: multivariate derivatives
Jacobian: vector-valued derivatives
As a note, the Hessian is the derivative of the Jacobian.
8. Say we have the weights w∈Rd×m and a mini-batch x of n elements, each element is of the shape 1×d so that x∈Rn×d . We have the output y=f(x;w)=xw . What’s the dimension of the Jacobian δyδx ?
Given a very large symmetric matrix A that doesn’t fit in memory, say A∈R1M×1M and a function f that can quickly compute f(x)=Ax for x∈R1M . Find the unit vector x so that xTAx is minimal. Hint: Can you frame it as an optimization problem and use gradient descent to find an approximate solution?
Linear regression
1. Derive the least squares solution.
Differentiate wrt \(\beta\) to minimize…
This is a common solution. But, to solve for \(\beta\), we can backtrack a little…
2. Prove that \(X\) and \(\epsilon\) are independent
We do this by proving \(X \perp \epsilon\). In other words, \(X^T \epsilon = 0\) where \(X\) is \(p\times n\) and \(\epsilon\) is \(n \times 1\)
While here, we should also prove that \(\epsilon\) and \(\hat{y}\) are independent
assuming \(\epsilon \epsilon^T = \sigma_\epsilon^2 I\) where \(\epsilon \sim N(0,1)\) and knowing that \(H\) is idopotent \(HH = H\).
3. Prove ANOVA \(SST = SSE + SSR\)
We know \(2 \sum_{i=1}^n (y_i - \bar{y}) (\hat{y}_i - \bar{y}) = 0\) because
\(\sum_{i=1}^n (y_i - \hat{y}_i) (\hat{y}_i - \bar{y}) = \sum_{i=1}^n \hat{y}_i (y_i - \hat{y}_i) - \bar{y}_i \sum_{i=1}^n (y_i - \hat{y}_i) = 0 - 0 = 0\)
We know
As a note, the adjusted \(R^2\) is
4. Given the standard deviation of residuals (\(\hat{\sigma}^2\)), find \(RSS\):
5. Given F, p, and n, find \(R^2\)
6. In the R output below, how are the terms calculated?
[1]:
from utils import disp
disp('example_OLS_output.png')

The estimate is calculated through ordinary least squares (closed form derivation shown above).
The std. error is \(\sqrt{\widehat{Var}(\hat{\beta}_j)}\) where \(\hat{\beta}_j\) is the LS estimator of \(\beta_j\), \(Var(\hat{\beta}_j) = \frac{\sigma_\epsilon^2}{X^\prime X}\) (proof below) is the variability of the coefficients (as new data points are added). We use \(\widehat{Var}\) instead of \(Var\) because we are estimating the sampling variability; things like the gaussian noise can be unknown quanitites and, therefore, the variance must be estimated.
The t-value is the estimate divided by the std. error
The p-value \(Pr(>|t|)\) is a table lookup; We find the p-value on the t distribution with DF \(N-p-1\) and t-value.
The residual standard error is \(RSE = \sqrt{\frac{RSS}{N-p-1}}\). Note, we can find \(RSS\) using the information on this line. Additionally, if we square this value, we receive the variance of the residuals according to \(\hat{\sigma}^2 = RSE^2\).
The R-square value is described as the total amount of variance explained by the model, or \(SSR / SST\).
The adjusted R-square is calculated as a function of the \(R^2\): .
The F-statistic is a “global” test that checks if at least one of your coefficients are nonzero.
Because \(F \sim F_{p, N - p - 1}\), the p-value is estimated as \(Pr(F_{p, N - p - 1} \geq F)\).
7. Prove \(Var[\hat{\beta}] = \frac{\sigma_\epsilon^2}{X^\prime X}\)
We know that
Knowing OLS is unbiased, \(E(\hat{\beta}|X) = \beta\), and therefore \(Var[E(\hat{\beta}|X)] = 0\) and that \(\beta\) is a constant so
To prove this last step,
Using this, Let \(\mathbf{x}_j\) be the \(j^{th}\) column of \(\mathbf{X}\), and \(\mathbf{X}_{-j}\) be the \(\mathbf{X}\) matrix with the \(j^{th}\) column removed.
From here, Let \(\mathbf{x_1}\) be the \(1\)st column of \(X\). Let \(X_{-1}\) be the matrix \(X\) with the \(1\)st column removed.
Consider the matrices:
Observe that:
By the matrix inversion lemma (and under some existence conditions):
Notice the 1st row, 1st column of \((X'X)^{-1}\) is given by the Schur complement of block \(D\) of the matrix \(X'X\)
8. Derive the ridge regression beta in closed form
It suffices to modify the loss function by adding the penalty. In matrix terms, the initial quadratic loss function becomes
Expanding the RSS
Differentiate wrt \(\beta\) to minimize…
Therefore, the ridge estimator is
As a note, assuming orthonormality of the design matrix implies \(X^T X = I = (X^T X)^{-1}\). So, the ridge estimator can be defined as \(\hat{\beta}(\lambda)_{ridge} = (1 + \lambda)^{-1} \hat{\beta}_{OLS}\).
Also, bias increases with \(\lambda\) (bc more sparse model) and variance decreases with \(\lambda\) (bc more sparse model). So, what happens to the MSE of ridge?
9. Compare the MSE of ridge regression and OLS
OLS minimizes MSE so it will have a smaller MSE than ridge regression.
Dimensionality reduction
1. Why do we need dimensionality reduction?
Remove collinearity & multicollinearity, and save storage & computation time.
2. Eigendecomposition is a common factorization technique used for dimensionality reduction. Is the eigendecomposition of a matrix always unique?
No. If multiple eigenvalues are the same, then decomposition is not unique.
3. Name some applications of eigenvalues and eigenvectors.
Singular value decomposition (SVD), \(A = U D V^T\), is more general than eigendecomposition. Every real matrix has a SVD
[8]:
# Singular-value decomposition
import numpy as np
from scipy.linalg import svd
# define a matrix
A = np.array([[1, 2], [3, 4], [5, 6]])
A = A - np.mean(A,0)
print("A\n",A)
# Eigendecomposition
co=np.cov(A.T)
[D,UI]=np.linalg.eigh(co)
print("UI",UI)
# SVD
U, s, VT = svd(A)
print("U, left-singular vectors of A\n", U)
print("Singular values of original matrix A\n", s)
print("V, right-singular vectors of A\n", VT)
A
[[-2. -2.]
[ 0. 0.]
[ 2. 2.]]
UI [[-0.70710678 0.70710678]
[ 0.70710678 0.70710678]]
U, left-singular vectors of A
[[-0.70710678 0. 0.70710678]
[ 0. 1. 0. ]
[ 0.70710678 0. 0.70710678]]
Singular values of original matrix A
[4. 0.]
V, right-singular vectors of A
[[ 0.70710678 0.70710678]
[-0.70710678 0.70710678]]
4. We want to do PCA on a dataset of multiple features in different ranges. For example, one is in the range 0-1 and one is in the range 10 - 1000. Will PCA work on this dataset?
Normalization is important in PCA since it is a variance maximizing exercise. On larger scales, the variance is naturally larger. So, the wrong feature combinations might be chosen.
5. Under what conditions can one apply eigendecomposition? What about SVD?
https://math.stackexchange.com/a/365020/752105
i. What is the relationship between SVD and eigendecomposition?
ii. What’s the relationship between PCA and SVD?
6. How does t-SNE (T-distributed Stochastic Neighbor Embedding) work? Why do we need it?
https://towardsdatascience.com/t-distributed-stochastic-neighbor-embedding-t-sne-bb60ff109561
An unsupervised, randomized algorithm, used only for visualization Applies a non-linear dimensionality reduction technique where the focus is on keeping the very similar data points close together in lower-dimensional space. Preserves the local structure of the data using student t-distribution to compute the similarity between two points in lower-dimensional space. t-SNE uses a heavy-tailed Student-t distribution to compute the similarity between two points in the low-dimensional space rather than a Gaussian distribution, which helps to address the crowding and optimization problems. Outliers do not impact t-SNE.
Step 1: Find the pairwise similarity between nearby points in a high dimensional space.
Step 2: Map each point in high dimensional space to a low dimensional map based on the pairwise similarity of points in the high dimensional space.
Step 3: Find a low-dimensional data representation that minimizes the mismatch between Pᵢⱼ and qᵢⱼ using gradient descent based on Kullback-Leibler divergence(KL Divergence)
Step 4: Use Student-t distribution to compute the similarity between two points in the low-dimensional space.
PCA is deterministic, whereas t-SNE is not deterministic and is randomized. t-SNE tries to map only local neighbors whereas PCA is just a diagonal rotation of our initial covariance matrix and the eigenvectors represent and preserve the global properties
Statistics
1. Explain frequentist vs. Bayesian statistics.
I have misplaced my phone somewhere in the home. I can use the phone locator on the base of the instrument to locate the phone and when I press the phone locator the phone starts beeping.
Problem: Which area of my home should I search?
Frequentist Reasoning
I can hear the phone beeping. I also have a mental model which helps me identify the area from which the sound is coming. Therefore, upon hearing the beep, I infer the area of my home I must search to locate the phone.
Bayesian Reasoning
I can hear the phone beeping. Now, apart from a mental model which helps me identify the area from which the sound is coming from, I also know the locations where I have misplaced the phone in the past. So, I combine my inferences using the beeps and my prior information about the locations I have misplaced the phone in the past to identify an area I must search to locate the phone.
So, prior beliefs (\(f(p)\)) get updated with new data! This follows human thinking! However, it is sometimes hard to define the priors.
2. Given the array , find its mean, median, variance, and standard deviation.
mean \(\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i\)
variance \(s^2 = \frac{1}{n-1} \sum_{i=1}^n (x - \bar{x})^2\)
3. When should we use median instead of mean? When should we use mean instead of median?
Median is more robust to outliers. Mean is tractible.
4. What is a moment of function? Explain the meanings of the zeroth to fourth moments.
A moment \(M_X(t) = E(e^{tX})\) of a distribution about a number is the expected value of the \(n\)th power of the deviations about that number. It’s a good trick for calculating the properties of a distribution.
n = 0, moment = 1 because the AUC of PDF must be 1.
n = 1 and centered about origin, \(E(X)\)
n = 2 and centered about mean, the variance \(Var(X) = E((X-\mu)^2)\)
n = 3 and centered about mean, the skewness \(E((X-\mu)^3)\)
n = 4 and centered about mean, the kurtosis \(E((X-\mu)^4)\)
5. Are independence and zero covariance the same? Give a counterexample if not.
Independence does not mean a zero covariance. For instance, let \(X\) be a random variable that is \(−1\) or \(+1\) with probability \(0.5\). Then let \(Y\) be a random variable such that \(Y=0\) if \(X=-1\) and \(Y\) is randomly \(-1\) or \(+1\) with probability \(0.5\) if \(X=1\). Clearly, \(X\) and \(Y\) are dependent (since knowing \(Y\) allows me to perfectly know \(X\)), but their covariance is zero. They both have zero mean, and
Or more generally, take any distribution \(P(X)\) and any \(P(Y|X)\) such that \(P(Y=a|X)=P(Y=−a|X)\) for all \(X\) (i.e., a joint distribution that is symmetric around the \(x\) axis), and you will always have zero covariance. But you will have non-independence whenever \(P(Y|X)\neq P(Y)\); i.e., the conditionals are not all equal to the marginal. Or ditto for symmetry around the \(y\) axis.
Another example is: Take a random variable \(X\) with \(EX=0\) and \(EX^3=0\), e.g. normal random variable with zero mean. Take Y=X2. It is clear that \(X\) and \(Y\) are related, but
Summary of ML implementations on Kaggle: https://www.kaggle.com/shivamb/data-science-glossary-on-kaggle
Bayesian rule
Let’s say the response variable is either low
or high
. The population can have either default (D)
or not default (ND)
.
We know \(P(low|D)=\)
What is the probability of default given high?
More generally, we can write the posterior continuous probability as:
With prior of 1/2 what is posterior probability?
\(P(k) = 1/2 = P(1) = P(2)\)
If the \(P(k)\) are the same (as we see), then the posterior update is simply the ratio of the probabilities. So, you’d classify the point where the density is higher. In other words, instead of calculating the posterior \(p_k(x)\), \(k \in C\), we can simply compare them and select the class \(k\) that maximizes \(p_k(x)\)
\(p_1(x) = \frac{\pi_1 f_1(x)}{f(x)}\)
\(p_2(x) = \frac{\pi_2 f_2(x)}{f(x)}\)
Taking the ratio eliminates the \(f(x)\) and makes the computation simpler.
If you simplify this form (where \(P(k) \sim N(\mu, \sigma^2\))),
Show :math:`x = mu_1 + mu_2`
As a note, if \(\sigma\) is the same, this is contained for \(i = 2, ..., m\) groups.
Cross Entropy
Negative log likelihood
Likelihood refers to the chances of some calculated parameters producing some known data. In ML, the parameters are updated to fit a static dataset.
Cost function that is used as loss for machine learning classification models (the lower the better). We use NEGATIVE because most ML frameworks only have minimization optimization functionality.
We take \(\ln\) because it’s cleaner to use when have high or low numbers.
The output of a classification problem is usually a probability vector. For example,
If the correct answer is the fourth class \(y = [0,0,0,1]\), the likelihood of the current state of the model producing the input is:
Therefore, \(-\ln(0.1) = 2.3\)
If the correct category would have been the third class \(y = [0,0,1,0]\), the likelihood would be
Therefore \(-\ln(0.5) = 0.69\).
The better the prediction, the lower the number!
Regularization
Ridge
Add notes from missed class.
The \(\lambda I_n\) adds
Lasso
Shrinkage effect
If each feature is independent, the design matrix (centerd and standardized) will have \(x_j^T x_k = 1(j=k)\) because they are orthogonal.
The least squares estimator will be reevalauted as
\(\hat{\beta}^{OLS} = (X^T X)^{-1} X^T y = X^T y\)
because \(X^T X = I\). In other words, each predictor can be treated as a basis.
In ridge regression, the \(X^T X + \lambda I_p\) becomes a diagonal matrix of lambdas.
\(\hat{\beta}^R_\lambda = (X^T X + \lambda I_p)^{-1} X^T y = \frac{1}{1 + \lambda} X^T y = \frac{\beta^{OLS}}{1 + \lambda}\)
Therefore, the larger \(\lambda\) is, the smaller the magnitude of each \(\beta_j\). This is a linear function. Here, it is applied uniformally. However, in practice, features are not orthogonal. So, each coefficient is affected nonuniformally. It should shrink more on the smaller eigenvalues.
This will shirnk the prediction by the same scale.
\(\hat{y}\) is scaled by (\(1 + \lambda\))
We know that the \(\hat{\beta}^{OLS}\) is unbiased. \(E \hat{\beta} = (X^T X)^{-1} X^T E(Y) = \beta\) because \(E(Y) = X\beta\).
However, the ridge regression is a biased estimator: \(E \hat{\beta}_\lambda^R = \frac{\beta}{1 + \lambda} \neq \beta\)
Tradeoff between bias and variance. With collinearity, variance explodes (because we cannot take inverse bceause it is singular). Ridge regression will regularize the variance.
For OLS, the \(MSE(\hat{y}^{OLS}) = E(\hat{y}^{OLS} - f(x))^2 = \sigma^2_{OLS}\)
But, consider the ridge regression solution \(\hat{y}_\lambda^R = E(\hat{y}^{OLS} / (1 + \lambda) - f(x))^2 = \frac{\sigma^2_{OLS}}{(1 + \lambda)^2} + \frac{\lambda^2}{(1 + \lambda)^2}f^2(x)\)
In some cases, the \(\sigma_{OLS}\) can be very large due to singularity of \(X^TX\). The shrinkage will increase bias but will reduce variance.
Proof:
\(E(\frac{\hat{y}^{OLS}}{(1 + \lambda)} - f(x))\)
\(= E[(\frac{\hat{y}}{1 + \lambda} - \frac{f(x)}{1 + \lambda}) - (\frac{\lambda}{1 + \lambda} f(x))]\)
\(= \frac{\sigma^2_{OLS}}{1 + \lambda} + (\frac{\lambda}{1 + \lambda})^2 f^2(x)\)
PCA
\(X^T X = (V D U^T) (U D V^T) = U D V^T\)
\(F = XV = UD\) where \(f_j = X v_j\) is the projections to the PC directions and are the principal components.
Note \(f^T_j f_j = d^2_j\) and \(f^T_i f_j = 0\) because orthogonal.
So, with our PCs as \(X\),
\(y = X\beta = y - UDV\beta = y - F \alpha = y - \sigma_{j=1}^p \alpha_j f_j\)
Regression after PCA agains all PCs
\(\alpha = V \beta\) and \(||\alpha||_2 = ||\beta||_2\)
Therefore, the L2 norm of \(\beta\) is the same as L2 norm of \(\alpha\)
The ridge regression therefore
\(min_\beta ()...\)
is equivalent to
\(min_\alpha ()...\)
This allows \(F = UD\) to satisfy the property \(F^T F = DU^T UD = D^2\)
The ridge estimator
\(\hat{\alpha}_\lambda^R = (F^T F + \lambda I)^{-1} F^T y\)
\(= diag(\frac{d_j}{d_j^2 + \lambda}) U^T y\)
$:nbsphinx-math:hat{alpha}:nbsphinx-math:`lambda`^R = (:nbsphinx-math:`frac{d_j^2}{d_j^2 + lambda}`) :nbsphinx-math:`hat{alpha}`:nbsphinx-math:`lambda`^{OLS} $
The smaller the \(d_j\), the more shrinkage.
The prediction is \(\hat{y}^R = F \alpha^R\)
Partial Least Squares
Gradually capture the inform in \(x\) corresponding to the information in \(y\).
\(z_1 = \sum_{j=1}^p \phi_j x_j\)
where \(\phi_j = \frac{<y, x_j>}{<x_j, x_j>} = <y, x_j>\)
We adjust the predictors by removing the effect containted on \(Z_1\)
\(e_j = X_j - Z_1 x_j\)
We do not want duplicate information. Info in \(Z_1\) should not be in \(Z_2\).
Degrees of freedom
We can evaluate the effective degree-of-freedom because
In OLS, the trace of \(H\) is \(p\) because \(\hat{y} = Hy = X(X^T X)^{-1}X^T y = (X^T X)^{-1}X^TX y = tr(I_p) = p\)
The ridge regerssion project \(y\) to
\(\hat{\beta}^R_\lambda = (X^T X + \lambda I_p)^{-1} X^T y = S_\lambda y\)
We measure the effective degrees-of-freedom of ridge regression by
\(tr(S_\lambda) < p\)
Lasso
Subset selection minimizes
\((y - X\beta)^T (y - X\beta) + \lambda ||\beta||_1\)
where \(1 = ||\beta||_1 = |\beta_1| + |\beta_2| = \sum_{j=1}^p |\beta_j|\). Therefore, a diamond (linear functions) is constructed.
L2 norm encourages small \(\beta_j\) but L1 will encourage sparse solutions.
Theoretical Statistics
Notes based off Casella and Berger edition 2.
Probability Theory
To Do
Transformations and Expectations
Transformation
Theorem 2.1.5: Let \(X\) have pdf \(f_X(x)\) and let \(Y = g(X)\) where \(g\) is a monotone function. Let \(X\) and \(Y\) be defined by \(\mathbf{X} = \{ x: f_X(x) > 0 \}\) and \(\mathbf{Y} = \{y: y = g(x)\) for some \(x \in \mathbf{X} \}\). Suppose that \(f_X(x)\) is continuous on \(\mathbf{X}\) and that \(g^{-1}(y)\) has a continuous derivative on \(\mathbf{Y}\). Then, the pdf of \(Y\) is given by
Procedure:
Check if \(g\) is monotonic
Split up into regions where monotonic and then evaluate formula above
See: Example 2.1.6 (monotonic) and Example 2.1.7 (multiple regions)
Theorem 2.1.10:
See: Proof on pg. 54
Expected values
See: Example 2.2.2 for continuous and Example 2.2.3 for discrete
Properties:
\(E(a g_1(X) + b g_2(X) + c) = a E g_1(X) + b E g_2(X) + c\).
If \(g_1(x) \geq 0\) for all \(x\) then \(E g_1(X) \geq 0\).
If \(g_1(x) \geq g_2(x)\) for all \(x\) then \(E g_1(X) \geq E g_2(X)\).
If \(a \leq g_1(x) \leq b\) for all \(x\) then \(a \leq E g_1(X) \leq b\).
Example: Minimize distance
This result happens to be the definition of variance.
Moments
The \(n\)th central moment of \(X\) is
where \(\mu = E X\).
From this, we know the variance is
See: Example 2.3.3 for the variance of a parameter
Properties:
\(Var(aX + b) = a^2 Var X\)
\(Var X = E(X - EX)^2 = EX^2 - (EX)^2\)
Moment Generating Function (mgf)
Let \(X\) be a R.V. with cdf \(F_X\). The mgf of \(X\) (or \(F_X\)) is
With our knowledge of expected values,
Theorem: The \(n\)th moment is equal to the \(n\)th derivative of \(M_X(t)\) evaluated at \(t=0\).
Assuming we can differentiate under the integral sign (see Leibnitz Rule below),
Evaluating this at \(t=0\), we have
See: Example 2.3.8 for continuous case and Example 2.3.9 for discrete.
Properties:
\(M_{aX + b}(t) = e^{bt} M_X(at)\)
Convergence of mgfs
Suppose \(\{X_i, i = 1, 2, ... \}\) is a sequence of RVs, each with mgf \(M_{X_i}(t)\). Furthermore suppose that
for all \(t\) in a neighborhood of 0 and \(M_X(t)\) is an mgf. Then, there is a unique cdf \(F_X\) whose moments are determined by \(M_X(t)\) and, for all \(x\) where \(F_X(x)\) is continuous we have
. That is, convergence, for \(|t| < h\), of mgfs to an mgf implies convergence of cdfs.
This relies on Laplace transforms, which defines
Proof: Poisson approximation of Binomial
We know that the poisson approximation is valid when \(n\) is large and \(np\) is small.
Recall that the moment of binomial is \(M_X(t) = [p e^t + (1 - p)]^n\).
From the rule above (and just from txtbk), the MGF of poisson is \(M_Y(t) = e^{\lambda (e^t - 1)}\).
If we define \(p = \lambda / n\) then \(M_X(t) \xrightarrow{} M_Y(t)\) as \(n \xrightarrow{} \infty\).
Lemma: If \(\lim_{n\xrightarrow{} \infty} a_n = a\), then
Leibnitz Rule
If \(f(x,\theta)\), \(a(\theta)\), and \(b(\theta)\) are differentiable with respect to \(\theta\), then
See page 69.
If \(a(\theta)\) and \(b(\theta)\) are constant, then
Lebesgue’s Dominated Convergence Theorem
See page 69 & 70. Basically, if the integral is not too badly behaved, then we can say it’s good enough to bring a limit inside an integral.
Lipschitz Continuous
Impose smoothness on a function by bounding its first derivative by a function with finite integral. It leads to interchangeability of integration and differentiation.
See: Theorem 2.4.3 (pg 70), Corollary 2.4.4 and Examples 2.4.5 and 2.4.6
Families of Distributions
Check if pdf part of exponential family
where \(h(x) \geq 0\), \(t_i(x)\) are real-valued functions of \(x\), \(c(\mathbf{\theta}) \geq 0\) and \(w_i(\mathbf{\theta})\) are real-valued functions of the possibly vector-valued parameter \(\mathbf{\theta}\) which is independent of \(x\).
Here are some common exponential families:
Continuous: normal, gamma, beta
Discrete: binomial, poisson, negative binomial
A distribution which is a member of the exponential family has nice properties. For instance,
Expectations and Variance of exponential family pdf
Theorem: If \(X\) is a RV with pdf or pmf which is member of exponential family,
Definition: The indicator function of a set \(A\)
So, we can write the normal pdf (example 3.4.4) as
Since the indicator function is only a function of \(x\), it can be incorporated into the function \(h(x)\), showing that this pdf is of the exponential family form.
Another example is of \(f(x|\theta) = \theta^{-1} \exp(1 - \frac{x}{\theta})\) on \(0 < \theta < x < \infty\). Although this expression can fit the exponential family definition, the indicator function is dependent, \(I_{[\theta, \infty)}(x)\).
Chebychev’s inequality
Let \(X\) be a RV and let \(g(x)\) be a nonnegative function. Then, for any \(r>0\),
We usually set \(r = t^2\).
See: Example 3.6.2 and 3.6.3
Multiple Random Variables
Joint probability \(f_{X,Y}(x,y)\)
Definition: Discrete
Marginal probability is \(f_X(x) = \sum_{y \in \mathbf{R}} f_{X,Y}(x,y)\)
Definition: Continuous
where \(-\infty <x < \infty\)
Definition: Conditional
where \(\sum_y f(y|x) = 1\).
Independence properties
With \(U=g(X)\) and \(V=h(Y)\) where \(X\) and \(Y\) are independent and \(A_u = \{ x: g(x) \leq u\}\) and $B_v = { y: h(y) :nbsphinx-math:`leq `v}. Then,
Conditional expectation \(EX = E(E(X|Y))\)
Rewritten, we say \(E_X X = E_Y (E_{X|Y} (X|Y))\) because \(E(X|Y)\) is a rv (random in \(Y\)),
See: Example 4.4.5
Definition: Conditional variance identity
For any two rv \(X\) and \(Y\),
See: Example 4.4.8
Covariance and correlation
Covariance and correlation measure the strength of a relationship between two rv.
Covariance of \(X\) and \(Y\) is
This gives information regarding the relationship of \(X\) and \(Y\). Large positive values mean \(X\) and \(Y\) both go up together or down together. This value, however, struggles because, by itself, it is domain-specific. We can normalize by the variance to ensure the range of the metric… this is what correlation does.
Correlation of \(Y\) and \(Y\) is
\(\rho_{XY}\) is also known as the correlation coefficient.
If \(X\) and \(Y\) are independent, then \(EXY = (EX)(EY)\) and therefore,
Note: It is invalid to say because \(Cov(X,Y)=0\), \(X\) and \(Y\) are independent. For example, if \(X \sim f(x-\theta)\) and \(Y\) is an indicator function \(Y = I(|X-\theta|<2)\), then \(Y\) and \(X\) are not independent but \(E(XY)\) ends up equaling \(EXEY\) so \(Cov(X,Y) = 0\).
Properties:
For any rv \(X\) and \(Y\),
\(-1 \leq \rho_{XY} \leq 1\)
\(|\rho_{XY}| = 1\) iff there exists \(a \neq 0\) and \(b\) st \(P(Y=aX+b) = 1\). If \(\rho_{XY} = 1\), then \(a > 0\), and if \(\rho_{XY} = -1\), then \(a<0\).
Definition: Multivariate variance
If \(X\) and \(Y\) are any two rv and \(a\) and \(b\) are any two constants, then
Note: if \(X\) and \(Y\) are independent rv then
Multivariate distributions
With \(\mathbf{X} = (X_1, ..., X_n)\) representing a sample space that is a subset of \(\mathbf{R}^n\)
and its expectation
The marginal pdf of any subset of the coordinates of \((X_1, ..., X_n)\) can be computed by integrating the joint pdf over all possible values of the coordinates
Multinomial distribution
Let \(n\) and \(m\) be positive integers and let \(p_1,..., p_n\) be numbers satisfying \(0 \leq p_i \leq 1\), \(i=1, ..., n\) and \(\sum_{i=1}^n p_i = 1\). Then the rv (X_1, …, X_n)$ has a multinomial distribution with \(m\) trials and cell probabilities \(p_1, ..., p_n\) if the join pmf of \((X_1, ..., X_n)\) is
on the set of \((x_1, ..., x_n)\) st \(x_i\) is a nonnegative integer and \(\sum_{i=1}^n x_i = m\).
This follows the following experiment: the experiment consists of \(m\) independent trials. each trial results in one of \(n\) distinct possible outcomes. The probability of \(i\)th outcome is \(p_i\) on every trial. And, \(X_i\) is the count of the number of times the \(i\)th outcome occurred in the \(m\) trials. For \(n=2\), this is just the binomial experiment in which each trial has \(n=2\) possible oucomes and \(X_i\) counts the number of “successes” and \(X_2 = m - X_1\) counts the number of fials in \(m\) tirlas. In a general multinomial experiment, there are \(n\) possible outcomes to count.
Multinomial properties (similar to univariate)
In particular, if \(X_1, ..., X_n\) share the same distribution with mgf \(M_X(t)\), then
Corollary: Linear combination of independent distributions form the same (but multivariate) distribution
Let \(X_1, ..., X_n\) be mutually independent rv with mgfs \(M_{X_1}(t), ..., M_{X_n}(t)\). Let \(a_1, ..., a_n\) and \(b_1, ..., b_n\) be fixed constants. Let \(Z = (a_1X_1 +b_1) + ... + (a_n X_n + b_n)\). Then, the mgf of \(Z\) is
From this, we can conclude (for instance) that a linear combination of independent (say) normal rv is normally distributed.
Inequalities
Per Holder’s Inequality, if \(\frac{1}{p} + \frac{1}{q} = 1\), then \(|EXY| \leq E|XY| \leq (E|X|^p)^{1/p}(E|X|^q)^{1/q}\)
Cauchy-Schwarz’s inequality is a special case of Holder’s Inequality where \(p=q=2\):
The covariance inequality states that if \(X\) and \(Y\) have means \(\mu_X\) and \(\mu_Y\) and variances \(\sigma_X^2\) and \(\sigma_Y^2\), we can apply Cauch-Schwarz’s Inequality to get
By squaring both sides, we get a useful property:
This can be modified (by setting \(Y \equiv 1\)) to state \(E|X| \leq \{ E|X|^p \}^{1/p}\) (with \(1<p<\infty\)).
Additionally, Liapounov’s Inequality takes this a step further by, for \(1<r<p\), if we replace \(|X|\) by \(|X|^r\) we obtain
and then we set \(s=pr\) (where \(s>r\)) and rearrange:
Also, Minkowski’s Inequality states that for two rvs \(X\) and \(Y\) and for \(1 \leq p < \infty\),
This just uses the normal triangle inequality property.
Lastly, Jensen’s Inequality says that for a rv \(X\), if \(g(x)\) is a convex function, then
Note: This only holds true iff, for every line \(a+bx\) that is tangent to \(g(x)\) at \(x=EX\), \(P(g(X) = a+bX)=1\).
Properties of Random Sample
Statistics
Definition: A statistic \(Y = T(X_1, ..., X_n)\) cannot be a function of a parameter of the distribution.
The sample mean is
The sample variance is
Theorem: With the definition defined like above, we know a few things:
\(\min_a \sum_{i=1}^n (x_i - a)^2 = \sum_{i=1}^n (x_i - \bar{x})^2\)
\((n-1)s^2 = \sum_{i=1}^n (x_i - \bar{x})^2 = \sum_{i=1}^n x_i^2 - n \bar{x}^2\)
Lemma: Let \(X_1, ..., X_n\) be a random sample from a population and let \(g(x)\) be a function such that \(Eg(X_1)\) and \(\hbox{Var} g(X_1)\) exist. Then,
Theorem: With \(X_1, ..., X_n\) as a random sample and population with mean \(\mu\) and variance \(\sigma^2<\infty\),
\(E \bar{X} = \mu\)
\(\hbox{Var} \bar{X} = \frac{\sigma^2}{n}\)
\(E S^2 = \sigma^2\)
In relationship (c), we see why \(S^2\) requires a \(\frac{1}{n-1}\):
Distribution of statistics
Example: since \(\bar{X} = \frac{1}{n}(X_1 + ... + X_n)\), if \(f(y)\) is the pdf of \(Y = (X_1 + ... + X_n)\), then \(f_{\bar{X}}(x) = nf(nx)\) is the pdf of \(\bar{X}\). We can prove this using the transformation equation from chapter 4.3.2:
We say that \(\bar{X} = g(Y) = (1/n)Y\). Therefore, \(g^{-1}(\bar{X}) = n\bar{X}\)
\(f_{\bar{X}}(x) = f_{Y}(n\bar{X}) |n|\)
Additionally, this can be conducted for mgfs:
Since \(X_1, ..., X_n\) are iid, then this is true.
Theorem: However, if they are just from a random sample from a population with mgf \(M_X(t)\), then the mgf of the sample mean is:
This can be useful in cases where \(M_{\bar{X}}(t)\) is a familiar mgf, for instance:
Example: Distribution of the mean
Let \(X_1, ..., X_n\) be a random sample from a \(n(\mu, \sigma^2)\) population. Then, the mgf of the sample mean is
This also works for a random sample of \(\gamma(\alpha,\beta)\) (see Example 5.2.8)
Definition: Check if pdf is member of exponential family
Suppose \(X_1, ..., X_n\) is a random sample from a pdf or pmf \(f(x|\theta)\), where
is a member of an exponential family. Define statistics \(T_1, ..., T_k\) by
where \(i=1,...,k\).
If the set \(\{(w_1(\theta) w_2(\theta), ..., w_k(\theta)), \theta \in \Theta\}\) contains an open subset of \(\mathbf{R}^k\), then the distribution of \((T_1,...,T_k)\) is an exponential family of the form
The open set condition eliminates a density such as the \(n(\theta, \theta^2)\) and, in general, eliminates curved exponential families.
Example: Sum of bernuolli rvs
Suppose \(X_1, ..., X_n\) is a random sample from a \(Bernuolli(p)\). We know that
We can see that this is a exponential family where (if \(n=1\), because \(Bernuolli(p) \sim Binomial(1,p)\))
\(c(p) = (1-p)^n, 0<p<1\)
\(w_1(p) = \log(\frac{p}{1-p}), 0<p<1\)
\(t_1(x) = x\)
Thus, from the previous theorem, \(T_1(X_1,...,X_n) = X_1 + ... + X_n\). From the definition of a binomial distribution, we know that \(T_1\) has a \(binomial(n,p)\) distribution, which we have already shown is an exponential family. This verifies the theorem shown above.
Properties: of sample mean and variance
\(\bar{X}\) and \(S^2\) are independent random variables
\(\bar{X}\) has a \(n(\mu, \sigma^2/n)\) distribution
\((n-1)S^2/\sigma^2 \sim \chi_{n-1}^2\)
Some distribution facts
Facts: about \(\chi_p^2\) distribution with \(p\) dof
If \(Z\) is a \(n(0,1)\) rv, then \(Z^2 \sim \chi_1^2\)
If \(X_1, ..., X_n\) are independent and \(X_i \sim \chi_{p_i}^2\) then \(X_1 + ... X_n \sim \chi^2_{p_1 + ... p_n}\); so, independent chi-squared variables add to a chi-squared variable AND the dof also add.
Definition: Student’s t-distribution
Instead of looking at the \(n(\mu, \sigma^2)\),
where we can use our knowledge of \(\sigma\) and our measurement of \(\bar{X}\) as a basis to determine \(\mu\), we can look at a distribution where \(\mu\) and \(\sigma\) are unknown:
Properties: of t-distribution
Has no mgf becasue it does not have moments of all orders. If it has \(p\) degrees of freedom, it only has \(p-1\) moments. For instance, \(t_1\) has no mean and \(t_2\) has no variance.
Definition: F-distribution
Built by a ratio of variances. See Definition 5.3.6
This distribution has a few important corollaries:
If \(X \sim F_{p,q}\), then \(\frac{1}{X} \sim F_{q,p}\)
If \(X \sim t_q\), then \(X^2 \sim F_{1,q}\)
If \(X \sim F_{p,q}\), then \(\frac{p}{q} \frac{X}{1 + (p/q)X} \sim beta(p/2, q/2)\)
Order statistics
Organize the random variables by size: \(X_{(1)} \leq ... \leq X_{(n)}\)
We know the pdf of \(X_{(j)}\) of a random sample \(X_1, ..., X_n\) from a continuous population with cdf \(F_X(x)\) and pdf \(f_X(x)\) is
and the joint pdf of \(X_{(i)}\) and \(X_{(j)}\), \(1 \leq i \leq j \leq n\) is
Convergence
Definition: Convergence in probability
Weak law of large numbers (WLLN) says that \(\lim_{n\xrightarrow[]{}\infty}P(|\bar{X}_n - \mu| < \epsilon) = 1\). That is, \(\bar{X}_n\) converges in probability to \(\mu\). So, \textbf{convergence in probability} is \(\lim_{n\xrightarrow[]{}\infty} P(w \in S |X_n(w) - X(w)| \geq \epsilon) = 0\) where \(w\) is all the solutions in the set. I.e. for \(\Sigma X_i = 4\), the set could be \(\{(1,1,1,1), (2,2), \hbox{etc.}\}\)
Definition: Convergence almost surely
A stronger definition of convergence (yet, it does not need to converge on a set with probability 0) says that a sequence of RVs \textbf{converges almost surely} to a random variable X if \(P(\lim_{n\xrightarrow[]{}\infty}|X_n - X| < \epsilon) = 1\). Moving the limit inside gives it a more strict definition.
Definition: Convergence in distribution
A sequence of RVs converge in distribution to a random variable X if \(\lim_{n\xrightarrow[]{}\infty}F_{X_n}(x) = F_X(x)\). The CDFs converge.
Proving Tools
Definition: Slutsky’s Theorem
If \(X_n \xrightarrow{} X\) in distribution and \(Y_n \xrightarrow{} a\), a constant, in probability, then
\(Y_n X_n \xrightarrow{} a X\) in distribution
\(X_n + Y_n \xrightarrow{} X + a\) in distribution
Definition: Delta method
Let \(Y_n\) be a sequence of rvs that satisfies \(\sqrt{n} (Y_n - \theta) \xrightarrow{} n(0, \sigma^2)\) in distribution. For a given function \(g\) and a specific value of \(\theta\), suppose that \(g^\prime(\theta)\) exists and is not 0. Then, \(\sqrt{n}[g(Y_n) - g(\theta)] \xrightarrow{} n(0, \sigma^2[g^\prime(\theta)]^2)\) in distribution.
Principles of Data Reduction
We are interested in methods of data reduction that do not discard important information about the unknown parameter \(\theta\) and methods that successfully discard information that is irrelevant as far as gaining knowledge about \(\theta\) is concerned.
Sufficiency: data reduction that does not discard information about \(\theta\) while achieving some summarization of the data
Likelihood: a function of the parameter, obtained by the observed sample, that contains all the information about \(\theta\) that is available from the sample
Equivariance: preserve important features of the model
Sufficiency
If \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) then any inference about \(\theta\) should depend on the sample \(\mathbf{X}\) only through the value \(T(\mathbf{X})\). That is, if \(\mathbf{x}\) and \(\mathbf{y}\) are two sample points such that \(T(\mathbf{x}) = T(\mathbf{y})\) then the inference about \(\theta\) shoud be the ame whether \(\mathbf{X}=\mathbf{x}\) or \(\mathbf{X}=\mathbf{y}\) is observed.
A statistic \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\)
if the conditional distribution of the sample \(\mathbf{X}\) given the value of \(T(\mathbf{X})\) does not depend on \(\theta\)
if \(p(\mathbf{x}|\theta)\) is the joint pdf/pmf of \(\mathbf{X}\) and \(q(t|\theta)\) is the pdf/pmf of \(T(\mathbf{X})\) and if, for every \(\mathbf{x}\) in the sample space, the ratio \(p(\mathbf{x}|\theta)/q(T(\mathbf{x})|\theta)\) is constant as a function of \(\theta\) (aka, does not depend on \(\theta\)).
Definition: Determine if sufficient statistic
Factorization Theorem: (no prereq)
For exponential family pdfs:
where \(\mathbf{\theta} = (\theta_1, ..., \theta_d), d \leq k\). Then,
is a sufficient statistic for \(\mathbf{\theta}\).
Definition: Minimal sufficient statistic
General: (no prereq)
A sufficient statistic \(T(\mathbf{X})\) is a minimal sufficient statistic if, for any other sufficient statistic \(T^\prime(\mathbf{X})\), \(T(\mathbf{x})\) is a function \(T^\prime(\mathbf{x})\).
General: (no prereq)
Let \(f(\mathbf{x}|\theta)\) be the pmf or pdf of a sample \(\mathbf{X}\). Suppose there exists a function \(T(\mathbf{x})\) st, for every two sample points \(\mathbf{x}\) and \(\mathbf{y}\), the ratio \(f(\mathbf{x}|\theta) / f(\mathbf{y}|\theta)\) is constant as a function of \(\theta\) iff \(T(\mathbf{x}) = T(\mathbf{y})\). Then, \(T(\mathbf{X})\) is a minimal sufficient statistic for \(\theta\).
Therefore, show \(\frac{f(\textbf{x}|\theta)}{f(\textbf{y}|\theta)} = \frac{g^\prime(T^\prime(\textbf{x})|\theta)h^\prime(\textbf{x})}{g^\prime(T^\prime(\textbf{y})|\theta)h^\prime(\textbf{y})} = \frac{h^\prime(\textbf{x})}{h^\prime(\textbf{y})}\) does not depend on \(\theta\). Therefore, \(T(\textbf{x}) = T(\textbf{y})\). Thus, \(T(\textbf{x})\) is a function of \(T^\prime(\textbf{x})\) and \(T(\textbf{x}) (6.2.13)\)
Other properties
Definition: Ancillary statistics
A statistic \(S(\mathbf{X})\) whose distribution does not depend on the parameter \(\theta\) is called an ancillary statistic.
Prove that the statistic does not depend on \(\theta\). Derive \(f(T(X)|\theta)\) and check whether \(\theta\) is in it. Using Basu’s theorem, if \(T(\textbf{X})\) is a complete and minimal sufficient statistic, then \(T(\textbf{X})\) is independent of every ancillary statistic.
Definition: Complete statistic
Let \(f(t|\theta)\) be a family of pdfs or pmfs for a statistic \(T(\mathbf{X})\). The family of probability distributions is called complete if \(E_\theta g(T) = 0\) for all \(\theta\) implies \(P_\theta(g(T) = 0) = 1\) for all \(\theta\). Equivalently, \(T(\mathbf{X})\) is called a complete statistic.
General: (no prereq)
Basu’s theorem
For exponential pdfs:
where $:nbsphinx-math:mathbf{theta} = (\theta_1, …, \theta_k). Then the statistic
is complete as long as the parameter space \(\Theta\) contains an open set in \(\mathbf{R}^k\).
Likelihood
Let \(f(\mathbf{x}|\theta)\) denote the joint pdf/pmf of the sample \(\mathbf{X} = (X_1, ..., X_n)\). Then, given that \(\mathbf{X} = \mathbf{x}\) is observed the function of \(\theta\) is defined by
is called the likelihood function.
Definition: Likelihood principle
If \(\mathbf{x}\) and \(\mathbf{y}\) are two sample points st \(L(\theta|\mathbf{x})\) is proportional to \(L(\theta|\mathbf{y})\), that is, there exists a constant \(C(\mathbf{x},\mathbf{y})\) st
for all \(\theta\), then the conclusions drawn from \(\mathbf{x}\) and \(\mathbf{y}\) should be identical.
If \(C(\mathbf{x},\mathbf{y}) = 1\), then the likelihood principle states that if two sample points result in the same likelihood function, then they contain the same information about \(\theta\). But, this can be taken further: the principle states that even if two sample points have only proportional likelihoods, then they contain equivalent information about \(\theta\). The plausibility can be observed by the proportion. For instance, if \(L(\theta_2|\mathbf{x}) = 2L(\theta_1|\mathbf{x})\) then \(\theta_2\) is said to be twice as plausible as \(\theta_1\).
The fiducial inference sometime interprets likelihoods as probabilities for \(\theta\). That is, \(L(\theta|\mathbf{x})\) is multiplied by \(M(\mathbf{x}) = (\int_{-\infty}^\infty L(\theta|\mathbf{x})d\theta)^{-1}\) and then \(M(\mathbf{x})L(\theta|\mathbf{x})\) is interpreted as a pdf for \(\theta\) (if \(M(\mathbf{x})\) is finite).
Equivariance
If \(\mathbf{Y} = g(\mathbf{X})\) is a change of measurement scale st the model for \(\mathbf{Y}\) has the same formal structure as the model for \(\mathbf{X}\), then an inference procedure should be both measurement equivariant and formally equivariant.
Point Estimation
A point estimator is any function \(W(X_1, ..., X_n)\) of a sample; that is, any statistic is a point estimator.
There exist three ways of finding estimators: 1) Method of Moments, 2) Maximum Likelihood Estimation, and 3) Bayes’
Method of moments (MOM)
Let \(X_1, ..., X_n\) be a sample from a population with pdf/pmf \(f(x|\theta_1,...,\theta_k)\). MOM estimators are found by equating the first \(k\) sample moments to the corresponding \(k\) population moments, and solving the resulting system of simultaneous equations.
Maximum likelihood estimation (MLE)
Get \(L(\theta | \mathbf{x})\), then \(\ln L(\theta | \mathbf{x})\), then observe \(\frac{d}{d\theta} \ln L(\theta | \mathbf{x})\) domain and find the MLE. Check endpoints.
Bayes
\(\hbox{posterior} = \pi(\theta|\mathbf{x}) = f(\mathbf{x}|\theta)\pi(\theta) / m(\mathbf{x})\)
where \(f(\mathbf{x}|\theta)\pi(\theta) = f(\mathbf{x},\theta)\), the joint PDF and the marginal PDF \(m(\mathbf{x}) = \int f(x|\theta) \pi(\theta) d\theta\). \(\pi(\theta)\) is your prior distribution.
Definition: Let \(\mathbf{F}\) denote the class of pdfs or pmfs \(f(x|\theta)\) (indexed by \(\theta\)). A class \(\Pi\) of prior distributions is a conjugate family for \(\mathbf{F}\) if the posterior distribution is in the class \(\Pi\) for all \(f \in \mathbf{F}\), all priors in \(\Pi\) and all \(x \in \mathbf{X}\).
For instance, the beta family is conjugate for the binomial family. Thus, if we start with a beta prior, we will end up with a beta posterior.
Examples Finding Estimators
Example: Normal distribution
MOM
If \(X_1, ..., X_n\) are iid \(n(\theta,\sigma^2)\), then \(\theta_1 = \theta\) and \(\theta_2 = \sigma^2\). We have \(m_1 = \bar{X}, m_2 = \frac{1}{n} \sum X_i^2, \mu_1^\prime = \theta, \mu_2^\prime = \theta^2 + \sigma^2\), and hence we must solve
Solving for \(\theta\) and \(\sigma^2\) yields the MOM estimators:
Bayes
(Example 7.2.16) Let \(X \sim n(\theta, \sigma^2)\) and suppose that the prior distribution on \(\theta\) is \(n(\mu, \tau^2)\). Here, we assume all the parameters are known. The posterior distribution of \(\theta\) is also normal with mean and variance given by
Step 1: evaluate the posterior \(\pi(\theta|\vec{x}) = \frac{f(x|\theta) \pi(\theta)}{m(x)}\). Or, since \(m(x)\) is not dependent on \(\theta\), we can just evaluate the joint distribution \(f(x,\theta) = f(x|\theta) \pi(\theta)\).
Step 1b: Find \(m(x) = \int f(x|\theta) \pi(\theta) d\theta = \int f(x,\theta) d\theta\).
We can continue to boil down the joing \(f(x,\theta)\)
Therefore, we can find the mean and variance
When have \(tau^2\) near 0, the weight on \(\bar{x}\) is 0 and the weight on \(\mu\) is 1.
The normal family is its own conjugate.
Example: Let \(X_1, ..., X_n\) be iid \(binomial(k,p)\), that is,
on \(x = 0,1,...,k\).
We desire point estimators for both \(k\) and \(p\). We start with the population yields:
Or more simply,
With these definitions, we can build parameter estimates
Metrics for evaluating estimators
Definition: Mean squared error (MSE)
The MSE of an estimator \(W\) of a parameter \(\theta\) is a function of \(\theta\) defined by \(E_\theta (W - \theta)^2\).
This is usually a balancing act.
Note: for an unbiased (\(Bias_\theta = 0\)) estimator, we have
Example: Normal MSE
Let \(X_1, ..., X_n\) be iid \(n(\mu, \sigma^2)\). The statistics \(\bar{X}\) and \(S^2\) are both unbiased estimators since
for all \(\mu\) and \(\sigma^2\). This is always true!!
This is true without the normality assumption. The MSE of these estimators are
This goes to 0 as \(n \xrightarrow{} \infty\).
For a non-normal case, this is \(MSE_{\sigma^2}(S_n^2) = \frac{1}{n}(\theta_4 - \frac{n-3}{n-1} \theta_2^2)\)
Example: \(MSE(\hat{\sigma}^2) < MSE(S^2)\)
We know the MLE of \(\hat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 = \frac{n-1}{n} \sigma^2\).
So, \(\hat{\sigma^2}\) is a biased estimator of \(\sigma^2\). The variance of \(\hat{\sigma^2}\) can be calculated
because \(Var S^2 = \frac{1}{n}(\theta_4 - \frac{n-3}{n-1} \theta_2^2)\), from above.
Therefore, the \(MSE\) is given by
Conclusions about MSE:
Small inccrease in bias can be traded for a large decrease in variance resulting in a smaller MSE.
Because MSE is a function of the parameter, there is often not one best esetimator. Often, the MSEs of two estimators will cross each other, showing that each estimator is better (wrt the other) in only a portion of the parameter space.
This second bullet point is why we discuss other tactics of finding the best estimator… see next!
Definition: Best unbiased estimator
We want to recommend a candidate estimator. Specifically, we consider unbiased estimators. So, if both \(W_1\) and \(W_2\) are unbiased estimators of a parameter \(\theta\), that is, \(E_\theta W_1 = E_\theta W_2 = \theta\), then their MSE are equal to their variances, so we should choose the estimator with the smaller variance.
Theoretical Statistics Questions
Is a pdf part of the exponential family?
Equation 3.4.1
Is a statistic complete?
If distribution part of the exponential family:
Theorem 6.2.25
If not:
Theorem 6.2.21
Is a statistic sufficient?
If distribution part of the exponential family:
Theorem 6.2.10
If not:
Factorization theorem (Theorem 6.2.6)
Is a statistic unbiased?
\(E[T] = \int_{x \in R} x g_T(X|\theta) dx\)
If \(E[g(T)] = \theta\), then unbiased.
To find the expected value, sometimes it is helpful to route thru the moment and use \(M_{\sum X_i}(t) = M_{X_i}(t)^n\); for instance, if you know the pdf of X_i but the statistic is \(\sum X_i\).
Does a statistic converge in probability as \(n \xrightarrow{} \infty\)
Convergence in probability (Definition 5.5.1)
But, we are looking at the difference of \(T\) and \(\theta\) instead.
Usually, like this:
\(P(|T - \theta| \geq \epsilon) = P((T - \theta)^2 \geq \epsilon^2) = P(Var[T] \geq \epsilon^2)\)
So, calculate the variance (\(VarT = E[T^2] - E[T]^2\)) and if this (markov’s inequality, lemma 3.8.3, applied in example 5.5.3) holds up:
\(P(|T - \theta| \geq \epsilon) \leq \lim_{n\xrightarrow{}\infty} [Var[T] / \epsilon^2] = 0\)
then it converges in probability, so \(T \xrightarrow[p]{} \theta\).
Find unique best unbiased estimator of \(\theta\).
Theorem 7.3.23
Find suff. statistic \(T\). Get its distribution \(g(t|\theta)\). Find \(E[T(\vec{x})] = \hbox{ (say) } 2n\theta\). With this, we solve for \(\theta\). This becomes \(\phi(T) = \frac{T}{2n}\). Then, verify \(E[\phi(T)] = \theta\).
Find MLE
Solution not solvable directly (i.e. pdf has indicator function with support containing parameter)
Look at function and think about maximum (yes.. it’s vague).
Normal solutions
Find \(\frac{d}{d\theta} [\ln L(\theta|\vec{x})] = 0\), confirm \(\frac{d}{d\theta^2} [\ln L(\theta=\hat{\theta}|\vec{x})] < 0\), evaluate boundary conditions \(\lim_{\theta \xrightarrow{} \hbox{ upper}} [L(\theta|\vec{x})] = \lim_{\theta \xrightarrow{} \hbox{ lower}} [L(\theta|\vec{x})] = 0\)
Find MOM
Define pdf/pmf-specific expected value:
\(\mu = E(X^1) = \int \hbox{ or } \sum = \hbox{func}(\theta)\)
Set that equal to \(\frac{1}{n} \sum_{i=1}^n x_i^1\)
solve for \(\theta\).
If solving for variance, do the same but use \(X^2\) in both places.
Find CRLB (variance bound)
7.3.10 and, if exponential fam, 7.3.11 helps.
If it’s a series of n RV’s, make sure to multiply by n (e.g. 7.3.12)
Check regularity condition: \(E[\frac{d}{d\theta} \ln p(x;\theta)] = 0\)
Find statistic at CRLB
7.3.15 (but have to prove W(X) is unbiased)
Can calculate Var(W(X)) for variance lower bound also.
Best unbiased of \(\tau(\theta)\)
Rao-Blackwell (7.3.17)
Find complete suff stat \(T\) for \(\theta\)
Compute \(E(T)\)
If \(E(T) = a + b \tau(\theta)\), then the UMVUE of \(T(\theta)\) is \(\phi(\tau) = \frac{\tau-a}{b}\)
If \(E(T) \neq a + b \tau(\theta)\). Find an unbiased estimator of \(\tau(\theta)\). Compute \(\phi(T) = E[W|T]\).
Find an LRT of size 0.05
Find the likelihood function, find MLE \(\hat{\theta}\), build \(\lambda = \frac{L(\theta_0|\vec{x})}{L(\hat{\theta}|\vec{x})}\).
evaluate boundary conditions \(\lim_{\theta \xrightarrow{} \hbox{ upper}} [L(\theta|\vec{x})] = \lim_{\theta \xrightarrow{} \hbox{ lower}} [L(\theta|\vec{x})] = 0\)
Derive level \(\alpha\) UMP test of \(H_0\) and \(H_1\).
Find sufficient statistic, \(T\), (either Factorization theorem or Exponential family proxy)
Find distribution \(g(T|\theta)\)
the suff. stat \(T\) is the transformation from \(Y = g(X) = T(\vec{x})\). Just, remove summations from it during this initial transformation. Once transformed, usually want to find distribution it follows and then apply summations afterwards using properties of that distribution (e.g., the summation of exponentials is a gamma).
Verify MLR with Definition 8.3.16, usually want to take \(\frac{d}{dt}\) and verify if always positive or negative.
Apply Karlin-Rubin with Theorem 8.3.17.
If \(\theta \geq \theta_0\), then the rejection region \(R = \{ \vec{x}: T(\vec{x}) < t_0\}\) and \(\alpha = P_{\theta_0}(T(\vec{x}) < t_0 | \theta \geq \theta_0) = F_T(t|\theta \geq \theta_0) = \int_{-\infty}^{t_0} g(T|\theta) dt\).
Find a pivot quantity and its distribution
Find MLE.
Find pdf of MLE.
Transform this pdf \(X \xrightarrow{} Z\) (where the transformation \(Z = g(X;\theta)\) is the pivot). This pivot \(Z\) should be dependent on both \(X\) and \(\theta\).
Show this pdf \(f_Z(z)\) is not dependent on the parameter \(\theta\)
Use equation 9.2.11.
Find a pivotal interval of \(\theta\) w/ confidence coeff \((1-\alpha)\)
\(P_\theta(a \leq Q(\vec{X},\theta) \leq b) \geq 1-\alpha\). Then, replace \(Q(\vec{X},\theta)\) with the actual pivot (e.g. \(Q(\vec{X},\theta) = \frac{X}{\theta}\) and then solve for \(\theta\). This is the confidence interval. If the interval is symmetric, you can make this problem simpler by \(P_\theta(Q(\vec{X},\theta) \leq a) = P_\theta(Q(\vec{X},\theta) \geq b) = \alpha/2\)
Find smallest pivotal interval with CI \((1-\alpha)\)
See Theorem 9.2.12 (pivoting the CDF)
Evaluate the CDF of the \(Q(\vec{X},\theta)\) (take the integral from lower to x)… let’s call this \(F_{Q(\vec{X},\theta)}(x) = \int_a^x f(t)dt\)
We have \(P(a \leq Q(\vec{X},\theta) \leq b) = 1 - \alpha\)
CI region: \(\theta\) is \(\{\theta: Q^{-1}(\vec{x},a) \leq \theta \leq Q^{-1}(\vec{x},b) \}\)
CI size: \(P(a \leq Q(\vec{x},\theta) \leq b) = F_{Q(\vec{X},\theta)}(b) - F_{Q(\vec{X},\theta)}(a) = 1 - \alpha\)
Lastly, plug in the end points.
So, if (say) \(x \in (0,1)\)
Use upper bound: then plug in \(b=1\) and solve for \(a\) using the CI size expression. Plug the a and b values into the CI region expression
Use lower bound: then plug in \(a=0\) and solve for \(b\) using the CI size expression. Plug the a and b values into the CI region expression
Look at the interval which is smaller. And conclude that that’s the interval to choose.
Natural Language Processing
Graph neural networks at scale
Introduction
GNNs are unique because they leverage connectivity in addition to a typical \((X, y)\) pairs for supervised learning. A GNN utilizes a graph \(g = (X,A,y)\) to conduct a task (e.g. node classification), where \(X \in \mathbf{R}^{n\times d}\) are node features, \(A \in \mathbf{R}^{n \times n}\) are adjacencies, and \(y \in \mathbf{R}^{n,k}\) are labels. Often, an adjacency matrix is defined as \(A_{v,u} = \begin{cases}1 & u \xrightarrow{} v\\ 0 & otherwise \end{cases}\). However, it is not rare to see an adjacency matrix where the value is a float scaler or vector.
Graph convolutional networks (Kipf and Welling, 2017)
At every layer of the GNN, we transform the information at node level into a hidden represenetation. We stack the transformations into multiple layers and then use the final hidden space to map to our final value, based on the task.
where \(S H^{(l - 1)}\) is the aggregation (in this case, average since \(S\) is a symmetrically normalized ajacency matrix) of current representations of neighbor nodes.
This is the typical process for all GNNs. The variants change what message to send, how to aggregate messages, and how to transform the aggregated message. These changes are usually motivated by additional knowledge abou tthe process generating the graph.
This shared structure is formalized in:
Message Passing Neural Networks (Gilmer et al 2017)
For each incoming edge, generate a message to send based on the current representation of the endpoints and edge (if any)
Aggregate all incoming messages at a node to reduce to a single vector
Apply a transformation to the aggregated messages to uipdate the node’s representation
GNN Scaling issues
This section will focus on scaling GNNs to large data sets. In general GNNs are not very large models (in terms of # of parameters), especially compared to the recent language models. The key bottleneck is not so much the size of the model but fitting the data into GPU memory. GPU memory pressure comes from the recursive structure of the memory passing computation that happens at each layer.
What needs to be on the GPU when training a GNN?
Model parameters (\(W^1\), \(W^2\))
Input data (e.g., X, y)
Intermediate outputs (e.g. \(H^1\), \(H^2\))
Gradients
Typically, mini-batch training is used to sample a small number of samples from the training dataset at every iteration before computing a model update. That reduces the size of the input data during training.
Overview
In GNNs, minibatches don’t give us an easy win for graphs. In order to compute the output of the given node, we have to collect nodes from its neighborhood (e.g. k-hop neighbors). This can produce a large portion of the graph if in a dense graph. “Small world” says that the distance between any two nodes in a graph only grows logarithmically with the number of nodes in the graph (e.g. if we have a graph with 10x the number of nodes, we only expect to increase the distance between two nodes by 1 hop). So, the implication is that the K-hop receptive fields for a minibatch of nodes can increase the memory requirements on the GPU by serveral orders of magnitude (beyond what we’d need just for target nodes). So in practice, we don’t get the same benefit of using minibatches directly in GNNs because of the inter-dependence of samples. This is called the receptive field problem.
Another way to get around GPU issues is to use distributed GPU learning. Data parallelism will partition the graph into parts and then pass each partition to a separate GPU. The same model architecture is trained on each partition and every once in a while they are synced by averaging (for instance) the model parameters across the different model instances. In general, this is not easy to implement because if we partition the nodes, then there will be edges between nodes on different partitions. So, what do we do if we have to collect a message from a neighbor that lives on another partition? We can query data from different GPUs but that’s expensive. In general, data parallelism is hard to implement. As a note, if the graph is small enough to fit in host CPU memory for each GPU, then data parallelism is great to speed up GNN training.
Another way is model and pipeline parallelism. This is often used when a model is really large (for instance, large language models).
An alternative is to take a subgraph of the graph in each batch and use that for training.
Solutions
Message Flow Graphs
GraphSage (Hamilton et al 2017)
Starting with the last layer, sample k neighbor nodes to use from previous layer to compute the representation of the target node
Subgraphs
ClusterGCN (Chiang et al. 2019)
Partition nodes into K groups to find “dense subgraphs” (paper and DGL use METIS). For instance, we partition our graph into K groups, then ClusterGCN samples k groups and induce subgraph. The intuition is that if we only pass messages between nodes from one partition, then we may get a pretty good approx of the true output becasue this is a dense subset of the graph. But, occassionaly we want to reach across partitions and include info from other partions. So it controls the number of nodes on the GPU but also allows us to get a relative representation of true nodes from the graph.
GraphSAINT: Node Sampler (Zeng et al 2020)
Sample nodes proportional to their out-degree and induce subgraph
GraphSAINT: Edge Sampler (Zeng et al 2020)
Instead of sampling nodes and then including any edges between that usbset of nodes, we are going to sample edges directly. So, sample edges with probability which is proprotional to the sum of the reciprocal of the source nodes’ out degree and the dstination nodes’ in-degree.
Once we have a sample of edges, we are going to induce a subgraph from nodes that appear in at leat one sampled edge.
GraphSAINT: Random Walk Sampler (Zeng et al. 2020)
Randomly choose \(n\) root nodes (with replacement) and then start length \(k\) random walks from each root. Then, we induce a subgraph from the union of all visited nodes in the subgraph including the roots.
ShaDowKHop (Zeng et al 2021)
Induce a subgraph from K-hop neighborhood of sampled root nodes. E.g. sample 2 neighbors at layer 1 and 2 neighbors at layer 2, and induce subgraph. This is really similar to GraphSage. The key difference is that in GraphSage, we use the message flow graphs to only pass messages that we need to pass to compute outputs of target nodes. In ShaDowKHop, we keep the extraneous edges that might exist between two nodes that are same instance from target node and we allow messages to be exchanged.
Which to use?
Hard to predict.. but if the graph is too large, use GraphSage with as many neighbors as possible before you run out of GPU memory. Once refining hyperparameters, treat the minibatch sampler as a hyperparameter and run experiments to see which is best for problem.
Recent solutions in literature
Some recent extensions ahve been proposed in literature, namely
reduce the variance of sample approximations
maximize “embedding utilization” (use each node more than once if possible) - using layer sampling..
learn to select which neighbors to include in the approximation based on the loss (similar to graph attention networks but avoiding having to compute on full graph)
Overview
Batch training
Batch training is written mathematically usually as follows:
\(\mathcal{V}\): Node set
\(|\mathcal{V}|\): Number of nodes in the graph
\(Z_v^L\): Model output for node \(v\)
\(y_v\): Label for node \(v\)
\(l\): loss function
Minibatch training w/ full neighbors
\(\mathcal{B}\): Minibatch of randomly selected nodes from \(\mathcal{V}\)
\(|\mathcal{B}|\): Number of sampled nodes in minibatch
MBFN is unbiased.
Minibatch training w/ sampled neighbors
\(\hat{\mathcal{N}}_L(v)\): Recursively sampled neighbors from \(v\)’s \(L\)-hop neighborhood
\(\hat{Z}_v^L\): Approximate model output for \(v\) based on sampled neighbors
Variance reduced GCN (Chen et al 2018)
In general, MBSN loss for GCN training is biased because
If the number of neighbors sampled at each hop from the target node is large, then the bias is small (intuition follows the continuous mapping theorem). A large sample, however, puts more memory pressure on the GPU. The goal is to reduce the bias by reducing the variance without sampling a larger numebr of nodes.
The way they do that is as follows:
Message to node \(v\) at layer \(l\) is:
With neighbor sampling, the message is approximated with
where \(\mathcal{S}_l(v)\) is the sample of neighbors of a node \(v\). The approximation of the neighbor’s representations are used also.
To reduce the variance, Chen et al. propose a control variate based on the historical embeddings of a node’s neighbors at the previous layer.
They noticed we can break each previous representation into a historical value and the difference between the historical and current value:
This history value \(\hat{H}_u^{(l-1)}\) is assumed to be known. We use a cache of the \(l-1\) layer representations for all nodes. Then the model updates are pushed. Then we compute \(\Delta H_u^{(l-1)} = H_u^{l} - \hat{H}_u^{(l-1)}\) as the difference.
If we subset just the deltas, then we sample vectors with smaller overall norm, which means a smaller variance. There’s also a control variate effect. Get smaller as training converges.
This uses more CPU memory (from the cache) instead of GPU memory.
Adaptive Sampling GCN (Huang et al 2018)
Even when we use GraphSage to subsample the number of neighbors at each layer, the number of intermediate nodes is still exponential in the number of layers (and we need large samples to reduce bias). Key goal: Make the number of intermediate nodes linear in the number of layers by selecting a fixed set of nodes in the previous layer for approximating messages for all nodes in the current layer. So, instead of having a multiplicative effect at every layer that we add (causing exponential effect), we add same number of nodes at every layer independent of the number of nodes at previous layer, giving us a linear dependence. So, how do we choose a good fixed set of nodes at each layer to help approximate the representation of the layer above?
The proposal distribution distribution that minimizes the variance for node \(v\):
To minimize variance, we prefer to sample nodes \(u\) that have a large L2 norm in representation on the previous layer.
There are two problems with this:
the optimal proposal distribution is specific to \(v\) (we want to sample from a common \(q\), not specific neighbors to node \(v\) as there could be duplicates)
we need to compute the hidden representation at the previous layer for all neighbors of all nodes at the current layer (the thing we were trying to avoid!)
To sidestep these two issues, they introduce this proposal distribution:
where \(W\) is the learnable \(1 \times d\) matrix. So, we are trying to predict what the norm of the hidden representation at layer \(l-1\) is, given the known feature values for node \(u\).
They augment the learning objective by minimizing the minibatch with sampled neighbors objective (from before) plus a variance penalty:
The variance is only computed in the final layer (not computing variance at each layer… just at last layer \(L\)). This is interesting because 1) we share/compact the nodes that we use to help approximate the last layer, 2) having a learnable distribution when choosing neighbors at every layer as we go down the stack.
Performance Adaptive Sampling (Yoon et al 2021)
Like AS-GCN, this method introduces a learnable distribution over neighbors. The key difference is PASS optimizes the distribution to directly improve performance instead of minimizing variance.
Each pre-activation hidden representation is
where \(\mathcal{S}_l(v)\) is a set of neighbors sampled using a learned distribution. Specifically, the learned policy is a mixture of random sampling and “affinity” sampling. So this is a tradeoff of exploration and exploitation.
The proposal distribution at layer \(l\) over \(u\) given \(v\) is:
where \(a_s\) is an attention matrix. The affinity assigned weight is a dot product of the vectors obtained after passing through a linear transformation of \(W_s\) with the original hidden representations of the nodes. The sampling operation is non-differentiable, so they use REINFORCE to update the sampling distribution parameters, which does not require computing the gradients and using backpropagation.
Contextual Bandits
Sequential learning environment!
Agent interacts with environment. In beginning there is no data. In each round, an action is taken, and environment gives feedback.
[2]:
from utils import disp
disp('bandits_intro.drawio.png')

The environment sends noisy feedback. The agent has to learn a policy that is a way to choose the next action given the information that has been observed so far.
In a real-world setting, the environment can be a user. In this case, a delay buffer. After a while, the feedback is sent to agent which will update the policy.
How can we learn an efficient policy for the agent? And, how do we deal with delays?
Linear Bandits
In round \(t\), observe action set \(\mathcal{A}_t \subset \mathcal{R}^d\)
The learner chooses \(A_t \in \mathcal{A}_t\) and receives \(X_t\), satisfying
for some unknown \(\theta_\star\).
Light-tailed noise: \(X_t - \bigl< A_t, \theta_\star\bigr> = \eta_t \sim N(0,1)\)
Goal: Keep regret small.
Regret is the cumulative sum of losses. The loss is the gap between the reward difference between what you would have got if doing it correctly from what you actually did.
The action set should be some subset. It is selected by some adversary.
Real-world setting
A typical setting is a user, represented by feature vector \(u_t\) shows up and we have a finite set of (correlated) actions \((a_1, ..., a_k)\).
Some function \(\phi\) joins these vectors pairwise to create a contextualized action set:
\(\forall i \in [K]\),
No assumption is to be made on the joining function \(\phi\) as the bandit may take over the decision step from that contexualized action set.
So, it is equivalent to \(\mathcal{A}_t \sim \Pi (\mathcal{R}^d)\) some arbitrary distribution, or \(\mathcal{A}_1, ..., \mathcal{A}_n\) fixed arbitrarily by the environment.
We want an algorithm which is robust to any choice (even sampled from any unknown distribution).
Toolbox of the optimist
To minimize the regret…
Say, reward in round \(t\) is \(X_t\), action in round \(t\) is \(A_t \in \mathcal{R}^d\):
We want to estimate \(\theta_\star\): regularized leat-squares estimator:
Choice of confidence regions (ellipsoids) \(\mathcal{C}_t\):
where, for \(A\) positive definite, \(||x||_{A}^2 = x^\intercal A x\).
Immediately, we can see that \(\beta_t\) is a very important hyperparameter.
LinUCB
“Choose the best action in the best environment amongst the plausible ones.”
Choose \(\mathcal{C}_t\) with suitable \((\beta_t)_t\) and let
Or, more concretely, for each action \(a \in \mathcal{A}\), compute the “optimistic index”
Maximising a linear function over a convex closed set, the solution is explicit:
At round \(t\), I have an estimator (red dot) \(\hat{\theta}_t\) which is encompassed by a confidence ellipsoid (here it is round because we have Tikhonov regularization, which uses \(\lambda I\)) and at round \(t\) for each green dot (i.e. actions, or arms), we compute an index and choose a vector (red circle around green arm dot) which is \(a_{1,t} = A_t\) which has the best scalar product with \(\theta\).
[3]:
disp('bandit_uncertainty_principle.png')

In beginning, we have confidence elliposid which is round. But then we pull the action in direction of the action that gives a lot of reward. But, sometimes this chooses an action which is bad and it increases the uncertainty which makes it rounder. This is how we regulate exploration vs exploitation.
Exploitation pulls in direction which makes it ovular and exploration is when we reshape the ellipsoid of competence.
[4]:
disp('bandits_uncertainty_principle_laterepoch.png')

Regret bound
So what is the regret of linearity?
We first make a few assumptions that the rewards and parameters are bounded, namely:
Bounded scalar mean reward: \(|\bigl< a, \theta_\star \bigr>| \leq 1\) for any \(a \in \cup_t A_t\)
Bounded actions: for any \(a \in \cup_t \mathcal{A}_t\), \(||a||_2 \leq L\)
Honest confidence intervals: There exists a \(\delta \in (0, 1)\) such that with probability \(1 - \delta\) for all \(t \in [n]\), \(\theta_\star \in \mathcal{C}_t\) for some choice of \((\beta_t)_{t \leq n}\)
Theorem (LinUCB Regret)
Let the conditions listed above hold. Then with \(1 - \delta\) the regret of LinUCB satisfies
Proof
Jensen’s inequality shows that
where \(A_t^\star \dot{=} \argmax_{a \in \mathcal{A}_t} \bigl< a, \theta_\star \bigr>\) is the optimal action.
Let \(\tilde{\theta}_t\) be the vector that realizes the maximum over the ellipsoid:
\(\theta_t \in \mathcal{C}_t\) s.t. \(\bigl< A_t, \tilde{\theta}_t \bigr> = U_t(A_t)\)
From the definition of LinUCB,
Then, (by Cauchy-Schwartz with VT-norm)
So now we know the bound of \(r_t\) and can use it in Jensen’s inequality.
Elliptical Potential Lemma
So we now have a new upper bound,
where \(\bigl(1 \wedge ||A_t||^2_{V_{t-1}^{-1}} \bigr)\) is the norm squared of your actions.
Lemma (Abbasi-Yadkori et al. (2001))
Let \(x_1, ..., x_n \in \mathcal{R}^d\), \(V_t = V_0 + \sum_{s=1}^t x_s x_s^\intercal\), \(t \in [n]\) and \(L \geq \max_t ||x_t||_2\). Then,
See Chapter 20 of Bandit Algorithms located here: https://tor-lattimore.com/downloads/book/book.pdf
Summary of Theorem (LinUCB Regret)
Let the conditions listed above hold. Then with \(1 - \delta\) the regret of LinUCB satisfies
Linear bandits are an elegant model of the exploration-exploitation dilemma when actions are correlated. The main ingredients of the regret analysis are
bounding the instantaneous regret using the definition of optimism
a maximal concentration inequality holding for a randomized, sequential design
the elliptical potential lemma
Real-World Setting: Delayed Feedback
If you have a delay, the learner is going to stop waiting. For instance, after \(m\) timestamps, it will start deleting data (inside the delay buffer).
Modified settings: at round \(t \geq 1\),
receive contextualized action set \(\mathcal{A}_t = \{ a_1, ..., a_k \}\) and choose action \(A_t \in \mathcal{A}_t\),
two random variables are generated but not observed: \(X_t \sim \mathcal{B} (\theta^\intercal A_t)\) and \(D_t \sim \mathcal{D}(\tau)\)
at \(t + D_t\) the reward \(X_t\) of action \(A_t\) is disclosed …
…unless \(D_t > m\): if the delay is too long, the reward is discarded.
New parameter: \(0 < m < T\) is the cut-off time of the system. If the delay is longer, the reward is never received. The delay distribution \(\mathcal{D}(\tau)\) characterizes the proportion of converting actions \(\tau_m = p(D_t \leq m)\). So this denotes the amount of data which is sent back to the agent.
We now have covariance metric \(V_t\) and slightly changed \(b_t\) vector which is the sum of the actions weighted by \(X_t\) times an indicator that it was part of those data actually in delay.
where \(\tilde{b}_t\) contains additional non-identically distributed samples:
It is split into two parts: 1) old actions made a long time ago, 2) actions made in last delay portion.
This establishes a new estimator
a “conditionally biased” least squares estimator which includes every recieved feedback. It is biased because the last set of observations are still awaiting reward somewhere in the buffer – these are not all iid because the probabiliyt of the delay being smaller than t-s is not the same as for all the actions. This hurts the estimator.
Baseline: use previous estimator but discord last \(m\) steps
with \(E[\hat{\theta}_t^{disc} | \mathcal{F}_t] \approx \tau_m \theta\)
We remark that
where \(\hat{\theta}_t^b - \hat{\theta}_{t+m}^{disc}\) contians the finite biase and \(\hat{\theta}_{t+m}^{disc} - \tau_m \theta\) is the same as before.
For the new \(\mathcal{C}_t\), we have new optimistic indices
But now, the solution has an extra (vanishing) bias term
D-LinUCB : Easy, straightforward, harmless modification of LinUCB, with regret guarantees in the delayed feedback setting
Regret bound (Theorem: D-LinUCFB Regret)
Under the same conditions as before, with \(V_0 = \lambda I\), with probability \(1 - \delta\) the regret of D-LinUCB satisfies
So, we have an extra term. Note: There is a dependency on \(\tau_m\) here.
Conclusion
Linear bandits are a powerful and well-understood way of solving the exploration-exploitation trade-off in a metric space
The techniques have been extended to generalized linear models by Filippi et al
and to kernel regression Valko et al
Yet, including constraints and external sources of noise in real-world applications is challenging
Sometimes when make action, it modifies environment and then following this environment change, it gets reward that credits to first action. In general RL setting, you take actions but don’t get reward immediately. You change position in space and rewards come later (in delayed way). In this case, we have to go out of the bandit assumption and open markov decision processes box (e.g. UCRL abd KL-UCRL Auer et al., Filippi et al.)
Neural Network Gaussian Process (NNGP)
Summary
Here, I will prove the NNGP. The proof is by induction. As we go through the layers of the NN, the gaussian structure property is preserved from layer to layer and the kernel updates.
\(f^{old}\) is a GP \(\Rightarrow\) \(f^{new}\) is a GP.
Definition
A “gaussian process” is a random function \(f(x)\) and any finite sample of points x creates a gaussian vector. So a guassian process is a generalization for a gaussian vector to an infinite number of points.
For any finite set of test points \(x^{(1)}, \dots, x^{(n_{\text{test}})}\). Then the joint distribution of the vector \(f(x)\) is a gaussian vector with mean \(m(x^{(i)})\) with covariance structure \(\sum \bigl( x^{(i)}, x^{(j)} \bigr)_{ij=1}^{n_{\text{test}}}\)
So a gaussian process is just something where any finite dimensional subset of it is a guassian vector.
The main result is that in a neural network as you go from one layer to another layer, if one layer is a gaussian process then the next layer is also a gaussian process, and by propagating that through the network, you result in a NNGP. Note: it will mandate that the widths approach infinity.
Proposition
Note: An old function is assumed to be a gaussian process and the new function is our new layer update.
If \(f^{old} : \mathbb{R}^{n_{in}} \xrightarrow{} \mathbb{R}^{n_{old}}\) where every component \(f_i^{old} \sim \mathcal{GP}(0, \sum^{old} (x, x^\prime))\) and \(f_i^{old}, f_j^{old}\) are independent
Then \(f^{new} : \mathbb{R}^{n_{in}} \xrightarrow{} \mathbb{R}^{n_{new}}\) is defined as
Note \(f^{new}\) is also a gaussian process and in the limit \(n_{old} \xrightarrow{} \infty\), we can write its kernel:
with covariance structure \(\sum^{new} (x, x^\prime) = \sigma_W^2 \mathbf{E} \bigl[ \phi \bigl( f^{old}_1 (x) \bigr)^T \phi \bigl( f_1^{old}(x^\prime ) \bigr) \bigr] + \sigma_b^2\), or more simply \(\sum^{new} (x, x^\prime) = \sigma_W^2 \mathbf{E} \bigl[ \phi \bigl( Z \bigr)^T \phi \bigl( Z^\prime \bigr) \bigr] + \sigma_b^2\) where
And \(f_i, f_j\) are independent for \(i \neq j\).
Short proposition
\(f^{old}\) is a GP \(\Rightarrow\) \(f^{new}\) is a GP.
Proof
\(f_i^{new}\) is exactly a random feature regression model with features given by
\(\Rightarrow f_i^{new}\) is a GP with kernel:
and \(f_i^{new}\) and \(f_j^{new}\) are independent.
Acknowledgements
Summarized from https://bpb-ca-c1.wpmucdn.com/sites.uoguelph.ca/dist/8/175/files/2021/02/Notes_on_feature_regression_and_wide_NNs.pdf
Helpful Resources
Good Blogs: https://colah.github.io/
Graph Theory: http://olizardo.bol.ucla.edu/classes/soc-111/textbook/_book/1-intro.html#intro
Derivatives of Lin. Algebra: http://michael.orlitzky.com/articles/the_derivative_of_a_quadratic_form.xhtml
Use non-linear model at end of NN : https://www.reddit.com/r/MachineLearning/comments/qex0o7/d_mlps_are_actually_nonlinear_linear/
Computer science at UCF: https://www.cs.ucf.edu/~kienhua/classes/
Statistics with proofs and python: https://xavierbourretsicotte.github.io/#
Statistics notes written out on Twitter: https://twitter.com/mervenoyann/status/1386752998131605510
Data science project (including XGBoost) example implementation: https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/06-trees/notebook.ipynb
Data science project template: https://drivendata.github.io/cookiecutter-data-science/#cookiecutter-data-science
Deep learning descriptions: https://arthurdouillard.com/deepcourse/
DL Timeseries Reserve: https://github.com/Alro10/deep-learning-time-series
Cool Papers
Uncertainty estimation NN: https://arxiv.org/pdf/2003.02037.pdf
Decomposing sensitivity components for calibration: https://arxiv.org/pdf/2110.14577.pdf
HMM -> ODE: http://www.stat.columbia.edu/~liam/teaching/neurostat-fall20/papers/hmm/minka-lds-techreport.pdf
Data + ODE: https://arxiv.org/pdf/2103.10153.pdf