Summaries

ML & DL models

Category	Model	Description	When to use	Extending Concepts
Ensemble Learning	Gradient Boosting	Builds tree one at a time! At each stage \(m\) (\(1 \leq m \leq M\), of \(M\) total stages) of gradient boosting, suppose some imperfect model \(F_{m}\) (for low \(m\), this model may simply return \(\hat {y}_{i}=\bar{y}\). In order to improve \(F_m\), our algorithm should add some new estimator, \(h_{m}(x)\). Thus, \(F_{m+1}(x)=F_{m}(x)+h_{m}(x)=y\). Therefore, \(h_{m}(x)=y - F_m (x)\). Gradient boosting will fit \(h\) to the residual \(y - F_m(x)\), attempting to correct the errors of its predecessor.	Decrease the bias error. Used in classification (MSE) and regression (log-Loss)	A generalization of this idea to loss functions other than squared error, and to classification and ranking problems, follows from the observation that residuals \(h_{m}(x)\) for a given model are the negative gradients of the mean squared error (MSE) loss function (with respect to \(F(x)\): \(L_{\rm {MSE}}={\frac {1}{2}}\left(y-F(x)\right)^{2}\) and \(h_{m}(x)=-{\frac {\partial L_{\rm {MSE}}}{\partial F}}=y-F(x)\), meaning the gradient boosting could be specialized to a gradient descent algorithm, and generalizing it entails “plugging in” a different loss and its gradient.
Ensemble Learning	XGBoost	A form of gradient boosting. XGBoost delivers high performance as compared to Gradient Boosting. Its training is very fast and can be parallelized / distributed across clusters. XGBoost computes second-order gradients, i.e. second partial derivatives of the loss function, which provides more information about the direction of gradients and how to get to the minimum of our loss function. XGBoost also handles missing values in the dataset. So, in data wrangling, you may or may not do a separate treatment for the missing values, because XGBoost is capable of handling missing values internally.	A good model to try first.
Ensemble Learning	Random Forest	RFs train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data	multi-class because efficient
Statistics	Beta regression			https://www.ime.usp.br/~sferrari/beta.pdf
Statistics	Naive Bayes	Classifier \(\hat{y} = argmax_{k\in\{1,...,K\}} p(C_k) \Pi_{i=1}^n p(x_i \\| C_k)\) that often (in Gaussian NB) uses \(p(X\\|C_k) \sim N(\mu_k, \sigma_k^2)\).	Per the product in this equation, NB requires feature independent. Data scarcity causes probabilities to be close to zero which can cause numerical instabilities. Additionally, continuous features have to be transformed to discrete, throwing away a lot of information.
Statistics	Linear Discriminant Analysis	While naive bayes classifier assumes covariance = 0, the LDA assumes equal covariances between classes \(G\) in \(y\). As a note: quadratic discriminant analysis allows different covariance matrices.	Does not work if the classes are not balanced. Also, it’s not applicable to non-linear problems. Also, sensitive to overfitting.
Statistics	Support Vector Machine	see notebook.	Good when \(p > n\) or when good class partition

Sampling

Category	Technique	Description	When to use	Extending Concepts

Estimate integrals:

Variational Bayesian http://www.orchid.ac.uk/eprints/40/1/fox_vbtut.pdf

Monte carlo: see Monte Carlo Integration tab

Optimizers

In stochastic gradient descent, \(\theta_{t-1} = \theta_t - \alpha \delta L(\theta_t)\), the \(\theta\) are getting changed according to the gradient of the loss wrt \(\theta\). \(\alpha\) is the learning rate. If it is small, convergence will be slow and has potential to get caught at local minima; large \(\alpha\) will lead to divergence. The gradient of the loss \(L\) changes quickly after each iteration due to the diversity of each training example. In this base case, it is common to have “zig-zag” even though we slowly reach the loss minima.

Momentum can overcome this “zig-zag”. We introduce a new hyperparameter \(\mu\) where \(v_{t+1} = \mu v_t - \alpha \delta L(\theta_t)\) and \(\theta_{t+1} = \theta + v_{t+1}\).

Category	Technique	Description	Extending Concepts
Adaptive learning rate	Adagrad	Scales \(\alpha\) for each parameter according to the history of gradients (previous steps) for that parameter which is basically done by dividing current gradient in updated rule by the sum of previous gradients. As a result, what happens is that when the gradient is very large, alpha is reduced and vice-versa. \(g_{t+1} = g_t + \delta L(\theta_t)^2\) and \(\theta_{t+1} = \theta_t - \frac{\alpha \delta L(\theta)^2}{\sqrt{g_{t+1} + \eps ilon}}\)
Adaptive learning rate	RMSProp	\(g_t\) term (as seen in Adagrad) is calculated by exponential decaying average and not the sum of gradients.	https://www.quora.com/What-are-differences-between-update-rules-like-AdaDelta-RMSProp-AdaGrad-and-AdaM
Adaptive learning rate	Adam	Uses both the first order moment \(m_t\) and the second order moment \(g_t\) but they are both decayed over time. Step size is approximately \(\pm \alpha\). Step size will decrease as it approaches minimum.

There is often value to using more than one method (an ensemble), because every method has a weakness.