Summaries

ML & DL models

Category

Model

Description

When to use

Extending Concepts

Ensemble Learning

Gradient Boosting

Builds tree one at a time! At each stage \(m\) (\(1 \leq m \leq M\), of \(M\) total stages) of gradient boosting, suppose some imperfect model \(F_{m}\) (for low \(m\), this model may simply return \(\hat {y}_{i}=\bar{y}\). In order to improve \(F_m\), our algorithm should add some new estimator, \(h_{m}(x)\). Thus, \(F_{m+1}(x)=F_{m}(x)+h_{m}(x)=y\). Therefore, \(h_{m}(x)=y - F_m (x)\). Gradient boosting will fit \(h\) to the residual \(y - F_m(x)\), attempting to correct the errors of its predecessor.

Decrease the bias error. Used in classification (MSE) and regression (log-Loss)

A generalization of this idea to loss functions other than squared error, and to classification and ranking problems, follows from the observation that residuals \(h_{m}(x)\) for a given model are the negative gradients of the mean squared error (MSE) loss function (with respect to \(F(x)\): \(L_{\rm {MSE}}={\frac {1}{2}}\left(y-F(x)\right)^{2}\) and \(h_{m}(x)=-{\frac {\partial L_{\rm {MSE}}}{\partial F}}=y-F(x)\), meaning the gradient boosting could be specialized to a gradient descent algorithm, and generalizing it entails “plugging in” a different loss and its gradient.

Ensemble Learning

XGBoost

A form of gradient boosting. XGBoost delivers high performance as compared to Gradient Boosting. Its training is very fast and can be parallelized / distributed across clusters. XGBoost computes second-order gradients, i.e. second partial derivatives of the loss function, which provides more information about the direction of gradients and how to get to the minimum of our loss function. XGBoost also handles missing values in the dataset. So, in data wrangling, you may or may not do a separate treatment for the missing values, because XGBoost is capable of handling missing values internally.

A good model to try first.

Ensemble Learning

Random Forest

RFs train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data

multi-class because efficient

Statistics

Beta regression

https://www.ime.usp.br/~sferrari/beta.pdf

Statistics

Naive Bayes

Classifier \(\hat{y} = argmax_{k\in\{1,...,K\}} p(C_k) \Pi_{i=1}^n p(x_i \| C_k)\) that often (in Gaussian NB) uses \(p(X\|C_k) \sim N(\mu_k, \sigma_k^2)\).

Per the product in this equation, NB requires feature independent. Data scarcity causes probabilities to be close to zero which can cause numerical instabilities. Additionally, continuous features have to be transformed to discrete, throwing away a lot of information.

Statistics

Linear Discriminant Analysis

While naive bayes classifier assumes covariance = 0, the LDA assumes equal covariances between classes \(G\) in \(y\). As a note: quadratic discriminant analysis allows different covariance matrices.

Does not work if the classes are not balanced. Also, it’s not applicable to non-linear problems. Also, sensitive to overfitting.

Statistics

Support Vector Machine

see notebook.

Good when \(p > n\) or when good class partition

Sampling

Category

Technique

Description

When to use

Extending Concepts

Estimate integrals:

Variational Bayesian http://www.orchid.ac.uk/eprints/40/1/fox_vbtut.pdf

Monte carlo: see Monte Carlo Integration tab

Optimizers

In stochastic gradient descent, \(\theta_{t-1} = \theta_t - \alpha \delta L(\theta_t)\), the \(\theta\) are getting changed according to the gradient of the loss wrt \(\theta\). \(\alpha\) is the learning rate. If it is small, convergence will be slow and has potential to get caught at local minima; large \(\alpha\) will lead to divergence. The gradient of the loss \(L\) changes quickly after each iteration due to the diversity of each training example. In this base case, it is common to have “zig-zag” even though we slowly reach the loss minima.

Momentum can overcome this “zig-zag”. We introduce a new hyperparameter \(\mu\) where \(v_{t+1} = \mu v_t - \alpha \delta L(\theta_t)\) and \(\theta_{t+1} = \theta + v_{t+1}\).

Category

Technique

Description

When to use

Extending Concepts

Adaptive learning rate

Adagrad

Scales \(\alpha\) for each parameter according to the history of gradients (previous steps) for that parameter which is basically done by dividing current gradient in updated rule by the sum of previous gradients. As a result, what happens is that when the gradient is very large, alpha is reduced and vice-versa. \(g_{t+1} = g_t + \delta L(\theta_t)^2\) and \(\theta_{t+1} = \theta_t - \frac{\alpha \delta L(\theta)^2}{\sqrt{g_{t+1} + \eps ilon}}\)

Adaptive learning rate

RMSProp

\(g_t\) term (as seen in Adagrad) is calculated by exponential decaying average and not the sum of gradients.

https://www.quora.com/What-are-differences-between-update-rules-like-AdaDelta-RMSProp-AdaGrad-and-AdaM

Adaptive learning rate

Adam

Uses both the first order moment \(m_t\) and the second order moment \(g_t\) but they are both decayed over time. Step size is approximately \(\pm \alpha\). Step size will decrease as it approaches minimum.

There is often value to using more than one method (an ensemble), because every method has a weakness.