To do inference or prediction based on a statistical model.
A relationship between the independent variable(s) (Y/D) and dependent variable(s) (X) in presence of uncertainty (which is induced by random sampling, and measured by probability).
Generally, inference is interested to learn about the data-generating process, whereas prediction is interested to predict the outcome (for given values of the dependent variables). Without the presence of uncertainty (i.e.,for deterministic modeling), they are the same problem. But in presence of uncertainty, they are different problems, because even though a data-generating process is totally known (all parameters known so that there is no error in parameter), the process will not outcome the same value everytime because of randomness (random error). However, in machine learning terminology, learning the data-generating process is simply called learning, whereas predicting the outcome is often called inference.
Any inference or prediction is based on a given statisical model. Model selection is concerned about selecting a particular statisical model out of all possible candidate models – i.e., selecting an inferential model (for inference), or a predictive model (for prediction). An inferential model is generally validated through residual analysis and goodness-of-fit tests (in the same data), whereas a predictive model is validated through cross-validation (CV) using discrimination and calibration (using new data/test data).
The statistical community is generally more (but not exclusively) interested in learning about the data-generating process (inferential modeling). The machine learning community is generally more interested in predictions (predicitive modeling), no matter how complex is the used model. That is why machine learning models are often seen as ‘black boxes,’ which suffer from model interpretability, but gain on model validation, as they are validated against test data. On the other hand, inferential models generally have much higher interpretability, but they may show lower model validation against test data, as predicitve accuracy is not considered in modeling.
There are two main criteria for predictive model validation – discrimination and calibration. Discrimination is measured using sensitivity, specificity, precision or positive predicitve value (PPV), negative predicitve value (NPV), accuracy, ROC, area under the ROC (AUROC) or concordance-statistic (C-statistic) and similar concepts. Calibration is measured using calibration curves, or Hosmer and Lemeshow (HL) goodness-of-fit. There are some overall performance measures that take into account discrimination and calibration both, e.g., \(R^2\) and Brier score. All of these predictive measures can be used for model validation through cross validation (CV) or resmpling in test data/new data.
Take the instance of (disease) risk prediction models. For such a model, discrimination corresponds to separating subjects with and without the disease. But calibration corresponds to the agreement between the subjects’ observed and predicted risks of the disease. It is possible that discrimination is good, but calibration poor. For example, if the predicted risk for all diseased subjects is 0.55 and for all non-diseased subjects is 0.45, then discrimination is perfect, but calibration is very poor.
Based on a prediction model, you can do two kinds of predictions – prediction about the present (diagnosis) and prediction about the future (prognosis). When diagnosis is of interest, the event (disease) has already taken place, and you can only talk of ‘discriminating’ between the diseased and non-diseased persons; so discrimination measures are more important for diagnostic purpose. When prognosis is of interest, the event (disease) has not yet taken place in time, and so you can only talk of the ‘risk’ of developing the disease (not discriminate); so calibration measures are more important for prognostic purpose. For discrimination, model intercept is not important, but for calibration model intercept being a measure of disease prevalence is important.
To achieve discrimination, the subjects are classified into disese or non-disease groups generally by dividing the risk-scale using an arbitrary threshold (e.g., predicted risk > 0.5 means disease, else non-disease).
Traditional Harrell-type C-index is implemented in pec:cindex(), and time-dependent AUROC is implemented in riskRegression:Score(). The diference is the following.
C-index at time \(t\) is the probability that, for any two subjects \(i\) and \(j\), t-year risk for \(i\) is greater than \(t\)-year risk of \(j\), given that \(i\) has event before \(j\) .
Time-dependent AUROC at time \(t\) is the probability that, for any two subjects \(i\) and \(j\), t-year risk for \(i\) is greater than \(t\)-year risk of \(j\), given that \(i\) has event before \(t\) and \(j\) has event after \(t\) .
\(C\) = Prob( Risk\(_t(i)\) > Risk\(_t(j)\) | \(i\) has event before \(j\) ).
AUC\(_t\) = Prob( Risk\(_t(i)\) > Risk\(_t(j)\) | \(i\) has event before \(t\) and \(j\) has event after \(t\) ).
A sample, being randomly drawn from a population, is a random variable (X). A discrete random variable (e.g. disease status, number of minor alleles) has a probability mass function (pmf), whereas a continuous random variable (e.g., height, blood pressure) has a probability density function (pdf). The pmf/pdf, \(p(x \mid \beta)\), is a function of \(x\) for a given parameter \(\beta\). Likelihood function, or simply likelihood, is seen as a function of that parameter \(\beta\) (not of \(x\)) for a given value \(x\) of \(X\), and written as \(L(\beta \mid x)\). The interpretation is that - likelihood function denotes how likely are the various values of \(\beta\) to produce the given sample \(x\).
So, for any particular values of \(X\) and \(\beta\), \(p(x \mid \beta) = L(\beta \mid x)\), that is, their functional values are same. But there is a difference. Here, \(p(.)\) is defined over all values of \(x\), whereas \(L(.)\) is defined over all values of \(\beta\). So, although \(p(.)\) is a pmf or pdf \((\sum p(x \mid \beta)=1, \text{ or } \int p(x \mid \beta)dx = 1)\), \(L(.)\) need not be a pmf or pdf \((\sum L(\beta \mid x) \neq 1, \text{ or } \int L(\beta \mid x)d\beta \neq 0 \text{ in general})\).
Maximum likelihood estimation is a method to obtain a point estimator of a model parameter. MLE is that value of the parameter (\(\beta\)) which maximizes the likelihood function of the parameter. In other words, for a given sample \(x\), MLE is that value of \(\beta\), for which the likelihood of \(x\) being observed is the maximum.
Penalized regression method is used to reduce the tendency of overfitting of the ordinary regression method, especially when the number of independent variables is large. When the parameters of a regression model are estimated by maximizing the penalized likelihood function (instead of the ordinary likelihood function), it is called penalized regression analysis. A penalized likelihood is the ordinary likelihood combined (regularized) with a penalty term, or penalty. The penalty term depends on a parameter, which is called the regularization parameter or penalty parameter or tuning parameter. Introducing the penalty ‘shrinks’ the resulting estimates towards 0, for which the corresponding estimators are called shrinkage estimators, and the method is also called shrinkage method. The larger is the penalty parameter, the more is the shrinkage.
The intuition is that if you add variable to a model, the likelihood will always increase (can never decrease). So using MLE for model selection will always favor more complex models (models with more variables) over simpler models. As a result, the model may ‘overfit.’ Overfitted models perform unusually well for the training set, but poorly generalize beyond the training set, because they model/memorize/learn noise instead of signal. They can also be difficult to interpret.
The goal of (predictive) model selection is to build a model that predicts well. Good prediction means smaller mean squared error (\(MSE\) = mean((predicted - observed)^2)). MSE is composed of bias and variance (\(MSE(\hat{f}) = Var(\hat{f}) + Bias^2(\hat{f}) + popln. variance)\). So, reducing both bias and varince are ideal. But it is not possible to simultaneously minimize them. There is always a trade-off (bias-variance trade-off). Generally, complex models (models with large number of variables) have larger variance but smaller bias. And simple models (models with a small number of variables) have smaller variance but larger bias. It is because complex models increases the number of models to search from (e.g., from 10 variables, 2^3=8 three-variable models can be made whereas 2^4=16 four-variable models can be made); so although there is more chance for the true model to be included in that large model space (that is, lower bias), there is less chance that you will find it among so many models (that is, more variance). On the contrary, simple models lead to a smaller model space, which may highly not include the true model (that is, more bias) but is easier to search from (that is, low variance). Too complex models tend to overfit and too simple models tend to underfit.
By adding the penalty term, we can control this bias-variance trade-off through adjusting the peanlty parameter in order to obtain the right/desired amount of ‘fitting.’ Penalized regression, by shrinking the parameter estimates, introduces some bias, but reduces the variance from overfitting.
The penalty term is combined with likelihood in the following way:
\(log(L(\mathbf{\beta})) - \lambda \text{ (a function of } \mathbf{\beta}) = log(L(\mathbf{\beta})) - P_{\lambda}(\mathbf{\beta})\), where \(\lambda\) is the tuning parameter.
(recall that adding a variable will increase likelihood, so to counteract that the above is subtracted)
Here, we will consider the following penalty functions:
Ridge: \(P_{\lambda}(\mathbf{\beta}) = \lambda \sum_j \beta_j^2.\) This penalty penalizes the squares of the parameter estimates. It shrinks the estimates, but does not reduce to 0.
LASSO (least absolute shrinkage and selection operator): \(P_{\lambda}(\mathbf{\beta}) = \lambda \sum_j |\beta_j|.\) This penalty penalizes the absolute values of the parameter estimates. It shrinks the estimates such that some estimates are reduced to 0. So, in addition to shrinkage, it provides a variable selection (reduces the number of variables, or dimensionality).
ADALASSO (adaptive LASSO): \(P_{\lambda}(\mathbf{\beta}) = \lambda \sum_j a_j |\beta_j|.\) This penalty penalizes a weighted sum of the absolute values of the parameter estimates, where the weights are generally a function of the ordinary least square estimates. It provides shrinkage and selection, is more flexible than LASSO and theoretically tends to the ideal estimator (estimator based on true predictors only) as sample size increases.
LASSO, although provides variable selection/sparsity (estimating only meaningful coefficients as non-0 and others as 0), often introduces large bias. The following penalties are motivated to reduce this bias and yet maintain the sparsity.
SCAD (smoothly clipped absolute deviation): Generally expressed by the first derivative of the penalty function \(P'_{\lambda}(\mathbf{\beta}) = \lambda \sum_j \{ I(\beta_j \leq \lambda) + \frac{(a\lambda - \beta_j)_+}{(a-1)\lambda} I(\beta_j > \lambda) \},\) where \(a(>2)\) is a tunable parameter that determines how fast the peanlty decreases for large coefficients. If the coefficient is large enough, SCAD penalty becomes constant (i.e., derivative=0, the top flat part of the curve in the figure), whereas LASSO penalty keeps on increasing with larger coefficients. This is how SCAD reduces bias. If the coefficient is small enough, SCAD penalty becomes linear (a line; the bottom linear part of the curve in the figure). For intermediate values of the coefficient, the SCAD penalty is quadratic (the intermediate curvy part of the curve in the figure).
MCP (minimax concave penalty): Generally expressed by the first derivative of the penalty function \(P'_{\lambda}(\mathbf{\beta}) = \lambda \sum_j I(\beta_j \leq a\lambda) \; sign(\beta_j)(\lambda - \frac{|\beta_j|}{a}),\) where \(a(>1)\) is a tunable parameter. The interpretation is similar to SCAD. The difference from SCAD is that MCP does not have a part with constant penalization rate like SCAD and imposes immediately decreasing penalty as coefficient value starts to increase.
We will consider two generally used ways to choose a value for the tuning parameter for the above penalty functions. Both of these are motivated to overcome overfitting.
CV (cross-validation): Cross-validation means we are taking into account predictive ability to overcome overfitting. The idea is to divide the data into several, say \(k\) subsamples. Predict each subsample based on a model fitten only using the other subsamples and compare with the true values to obtain subsample-specific prediction error. Combine these subsample-specific prediction errors to get an estimate of overall prediction error. For each value of lambda, we can obtain an overall prediction error. Choose that lambda that minimizes this prediction error.
BIC (Bayesian information criterion): BIC is trying to overcome overfitting by combining a penalty term to the likelihood, where the penalty is dependent on the number of selected variables in the model (called model complexity). Choose that lambda that maximizes this penalized likelihood.