Main concepts
[1] What is boosting? What is the basic procedure of boosting?
Both penalized regression and boosting are used to reduce overfitting in high-dimensional data (see [S1]). Specifically, both methods aim to maximize a ‘regularized’ likelihood. But, whereas in penalized regression, regularization is achieved by introducing an explicit penalty term, in boosting (algorithm), regularization takes place in form of algorithmic constraints (as we will see below).
Boosting, bagging and random forest, all are examples of ensemble methods – methods which combine results from multiple learning algorithms to obtain better predictive performance than could be obtained by any single constituent algorithm (think of random forest, combining predictions obtained from multiple decision trees). In boosting, the basic idea is to combine predictions obtained from multiple re-weighted versions of the original data.
Booting algorithms exist for both classification and regression. The basic procedure of boosting is as follows:
- Specify a base procedure, also called weak leraner (a sensible estimator for regression probelm such as regression tree; a sensible classifier for classification problem such as classification tree) to obtain prediction based on a given input data.
- Next, consider different re-weighted versions of the original data. Using each re-weighted data as input data in the base procedure, obtain the corresponding prediction (weak learner).
- Combine the predictions from all re-weighted data into a single ensemble estimate (strong learner).
Different boosting algorithms consider different choices of re-weighting mechanism and different forms of ensemble.
[2] What are the existing boosting methods for survival model? What is their advantage over penalized regression?
For survival data, there may be some covariates that are always to be included in the analysis as mandatory covariates (such as clinical covariates), whereas all other covariates are to be considered optional covariates that are chosen only if relevant. However, penalized regression does not naturally allow any unpenalized mandatory covariate in the model. This is problematic, because then such penalized model cannot be directly compared with a purely clinical model which is generally fit without penalization. For a valid comparison, in both fitted models, clinical covariates have to stay unpenalized.
Tutz & Binder (2007) proposed a boosting method for ridge regression that allows unpenalized mandatory covariates in the model. Binder and Schumacher (2008b) extended this method to the Cox proportional hazards model, and Binder et al. (2009) adopted it further to Fine-Gray proportional subdistribution hazards (PSH) model for competing risks, implemented in R package CoxBoost
. In contrast to gradient-based boosting (implemented e.g. in the glmboost() function in the R package mboost
, using the CoxPH loss function), CoxBoost
is not based on gradients of loss functions, but adapts the offset-based boosting approach from Tutz and Binder (2007) (as we will see below).
[3] What is the boosting algorithm for the FGR-PSH model?
The algorithm can be summarized as follws:
The whole boosting procedure consists of \(M\) boosting steps (just like \(B\) decision trees in a random forest)
Initialization : The parameter vector includes elements corresponding to the mandatory covariates and the optional covariates, and is initialized with 0 elements \((\hat{\beta}_0 = (0,...,0)')\). In all boosting steps, the object is to estimate the mandatory covariates elements without penalization and the optional covariates elements with penalization. Further, in each boosting step, previous boosting steps are incorporated through an offset, which is simply the linear predictor, i.e., linear combination of the covariates (offset = \(x'_i \hat{\beta}\)) with current estimates. Thus, the offset is initialized as 0.
Update mandatory covariates : Before each boosting step, all the mandatory covariates elements of the parameter vector are updated by one maximum partial likelihood Newton-Raphson step (see [S4]) (this involves no penalization), and subsequently, the offset is updated. The Fine-Gray log partial likelihood function is used (see [S3]).
Update optional covariates : In each boosting step, only one optional covariate element of the parameter vector is updated. To determine which one to update, compute estimates of all covariates using one penalized maximum partial likelihood Newton-Raphson step (see [S4]) incorporating the offset (involves penalization), then select to update that covariate which maximizes the score statistic (see [S5]). This penalized estimation is done by combining a penalty term with the Fine-Gray log partial likelihood (see [S3]).
Using the offset-based penalized partial likelihood estimation results in sparse fits similar to Lasso-like approaches, with many estimated coefficients being zero.
The number of boosting steps \(M\) can be determined by cross-validation.
Supporting background concepts
[S1] What are high-dimensional data/problems?
High-dimensional data/problems simply means data/problems where the number of unknown parameters to be estimated, which is generally equal to the number of independent variables / features (\(p\)), is large, making the dimension of the feature space high. However, there is no fixed definition of how large is large, and the use is context-dependent.
Classical statistical methods (such as multiple regression) work so long as the number of features is less than or equal to the number of observations / sample size as the sample size increases (i.e., so long as \(p \leq n\) as \(n \to \infty\)). Further, classical asymptotic results hold under the assumption that number of features is fixed even as the sample size increases (i.e., \(p\) is fixed as \(n \to \infty\)). But these classical procedures fail when \(p > n\) and/or \(p\) increases as \(n\) increases – situations which are very common in the contemporary real datasets, as massive data collection has been greatly facilitated by progressively advanced technology in past two decades. The term high-dimensional data has become popular because of the availability of such datasets. In many such real data, \(p\) is not only larger than \(n\), but is much larger than \(n\) (\(p \gg n\)) (such as microarray gene expression data, or GWAS SNP data or fMRI imaging data). [Note that, strictly speaking, from the statistical point of view, any \(p > n\) problem would be called a high-dimensional problem even when \(p\) itself is not very large (e.g., \(p=5\) and \(n=4\)). But practically, the problem of high-dimension arises when \(p\) itself is large.]
So, using order of magnitude idea (see [S2]), one can generally say high-dimensional data are data where \(p\) is one or several order of magnitude higher than \(n\). The notation \(\mathcal{O}\), read ‘big O,’ is often used to express such order relations.
\(p = \mathcal{O}(n)\) : means \(p\) is of the same order as \(n\)
\(p > \mathcal{O}(n)\) : means \(p\) is of a greater order than \(n\)
\(p = \mathcal{O}(n^c)\), \(c>1\) : means \(p\) is of polynomial order in \(n\)
\(p = \mathcal{O}(c^n)\), \(c>1\) : means \(p\) is of exponential order in \(n\).
High-dimensional data where \(p > \mathcal{O}(n^c)\) (\(p\) is of superpolynomial order in \(n\)) has been called ‘ultrahigh-dimensional’ data.
[S2] What is order of magnitude of a number?
The order of magnitude of a number (\(N\)) is usually the smallest power of 10 used to represent that number, and there are mainly two different ways of representation that are used in different places-
- representation of the form \(N = a \times 10^b\), with \({1/\sqrt{10} \leq a < \sqrt{10}}\), i.e, \(0.316 \leq a < 3.162\) (because the geometric mean of 1 and 10 is \(\sqrt{10}\)). So, any number in 0.316 and 3.162 (excluding 3.162) has order 0 in this representation. This representation determines order by the power of 10 that is nearest (in the ratio scale) to the given number.
- representation of the form \(N = a \times 10^b\), with \(1 \leq a < 10\). This representation is simply the scientific notation of the number. Any number in 1 to 10 (excluding 10) has order 0 in this representation. Generally, for any number \(N\), this order is the truncated value (i.e., the largest integer contained in) \(\log_{10} N\). So, an \(n\)-digit number has order \(n-1\). E.g., 3518588 has order 6 (\(\log_{10} (3518588) = 6.546\)).
Depending on which representation is considered, the order of a number can be different. For example, the order of the number 65 is 2 in first representation (as \(65 = 0.65 \times 10^2\)), but 1 in second representation (as \(65 = 6.5 \times 10^1\)); the order of 0.045 is -1 in first representation (as \(0.045 = 0.45 \times 10^{-1}\)), but -2 in second representation (as \(0.045 = 4.5 \times 10^{-2}\)).
Order of magnitude is generally used to make very approximate comparisons. Two numbers of the same order of magnitude have roughly the same scale in the sense that the larger value is less than ten times the smaller value. If two numbers differ by 1 order of magnitude, one is roughly 10 times larger than the other, if they differ by 2 orders of magnitude, one is roughly 100 times larger than the other, and so on.
[S3] What are different kinds/modifications of likelihood function?
Given, primary parameter (parameter of interest) \(\beta\) and nuisance parameter \(\delta\),
Ordinary likelihood: \(L(\beta, \delta | \text{data})\).
Inference based on maximum likelihood estimation: Ordinary likelihood maximized w.r.t. \(\beta\) and \(\delta\) \(\; = \; \underset{\beta,\delta}{\arg\max} \; L(\beta, \delta | \text{data}) \; = \; L(\beta_{MLE}, \delta_{MLE} | \text{data})\).
Limitation: Maximization can sometimes be computationally challenging/expensive.
Marginal likelihood / Integrated likelihood: If the conditional distribution of \(\delta\) given \(\beta\) can be identified as a standard distribution, one can simply ‘integrate out’ \(\delta\) from the ordinary likelihood to obtain the marginal likelihood of \(\beta\), \(L_{marg}(\beta | \text{data}) = \int_{\delta} L(\beta, \delta | \text{data}) d\delta\), that can be used for inference on \(\beta\).
Limitation: Conditional distribution often does not have any standard form (cannot benefit from well-known properties of standard distributions).
Partial likelihood: If we can write the ordinary likelihood function as \(L(\beta,\delta | \text{data}) \; = \; L_1(\beta | \text{data}) \; L_2(\beta, \delta | \text{data}),\) with \(L_1\) containing most information on \(\beta\) and \(L_2\) containing relatively little information on \(\beta\), then \(L_1(\beta | \text{data})\) is a partial likelihood of \(\beta\), and can be used for inference on \(\beta\).
Limitation: Less efficient than ordinary/marginal likelihood based estimators.
Profile likelihood: If we can express \(\delta\) as a function of \(\beta\), \(h(\beta)\), then the ordinary likelihood can be written, ‘profiling out’ \(\delta\), as \(L(\beta, h(\beta) | \text{data})\). This is called profile likelihood of \(\beta\), and can be used for inference on \(\beta\).
Penalized likelihood: Ordinary likelihood combined (regularized) with a penalty term, or penalty that depends on a parameter, called the regularization parameter or penalty parameter. It is generally expressed in the log-scale as: \(\log(L(\beta,\delta | \text{data})) - P_{\lambda}(\beta)\).
[S4] What is the Newton-Raphson method?
Newton’s method, or Newton-Raphson method is a numerical method of finding roots of a function \(f(x)\) by
solving \(f(x)=0\), that is to say, finding at what value(s) of \(x\) (in the X-axis), the functional value \(f(.)\) (height of the \(f(.)\) curve) is 0, or in other words, finding where the \(f(x)\) curve touches the X-axis.
This is a numerical method, and is to be used iteratively to get closer and closer approximations of the true root. Each iteration can be called a step. Given an initial guess \(x_n\), of the true root \(x_0\), one uses the tangent at \(x_n\) to find a better guess \(x_{n+1}\) (the X-value of the point where the tangent cuts the X-axis). Using the property that the slope of the tangent of \(f(x)\) at \(x_n\) is \(f'(x_n)\), the better guess can be written as
\(x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}\).
This process is repeated to get increasingly improved guesses.
[S5] What is score function? What is score statistic?
The score function, or score, is the gradient (the vector of partial derivatives of a multiparameter function w.r.t. the individual parameters) of the log-likelihood function. For one-parameter function, the gradient is simply derivative of the function w.r.t. the parameter.
Recall that a derivative of a function \(f(x)\) w.r.t \(x\) is a measure of steepness of the function w.r.t. \(x\), which means a measure of how sharply the function changes as \(x\) changes; in other words, how sensitive is the function to change in \(x\). So, score function measures how sensitive is the log-likelihood function w.r.t changes in the parameters.
The maximum likelihood estimator (MLE) of the parameter(s) can be found by setting the score to zero, that is, by solving score = 0 (if the log-likelihood is concave [imagine dome-sized, looks like opening of a cave]).
MLE (\(\hat{\theta}\)) asymptotically has a normal distribution with mean equal to the true parameter value (\(\theta\)) and variance equal to the inverse of the so-called information matrix (\(I^{-1}(\theta)\)), i.e., \(\hat{\theta} \sim N(\theta,I^{-1}(\theta))\). A score test, which tests whether a given value \(\theta_0\) can be taken as the true parameter value in the light of the data, uses a score statistic that is a standardized measure of the distance between MLE and the a given tentative parameter value \(\theta_0\), expressed as \((\hat{\theta}-\theta_0)'I(\theta_0)(\hat{\theta}-\theta_0)\) [this follows a chi-square distribution]. If the given value is 0 (\(\theta_0=0\)), then a small enough score statistic (i.e., MLE is close to 0) will indicate the true parameter value can be taken as 0. On the other hand, a large score statistic will indicate MLE is far from 0, so 0 is less likely to be the true parameter value. Thus, the larger the score statistic, the more significant is the parameter.
References and other resources
Tutz G, Binder H: Boosting Ridge Regression. Computational Statistics & Data Analysis 2007, 51(12):6044-6059.
Binder,H. and Schumacher,M. (2008b) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics, 9, 14.
Binder, H., Allignol, A., Schumacher, M., & Beyersmann, J. (2009). Boosting for high-dimensional time-to-event data with competing risks. Bioinformatics, 25(7), 890-896.