knitr::opts_chunk$set(echo = TRUE, fig.align = "center", out.width = "70%")

In R package {bigstatsr}, you can fit efficient penalized (linear and logistic) regressions using functions big_spLinReg() and big_spLogReg(). Similar implementation of Cox regression is an area of future development.

You might want to look at the corresponding paper and cite it:

Privé, Florian, Hugues Aschard, and Michael GB Blum. "Efficient implementation of penalized regression for genetic risk prediction." Genetics (2019). [Open access]

Data

To illustrate how to use big_spLinReg() and big_spLogReg(), let us use some simulated data:

library(bigstatsr)
set.seed(1)

# simulating some data
N <- 530
M <- 730
X <- FBM(N, M, init = rnorm(N * M, sd = 5))
y <- rowSums(X[, 1:100]) + 20 * rnorm(N)

ind.train <- sort(sample(nrow(X), 400))
ind.test <- setdiff(rows_along(X), ind.train)

Cross-Model Selection and Averaging (CMSA)

Functions big_spLinReg() and big_spLogReg() automatically perform a procedure similar to cross-validation to choose hyper-parameters $\lambda$ and $\alpha$ of the elastic net regularization:

knitr::include_graphics("https://raw.githubusercontent.com/privefl/paper2-PRS/master/figures/simple-CMSA.png")

Sequences of hyper-parameters

Stopping criterion for regularization path

There are three main reasons for which regularization paths can end:

It is possible to fit the K folds for each value of alphas in parallel using parameter ncores.

You should check why regularization paths stop. For this, you can use both methods plot() and summary(). Let us make an example below.

mod <- big_spLinReg(X, y[ind.train], ind.train = ind.train, K = 4)
summary(mod)
summary(mod)$message

Here, "No more improvement" means that the early stopping criterion has been reached. We can see the validation loss getting worse at some point:

plot(mod)

For small datasets, you might want to reduce the number of folds (K = 10 bu default). For large datasets, you might want to decrease n.abort (i.e. stop the regularization path quickly) as the path is usually much more smooth and fitting is more and more demanding as we advance along the regularization path (because more and more variables are used in the model).

Use predict() to use the model for data in the test set:

pred <- predict(mod, X, ind.test)
plot(pred, y[ind.test], pch = 20); abline(0, 1, col = "red")

Other parameters

Time and memory requirements (from the paper)

The computation time of our PLR implementation mainly depends on the sample size and the number of candidate variables (variables that are included in the gradient descent). Indeed, the algorithm is composed of two steps: first, for each variable, the algorithm computes an univariate statistic that is used to decide if the variable is included in the model (for each value of $\lambda$). This first step is very fast (if data can be quickly read from disk). Then, the algorithm iterates over a regularization path of decreasing values of $\lambda$, which progressively enables variables to enter the model (see first figure). In the second step, the number of variables increases and computations stop when an early stopping criterion is reached (when prediction is getting worse on the corresponding validation set, see first figure).

For outcomes that require lots of variables to predict, such as predicting height and when using huge datasets such as the UK Biobank, the algorithm might iterate over >100,000 variables, which is computationally demanding. On the contrary, for outcomes that require only a few variables to predict, such as autoimmune diseases, the number of variables included in the model is much smaller so that fitting is very fast (only 13 min for 150K women of the UK Biobank for breast cancer).

Memory requirements are tightly linked to computation time. Indeed, variables are accessed in memory thanks to memory-mapping when they are used. When there is not enough memory left, the operating system (OS) frees some memory for new incoming variables. Yet, if too many variables are used in the gradient descent, the OS would regularly swap memory between disk and RAM, severely slowing down computations.



privefl/bigstatsr documentation built on March 29, 2024, 3:31 a.m.