knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

Snpnet is a package that is used to fit the lasso on big genomics data. We assume the data are stored in .pgen/.pvar/.psam format by the PLINK library. The potential training/validation split can be specified with a separate column in the phenotype file.

The most essential parameters in the core function snpnet include:

Some additional important parameters for model building include:

The other parameters can be specified in a config list object, such as missing.rate, MAF.thresh, nCores, num.snps.batch (batch size M of the BASIL algorithm), save (whether to save intermediate results), results.dir, prevIter (when starting from the middle), use.glmnetPlus and glmnet.thresh (convergence threshold). More details can be seen in the function documentation. In particular, If we want to recover results and continue the procedure from a previous job, we should have save = TRUE and specify prevIter with the index of the last successful (and saved) iteration.

Snpnet depends on two other programs plink2 and zstdcat. If they are not already on the system serach path, it is important to specify their locations in the configs object and pass it to snpnet.

configs <- list(
  # results.dir = "PATH/TO/SAVE/DIR",  # needed when saving intermediate results
  # save = TRUE,  # save intermediate results per iteration (default FALSE)
  # nCores = 16,  # number of cores available (default 1)
  # niter = 100,  # max number of iterations (default 50)
  # prevIter = 15,  # if we want to start from some iteration saved in results.dir
  # use.glmnetPlus = TRUE,  # recommended for faster computation
  # early.stopping = FALSE,  # whether to stop based on validation performance (default TRUE)
  plink2.path = "plink2",   # path to plink2 program
  zstdcat.path = "zstdcat"  # path to zstdcat program
)
# check if the provided paths are valid
for (name in names(configs)) {
  tryCatch(system(paste(configs[[name]], "-h"), ignore.stdout = T),
    condition = function(e) cat("Please add", configs[[name]], "to PATH, or modify the path in the configs list.")
  )
}

A Simple Example

We demonstrate a simple lasso example first.

library(snpnet)
genotype.pfile <- file.path(system.file("extdata", package = "snpnet"), "sample")
phenotype.file <- system.file("extdata", "sample.phe", package = "snpnet")
phenotype <- "QPHE"
covariates <- c("age", "sex", paste0("PC", 1:10))

fit_snpnet <- snpnet(  
  genotype.pfile = genotype.pfile,
  phenotype.file = phenotype.file,
  phenotype = phenotype,
  covariates = covariates,
  # split.col = "split",  # split column name in phenotype.file with train/val/test labels
  # mem = 128000,  # amount of memory available (MB), recommended
  configs = configs
)  # we hide the intermediate messages

The intercept and coefficients can be extracted by fit_snpnet$a0 and fit_snpnet$beta. It also saves the evaluation metric, which by default is $R^2$ for the Gaussian family.

fit_snpnet$metric.train

We can make prediction with the fitted object fit_snpnet. For example,

pred_snpnet <- predict_snpnet(
  fit = fit_snpnet,
  new_genotype_file = genotype.pfile,
  new_phenotype_file = phenotype.file,
  phenotype = phenotype,
  covariate_names = covariates,
  split_col = "split",
  split_name = c("train", "val"),  # can also include "test" if such samples are available in the phenotype file
  configs = configs)

We can find out both the predicted values from the prediction field and the evaluation metrics from the metric field.

str(pred_snpnet$prediction)
str(pred_snpnet$metric)

Lasso with Refitting

Refitting is often recommended for the lasso/elastic-net to make the most of the validation set (more than serving for tuning parameter selection). One may take the following steps:

We show a code example of refitting below. To do that, we will need to use the split column in the phenotype file.

fit_snpnet_train <- snpnet(  
  genotype.pfile = genotype.pfile,
  phenotype.file = phenotype.file,
  phenotype = phenotype,
  covariates = covariates,
  split.col = "split",
  # mem = 128000,  # amount of memory available (MB), recommended
  configs = configs
)  # we hide the intermediate messages

Due to the default early stopping criterion, snpnet doesn't fit all the way to the end.

max_idx <- sum(!is.na(fit_snpnet_train$metric.val))
fit_snpnet_train$metric.val[1:max_idx]

To do the refitting, we have created a separate column split_refit in the phenotype file that merges the original train and val labels, and replaces the test labels with val so that snpnet will conveniently evaluate the test performance of the refit models.

library(data.table)
phe_tbl <- fread(phenotype.file)
table(phe_tbl$split, phe_tbl$split_refit)

We extract the exact same lambda sequence from the fit above and refit using the split_refit column. Note that we want to turn off early.stopping in this case.

configs[["early.stopping"]] <- FALSE
fit_snpnet_refit <- snpnet(  
  genotype.pfile = genotype.pfile,
  phenotype.file = phenotype.file,
  phenotype = phenotype,
  covariates = covariates,
  split.col = "split_refit",
  lambda = fit_snpnet_train$full.lams[1:max_idx],
  configs = configs
)  # we hide the intermediate messages

We may take a look at the refit training and test performance.

fit_snpnet_refit$metric.train
fit_snpnet_refit$metric.val  # this is in fact the performance evaluated on the test individuals

In the end, we should extract the test performance at the $\lambda$ value that achieves the best validation performance earlier. We may also output the corresponding model size.

opt_idx <- which.max(fit_snpnet_train$metric.val)
metric_optimal_test <- fit_snpnet_refit$metric.val[opt_idx]
size_optimal <- sum(fit_snpnet_refit$beta[[opt_idx]] != 0)
list(metric = metric_optimal_test, size = size_optimal)

Numerical Comparison with Glmnet

To compare with glmnet, we need to convert the genotype data into a normal R object.

mypfile <- dplyr::mutate(
  dplyr::rename(data.table::fread(paste0(genotype.pfile, '.psam')), 'FID' = '#FID'),
  ID = paste(FID, IID, sep='_')
  )
ids <- mypfile$ID
phe <- readPheMaster(phenotype.file, ids, "gaussian", covariates, phenotype, NULL, NULL, configs)
vars <- readRDS(system.file("extdata", "vars.rds", package = "snpnet"))
pvar <- pgenlibr::NewPvar(paste0(genotype.pfile, '.pvar.zst'))
pgen <- pgenlibr::NewPgen(paste0(genotype.pfile, '.pgen'), pvar = pvar, sample_subset = NULL)
data.X <- pgenlibr::ReadList(pgen, seq_along(vars), meanimpute=F)
colnames(data.X) <- vars
p <- ncol(data.X)
pnas <- numeric(p)
for (j in 1:p) {
  pnas[j] <- mean(is.na(data.X[, j]))
  data.X[is.na(data.X[, j]), j] <- mean(data.X[, j], na.rm = T)  # mean imputation
}
data.X <- as.matrix(cbind(age = phe$age, sex = phe$sex, phe[, paste("PC", 1:10, sep = "")], data.X))
data.y <- phe$QPHE
pfactor <- rep(1, p + 12)
pfactor[1:12] <- 0  # we don't penalize the covariates

fit_glmnet <- glmnet::glmnet(data.X, data.y, penalty.factor = pfactor, standardize = F)
# check difference of coefficients matched by the names
checkDiff <- function(x, y) {
  unames <- union(names(x), names(y))
  xf <- yf <- rep(0, length(unames))
  names(xf) <- names(yf) <- unames
  xf[match(names(x), unames)] <- x
  yf[match(names(y), unames)] <- y
  list(max = max(abs(xf-yf)), mean = mean(abs(xf-yf)))
}

We show the difference of the computed $\lambda$ sequence and the estimated coefficients. There is small discrepancy between the two solutions within the range of convergence threshold. The gap will shrink and eventually goes to 0 if we keep on tightening the threshold.

max(abs(fit_snpnet$full.lams - fit_glmnet$lambda*length(pfactor)/sum(pfactor)))  # adjustment due to some internal normalization by glmnet
checkDiff(fit_snpnet$beta[[6]], fit_glmnet$beta[, 6])

More Examples

We also show two more sophisticated usage of the snpnet function.

configs[["nCores"]] <- 2
configs[["num.snps.batch"]] <- 500
fit_snpnet_ent <- snpnet(
  genotype.pfile = genotype.pfile,
  phenotype.file = phenotype.file,
  phenotype = phenotype,
  covariates = covariates,
  alpha = 0.5,  # elastic-net
  split.col = "split",  # the sample phenotype file contains a column specifying the training/validation subsets
  configs = configs
)

When split.col is specified, validation metric will automatically be computed. Moreover, by default, early stopping is adopted to stop the process based on the validation metric (the extent controlled by stopping.lag in configs object).

fit_snpnet_ent$metric.train
fit_snpnet_ent$metric.val
fit_snpnet_bin <- snpnet(
  genotype.pfile = genotype.pfile,
  phenotype.file = phenotype.file,
  phenotype = "BPHE",  # binary phenotype with logistic regression
  covariates = covariates,
  alpha = 0.5,
  family = "binomial",
  split.col = "split",
  configs = configs
)

For binary phenotypes, instead of $R^2$, AUC is computed.

fit_snpnet_bin$metric.train
fit_snpnet_bin$metric.val


junyangq/GWASLasso documentation built on Nov. 25, 2022, 10:03 a.m.