GWASLasso: snpnet: Efficient Lasso Solver for Large SNP Data

View source: R/snpnet.R

snpnet

R Documentation

Fit the Lasso/Elastic-Net for Large Phenotype-Genotype Datasets

Description

Fit the entire lasso or elastic-net solution path using the Batch Screening Iterative Lasso (BASIL) algorithm on large phenotype-genotype datasets.

Usage

snpnet(genotype.pfile, phenotype.file, phenotype, family = NULL, covariates = NULL, alpha
  = 1, nlambda = 100, lambda.min.ratio = ifelse(nobs < nvars, 0.01, 1e-04), lambda = NULL,
  split.col = NULL, p.factor = NULL, status.col = NULL, mem = NULL, configs = NULL)

Arguments

`genotype.pfile`	the PLINK 2.0 pgen file that contains genotype. We assume the existence of genotype.pfile.pgen,pvar.zst,psam.
`phenotype.file`	the path of the file that contains the phenotype values and can be read as as a table. There should be FID (family ID) and IID (individual ID) columns containing the identifier for each individual, and the phenotype column(s). (optional) some covariate columns and a column specifying the training/validation split can be included in this file.
`phenotype`	the name of the phenotype. Must be the same as the corresponding column name in the phenotype file.
`family`	the type of the phenotype: "gaussian", "binomial", or "cox". If not provided or NULL, it will be detected based on the number of levels in the response.
`covariates`	a character vector containing the names of the covariates included in the lasso fitting, whose coefficients will not be penalized. The names must exist in the column names of the phenotype file.
`alpha`	the elastic-net mixing parameter, where the penalty is defined as alpha * \|\|beta\|\|_1 + (1-alpha)/2 * \|\|beta\|\|_2^2. alpha = 1 corresponds to the lasso penalty, while alpha = 0 corresponds to the ridge penalty.
`nlambda`	the number of lambda values - default is 100.
`lambda.min.ratio`	smallest value for lambda, as a fraction of lambda.max, the (data derived) entry value, i.e. the smallest value for which all coefficients are zero. The default depends on the sample size nobs relative to the number of actual variables nvars (after QC filtering). If nobs > nvars, the default is 0.0001, close to zero. If nobs < nvars, the default is 0.01. A very small value of lambda.min.ratio will lead to a saturated fit in the nobs < nvars case.
`lambda`	one can specify the full lambda list on which the lasso/elastic-net will be solved. Once provided, 'lambda' and 'lambda.min.ratio' will be ignored. It can be used for refitting after the optimal parameter is selected by validation.
`split.col`	the column name in the phenotype file that specifies the membership of individuals to the training or the validation set. The individuals marked as "train" and "val" will be treated as the training and validation set, respectively. When specified, the model performance is evaluated on both the training and the validation sets.
`p.factor`	a named vector of separate penalty factors applied to each coefficient. This is a number that multiplies lambda to allow different shrinkage. If not provided, default is 1 for all variables. Otherwise should be complete and positive for all variables.
`status.col`	the column name for the status column for Cox proportional hazards model. When running the Cox model, the specified column must exist in the phenotype file.
`mem`	Memory (MB) available for the program. It tells PLINK 2.0 the amount of memory it can harness for the computation. IMPORTANT if using a job scheduler.
`configs`	a list of other config parameters. missing.rate variants are excluded if the missing rate exceeds this level. Default is 0.1. MAF.thresh variants are excluded if the minor allele frequency (MAF) is lower than this level. Default is 0.001. nCores the number of cores used for computation. You may use the maximum number of cores available on the computer. Default is 1, single core. num.snps.batch the number of variants added to the strong set in each iteration. Default is 1000. niter The number of maximum iteration in the algorithm. Note that each iteration may be able to find solutions for more than one lambda value. The default is 50 prevIter if non-zero, it indicates the last successful iteration in the procedure so that we can restart from there. niter should be no less than prevIter. save a logical value whether to save the intermediate results (e.g. in case of job failure and restart). results.dir the path to the directory where meta and intermediate results are saved. meta.dir the relative path to the subdirectory used to store the computed summary statistics, e.g. mean, missing rate, standard deviation (when 'standardization = TRUE'). Needed when 'save = TRUE'. Default is '"meta.dir/'. save.dir the relative path to the subdirectory used to store the intermediate results so that we may look into or recover from later. Needed when 'save = TRUE'. Default is '"results/'. excludeSNP character vector containing genotype names to exclude from the analysis nlams.init the number of lambdas considered in the first iteration. Default 10 is a reasonable number to start with. nlams.delta the length of extended lambdas down the sequence when there are few left in the current sequence (remember we don't fit all lambdas every iteration, only extend when most of the current ones have been completed and validated). Default is 5. glmnet.thresh the convergence threshold used in glmnet/glmnetPlus. keep one may specify keep file in plink format to focus on a subset of individuals. use.glmnetPlus a logical value whether to use glmnet with warm start, if the glmnetPlus package is available. Currently only "gaussian" family is supported. early.stopping a logical value indicating whether early stopping based on validation metric is desired. stopping.lag a parameter for the stopping criterion such that the procedure stops after this number of consecutive decreases in the validation metric. verbose a logical value indicating if more detailed messages should be printed. KKT.verbose a logical value indicating if details on KKT check should be printed. increase.size the increase in batch size if the KKT condition fails often in recent iterations. Default is half of the batch size. plink2.path the user-specified path to plink2 (default: plink2) zstdcat.path the user-specified path to zstdcat (default: zstdcat) zcat.path the user-specified path to zcat (to read a zcat compressed phenotype file) (default: zdcat) rank if TRUE, then the smallest lambda indices when each variable enters the model are recorded

Details

Junyang Qian, Wenfei Du, Yosuke Tanigawa, Matthew Aguirre, Robert Tibshirani, Manuel A. Rivas, and Trevor Hastie. "A Fast and Flexible Algorithm for Solving the Lasso in Large-scale and Ultrahigh-dimensional Problems." bioRxiv (2019): https://doi.org/10.1101/630079