multisnpnet: Fast Multi-Phenotype SRRR on SNP Data

View source: R/multisnpnet.R

multisnpnetR Documentation

Fast Multi-Phenotype SRRR on SNP Data

Description

Fit a sparse reduced rank regression model on large-scale SNP data and multivariate responses with batch variable screening and alternating minimization. It computes a full solution path on a grid of penalty values. Can deal with larger-than-memory SNP data, missing values and adjustment covariates.

Usage

multisnpnet(genotype_file, phenotype_file, phenotype_names, binary_phenotypes = NULL,
  covariate_names, rank, nlambda = 100, lambda.min.ratio = 0.01, standardize_response = TRUE,
  weight = NULL, validation = FALSE, split_col = NULL, mem = NULL,
  batch_size = 100, prev_iter = 0, max.iter = 10, configs = NULL, save = TRUE,
  early_stopping = FALSE)

Arguments

genotype_file

Path to the suite of genotype files. genotype_file.pgen, psam, pvar.zst must exist.

phenotype_file

Path to the phenotype. The header must include FID, IID, covariate_names and phenotype_names. Missing values are expected to be encoded as -9.

binary_phenotypes

Names of the binary phenotypes. AUC will be evaluated for binary phenotypes.

covariate_names

Character vector of the names of the adjustment covariates.

rank

Target rank of the model.

nlambda

Number of penalty values.

lambda.min.ratio

Ratio of the minimum penalty to the maximum penalty.

standardize_response

Boolean. Whether to standardize the responses before fitting to deal with potential different units of the responses.

weight

Numberic vector that specifies the (importance) weights for the responses.

p.factor

Named vector of separate penalty factors applied to each coefficient. This is a number that multiplies lambda to allow different shrinkage. Default is 1 for all variables. Can specify partially and the rest will be set to 1. Must be positive.

validation

Boolean. Whether to evaluate on validation set.

split_col

Name of the column in the phenotype file that specifies whether each sample belongs to the training split or the validation split. The values are either "train" or "val".

mem

Memory available for the program. It tells PLINK 2.0 the amount of memory it can harness for the computation. IMPORTANT if using a job scheduler.

batch_size

Number of variants used in batch screening.

prev_iter

Index of the iteration to start from (e.g. to resume a previously interrupted computation).

max.iter

Maximum number of iterations allowed for alternating minimization.

configs

List of additional configuration parameters. It can include:

nCores

number of cores for the PLINK computation (default: 1)

results.dir

directory to save intermediate results if save=TRUE (default: temp directory created by the tempdir function)

thresh

convergence threshold for alternating minimization (default: 1E-7)

glmnet.thresh

convergence threshold for glmnet(Plus) (default: 1E-7)

plink2.path

path to the PLINK2.0 program, if not on the system path

zstdcat.path

path to the zstdcat program, if not on the system path

use_safe

whether to use safe product to deal with very large matrix multiplication (default: TRUE). One may also specify MAXLEN (default: (2^31-1)/2), the maximum vector length passed to the R base matrix multiplication operation

excludeSNP

character vector containing genotype names to exclude from the analysis

save

Boolean. Whether to save intermediate results.

early_stopping

Whether to stop the process early if validation metric starts to fall.

early_stopping_phenotypes

List of phenotypes to focus when evaluating the early stopping condition.

early_stopping_check_average

whether to check the average metric when evaluating the early stopping condition


junyangq/multiSnpnet documentation built on Oct. 19, 2023, 8:22 p.m.