knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
multiSnpnet is a package that is used to fit sparse reduced rank regression for multiple responses on big genomics data. It optimizes the following objective function: $$ \begin{aligned} \text{minimize} & \quad \frac{1}{2} \|\mathcal{P}\Omega(\mathbf{Y})-\mathcal{P}\Omega(\mathbf{X}\mathbf{UV}^\top)\|F^2 + \lambda \|\mathbf{U}\|{1,2}, \ \text{s.t.} & \quad \mathbf{V}^\top \mathbf{V} = \mathbf{I}, \end{aligned} $$ where $\mathcal{P}\Omega$ is a projection operator for missing values in $\mathbf{Y}$, i.e. $\mathcal{P}\Omega(\mathbf{Y}){ij} = \mathbf{Y}{ij}$ if $\mathbf{Y}{ij}$ is not missing and $0$ otherwise, and $\|\mathbf{U}\|{1,2} = \sum_{j=1}^p \|\mathbf{U}_{j\cdot}\|_2$.
We assume the data are stored in .pgen/.pvar/.psam format by the PLINK library. The potential training/validation split can be specified by a separate column in the phenotype file.
The most essential parameters in the core function multisnpnet
include:
genotype_file
: the PLINK 2.0 pgen file that contains genotype. We assume the existence of genotype_file.{pgen,pvar.zst,psam}.phenotype_file
: the path of the file that contains the phenotype values and can be read as a table. Missing values are expected to be encoded as -9.phenotype_names
: names of the phenotypes. Must be consistent with the column names in the phenotype file.covariate_names
: a character vector containing the names of the covariates included in the fitting, whose coefficients will not be penalized. The names must exist in the column names of the phenotype file. Can be an empty vector.rank
: target rank of the reduced rank model.validation
: whether to evaluate the model performance on the validation set.split_col
: the column name in the phenotype file that specifies the membership of individuals to the training or the validation set. The individuals marked as "train"
and "val"
will be treated as the training and validation set, respectively. If not specified, all the qualified samples will be treated as training samples.mem
: Memory (MB) available for the program. It tells PLINK 2.0 the amount of memory it can harness for the computation. IMPORTANT if using a job scheduler.batch_size
: the number of variants used in each batch screening and should be chosen based on the problem size and memory available. Default is 100.save
: a boolean indicating whether to save intermediate results. If TRUE
, results.dir
in the configs
list should be provided.configs
: a list of additional configuration parameters. It can include:results.dir
: path to the directory holding intermediate results, if save = TRUE
.nCores
: number of cores used for computation.Some additional important parameters for model building include:
nlambda
: the number of lambda values on the solution path.lambda.min.ratio
: the ratio of the minimum lambda considered versus the maximum lambda that makes all penalized coefficients zero.standardize_response
: whether to standardize the responses before fitting to deal with potential different units of the responses. Default is TRUE
.prev_iter
: index of the iteration to start from (e.g. to resume a previously interrupted computation). Set prev_iter = -1
if one wants to start from the last saved iteration.There are other parameters that can be used to control the training behavior, such as weight
to adjust the relative importance of the phenotypes, p.factor
to control the priority of each variable entering the model.
library(multiSnpnet)
configs <- list( plink2.path = "plink2", # path to plink2 program zstdcat.path = "zstdcat" # path to zstdcat program ) # check if the provided paths are valid for (name in names(configs)) { tryCatch(system(paste(configs[[name]], "-h"), ignore.stdout = T), condition = function(e) cat("Please add", configs[[name]], "to PATH, or modify the path in the configs list.") ) }
genotype_file <- file.path(system.file("extdata", package = "multiSnpnet"), "sample") phenotype_file <- system.file("extdata", "sample.phe", package = "multiSnpnet") phenotype_names <- c("QPHE", "BPHE") covariate_names <- c("age", "sex", paste0("PC", 1:10)) multisnpnet_results <- multisnpnet( genotype_file = genotype_file, phenotype_file = phenotype_file, phenotype_names = phenotype_names, covariate_names = covariate_names, rank = 2, nlambda = 10, validation = TRUE, split_col = "split", batch_size = 100, standardize_response = TRUE, save = TRUE ) # we hide the intermediate messages
We can make prediction with the fitted object in multisnpnet_results
. For example,
pred_multisnpnet <- predict_multisnpnet( fit = multisnpnet_results$fit_list, new_genotype_file = genotype_file, new_phenotype_file = phenotype_file, covariate_names = covariate_names, split_col = "split", split_name = c("train", "val"), # can also include "test" if such samples are available in the phenotype file binary_phenotypes = c("BPHE"))
We can extract the R-squared ($R^2$) for both phenotypes.
pred_multisnpnet$R2
The negative validation $R^2$'s are caused by overfitting. We can also obtain AUC for the specified binary phenotypes.
pred_multisnpnet$AUC
The coefficent matrix $C$ in the returned value from multiSnpnet::multisnpnet()
has the dimension of $p \times q$. One can visualize this matrix with a biplot visualization.
fit_single <- multisnpnet_results$fit if (require(ggrepel)) plot_biplot( svd(t(as.matrix(fit_single$C))), label=list( 'phenotype'=rownames(fit_single$A_init), # we should update this phenotype labels so that it matches with the simple example above. 'variant'=rownames(fit_single$C) ) )
By default, this function annotates 5 phenotypes and 5 variants based on the distance from the center of origin. One can annotate more points by specifying n_labels
option.
Alternatively, you can generate an interactive plot with plotly
with mouseover annotation of phenotypes and variants. In that case, you need to disable the static annotation by specifying use_ggrepel=FALSE
.
# if (!require('plotly', character.only = TRUE)) { # install.packages('plotly', dependencies = TRUE) # library('plotly', character.only = TRUE) # } if (require(plotly)) { plot_biplot( svd(t(as.matrix(fit_single$C))), label=list( 'phenotype'=rownames(fit_single$A_init), # we should update this phenotype labels so that it matches with the simple example above. 'variant'=rownames(fit_single$C) ), use_ggrepel=FALSE ) %>% plotly::ggplotly() %>% htmlwidgets::saveWidget('multisnpnet.biplot.html', selfcontained = F, libdir = "lib") }
You can open the resulting html file (multisnpnet.biplot.html
) with your favorite browser.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.