knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

multiSnpnet is a package that is used to fit sparse reduced rank regression for multiple responses on big genomics data. It optimizes the following objective function: $$ \begin{aligned} \text{minimize} & \quad \frac{1}{2} \|\mathcal{P}\Omega(\mathbf{Y})-\mathcal{P}\Omega(\mathbf{X}\mathbf{UV}^\top)\|F^2 + \lambda \|\mathbf{U}\|{1,2}, \ \text{s.t.} & \quad \mathbf{V}^\top \mathbf{V} = \mathbf{I}, \end{aligned} $$ where $\mathcal{P}\Omega$ is a projection operator for missing values in $\mathbf{Y}$, i.e. $\mathcal{P}\Omega(\mathbf{Y}){ij} = \mathbf{Y}{ij}$ if $\mathbf{Y}{ij}$ is not missing and $0$ otherwise, and $\|\mathbf{U}\|{1,2} = \sum_{j=1}^p \|\mathbf{U}_{j\cdot}\|_2$.

We assume the data are stored in .pgen/.pvar/.psam format by the PLINK library. The potential training/validation split can be specified by a separate column in the phenotype file.

The most essential parameters in the core function multisnpnet include:

Some additional important parameters for model building include:

There are other parameters that can be used to control the training behavior, such as weight to adjust the relative importance of the phenotypes, p.factor to control the priority of each variable entering the model.

A Simple Example

library(multiSnpnet)
configs <- list(
  plink2.path = "plink2",   # path to plink2 program
  zstdcat.path = "zstdcat"  # path to zstdcat program
)
# check if the provided paths are valid
for (name in names(configs)) {
  tryCatch(system(paste(configs[[name]], "-h"), ignore.stdout = T),
    condition = function(e) cat("Please add", configs[[name]], "to PATH, or modify the path in the configs list.")
  )
}
genotype_file <- file.path(system.file("extdata", package = "multiSnpnet"), "sample")
phenotype_file <- system.file("extdata", "sample.phe", package = "multiSnpnet")

phenotype_names <- c("QPHE", "BPHE")
covariate_names <- c("age", "sex", paste0("PC", 1:10))

multisnpnet_results <- multisnpnet(
  genotype_file = genotype_file,
  phenotype_file = phenotype_file,
  phenotype_names = phenotype_names,
  covariate_names = covariate_names,
  rank = 2,
  nlambda = 10,
  validation = TRUE,
  split_col = "split",
  batch_size = 100,
  standardize_response = TRUE,
  save = TRUE
) # we hide the intermediate messages

We can make prediction with the fitted object in multisnpnet_results. For example,

pred_multisnpnet <- predict_multisnpnet(
  fit = multisnpnet_results$fit_list,
  new_genotype_file = genotype_file,
  new_phenotype_file = phenotype_file,
  covariate_names = covariate_names,
  split_col = "split",
  split_name = c("train", "val"),  # can also include "test" if such samples are available in the phenotype file
  binary_phenotypes = c("BPHE"))

We can extract the R-squared ($R^2$) for both phenotypes.

pred_multisnpnet$R2

The negative validation $R^2$'s are caused by overfitting. We can also obtain AUC for the specified binary phenotypes.

pred_multisnpnet$AUC

Example of Biplot Visualization

The coefficent matrix $C$ in the returned value from multiSnpnet::multisnpnet() has the dimension of $p \times q$. One can visualize this matrix with a biplot visualization.

fit_single <- multisnpnet_results$fit
if (require(ggrepel))
plot_biplot(
  svd(t(as.matrix(fit_single$C))),
  label=list(
    'phenotype'=rownames(fit_single$A_init), # we should update this phenotype labels so that it matches with the simple example above.
    'variant'=rownames(fit_single$C)
  )
)

By default, this function annotates 5 phenotypes and 5 variants based on the distance from the center of origin. One can annotate more points by specifying n_labels option.

Alternatively, you can generate an interactive plot with plotly with mouseover annotation of phenotypes and variants. In that case, you need to disable the static annotation by specifying use_ggrepel=FALSE.

# if (!require('plotly', character.only = TRUE)) {
#   install.packages('plotly', dependencies = TRUE)
#   library('plotly', character.only = TRUE)
# }
if (require(plotly)) {
  plot_biplot(
    svd(t(as.matrix(fit_single$C))),
    label=list(
      'phenotype'=rownames(fit_single$A_init), # we should update this phenotype labels so that it matches with the simple example above.
      'variant'=rownames(fit_single$C)
    ),
    use_ggrepel=FALSE
  ) %>%
    plotly::ggplotly() %>%
    htmlwidgets::saveWidget('multisnpnet.biplot.html', selfcontained = F, libdir = "lib")
}

You can open the resulting html file (multisnpnet.biplot.html) with your favorite browser.



junyangq/multiSnpnet documentation built on Oct. 19, 2023, 8:22 p.m.