Introduction of mlm4omics

library(knitr)
library(rmarkdown)
knitr::opts_chunk$set(echo = TRUE)

Introduction

Summary

High throughput mass-spectrometry clinical proteomics studies curate abundance data with hierarchical structure. The non-random missingness of the measurements from a vast amount of information also adds complexity in data analysis. We propose a multivariate 2 level model to analyze protein abundances and to handle abundance-dependent Missingness within a Bayesian framework. The proposed model enables the variance decomposition at different levels of the data hierarchy and provides central shrinkage of protein-level estimates for proteins with small numbers of observations. A logistic regression model with informative prior of the regression coefficient is used to model the missing probability. The non-random missing and censored values are treated as unknown parameters in a Bayesian model. Hamiltonian MC/No-U-Turn Sampling algorithm is created to derive the posterior distribution of the unknown parameters. Hamiltonian MC is demonstrated to gain more efficiency for these high-dimensional correlated data.

The proposed model

We use an inference model within a multivariate-multilevel framework to analyze the protein abundance data that are curated in clinical proteomics studies. These models decompose variations from different sources, such as experimental factors, proteins' biological features and biological samples' physiological properties. It is also known that in the ion intensities data which is used to construct protein's relative abundance, its missing values depend on the unobserved abundance values and the probability of missing associated with some observable experiment factors. In addition to this, due to the detection limit of a device, we also observe left-censoring values in the output from many studies. We model non-random missingness as either left-censored or completely missing using one of the proposed likelihood-based method of Little and Rubin.
Little (1993), Little and Rubin (1987) proposed three likelihood-based methods (selection model, pattern-mixture model and shared-parameter model) for Missing Not At Random (MNAR). Selection model, in which the missing data mechanism and parameters of the inferential model are conditionally estimated given the hypothetical completed data, is an appropriate approach given the abundance-dependent missingness. The hypothetical completed data comprises the missing and observed values. The known physical properties of peptides and mass-spectrometers provide the auxiliary information for the missing probability and missing values. We use the selection model factorization and include the left-censored missingness under a Bayesian framework. In two case studies, we estimate the likelihood of missingness using auxiliary variable mass-to-charge ratio (m/z) and the intensity values.

The computing algorithm

Our proposed method utilizes Hamiltonian MC/Non-U-Turn Sampling for the posterior distribution. Compared to the Gibb sampling, Hamiltonian MC (HMC) avoids the random walk by introducing the leapfrog function. It provides an alternative to approximating the solution on the continuous time scale from the solutions on the discrete time scale, with a specified step-size. The logarithmic posterior probability function was simulated by one of the paired partial differentiated equations, namely the Hamiltonian. Larger moving steps are generated from the leap frog scheme, and this helps to improve the convergence compared to the random walk. It has been shown to have higher efficiency in sampling high-dimensional correlated multivariate distributions. RStan is a new tool recently developed to implement HMC modeling for Bayesian data analysis, of which the posterior distribution can be sampled using the No-U-Turn Sampling method -an extension of HMC. No-U-Turn sampling implements a recursive algorithm that will enable auto-tuning of the couple parameters in HMC: numbers of leap-frog steps and the discrete step. We based on Rstan library function stan() to set up the program for our proposed model. The R library "mlm4omics" is written to implement posterior samplings using HMC.

The mlm4omics R package

The functions included in the pacakge "mlm4omics" are:

  1. setinitialvalue(): this function generates initiating values for the parameters in the computing.

  2. mlmc(): this function provides Rstan program to estimate the posterior distribution of parameters in a linear regression model with multivariate normal distributed respondents , for example the log (peptide ion intensity) from mass-spec, and explanatory variables such as a protein identification. Other explanatory variables can be experimental variables and clinical design variables. The protein identification is an ID to link the peptides to its parent protein. The probability of missing, missing and censored values of the respondent variable are estimated in the algorithm through a logistic regression and a truncated normal distribution for the censored values. The probability of missing can be fitted via explanatory variables such as the abundance and mass-to-charge ratio of the observed ion intensity. The censored values are modelled through defining a truncated normal as its prior distribution. mlmc() function also provides option (respond_dep_missing=TRUE/FALSE) to choose if the missing probability is depending on the magnitude of missing value.

  3. mlmm(): this function is similar to mlmc(), but it is written for data without any censored respondent values in the regression model. The input variables in this function are similar to those in mlmm() but without terms for censored missing.

Installation

There are two ways to install the package. Before install the package, please use the instructions from https://github.com/stan-dev/rstan/wiki to install the rstan package firstly.

Use bioconductor as installation source

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("mlm4omics")
library(mlm4omics)

Use github as installation source

devtools::install_github("superOmics/mlm4omics")
library(mlm4omics)

Instructions and Examples of using main functions for regressions

The following section provides examples using a simulated dataset: pdata. The examples include:

  1. Examples with no informative prior to call the compiled model; and the default priors are used.

  2. Example with informative priors.

When set up the formulas and parameters for using the mlmm() and mlmc() functions, some instructions are as follows:

1.The formula_subject will need to have at least one variable. For example, to include subject-id for subject and this will provide the mean value for each subject.

2.The missing and censor values in the response variable require Different indicators, the missing and censor indictors are mutually exclusive.

3.The number of iterations are better to set to be >1000. The default burn-in is half of the set iterations.

Read data and check data

library(mlm4omics)
data(pdata,package="mlm4omics")
pdata[1:2,]

Check for missing and censor indictors

Check if there is overlapped definition for missing and censor.
The definition for missing and censored are differently defined. The following code is to check if the indicators for missing and censored are mutually exclusive.

table(pdata$miss, pdata$censor)

If the indictors coded for missing and censored are overlapped, the following codes convert missing indictor from 1 to 0 where censor =1.

n=dim(pdata)[1]
for (i in seq_len(n))
if (pdata$miss[i]==1 && pdata$censor[i]==1) pdata$miss[i]=0

After that, we can set the formula for response, probability of missing and subject as the followings:

Set the formula for response, missingness and subject

formula_completed=var1~var2+treatment;
formula_missing=miss~var2;
formula_censor=censor~1;
formula_subject=~treatment;
response_censorlim=0.002

Where var1 is the response variable, miss and censor are the indicator variables for missing and censoring respectively.

Set the formula and fit the regression model

The formula for subject requires at least one variable. For example, to include subject-id for subject and this will provide the mean value for each subject.

R examples for mlmc()

Example 1a. Use default prior and call complied codes

When no priors information available, we can have posterior samples based on the compiled codes, given the default flat prior of precision matrix of the protein level covariates (a second level unit in the multilevel model), and prior for regression coefficients of the logistic regression predicting the missing probability.

model1 <- mlmc(formula_completed = var1~var2+treatment,
formula_missing = miss~var2, 
formula_censor = censor~1, formula_subject = ~sid, 
pdata = pdata, response_censorlim = 0.002, respond_dep_missing = TRUE, 
pidname = "geneid", sidname = "sid", 
iterno = 100, chains = 2, savefile = TRUE)

Example 1b-1c. Use a provided informative priors

Example 1b is to set priors for regression coefficients in the logistic regression model for missingness. alpha_prior is a prior for coefficients (including intercept) in the logistic regression to predict missing probability; it is a vector with a length of (1+number of predictors).

Example 1c is to Use an informative prior for precision matrix in the regression for completed data. prec_prior is the precision matrix prior of regression coefficients -intercept and predictor (var2) in the completed data formula.

#Example 1b
model1 <- mlmc(formula_completed=var1~var2+treatment, 
formula_missing=miss~var2, 
formula_censor=censor~1, formula_subject=~sid, 
pdata=pdata, response_censorlim=0.002,
respond_dep_missing=TRUE, pidname="geneid", sidname="sid", 
alpha_prior <- c(0,0.001), iterno=100, chains=2, 
savefile=TRUE)

#Example 1c
prec_example <- matrix(c(0.01,0.001,0.001,0.01),nrow=2,ncol=2)

model3 <- mlmc(formula_completed=var1~var2, 
formula_missing=miss~var2, 
formula_censor=censor~1, formula_subject=~sid+treatment, 
pdata=pdata, response_censorlim=0.002, respond_dep_missing=TRUE,
pidname="geneid", sidname="sid", prec_prior=prec_example,
iterno=1000, chains=2, savefile=TRUE)

R examples for mlmm()

Formulas for completed data, subject and missingness all require at least one variable.

Example 2a. Use default prior

model5 <- mlmm(formula_completed=var1~var2+treatment, 
formula_missing=miss~var2, 
formula_subject=~sid, pdata=pdata,
respond_dep_missing=FALSE, pidname="geneid", sidname="sid",
iterno=1000, chains=2, savefile=FALSE)

Example 2b-2c. Provide informative priors

The user define priors include:

a) prec_prior : the precision matrix of the first-level parameters (var2 and treatment);

b) alpha_prior: coefficients for intercept and var2 to predict probability of having missing value in logistic regression.

prec_example <- matrix(c(0.01,0.001,0.001,0.001,0.01,0.001,0.001,0.001,0.01),
nrow=3, ncol=3)
#Example 2b use both priors
model4 <- mlmm(formula_completed=var1~var2+treatment, 
formula_missing=miss~var2,
formula_subject=~sid, pdata=pdata,
respond_dep_missing=FALSE, pidname="geneid", sidname="sid",
alpha_prior <- c(0,0.001), prec_prior=prec_example,
iterno=100, chains=2, savefile=TRUE)

#Example 2c. Use alpha prior 
model5 <- mlmm(formula_completed=var1~var2+treatment, 
formula_missing=miss~var2,
formula_subject=~sid+treatment, pdata=pdata,
respond_dep_missing=FALSE, pidname="geneid",alpha_prior <- c(0,0.001),
sidname="sid", iterno=100, chains=2, savefile=TRUE)

How to make plots of the parameters and to diagnose convergence

If the estimate converges, it will have high number of efficient samples and the rhat value will close to 1.

The following codes are to:

1) plot posterior parameter using the outsummary results from multiple chains.

2) plot the trajectory of posterior sampling including those values generated from burn-in iterations.

plot posterior parameters

summaryreader <- read.csv(file = file.path(getwd(),"outsummary.csv"),
header=TRUE, sep=",", skip=0)

iterno <- dim(summaryreader)[1]; burnin <- iterno/2

U.1.1 <- rowMeans(matrix(c(summaryreader$chain.1.U.1.1,
summaryreader$chain.2.U.1.1), nrow=iterno, ncol=2))[burnin:iterno]

meanU <- mean(U.1.1)
qU <- quantile(U.1.1,p <- seq(0, 1, by=0.025))
scale <- seq(0, 1, by=0.025)

plot(scale, qU, pch=19, ylab="quantiles of estimate", xlab="quantiles")

segments(0,qU[names(qU)=="50%"],1,qU[names(qU)=="50%"],lwd=2,col="red")
segments(0,qU[names(qU)=="2.5%"],1,qU[names(qU)=="2.5%"],lty=2,lwd=2,col="red")
segments(0,qU[names(qU)=="97.5%"],1,qU[names(qU)=="97.5%"],lty=2,lwd=2,
col="red")

legend(0.5,qU[names(qU)=="50%"],"median",cex=0.8,bty="n")
legend(0.03,qU[names(qU)=="2.5%"],"2.5%",cex=0.8,bty="n")
legend(0.90,qU[names(qU)=="97.5%"],"97.5%",cex=0.8,bty="n")
qU

plot trajectory of the posterior samples

sample1reader <- read.csv(file <- file.path(getwd(),"samples_1.csv"),
header=TRUE, sep=",", skip=25)

sample2reader <- read.csv(file <- file.path(getwd(),"samples_2.csv"),
header=TRUE, sep=",", skip=25)

#plot variable U.1.1 - the intercept of first unit
trajectory_length <- dim(sample1reader)[1]

plot(seq(1, trajectory_length, by=1), sample1reader$U.1.1, xlab="trajectory
number", ylab="U.1.1", type="n",
ylim = c(min(sample1reader$U.1.1, sample2reader$U.1.1, na.rm=TRUE),
max(sample1reader$U.1.1, sample2reader$U.1.1, na.rm=TRUE)))

trajectory <- seq(1, trajectory_length, by=1)

lines(trajectory, sample1reader$U.1.1)
lines(trajectory, sample2reader$U.1.1, col="red")

sessionInfo

All of the outputs in this vignette are produced under the following conditions:

sessionInfo()

References

  1. Hoffman, M.D. and Gelman, A. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of machine learning research 2011;12.
  2. Little, R.J.A. Pattern-mixture models for multivariate incomplete data. J. Am. Statist. Assoc. 1993;88:125-134.
  3. Little, R.J.A. and Rubin, D.B. Statistical Analysis with Missing Data. 1987. Neal, R.M. MCMC using Hamiltonian dynamics. In: S. Brooks, e.a., editor, Handbook of Markov Chain Monte Carlo. 2011.
  4. Zeng,I.S. PhD thesis. Statistical methods in clinical proteomic studies 2007-2014 -A protein concerto-. Research Space: The University of Auckland.
  5. Zeng, I.S. Topics in Study Design and Analysis for Multistage Clinical Proteomics In: Jung, K., editor, Statistical Analysis in Proteomics New York: Humana Press: Springer; 2015.
  6. Hrydziuszko, O., and Viant, M. R.. Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline. Metabolomics 2012;8,S161S174.
  7. Zeng, I.S., Lumley, T., Ruggiero, K. and Middleditch, M. A Bayesian approach to multivariate and multilevel modelling with non-random missingness for hierarchical clinical proteomics data. In: https://www.biorxiv.org/content/early/2017/06/21/153049; 2017.


Try the mlm4omics package in your browser

Any scripts or data that you put into this service are public.

mlm4omics documentation built on Oct. 31, 2019, 9:43 a.m.