library(knitr) library(rmarkdown) knitr::opts_chunk$set(echo = TRUE)
High throughput mass-spectrometry clinical proteomics studies curate abundance data with hierarchical structure. The non-random missingness of the measurements from a vast amount of information also adds complexity in data analysis. We propose a multivariate 2 level model to analyze protein abundances and to handle abundance-dependent Missingness within a Bayesian framework. The proposed model enables the variance decomposition at different levels of the data hierarchy and provides central shrinkage of protein-level estimates for proteins with small numbers of observations. A logistic regression model with informative prior of the regression coefficient is used to model the missing probability. The non-random missing and censored values are treated as unknown parameters in a Bayesian model. Hamiltonian MC/No-U-Turn Sampling algorithm is created to derive the posterior distribution of the unknown parameters. Hamiltonian MC is demonstrated to gain more efficiency for these high-dimensional correlated data.
We use an inference model within a multivariate-multilevel framework
to analyze the protein abundance data that are curated in clinical
proteomics studies. These models decompose variations from different
sources, such as experimental factors, proteins' biological features
and biological samples' physiological properties. It is also known
that in the ion intensities data which is used to construct protein's
relative abundance, its missing values depend on the unobserved
abundance values and the probability of missing associated with some
observable experiment factors. In addition to this, due to the detection
limit of a device, we also observe left-censoring
values in the output from many studies. We model
non-random missingness as either left-censored or completely missing
using one of the proposed likelihood-based method of Little and Rubin.
Little (1993), Little and Rubin (1987) proposed three likelihood-based methods
(selection model, pattern-mixture model and shared-parameter model) for
Missing Not At Random (MNAR). Selection model, in which the missing data
mechanism and parameters of the inferential model are conditionally
estimated given the hypothetical completed data, is an appropriate
approach given the abundance-dependent missingness. The hypothetical
completed data comprises the missing and observed values. The known
physical properties of peptides and mass-spectrometers provide the
auxiliary information for the missing probability and missing
values. We use the selection model factorization and include
the left-censored missingness under a Bayesian framework. In two case
studies, we estimate the likelihood of missingness using auxiliary variable
mass-to-charge ratio (m/z) and the intensity values.
Our proposed method utilizes Hamiltonian MC/Non-U-Turn Sampling for the posterior distribution. Compared to the Gibb sampling, Hamiltonian MC (HMC) avoids the random walk by introducing the leapfrog function. It provides an alternative to approximating the solution on the continuous time scale from the solutions on the discrete time scale, with a specified step-size. The logarithmic posterior probability function was simulated by one of the paired partial differentiated equations, namely the Hamiltonian. Larger moving steps are generated from the leap frog scheme, and this helps to improve the convergence compared to the random walk. It has been shown to have higher efficiency in sampling high-dimensional correlated multivariate distributions. RStan is a new tool recently developed to implement HMC modeling for Bayesian data analysis, of which the posterior distribution can be sampled using the No-U-Turn Sampling method -an extension of HMC. No-U-Turn sampling implements a recursive algorithm that will enable auto-tuning of the couple parameters in HMC: numbers of leap-frog steps and the discrete step. We based on Rstan library function stan() to set up the program for our proposed model. The R library "mlm4omics" is written to implement posterior samplings using HMC.
The functions included in the pacakge "mlm4omics" are:
setinitialvalue(): this function generates initiating values for the parameters in the computing.
mlmc(): this function provides Rstan program to estimate the posterior distribution of parameters in a linear regression model with multivariate normal distributed respondents , for example the log (peptide ion intensity) from mass-spec, and explanatory variables such as a protein identification. Other explanatory variables can be experimental variables and clinical design variables. The protein identification is an ID to link the peptides to its parent protein. The probability of missing, missing and censored values of the respondent variable are estimated in the algorithm through a logistic regression and a truncated normal distribution for the censored values. The probability of missing can be fitted via explanatory variables such as the abundance and mass-to-charge ratio of the observed ion intensity. The censored values are modelled through defining a truncated normal as its prior distribution. mlmc() function also provides option (respond_dep_missing=TRUE/FALSE) to choose if the missing probability is depending on the magnitude of missing value.
mlmm(): this function is similar to mlmc(), but it is written for data without any censored respondent values in the regression model. The input variables in this function are similar to those in mlmm() but without terms for censored missing.
There are two ways to install the package. Before install the package, please use the instructions from https://github.com/stan-dev/rstan/wiki to install the rstan package firstly.
if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("mlm4omics") library(mlm4omics)
devtools::install_github("superOmics/mlm4omics") library(mlm4omics)
The following section provides examples using a simulated dataset: pdata. The examples include:
Examples with no informative prior to call the compiled model; and the default priors are used.
Example with informative priors.
When set up the formulas and parameters for using the mlmm() and mlmc() functions, some instructions are as follows:
1.The formula_subject will need to have at least one variable. For example, to include subject-id for subject and this will provide the mean value for each subject.
2.The missing and censor values in the response variable require Different indicators, the missing and censor indictors are mutually exclusive.
3.The number of iterations are better to set to be >1000. The default burn-in is half of the set iterations.
library(mlm4omics) data(pdata,package="mlm4omics") pdata[1:2,]
Check if there is overlapped definition for missing and censor.
The definition for missing and censored are differently defined.
The following code is to check if the indicators for missing and
censored are mutually exclusive.
table(pdata$miss, pdata$censor)
If the indictors coded for missing and censored are overlapped, the following codes convert missing indictor from 1 to 0 where censor =1.
n=dim(pdata)[1] for (i in seq_len(n)) if (pdata$miss[i]==1 && pdata$censor[i]==1) pdata$miss[i]=0
After that, we can set the formula for response, probability of missing and subject as the followings:
formula_completed=var1~var2+treatment; formula_missing=miss~var2; formula_censor=censor~1; formula_subject=~treatment; response_censorlim=0.002
Where var1 is the response variable, miss and censor are the indicator variables for missing and censoring respectively.
The formula for subject requires at least one variable. For example, to include subject-id for subject and this will provide the mean value for each subject.
When no priors information available, we can have posterior samples based on the compiled codes, given the default flat prior of precision matrix of the protein level covariates (a second level unit in the multilevel model), and prior for regression coefficients of the logistic regression predicting the missing probability.
model1 <- mlmc(formula_completed = var1~var2+treatment, formula_missing = miss~var2, formula_censor = censor~1, formula_subject = ~sid, pdata = pdata, response_censorlim = 0.002, respond_dep_missing = TRUE, pidname = "geneid", sidname = "sid", iterno = 100, chains = 2, savefile = TRUE)
Example 1b is to set priors for regression coefficients in the logistic regression model for missingness. alpha_prior is a prior for coefficients (including intercept) in the logistic regression to predict missing probability; it is a vector with a length of (1+number of predictors).
Example 1c is to Use an informative prior for precision matrix in the regression for completed data. prec_prior is the precision matrix prior of regression coefficients -intercept and predictor (var2) in the completed data formula.
#Example 1b model1 <- mlmc(formula_completed=var1~var2+treatment, formula_missing=miss~var2, formula_censor=censor~1, formula_subject=~sid, pdata=pdata, response_censorlim=0.002, respond_dep_missing=TRUE, pidname="geneid", sidname="sid", alpha_prior <- c(0,0.001), iterno=100, chains=2, savefile=TRUE) #Example 1c prec_example <- matrix(c(0.01,0.001,0.001,0.01),nrow=2,ncol=2) model3 <- mlmc(formula_completed=var1~var2, formula_missing=miss~var2, formula_censor=censor~1, formula_subject=~sid+treatment, pdata=pdata, response_censorlim=0.002, respond_dep_missing=TRUE, pidname="geneid", sidname="sid", prec_prior=prec_example, iterno=1000, chains=2, savefile=TRUE)
Formulas for completed data, subject and missingness all require at least one variable.
model5 <- mlmm(formula_completed=var1~var2+treatment, formula_missing=miss~var2, formula_subject=~sid, pdata=pdata, respond_dep_missing=FALSE, pidname="geneid", sidname="sid", iterno=1000, chains=2, savefile=FALSE)
The user define priors include:
a) prec_prior : the precision matrix of the first-level parameters (var2 and treatment);
b) alpha_prior: coefficients for intercept and var2 to predict probability of having missing value in logistic regression.
prec_example <- matrix(c(0.01,0.001,0.001,0.001,0.01,0.001,0.001,0.001,0.01), nrow=3, ncol=3)
#Example 2b use both priors model4 <- mlmm(formula_completed=var1~var2+treatment, formula_missing=miss~var2, formula_subject=~sid, pdata=pdata, respond_dep_missing=FALSE, pidname="geneid", sidname="sid", alpha_prior <- c(0,0.001), prec_prior=prec_example, iterno=100, chains=2, savefile=TRUE) #Example 2c. Use alpha prior model5 <- mlmm(formula_completed=var1~var2+treatment, formula_missing=miss~var2, formula_subject=~sid+treatment, pdata=pdata, respond_dep_missing=FALSE, pidname="geneid",alpha_prior <- c(0,0.001), sidname="sid", iterno=100, chains=2, savefile=TRUE)
If the estimate converges, it will have high number of efficient samples and the rhat value will close to 1.
The following codes are to:
1) plot posterior parameter using the outsummary results from multiple chains.
2) plot the trajectory of posterior sampling including those values generated from burn-in iterations.
summaryreader <- read.csv(file = file.path(getwd(),"outsummary.csv"), header=TRUE, sep=",", skip=0) iterno <- dim(summaryreader)[1]; burnin <- iterno/2 U.1.1 <- rowMeans(matrix(c(summaryreader$chain.1.U.1.1, summaryreader$chain.2.U.1.1), nrow=iterno, ncol=2))[burnin:iterno] meanU <- mean(U.1.1) qU <- quantile(U.1.1,p <- seq(0, 1, by=0.025)) scale <- seq(0, 1, by=0.025) plot(scale, qU, pch=19, ylab="quantiles of estimate", xlab="quantiles") segments(0,qU[names(qU)=="50%"],1,qU[names(qU)=="50%"],lwd=2,col="red") segments(0,qU[names(qU)=="2.5%"],1,qU[names(qU)=="2.5%"],lty=2,lwd=2,col="red") segments(0,qU[names(qU)=="97.5%"],1,qU[names(qU)=="97.5%"],lty=2,lwd=2, col="red") legend(0.5,qU[names(qU)=="50%"],"median",cex=0.8,bty="n") legend(0.03,qU[names(qU)=="2.5%"],"2.5%",cex=0.8,bty="n") legend(0.90,qU[names(qU)=="97.5%"],"97.5%",cex=0.8,bty="n") qU
sample1reader <- read.csv(file <- file.path(getwd(),"samples_1.csv"), header=TRUE, sep=",", skip=25) sample2reader <- read.csv(file <- file.path(getwd(),"samples_2.csv"), header=TRUE, sep=",", skip=25) #plot variable U.1.1 - the intercept of first unit trajectory_length <- dim(sample1reader)[1] plot(seq(1, trajectory_length, by=1), sample1reader$U.1.1, xlab="trajectory number", ylab="U.1.1", type="n", ylim = c(min(sample1reader$U.1.1, sample2reader$U.1.1, na.rm=TRUE), max(sample1reader$U.1.1, sample2reader$U.1.1, na.rm=TRUE))) trajectory <- seq(1, trajectory_length, by=1) lines(trajectory, sample1reader$U.1.1) lines(trajectory, sample2reader$U.1.1, col="red")
All of the outputs in this vignette are produced under the following conditions:
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.