Log-Linear Poisson Graphical Model with Hot-Deck Multiple Imputation
In RNAseqNet: Log-Linear Poisson Graphical Model with Hot-Deck Multiple Imputation

Introduction

The R package RNAseqNet can be used to infer networks from RNA-seq expression data. The count data are given as a $n \times p$ matrix in which $n$ is the number of individuals and $p$ the number of genes. This matrix is denoted by $\mathbf{X}$ in the sequel.

Eventually, the RNA-seq dataset is complemented with an $n' \times d$ matrix, $\mathbf{Y}$ which can be used to impute missing individuals in $\mathbf{X}$ as described in [Imbert et al., 2018]^[Imbert, A., Valsesia, A., Le Gall, C., Armenise, C., Lefebvre, G., Gourraud, P.A., Viguerie, N. and Villa-Vialaneix, N. (2018) Multiple hot-deck imputation for network inference from RNA sequencing data. Bioinformatics. DOI: http://dx.doi.org/10.1093/bioinformatics/btx819].

library(RNAseqNet)

Dataset description

Two datasets are available in the package: lung and thyroid with $n = 221$ rows and respectively 100 and 50 columns. The raw data were downloaded from https://gtexportal.org/. The TMM normalisation of RNA-seq expression was performed with the R package edgeR.

Data are loaded with:

data(lung)
boxplot(log2(lung + 1), las = 3, cex.names = 0.5)
data(thyroid)
boxplot(log2(thyroid + 1), las = 3, cex.names = 0.5)

Network inference from RNA-seq data

Network inference from RNA-seq data is performed with the Poisson GLM model described in [Allen and Liu, 2012]^[Allen, G. and Liu, Z. (2012) A log-linear model for inferring genetic networks from high-throughput sequencing data. In Proceedings of IEEE International Conference on Bioinformatics and Biomedecine (BIBM)]. The inference can be performed with the function GLMnetwork as follows:

lambdas <- 4 * 10^(seq(0, -2, length = 10))
ref_lung <- GLMnetwork(lung, lambdas = lambdas)

The entry path of ref_lung contains length(lambdas) = r length(lambdas) matrices with estimated coefficients. Each matrix is a square matrix with ncol(lung) = r ncol(lung) rows and columns.

The choice of the most appropriate value for $\lambda$ can be performed with the StARS criterion of [Liu et al., 2010]^[Liu, H., Roeber, K. and Wasserman, L. (2010) Stability approach to regularization selection (StARS) for high dimensional graphical models. In Proceedings of Neural Information Processing Systems (NIPS 2010), 23, 1432-1440, Vancouver, Canada.], which is implemented in the function stabilitySelection. The argument B is used to specify the number of re-sampling used to compute the stability criterion:

set.seed(11051608)
stability_lung <- stabilitySelection(lung, lambdas = lambdas, B = 50)
plot(stability_lung)

The entry best of stability_lung is the index of the chosen $\lambda$ in lambdas. Here, the value $\lambda=$ lambdas[stability_lung$best] = r lambdas[stability_lung$best] is chosen.

The corresponding set of estimated coefficients, is in ref_lung$path[[stability_lung$best]] and can be transformed into a network with the function GLMnetToGraph:

lung_refnet <- GLMnetToGraph(ref_lung$path[[stability_lung$best]])
print(lung_refnet)
set.seed(1243)
plot(lung_refnet, vertex.size = 5, vertex.color = "orange", 
     vertex.frame.color = "orange", vertex.label.cex = 0.5, 
     vertex.label.color = "black")

Network inference with an auxiliary dataset

In this section, we artificially remove some of the observations in lung to create missing individuals (as compared to those in thyroid):

set.seed(1717)
nobs <- nrow(lung)
miss_ind <- sample(1:nobs, round(0.2 * nobs), replace = FALSE)
lung[miss_ind, ] <- NA
lung <- na.omit(lung)
boxplot(log2(lung + 1), las = 3, cex.names = 0.5)

The method described in [Imbert et al., 2018] is thus used to infer a network for lung expression data, imputing missing individuals from the information provided between gene expressions by the thyroid dataset.

The first step of the method is to choose a relevant value for the donor list parameter, $\sigma$. This is done computing $V_{\textrm{intra}}$, the intra-variability in donor pool, for various values of $\sigma$. An elbow rule is thus used to choose an appropriate value:

sigmalist <- 1:5
sigma_stats <- chooseSigma(lung, thyroid, sigmalist)
p <- ggplot(sigma_stats, aes(x = sigma, y = varintra)) + geom_point() +
  geom_line() + theme_bw() + 
  ggtitle(expression("Evolution of intra-pool homogeneity versus" ~ sigma)) +
  xlab(expression(sigma)) + ylab(expression(V[intra])) +
  theme(title = element_text(size = 10))
print(p)

Here, $\sigma = 2$ is chosen. Finally, hd-MI is processed with the chosen $\sigma$, a list of regularization parameters $\lambda$ that are selected with with the StARS criterion (from B = 10 subsamples) in m = 100 replicates of the inference, all performed on a different imputed dataset. The function imputedGLMnetwork is the one implementing the full method. The distribution of edge frequency among the m = 100 inferred network is obtained with the function plot applied to the result of this function.

set.seed(16051244)
lung_hdmi <- imputedGLMnetwork(lung, thyroid, sigma = 2, lambdas = lambdas,
                               m = 100, B = 10)
plot(lung_hdmi)

Finally, the final graph is extracted using the function GLMnetToGraph on the result of the function ``imputedGLMnetwork``` and providing a threshold for edge frequency prediction.

lung_net <- GLMnetToGraph(lung_hdmi, threshold = 0.9)
lung_net
set.seed(1605)
plot(lung_net, vertex.size = 5, vertex.color = "orange", 
     vertex.frame.color = "orange", vertex.label.cex = 0.5, 
     vertex.label.color = "black")