The R package RNAseqNet
can be used to infer networks from RNA-seq
expression data. The count data are given as a $n \times p$ matrix in which $n$
is the number of individuals and $p$ the number of genes. This matrix is denoted
by $\mathbf{X}$ in the sequel.
Eventually, the RNA-seq dataset is complemented with an $n' \times d$ matrix, $\mathbf{Y}$ which can be used to impute missing individuals in $\mathbf{X}$ as described in [Imbert et al., 2018]^[Imbert, A., Valsesia, A., Le Gall, C., Armenise, C., Lefebvre, G., Gourraud, P.A., Viguerie, N. and Villa-Vialaneix, N. (2018) Multiple hot-deck imputation for network inference from RNA sequencing data. Bioinformatics. DOI: http://dx.doi.org/10.1093/bioinformatics/btx819].
library(RNAseqNet)
Two datasets are available in the package: lung
and thyroid
with
$n = 221$ rows and respectively 100 and 50 columns. The raw data were downloaded
from https://gtexportal.org/. The TMM normalisation of RNA-seq expression was
performed with the R package edgeR
.
Data are loaded with:
data(lung) boxplot(log2(lung + 1), las = 3, cex.names = 0.5) data(thyroid) boxplot(log2(thyroid + 1), las = 3, cex.names = 0.5)
Network inference from RNA-seq data is performed with the Poisson GLM model
described in [Allen and Liu, 2012]^[Allen, G. and Liu, Z. (2012) A log-linear
model for inferring genetic networks from high-throughput sequencing data. In
Proceedings of IEEE International Conference on Bioinformatics and Biomedecine
(BIBM)]. The inference can be performed with the function GLMnetwork
as
follows:
lambdas <- 4 * 10^(seq(0, -2, length = 10)) ref_lung <- GLMnetwork(lung, lambdas = lambdas)
The entry path
of ref_lung
contains length(lambdas)
= r
length(lambdas)
matrices with estimated coefficients. Each matrix is a square
matrix with ncol(lung)
= r ncol(lung)
rows and columns.
The choice of the most appropriate value for $\lambda$ can be performed with
the StARS criterion of [Liu et al., 2010]^[Liu, H., Roeber, K. and Wasserman,
L. (2010) Stability approach to regularization selection (StARS) for high
dimensional graphical models. In Proceedings of Neural Information Processing
Systems (NIPS 2010), 23, 1432-1440, Vancouver, Canada.], which is
implemented in the function stabilitySelection
. The argument B
is
used to specify the number of re-sampling used to compute the stability
criterion:
set.seed(11051608) stability_lung <- stabilitySelection(lung, lambdas = lambdas, B = 50) plot(stability_lung)
The entry best
of stability_lung
is the index of the chosen
$\lambda$ in lambdas
. Here, the value $\lambda=$
lambdas[stability_lung$best]
= r lambdas[stability_lung$best]
is chosen.
The corresponding set of estimated coefficients, is in
ref_lung$path[[stability_lung$best]]
and can be transformed into a network
with the function GLMnetToGraph
:
lung_refnet <- GLMnetToGraph(ref_lung$path[[stability_lung$best]]) print(lung_refnet) set.seed(1243) plot(lung_refnet, vertex.size = 5, vertex.color = "orange", vertex.frame.color = "orange", vertex.label.cex = 0.5, vertex.label.color = "black")
In this section, we artificially remove some of the observations in lung
to create missing individuals (as compared to those in thyroid
):
set.seed(1717) nobs <- nrow(lung) miss_ind <- sample(1:nobs, round(0.2 * nobs), replace = FALSE) lung[miss_ind, ] <- NA lung <- na.omit(lung) boxplot(log2(lung + 1), las = 3, cex.names = 0.5)
The method described in [Imbert et al., 2018] is thus used to infer a network
for lung
expression data, imputing missing individuals from the
information provided between gene expressions by the thyroid
dataset.
The first step of the method is to choose a relevant value for the donor list parameter, $\sigma$. This is done computing $V_{\textrm{intra}}$, the intra-variability in donor pool, for various values of $\sigma$. An elbow rule is thus used to choose an appropriate value:
sigmalist <- 1:5 sigma_stats <- chooseSigma(lung, thyroid, sigmalist) p <- ggplot(sigma_stats, aes(x = sigma, y = varintra)) + geom_point() + geom_line() + theme_bw() + ggtitle(expression("Evolution of intra-pool homogeneity versus" ~ sigma)) + xlab(expression(sigma)) + ylab(expression(V[intra])) + theme(title = element_text(size = 10)) print(p)
Here, $\sigma = 2$ is chosen. Finally, hd-MI is processed with the chosen
$\sigma$, a list of regularization parameters $\lambda$ that are selected with
with the StARS criterion (from B = 10
subsamples) in m = 100
replicates of the inference, all performed on a different imputed dataset.
The function imputedGLMnetwork
is the one implementing the full method.
The distribution of edge frequency among the m = 100
inferred network is
obtained with the function plot
applied to the result of this function.
set.seed(16051244) lung_hdmi <- imputedGLMnetwork(lung, thyroid, sigma = 2, lambdas = lambdas, m = 100, B = 10) plot(lung_hdmi)
Finally, the final graph is extracted using the function GLMnetToGraph
on
the result of the function ``imputedGLMnetwork``` and providing a threshold for
edge frequency prediction.
lung_net <- GLMnetToGraph(lung_hdmi, threshold = 0.9) lung_net set.seed(1605) plot(lung_net, vertex.size = 5, vertex.color = "orange", vertex.frame.color = "orange", vertex.label.cex = 0.5, vertex.label.color = "black")
Here is the output of sessionInfo()
on the system on which this document was
compiled:
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.