In yliu433/scZINB: Graphical model estimation for zero-inflated count data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The package is developed for DAG skeleton estimation using zero-inflated count data. Consider the sample count data $Y_{n\times p}=(y_1, \ldots, y_p)$ which contain a large number of zeros, we assume that each variable $y_j,j=1,\ldots,p$ follows a zero-inflated negative binomial distribution, that is, the distribution has two components: the zero-inflated component and the negative binomial component. The likelihood can be written as $$\log L_j = \sum_{i = 1}^n \left{ \log[ \pi_{ij} I(y_{ij} =0) + (1-\pi_{ij}) f_{nb}(y_{ij}; \mu_{ij}, \phi_{ij}) ] \right},$$ where $\pi_{ij}$ is the probability that $y_{ij}$ arises from the zero-inflated component, $f_{nb}(y_{ij}; \mu_{ij}, \phi_{ij})$ is the density function of the negative binomial distribution with mean $\mu_{ij}$ and over-dispersion parameter $\phi_{ij}$.

The package uses a two-step procedure to estimate the skeleton of the graphical model based on this sample data. First, we estimate the moral graph using a neighborhood selection procedure. In particular, we estimate the neighbors of each vertex via penalized maximum likelihood in the ZINB regression separately, and then consolidate the results. An edge between $y_{j1}$ and $y_{j2}$ will be established if either $y_{j1}$ is selected as a neighbor of $y_{j2}$ or $y_{j2}$ is selected as a neighbor of $y_{j1}$. Second, after we obtain the moral graph, we could further estimate the skeleton by removing the false connections in the v-structure using a modified PC-algorithm.

The scZINB package uses an EM algorithm to obtain the penalized maximum likelihood estimation of the coefficients in the regression model. And in the M step, it uses the coordinate descent algorithm to update the objective function over each coefficient with others being fixed, and cycles repeatedly until convergence The core of the code is written using C++ for reducing the computational burden.

The package can handle high-dimensional data where the number of variables p is larger than the sample size n. It also involves other related functions such as simulating graphical model and the sample data using the ER model, generating the tuning parameters for the regression model.

Installation

The package can be installed from the github:

library(devtools)
install_github("yliu433/scZINB", build_vignettes = TRUE)

To use the package, we need to load several required packages

library(doParallel)
library(penalized) # for penalized
library(MASS) # for negative.binomal() and glm.nb
library(pscl) # for zeroinfl function
library(graph)
library(mvtnorm)

Then we can load the scZINB package

library(scZINB)

Quick start

In this section, we will use a simulated example to show how to perform the algorithm to estimate the skeleton of a graphical model. We can simple generate a true graphical model and the continuous sample data using the genER function:

set.seed(1)
n <- 200 # sample size
p <- 100 # dimension
p0 <- 0.01 # the probability of having an edge for any pair of vertices
effs <- 0.5 # effect sizes of the association in the graphical model
ts <- genER(p, p0, n, effs)
graph <- ts$A
graph <- (graph != 0) * 1
mgraph <- mirror(graph)
dim(mgraph)
dat <- ts$X
dim(dat)

Here "graph" is the true adjacency matrix of the simulated graph, and "mgraph" is the true undirected graph, and "dat" is the sample data generated from a mulitvariate normal distribution. To further obtain the zero-inflated count data, we can use the inverse distribution functions of the binomial distribution and the negative binomial distribution as follows:

data.negbin <- matrix(qnbinom(pnorm(dat), mu = 2, size = 1.5), ncol = ncol(dat))
data.bin <- matrix(qbinom(pnorm(dat), prob = 0.4, size = 1), ncol = ncol(dat))
data <- data.negbin
data[which(data.bin == 0)] <- 0
data[which(data == Inf)] <- max(data[which(data != Inf)]) + 1
head(data[, 1:6])

Then we will use the sample data "data" to estimate the skeleton and compare the result with the true graph "mgraph". Before performing the first step for estimating the moral graph, we can use a marginal screening to filter out the covariates that are not even weakly associated with the response variable to reduce the computational time:

# step 0 marginal screening 
gEstM <- marscr(log(data + 1/6), thres = 10)

Now we can use the function nsZINB to perform the neighborhood selection for the sample data using each varaible as the response and the remaining $(p−1)$ as covaraites. Here we use the above "gEstM" as an initial filter prior to the variable selection. The output is the estimated moral graph based on the extended BIC. See help(nsZINB) for more details of this function.

# step 1 neighborhood selection
mgh1 <- nsZINB(data, filter = gEstM)

The compareGraphs function can be used to compare the estimated moral graph with the true graph structure:

compareGraphs(mgh1, mgraph)

In step 2, the function pcZINB removes the false connections in the moral graph via a modified PC algorithm:

# Step 2 Conditional Dependence Testing
res <- pcZINB(mgh1, data)
compareGraphs(res$G, mgraph)

Compared to the result of the moral graph, the false positive rate has been reduced.

We can plot the graphs using the plot.igraph function in the igraph package:

library(igraph)
g0 <- graph.adjacency(mgraph, mode = "undirected", diag = FALSE)
g1 <- graph.adjacency(mgh1, mode = "undirected", diag = FALSE)
g2 <- graph.adjacency(res$G, mode = "undirected", diag = FALSE)
par(mar=c(0, 0, 0, 0) + .1)
plot(g0, layout = layout_in_circle, vertex.size = 1, vertex.label = NA, 
     main = "True graph", cex = 0.25)
plot(g1, layout = layout_in_circle, vertex.size = 1, vertex.label = NA, 
     main = "Estimated moral graph", cex = 0.25)
plot(g2, layout = layout_in_circle, vertex.size = 1, vertex.label = NA, 
     main = "Estimated skeleton", cex = 0.25)