nb.gof.m: Main Function of Implementing Simulation-based...

Description Usage Arguments Details Value Author(s) References Examples

View source: R/nb.gof.m.R

Description

This function performs simulation-based goodness-of-fit tests of different negative binomial dispersion models for an RNA-Seq dataset with a known design matrix. Currently supported models are implemented in the NBPSeq and edgeR packages.

Usage

1
2
nb.gof.m(counts, x, lib.sizes=colSums(counts), sim=999, model=NULL, method=NULL, 
min.n=100, prior.df = 10, seed=1, ncores = NULL, ...)

Arguments

counts

an m-by-n count matrix of non-negative integers. For a typical RNA-Seq experiment, this is the read counts with m genes and n samples.

x

an n-by-p design matrix.

lib.sizes

library sizes of an RNA-Seq experiment. Default is the column sums of the counts matrix.

sim

number of simulations performed.

model

a string of characters specifying the negative binomial dispersion model used to fit the data. Currently the following dispersion models are available to be checked for goodness-of-fit:

  • NBP dispersion model in the NBPSeq package (NBP)

  • NBQ dispersion model in the NBPSeq pacakge (NBQ)

  • NBS dispersion model in the NBPSeq pacakge (NBS)

  • NBSTEP dispersion model in the NBPSeq pacakge (NBSTEP)

  • NB common dispersion model in the edgeR package (Common)

  • NB genewise in the NBPSeq package (Genewise)

  • NB trended (non-parametric) dispersion model in the edgeR package (Trended)

  • NB tagwise-common dispersion model in the edgeR package (Tagwise-Common)

  • NB tagwise-trended dispersion model in the edgeR package (Tagwise-Trend)

Users should specify exactly the same characters as the ones in paratheses above for each dispersion model.

method

method for estimating dispersions. MAPL: maximum adjusted profile likelihood other estimation methods from edgeR can also be specified.

min.n

for Trended model only: specify the minimim number of genes in a bin (default is 100).

prior.df

control of the shrinkage applied for edgeR genewise, tagwise (and its variants) dispersion models. Setting prior.df=0 gives the genewise model without any shrinkage. Setting prior.df=10 (default in edgeR) gives the tagwise model with shrinkage, either towards the common dispersion (Tagwise-Common) or the global dispersion trend (Tagwise-Trend). In the paper of McCarthy, DJ et al. (2012), in the formula G_0 = prior.df/df = 20/df: df is the "residual df" (equals the number of libraries minus the number of distinct treatment groups), and prior.df=20. For example, if there are six libraries and two distinct groups, df = residual df = 4. If we set prior.df = 20, then G_0 = 5. G_0 is the "prior.n" in edgeR's earlier versions. Users should be careful in choosing appropriate prior.df when using tagwise models.

seed

initial random number seed for reproducibility of re-simulations (default is 1).

ncores

number of CPU cores to use. If unspecified, use the total number of CPUs minus 1.

...

(for future use)

Details

When the response is a count matrix, we can use this function to test the goodness-of-fit of a specified negative binomial dispersion model. Specifically, this function calls model.nbp.m to fit the NBP dispersion model, calls model.nbq.m to fit the NBQ dispersion model, calls model.edgeR.common to fit the common dispersion model, calls model.edgeR.genewise to fit the pure genewise dispersion model (without shrinkage), calls model.edgeR.trended to fit the non-parametric trended dispersion model in edgeR, calls model.edgeR.tagcom to fit the tagwise model with the dispersions shrinking towards a common dispersion, and calls model.edgeR.tagtrd to fit the tagwise model with the dispersions shrinking towards the global dispersion trend.

Value

An object of class "gofm" to which other methods can be applied.

Author(s)

Gu Mi <neo.migu@gmail.com>, Yanming Di, Daniel Schafer

References

Mi, G, Di, Y, & Schafer, DW (2015). Goodness-of-Fit Tests and Model Diagnostics for Negative Binomial Regression of RNA Sequencing Data. PLOS ONE, 10(3).

Di Y, Schafer DW, Cumbie JS, and Chang JH (2011): "The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq", Statistical Applications in Genetics and Molecular Biology, 10 (1).

McCarthy DJ, Chen Y and Smyth GK (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research 40, 4288-4297.

Cox, DR, and Reid, N (1987). Parameter orthogonality and approximate conditional inference. Journal of the Royal Statistical Society Series B 49, 1-39.

See https://github.com/gu-mi/NBGOF/wiki/ for more details.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
## load package into session:
library(NBGOF)

## basic set-up of the model:
seed = 539
sim = 999
conf.env = 0.95

## basic parameter specifications:
m = 1000 # 1000 genes
n = 6    # 6 samples
s = 1e6  # lib.sizes
offset = log(s)  # make sure offset for NBP functions should take the log

## simulate coefficients:
beta = matrix(0, m, 1)  # beta0 only
set.seed(seed)
beta[,1] = rnorm(m, 5, 2) - offset   # beta0 ~ normal (mean and a based on real RNA-Seq data)

## design matrix (single group only):
x = matrix(rep(1,n))

## specify mean levels:
mu = round(t(s * exp(x %*% t(beta))))
pi = mu/s

## simulate an m-by-n count matrix mimicking a RNA-Seq dataset:
## simulate NB1.8 data with random uniform noise added to the dispersion
alpha1 = -0.2  # NB1.8
phi0 = 0.02
alpha0 = log(phi0)
phi.nbp = phi0 * pi^alpha1
range(phi.nbp)
cbind(mu[,1], phi.nbp[,1])

## add noise:
a = 0.1
set.seed(seed)  
phi.noi.vec = phi.nbp[ ,1] * exp(matrix(runif(m, -a, a), nr=m, nc=1))
phi.noi = matrix(phi.noi.vec, nr=m, nc=n)
range(phi.noi)
cbind(mu[,1], phi.noi[,1])  # make sure phi's are in reasonable range

## generate NBP response with added noise:
set.seed(seed)
y = rnbinom(m * n, mu=mu, size=1/phi.noi)
dim(y) = dim(mu)
rownames(y) = paste("g", seq(1,m), sep="")
colnames(y) = paste("s", seq(1,n), sep="")
plot(mu, phi.noi, log="xy")

## GOF tests for different dispersion models, using parallel computing:
## CAUTION: may be time-consuming depending on the size of data and simulations!
nc = detectCores() - 1
fnbp.noip = nb.gof.m(counts=y, x=x, sim=sim, model="NBP", method="MAPL", seed=1, ncores=nc)
fnbq.noip = nb.gof.m(counts=y, x=x, sim=sim, model="NBQ", method="MAPL", seed=1, ncores=nc)
fcom.noip = nb.gof.m(counts=y, x=x, sim=sim, model="Common", method="CoxReid", seed=1, ncores=nc)
fgen.noip = nb.gof.m(counts=y, x=x, sim=sim, model="Genewise", method="auto", seed=1, ncores=nc)
ftgc.noip = nb.gof.m(counts=y, x=x, sim=sim, model="Tagwise-Common", method="CoxReid", seed=1, ncores=nc)
ftgt.noip = nb.gof.m(counts=y, x=x, sim=sim, model="Tagwise-Trend", method="auto", seed=1, ncores=nc)
ftrd.noip = nb.gof.m(counts=y, x=x, sim=sim, model="Trended", method="auto", seed=1, ncores=nc)

## summarize the GOF test results:
summary(fnbp.noip, conf.env=conf.env, data.note="NB1.8-noise(a=0.1)")
summary(fnbq.noip, conf.env=conf.env, data.note="NB1.8-noise(a=0.1)")
summary(fcom.noip, conf.env=conf.env, data.note="NB1.8-noise(a=0.1)")
summary(fgen.noip, conf.env=conf.env, data.note="NB1.8-noise(a=0.1)")
summary(ftgc.noip, conf.env=conf.env, data.note="NB1.8-noise(a=0.1)")
summary(ftgt.noip, conf.env=conf.env, data.note="NB1.8-noise(a=0.1)")
summary(ftrd.noip, conf.env=conf.env, data.note="NB1.8-noise(a=0.1)")

## evaluate GOF p-values using histograms and quantile-quantile uniform plots:
# pdf("plots-NBP-01-1grp.pdf", width=20, height=20)
h1 = diagPlot(fnbp.noip, type="hist", binwidth=0.1)
q1 = diagPlot(fnbp.noip, type="qq")
h2 = diagPlot(fnbq.noip, type="hist", binwidth=0.1)
q2 = diagPlot(fnbq.noip, type="qq")
h3 = diagPlot(fcom.noip, type="hist", binwidth=0.1)
q3 = diagPlot(fcom.noip, type="qq")
h4 = diagPlot(fgen.noip, type="hist", binwidth=0.1)
q4 = diagPlot(fgen.noip, type="qq")
h5 = diagPlot(ftgc.noip, type="hist", binwidth=0.1)
q5 = diagPlot(ftgc.noip, type="qq")
h6 = diagPlot(ftgt.noip, type="hist", binwidth=0.1)
q6 = diagPlot(ftgt.noip, type="qq")
h7 = diagPlot(ftrd.noip, type="hist", binwidth=0.1)
q7 = diagPlot(ftrd.noip, type="qq")
multiplot(h1, q1, h2, q2, h3, q3, h4, q4, h5, q5, h6, q6, h7, q7, cols=4, layout=matrix(seq(1,16), nrow=4, byrow=TRUE))
# dev.off()

gu-mi/NBGOF documentation built on Oct. 25, 2020, 3:30 a.m.