HTSDiff: Differential analysis of RNA-seq data using a Poisson mixture...
In HTSDiff: Differential Analysis of RNA-Seq Data with Poisson Mixture Models

Description Usage Arguments Details Value Author(s) References Examples

This function implements a differential analysis of RNA-seq data using a Poisson mixture model, where one cluster is fixed to represent genes with equal mean in each experimental condition (i.e., a cluster of non-differentially expressed genes).

1	HTSDiff(counts, conds, DEclusters=4, norm="TMM", epsilon=0.8, EM.verbose=FALSE, ...)

`counts`	(n x q) matrix of observed counts for n genes and q samples, with row names corresponding to gene IDs
`conds`	Vector of length q defining the condition (treatment group) for each variable (column) in `counts`
`DEclusters`	Number of clusters to include to represent differentially expressed genes (default value of 4), in addition to the cluster fixed to represent non-differentially expressed genes.
`norm`	The estimator to be used for the library size parameter: “`TC`” for total count, “`UQ`” for upper quantile, “`Med`” for median, “`DESeq`” for the normalization method in the DESeq package, and “`TMM`” for the TMM normalization method (default).
`epsilon`	Cutoff used to identify whether the log2-ratio of cluster parameters between conditions is sufficiently large to be declared as differentially expressed, with default value 0.8
`EM.verbose`	If `TRUE`, more informative output is printed about the EM algorithm, including the number of iterations run and the difference between log-likelihoods at the last and penultimate iterations.
`...`	Additional parameters to be passed to the HTSCluster package, if desired. These include notably the following: 1) `init.runs`, the number of small-EM algorithms to run in initialization of Poisson mixture model estimation, with default value of 1, and 2) `init.iter`, the number of iterations to run within each small-EM algorithm in initialization of Poisson mixture model estimation, with default value of 10

In a Poisson mixture model, the data y are assumed to come from g distinct subpopulations (clusters), each of which is modeled separately; the overall population is thus a mixture of these subpopulations. In the case of a Poisson mixture model with g components, the model may be written as

f(y;g,ψ_g) = ∏_{i=1}^n ∑_{k=1}^g π_k ∏_{j=1}^{d}∏_{l=1}^{r_j} P(y_{ijl} ; θ_k)

for i = 1, …, n observations in l = 1, …, r_j replicates of j = 1, …, d conditions (treatment groups), where P(\cdot) is the standard Poisson density, ψ_g = (π_1,…,π_{g-1}, θ^\prime), θ^\prime contains all of the parameters in θ_1,…,θ_g assumed to be distinct, and π = (π_1,…,π_g)^\prime are the mixing proportions such that π_k is in (0,1) for all k and ∑_k π_k = 1. We consider

μ_{ijlk} = w_i s_{jl} λ_{jk}

where w_i and λ_k are as before and s_{jl} is the normalized library size (a fixed constant) for replicate l of condition j. See Rau et al. (2011) for more details on this model, including parameter estimation, algorithm initialization, and model selection.

In the case of differential analysis, we fix one of the clusters (typically the first, although this choice is arbitrary) to represent non-differentially expressed genes, i.e., λ_{11} = ... = λ_{1d} = 1. Typically we fix the number of remaining clusters (DEclusters) to be 4, although this choice may be modified by the user. In addition to the fixed cluster, clusters for which the absolute value of \log_2(λ_{1k} / λ{2k}) is less than epsilon (default value 0.8) are also considered to represent non-differentially expressed genes.

Following clusering, a gene is declared differentially expressed if its conditional probability to be non-differentially expressed (i.e., to belong to a cluster of non-differentially expressed genes) is less than 1e-8.

`res`	Results data frame containing the following information: `id` = gene IDs, `baseMean` = base mean (for normalized counts), `baseMeanA` = base mean for condition A (for normalized counts), `baseMeanB` = base mean for condition B (for normalized counts), `foldChange` = fold change between `baseMeanA` and `baseMeanB`, `log2FoldChange`, `tauDE` = conditional probability of differential expression, `tauNDE` = conditional probability of non-differential expression, `DE` = TRUE if gene is declared differentially expressed using cutoff for conditional probability of the non-differential cluster and FALSE otherwise
`PMM`	Object of class `HTSCluster` containing parameter estimates and other information from the Poisson mixture model estimation
`iterations`	Number of iterations run
`logLikeDiff`	Difference in log-likelihood between the last and penultimate iterations of the algorithm

Andrea Rau <andrea.rau@jouy.inra.fr>

S. Balzergue, G. Rigaill, V. Brunaud, E. Blondet, A. Rau, O. Rogier, J. Caius, C. Maugis-Rabusseau, L. Soubigou-Taconnat, S. Aubourg, C. Lurin, E. Delannoy, and M.-L. Martin-Magniette. (2014) HTSDiff: A Model-Based Clustering Alternative to Test-Based Methods in Differential Gene Expression Analyses by RNA-Seq Benchmarked on Real and Synthetic Datasets (submitted).

set.seed(12345)

## Generate synthetic data: 2000 genes under H0
test <- syntheticData(H0number = 2000)

## Mixture model differential analysis
## DEtest <- HTSDiff(test, c(1,1,2,2))