dryseq: Differential rhythmicity analysis for RNA-Seq datasets

Description Usage Arguments Details Value References Examples

View source: R/dryseq_function.R

Description

This function performs a rhythmicity analysis based on generalized linear models with a subsequent models selection. The function accepts raw count data from a temporal RNA-Seq dataset of two or more groups. The function outputs parameters mean, phase and amplitude are for each group.

Usage

1
2
3
4
5
6
7
8
9
dryseq(
  countData,
  group,
  time,
  period = 24,
  sample_name = colnames(countData),
  batch = rep("A", length(sample_name)),
  n.cores = round(detectCores() * 0.6, 0)
)

Arguments

countData

matrix containing non-negative integers; each column represents a sample, each row represents a gene/transcript.

group

vector containing the name of each sample.

time

vector containing numeric values of the time for each sample.

period

numeric value to indicate period length of the oscillation. Default: circadian data period of 24 h.

sample_name

vector containing sample names. Default: colnames are sample names.

batch

vector containing potential batch effects between samples. Default: no batch effect.

nthreads

vector numeric value to indicate the threads for parallel computing .Default: 60 % of detected cores.

Details

DryR assesses rhythmicity and mean differences of gene expression in RNA-Seq count data. As proposed (Love et al. 2014), a count Y of a gene in a sample s can be modeled as a negative binomial with a fitted mean μ_gs and a gene-specific dispersion parameter θ_g.

Y_gs~NB(μ_gs,θ_g)

The fitted mean is proportional to the quantity q of fragments that correspond to a gene in a sample scaled by a sample-specific scaling factor s_s (Love et al. 2014). This scaling factor depends on the sampling depth of each library and can be estimated using the median-of-ratios method of DESeq2 (Anders and Huber 2010).

μ_gs = s_s q_gs

DryR estimates gene-specific distribution θg using empirical Bayes shrinkage described by Love et al. (Love et al. 2014). and variance is computed from the following relationship to the dispersion parameter θ:

Var(Y_gs) = E[(Y_gs + θ_g μ^2_gs)]

The fit uses a generalized linear model with a logarithmic link function. Sample specific size factor (λ_s) is defined as an offset. The full GLM is defined as follows:

log(μ_gbcs) = m_gb + m_gc + α_gc cos(ω t(s)) + β_gc sin(ω t(s)) + log(λ_t(s))

μ is the raw count for gene g, condition/group c and Zeitgeber/circadian time t. α and β are coefficients of the cosine and sine functions, respectively. m is a coefficient to describe a mean expression level. When necessary, a batch specific mean (m) can be given to the dryseq function to account for technical batch effects. A technical batch effect is not allowed to be confounding so the resulting model matrix is fully ranked. To select an optimal gene-specific model, dryseq first assesses rhythmicity across the different conditions. To this end, dryR defines different models across all groups. Models refined to have either zero (non-rhythmic pattern) or non-zero (rhythmic pattern) α and β coefficients for each analyzed group. Moreover, for some models the values of α and β can be also shared within any combination of all groups The coefficients α and β were used to calculate the phase (arctan(α/β)) and amplitude (log2-fold change peak-to-trough; 2sqrt(α^2+β^2) ) of a gene. Bayesian information criterion (BIC) based model selection was employed to account for model complexity using the following formula:

BIC_j = ln(n)k - 2ln(L̂_ĵ)

L̂ is defined as the log-likelihood of the model j from the regression, n is the number of data points and k is the number of parameters. To assess the confidence of the selected model j we calculated the Schwarz weight (BICW):

BICW_j = e^(0.5ΔBIC_j)\ sum(e^0.5 ΔBIC_m), with ΔBIC_j - BIC_j - BIC_m*

m* is the minimum BIC value in the entire model set. Dryseq consideres the BICW_j as the confidence level for model j. The model with the highest BICW is selected as the optimal model within the set of all defined models. In a second iteration step, dryR set the coefficient α and β to the values of the selected model in the first regression. dryseq then defined different models for the mean coefficient with differing or shared means between groups. Each model is solved using generalized linear regression and each gene was assigned to a preferred model based on the BICW as described above for the first iteration. The model selection is sensitive to outliers: dryseq provide a cook's distance for each gene. A fit for a gene with a maximum cook's distance of higher than 1 should be considered with care.

Value

a list that contains the following data.frames: results (summary of results), parameters (rhythmic parameters), ncounts (normalized counts), counts (raw counts), cook (cook's distance)

References

Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology

Anders, S. and Huber, W. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology

Examples

1
2
3
4
5
6
7
8
9
countData = simData[["countData"]]
group = simData[["group"]]
time  = simData[["time"]]
dryList = dryseq(countData,group,time)
head(dryList[["results"]])    # data frame summarizing results
head(dryList[["parameters"]]) # coefficients: phase, amplitude and mean for each group
head(dryList[["ncounts"]])    # normalized counts
head(dryList[["counts"]])     # raw counts
head(dryList[["cook"]])       # cook's distance

benjaminweger/dryR documentation built on Oct. 6, 2020, 6:03 a.m.