cnvEQTL: CNV-expression association analysis
In CNVRanger: Summarization and expression/phenotype association of CNV ranges

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/expr_assoc.R

Testing CNV regions for effects on the expression level of genes in defined genomic windows.

cnvEQTL(
  cnvrs,
  calls,
  rcounts,
  data,
  window = "1Mbp",
  multi.calls = .largest,
  min.samples = 10,
  de.method = c("edgeR", "limma"),
  padj.method = "BH",
  filter.by.expr = TRUE,
  verbose = FALSE
)

`cnvrs`	A `GRanges` or character object containing the summarized CNV regions as e.g. obtained with `populationRanges`. Alternatively, the assay name if the 'data' argument is provided.
`calls`	Either a `GRangesList` or `RaggedExperiment` storing the individual CNV calls for each sample. Alternatively, the assay name if 'data' is provided.
`rcounts`	A `RangedSummarizedExperiment` or character name storing either the raw RNA-seq read counts in a rectangular fashion (genes x samples). Alternatively, the assay name if 'data' is provided.
`data`	(optional) A `MultiAssayExperiment` object with 'cnvrs', 'calls', and 'rcounts' arguments corresponding to assay names.
`window`	Numeric or Character. Size of the genomic window in base pairs by which each CNV region is extended up- and downstream. This determines which genes are tested for each CNV region. Character notation is supported for convenience such as "100kbp" (same as 100000) or "1Mbp" (same as 1000000). Defaults to `"1Mbp"`. Can also be set to `NULL` to test against all genes included in the analysis.
`multi.calls`	A function. Determines how to summarize the CN state in a CNV region when there are multiple (potentially conflicting) calls for one sample in that region. Defaults to `.largest`, which assigns the CN state of the call that covers the largest part of the CNV region tested. A user-defined function that is passed on to `qreduceAssay` can also be provided for customized behavior.
`min.samples`	Integer. Minimum number of samples with at least one call overlapping the CNV region tested. Defaults to 10. See details.
`de.method`	Character. Differential expression method. Defaults to `"edgeR"`.
`padj.method`	Character. Method for adjusting p-values to multiple testing. For available methods see the man page of the function `p.adjust`. Defaults to `"BH"`.
`filter.by.expr`	Logical. Include only genes with sufficiently large counts in the DE analysis? If TRUE, excludes genes not satisfying a minimum number of read counts across samples using the `filterByExpr` function from the edgeR package. Defaults to TRUE.
`verbose`	Logical. Display progress messages? Defaults to `FALSE`.

Association testing between CNV regions and RNA-seq read counts is carried out using edgeR, which applies generalized linear models (GLMs) based on the negative-binomial distribution while incorporating normalization factors for different library sizes.

In the case of only one CN state deviating from 2n for a CNV region under investigation, this reduces to the classical 2-group comparison. For more than two states (e.g. 0n, 1n, 2n), edgeR’s ANOVA-like test is applied to test all deviating groups for significant expression differences relative to 2n.

To avoid artificial effects due to low expression of a gene or insufficient sample size in deviating groups, it is typically recommended to exclude from the analysis (i) genes with fewer than r reads per million reads mapped (cpm, counts per million) in the maximally expressed sample group, and (ii) CNV regions with fewer than s samples in a group deviating from 2n. Use the min.cpm and min.samples arguments, respectively.

When testing local effects (adjacent or coinciding genes of a CNV region), suitable thresholds for candidate discovery are r = 3, s = 4, and a nominal significance level of 0.05; as such effects have a clear biological indication and the number of genes tested is typically small.

For distal effects (i.e. when testing genes far away from a CNV region) more stringent thresholds such as r = 20 and s = 10 for distal effects in conjunction with multiple testing correction using a conservative adjusted significance level such as 0.01 is typically recommended (due to power considerations and to avoid detection of spurious effects).

A DataFrame containing measures of association for each CNV region and each gene tested in the genomic window around the CNV region.

Ludwig Geistlinger <Ludwig.Geistlinger@sph.cuny.edu>

Geistlinger et al. (2018) Widespread modulation of gene expression by copy number variation in skeletal muscle. Sci Rep, 8(1):1399.

findOverlaps to find overlaps between sets of genomic regions,

qreduceAssay to summarize ragged genomic location data in defined genomic regions,

glmQLFit and glmQLFTest to conduct negative binomial generalized linear models for RNA-seq read count data.

# (1) CNV calls
states <- sample(c(0,1,3,4), 17, replace=TRUE)
calls <- GRangesList(
     sample1 = GRanges( c("chr1:1-10", "chr2:15-18", "chr2:25-34"), state=states[1:3]),
     sample2 = GRanges( c("chr1:1-10", "chr2:11-18" , "chr2:25-36"), state=states[4:6] ),
     sample3 = GRanges( c("chr1:2-11", "chr2:14-18", "chr2:26-36"), state=states[7:9] ),
     sample4 = GRanges( c("chr1:1-12", "chr2:18-35" ), state=states[10:11] ),
     sample5 = GRanges( c("chr1:1-12", "chr2:11-17" , "chr2:26-34"), state=states[12:14] ) ,
     sample6 = GRanges( c("chr1:1-12", "chr2:12-18" , "chr2:25-35"), state=states[15:17] )
)

# (2) summarized CNV regions
cnvrs <- populationRanges(calls, density=0.1)

# (3) RNA-seq read counts
genes <- GRanges(c("chr1:2-9", "chr1:100-150", "chr1:200-300",
                   "chr2:16-17", "chr2:100-150", "chr2:200-300", "chr2:26-33"))
y <- matrix(rnbinom(42,size=1,mu=10),7,6)
names(genes) <- rownames(y) <- paste0("gene", 1:7)
colnames(y) <- paste0("sample", 1:6)

library(SummarizedExperiment)
rse <- SummarizedExperiment(assays=list(counts=y), rowRanges=granges(genes))

# (4) perform the association analysis
res <- cnvEQTL(cnvrs, calls, rse, min.samples=1, filter.by.expr=FALSE)