knitr::opts_chunk$set(echo = TRUE, crop = NA)
The countsimQC
package provides a simple way to compare the characteristic
features of a collection of (e.g., RNA-seq) count data sets. An important
application is in situations where a synthetic count data set has been generated
using a real count data set as an underlying source of parameters, in which case
it is often important to verify that the final synthetic data captures the main
features of the original data set. However, the package can be used to create a
visual overview of any collection of one or more count data sets.
In this vignette we will show how to generate a comparative report from a
collection of two simulated data sets and the original, underlying real data
set. First, we load the object containing the three data sets. The object is a
named list, where each element is a DESeqDataSet
object, containing the count
matrix, a sample information data frame and a model formula (necessary to
calculate dispersions). For more information about the DESeqDataSet
class,
please see the
DESeq2
Bioconductor package. For speed reasons, we use only a subset of the features
in each data set for the following calculations.
suppressPackageStartupMessages({ library(countsimQC) library(DESeq2) }) data(countsimExample) countsimExample countsimExample <- lapply(countsimExample, function(cse) { cse[seq_len(1500), ] })
Next, we generate the report using the countsimQCReport()
function. Depending
on the level of detail and the type of information that are required for the
final report, this function can be run in different "modes":
calculateStatistics = FALSE
, only plots will be generated. This
is the fastest way of running countsimQCReport()
, and in many cases generates
enough information for the user to make a visual evaluation of the count data
set(s).calculateStatistics = TRUE
and permutationPvalues = FALSE
, some
quantitative pairwise comparisons between data sets will be performed. In
particular, the Kolmogorov-Smirnov test and the Wald-Wolfowitz runs test will be
used to compare distributions, and additional statistics will be calculated to
evaluate how similar the evaluated aspects are between pairs of data sets.calculateStatistics = TRUE
and permutationPvalues = TRUE
(and
giving the requested number of permutations via the nPermutations
argument),
permutation of data set labels will be used to evaluate the significance of the
statistics calculated in the previous point. Naturally, this increases the run
time of the analysis considerably.Here, for the sake of speed, we calculate statistics for a small subset of the
observations (subsampleSize = 25
) and refrain from calculating permutation
p-values.
tempDir <- tempdir() countsimQCReport(ddsList = countsimExample, outputFile = "countsim_report.html", outputDir = tempDir, outputFormat = "html_document", showCode = FALSE, forceOverwrite = TRUE, savePlots = TRUE, description = "This is my test report.", maxNForCorr = 25, maxNForDisp = Inf, calculateStatistics = TRUE, subsampleSize = 25, kfrac = 0.01, kmin = 5, permutationPvalues = FALSE, nPermutations = NULL)
The countsimQCReport()
function can generate either an HTML file (by setting
outputFormat = "html_document"
or outputFormat = NULL
) or a pdf file (by
setting outputFormat = "pdf_document"
). The description
argument can be used
to provide a more extensive description of the data set(s) that are included in
the report.
If the argument savePlots
is set to TRUE, an .rds file containing the
individual ggplot objects will be generated. These objects can be used to
perform fine-tuning of the visualizations if desired. Note, however, that the
.rds file can become large if the number of data sets is large, or if the
individual data sets have many samples or features. The convenience function
generateIndividualPlots()
can be used to quickly generate individual figures
for all plots included in the report, using a variety of devices. For example,
to generate each plot in pdf format:
ggplots <- readRDS(file.path(tempDir, "countsim_report_ggplots.rds")) if (!dir.exists(file.path(tempDir, "figures"))) { dir.create(file.path(tempDir, "figures")) } generateIndividualPlots(ggplots, device = "pdf", nDatasets = 3, outputDir = file.path(tempDir, "figures"))
In the example above, all data sets were provided as DESeqDataSet
objects. The
advantage of this is that it allows the specification of the experimental
design, which is used in the dispersion calculations. countsimQC
also allows a
data set to be provided as either a data.frame
or a matrix
. However, in
these situations, it will be assumed that all samples are replicates (i.e., a
design ~1
). An example is provided in the countsimExample_dfmat
data set,
provided with the package.
data(countsimExample_dfmat) names(countsimExample_dfmat) lapply(countsimExample_dfmat, class)
tempDir <- tempdir() countsimQCReport(ddsList = countsimExample_dfmat, outputFile = "countsim_report_dfmat.html", outputDir = tempDir, outputFormat = "html_document", showCode = FALSE, forceOverwrite = TRUE, savePlots = TRUE, description = "This is my test report.", maxNForCorr = 25, maxNForDisp = Inf, calculateStatistics = TRUE, subsampleSize = 25, kfrac = 0.01, kmin = 5, permutationPvalues = FALSE, nPermutations = NULL)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.