title: "The QCnormSE User's Guide" author: Silke Szymczak package: QCnormSE abstract: > A comprehensive guide to using the QCnormSE package for quality control of processed gene expression data sets from microarray and RNA-seq experiments. vignette: > %\VignetteIndexEntry{QCnormSE User's Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc_float: true
The QCnormSE package provides implementations of several quality control (QC) procedures and processing steps for normalized gene expression data sets from microarray and RNA-seq experiments within a unified framework.
The package uses the SummarizedExperiment class from Bioconductor to store expression values as well as gene annotation and phenotype information of transcriptome data sets in a single object. For more information on the SummarizedExperiment class see https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html.
The package provides functions to detect outlier, duplicated or mislabeled samples and to evaluate potential batch effects together with low dimensional visualizations, e.g. principal component analysis (PCA) or multidimensional scaling (MDS). Furthermore, functions are provided to access expression data from GEO including extraction of information about detection P-values and scan dates if available, to aggregate expression values based on new identifiers (e.g. from probe sets to genes), to perform log transformation and normalization for raw counts from RNA-seq studies and to remove samples or genes based on different criteria (e.g. low expression).
This user guide demonstrates the functionality based on two exemplary data sets , one generated by microarrays and one by RNA-seq technology.
This document has the following dependencies:
library(devtools) library(QCnormSE) library(SummarizedExperiment) library(ggpubr) library(sva) library(recount)
The first example is a GEO data set (GSE36398) and contains data from a microarray study comparing gene expression between patients with Facioscapulohumeral muscular dystrophy (FSHD) and their unaffected first degree relatives (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36398).
The function get_geo_data()
can be used to retrieve normalized gene
expression data from GEO and to store it as a SummarizedExperiment object.
se.ma = get_geo_data(accession = "GSE36398") print(se.ma)
The SummarizedExperiment object contains expression data for r nrow(se.ma)
genes
and r ncol(se.ma)
samples.
Information about the case-control status is given in the variable 'disease.state'.
table(se.ma$disease.state)
For this data set, GEO also provides the raw files (.CEL files). Thus, the date
of scanning can be extracted using the function extract_scan_date()
and is
then stored as an additional column called scan.date
in the colData()
slot.
Please note that this step will take a while since all the .CEL files have to
be downloaded. It can be skipped for the further analyses presented in this
vignette.
se.ma = extract_scan_date(se = se.ma)
table(se.ma$scan.date)
The scanning date can be used as a surrogate for unwanted technical variability
(batch effects) which can be automatically defined using the function
define_batches()
. However, the colData()
slot already contains batch
information which will be renamed so that it is not overwritten.
## rename se.ma$batch.original = se.ma$batch se.ma$batch = NULL ## define se.ma = define_batches(se = se.ma, col.scan.date = "scan.date") ## comparison table(original = se.ma$batch.original, new = se.ma$batch)
The comparison shows that both batch definitions are equivalent.
One method to detect outlier samples utilises a boxplot of the distribution of
expression values for each sample which can be generated using the function
plot_distribution()
.
plot_distribution(se = se.ma, method = "boxplot")
For studies with even larger sample size an alternative is the quantile plot where the quantiles are visualized as lines across the samples.
plot_distribution(se = se.ma, method = "quantileplot")
Another possibility is a low dimensional visualization using PCA or MDS which
can be generated using the functions calculate_mds_pca()
and plot_mds_pca()
. Different phenotype variables can be specified for coloring and shaping.
## PCA res.pca = calculate_mds_pca(se = se.ma, method = "pca") ## color code by disease.state plot_mds_pca(res = res.pca, se = se.ma, var.color = "disease.state") ## color code by disease.state ## and shape by tissue plot_mds_pca(res = res.pca, se = se.ma, var.color = "disease.state", var.shape = "tissue")
The plots show no obvious outliers which is also confirmed by the return value of the functions (the info component is NULL). Otherwise this component would give information about outlier samples.
If the PCA plot is colored by batch, it can be seen that the clusters in the data are driven by the different batches.
## color code by batch plot_mds_pca(res = res.pca, se = se.ma, var.color = "batch")
The association between each of the first three components and variables of
interest (e.g. batch but also case-control status) can be tested using the
function check_batch_effects()
. The resulting heatmap visualizes the P-values
of the F tests in a linear model (as in the prince plot in the R package swamp).
res.batch = check_batch_effects(se = se.ma, res.pca = res.pca, col.test = c("batch", "disease.state", "tissue")) print(res.batch$plot)
It can be clearly seen that the variability captured in the first three components is not based on the biological information (disease.state and tissue) but rather caused by batch effects (all P-values < 0.0001).
For the remaining QC steps correction of the batch effect is
performed using the ComBat function in the R package sva
. However, for
statistical analyses of differential expression, it might be better to include
batch as a covariate in the statistical model.
## extract expression values exprs = assays(se.ma)$exprs ## perform bacth correction exprs.corrected = ComBat(dat = exprs, batch = se.ma$batch) ## store corrected expression values as new assay assays(se.ma)$exprs.corrected = exprs.corrected ## check correction ## explicitly state assay res.pca.corrected = calculate_mds_pca(se = se.ma, assay = "exprs.corrected") plot_mds_pca(res = res.pca.corrected, se = se.ma, var.shape = "batch", var.color = "disease.state") res.corrected = check_batch_effects(se = se.ma, res.pca = res.pca.corrected, col.test = c("batch", "disease.state", "tissue")) print(res.corrected$plot)
The influence of the batches is removed, however, the biological information is still not captured by the first component.
Potential duplicates can be detected using the function
detect_duplicated_samples()
which calculates pairwise correlations between
samples. Duplicated samples have an extremely high correlation and are
visualized using a vertical red line on the histogram.
detect_duplicated_samples(se = se.ma, assay = "exprs.corrected")
No problematic samples are detected.
Sample swaps or mislabeled samples can be identified using the function
check_sex()
which predicts sex based on the expression of sex chromosomal
genes and compares it to the reported sex in the colData
slot.
Information about genes on the X and Y chromosome for humans is provided as part of the package.
## load info data("info.sex.genes") check_sex(se = se.ma, assay = "exprs.corrected", info.sex.genes = info.sex.genes, gene.column = "Gene.symbol", sex.column = "Sex")
The MDS plot shows a clear clustering by sex and no misannotation across sexes.
The second example is a sequencing based transcriptome study of normal and Parkinson's disease human brain samples (SRP015668).
The fastq files are available in the Sequence Read Archive (SRA,
https://www.ncbi.nlm.nih.gov/sra?term=SRP015668).
In addition, raw read counts are provided by recount2
(https://jhubiostatistics.shinyapps.io/recount) and can be accessed using the
Bioconductor package recount
. The gene-level SummarizedExperiment object is
downloaded in a .Rdata file which can be loaded into R.
## temporary directory for storing the file temp.dir = tempdir() ## download the file download_study(project = "SRP015668", type = "rse-gene", outdir = temp.dir) ## load file load(file.path(temp.dir, "rse_gene.Rdata")) ## rename object se.seq = rse_gene print(se.seq)
The SummarizedExperiment object contains count data for r nrow(se.seq)
genes
and r ncol(se.seq)
samples.
The phenotype information is stored as strings in a single column
called characteristics
in the colData
slot. The function
extract_geo_characteristics
can be used to extract the relevant information
into separate columns of colData
.
## show original column head(se.seq$characteristics) ## extract information se.seq = extract_geo_characteristics(se = se.seq) ## show resulting columns head(colData(se.seq)[, 22:25])
Information about the case-control status is given in the variable 'disease.state'.
table(se.seq$disease.state)
In contrast to the microarray experiment with normalized data, recount provides raw data which need to be filtered and normalized for the following QC steps. However, if differential expression analysis is performed using RNA-seq specific package such as DESeq2 or edgeR, the raw counts are required as input.
Lowly expressed genes can be removed using the function remove_genes()
which
includes an option to run the function 'filterByExpr()' from the 'edgeR'
package.
## remove lowly expressed genes se.seq = remove_genes(se = se.seq, assay = "counts", method = "edgeR", freq = 0.5) print(se.seq)
Only r nrow(se.seq)
genes are left after filtering.
The function normalize_counts()
is a wrapper for several other functions
provided by edgeR
, e.g. the tmm method (weighted trimmed mean of M-values)
followed by a log-transformation.
se.seq = normalize_counts(se = se.seq, assay = "counts", method = "tmm") print(se.seq)
The normalized counts are given in a second assay called counts.norm
.
As in the microarray example, the distribution of the expression values for
each sample can be visualized using the function plot_distribution()
. Here,
the name of the assay that should be plotted needs to be specified to generate
a boxplot based on counts.norm
instead of counts
.
## boxplot plot_distribution(se = se.seq, assay = "counts.norm", method = "boxplot") ## quantile plot plot_distribution(se = se.seq, assay = "counts.norm", method = "quantileplot")
Gene expression values show comparable distributions across the majority of the given samples. However, for one sample the overall gene expression has been estimated to be very low. For another three samples a considerably high variability of gene expression values has been observed. All of these outliers are clearly visible within a PCA plot.
## PCA res.pca = calculate_mds_pca(se = se.seq, assay = "counts.norm", method = "pca") ## color code by disease.state plot_mds_pca(res = res.pca, se = se.seq, var.color = "disease.state")
For the remaining QC steps, these outliers will be removed.
## store outlier information res.plot = plot_distribution(se = se.seq, assay = "counts.norm", method = "boxplot") out = res.plot$info$id ## remove se.seq = se.seq[, -(which(colnames(se.seq) %in% out))] print(se.seq)
To detect batch effects, the PCA needs to be run again after removing the outlier samples.
res.pca = calculate_mds_pca(se = se.seq, assay = "counts.norm", method = "pca")
One potential batch in RNA-seq studies is the total read count. Thus, the
variable mapped_read_count
is used to color the PCA plot and also included
in the check_batch_effects()
function. Other potentially interesting
biological variables are case-control status and age.
plot_mds_pca(res = res.pca, se = se.seq, var.color = "mapped_read_count") res.batch = check_batch_effects(se = se.seq, res.pca = res.pca, col.test = c("disease.state", "age.at.death..y.", "mapped_read_count")) print(res.batch$plot)
Apparently the normalization did not completely remove the read count effect.
Again, the assay needs to be specified for the detect_duplicated_samples()
function.
detect_duplicated_samples(se = se.seq, assay = "counts.norm")
No duplicated samples are detected.
As before, the assay needs to be specified for the check_sex()
function.
## load info data("info.sex.genes") check_sex(se = se.seq, assay = "counts.norm", info.sex.genes = info.sex.genes, gene.column = "symbol", sex.column = "gender")
In contrast to the microarray study, no clear clustering of sexes is shown. The reason might be a wrong annotation of the samples. However, the approach of sex-specific expression could be invalid in brain samples. Additional tests based on further studies of brain samples would be necessary.
session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.