title: "The QCnormSE User's Guide" author: Silke Szymczak package: QCnormSE abstract: > A comprehensive guide to using the QCnormSE package for quality control of processed gene expression data sets from microarray and RNA-seq experiments. vignette: > %\VignetteIndexEntry{QCnormSE User's Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc_float: true


Introduction

The QCnormSE package provides implementations of several quality control (QC) procedures and processing steps for normalized gene expression data sets from microarray and RNA-seq experiments within a unified framework.

The package uses the SummarizedExperiment class from Bioconductor to store expression values as well as gene annotation and phenotype information of transcriptome data sets in a single object. For more information on the SummarizedExperiment class see https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html.

The package provides functions to detect outlier, duplicated or mislabeled samples and to evaluate potential batch effects together with low dimensional visualizations, e.g. principal component analysis (PCA) or multidimensional scaling (MDS). Furthermore, functions are provided to access expression data from GEO including extraction of information about detection P-values and scan dates if available, to aggregate expression values based on new identifiers (e.g. from probe sets to genes), to perform log transformation and normalization for raw counts from RNA-seq studies and to remove samples or genes based on different criteria (e.g. low expression).

This user guide demonstrates the functionality based on two exemplary data sets , one generated by microarrays and one by RNA-seq technology.

Dependencies

This document has the following dependencies:

library(devtools)
library(QCnormSE)
library(SummarizedExperiment)
library(ggpubr)
library(sva)
library(recount)

Microarray example

The first example is a GEO data set (GSE36398) and contains data from a microarray study comparing gene expression between patients with Facioscapulohumeral muscular dystrophy (FSHD) and their unaffected first degree relatives (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36398).

Prepare data set

The function get_geo_data() can be used to retrieve normalized gene expression data from GEO and to store it as a SummarizedExperiment object.

se.ma = get_geo_data(accession = "GSE36398")
print(se.ma)

The SummarizedExperiment object contains expression data for r nrow(se.ma) genes and r ncol(se.ma) samples.

Information about the case-control status is given in the variable 'disease.state'.

table(se.ma$disease.state)

For this data set, GEO also provides the raw files (.CEL files). Thus, the date of scanning can be extracted using the function extract_scan_date() and is then stored as an additional column called scan.date in the colData() slot. Please note that this step will take a while since all the .CEL files have to be downloaded. It can be skipped for the further analyses presented in this vignette.

se.ma = extract_scan_date(se = se.ma)
table(se.ma$scan.date)

The scanning date can be used as a surrogate for unwanted technical variability (batch effects) which can be automatically defined using the function define_batches(). However, the colData() slot already contains batch information which will be renamed so that it is not overwritten.

## rename
se.ma$batch.original = se.ma$batch
se.ma$batch = NULL

## define
se.ma = define_batches(se = se.ma,
                       col.scan.date = "scan.date")

## comparison
table(original = se.ma$batch.original,
      new = se.ma$batch)

The comparison shows that both batch definitions are equivalent.

Quality control

Detection of outlier samples

One method to detect outlier samples utilises a boxplot of the distribution of expression values for each sample which can be generated using the function plot_distribution().

plot_distribution(se = se.ma,
                  method = "boxplot")

For studies with even larger sample size an alternative is the quantile plot where the quantiles are visualized as lines across the samples.

plot_distribution(se = se.ma,
                  method = "quantileplot")

Another possibility is a low dimensional visualization using PCA or MDS which can be generated using the functions calculate_mds_pca() and plot_mds_pca() . Different phenotype variables can be specified for coloring and shaping.

## PCA
res.pca = calculate_mds_pca(se = se.ma,
                            method = "pca")

## color code by disease.state
plot_mds_pca(res = res.pca,
             se = se.ma,
             var.color = "disease.state")

## color code by disease.state
## and shape by tissue
plot_mds_pca(res = res.pca,
             se = se.ma,
             var.color = "disease.state",
             var.shape = "tissue")

The plots show no obvious outliers which is also confirmed by the return value of the functions (the info component is NULL). Otherwise this component would give information about outlier samples.

Detection of batch effects

If the PCA plot is colored by batch, it can be seen that the clusters in the data are driven by the different batches.

## color code by batch
plot_mds_pca(res = res.pca,
             se = se.ma,
             var.color = "batch")

The association between each of the first three components and variables of interest (e.g. batch but also case-control status) can be tested using the function check_batch_effects(). The resulting heatmap visualizes the P-values of the F tests in a linear model (as in the prince plot in the R package swamp).

res.batch = check_batch_effects(se = se.ma,
                                res.pca = res.pca,
                                col.test = c("batch",
                                             "disease.state",
                                             "tissue"))
print(res.batch$plot)

It can be clearly seen that the variability captured in the first three components is not based on the biological information (disease.state and tissue) but rather caused by batch effects (all P-values < 0.0001).

For the remaining QC steps correction of the batch effect is performed using the ComBat function in the R package sva. However, for statistical analyses of differential expression, it might be better to include batch as a covariate in the statistical model.

## extract expression values
exprs = assays(se.ma)$exprs

## perform bacth correction
exprs.corrected = ComBat(dat = exprs,
                         batch = se.ma$batch)

## store corrected expression values as new assay
assays(se.ma)$exprs.corrected = exprs.corrected

## check correction
## explicitly state assay
res.pca.corrected = calculate_mds_pca(se = se.ma,
                                      assay = "exprs.corrected")

plot_mds_pca(res = res.pca.corrected,
             se = se.ma,
             var.shape = "batch",
             var.color = "disease.state")

res.corrected = check_batch_effects(se = se.ma,
                                    res.pca = res.pca.corrected,
                                    col.test = c("batch",
                                                 "disease.state",
                                                 "tissue"))
print(res.corrected$plot)

The influence of the batches is removed, however, the biological information is still not captured by the first component.

Detection of duplicated samples

Potential duplicates can be detected using the function detect_duplicated_samples() which calculates pairwise correlations between samples. Duplicated samples have an extremely high correlation and are visualized using a vertical red line on the histogram.

detect_duplicated_samples(se = se.ma,
                                     assay = "exprs.corrected")

No problematic samples are detected.

Detection of mislabeled samples

Sample swaps or mislabeled samples can be identified using the function check_sex() which predicts sex based on the expression of sex chromosomal genes and compares it to the reported sex in the colData slot.

Information about genes on the X and Y chromosome for humans is provided as part of the package.

## load info
data("info.sex.genes")

check_sex(se = se.ma,
          assay = "exprs.corrected",
          info.sex.genes = info.sex.genes,
          gene.column = "Gene.symbol",
          sex.column = "Sex")

The MDS plot shows a clear clustering by sex and no misannotation across sexes.

RNA-seq example

The second example is a sequencing based transcriptome study of normal and Parkinson's disease human brain samples (SRP015668).

Prepare data set

The fastq files are available in the Sequence Read Archive (SRA, https://www.ncbi.nlm.nih.gov/sra?term=SRP015668). In addition, raw read counts are provided by recount2 (https://jhubiostatistics.shinyapps.io/recount) and can be accessed using the Bioconductor package recount. The gene-level SummarizedExperiment object is downloaded in a .Rdata file which can be loaded into R.

## temporary directory for storing the file
temp.dir = tempdir()

## download the file
download_study(project = "SRP015668",
               type = "rse-gene",
               outdir = temp.dir)

## load file
load(file.path(temp.dir, "rse_gene.Rdata"))

## rename object
se.seq = rse_gene

print(se.seq)

The SummarizedExperiment object contains count data for r nrow(se.seq) genes and r ncol(se.seq) samples.

The phenotype information is stored as strings in a single column called characteristics in the colData slot. The function extract_geo_characteristics can be used to extract the relevant information into separate columns of colData.

## show original column
head(se.seq$characteristics)

## extract information
se.seq = extract_geo_characteristics(se = se.seq)

## show resulting columns
head(colData(se.seq)[, 22:25])

Information about the case-control status is given in the variable 'disease.state'.

table(se.seq$disease.state)

In contrast to the microarray experiment with normalized data, recount provides raw data which need to be filtered and normalized for the following QC steps. However, if differential expression analysis is performed using RNA-seq specific package such as DESeq2 or edgeR, the raw counts are required as input.

Lowly expressed genes can be removed using the function remove_genes() which includes an option to run the function 'filterByExpr()' from the 'edgeR' package.

## remove lowly expressed genes
se.seq = remove_genes(se = se.seq,
                      assay = "counts",
                      method = "edgeR",
                      freq = 0.5)
print(se.seq)

Only r nrow(se.seq) genes are left after filtering.

The function normalize_counts() is a wrapper for several other functions provided by edgeR, e.g. the tmm method (weighted trimmed mean of M-values) followed by a log-transformation.

se.seq = normalize_counts(se = se.seq,
                          assay = "counts",
                          method = "tmm")
print(se.seq)

The normalized counts are given in a second assay called counts.norm.

Quality control

Detection of outlier samples

As in the microarray example, the distribution of the expression values for each sample can be visualized using the function plot_distribution(). Here, the name of the assay that should be plotted needs to be specified to generate a boxplot based on counts.norm instead of counts.

## boxplot
plot_distribution(se = se.seq,
                  assay = "counts.norm",
                  method = "boxplot")

## quantile plot
plot_distribution(se = se.seq,
                  assay = "counts.norm",
                  method = "quantileplot")

Gene expression values show comparable distributions across the majority of the given samples. However, for one sample the overall gene expression has been estimated to be very low. For another three samples a considerably high variability of gene expression values has been observed. All of these outliers are clearly visible within a PCA plot.

## PCA
res.pca = calculate_mds_pca(se = se.seq,
                            assay = "counts.norm",
                            method = "pca")

## color code by disease.state
plot_mds_pca(res = res.pca,
             se = se.seq,
             var.color = "disease.state")

For the remaining QC steps, these outliers will be removed.

## store outlier information
res.plot = plot_distribution(se = se.seq,
                             assay = "counts.norm",
                             method = "boxplot")
out = res.plot$info$id

## remove
se.seq = se.seq[, -(which(colnames(se.seq) %in% out))]
print(se.seq)

Detection of batch effects

To detect batch effects, the PCA needs to be run again after removing the outlier samples.

res.pca = calculate_mds_pca(se = se.seq,
                            assay = "counts.norm",
                            method = "pca")

One potential batch in RNA-seq studies is the total read count. Thus, the variable mapped_read_count is used to color the PCA plot and also included in the check_batch_effects() function. Other potentially interesting biological variables are case-control status and age.

plot_mds_pca(res = res.pca,
             se = se.seq,
             var.color = "mapped_read_count")

res.batch = check_batch_effects(se = se.seq,
                                res.pca = res.pca,
                                col.test = c("disease.state",
                                             "age.at.death..y.",
                                             "mapped_read_count"))
print(res.batch$plot)

Apparently the normalization did not completely remove the read count effect.

Detection of duplicated samples

Again, the assay needs to be specified for the detect_duplicated_samples() function.

detect_duplicated_samples(se = se.seq,
                          assay = "counts.norm")

No duplicated samples are detected.

Detection of mislabeled samples

As before, the assay needs to be specified for the check_sex() function.

## load info
data("info.sex.genes")

check_sex(se = se.seq,
          assay = "counts.norm",
          info.sex.genes = info.sex.genes,
          gene.column = "symbol",
          sex.column = "gender")

In contrast to the microarray study, no clear clustering of sexes is shown. The reason might be a wrong annotation of the samples. However, the approach of sex-specific expression could be invalid in brain samples. Additional tests based on further studies of brain samples would be necessary.

Session info

session_info()


szymczak-lab/QCnormSE documentation built on March 25, 2023, 1:05 p.m.