knitr::opts_chunk$set(comment="# ", collapse=TRUE)

This guide follows the Bioconductor RNA-Seq workflow to find differentially expressed genes using DESeq2 version r gsub("‘’", "", packageVersion("DESeq2")). The hciR package works best with tidyverse packages (readr, dplyr, tibble, etc.) and simplifies the code in a typical differential expression analysis.

Load samples and counts

Load the sample table from the pasilla package using the readr package. A copy of these files are also available in extdata directory in the hciR package.

library(tidyverse)
extdata <- system.file("extdata", package="hciR")
samples <- read_tsv(paste(extdata, "pasilla_samples.tsv", sep="/"))
samples

Load the count table. The sample names in the first column above should match the count column names (the same order is not required).

options(width=110)
counts  <- read_tsv(paste(extdata, "pasilla_counts.tsv", sep="/"))
counts

Check the pre-filter count cutoffs. The plot displays the number of reads removed using either a maximum based filter or the total number of reads.

par(mar=c(5,4,1,1))
library(hciR)
plot_filter(counts)

Remove 2240 features with zero counts and 2954 with five or fewer reads in every sample to create a final matrix with 9405 rows.

counts <- filter_counts(counts, n = 5)

Load annotations

The hciRdata package includes the latest Ensembl annotations for twelve common reference genomes.

options(width=110)
library(hciRdata)
# data(package="hciRdata")
dplyr::select(fly104, 1:4,8)

The pasilla dataset was analyzed using Ensembl version 62, so 737 gene ids have been removed from the latest release. See the Ensembl vignette to download and use an earlier release.

filter(counts, !id %in% fly104$id) %>% nrow()

Check genes with the highest number of assigned reads.

options(width=110)
n1 <- rowMeans(as_matrix(counts))
right_join( dplyr::select(fly104, 1:4,8),
 tibble(id= names(n1), mean_count = n1)) %>%
 arrange(desc(mean_count))

Run DESeq

Run DESeq using ~ condition + type in the design formula to control for paired vs single end effects on gene expression and get the regularized log (rlog) counts for sample visualizations. These values are similar to the log2 normalized counts except the variance in low count genes is reduced.

dds <- deseq_from_tibble(counts, samples,  design = ~ condition + type)
rld <- DESeq2::rlog(dds)

Plot the first two principal components using the rlog values from the top 500 variable genes. The plot_pca function will plot an interactive highchart by default.

# plot_pca(rld, "condition", tooltip=c("file", "type") , width=700)
plot_pca(rld, "condition", label="file", ggplot=TRUE)


Cluster all the rlog values using the R function dist to calculate the Euclidean distance between samples. Since the distance values on the diagonal are always zero, this is set to NA to avoid skewing the heatmap colors.

# plot_dist(rld , c("condition", "type"), palette="Blues", diagNA=FALSE, reverse=TRUE)
plot_dist(rld , c("condition", "type"), na_col="white")

Results

DESeq will convert any character columns in the design formula to factors (see warning above). In this case, the values will be ordered alphabetically and the results_all function will run all pairwise comparisons in reverse order (for example C vs. B, C vs. A and B vs. A if samples groups are A, B, C).

Check the possible contrasts using check_contrasts.

check_contrasts(dds$condition)

The best way to ensure that control groups are set as the reference level is to factor the columns before running deseq_from_tibble.

samples$condition <- factor(samples$condition, levels = c("untreated", "treated"))

Another option is to use the relevel option in results_all below to compare treated vs. untreated using a 5% FDR (or setting vs="untreated" would also work in this case).

res <- results_all(dds, fly104, alpha= 0.05, relevel =c("untreated", "treated"))

The results are a tibble or list of tibbles if there is more than one contrast.

options(width=110)
arrange(res, padj)  %>%
  dplyr::select(1:7,11)

Plot the fold changes and p-values in a volcano plot.

plot_volcano(res, pvalue= c(35,25))

Cluster the top 40 significant genes and scale by rows, so values represent the number of standard deviations from the mean rlog value.

x <- top_counts( res, rld, top=40)
x
plot_genes(x, c("condition", "type"), scale ="row", annotation_names_col=FALSE)

Cluster all 1090 significant genes.

x <- top_counts(res, rld, top=2000)
nrow(x)
plot_genes(x, c("condition", "type"), scale ="row", annotation_names_col=FALSE,
 show_rownames=FALSE)

Save results

Save the DESeq results, raw counts, normalized counts, regularized log counts and fly annotations to a single Excel file in DESeq.xlsx and R objects to a binary data file to load into a new session.

write_deseq(res, dds, rld, fly104)
save(res, dds, rld, file="dds.rda")

The pasilla dataset is also available in the hciR package and used in the help examples.

pasilla <- list(dds = dds, rlog = rld, results = res)




HuntsmanCancerInstitute/hciR documentation built on March 26, 2024, 3:09 a.m.