In paulklemm/seqc: What the Package Does (One Line, Title Case)

# Use this code for debugging the session interactively. Put this as well as the setup chunk to set everything up.
if (!exists('params')) {
  params <- list()
  params$cuffdiff_path <- '/beegfs/scratch/bruening_scratch/pklemm/2018-06-johan-ruud-rnaseq-tuxedo/old_run/cuffdiff_output/RUUD'
  params$output_path <- '/beegfs/scratch/bruening_scratch/pklemm/2018-06-johan-ruud-rnaseq-tuxedo/old_run/seqc_output/RUUD'
  params$save_plots <- TRUE
  params$save_width <- 10
  params$save_height <- 10
  params$save_format <- 'pdf'
}

library(seqc)
library(magrittr)
library(ggplot2)
library(ggrepel)

# Debug outputs
params %>% print()
# End debug outputs
saveAndPrint <- function(plot, plot_name, is_ggplot = TRUE) {
  plot %>% seqc::saveReturnPlot(save_plot = params$save_plots, plot_name = plot_name, output_path = params$save_plots_path, is_ggplot = is_ggplot, save_width = save_width, save_height = save_height, save_format = save_format)
}
cuff <- read_cuffdiff(params$cuffdiff_path)

This Notebook is guided by the cummeRbund Manual. Accompanying descriptions of the plots are cited from the cummeRbund manual.

Global Statistics & Quality Control

cuff %>% print()

Several plotting methods are available that allow for quality-control or global analysis of cufflinks data. A good place to begin is to evaluate the quality of the model fitting. Overdispersion is a common problem in RNA-Seq data. As of cufflinks v2.0 mean counts, variance, and dispersion are all emitted, allowing you to visualize the estimated overdispersion for each sample as a quality control measure.

cuff %>% genes() %>% dispersionPlot() %>%
  saveAndPrint('dispersionPlot')

The squared coefficient of variation is a normalized measure of cross-replicate variability that can be useful for evaluating the quality your RNA-seq data. Differences in CV 2 can result in lower numbers of differentially expressed genes due to a higher degree of variability between replicate fpkm estimates.

cuff %>% genes() %>% fpkmSCVPlot() %>% 
  saveAndPrint('genes_scv')
cuff %>% isoforms() %>% fpkmSCVPlot() %>%
  saveAndPrint('isoforms_scv')

To assess the distributions of FPKM scores across samples, you can use the csDensity plot.

cuff %>% genes() %>% csDensity() %>%
  saveAndPrint('csDensity')
cuff %>% genes() %>% csDensity(replicates = TRUE) %>%
  saveAndPrint('csDensityReplicates')

Boxplots [of FPKM distributions] can be visualized using the csBoxplot method Box plot with replicates=TRUE exposes individual replicate FPKM distributions.

cuff %>% genes() %>% csBoxplot() %>% 
  saveAndPrint('csBoxplot')
cuff %>% genes() %>% csBoxplot(replicates = TRUE) %>%
  saveAndPrint('csBoxplotReplicates')

A matrix of pairwise scatterplots can be drawn using the csScatterMatrix() method. Scatterplots can be useful to identify global changes and trends in gene expression between pairs of conditions. Pairwise scatterplots can identify biases in gene ex- pression between two particular conditions.

cuff %>% genes() %>% csScatterMatrix() %>%
  saveAndPrint('csScatterMatrix')

Dendrogram of JS distances between conditions and replicates.

cuff %>% genes() %>% csDendro() %>%
  saveAndPrint('csDendogram', is_ggplot = FALSE)
cuff %>% genes() %>% csDendro(replicates = TRUE) %>%
  saveAndPrint('csDendogramReplicates', is_ggplot = FALSE)

MvsA plots can be useful to determine any systematic bias that may be present between conditions. The CuffData method MAplot() can be used to examine these intensity vs fold-change plots. You must specify the sample names to use for the pairwise comparison with x and y:

Not implemented

Volcano plots are also available for the CuffData objects.

cuff %>% genes() %>% csVolcanoMatrix() %>%
  saveAndPrint("csVolcanoMatrix")

Distribution of p- and q-values. This is not part of the standard cummeRbund package.

cuff %>% genes %>% diffData %>%
  ggplot(aes(x=p_value)) + 
    geom_histogram() +
    scale_y_log10() +
    ggtitle("p Value distribution (log10 transformed)")

cuff %>% genes %>% diffData %>%
  ggplot(aes(x=q_value)) + 
    geom_histogram() +
    scale_y_log10() +
    ggtitle("q Value distribution (adjusted p-value, log10 transformed")

Geneset Level Plots

The csHeatmap() function is a plotting wrapper that takes as input either a CuffGeneSet or a CuffFeatureSet object (essentially a collection of genes and/or features) and produces a heatmap of FPKM expression values. The ’cluster’ argument can be used to re-order either ’row’, ’column’, or ’both’ dimensions of this matrix. By default, the Jensen-Shannon distance is used as the clustering metric, however, any function that produces a dist object can be passed to the ’cluster’ argument as well.

On getSig():

By default getSig() outputs a vector of tracking IDs corresponding to all genes that reject the null hypothesis in any condition tested. The default feature type can be changed by adjusting the ’level’ argument to getSig(). In addition, a alpha value can be provided on which to filter the resulting list (the default is 0.05 to match the default of cuffdiff). Significance results for specific pairwise comparisons can be retrieved as well by specifying the two conditions as ’x’ and ’y’. In this case, p-values are adjusted to reduce the impact of multiple-testing correction when only one set of tests is being conducted.

# Remove "|XLOC" labels from heatmap
heatmapFixLabels <- function(plot) {
  plot$scales$scales[[2]]$labels <- sub("\\|.*", "",plot$scales$scales[[2]]$labels)
  plot$scales$scales[[2]]$labels <- sub(",.*", "",plot$scales$scales[[2]]$labels)
  plot %>% return()
}
# Get significant genes
minPValue <- 0.05
sigGeneIDs <- cuff %>% getSig(alpha=minPValue, level='genes')
# Check if there are actually genes passing the significance filter
if (length(sigGeneIDs) > 0) {
  sigGenes <- cuff %>% getGenes(sigGeneIDs)
  # Make plots
  sigGenes %>% csHeatmap(cluster = 'both') %>% heatmapFixLabels() %>%
    saveAndPrint('heatmap')
  sigGenes %>% csHeatmap(cluster = 'both', replicates = TRUE) %>% heatmapFixLabels() %>%
    saveAndPrint('heatmap_isoforms')
} else {
  paste0("No genes significant with p-value <= ", minPValue) %>% print()
}

Data Exploration

The sigMatrix() function can provide you with a “quick–and–dirty” view of the number of significant features of a particular type, and at a given alpha (0.05 by default).

# Alpha = False discovery value (p-Value)
cuff %>% sigMatrix(,level='genes',alpha=0.05) %>%
  saveAndPrint('sigMatrix')

Similarities between conditions and/or replicates can provide useful insight into the relationship between various groupings of conditions and can aid in identifying outlier replicates that do not behave as expected. cummeRbund provides the csDistHeat() method to visualize the pairwise similarities between conditions. Again with the replicates argument, distances between individual replicates can be presented.

genes(cuff) %>% csDistHeat() %>%
  saveAndPrint('distanceMatrix')

genes(cuff) %>% csDistHeat(replicates = TRUE) %>%
  saveAndPrint('distanceMatrix_replicates')

Dimensionality Reduction

Dimensionality reduction is an informative approach for clustering and exploring the relationships between conditions. It can be useful for feature selection as well as identifying the sources of variability within your data. To this end, we have applied two different dimensionality reduction strategies in cummeRbund: principal component analysis (PCA) and multi-dimensional scaling (MDS). We provide the two wrapper methods, PCAplot and MDSplot.

cuff %>% genes() %>% PCAplot('PC1', 'PC2') %>%
  saveAndPrint('PCA_genes')
cuff %>% genes() %>% PCAplot('PC1', 'PC2', replicates = TRUE) %>%
  saveAndPrint('PCA_genes_replicates')

# Normal MDS plot only makes sense for more than two conditions (cummeRbund calls conditions 'samples' and samples 'replicates'
if (cuff %>% samples() %>% nrow() > 2) {
  cuff %>% genes() %>% MDSplot() %>%
    saveAndPrint('PCoA')
}
cuff %>% genes() %>% MDSplot(replicates = TRUE) %>%
  saveAndPrint('PCoA_replicates')
cuff %>% genes() %>% MDSplot(replicates = TRUE)

Custom: Own heat map implementation

Since the heat maps produced by cummeRbund usually look very busy we did a custom implementation of it. This one has a more stringent q-value cutoff of 0.005.

q_value_cutoff_heatmap <- 0.005

# Get FPKM per Replicate
custom_heat_map <- cummeRbund::repFpkm(cummeRbund::genes(cuff)) %>%
  tibble::as_tibble() %>%
  # Filter for significant genes
  dplyr::right_join(
    # Get list of significant genes
    cummeRbund::diffData(cummeRbund::genes(cuff)) %>%
      tibble::as_tibble() %>%
      dplyr::filter(q_value <= q_value_cutoff_heatmap) %>%
      dplyr::select(gene_id)
  ) %>%
  dplyr::select(gene_id, rep_name, fpkm) %>%
  # Make same adjustment as in cummerbund
  dplyr::mutate(adjFPKM = log10(fpkm + 1)) %>%
  ggplot(aes(x = rep_name, y = gene_id)) +
    geom_tile(aes(fill = adjFPKM)) +
    scale_fill_distiller(direction = 1, name = "log(10) FPKM + 1") +
    ggtitle("Heat map of top regulated genes") +
    xlab("Sample Label") +
    ylab("Genes") +
    theme(
      axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank()
    )

custom_heat_map %>%
  saveAndPrint('Heatmap_replicates_own')

Custom: PCA plot

custom_pca_plot <- cummeRbund::repFpkm(cummeRbund::genes(cuff)) %>%
  # Make a data frame that contains each replicate as column and each gene as row
  tibble::as_tibble() %>%
  dplyr::select(gene_id, rep_name, fpkm) %>%
  tidyr::spread(rep_name, fpkm) %>%
  dplyr::select(-gene_id) %>%
  dplyr::distinct() %>%
  as.matrix() %>%
  # Run principal component analysis on the matrix
  prcomp() %>%
  (function(prcomp_result) {
    # Calculate proportion of variance
    proportion_of_variance <<- tibble::tibble(
      pc = paste0("PC", 1:nrow(prcomp_result$rotation)),
      variance_proportion = ((prcomp_result$sdev^2) / (sum(prcomp_result$sdev^2)))*100
    )
    prcomp_result %>%
      return()
  }) %>%
  .$rotation %>%
  as.data.frame() %>%
  # Attach sample labels again
  tibble::rownames_to_column("sample") %>%
  # Get the group name
  tidyr::separate(sample, into = c("group", "id"), sep = "_([^_])*$", remove = FALSE) %>%
  # Make scatterplot
  ggplot(aes(x=PC1, y=PC2, label = sample, colour = group)) +
    geom_point() +
    ggtitle("Principal Component Analysis") +
    ggrepel::geom_text_repel() +
    xlab(paste0("Principal Component 1 (", proportion_of_variance$variance_proportion[1] %>% round(digits = 2), "% variance explained)")) +
    ylab(paste0("Principal Component 2 (", proportion_of_variance$variance_proportion[2] %>% round(digits = 2), "% variance explained)")) +
    theme_classic()

custom_pca_plot %>%
  saveAndPrint('PCA_own')

Plots in cummeRbund that are not included in this report

Skipped all plots showing individual gene levels - make this interactive?
Skipped "Finding similar genes"

Another common question in large-scale gene expression analyses is ’How can I find genes with similar expression profiles to gene x?’. We have implemented a method, findSimilar to allow you to identify a fixed number of the most similar genes to a given gene of interest.

paulklemm/seqc documentation built on Sept. 12, 2019, 8:15 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com