rna_seq_analysis: Wrapper function to run edgeR analysis of summarized reads...
In luisvalesilva/hwglabr: Collection of R functions used in the Hochwagen Lab

rna_seq_analysis

R Documentation

Wrapper function to run edgeR analysis of summarized reads from RNA-seq experiment

Description

This function allows you to run a statistical analysis of an RNA-seq experiment using Bioconductor package edgeR. Inputs are tables of summarized reads generated using featureCounts (part of the lab's RNA-seq analysis pipeline running on NYU's HPC) for each sample in the experiment. The presence or absence of biological replicates is automatically inferred from the conditionNames argument: conditions aggregate samples as replicates. If you provide as many different conditionNames as there are sampleNames each sample is taken as a single condition (no replicates). The output includes tables of CPM, TPM and, if included, differential expression (DE) analysis for selected pairs of samples (see "Value" section below for more details).

Running without replicates: This function allows you to run the analysis on experiments without biological replicate libraries. While the calculation of CPM and TPM has no requirement for replicates, you should obviously avoid doing DE analysis in the absence of replicates. The statistical model used by edgeR to perform DE analysis relies on biological replicates to estimate biological variability. While there is no satisfactory alternative to actual biological replication, you can still run a preliminary DE analysis without replicates. The approach taken here relies on providing an estimate of biological variation that is reasonable for RNA-seq experiments using S. cerevisiae. This is a better alternative to simply assuming that biological variability is absent. Typical values for the common BCV (square root-dispersion) for datasets arising from well-controlled experiments are 0.1 for data on genetically identical model organisms, so that value is used here.

Usage

rna_seq_analysis(pathToFiles, sampleNames, conditionNames, batchNames = NULL,
  pairwiseDE = FALSE, outputFilePrefix)

Arguments

`pathToFiles`	A list of strings corresponding to the full path to the featureCounts output files for all samples in the experiment. No default.
`sampleNames`	A list of strings corresponding to the names of all samples in the experiment in the same order as the files in `pathToFiles`. No default.
`conditionNames`	A list of strings corresponding to the experimental groups/conditions for each sample/library. Will be the input to edgeR's `DGEList` constructor function and will define replicate groups, if any. No default.
`batchNames`	Optional list of strings corresponding to the experimental batch for each sample/library. Will be part of the design matrix for differential expression testing and will define batch groups, if any. Defaults to `NULL` (all samples come from the same batch).
`pairwiseDE`	Logical indicating whether to test differential expression (DE). Defaults to `FALSE`.
`outputFilePrefix`	Optional string to be added as prefix to output file names. No default.

Value

The output includes several tables saved as .csv files in a directory named "RNA-seq_analysis" written to the working directory:

CPM: Compositional bias-normalized Counts Per Million (output of edgeR's cpm function). Useful for intersample feature expression comparison (not feature length-normalized).
TPM: Compositional bias and feature length-normalized Transcripts Per Million. Useful for within-sample feature expression comparison.
DE: A Differential Expression table comparing the first sample to all others. Includes log2 fold change (logFC), average log2 counts per million (logCPM), two-sided p-value (PValue) and false discovery rate (FDR).

Note: Fold change differences (logFC) between samples not directly ompared in the table can be obtained by subtracting their reported logFC (to the first sample). For example, if the first sample is sample1 and we want the logFC between sample2 and sample3, simply calculate the difference:

logFC.sample3_vs_sample2 = logFC.sample3 - logFC.sample2

A Multi-Dimensional Scaling plot of all samples is also saved in the output directory as a .pdf file.

Examples

## Not run: 
rna_seq_analysis(pathToFiles = list('AH119-2h_featureCounts.txt', 'AH119-3h_featureCounts.txt',
                                    'AH8104-2h_featureCounts.txt', 'AH8104-3h_featureCounts.txt'),
                 sampleNames = list('AH119_2h', 'AH119_3h', 'AH8104_2h', 'AH8104_3h'),
                 conditionNames = list('WT_2h', 'WT_3h', 'dot1_2h', 'dot1_3h'),
                 outputFilePrefix = 'dot1_noReplicates')

rna_seq_analysis(pathToFiles = list('AH119-2h_featureCounts.txt', 'AH119-3h_featureCounts.txt',
                                    'AH8104-2h_featureCounts.txt', 'AH8104-3h_featureCounts.txt'),
                 sampleNames = list('AH119_2h', 'AH119_3h', 'AH8104_2h', 'AH8104_3h'),
                 conditionNames = list('WT_2h', 'WT_3h', 'dot1_2h', 'dot1_3h'),
                 pairwiseDE = TRUE, outputFilePrefix = 'dot1_noReplicates')

rna_seq_analysis(pathToFiles = list('AH119-A_featureCounts.txt', 'AH119-B_featureCounts.txt',
                                    'AH8104-A_featureCounts.txt', 'AH8104-B_featureCounts.txt'),
                 sampleNames = list('AH119_A', 'AH119_B', 'AH8104_A', 'AH8104_B'),
                 conditionNames = list('WT', 'WT', 'dot1', 'dot1'),
                 batchNames = list('batch1', 'batch2', 'batch1', 'batch2')
                 pairwiseDE = TRUE)

## End(Not run)

luisvalesilva/hwglabr documentation built on Feb. 25, 2025, 3:51 p.m.