rna_seq_analysis: Wrapper function to run edgeR analysis of summarized reads...

Description Usage Arguments Value Examples

Description

This function allows you to run a statistical analysis of an RNA-seq experiment using Bioconductor package edgeR. Inputs are tables of summarized reads generated using featureCounts (part of the lab's RNA-seq analysis pipeline running on NYU's HPC) for each sample in the experiment. The presence or absence of biological replicates is automatically inferred from the conditionNames argument: conditions aggregate samples as replicates. If you provide as many different conditionNames as there are sampleNames each sample is taken as a single condition (no replicates). The output includes tables of CPM, TPM and, if included, differential expression (DE) analysis for selected pairs of samples (see "Value" section below for more details).

Running without replicates: This function allows you to run the analysis on experiments without biological replicate libraries. While the calculation of CPM and TPM has no requirement for replicates, you should obviously avoid doing DE analysis in the absence of replicates. The statistical model used by edgeR to perform DE analysis relies on biological replicates to estimate biological variability. While there is no satisfactory alternative to actual biological replication, you can still run a preliminary DE analysis without replicates. The approach taken here relies on providing an estimate of biological variation that is reasonable for RNA-seq experiments using S. cerevisiae. This is a better alternative to simply assuming that biological variability is absent. Typical values for the common BCV (square root-dispersion) for datasets arising from well-controlled experiments are 0.1 for data on genetically identical model organisms, so that value is used here.

Usage

1
2
rna_seq_analysis(pathToFiles, sampleNames, conditionNames, batchNames = NULL,
  pairwiseDE = FALSE, outputFilePrefix)

Arguments

pathToFiles

A list of strings corresponding to the full path to the featureCounts output files for all samples in the experiment. No default.

sampleNames

A list of strings corresponding to the names of all samples in the experiment in the same order as the files in pathToFiles. No default.

conditionNames

A list of strings corresponding to the experimental groups/conditions for each sample/library. Will be the input to edgeR's DGEList constructor function and will define replicate groups, if any. No default.

batchNames

Optional list of strings corresponding to the experimental batch for each sample/library. Will be part of the design matrix for differential expression testing and will define batch groups, if any. Defaults to NULL (all samples come from the same batch).

pairwiseDE

Logical indicating whether to test differential expression (DE). Defaults to FALSE.

outputFilePrefix

Optional string to be added as prefix to output file names. No default.

Value

The output includes several tables saved as .csv files in a directory named "RNA-seq_analysis" written to the working directory:

  1. CPM: Compositional bias-normalized Counts Per Million (output of edgeR's cpm function). Useful for intersample feature expression comparison (not feature length-normalized).

  2. TPM: Compositional bias and feature length-normalized Transcripts Per Million. Useful for within-sample feature expression comparison.

  3. DE: A Differential Expression table comparing the first sample to all others. Includes log2 fold change (logFC), average log2 counts per million (logCPM), two-sided p-value (PValue) and false discovery rate (FDR).

    Note: Fold change differences (logFC) between samples not directly ompared in the table can be obtained by subtracting their reported logFC (to the first sample). For example, if the first sample is sample1 and we want the logFC between sample2 and sample3, simply calculate the difference:

    logFC.sample3_vs_sample2 = logFC.sample3 - logFC.sample2

A Multi-Dimensional Scaling plot of all samples is also saved in the output directory as a .pdf file.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
## Not run: 
rna_seq_analysis(pathToFiles = list('AH119-2h_featureCounts.txt', 'AH119-3h_featureCounts.txt',
                                    'AH8104-2h_featureCounts.txt', 'AH8104-3h_featureCounts.txt'),
                 sampleNames = list('AH119_2h', 'AH119_3h', 'AH8104_2h', 'AH8104_3h'),
                 conditionNames = list('WT_2h', 'WT_3h', 'dot1_2h', 'dot1_3h'),
                 outputFilePrefix = 'dot1_noReplicates')

rna_seq_analysis(pathToFiles = list('AH119-2h_featureCounts.txt', 'AH119-3h_featureCounts.txt',
                                    'AH8104-2h_featureCounts.txt', 'AH8104-3h_featureCounts.txt'),
                 sampleNames = list('AH119_2h', 'AH119_3h', 'AH8104_2h', 'AH8104_3h'),
                 conditionNames = list('WT_2h', 'WT_3h', 'dot1_2h', 'dot1_3h'),
                 pairwiseDE = TRUE, outputFilePrefix = 'dot1_noReplicates')

rna_seq_analysis(pathToFiles = list('AH119-A_featureCounts.txt', 'AH119-B_featureCounts.txt',
                                    'AH8104-A_featureCounts.txt', 'AH8104-B_featureCounts.txt'),
                 sampleNames = list('AH119_A', 'AH119_B', 'AH8104_A', 'AH8104_B'),
                 conditionNames = list('WT', 'WT', 'dot1', 'dot1'),
                 batchNames = list('batch1', 'batch2', 'batch1', 'batch2')
                 pairwiseDE = TRUE)

## End(Not run)

luisvalesilva/hwglabr documentation built on May 21, 2019, 8:56 a.m.