knitr::opts_chunk$set(echo = TRUE) options(rmarkdown.html_vignette.check_title = FALSE)
library(pmartR) library(pmartRdata)
RNAseq data in pmartR
follows a different workflow compared to the other data types, with different filtering, plotting, and statistical options. This vignette walks through the RNAseq workflow using example data from the pmartRdata
package, including the use of all three statistical analysis options, DEseq2 [@deseq2], edgeR [@edger], or limmaVOOM [@limmavoom].
Data upload, create object
# load example e_data, f_data, e_meta for the RNAseq dataset edata <- rnaseq_edata fdata <- rnaseq_fdata emeta <- rnaseq_emeta
For these data, the column named "Transcript" denotes the transcript identifier in e_data.
head(edata)
The f_data data frame contains the sample identifier, "SampleName", mapping to the column names in e_data other than "Transcript". The column for "Virus" takes on 1 of 3 values, and will be used as our main effect, since comparisons of interest in this experiment are between virus strains. Finally, f_data contains information about the Donor and Replicate, which describe the provenance of the samples and biological replicate information.
head(fdata)
The e_meta data frame contains just 2 columns, "Transcript" and "Gene". The "Transcript" column contains the same transcripts found in e_data, and "Gene" provides the mapping of the transcripts to the gene level.
head(emeta)
Since these are RNAseq data, we will create a seqData
object in pmartR
. Note that unlike the other data types supported in pmartR
, seqData objects are not log2 transformed.
myseq <- as.seqData( e_data = rnaseq_edata, e_meta = rnaseq_emeta, f_data = rnaseq_fdata, edata_cname = "Transcript", fdata_cname = "SampleName", emeta_cname = "Transcript" )
We can use the summary
function to get basic summary information for our isbaricseqData
object.
summary(myseq)
Using the plot
function on our seqData object provides box plots of log counts per million (lcpm) for each sample.
plot(myseq, transformation = "lcpm")
We are preparing this data for statistical analysis where we will compare the samples belonging to one group to the samples belonging to another, and so we must specify the group membership of the samples. We do this using the group_designation()
function, which modifies our seqData object and returns an updated version of it. Up to two main effects and up to two covariates may be specified (VERIFY THIS IS TRUE), with one main effect being required at minimum. For these example data, we specify the main effect to be "Virus" so that we can make comparisons between the different strains. Certain functions we will use below require that groups have been designated via the group_designation()
function, and doing so makes the results of plot
and summary
more meaningful.
myseq <- group_designation(omicsData = myseq, main_effects = "Virus") summary(myseq) plot(myseq, order_by = "Virus", color_by = "Virus", transformation = "lcpm")
The group_designation()
function creates an attribute of the dataset as follows:
attributes(myseq)$group_DF
Transcripts can be removed from seqData
objects using the molecule filter or the custom filter, if desired. See the "Filter Functionality" vignette for more information.
The total count filter is specifically applicable to seqData
objects, and is implemented in pmartR
based
on recommendations in edgeR processing [@edger], where the low-observed biomolecules are removed. Here we use the default recommendation in edgeR and require at least 15 total counts observed across samples. We plot the filter first without a min_count
threshold, and then with our threshold of 15, so we can see how the application of the filter changes the observation density.
mytotcountfilt <- total_count_filter(omicsData = myseq) summary(mytotcountfilt, min_count = 15) plot(mytotcountfilt) plot(mytotcountfilt, min_count = 15)
Once we are satified with the filter results, we apply the filter to the seqData
object.
myseq <- applyFilt(filter_object = mytotcountfilt, omicsData = myseq, min_count = 15)
The RNA filter removes samples based on either:
library size (number of reads)
number/proportion of unique non-zero biomolecules per sample
This filter is particularly useful for identifying any samples that contain lower than expected numbers of reads.
First we utilize this filter to identify any samples with fewer than one million reads, of which there are none.
# create RNA filter object myrnafilt <- RNA_filter(omicsData = myseq) # filter by library size plot(myrnafilt, plot_type = "library") plot(myrnafilt, plot_type = "library", size_library = 1000000)
Next, we utilize the filter to identify any samples with fewer than 5,000 non-zero counts, of which there are none.
# filter based on number or proportion of non-zero counts plot(myrnafilt, plot_type = "biomolecule", min_nonzero = 5000)
After filtering out any transcripts and/or samples above, we may wish to do some exploratory data analysis.
A Spearman correlation heatmap shows a couple of samples with lower correlation than the rest, but neither of these are
Strain B 3 D
Strain C 1 E
mycor <- cor_result(omicsData = myseq) plot(mycor, interactive = TRUE)
A principal components analysis (PCA) plot of our data shows clustering of strains A and C with one another, separated from strain B. The two samples that are separated on PC1 one from the rest are the same two samples that stood out in the correlation heatmap.
mypca <- dim_reduction(omicsData = myseq, k = 2) plot(mypca, interactive = TRUE) mypca_df <- data.frame(SampleID = mypca$SampleID, PC1 = mypca$PC1, PC2 = mypca$PC2)
For RNAseq data there are three methods available via pmartR
for making statistical comparisons:
edgeR [@edger]
DESeq2 [@deseq2]
limma-voom [@limmavoom]
Here we utilize all three for demonstration purposes, but in practice a single method should be utilized, the selection of which can be guided by visual inspection of dispersion plots for each method via dispersion_est
. After using the diffexp_seq
function to perform statistical comparisons with the selected method, plot
and summary
methods are available to help display the results, and the data frame returned by the diffexp_seq
function can be saved as an Excel file or another useful format if helpful.
We examine the dispersion plot for use of the edgeR method.
dispersion_est(omicsData = myseq, method = "edgeR")
If we are satisfied with using this method for our statistical comparisons between virus strains, we can do so as follows.
stats_edger <- diffexp_seq(omicsData = myseq, method = "edgeR")
Various plot types are available to display the results of the statistical comparisons.
plot(stats_edger, plot_type = "volcano") plot(stats_edger, plot_type = "bar") plot(stats_edger, plot_type = "ma")
We examine the dispersion plot for use of the DESeq2 method.
dispersion_est(omicsData = myseq, method = "DESeq2")
If we are satisfied with using this method for our statistical comparisons between virus strains, we can do so as follows.
stats_deseq <- diffexp_seq(omicsData = myseq, method = "DESeq2")
Various plot types are available to display the results of the statistical comparisons.
plot(stats_deseq, plot_type = "volcano") plot(stats_deseq, plot_type = "bar") plot(stats_deseq, plot_type = "ma")
We examine the dispersion plot for use of the voom method.
dispersion_est(omicsData = myseq, method = "voom")
If we are satisfied with using this method for our statistical comparisons between virus strains, we can do so as follows.
stats_voom <- diffexp_seq(omicsData = myseq, method = "voom")
Various plot types are available to display the results of the statistical comparisons.
plot(stats_voom, plot_type = "volcano") plot(stats_voom, plot_type = "bar") plot(stats_voom, plot_type = "ma")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.