\newpage

DGE.Tools2: Plotting tools

DGE.Tools2 includes an assortment of standard data exploration plotting tools. The intention here is to make common plot types dead easy to produce and provide a consistent look and feel. Generally, optional argument allow you to tweak the plots and since the plots are based on ggplot, you can further customize the resulting ggplot objects.

The Plotting functions are:

Most of the plots are generated with ggplot, thus, the user can take a returned ggplot object and further modify it to their heart's content.

Code Block: Load some test data (IPF Fibroblast data).

rm(list=ls()) #Clear the workspace
invisible(gc()) #garbage collection to masimize available memory
startTime = Sys.time()

setwd("~/R/lib/pkgsrc/DGE.Tools2/vignettes")
library(tidyverse)
library(magrittr)
library(DGEobj)
library(DGE.Tools2)
library(JRTutil)

outputPath <- "./output"

#get a dataset to work with from Stash
dgeObj <- getRDSobjFromStash("UCSD_Lung_Fibroblasts_P-20171107-0002_8Feb2018.RDS")

#apply a low intensity filter
dgeObj <- lowIntFilter(dgeObj, zfpkmThreshold=-3, countThreshold=10)

\newpage

Table of Differential Gene Counts from Contrasts

Code Block: Signature Table

#grab a topTable dataframe
myContastList <- getType(dgeObj, "topTable")

#Plot with defaults
df <- summarizeSigCounts(myContastList)
knitr::kable(df)

#add a fold change threshold
df <- summarizeSigCounts(myContastList, fcThreshold = 2)
knitr::kable(df)

Arguments to summarizeSigCounts allow you to specify which fields to include and define the thresholds for significance. see ?summarizeSigcounts

\newpage

Profile Plot (aka MA plot): LogInt vs. LogRatio plots

Function profilePlot uses defaults appropriate for topTable/topTreat dataframes.

The printAndSave function sends the plot to the console with a 12pt base font (suitable for knitr/pdf output) and saves an image file with a 24pt base font which is more suitable for PPT use. The file extension determines the image file type (allowed values include: .pdf, .png, .jpg, .tiff, .svg, bmp)

Code Block: Profile Plot Example

#grab a topTable dataframe
myTopTable <- dgeObj$TGFb25_vs_Veh

#draw the plot
MyProfilePlot <- profilePlot(myTopTable, 
                             title="TGFbeta Signature",
                             legendPosition = "ne")
printAndSave(MyProfilePlot, file.path(outputPath, "ProfilePlot.PNG"))

See ?profilePlot for options to modify various attributes (color, shape, tranparency, reference lines, etc)

\newpage

Volcano Plot: LogRatio vs. Negative Log Pvalue

Function volcanoPlot uses defaults appropriate for topTable dataframes.

Code Block: Volcano Plot Example

#grab a topTable dataframe
myTopTable <- dgeObj$TGFb25_vs_Veh

#draw the plot
MyVolcanoPlot <- volcanoPlot(myTopTable)
printAndSave(MyVolcanoPlot, file.path(outputPath, "VolcanoPlot.PNG"))

See ?volcanoPlot for options to modify various attributes (color, shape, tranparency, reference lines, etc)

\newpage

CDF Plot: Evaluate model performance

Plot the pvalue Rank vs. the actual pvalues. A straight diagonal line indicate no differential genes (null hypothesis satisfied), while differential genes appear as a break above the line at low pvalues.

The main plot shows pvalues <= 0.1 (user settable). An inset plot shows the full range of pvalues and a blue box indicates the region shown in the main plot. Genes with p<0.01 (user settable) are shown in red.

Code Block: CDF Plot Example

#grab a topTable dataframe
myTopTable <- dgeObj$TGFb25_vs_Veh

#draw the plot
MyCDFPlot <- cdfPlot(myTopTable, plotFile=file.path(outputPath, "MyCdfPlot.PNG"))

Note: printAndSave doesn't work with cdfPlot because a cdfPlot is really two plots. Therefore, cdfPlot has an argument to enable saving the plot to an image file.

See ?cdfPlot options to modify various attributes (color, shape, tranparency, reference lines, etc)

\newpage

Compare Plot: compare two signature

Plot the LogRatios of two contrasts, highlight common genes and genes unique to either contrast.

The input for this plot requires pulling the log ratio and pvalue columns from two topTable dataframes together.

Code Block: Compare Plot Example

#Get two topTreat dataframes
#grab a topTable dataframe
tt1 <- dgeObj$TGFb25_vs_Veh
tt2 <- dgeObj$TGFb10_vs_Veh

#combine the LogRatios and pvalues into a dataframe
compareData <- data.frame(cbind(TGFb25= tt1$logFC,
                   Stable.Vs.Norm = tt2$logFC,
                   xp = tt1$P.Value,
                   yp = tt2$P.Value) )

MyComparePlot = comparePlot(compareData, 
                            title = "Compare/Contrast Rapid and Stable",
                            legendPosition = "se")
printAndSave (MyComparePlot, file.path(outputPath, "ComparePlot.PNG"))

See ?comparePlot options to modify various attributes (color, shape, tranparency, reference lines, etc)

\newpage

Observation Plot: Intensity boxplots by gene

This is a gene of interest plot, intended to plot a reasonable small number of genes, a separate plot for each gene with a box for each treatment group. It takes a tidy data frame as input and a tidyIntensity function is supplied to simplify creation of the input file.

The plot is faceted by default. Setting facet = FALSE produces individual plots for each gene that are returned as a list of plots.

Code Block: Observation Plot Example

  #get log2cpm data
  log2cpm <- convertCounts(dgeObj$counts, unit="cpm", log=TRUE, normalize="tmm")
  #filter for a smallish set of genes
  idx <- stringr::str_detect(dgeObj$geneData$GeneName, "^PPAR")
  log2cpm <- log2cpm[idx,]
  #swap gene symbols for Ensembl IDs
  rownames(log2cpm) <- dgeObj[idx,]$geneData$GeneName

  #put in tidy format
  tidyInt <- tidyIntensity(log2cpm,
                           rowidColname="GeneID",
                           keyColname="Sample",
                           valueColname="Log2CPM",
                           group=dgeObj$design$ReplicateGroup)

  #Facetted boxplot
  obsPlot2(tidyInt, plotByCol="GeneID",
           groupCol = "group",
           valueCol ="Log2CPM",
           pointJitter = 0.1,
           facetRow = 2)

  #Facetted violin plot
  obsPlot2(tidyInt, plotByCol = "GeneID",
           violinLayer = TRUE,
           boxLayer = FALSE,
           groupCol="group",
           valueCol = "Log2CPM",
           pointJitter = 0.1,
           facetRow = 2)

  #return a list of individual plots
  myplots <- obsPlot2(tidyInt, plotByCol="GeneID",
                      groupCol = "group",
                      valueCol ="Log2CPM",
                      pointJitter = 0.1,
                      facet = FALSE)
  #plot the first one
  printAndSave(myplots[[1]], file.path(outputPath, "obsPlot1.png"))

See ?obsPlot2 options to modify various attributes (color, shape, tranparency, reference lines, etc)

\newpage

logRatioPlot

Plotting logRatio data with 95% confidence limits is generally preferable to plotting intensity data with the Observation Plot. This is because the ratio data is post-modeling and thus reflects adjustments for any "nusiance" factors included in the modeling (e.g. batch effect, demographic or sex effects).

  #Put contrasts in tidy format keeping logFC, and confidence limits data
  tidyDat <-tidyContrasts(dgeObj, rownameColumn="EnsgID", includeColumns = c("logFC", "CI.R", "CI.L"))

  #add gene symbols from geneData
  ens2genesym <- dgeObj$geneData %>%
                 rownames_to_column(var="EnsgID") %>%
                 select(EnsgID, GeneSymbol=GeneName)
  tidyDat <-  left_join(tidyDat, ens2genesym) 


  #filter for a small set of genes of interest
  idx <- stringr::str_detect(tidyDat$GeneSymbol, "^PPAR")
  tidyDat <- tidyDat[idx,]  

  #simple barplot
  myLogRatioPlot <- logRatioPlot(tidyDat,
                                 facetColname = "GeneSymbol",
                                 xColname = "Contrast", 
                                 title = "LogRatio Plots: PPAR genes",
                                 facetCol = 2,
                                 barWidth = 0.6
  )

  printAndSave(myLogRatioPlot, file.path(outputPath, "logRatioPlot.png"))

The companion function, tidyContrasts, aids in reformatting topTable contrast data for use with function logRatioPlot. tidyContrasts will accept a DGEobj as input and will find all topTable data in the DGEobj. tidyContrasts will also work with a list of topTable dataframes which provides flexibility to mix and match topTable data from multiple projects for plotting.

See ?logratio options to modify various attributes (color, shape, tranparency, reference lines, etc)

\newpage

checkSex Plot

The checkSex function uses expression intensity data from the X and Y chromosomes to infer sex. When compared to actuall sex annotation, this can help to identify sample labeling problems and swaps.

The XIST gene us used as an X-linked reporter gene that is well expressed in most tissues. Since most Y-linked genes are expressed predominantly in testes, we scan for and use the highest expressed Y-linked gene for the analysis.

You can provide the name of the annotated sex column in the design table and it will be used to color code the points.

  sexPlot <- checkSex(dgeObj, species="human", sexCol="Sex")
  printAndSave(sexPlot, file.path(outputPath, "sexPlot.png"))

Note in this example each subject was tested twice. There are two points from the same individual that are annotated male and plot as a female and vice versa. Also, there are two points from one individual that plot in a position indicative of an XXY genotype.

\newpage

Dispersion Plot (plotDisp function)

Dispersion is a measure of variance that is plotted against intensity as a QC plot that illustrates heteroscedasticity.

# Note a DGEobj can have multiple design matrices or DGELists;  in this case
# only one of each is present
designMatrix <- getType(dgeObj, "designMatrix")[[1]]
dgelist <- getType(dgeObj, "DGEList")[[1]]
#dispersion plot
dispersionPlot <- plotDisp(dgelist, designMatrix, lineFit="loess", plotType="dispersion")
printAndSave(dispersionPlot, file.path(outputPath, "dispersionPlot.png"))

#BCV (biological coefficient of variation) plot
BCV_Plot <- plotDisp(dgelist, designMatrix, lineFit="loess", plotType="BCV")
printAndSave(BCV_Plot, file.path(outputPath, "BCV_Plot.png"))

\newpage

Plot Normalization (plotNorm function)

This function plots distributions before and after normalization as a two panel facetted plot. It uses edgeR::calcNormFactors and thus supports any normalization accepted by calcNormFactors.

Input can be a DGEobj or a raw counts matrix.
Output can be a box plot or density distribution.

boxNorm <- plotNorm(dgeObj, plotType="box", normalize="tmm")
printAndSave(boxNorm, file.path(outputPath, "boxNorm.png"))

densityNorm <- plotNorm(dgeObj, plotType="density", normalize="tmm")
printAndSave(densityNorm, file.path(outputPath, "densityNorm.png"))

\newpage

Alignment QCplots (Function QCplots)

This function is intended to work on Omicsoft-style alignment QC data. QC parameters are named in the first column and subsequent columns hold data for each sample.

Since Omicsoft alignment QC includes hundreds of metrics, you typically want to limit the plots to a subset of the available metrics.

  #Get some data from an Omicsoft project in S3
  s3mount <- "Y:" #where you have mounted the S3 bucket "bmsrd-ngs-arrayserver"
  s3path <- "/OmicsoftHome/output/P-20171107-0002/UCSD_Lung_Fibroblasts_P-20171107-0002_8Feb2018/ExportedViewsAndTables"
  qcfilename <- "RNA-Seq.QCMetrics.Table.txt" #standard name for QC file in Omicosoft projects


  qcdat <- readr::read_delim(file.path(s3mount, s3path, qcfilename), delim="\t")
  #shorten the column names
  colnames(qcdat) <- str_replace(colnames(qcdat), "_UCSD_Lung_Fibroblasts_P-20171107-0002_8Feb2018", "")

  # qcdat <- qcdat[complete.cases(qcdat),]

  #pick some Omicsoft Metrics from column 1 of the data frame
  someFavMetrics <- c("Alignment_MappedRate", "Source_rRNA", "Profile_ExonRate",
                              "Profile_InterGene_FPK")

  MyQCplots <- QCplots(qcdat, metricNames=someFavMetrics) #all defaults
  # plots <- QCplots(qcdat, metricNames=someFavMetrics, plotType="bar", xAngle=90)
  #draw the first plot
  printAndSave(MyQCplots[[1]], file.path(outputPath, "QCplot.png"))

You need to download a QC data file from the S3 bucket bms-ngs-arrayserver or, preferrably, you can read data directly from S3 using software to mount an S3 bucket to a local location.

To map an S3 bucket to local storage:
For Mac: see s3fs
For PC: see Cloudberry Drive

\newpage

Pvalue Histogram: Evaluate Quality of Fit

Code Block: Pvalue Histogram Facet Plot Example

topTableList <- getType(dgeObj, "topTable")

#Get all the pvalue columns from the contrasts 
MyPval <- extractCol(topTableList, "P.Value")

pvalPlot <- plotPvalHist(MyPval)
printAndSave(pvalPlot, file.path(outputPath, "pvalHist.png"))

See ?plotPvalHist for optional arguments

\newpage

ggplotMDS

Multidimentional scaling as implemented by limma::plotMDS is PCA-like but uses a distance metric instead of a correlation metric. As such, when you plot log2cpm values, the axes represent log2cpm units and you can interpret the distance between points accordingly.

ggplotMDS is a ggplot wrapper around limma::plotMDS

dgeList <- getItem(dgeObj, "DGEList")
result <- ggplotMDS(dgeList, colorBy = dgeObj$design$Treatment, 
                    shapeBy = dgeObj$design$DiseaseStatus, labels = NULL)
printAndSave(result[[1]], file.path(outputPath, "MDSplot.png"))

\newpage

MDS_var_explained

Takes the mds object from the ggplotMDS result list (item 2) and returns a list with:
a plot of the amount of variance explained by each component
a cumulative plot of the variance explained
* the underlying data table

MDSvarResult <- MDS_var_explained(result[["mdsobj"]])

printAndSave(MDSvarResult$varexp, file.path(outputPath, "varExplained.png"))
printAndSave(MDSvarResult$cumvar, file.path(outputPath, "cumVarExplained.png"))           

\newpage

Session Info

Time required to process this report: r format(Sys.time() - startTime)

R Session Info

sessionInfo()


jrthompson54/DGE.Tools2 documentation built on May 12, 2021, 8:47 p.m.