Package: r Rpackage("DASC")
Authors: r packageDescription("DASC")[["Author"]]
Version: r packageDescription("DASC")[["Version"]]
Compiled date: r Sys.Date()
License: r packageDescription("DASC")[["License"]]
Prerequisites: NMF, cvxclustr, Biobase, foreach, doParallel
BiocStyle::markdown()
library(knitr) opts_chunk$set(comment=NA, fig.align="center", warning=FALSE) ## Libraries require(Biobase) require(NMF) require(cvxclustr)
r Biocpkg("DASC")
is an R package distributed as part of the
Bioconductor project. To install the package, start
R and enter:
source("http://bioconductor.org/biocLite.R") biocLite("DASC")
r Biocpkg("DASC")
is used for identifying batches and classifying samples
into different batches in a high dimensional gene expression dataset. The batch
information can be further used as a covariate in conjunction with other
variables of interest among standard bioinformatics analysis like differential
expression analysis.
If you use r Biocpkg("DASC")
for your analysis, please cite our paper as here below.
@article{Yi2018Detecting, title={Detecting hidden batch factors through data-adaptive adjustment for biological effects}, author={Yi, H. and Raman, A. T. and Zhang, H. and Allen, G. I. and Liu, Z.}, journal={Bioinformatics}, volume={34}, number={7}, pages={1141}, year={2018}, }
library(DASC) data("esGolub") samples <- c(20,21,28,30) dat <- exprs(esGolub)[1:100,samples] pdat <- pData(esGolub)[samples,] ## use nrun = 50 or more for better convergence of results res <- DASC(edata = dat, pdata = pdat, factor = pdat$Cell, method = 'ama', type = 3, lambda = 1, rank = 2:3, nrun = 5, annotation='esGolub Dataset') res
The first step in using DASC
package is to properly format the data. For
example, in case of gene expression data, it should be a matrix with features
(genes, transcripts) in the rows and samples in the columns. DASC
then
requires the information for the variable of interest to model the gene
expression data effectively.Variable of interest could be a genotype or
treatment information.
Below is an example of Stanford gene expression dataset (Chen et. al. PNAS, 2015; Gilad et. al. F1000 Research, 2015). It is a filtered raw counts dataset which was published by Gilad et al. F1000 Research. 30% of genes with the lowest expression & mitochondrial genes were removed (Gilad et al.F1000 Research).
## libraries set.seed(99999) library(DESeq2) library(ggplot2) library(pcaExplorer) ## dataset rawCounts <- stanfordData$rawCounts metadata <- stanfordData$metadata
## Using a smaller dataset idx <- which(metadata$tissue %in% c("adipose", "adrenal", "sigmoid")) rawCounts <- rawCounts[,idx] metadata <- metadata[idx,]
head(rawCounts) head(metadata)
## Normalizing the dataset using DESeq2 dds <- DESeqDataSetFromMatrix(rawCounts, metadata, design = ~ species+tissue) dds <- estimateSizeFactors(dds) dat <- counts(dds, normalized = TRUE) lognormalizedCounts <- log2(dat + 1)
## PCA plot using rld.dds <- rlog(dds) pcaplot(rld.dds, intgroup=c("tissue","species"), ntop=1000, pcX=1, pcY=2)
In the PCA plot, PC1 shows the differences between the species. PC2 shows the differences between the species i.e. samples clustering based on tissues.
res <- DASC(edata = dat, pdata = metadata, factor = metadata$tissue, method = 'ama', type = 3, lambda = 1, rank = 2:3, nrun = 10, annotation = 'Stanford Dataset')
## Consensus plot annotation <- data.frame(Batch = res$`2`$class, Tissue = as.character(metadata$tissue)) consensusmap(res$`2`$consensus, annCol = annotation, main = "rank = 2")
## Batches -- dataset has 6 batches sample.clust <- data.frame(sample.name = colnames(lognormalizedCounts), clust = as.vector(res$`2`$class), batch = metadata$seqBatch) ggplot(data = sample.clust, aes(x=c(1:6), y=clust, color=factor(clust))) + geom_point(size = 4) + xlab("Sample Number") + ylab("Cluster Number")
Based on the above plots, we observe that the dataset has 2 batches. This can
further be compared with the sequencing platform or metadata$seqBatch
. The
results suggest that differences in platform led to batch effects. Batch number
can be used as another covariate, when differential expression analyses using
DESeq2
,edgeR
or limma
are performed.
## not running this part of the code for building package ## Using entire dataset rawCounts <- stanfordData$rawCounts metadata <- stanfordData$metadata dds <- DESeqDataSetFromMatrix(rawCounts, metadata, design = ~ species+tissue) dds <- estimateSizeFactors(dds) dat <- counts(dds, normalized = TRUE) lognormalizedCounts <- log2(dat + 1) ## PCA Plot rld.dds <- rlog(dds) pcaplot(rld.dds, intgroup=c("tissue","species"), ntop = nrow(rld.dds), pcX = 1, pcY = 2) ## Running DASC res <- DASC(edata = dat, pdata = metadata, factor = metadata$tissue, method = 'ama', type = 3, lambda = 1, rank = 2:10, nrun = 100, annotation = 'Stanford Dataset') ## Consensus plot annotation <- data.frame(Batch = res$`10`$class, Tissue = metadata$tissue) consensusmap(res$`10`$consensus, annCol = annotation, main = "rank = 10") ## Clustering samples based on batches sample.clust <- data.frame(sample.name = colnames(lognormalizedCounts), clust = as.vector(res$`10`$class), batch = metadata$seqBatch) ggplot(data = sample.clust, aes(x=c(1:26), y=clust, color=factor(clust))) + geom_point(size = 4) + xlab("Sample Number") + ylab("Cluster Number")
knitr::include_graphics("PCA_Plot_Fig1.png") knitr::include_graphics("Consensus_Plot_Fig2.png") knitr::include_graphics("ClusterNumber_Fig4.png")
Based on the PCA plot, we see the PC1 captures the differences based on
species. Based on consensusmap
plot and Cophenetic & dispersion values, there
are 10 batches in the dataset. Our results show that the batches are not only
due to platform but due other reason like differences in the date and the time
of library preparation and sequencing of the samples. Another important
observation, is that at rank equal to 2, we observe entire dataset to cluster
based on speicies.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.