The iasva package can be used to detect hidden heterogenity within bulk or single cell sequencing data. To illustrate how to use the iasva package for heterogenity detection, we use real-world single cell RNA sequencing (scRNA-Seq) data obtained from human pancreatic islet samples (Lawlor et. al., 2016). This dataset is included in a R data package ("iasvaExamples") containing data examples for IA-SVA (https://github.com/dleelab/iasvaExamples). To install the package, follow the instruction provided in the GitHub page.
#devtools library(devtools) #iasva devtools::install_github("UcarLab/iasva") #iasvaExamples devtools::install_github("dleelab/iasvaExamples")
rm(list=ls()) library(irlba) # partial SVD, the augmented implicitly restarted Lanczos bidiagonalization algorithm library(iasva) library(iasvaExamples) library(sva) library(SCnorm) library(Rtsne) library(pheatmap) library(corrplot) library(DescTools) #pcc i.e., Pearson's contingency coefficient library(RColorBrewer) library(SummarizedExperiment) color.vec <- brewer.pal(9, "Set1")[c(1,9)] # Normalization. normalize <- function(counts) { normfactor <- colSums(counts) return(t(t(counts)/normfactor)*median(normfactor)) }
data("Lawlor_Islet_scRNAseq_Read_Counts") data("Lawlor_Islet_scRNAseq_Annotations") ls() counts <- Lawlor_Islet_scRNAseq_Read_Counts anns <- Lawlor_Islet_scRNAseq_Annotations dim(anns) dim(counts) summary(anns) ContCoef(table(anns$Gender, anns$Cell_Type)) ContCoef(table(anns$Phenotype, anns$Cell_Type)) ContCoef(table(anns$Race, anns$Cell_Type)) ContCoef(table(anns$Patient_ID, anns$Cell_Type)) ContCoef(table(anns$Batch, anns$Cell_Type))
The annotations describing the islet samples and experimental settings are stored in "anns" and the read counts information is stored in "counts".
To illustrate how IA-SVA can be used to detect hidden heterogeneity within a homogenous cell population (i.e., alpha cells), we use read counts of alpha cells from healthy (non-diabetic) subjects (n = 101).
# Selected T2D patients/GCG cell type counts <- counts[, (anns$Phenotype!="Non-Diabetic")&(anns$Cell_Type=="GCG")] anns <- subset(anns, (Phenotype!="Non-Diabetic")&(Cell_Type=="GCG")) dim(counts) dim(anns) anns <- droplevels(anns) prop.zeros <- sum(counts==0)/length(counts) prop.zeros # filter out genes that are sparsely and lowly expressed filter = apply(counts, 1, function(x) length(x[x>5])>=3) counts = counts[filter,] dim(counts) prop.zeros <- sum(counts==0)/length(counts) prop.zeros
## count-depth relationship for all genes Conditions = rep(c(1), each=101) countDeptEst <- plotCountDepth(Data = counts, Conditions = Conditions, FilterCellProportion = .1, NCores=8) DataNorm <- SCnorm(Data = counts, Conditions = Conditions, PrintProgressPlots = FALSE, FilterCellNum = 10, NCores=8) counts <- SingleCellExperiment::normcounts(DataNorm) summary(colSums(counts))
It is well known that the number of detected genes in each cell explains a very large portion of variability in scRNA-Seq data (Hicks et. al. 2015 BioRxiv, McDavid et. al. 2016 Nature Biotechnology). Frequently, the first principal component of log-transformed scRNA-Seq read counts is highly correlated with the number of detected genes (e.g., r > 0.9). Here, we calculate the number of detected genes for islet cells, which will be used as an known factor in the IA-SVA analyses.
Num_Detected_Genes <- colSums(counts>0) Geo_Lib <- colSums(log(counts+1)) summary(Num_Detected_Genes) summary(Geo_Lib) barplot(Num_Detected_Genes, xlab="Cell", las=2, ylab = "Number of detected genes") lcounts <- log(counts + 1) # PC1 and Geometric library size correlation pc1 = irlba(lcounts - rowMeans(lcounts), 1)$v[,1] ## partial SVD cor(Num_Detected_Genes, pc1) cor(Geo_Lib, pc1)
Here, we run IA-SVA using Patient_ID and Geo_Lib_Size as known factors and identify five hidden factors. SVs are plotted in a pairwise fashion to uncover which SVs can seperate cell types.
set.seed(4543535) Patient_ID <- anns$Patient_ID mod <- model.matrix(~Patient_ID+Geo_Lib) summ_exp <- SummarizedExperiment(assays = counts) iasva.res<- iasva(summ_exp, mod[,-1],verbose=FALSE, permute=FALSE, num.sv=5) ##irlba iasva.sv <- iasva.res$sv plot(iasva.sv[,1], iasva.sv[,2], xlab="SV1", ylab="SV2") Cluster <- as.factor(iasva.sv[,2] < 0.1) levels(Cluster)=c("Cell1","Cell2") table(Cluster) # We identified 6 outlier cells based on SV2 that are marked in red pairs(iasva.sv, main="IA-SVA", pch=21, col=color.vec[Cluster], bg=color.vec[Cluster], oma=c(4,4,6,12)) #4,4,6,12 legend("right", levels(Cluster), fill=color.vec, bty="n") plot(iasva.sv[,1:2], main="IA-SVA", pch=21, xlab="SV1", ylab="SV2", col=color.vec[Cluster], bg=color.vec[Cluster]) cor(Num_Detected_Genes, iasva.sv[,2]) cor(Geo_Lib, iasva.sv[,2]) corrplot(cor(iasva.sv))
As shown in the above figure, SV2 clearly separates alpha cells into two groups: 6 outlier cells (marked in red) and the rest of the alpha cells (marked in green). SV3 and SV4 also capture outlier cells. However, we will focus on SV2 in the rest of the analyses.
Here, using the find_markers() function we find marker genes (n=105 genes) that are significantly associated with SV2 (multiple testing adjusted p-value < 0.05, default significance cutoff, and R-squared value > 0.3, default R-squared cutoff).
# try different R2 thresholds pdf("output/Clustering_analyses_figure2.pdf") r2.results <- study_R2(summ_exp, iasva.sv, selected.svs=2, no.clusters=2) dev.off() marker.counts <- find_markers(summ_exp, as.matrix(iasva.sv[,2]), rsq.cutoff = 0.6) marker.counts.long <- find_markers(summ_exp, as.matrix(iasva.sv[,2]), rsq.cutoff = 0.3) nrow(marker.counts) rownames(marker.counts) nrow(marker.counts.long) anno.col <- data.frame(Cluster=Cluster, SV2=iasva.sv[,2]) rownames(anno.col) <- colnames(marker.counts) head(anno.col) cluster.col <- color.vec[1:2] names(cluster.col) <- as.vector(levels(Cluster)) anno.colors <- list(Cluster=cluster.col) anno.colors pheatmap(log(marker.counts+1), show_colnames =FALSE, clustering_method = "ward.D2",cutree_cols = 2,annotation_col = anno.col, annotation_colors = anno.colors)
For comparison purposes, we applied tSNE on read counts of all genes to identify the hidden heterogeneity. We used the Rtsne R package with default settings.
set.seed(323542534) tsne.res <- Rtsne(t(lcounts), dims = 2) plot(tsne.res$Y, main="tSNE", xlab="tSNE Dim1", ylab="tSNE Dim2", pch=21, col=color.vec[Cluster], bg=color.vec[Cluster], oma=c(4,4,6,12)) legend("bottomright", levels(Cluster), border="white", fill=color.vec, bty="n")
As shown above, tSNE fails to detect the outlier cells that are identified by IA-SVA when all genes are used. Same color coding is used as above.
Here, we use PCA to detect the hidden heterogeneity (SV2) detected by IA-SVA.
set.seed(345233) pca.res = irlba(lcounts - rowMeans(lcounts), 5)$v ## partial SVD pairs(pca.res, main="PCA", pch=21, col=color.vec[Cluster], bg=color.vec[Cluster], oma=c(4,4,6,12)) #4,4,6,12 legend("right", levels(Cluster), border="white", fill=color.vec, bty="n") plot(pca.res[,2:3], main="PCA", xlab="PC2", ylab="PC3", pch=21, col=color.vec[Cluster], bg=color.vec[Cluster], oma=c(4,4,6,12)) legend("bottomright", levels(Cluster), border="white", fill=color.vec, bty="n")
PC3 somewhat captures the six outlier cells, however this seperation is not as clear as the IA-SVA results.
Here, for comparison purposes we use SVA (using thre SVs) to detect the hidden heterogeneity in our example data.
mod1 <- model.matrix(~Patient_ID+Geo_Lib) mod0 <- cbind(mod1[,1]) sva.res = svaseq(counts,mod1,mod0, n.sv=5)$sv pairs(sva.res, main="SVA", pch=21, col=color.vec[Cluster], bg=color.vec[Cluster], oma=c(4,4,6,12)) #4,4,6,12 legend("right", levels(Cluster), border="white", fill=color.vec, bty="n") plot(sva.res[,1:2], main="SVA", xlab="SV1", ylab="SV2", pch=21, col=color.vec[Cluster], bg=color.vec[Cluster]) legend("topleft", levels(Cluster), border="white", fill=color.vec, bty="n")
SV2 is associated with the six outlier samples, however the seperation of these cells is not as clear as the IA-SVA results.
cor(Num_Detected_Genes, iasva.sv[,2]) cor(Geo_Lib, iasva.sv[,2])
pdf(file="output/Lawlor_Islets_Alpha_Doublets_Figure2_ABCD.pdf", width=5, height=6) layout(matrix(c(1,2,3,4), nrow=2, ncol=2, byrow=TRUE)) plot(iasva.sv[,1:2], main="IA-SVA", pch=21, xlab="SV1", ylab="SV2", col=color.vec[Cluster], bg=color.vec[Cluster]) legend("topright", levels(Cluster), border="white", fill=color.vec, bty="n") plot(pca.res[,2:3], main="PCA", pch=21, xlab="PC2", ylab="PC3", col=color.vec[Cluster], bg=color.vec[Cluster]) plot(sva.res[,1:2], main="USVA", xlab="SV1", ylab="SV2", pch=21, col=color.vec[Cluster], bg=color.vec[Cluster]) plot(tsne.res$Y, main="tSNE", xlab="Dimension 1", ylab="Dimension 2", pch=21, col=color.vec[Cluster], bg=color.vec[Cluster]) dev.off()
anno.col <- data.frame(Cluster=Cluster) rownames(anno.col) <- colnames(marker.counts) head(anno.col) cluster.col <- color.vec names(cluster.col) <- as.vector(levels(Cluster)) anno.colors <- list(Cluster=cluster.col) anno.colors pheatmap(log(marker.counts+1), show_colnames =FALSE, clustering_method = "ward.D2",cutree_cols = 2,annotation_col = anno.col, annotation_colors = anno.colors, filename="output/Lawlor_Islets_Alpha_iasva_SV2Markers_rsqcutoff0.6_pheatmap_iasvaV0.95.pdf", width=6, height=5) pheatmap(log(marker.counts.long+1), show_colnames =FALSE, clustering_method = "ward.D2",cutree_cols = 2,annotation_col = anno.col, annotation_colors = anno.colors, filename="output/Lawlor_Islets_Alpha_iasva_SV2Markers_rsqcutoff0.3_pheatmap_iasvaV0.95.pdf", width=6, height=14)
write.table(as.data.frame(rownames(marker.counts)), file="output/Lawlor_Islets_Alpha_Doublets_SV2_Genes_rsqcutoff0.6.txt", quote=F, row.names=F, col.names=F, sep=" ") write.table(as.data.frame(rownames(marker.counts.long)), file="output/Lawlor_Islets_Alpha_Doublets_SV2_Genes_rsqcutoff0.3.txt", quote=F, row.names=F, col.names=F, sep=" ")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.