knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(gac)

Getting Started

This data is composed of 82 simulated cell copy number profiles with deletions, losses, gains and focal amplifications along 5000 bins. The dataset contains example binary, categorical, and quantitative traits, and basic QC data.

The primary container to handle the complexity of single-cell DNA data is a CNR. The cnr is composed of a genotype (integer copy number) matrix, X, phenotype matrix Y, expression matrix Ye , a QC or metadata matrix qc, the chromInfo with chromosome-end coordinates, and a gene.index to map bins to specific genes. These are user provided tables. See documentation for buildCNR to learn how to build a CNR bundle with your data. The CNR is modeled after the expressionSet and AnnData objects from R/Bioconductor and Python, respectively. It is a lightweight and versitile framework to operate and manipulate large single-cell DNA data. The framework facilitates the use matrix operations by concurrently managing genotype and phenotype data. At the heart of the CNR bundle is the genotype X matrix wich hold copy number as ploidy corrected numeric estimations or integer states, with bins as rows, and cells as columns. Bins can be the direct output from Ginkgo/Varbin algorithm, or the a set of common segments from a .seg file in matrix form. The bin information is located on the chromInfo table, and the gene-to-bin mappings on the gene.index, which helps link the output of the segmentation to a biological context. An example on how to construt these is provided in the section Preparing your data. Cell/sample phenotypes are stored in the Y matrix, these are sample specific and biological in nature for which we wish to gain biological insights.

In a not so distant future, the posibility to acquire single-cell DNA and RNA from the same cell will become feasable at scale. To this end, a Ye or exprs matrix is included, to enable genetic analysis gene expression. If you have data of this type, it's best to have the last/most-accurate, curated, and complete data possible for the analyses to make sense. Eg. scaled, log2-transformed, MAGIC Imputed, without doublets.

Lastly, a qc matrix containg metadata about each cell is required, For example the number of alinged reads, duplication percentage, and a bin-to-bin dispersion metric such as median absolute deviation (MAD) or median absolute pairwise deviation (MAPD) are easily obtainable. Technical metrics of cell FASTQ and BAM file can be summarized with MultiQC and used to identify and filter problematic cells.

Figure1{ width=80% }

Figure 1. Schematic representation of the CNR bundle and how the different components are related.

Preparing your data

Data should be in rectancular form, preferably as a data.frame or matrix. The example data included in the package was generated as follows.

## number of cells to simulate
n.cells <- 82

## set pattern
X <- as.data.frame(replicate(n = n.cells, expr = round(seq(0, 5, by = 5/4999), digits = 4)))
names(X) <- paste("cell", 0:(n.cells -1), sep = "")

## set focal
X[, 1] <- 2
X[801:840, 2:10] <- 12
X[800:843, 11:24] <- 18
X[1810:1860, 24:40] <- 22
X[3801:3881, 40:82] <- 80

## adding noisy cells
set.seed(2020)
noisy.cells <- sort(round(runif(floor(n.cells*0.08), min = 1, max = n.cells)))
for(i in noisy.cells) {
    X[,i] <- X[,i] + rnorm(5000)
}
Y <- data.frame(matrix(NA, nrow = n.cells, ncol = 9))
names(Y) <- c("cellID", "binary1", "binary2",
              "category1", "category2",
              "quantitative1", "quantitative2",
              "random1", "random2")
Y$cellID <- names(X)

Y$binary1 <- rep(c(0, 1), c(floor(n.cells * 0.2), ## 20% of the data
                            ceiling(n.cells * 0.8))) ## 80% of the data
Y$binary2 <- rep(c(1, 0), each = floor(n.cells/2)) ## 50:50 + NA if n.cells is an odd number

Y$category1 <- rep(LETTERS[1:3], round(c(0.33, 0.50, 0.17) * n.cells))
Y$category2 <- rep(LETTERS[26:24], round(c(0.17, 0.50, 0.33) * n.cells))

Y$quantitative1 <- c(rnorm(n = floor(n.cells * 0.2), mean = 2),
                     rnorm(n = ceiling(n.cells * 0.8), mean = 8))
Y$quantitative2 <- rev(seq(0, 20, by = 20/n.cells)[-1])
Y$quantitative2[noisy.cells] <- Y$quantitative2[noisy.cells] - abs(rnorm(1, 3))
Y$quantitative2[noisy.cells] <- Y$quantitative2[noisy.cells] + abs(rnorm(1, 3))

Y$random1 <- abs(rnorm(n.cells))
Y$random2 <- abs(rnorm(n.cells, mean = 200, sd = 300))

This example was constructed using the following : * ReadsKept are simulated based on the recomended sequencing depth in Baslan et al. 2015 * MedianBinCouts assume uniform distribution of ReadsKept across all bins * DNA concentrations are simulated based on expected yields from WGA methods * Cells 5, and 11 set as FAIL

qc <- data.frame(matrix(NA, nrow = n.cells, ncol = 6))
names(qc) <- c("cellID", "ReadsKept", "MedianBinCount",
               "dna.ng.ul", "sort.gate", "qc.status")

qc$cellID <- names(X)
qc$ReadsKept <- 2000000 + rnorm(n = n.cells, sd = 700000)
qc$MedianBinCount <- round(qc$ReadsKept/5000)
qc$dna.ng.ul <- 200 + rnorm(n = n.cells, sd = 70)
qc$sort.gate <- rep(c("p6", "p7"), c(floor(n.cells * 0.2),
                                     ceiling(n.cells * 0.8)))
qc$qc.status <- "PASS"
## match noisy cells to QC
qc$ReadsKept[c(noisy.cells)] <- abs(200000 + rnorm(n = 2, sd = 700000))
qc$qc.status[c(noisy.cells)] <- "FAIL"
suppressPackageStartupMessages({
    library(GenomicRanges)
    library(biomaRt)
    library(dplyr)
})

## load example bin data
data(chromInfo)

## get ensembl genes w/HGNC symbol
grch37 <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
                  dataset = "hsapiens_gene_ensembl",
                  host="https://grch37.ensembl.org",
                  path="/biomart/martservice")

grch37.genes <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol",
                                     "chromosome_name", "start_position",
                                     "end_position", "strand", "band",
                                     "gene_biotype"),
                      mart = grch37) %>%
    filter(chromosome_name %in% c(1:22, "X", "Y")) %>%
    arrange(chromosome_name, start_position)

## set up a genomic ranges object for the gene index annotation
( grch37.genes <- GRanges(seqnames = grch37.genes$chromosome_name,
                        ranges = IRanges(start = grch37.genes$start_position,
                                         end = grch37.genes$end_position,
                                         names = grch37.genes$ensembl_gene_id),
                        ensembl_gene_id = grch37.genes$ensembl_gene_id,
                        hgnc.symbol = grch37.genes$hgnc_symbol,
                        gene_biotype= grch37.genes$gene_biotype) )


## set up a genomic ranges object for the genome bins
( gr.5k <- GRanges(seqnames = chromInfo$bin.chrom,
                 ranges = IRanges(start = chromInfo$bin.start,
                                  end = chromInfo$bin.end,
                                  names = 1:nrow(chromInfo))) )

## find and print the overlaps
( gr.5k.grch37.genes <- findOverlaps(gr.5k, grch37.genes,
                                     ignore.strand = TRUE) )

tail( gr.5k.grch37.genes <- as.data.frame(gr.5k.grch37.genes)[!duplicated(gr.5k.grch37.genes@to),]  )

any(duplicated(gr.5k.grch37.genes$subjectHits))

## assign binID to each gene
grch37.genes.5k <- grch37.genes[gr.5k.grch37.genes$subjectHits]
grch37.genes.5k$bin.id <- gr.5k.grch37.genes$queryHits
grch37.genes.5k

## coerce to data.frame and keep unique genes with HGNC symbols
grch37.genes.5k <- data.frame(grch37.genes.5k) %>%
    filter(hgnc.symbol != "", !duplicated(hgnc.symbol)) %>%
    mutate(chrom = factor(seqnames, unique(seqnames))) %>%
    dplyr::select(hgnc.symbol, chrom, start, end, width, strand, ensembl_gene_id,
           gene_biotype, bin.id)

grch37.genes.5k$chrom <- factor(grch37.genes.5k$chrom, c(1:22, "X", "Y"))

cnr <- buildCNR(X = X, Y= Y, qc = qc, chromInfo = chromInfo, gene.index = grch37.genes.5k)

## example for changing loss threshold in round CNR
cnr2 <- buildCNR(X = X, Y= Y, qc = qc, chromInfo = chromInfo, gene.index = grch37.genes.5k, loss = 1.5) 

summary_cnr(cnr)

View what data is present in the cnr

make sure numbers of cells match in all objects

data(cnr)
summary_cnr(cnr)

GAC makes use of ComplexHeatmap. ComplexHeatmap is one of the most powerful visualization tools available for R. To use the richness of ComplexHeatmap, the HeatmapCNR function attempts to minimize the total number of presets. Please visit the ComplexHeatmap documentation to take advantage of its potential.

The HeatmapCNR function will create a viable plot if you are in a time constraint.

## a color scale to visualize the integer copy number is provided
## visualize genome-wide
HeatmapCNR(cnr, use_raster = TRUE)

## visualize genes of interest
HeatmapCNR(cnr, what = "genes", which.genes = c("CDK4", "MDM2", "TP53"), use_raster = TRUE)

Working with a CNR

GAC is equiped with a couple of management functions that are prefixed with add*, exclude* and keep*.

On a more frequent basis you will want to remove cells that have noisy profiles.

For example to exclude cells that you have been deemed as "FAIL" b/c of low read counts you can do:

## remove cells
( excl.cells <- cnr$qc$cellID[cnr$qc$qc.status == "FAIL"] )
cnrE <- excludeCells(cnr, excl = excl.cells)
sapply(cnrE, dim)

HeatmapCNR(cnrE, use_raster = TRUE)

Conversly, if you have multiple samples or have multiple conditions and wish to use only part of the data, you can use keepCells

e.g.

## keep cells
## offset by 1 b/c cell names are 0 based
keep.cells <- setdiff(colnames(cnr$X), paste0("cell",noisy.cells-1))

cnrK <- keepCells(cnr, keep = keep.cells)
sapply(cnrK, dim)

HeatmapCNR(cnrK, use_raster = TRUE)

Lastly, if you perform a new experiment, and need to add additional cells use the addCells

## make new cells
newX <- data.frame(cbind(rep(c(5,2), c(3000, 2000)),
                         rep(c(2,4),  c(3000, 2000))))
names(newX) <- paste0("cell", c(n.cells+1, n.cells+2))
head(newX)

tail(newX)

## creating new phenotypes
newY <- head(cnr$Y, n = 2)
newY$cellID <- paste0("cell", c(n.cells+1, n.cells+2))
rownames(newY) <- newY$cellID

newY[, c(6:9)] <- newY[,c(6,9)]+3
newY

## creating new QC
newQC <- head(cnr$qc, 2)
newQC$cellID <- paste0("cell", c(n.cells+1, n.cells+2))
rownames(newQC) <- newQC$cellID
newQC$qc.status <- "FAIL"

newQC[,2:4] <- newQC[,2:4] + 2
newQC

## add cells
cnrA <- addCells(cnr, newX = newX, newY = newY, newqc = newQC)
sapply(cnrA, dim)
HeatmapCNR(cnrA, use_raster = TRUE)

If you run an algorithm to estimate genomic complexity or another parameter, you can add these to the Y matrix using addPheno

## addPheno
rand3 <- data.frame(cellID = cnr$Y$cellID, rand3 = rnorm(nrow(cnr$Y), mean = 2, sd = 1))

cnr <- addPheno(cnr, df = rand3)
sapply(cnr, dim)

head(cnr$Y)

Similarly, if you run a qc metric such as Median Absolute Pairwise Deviance (MAPD) which is used to estimate how noisy the profiles are. This estimate should be computed on the non-segmented bin estimates (this is not the right input for GAC as it requires you provided it segmented results). See documentation for Ginkgo/Varbin, MUMdex(http://mumdex.com), HMMcopy(https://bioconductor.org/packages/release/bioc/html/HMMcopy.html), or SCOPE(https://github.com/rujinwang/SCOPE) for details on how to perform the upstream analysis.

To add QC metrix use the addQC

## add QC
mapdX <- data.frame(t(apply(cnr$X, 2, mapd)))
mapdX <- data.frame(cellID = rownames(mapdX), mapdX)

cnr <- addQC(cnr, df = mapdX)

head(cnr$qc)

Lastly, the goal of this tool is to facilitate the genetic, and statistical analysis of single-cell copy number data. Most analysis will be carried out at the bin level. Even if you aim to work with a single gene, the bin is the smallest independent unit for segmented data. To add, for example, p-values to see which bins are linked to your phenotype the helper function is addInfo.

## addInfo
fakePval <- data.frame(fakePval = runif(5000))
cnr <- addInfo(cnr, df = fakePval)
head(cnr$chromInfo)

fake.Gene.pval <- data.frame(hgnc.symbol = unique(cnr$gene.index$hgnc.symbol), fPval = runif(nrow(cnr$gene.index)))
cnr <- addGeneInfo(cnr, df = fake.Gene.pval)
head(cnr$gene.index)

Clonal decomposition of cell populations based on copy number

Two methods are implemented: Heriarchical clustering, and Consensus Clustering

Heirarchical Clustering

For this method we estimate Bray-Curtis Dissimilarty across all pairwise cells. Heirarchical clustering is then performed on the complete matrix with phylo_cnr. To set the tree height and asign cluster membership, we iteratively test various heights, and count the number of one-cell clusters, and multi-cell clusters. By default setBrayClusters will set the height is set to the intersection between one-cell and multi-cell clusters. The package ape contains various functions to visualize heirarchical trees, a tree object class phylo is created to faciliate this. Bray-Dissimilarity membership is saved to the phenotype table Y as BrayC, and pairwise distances saved in cbd.

cnr <- phylo_cnr(cnr, root.cell = "cell0")
cnr <- setBrayClusters(cnr)

table(cnr$Y$BrayC)

Internally, the operations to select the intersect above are:

( mopc <- optClust(cnr, seq(0, 0.2, by = 0.005)) )

plotCL(mopc)

## select a different tree height
minimum.intersect(mopc)
sc.tree.height <- 0.06

## to overwrite the tree.height
cnr <- cnr
cnr$Y <- cnr$Y[, -c(grep("BrayC", names(cnr$Y)))]

cnr <- setBrayClusters(cnr, tree.height = sc.tree.height)

Single-cell dendrogram with tree height for cluster membership

plot(cnr[["hcdb"]])
abline(h = cnr$tree.height, col = "#D40000")
text(1, y = cnr$tree.height,
     labels = paste("tree.height =", cnr$tree.height),
     adj = c(0, 0))

Other tree types from package ape using the phylo class

cnr[["phylo"]]
plot(cnr[["phylo"]])

Consensus clustering

In addition, we implemented ConsensusClusterPlus using the function run_consensus_clustering directly on the CNR. Optimum number of clusters is based on the maximum number of stable clusters (Meyer et al. 2013). Stable Cluster membership is saved to the phenotype table Y as ConsensusC. K parameter (kCC) and number of stable clusters (sK) are saved into cnr[["optK"]]. The table of kCC, and kS is saved as cnr[["kStats"]].

cnr <- run_consensus_clustering(cnr, maxK = 20)

cnr <- doKSpectral(cnr)

cnr$kStats

cnr$optK

cnr <- setKcc(cnr)

plot_ccp(cnr)

table(cnr$Y$ConsensusC)
HeatmapCNR(cnr,
           column_split = cnr$Y$ConsensusC,
           column_gap = unit(0, "mm"),
           top_annotation =
               HeatmapAnnotation(df = cnr$Y[, c("BrayC", "ConsensusC")],
                                 simple_anno_size = unit(3, "mm")),
           border = TRUE,
           use_raster = TRUE)

Cluster heterogeneity

Sample heterogeneity based on clustering can be estimated using cluster_heterogeneity. The function returns a table with the same name containing the counts of each cluster, frequency, and other stats. Furthermore, it returns a vector in the phenotype table Y named final_cluster. There is no default cluster_column to use, one must choose either BrayC, ConsensusC, or if clustering was done outside GAC, these clusters can be imported to Y to be used here.

cnr <- cluster_heterogeneity(cnr, cluster_column = "ConsensusC")

cnr$cluster_heterogeneity

Cluster Summary Profiles

The function get_cluster_profiles is used to summarize the genetic profile of each cluster. Cluster summary profiles are saved in cnr[["DDRC.df"]].

segCol <- circlize::colorRamp2(breaks = c(0:5, 10, 20, 40, 80),
                     colors = c("#F2D200", "#318CE7", "#FFFFFF",
                                "#FFA89F", "#FF523F", "#D40000",
                                "#7F7F7F", "#636363", "#515151",
                                "#303030"))

cnr <- get_cluster_profiles(cnr)

chrAnnoLeft <- create_chromosome_annotation(cnr, side = "left")

Heatmap(cnr$DDRC.df, col = segCol, cluster_rows = FALSE, cluster_columns = FALSE,
        column_labels = paste(names(cnr$uclust), "(n =", cnr$uclust, ")" ),
        left_annotation = chrAnnoLeft)

Phylogenetic analysis of clones

The function phylo_ddrc was created to perform a phylogenetic analysis of clones. First a pairwise Bray-Curtis disimilarity, then the phylogeny is constructed using Neighbourhood Joining, finally the tree is rooted if a root.clone is provided.

cnr <- phylo_ddrc(cnr, root.clone = "X2")

plot(cnr$DDRC.phylo)

VDJ genotyping

To asses weather T-cells or B-cells are present in the cell populations, the function genotype_vdj will classify cells containing VDJ deletions into T-cells, B-cells, or vdj.unspecified. A visualization function is also provided.

cnr <- genotype_vdj(cnr)
vdjHeatmap(cnr, use_raster = TRUE)

Example of a complete analysis pipeline

library(tidyverse)

data(cnr)
noisy.cells <- cnr$qc$cellID[cnr$qc$qc.status == "FAIL"]
cnr <- cnr %>%
    excludeCells(excl = noisy.cells) %>%
    phylo_cnr(root.cell = "cell0") %>%
    setBrayClusters() %>%
    run_consensus_clustering(iters = 20, maxK = 40,
                             verbose = FALSE) %>%
    doKSpectral() %>%
    setKcc() %>%
    cluster_heterogeneity(by = "category1",
                          cluster_column = "ConsensusC") %>%
    get_cluster_profiles() %>%
    genotype_vdj() %>%
    phylo_ddrc(root.clone = "X2") %>%
    get_alteration_frequencies() 
clone.colors <- rainbow(length(unique(cnr$Y$final_cluster)))
names(clone.colors) <- unique(cnr$Y$final_cluster)

HeatmapCNR(cnr,
           cluster_columns = FALSE,
           column_split = cnr$Y$final_cluster,
           top_annotation = HeatmapAnnotation(
               clone = cnr$Y$final_cluster,
               col = list(clone = clone.colors),
               border = TRUE),
           column_gap = unit(0, "mm"),
           border = TRUE,
           use_raster = TRUE)
layout(mat = matrix(1:2, nrow = 1))
plot(cnr$phylo, main = "cell phylogenetic tree")
plot(cnr$DDRC.phylo, main = "clone phylogenetic tree")

Analysis considerations

Session Informatioin

date()
sessionInfo()


SingerLab/gac documentation built on March 23, 2024, 5:15 a.m.