In SingerLab/gac: Genetic Analysis of Cells

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(gac)

Getting Started

This data is composed of 82 simulated cell copy number profiles with deletions, losses, gains and focal amplifications along 5000 bins. The dataset contains example binary, categorical, and quantitative traits, and basic QC data.

The primary container to handle the complexity of single-cell DNA data is a CNR. The cnr is composed of a genotype (integer copy number) matrix, X, phenotype matrix Y, expression matrix Ye , a QC or metadata matrix qc, the chromInfo with chromosome-end coordinates, and a gene.index to map bins to specific genes. These are user provided tables. See documentation for buildCNR to learn how to build a CNR bundle with your data. The CNR is modeled after the expressionSet and AnnData objects from R/Bioconductor and Python, respectively. It is a lightweight and versitile framework to operate and manipulate large single-cell DNA data. The framework facilitates the use matrix operations by concurrently managing genotype and phenotype data. At the heart of the CNR bundle is the genotype X matrix wich hold copy number as ploidy corrected numeric estimations or integer states, with bins as rows, and cells as columns. Bins can be the direct output from Ginkgo/Varbin algorithm, or the a set of common segments from a .seg file in matrix form. The bin information is located on the chromInfo table, and the gene-to-bin mappings on the gene.index, which helps link the output of the segmentation to a biological context. An example on how to construt these is provided in the section Preparing your data. Cell/sample phenotypes are stored in the Y matrix, these are sample specific and biological in nature for which we wish to gain biological insights.

In a not so distant future, the posibility to acquire single-cell DNA and RNA from the same cell will become feasable at scale. To this end, a Ye or exprs matrix is included, to enable genetic analysis gene expression. If you have data of this type, it's best to have the last/most-accurate, curated, and complete data possible for the analyses to make sense. Eg. scaled, log2-transformed, MAGIC Imputed, without doublets.

Lastly, a qc matrix containg metadata about each cell is required, For example the number of alinged reads, duplication percentage, and a bin-to-bin dispersion metric such as median absolute deviation (MAD) or median absolute pairwise deviation (MAPD) are easily obtainable. Technical metrics of cell FASTQ and BAM file can be summarized with MultiQC and used to identify and filter problematic cells.

{ width=80% }

Figure 1. Schematic representation of the CNR bundle and how the different components are related.

Preparing your data

Data should be in rectancular form, preferably as a data.frame or matrix. The example data included in the package was generated as follows.

Copy numbers are a sequence from 0 - 5 in 12 cells as our copy number data, with some added focal amplifcations in a bins, and noise in cells 5 and 11. Column names must be the cellID.

## number of cells to simulate
n.cells <- 82

## set pattern
X <- as.data.frame(replicate(n = n.cells, expr = round(seq(0, 5, by = 5/4999), digits = 4)))
names(X) <- paste("cell", 0:(n.cells -1), sep = "")

## set focal
X[, 1] <- 2
X[801:840, 2:10] <- 12
X[800:843, 11:24] <- 18
X[1810:1860, 24:40] <- 22
X[3801:3881, 40:82] <- 80

## adding noisy cells
set.seed(2020)
noisy.cells <- sort(round(runif(floor(n.cells*0.08), min = 1, max = n.cells)))
for(i in noisy.cells) {
    X[,i] <- X[,i] + rnorm(5000)
}

Phenotype in three forms, binary, categorical, and quantitive; and an additional two random variables. Must have one column named cellID, and must match those in the copy number matrix.

Y <- data.frame(matrix(NA, nrow = n.cells, ncol = 9))
names(Y) <- c("cellID", "binary1", "binary2",
              "category1", "category2",
              "quantitative1", "quantitative2",
              "random1", "random2")
Y$cellID <- names(X)

Y$binary1 <- rep(c(0, 1), c(floor(n.cells * 0.2), ## 20% of the data
                            ceiling(n.cells * 0.8))) ## 80% of the data
Y$binary2 <- rep(c(1, 0), each = floor(n.cells/2)) ## 50:50 + NA if n.cells is an odd number

Y$category1 <- rep(LETTERS[1:3], round(c(0.33, 0.50, 0.17) * n.cells))
Y$category2 <- rep(LETTERS[26:24], round(c(0.17, 0.50, 0.33) * n.cells))

Y$quantitative1 <- c(rnorm(n = floor(n.cells * 0.2), mean = 2),
                     rnorm(n = ceiling(n.cells * 0.8), mean = 8))
Y$quantitative2 <- rev(seq(0, 20, by = 20/n.cells)[-1])
Y$quantitative2[noisy.cells] <- Y$quantitative2[noisy.cells] - abs(rnorm(1, 3))
Y$quantitative2[noisy.cells] <- Y$quantitative2[noisy.cells] + abs(rnorm(1, 3))

Y$random1 <- abs(rnorm(n.cells))
Y$random2 <- abs(rnorm(n.cells, mean = 200, sd = 300))

QC metadata in this case will be limited to aligned reads (ReadsKept), median bin counts, amplified DNA concentrations, flow cytometry gate, and qc.status as PASS/FAIL. Must have one column named cellID, and must match those in the copy number matrix and phenotype data. QC metrics of cells 5 and 11 are set similar to failed cells in an experiment.

This example was constructed using the following : * ReadsKept are simulated based on the recomended sequencing depth in Baslan et al. 2015 * MedianBinCouts assume uniform distribution of ReadsKept across all bins * DNA concentrations are simulated based on expected yields from WGA methods * Cells 5, and 11 set as FAIL

qc <- data.frame(matrix(NA, nrow = n.cells, ncol = 6))
names(qc) <- c("cellID", "ReadsKept", "MedianBinCount",
               "dna.ng.ul", "sort.gate", "qc.status")

qc$cellID <- names(X)
qc$ReadsKept <- 2000000 + rnorm(n = n.cells, sd = 700000)
qc$MedianBinCount <- round(qc$ReadsKept/5000)
qc$dna.ng.ul <- 200 + rnorm(n = n.cells, sd = 70)
qc$sort.gate <- rep(c("p6", "p7"), c(floor(n.cells * 0.2),
                                     ceiling(n.cells * 0.8)))
qc$qc.status <- "PASS"
## match noisy cells to QC
qc$ReadsKept[c(noisy.cells)] <- abs(200000 + rnorm(n = 2, sd = 700000))
qc$qc.status[c(noisy.cells)] <- "FAIL"

gene.index is prepared using biomart, requires bin information to map genes Bin mappings are part of the chromInfo, and user supplied. These can be downloaded from Ginkgo or QDNAseq. Ginkgo uses variable length bins with an equal mappability, vs QDNAseq which uses equally spaced bins with variable mappability. For this example, we'll use the bins from Baslan et al. 2012 and are provided in the package.

suppressPackageStartupMessages({
    library(GenomicRanges)
    library(biomaRt)
    library(dplyr)
})

## load example bin data
data(chromInfo)

## get ensembl genes w/HGNC symbol
grch37 <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
                  dataset = "hsapiens_gene_ensembl",
                  host="https://grch37.ensembl.org",
                  path="/biomart/martservice")

grch37.genes <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol",
                                     "chromosome_name", "start_position",
                                     "end_position", "strand", "band",
                                     "gene_biotype"),
                      mart = grch37) %>%
    filter(chromosome_name %in% c(1:22, "X", "Y")) %>%
    arrange(chromosome_name, start_position)

## set up a genomic ranges object for the gene index annotation
( grch37.genes <- GRanges(seqnames = grch37.genes$chromosome_name,
                        ranges = IRanges(start = grch37.genes$start_position,
                                         end = grch37.genes$end_position,
                                         names = grch37.genes$ensembl_gene_id),
                        ensembl_gene_id = grch37.genes$ensembl_gene_id,
                        hgnc.symbol = grch37.genes$hgnc_symbol,
                        gene_biotype= grch37.genes$gene_biotype) )


## set up a genomic ranges object for the genome bins
( gr.5k <- GRanges(seqnames = chromInfo$bin.chrom,
                 ranges = IRanges(start = chromInfo$bin.start,
                                  end = chromInfo$bin.end,
                                  names = 1:nrow(chromInfo))) )

## find and print the overlaps
( gr.5k.grch37.genes <- findOverlaps(gr.5k, grch37.genes,
                                     ignore.strand = TRUE) )

tail( gr.5k.grch37.genes <- as.data.frame(gr.5k.grch37.genes)[!duplicated(gr.5k.grch37.genes@to),]  )

any(duplicated(gr.5k.grch37.genes$subjectHits))

## assign binID to each gene
grch37.genes.5k <- grch37.genes[gr.5k.grch37.genes$subjectHits]
grch37.genes.5k$bin.id <- gr.5k.grch37.genes$queryHits
grch37.genes.5k

## coerce to data.frame and keep unique genes with HGNC symbols
grch37.genes.5k <- data.frame(grch37.genes.5k) %>%
    filter(hgnc.symbol != "", !duplicated(hgnc.symbol)) %>%
    mutate(chrom = factor(seqnames, unique(seqnames))) %>%
    dplyr::select(hgnc.symbol, chrom, start, end, width, strand, ensembl_gene_id,
           gene_biotype, bin.id)

grch37.genes.5k$chrom <- factor(grch37.genes.5k$chrom, c(1:22, "X", "Y"))

cnr <- buildCNR(X = X, Y= Y, qc = qc, chromInfo = chromInfo, gene.index = grch37.genes.5k)

## example for changing loss threshold in round CNR
cnr2 <- buildCNR(X = X, Y= Y, qc = qc, chromInfo = chromInfo, gene.index = grch37.genes.5k, loss = 1.5) 

summary_cnr(cnr)

View what data is present in the cnr

make sure numbers of cells match in all objects

data(cnr)
summary_cnr(cnr)

GAC makes use of ComplexHeatmap. ComplexHeatmap is one of the most powerful visualization tools available for R. To use the richness of ComplexHeatmap, the HeatmapCNR function attempts to minimize the total number of presets. Please visit the ComplexHeatmap documentation to take advantage of its potential.

The HeatmapCNR function will create a viable plot if you are in a time constraint.

## a color scale to visualize the integer copy number is provided
## visualize genome-wide
HeatmapCNR(cnr, use_raster = TRUE)

## visualize genes of interest
HeatmapCNR(cnr, what = "genes", which.genes = c("CDK4", "MDM2", "TP53"), use_raster = TRUE)

Working with a CNR

GAC is equiped with a couple of management functions that are prefixed with add*, exclude* and keep*.

On a more frequent basis you will want to remove cells that have noisy profiles.

For example to exclude cells that you have been deemed as "FAIL" b/c of low read counts you can do:

## remove cells
( excl.cells <- cnr$qc$cellID[cnr$qc$qc.status == "FAIL"] )
cnrE <- excludeCells(cnr, excl = excl.cells)
sapply(cnrE, dim)

HeatmapCNR(cnrE, use_raster = TRUE)

Conversly, if you have multiple samples or have multiple conditions and wish to use only part of the data, you can use keepCells

e.g.

## keep cells
## offset by 1 b/c cell names are 0 based
keep.cells <- setdiff(colnames(cnr$X), paste0("cell",noisy.cells-1))

cnrK <- keepCells(cnr, keep = keep.cells)
sapply(cnrK, dim)

HeatmapCNR(cnrK, use_raster = TRUE)

Lastly, if you perform a new experiment, and need to add additional cells use the addCells

## make new cells
newX <- data.frame(cbind(rep(c(5,2), c(3000, 2000)),
                         rep(c(2,4),  c(3000, 2000))))
names(newX) <- paste0("cell", c(n.cells+1, n.cells+2))
head(newX)

tail(newX)

## creating new phenotypes
newY <- head(cnr$Y, n = 2)
newY$cellID <- paste0("cell", c(n.cells+1, n.cells+2))
rownames(newY) <- newY$cellID

newY[, c(6:9)] <- newY[,c(6,9)]+3
newY

## creating new QC
newQC <- head(cnr$qc, 2)
newQC$cellID <- paste0("cell", c(n.cells+1, n.cells+2))
rownames(newQC) <- newQC$cellID
newQC$qc.status <- "FAIL"

newQC[,2:4] <- newQC[,2:4] + 2
newQC

## add cells
cnrA <- addCells(cnr, newX = newX, newY = newY, newqc = newQC)
sapply(cnrA, dim)
HeatmapCNR(cnrA, use_raster = TRUE)

If you run an algorithm to estimate genomic complexity or another parameter, you can add these to the Y matrix using addPheno

## addPheno
rand3 <- data.frame(cellID = cnr$Y$cellID, rand3 = rnorm(nrow(cnr$Y), mean = 2, sd = 1))

cnr <- addPheno(cnr, df = rand3)
sapply(cnr, dim)

head(cnr$Y)

Similarly, if you run a qc metric such as Median Absolute Pairwise Deviance (MAPD) which is used to estimate how noisy the profiles are. This estimate should be computed on the non-segmented bin estimates (this is not the right input for GAC as it requires you provided it segmented results). See documentation for Ginkgo/Varbin, MUMdex(http://mumdex.com), HMMcopy(https://bioconductor.org/packages/release/bioc/html/HMMcopy.html), or SCOPE(https://github.com/rujinwang/SCOPE) for details on how to perform the upstream analysis.

To add QC metrix use the addQC

## add QC
mapdX <- data.frame(t(apply(cnr$X, 2, mapd)))
mapdX <- data.frame(cellID = rownames(mapdX), mapdX)

cnr <- addQC(cnr, df = mapdX)

head(cnr$qc)

Lastly, the goal of this tool is to facilitate the genetic, and statistical analysis of single-cell copy number data. Most analysis will be carried out at the bin level. Even if you aim to work with a single gene, the bin is the smallest independent unit for segmented data. To add, for example, p-values to see which bins are linked to your phenotype the helper function is addInfo.

## addInfo
fakePval <- data.frame(fakePval = runif(5000))
cnr <- addInfo(cnr, df = fakePval)
head(cnr$chromInfo)

fake.Gene.pval <- data.frame(hgnc.symbol = unique(cnr$gene.index$hgnc.symbol), fPval = runif(nrow(cnr$gene.index)))
cnr <- addGeneInfo(cnr, df = fake.Gene.pval)
head(cnr$gene.index)

Clonal decomposition of cell populations based on copy number

Two methods are implemented: Heriarchical clustering, and Consensus Clustering

Heirarchical Clustering

For this method we estimate Bray-Curtis Dissimilarty across all pairwise cells. Heirarchical clustering is then performed on the complete matrix with phylo_cnr. To set the tree height and asign cluster membership, we iteratively test various heights, and count the number of one-cell clusters, and multi-cell clusters. By default setBrayClusters will set the height is set to the intersection between one-cell and multi-cell clusters. The package ape contains various functions to visualize heirarchical trees, a tree object class phylo is created to faciliate this. Bray-Dissimilarity membership is saved to the phenotype table Y as BrayC, and pairwise distances saved in cbd.

cnr <- phylo_cnr(cnr, root.cell = "cell0")
cnr <- setBrayClusters(cnr)

table(cnr$Y$BrayC)

Internally, the operations to select the intersect above are:

( mopc <- optClust(cnr, seq(0, 0.2, by = 0.005)) )

plotCL(mopc)

## select a different tree height
minimum.intersect(mopc)
sc.tree.height <- 0.06

## to overwrite the tree.height
cnr <- cnr
cnr$Y <- cnr$Y[, -c(grep("BrayC", names(cnr$Y)))]

cnr <- setBrayClusters(cnr, tree.height = sc.tree.height)

Single-cell dendrogram with tree height for cluster membership

plot(cnr[["hcdb"]])
abline(h = cnr$tree.height, col = "#D40000")
text(1, y = cnr$tree.height,
     labels = paste("tree.height =", cnr$tree.height),
     adj = c(0, 0))

Other tree types from package ape using the `phylo` class

cnr[["phylo"]]
plot(cnr[["phylo"]])

Consensus clustering

In addition, we implemented ConsensusClusterPlus using the function run_consensus_clustering directly on the CNR. Optimum number of clusters is based on the maximum number of stable clusters (Meyer et al. 2013). Stable Cluster membership is saved to the phenotype table Y as ConsensusC. K parameter (kCC) and number of stable clusters (sK) are saved into cnr[["optK"]]. The table of kCC, and kS is saved as cnr[["kStats"]].

cnr <- run_consensus_clustering(cnr, maxK = 20)

cnr <- doKSpectral(cnr)

cnr$kStats

cnr$optK

cnr <- setKcc(cnr)

plot_ccp(cnr)

table(cnr$Y$ConsensusC)

HeatmapCNR(cnr,
           column_split = cnr$Y$ConsensusC,
           column_gap = unit(0, "mm"),
           top_annotation =
               HeatmapAnnotation(df = cnr$Y[, c("BrayC", "ConsensusC")],
                                 simple_anno_size = unit(3, "mm")),
           border = TRUE,
           use_raster = TRUE)

Cluster heterogeneity

Sample heterogeneity based on clustering can be estimated using cluster_heterogeneity. The function returns a table with the same name containing the counts of each cluster, frequency, and other stats. Furthermore, it returns a vector in the phenotype table Y named final_cluster. There is no default cluster_column to use, one must choose either BrayC, ConsensusC, or if clustering was done outside GAC, these clusters can be imported to Y to be used here.

cnr <- cluster_heterogeneity(cnr, cluster_column = "ConsensusC")

cnr$cluster_heterogeneity

Cluster Summary Profiles

The function get_cluster_profiles is used to summarize the genetic profile of each cluster. Cluster summary profiles are saved in cnr[["DDRC.df"]].

segCol <- circlize::colorRamp2(breaks = c(0:5, 10, 20, 40, 80),
                     colors = c("#F2D200", "#318CE7", "#FFFFFF",
                                "#FFA89F", "#FF523F", "#D40000",
                                "#7F7F7F", "#636363", "#515151",
                                "#303030"))

cnr <- get_cluster_profiles(cnr)

chrAnnoLeft <- create_chromosome_annotation(cnr, side = "left")

Heatmap(cnr$DDRC.df, col = segCol, cluster_rows = FALSE, cluster_columns = FALSE,
        column_labels = paste(names(cnr$uclust), "(n =", cnr$uclust, ")" ),
        left_annotation = chrAnnoLeft)

Phylogenetic analysis of clones

The function phylo_ddrc was created to perform a phylogenetic analysis of clones. First a pairwise Bray-Curtis disimilarity, then the phylogeny is constructed using Neighbourhood Joining, finally the tree is rooted if a root.clone is provided.

cnr <- phylo_ddrc(cnr, root.clone = "X2")

plot(cnr$DDRC.phylo)

VDJ genotyping

To asses weather T-cells or B-cells are present in the cell populations, the function genotype_vdj will classify cells containing VDJ deletions into T-cells, B-cells, or vdj.unspecified. A visualization function is also provided.

cnr <- genotype_vdj(cnr)
vdjHeatmap(cnr, use_raster = TRUE)

Example of a complete analysis pipeline

library(tidyverse)

data(cnr)
noisy.cells <- cnr$qc$cellID[cnr$qc$qc.status == "FAIL"]
cnr <- cnr %>%
    excludeCells(excl = noisy.cells) %>%
    phylo_cnr(root.cell = "cell0") %>%
    setBrayClusters() %>%
    run_consensus_clustering(iters = 20, maxK = 40,
                             verbose = FALSE) %>%
    doKSpectral() %>%
    setKcc() %>%
    cluster_heterogeneity(by = "category1",
                          cluster_column = "ConsensusC") %>%
    get_cluster_profiles() %>%
    genotype_vdj() %>%
    phylo_ddrc(root.clone = "X2") %>%
    get_alteration_frequencies()

clone.colors <- rainbow(length(unique(cnr$Y$final_cluster)))
names(clone.colors) <- unique(cnr$Y$final_cluster)

HeatmapCNR(cnr,
           cluster_columns = FALSE,
           column_split = cnr$Y$final_cluster,
           top_annotation = HeatmapAnnotation(
               clone = cnr$Y$final_cluster,
               col = list(clone = clone.colors),
               border = TRUE),
           column_gap = unit(0, "mm"),
           border = TRUE,
           use_raster = TRUE)

layout(mat = matrix(1:2, nrow = 1))
plot(cnr$phylo, main = "cell phylogenetic tree")
plot(cnr$DDRC.phylo, main = "clone phylogenetic tree")

Analysis considerations

Filtering noisy cells is an iterative process. QC metrics such as ReadsAligned, MAPD, sequence quality, and other metrics are particularly useful to exclude technical artifacts. Clustering can help spot groups of problematic cells.
If the data does not contain a suitable root.cell, consider building an in-silico diploid cell, and clone to use in the analysis
We prefer the use of Consensus Clustering with Spectral anaysis over heirarchical clustering. The current set up is to select the maximum number of stable clones. As a result, some larger clones may get broken up into two or three subclones. Future versions will be focused to address this.
By default, consensus clustering is interated 200 times. This number is particularly low, and should be increased. The specific number of times will depend on the complexity of the data, and the confidence desired to separate specific groups of cells. Our current analyses are set at iter = 1000 and left to run over a weekend`
Cluster Heterogeneity can be split by a variable. E.g. sample, region information, histology, etc.

Session Informatioin

date()
sessionInfo()

SingerLab/gac documentation built on March 29, 2025, 4:38 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

SingerLab/gac
Genetic Analysis of Cells

In SingerLab/gac: Genetic Analysis of Cells

Getting Started

Preparing your data

View what data is present in the cnr

make sure numbers of cells match in all objects

Working with a CNR

Clonal decomposition of cell populations based on copy number

Heirarchical Clustering

Single-cell dendrogram with tree height for cluster membership

Other tree types from package ape using the `phylo` class

Consensus clustering

Cluster heterogeneity

Cluster Summary Profiles

Phylogenetic analysis of clones

VDJ genotyping

Example of a complete analysis pipeline

Analysis considerations

Session Informatioin

R Package Documentation

Browse R Packages

We want your feedback!

SingerLab/gac Genetic Analysis of Cells

In SingerLab/gac: Genetic Analysis of Cells

Getting Started

Preparing your data

View what data is present in the cnr

make sure numbers of cells match in all objects

Working with a CNR

Clonal decomposition of cell populations based on copy number

Heirarchical Clustering

Single-cell dendrogram with tree height for cluster membership

Other tree types from package ape using the phylo class

Consensus clustering

Cluster heterogeneity

Cluster Summary Profiles

Phylogenetic analysis of clones

VDJ genotyping

Example of a complete analysis pipeline

Analysis considerations

Session Informatioin

R Package Documentation

Browse R Packages

We want your feedback!

SingerLab/gac
Genetic Analysis of Cells

Other tree types from package ape using the `phylo` class