knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(varbintools)
library(tidyverse)

Introduction

When we think about genetic variations we immediately think about individual nucleotide sequences, however, throughout the genome, we can also observe large-scale changes such as insertions, deletions, inversions, and amplifications. Those genetic anomalies are known as aneuploidy.

Take for example the HeLa karyotype, a cervical adenocarcinoma cell line:

knitr::include_graphics("images/F1.large.jpg")

Comprehensive and Definitive Molecular Cytogenetic Characterization of HeLa Cells by Spectral Karyotyping. Macville et al.

You can immediately notice how this cell diverges immensely from the karyotype expected from a normal diploid cell with 22 pairs of autosomes and one allosome pair

As you can imagine, those large-scale genetic changes have huge implications for cancer, indeed, aneuploidy is a hallmark of tumor cells.

While a lot of the scientific studies used a light microscope and fluorescent dye to measure aneuploidy, we can also obtain that data with the short-read sequencing, we will learn how that is done in this tutorial.

Single-cell copy number data

Single-cell techniques are a powerful approach to resolve the heterogeneity of samples and reconstruct their evolutionary lineages. You can get a sense of what can be done by checking the previous works from our group.

Navin et al. 2011. Tumor Evolution Inferred by Single Cell Sequencing.

Gao et al. 2016. Punctuated copy number evolution and clonal stasis in triple-negative breast cancer

Casasent et al. 2018. Multiclonal Invasion in Breast Tumors Identified by Topographic Single Cell Sequencing

Copy Number data from short-reads

Principle

Short-read sequencing can be used for an array of applications: genome assembly, single nucleotide variants detection, RNA-Seq among others. A different way to approach this data is to use it to obtain information about the copy number of a sample. The way we approach this is very simple:

We want to count the number of reads that align to a particular region of the genome

Notice that, counterintuitively, we are not interested in the individual nucleotides sequences output by the sequencer, but to the regions of the genome that the reads align.

The varbin method

Read alignment, i.e. the mapping of the short-reads output by the sequencer to their regions in a reference genome, is a computationally intensive task with several tools developed to perform this task. When aligning millions of reads, aligners have to accept a trade-off between speed and accuracy. A consequence of this trade-off are mapping errors.

Mapping errors may be a significant source of bias. In CNA analysis, our estimation of copy number gains or losses is dependent on how well can we control for factors that may affect the counts to a particular region. One of the methods developed to control for this error is the Variable Binning method.

With the Varbin method, the genome is partitioned into bins of variable sizes, the idea is that every bin will receive the same number of reads if we were to map a diploid genome to it. Therefore, to accomplish that, we can generate reads from a reference genome, map it to the genome and partition the genome into the bins of variable size.

Different resolutions can be obtained using this method and it is a trade-off for the number of cells you would like to multiplex in your sequence run. For this workshop, we will be using the varbin method where the average size of the bins were set to 200kb. We usually aim a target of 1M reads to every cell when using 200kb bins.

Alignment

After sequencing, we will align the fastq files to the human reference genome, hg19. We will accomplish that using bowtie2

For this workshop, you can find the fastq file from a sequenced single-cell in the data folder. We will be skipping this step during the workshop. Here is the syntax on how you would have performed this task.

In terminal

bowtie2 -x hg_19 -1 fastq1 -2 fastq2 -S outputname -p 6 

For every cell, you expect an output similar to this:

bowtie2 -x /volumes/seq/code/3rd_party/bowtie2-2.2.6/hg19/hg19 -U ES16-54_SKBR3P11C2P1-C12_S1932_L005_R1_001.fastq.gz -S ES16-54_SKBR3P11C2P1-C12_S1932_L005_R1_001.sam 1091782 reads; of these: 1091782 (100.00%) were unpaired; of these: 36523 (3.35%) aligned 0 times 794548 (72.78%) aligned exactly 1 time 260711 (23.88%) aligned >1 times 96.65% overall alignment rate

The output of bowtie is a SAM file. SAM files are intermediates of large size the binary format (BAM) is better suited for long term storage while still allowing us to work with the data. We can do that using SAMtools. SAMtools is a powerful tool, an excelent tutorial can be found here.

In terminal

samtools view -bS -q 1 ES16-54_SKBR3P11C2P1-C37_S1957_L005_R1_001.sam > ES16-54_SKBR3P11C2P1-C37_S1957_L005_R1_001.bam

After alignment, the reads appear in random order in the fastq files, for efficient access the BAM files need to be sorted in genome order, i.e. ordering according to the genomic chromosome coordinates. This can be done with samtools:

In terminal

samtools sort ES16-54_SKBR3P11C2P1-C37_S1957_L005_R1_001.bam -o ES16-54_SKBR3P11C2P1-C37_S1957_L005_R1_001.sorted.bam

Lastly, we will prepare the input for variable binning method. Varbin script needs a sam file as input, so we want to convert the now sorted file back to bam.

In terminal

/volumes/seq/code/3rd_party/samtools-1.2/samtools view ES16-54_SKBR3P11C2P1-C37_S1957_L005_R1_001.sorted.bam > ES16-54_SKBR3P11C2P1-C37_S1957_L005_R1_001.sorted.sam

Varbin

Now we will create the varbin file, to recap, we previously set up scaffolds where the genome is partitioned into bins of variables size where, if we were to map a diploid genome, every bin will receive roughly the same number of reads. We can use this property to evaluate the copy number of tumor cells.

We can't immediately determine the actual copy number of a cell, but we can estimate regions of copy number alterations relative to the total number of reads that of that cell

Bins with regions of the genome that are amplified will receive more reads than the average of total read counts

Bins with regions of the genome that are deleted will receive fewer reads than the average of total read counts

This version of varbin only works with python2.

In terminal

../bin/varbin.sam.2.py ES16-54_SKBR3P11C2P1-C37_S1957_L005_R1_001.sorted.sam ES16-54_SKBR3P11C2P1-C37_S1957_L005_R1_001.vb \
ES16-54_SKBR3P11C2P1-C37_S1957_L005_R1_001.stat.txt \
../lib/hg19.chromInfo.cumsum.txt \
../lib/bin.boundaries.200k.bowtie.sorted.bin.txt

The varbin script takes care of another important source of bias, PCR duplicates, PCR can be a source of uneven amplification of the different molecules, if we do not remove those, they will be a confounder in our analysis and we would think that certain regions have copy number variations, when they are actually a result of this bias. We can assume that regions with the same start and end genomic coordinates position are a result of PCR duplicates and remove those from our analysis. Modern experimental copy number methods significantly reduce this bias by applying fragmentation of the DNA at the single-molecule level when compared to previous methods that relied on whole genome amplification prior to DNA fragmentation.

Lastly, we have to account for two more sources of bias before our analysis.

Some regions of the genome have highly repeated sequences that are hard for the seqeuncers to make the correct base call, as a result, those regions are incorrectly mapped to the reference genome. For the varbin method, we currently blacklist regions of the genome that present those patterns, the bins that contain those regions will be excluded from the analysis.

Another important and well-known source of bias in sequencing is the GC-content. To normalize for this effect we follow this approach:

Both fragment counts and GC counts are binned to a bin-size of choice. A curve describing the conditional mean fragment count per GC value is estimated. The resulting GC curve determines a predicted count for each bin based on the bin's GC. These predictions can be used directly to normalize the original signal, or as the rates for a heterogeneous Poisson model. extracted from: Summarizing and correcting the GC content bias in high-throughput sequencing. Benjamini & Speed

Therefore, we smooth the signal using loess normalization.

vb_files <- fs::dir_ls(system.file("extdata", "cells_fastq/vbdir", package = "varbintools"))

filterFile = system.file("extdata", "lib/200kb_with_a_chr6_region.excluded.txt", package = "varbintools")

gcInputFile = system.file("extdata", "lib/varbin.gc.content.200k.bowtie.k50.hg19.bin_not_removed.txt", package = "varbintools")

outArg = system.file("extdata", "cells_fastq", package = "varbintools")


purrr::map(vb_files, filter_vb_file, filterFile, gcInputFile, outArg)

Segmentation

CNA data is noisy

Now, for every cell, we will have a file where with the corrected bin counts, let's visualize one of those cells:

filtered_vb_files <- system.file("extdata", "cells_fastq/blvb/ES16-54_SKBR3P11C2P1-C37_S1957_L005_R1_001-bl.vb", package = "varbintools")

filtered_vb <- purrr::map(filtered_vb_files, read_tsv, col_names = FALSE) %>% 
  purrr::map(~purrr::set_names(.x, c("chrom","chrompos","abspos", "V4", "ratio", "loess.ratio", "loess.bin"))) %>%
  identity()

purrr::map(filtered_vb, varbintools::plot_bin)

CBS segmentation

We need some way to amplify this signal to characterize the discrete copy number states, this is done is with segmentation.

A popular method of segmentation is Circular Binary Segmentation (CBS).

From the help of DNAcopy package (you can access by typing ?segment in the R console):

This function implements the circular binary segmentation (CBS) algorithm of Olshen and Venkatraman (2004). Given a set of genomic data, either continuous or binary, the algorithm recursively splits chromosomes into either two or three subsegments based on a maximum t-statistic. A reference distribution, used to decided whether or not to split, is estimated by permutation. Options are given to eliminate splits when the means of adjacent segments are not sufficiently far apart. Note that after the first split the α-levels of the tests for splitting are not unconditional.

We can perform CBS in R with the help of the DNAcopy R package.

c_01 <- filtered_vb[[1]]

c_01 <- varbintools::segment_varbin(c_01)

varbintools::plot_loess_segments(c_01)

Merge Levels

After that we want to combine two segments if the p-value for wilcoxon rank sum test between the observed median between two segments is greater than pv.thres

# you can find this file at the github repository that you either download or cloned in the beginning
# source("bin/Mergelevels.R")

merge.obj <- varbintools::MergeLevels(log(c_01$loess.ratio, base=2), 
                         log2(c_01$seg.mean.loess),
                         pv.thres=0.0001)

c_01$seg.mean.ml <- merge.obj$vecMerged
varbintools::plot_merged_segments(c_01)

Now we apply this segmentation to all of our cells and our sample set is ready for analysis.

Filtering

Unfortunately, not all sequenced cells will have good quality, as a result, some cells will present poor segmentation.

For this workshop, we will work with an already segmented dataset of ~ 1000 cells of the MDA-MB-231, a breast cancer cell line.

Keep in mind that cell line datasets are usually cleaner and easier to work with than the ones we obtain from tumors. Nevertheless, the same filtering criteria can be applied.

We will work with three datasets - mdamb231_bin: Contains the GC-corrected bin counts for every cell. - mdamb231_seg: Contains the segment means for all the cells. - mdamb231_ratio: Contains the ratio values (number of reads in the bin/mean for all the cell reads) for all the cells.

Bin counts

Some cells will not receive enough reads during sequencing, filtering using the bin counts aims to remove those cells that are likely to have poor copy number profiles due to low bin counts.

The snippet below is a function which calculates the median bin counts for every cell in the dataset and filters based on the distribution of the bin counts, for this particular dataset we are filtering cells that are 2 standard deviations below the mean of all the bin counts.

Running bin count filtering

mdamb231_binf <- varbintools::bin_median_filter(mdamb231_bin,
                                   seg_data = mdamb231_seg,
                                   sd_threshold = TRUE,
                                   exploratory_plots = FALSE)

Keep in mind that if your dataset has been sequenced to a depth where every cell has received enough reads, bin filtering can be skipped. In our experience, if the median bin counts of a cell are equal or bigger than 50 those generally yield good copy number profiles.

Breakpoint filter

Cells with low-quality will be over segmented, we can use the number of breakpoints observed in the cells to filter out those profiles. We also aim to filter cells that were sequenced during DNA replication, since replicating cells would have lots of origin of replication with segments containing twice as much DNA, which could be confounded with a copy number aberration.

The principle is the same as the bin filtering, we calculate the number of breakpoints for all of the population and check the distribution. Only this time we will remove those 2 standard deviations above the mean.

We advise caution in using this filtering criteria. Can you think why?

Running breakpoint filter

mdamb231_binf_bpf <- varbintools::breakpoint_filter(mdamb231_binf,
                                       sd_threshold = TRUE,
                                       exploratory_plots = FALSE)

DBSCAN

Lastly, we apply density based clustering (DBSCAN), the idea is that any cell that does not cluster in density cluster is an outlier and therefore should be removed from the dataset. The idea of consensus clustering is widespread in single-cell sequencing and used in a variety of techniques, we tend to trust events that appear in X or more cells.

There's no general rule on how to tune the parameters for dbscan. We recommend the tune of the eps parameter by calculating a knn distance matrix and looking for a 'knee' in the plot.

db_mdamb231 <- dbscan::dbscan(x=t(mdamb231_binf_bpf), eps=14, 
                      minPts = 5, weights = NULL,
                      borderPoints = FALSE, search = "dist",
                      bucketSize = 10,
                      splitRule = "suggest", approx = 0)

mdamb231_dbf <- mdamb231_binf_bpf[which(db_mdamb231$cluster == 1)]

It is good practice to plot a heatmap and label the cells that were removed to visually assess if the filtering makes sense and no important populations were accidentaly removed.

Ratio Plots

The most common way to look at copy number data is to look at ratio plots, we've already seen they before in [segmentation][Segmentation], let's take a look at some cells from the MDA-MB-231 dataset.

We will be using the package ggplot2 for plotting, So we will start with data wrangling to put in a format that is more friendly with ggplot2.

# selecting a subset of cells to make computations faster, if you need the entire dataset remember to remove the 1:30 argument within select
ratios_data_l_gat <- tidyr::gather(data = mdamb231_ratio %>% dplyr::select(1:30, abspos), key = "cell", value = "ratios_rat", -abspos)
long_seg_t_gat <- tidyr::gather(data = mdamb231_seg %>% dplyr::select(1:30, abspos), key = "cell", value = "seg_mean", -abspos)
# joining by cell
df <- inner_join(ratios_data_l_gat, long_seg_t_gat)

It's very simple to quickly obtain a ratio plot

varbintools::ratio_plot(df, cell_id = unique(df$cell)[3])

It's good to clearly see chromosomal boundaries, this snippet will take care of several aesthetics parameters.

chrom.lengths <-
  c(
    249250621,
    243199373,
    198022430,
    191154276,
    180915260,
    171115067,
    159138663,
    146364022,
    141213431,
    135534747,
    135006516,
    133851895,
    115169878,
    107349540,
    102531392,
    90354753,
    81195210,
    78077248,
    59128983,
    63025520,
    48129895,
    51304566,
    155270560,
    59373566
  )
chrom.names <- c(paste("chr", 1:22, sep = ""),
                 "chrX",
                 "chrY")

Plotting again

varbintools::pretty_ratio_plot(df, cell_id = unique(df$cell)[3])

It's possible to observe clear discrete states of the different copy numbers in this cell, you also want to make sure that the segments are following the ratios.

Ratio plots are the best way to observe the copy number profile of a single-cell.

some_more_cells <- names(mdamb231_dbf)[4:7]

pretty_ratio_plot(df, cell_ids = some_more_cells)

Clustering

Different options can be considered when clustering copy number data, unfortunately, there's not yet a clear cut answer of which method should be applied. Hierarchical clustering has been the prevalent method used for the analysis of single-cell copy number data in the past and it is still one of the most effective and fast ways to obtain a quick clustering of your data.

Pre-processing

Here, we will present an alternative approach that takes advantage of UMAP dimension reduction as a pre-processing step before applying a clustering algorithm and has already been applied to copy number datasets such as Laks et al..

We'll follow the recommendations for UMAP by trying to maximize the identification global structure of the sample

mdamb231_seg_cells_t <- as.data.frame(t(mdamb231_dbf))

umap_df <- varbintools::varbin_umap(mdamb231_seg_cells_t)

ggplot(umap_df) +
    geom_point(aes(x = umap1,
                   y = umap2),
               show.legend = FALSE) +
    theme(legend.position = "none",
          panel.grid = element_blank(),
          axis.line = element_blank(),
          panel.background = element_blank(),
          axis.text = element_blank(),
          axis.ticks = element_blank())

Phenograph

Now we will use the UMAP pre-processed data as the input for Phenograph. Phenograph was originally designed for mass cytometry data, but can be nicely applied to copy number data, it calculates the Jaccard coefficient between nearest-neighbor sets and creates a graph where the communities are identified with the louvain community detection method to maximize modularity.

pheno <- Rphenograph::Rphenograph(umap_df, k = 30)
cl <- tibble(cell = rownames(umap_df),
             cluster = igraph::membership(pheno[[2]])) %>% 
  arrange(cluster)
# reordering
umap_df <- umap_df[cl$cell,]
# ordering cells
mdamb231_seg_cells_t <- mdamb231_seg_cells_t[cl$cell,]

# setting colors
n_colors <- cl$cluster %>% unique() %>% length()

row_col <- setNames(colorRampPalette(paletteer::paletteer_d(pals, 
                                alphabet, 26))(n_colors),
                    levels(as.factor(cl$cluster)))

Clustering is a largely unsolved problem in single-cell copy number data analysis, other options can be attempt and there is a need to benchmark methods to determine which method would be ideal. Other methods such as SNN, k-medoids among others.

Heatmap

Here we will apply another important tool for the visualization of copy number datasets and it has become increasingly important as we increase our capabilities of profiling thousands of cell from a single sample. In a heatmap, every row is a cell and the x-axis are the ordered genomic positions, from chr1 to chrY. The colors are representing the log2(ratios) and, in this case, red represents amplifications above the ploidy of the sample whereas blue represents regions of deletions

# chromosome annotation top bar
# getting the vector of chrom lengths
varbin_complexheatmap(mdamb231_seg, mdamb231_seg_cells_t)

We can also observe again the UMAP plot, now with the identified clusters

umap_df %>% 
  rownames_to_column(var = "Cell") %>%
  dplyr::mutate(cluster = cl$cluster) %>% 
  ggplot() +
  geom_point(aes(x = umap1, 
                 y = umap2, 
                 color = as.factor(cluster),
                 label = Cell),
             show.legend = FALSE) +
  scale_color_manual(values = row_col) +
  theme(legend.position = "none",
        panel.grid = element_blank(),
        panel.background = element_blank(),
        axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank()) 

Consensus heatmap

On a heatmap with so many cells it becomes difficult to visualize the unique events of the different clusters.

To make this visualization easier, we will draw a consensus profile for each of the detected clusters. We accomplish this by calculating the median of every segment for all of the cells that compose one cluster.

varbin_consensus_complexheatmap(mdamb231_seg, mdamb231_seg_cells_t)

Consensus heatmaps are a way to condense the data for easier visualization. Ideally, if our clustering is good, every row of the consensus heatmap will have distinct events.

Gene Annotations

We frequently want to check if the subclonal regions are located in important genes for cancer progression. Here we will compare those regions with the genes on the COSMIC Cancer Gene Census.

We will use the coefficient of variation between the segments of the different consensus clusters and overlap those ranges with the gene ranges for the human genome (hg19). To explain exactly what is happening in the snippet below is not within the scope of this workshop. Most of this is accomplished using the help of the awesome GenomicRanges R package.

suppressMessages(library(Homo.sapiens))

clusters_list <- split(mdamb231_seg_cells_t, cl$cluster)

# we can use apply function to calculate the median
clusters_consensus <- purrr::map_dfr(clusters_list, function(x) apply(x, 2, median))
clusters_consensus <- as.data.frame(t(clusters_consensus))

consensus_row_anno <- ComplexHeatmap::rowAnnotation(clusters = unique(sort(cl$cluster)),
                                      col = list(clusters = row_col))

# obtaining variable regions
hvr <- unname(apply(clusters_consensus,2, function(x) sd(x)/mean(x)))

rle_df <- as.data.frame(accelerometry::rle2(round(hvr,2)))

rle_df <- rle_df %>% dplyr::mutate(end_pos = cumsum(length)) %>%
  dplyr::filter(length > 10, value > 0.01) %>%
  dplyr::mutate(start_pos = (end_pos - length)+1)

#initializing columns
rle_df <- rle_df %>% 
  dplyr::mutate(seqnames = NA,
         start_p = NA,
         end_p = NA)

for (i in 1:nrow(rle_df)) {
  start_idx <- rle_df$start_pos[i]
  end_idx <- rle_df$end_pos[i]

  rle_df$seqnames[i] <- paste0("chr", mdamb231_seg$chrom[start_idx])
  rle_df$start_p[i] <- mdamb231_seg$chrompos[start_idx]
  rle_df$end_p[i] <- mdamb231_seg$chrompos[end_idx]

}

rle_df <- rle_df %>%  dplyr::group_by(seqnames) %>% 
  dplyr::mutate(start = min(start_p),
                end = max(end_p)) %>% 
  dplyr::ungroup() %>% dplyr::select(seqnames, start, end) %>% 
  dplyr::distinct(seqnames, .keep_all = TRUE)

rle_df <- as.data.frame(rle_df)

hvr_gr <- GenomicRanges::makeGRangesFromDataFrame(rle_df)

cosmic_census <- cosmic_census %>% 
  pull(Gene.Symbol) %>% 
  identity()

g <- GenomicFeatures::genes(Homo.sapiens, columns = "SYMBOL") %>% 
  plyranges::expand_ranges(SYMBOL) %>% 
  plyranges::filter(SYMBOL %in% cosmic_census)

sub_g <- suppressWarnings(IRanges::subsetByOverlaps(g, hvr_gr)) 

DT::datatable(as.data.frame(sub_g))

We can visualize the genes in our consensus heatmap.

## data frame with annotation for the varbin pipeline we are using that it'll be used on the next function

bins_in_cna_file <- system.file("extdata", "lib/bins_in_cna_pipeline_bands.txt", package = "varbintools")

bins_in_cna_pipeline <-
  read.delim(bins_in_cna_file, 
             sep = "\t",
             stringsAsFactors = FALSE, 
             header = T)

gene_ht <- varbintools::gene_anno(sub_g$SYMBOL, varbintools::bins_in_cna_pipeline) 

chr_bar <- varbintools::create_chr_bar(mdamb231_seg)

ht_consensus_anno <- ComplexHeatmap::Heatmap(as.matrix(log2(clusters_consensus+1e-3)),
              cluster_columns = FALSE,
              cluster_rows = FALSE,
              show_row_names = FALSE,
              show_column_names = FALSE,
              use_raster = TRUE,
              border = TRUE,
              col = circlize::colorRamp2(breaks = c(-1.5,-0.25,1.5), 
                                         c("dodgerblue3", "white", "firebrick3")),
              bottom_annotation = chr_bar,
              top_annotation = gene_ht,
              heatmap_legend_param = list(title = "Log2 (Ratio)"))

ComplexHeatmap::draw(consensus_row_anno + ht_consensus_anno, 
                     padding = unit(c(1, .1, 2.4, .1), "cm"))

Phylogenetics

Phylogenetics is a powerfull tool to help us determine the evolutionary relationships between the different analyzed samples.

Here, we will apply a distance-based phylogenetic method, neighbor-joining, to the MDA-MB-231 dataset as an example of how you can obtain phylogenetic information from single-cell copy number data. Commonly we apply other distance-based metrics such as minimum-evolution.

With higher throughput of single-cell analysis there is a need to develop methods specific for copy number analysis as well as optimize the runtime for thousands of cells.

Packages developed to R make it possible to obtain a phylogenetic tree in a faster and easy way

tree <- ape::nj(
  dist(mdamb231_seg_cells_t, 
             method = "manhattan")
)

# obtaining colors
list_samples <- split(rownames(mdamb231_seg_cells_t),
                      sort(cl$cluster))
tree <- ggtree::groupOTU(tree, list_samples) 

ggtree::ggtree(ape::ladderize(tree),
  ladderize = FALSE,
  size = .1) +
  ggtree::geom_tippoint(aes(color = group), size = .7) +
  scale_colour_manual(values = c("black", row_col)) +
  theme(legend.position = "right") +
  labs(color = "Cluster")


whtns/varbintools documentation built on Dec. 1, 2019, 5:15 a.m.