knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

Rascal is an R package that contains functions for scaling relative copy number data from shallow whole genome sequencing of cancer samples to absolute values and estimating the tumour ploidy and cellularity of the samples. It also provides an interactive Shiny web application for visualizing copy number data and assessing the goodness-of-fit for various ploidies and cellularities. This document provides a brief walkthrough of the available functions.

Background

Shallow whole genome sequencing is a relatively inexpensive method for obtaining copy number profiles for tumour samples, particularly as libraries from several samples can be multiplexed in a single lane of sequencing. Packages such as QDNAseq [@scheinin2014dna] are used for summing reads that align within genomic windows or bins, typically 30kb in size, and correcting for GC-content and mappability. This results in values that are relative to the average copy number within the sample for the GC and mappability of each bin. Relative copy numbers are smoothed and segmented to determine genomic regions with gains or losses, providing insights into the genomic abnormalities of cancers.

For some research projects it is desirable to obtain absolute copy numbers. Normally this would require deeper whole genome sequencing from which allele fractions of germline SNPs can help determine the clonal architecture of a tumour sample. In the absence of such information, and noting the significant increase in cost for deeper sequencing, we can attempt to fit the relative copy number profiles to absolute copy numbers by evaluating various estimates of ploidy and cellularity.

The approach used in rascal is based on concepts introduced in the ACE [@poell2019ace] package. The mathematics underpinning this approach assume a single dominant clone. Estimating ploidy and cellularity for tumour samples exhibiting significant heterogeneity at the level of copy number may prove difficult with this method.

Loading the package

In addition to loading the rascal package, the following walkthrough makes use of some functions for working with data frames provided by the dplyr and ggplot2 packages from the tidyverse.

library(rascal)
library(dplyr)
library(ggplot2)

Import copy number data

Rascal works with segmented copy number data in tabular format stored in a data frame. It should be relatively straightforward to convert data from various copy number analysis packages into the form required with the expected columns:

Usually each row will correspond to a genomic window or bin of fixed size. The relative copy number for that bin is given in the copy_number column while the result of smoothing and segmentation is given in the segmented column. Several successive bins that are within a segment share the same segmented value.

Some copy number packages output data as log~2~ ratios. Rascal uses relative copy numbers that are have not been log~2~-transformed.

The copy number value for each bin (copy_number) is optional and only used in the plotting functions to display points for each bin on a whole genome or chromosome copy number profile. Similarly the genomic location (chromosome, start and end) is only required by these plotting functions.

The package provides an example dataset that contains copy number data for two high-grade serous ovarian cancer samples generated by the Brenton lab at Cancer Research UK Cambridge Institute.

data(copy_number)
filter(copy_number, sample == "X17222", chromosome == 1, start >= 6000000, end <= 6500000)

The relative copy numbers should be centred around a value of 1.0 and the mathematical formulation used to scale relative to absolute copy numbers based on the ploidy and cellularity relies on this being the case. However the segmented copy numbers output from QDNAseq or other copy number analysis tools may need to be re-centred by scaling by the median segmented value.

copy_number %>%
  group_by(sample) %>%
  summarize(median_segmented = median(segmented, na.rm = TRUE))
# divide copy numbers by the median segmented copy number value for each sample
copy_number <- copy_number %>%
  group_by(sample) %>%
  mutate(across(c(copy_number, segmented), ~ . / median(segmented, na.rm = TRUE))) %>%
  ungroup()
# check centering worked
copy_number %>%
  group_by(sample) %>%
  summarize(median_segmented = median(segmented, na.rm = TRUE))

Collapsing bins to segments

The data frame in which each row contains the relative copy number for each bin can be collapsed to a smaller data frame that contains a single row for each segment using the copy_number_segments function.

segments <- copy_number_segments(copy_number)
segments

This function looks for consecutive rows (bins) within a chromosome that have the same segmented copy number value (segmented variable). Bins with missing (NA) values are removed prior to collapsing so do not cause new segments to be created.

Note that rascal does not provide a function for performing segmentation. This is usually carried out as part of a standard workflow for processsing copy number data, e.g. with QDNAseq which runs the circular binary segmentation (CBS) algorithm [@olshen2004circular] implemented in the DNAcopy package.

Visualizing copy number profiles

Rascal provides functions for plotting the copy number profile including a whole genome plot and a plot for a specific chromosome or chromosomal region.

Copy number data for bins and segments are provided separately to these functions. Both data frames must contain chromosome, start, end and copy_number columns. Typically these data frames will contain copy number values for multiple samples and will also contain a sample column.

We can create a copy number plot for a specified sample within the data set as follows.

genome_copy_number_plot(copy_number, segments, sample = "X17222")

We can also create a plot that shows the copy number data for a specific chromosome.

chromosome_copy_number_plot(copy_number, segments, sample = "X17222", chromosome = 3)

The start and end coordinates on the chromosome can be specified to focus on a smaller region.

chromosome_copy_number_plot(copy_number, segments, sample = "X17222",
                            chromosome = 3, start = 100000000, end = 120000000)

The following shows how to display log~2~ ratios instead of relative copy number values.

log2_ratio <- mutate(copy_number, copy_number = log2(copy_number))
log2_ratio_segments <- mutate(segments, copy_number = log2(copy_number))

plot <- genome_copy_number_plot(
  log2_ratio, log2_ratio_segments, sample = "X17222",
  min_copy_number = -2, max_copy_number = 3,
  xlabel = NULL, ylabel = expression(log[2]~ratio))

plot + theme(
  axis.text.x = element_text(size = 6),
  axis.text.y = element_text(size = 8),
  axis.title.y = element_text(size = 10)
)

Plotting functions in rascal use ggplot2 and create ggplot objects that can be modified, e.g. changing font sizes by altering elements of the theme as shown in the above example.

Similarly, log2 ratios can be passed to the chromosome_copy_number_plot function.

chromosome_copy_number_plot(log2_ratio, log2_ratio_segments,
                            sample = "X17222", chromosome = 3,
                            xlabel = NULL, ylabel = expression(log[2]~ratio))

Filtering for a specific sample

Although it is convenient to collect copy number data for many samples in a single file, particularly when importing data into the rascal Shiny app, most of the R functions rascal provides require a filtered set of values for a single sample. The plotting functions above are the exception to this, performing filtering based on the sample argument if provided.

We'll focus on the X17222 sample in what follows.

copy_number <- filter(copy_number, sample == "X17222")
# also filter the segments data frame or alternative re-run copy_number_segments
# segments <- filter(segments, sample == "X17222")
segments <- copy_number_segments(copy_number)

Scaling relative copy number data to absolute copy numbers

Relative copy number values can be scaled to absolute copy numbers for a given ploidy and cellularity using the following simple formulation where $r_i$ is the relative copy number at locus $i$ and $a_i$ is the absolute copy number at that locus. The ploidy, $p$, is the average number of copies across the genome for the tumour sample, e.g. a triploid tumour will have an average of 3 DNA copies per cell, $p = 3$. The cellularity, $c$, is the proportion of cells in the sample that are tumour cells.

$$ r_i = \frac{a_ic + 2(1-c)}{pc + 2(1-c)} $$

which rearranges to

$$ a_i = p + (r_i - 1)(p + \frac{2}{c} - 2) $$

This assumes that the tumour is homogeneous in terms of copy number, i.e. that all clones within an otherwise heterogeneous tumour sample have the same copy number across the genome. In practice this is not the case but the fitting and scaling process is not adversely affected when only small portions of the genome exhibit copy number heterogeneity.

We can determine how well the copy number data fit a given ploidy and cellularity by using the above equation to scale the copy numbers to absolute values and seeing how closely these match whole number steps.

ploidy <- 3
cellularity <- 0.6
absolute_segments <- mutate(segments,
    copy_number = relative_to_absolute_copy_number(copy_number, ploidy, cellularity))
select(absolute_segments, chromosome, start, end, copy_number)

We can do the same for the copy number values for each bin.

absolute_copy_number <- mutate(copy_number,
                               across(c(copy_number, segmented),
                               relative_to_absolute_copy_number, ploidy, cellularity))
filter(absolute_copy_number, chromosome == 1, start >= 6000000, end <= 6500000)

The plot for chromosome 3 shows that this choice of ploidy and cellularity results in segmented copy numbers that line up fairly closely to whole number steps but also suggests that a better fit could be obtain with different ploidy and cellularity valeus.

chromosome_copy_number_plot(absolute_copy_number, absolute_segments,
                            chromosome = 3,
                            copy_number_breaks = 0:8,
                            ylabel = "absolute copy number")

Note that we didn't need to specify the sample argument since the data were already filtered for the sample of interest.

Similarly, we can plot the scaled copy number values across the whole genome.

genome_copy_number_plot(absolute_copy_number, absolute_segments,
                        min_copy_number = 0, max_copy_number = 7,
                        copy_number_breaks = 0:7,
                        point_colour = "grey40",
                        ylabel = "absolute copy number")

A density plot can also be used to assess visually the goodness of fit.

copy_number_density_plot(absolute_copy_number$segmented,
                         min_copy_number = 0, max_copy_number = 6,
                         xlabel = "scaled/absolute copy number")

Alternatively the density plot can be displayed for the relative copy numbers with lines for whole number absolute copy number steps placed at the corresponding relative copy numbers.

# first create a table with each of the absolute copy number steps
copy_number_steps <- tibble(absolute_copy_number = 1:5)
copy_number_steps <- mutate(copy_number_steps, copy_number = absolute_to_relative_copy_number(absolute_copy_number, ploidy, cellularity))
copy_number_steps
copy_number_density_plot(copy_number$segmented, copy_number_steps,
                         min_copy_number = 0.3, max_copy_number = 1.7)

The relative copy number of 1 corresponds to the ploidy of the tumour sample. In this case a ploidy of 3 doesn't quite match up with the closest peak suggesting that the actual ploidy is slightly below 3 with the 5 main peaks corresponding to absolute copy numbers 1, 2, 3, 4, and 5.

The spacing between adjacent maxima is fairly consistent for the four or five main peaks which provides reassurance that these data fit the basic mathematical framework. Samples with lower cellularity, i.e. less pure and more contaminated with normal cells, display an average of the tumour copy number profile with a normal diploid profile (single peak at relative copy number 1 corresponding to 2 DNA copies) and will have more tightly spaced peaks.

A potential strategy for determining the ploidy and cellularity that best fit the data would be to use the average spacing between copy number maxima on the relative scale and to assign the absolute copy number for the peak at relative copy number 1 based on some prior knowledge of the tumour ploidy. It would then be straightforward to solve the equation given above for the cellularity of the sample. Rascal provides the copy_number_maxima function to obtain these maxima and here we show the difference between successive maxima for this example.

copy_number_maxima(copy_number$segmented, lower_threshold = 0.1) %>%
  mutate(difference = copy_number - lag(copy_number))

Determining optimal fits

The rascal Shiny app uses the same approach as ACE and other copy number tools to find the optimal ploidy and cellularity that best fit the copy number data. It explores a range of combinations of ploidy and cellularity values and computes a goodness-of-fit or distance function.

For a given ploidy and cellularity, rascal scales the relative copy numbers to absolute values and calculates the difference for each value to the nearest whole number. This is done for all bins or equivalently for the collapsed segment values weighted by the sizes of the bins. The absolute_copy_number_distance function can compute either a mean absolute deviation (MAD) or root mean square deviation (RMSD).

absolute_copy_number_distance(ploidy, cellularity,
                              segments$copy_number, segments$weight,
                              distance = "MAD")

Rascal also provides a helper function to calculate the goodness-of-fit for a grid of ploidies and cellularities.

distances <- absolute_copy_number_distance_grid(
    segments$copy_number, segments$weight,
    min_ploidy = 1.5, max_ploidy = 5.5, ploidy_step = 0.01,
    min_cellularity = 0.1, max_cellularity = 1.0, cellularity_step = 0.01,
    distance_function = "MAD")
distances

These can be displayed as a heatmap to show local minima. The function absolute_copy_number_distance_heatmap takes relative copy numbers and computes the distance grid and generates the heatmap in one step.

absolute_copy_number_distance_heatmap(
  segments$copy_number,
  segments$weight,
  distance_function = "MAD"
)

The find_best_fit_solutions function performs a grid-based search and identifies the local minima in the distance function that are considered the most likely solutions.

find_best_fit_solutions(
    segments$copy_number, segments$weight,
    min_ploidy = 1.5, max_ploidy = 5.5, ploidy_step = 0.01,
    min_cellularity = 0.1, max_cellularity = 1.0, cellularity_step = 0.01,
    distance_function = "MAD"
)

find_best_fit_solutions filters sub-optimal and unlikely solutions and has various parameters that can help tune these filters. For example, in the heat map there appears to be a feasible solution at a ploidy of just under 2 and a cellularity of approximately 0.45. This solution is filtered out because it would result in the lowest copy number segments being assigned an absolute copy number of 0. One of the filters removes solutions where the proportion of the genome with zero copies is above a specified level. The default value of the max_proportion_zero is 0.05 but the Shiny app applies a more lenient setting, allowing for 10% of the genome to be completely deleted.

solutions <- find_best_fit_solutions(
    segments$copy_number, segments$weight,
    min_ploidy = 1.5, max_ploidy = 5.5, ploidy_step = 0.01,
    min_cellularity = 0.1, max_cellularity = 1.0, cellularity_step = 0.01,
    max_proportion_zero = 0.1,
    distance_function = "MAD"
)
solutions

The density plot for the additional low-cellularity, near-diploid solution confirms that a substantial proportion of the genome would have zero copies.

ploidy <- 1.87
cellularity <- 0.45
copy_number_steps <- tibble(absolute_copy_number = 0:4)
copy_number_steps <- mutate(copy_number_steps, copy_number = absolute_to_relative_copy_number(absolute_copy_number, ploidy, cellularity))
copy_number_density_plot(copy_number$segmented, copy_number_steps,
                         min_copy_number = 0.3, max_copy_number = 1.7)

Using additional information to select the best solution

It is common for multiple solutions to fit the data equally well. The main difference between these solutions is a shift in the absolute copy number steps assigned up or down incrementally. In the X17222 example, the peak corresponding to the most common copy number state with a relative copy number of around 0.8 is assigned an absolute copy number of 1 for the ploidy 1.87 solution while it is assigned a copy number 2 for the ploidy 2.87 solution and 3 for ploidy 3.87.

A very high proportion of high-grade serous ovarian carcinomas have a driver TP53 mutation that is clonal, i.e. homozygous variant in all tumour cells. This can help distinguish between the 3 competing solutions in the case of X17222. The segmented copy number for TP53 is 0.83 corresponding to the most common copy number state.

filter(copy_number, chromosome == "17", end > 7565097, start < 7590856)

Amplicon sequencing of this sample identifies a TP53 mutation with an allele fraction of 0.58. Given the absolute copy number at locus $i$, $a_i$, and the cellularity, $c$, it is straightforward to calculate the tumour fraction at that locus that should match the observed allele fraction.

$$ tumour fraction_i= \frac{a_ic}{a_ic + 2(1-c)} $$

Rascal provides a simple helper function for computing the tumour fraction for a given absolute copy number and cellularity and we can use this to predict the TP53 allele fraction for each of the 3 competing solutions.

solutions %>%
  select(ploidy, cellularity) %>%
  mutate(tp53_absolute_copy_number = relative_to_absolute_copy_number(0.832, ploidy, cellularity)) %>%
  mutate(tp53_tumour_fraction = tumour_fraction(tp53_absolute_copy_number, cellularity))

The solution closest that gives a TP53 tumour fraction closest to the observed allele fraction is the one with ploidy 2.87 and cellularity 0.58.

ploidy <- 2.87
cellularity <- 0.58
absolute_copy_number <- mutate(copy_number,
                               across(c(copy_number, segmented),
                               relative_to_absolute_copy_number, ploidy, cellularity))
absolute_segments <- copy_number_segments(absolute_copy_number)
copy_number_steps <- tibble(absolute_copy_number = 1:5)
copy_number_steps <- mutate(copy_number_steps, copy_number = absolute_to_relative_copy_number(absolute_copy_number, ploidy, cellularity))
copy_number_density_plot(copy_number$segmented, copy_number_steps,
                         min_copy_number = 0.3, max_copy_number = 1.7)
genome_copy_number_plot(absolute_copy_number, absolute_segments,
                        min_copy_number = 0, max_copy_number = 7,
                        copy_number_breaks = 0:7,
                        point_colour = "grey40",
                        ylabel = "absolute copy number")

Batch fitting

The rascal package contains a script, fit_absolute_copy_numbers.R (located in the inst/scripts directory) and on GitHub here that finds the best fitting solutions for a large number of samples as a batch process. It can be run from the Unix command line as follows, assuming the script is in a directory on the PATH:

fit_absolute_copy_numbers.R -i copy_number_data.csv -o solutions.csv

The input file can be a comma-separated value (CSV) file, tab-delimited (TSV) file or a RDS file containing a QDNAseqCopyNumbers object. For more details about command-line options use the --help option:

fit_absolute_copy_numbers.R --help

Running the Shiny app

The plots and functions described in this workflow were all created to support an interactive web application developed using the R Shiny framework. This can be started at the R command prompt as follows:

library(rascal)
start_shiny_app()

Alternatively it can be started from the Unix command line.

Rscript -e "rascal::start_shiny_app()"

A help page with instructions on how to use the rascal Shiny app can be accessed from the top menu (select 'More' and then 'Help').

References



crukci-bioinformatics/rascal documentation built on Dec. 19, 2021, 6:21 p.m.