knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Rascal is an R package that contains functions for scaling relative copy number data from shallow whole genome sequencing of cancer samples to absolute values and estimating the tumour ploidy and cellularity of the samples. It also provides an interactive Shiny web application for visualizing copy number data and assessing the goodness-of-fit for various ploidies and cellularities. This document provides a brief walkthrough of the available functions.
Shallow whole genome sequencing is a relatively inexpensive method for obtaining copy number profiles for tumour samples, particularly as libraries from several samples can be multiplexed in a single lane of sequencing. Packages such as QDNAseq [@scheinin2014dna] are used for summing reads that align within genomic windows or bins, typically 30kb in size, and correcting for GC-content and mappability. This results in values that are relative to the average copy number within the sample for the GC and mappability of each bin. Relative copy numbers are smoothed and segmented to determine genomic regions with gains or losses, providing insights into the genomic abnormalities of cancers.
For some research projects it is desirable to obtain absolute copy numbers. Normally this would require deeper whole genome sequencing from which allele fractions of germline SNPs can help determine the clonal architecture of a tumour sample. In the absence of such information, and noting the significant increase in cost for deeper sequencing, we can attempt to fit the relative copy number profiles to absolute copy numbers by evaluating various estimates of ploidy and cellularity.
The approach used in rascal is based on concepts introduced in the ACE [@poell2019ace] package. The mathematics underpinning this approach assume a single dominant clone. Estimating ploidy and cellularity for tumour samples exhibiting significant heterogeneity at the level of copy number may prove difficult with this method.
In addition to loading the rascal package, the following walkthrough makes use of some functions for working with data frames provided by the dplyr and ggplot2 packages from the tidyverse.
library(rascal) library(dplyr) library(ggplot2)
Rascal works with segmented copy number data in tabular format stored in a data frame. It should be relatively straightforward to convert data from various copy number analysis packages into the form required with the expected columns:
sample
chromosome
start
end
copy_number
segmented
Usually each row will correspond to a genomic window or bin of fixed size. The
relative copy number for that bin is given in the copy_number
column while
the result of smoothing and segmentation is given in the segmented
column.
Several successive bins that are within a segment share the same segmented
value.
Some copy number packages output data as log~2~ ratios. Rascal uses relative copy numbers that are have not been log~2~-transformed.
The copy number value for each bin (copy_number
) is optional and only used in
the plotting functions to display points for each bin on a whole genome or
chromosome copy number profile. Similarly the genomic location (chromosome
,
start
and end
) is only required by these plotting functions.
The package provides an example dataset that contains copy number data for two high-grade serous ovarian cancer samples generated by the Brenton lab at Cancer Research UK Cambridge Institute.
data(copy_number) filter(copy_number, sample == "X17222", chromosome == 1, start >= 6000000, end <= 6500000)
The relative copy numbers should be centred around a value of 1.0 and the mathematical formulation used to scale relative to absolute copy numbers based on the ploidy and cellularity relies on this being the case. However the segmented copy numbers output from QDNAseq or other copy number analysis tools may need to be re-centred by scaling by the median segmented value.
copy_number %>% group_by(sample) %>% summarize(median_segmented = median(segmented, na.rm = TRUE))
# divide copy numbers by the median segmented copy number value for each sample copy_number <- copy_number %>% group_by(sample) %>% mutate(across(c(copy_number, segmented), ~ . / median(segmented, na.rm = TRUE))) %>% ungroup() # check centering worked copy_number %>% group_by(sample) %>% summarize(median_segmented = median(segmented, na.rm = TRUE))
The data frame in which each row contains the relative copy number for each bin
can be collapsed to a smaller data frame that contains a single row for each
segment using the copy_number_segments
function.
segments <- copy_number_segments(copy_number) segments
This function looks for consecutive rows (bins) within a chromosome that have
the same segmented copy number value (segmented
variable). Bins with missing
(NA
) values are removed prior to collapsing so do not cause new segments to be
created.
Note that rascal does not provide a function for performing segmentation. This is usually carried out as part of a standard workflow for processsing copy number data, e.g. with QDNAseq which runs the circular binary segmentation (CBS) algorithm [@olshen2004circular] implemented in the DNAcopy package.
Rascal provides functions for plotting the copy number profile including a whole genome plot and a plot for a specific chromosome or chromosomal region.
Copy number data for bins and segments are provided separately to these
functions. Both data frames must contain chromosome
, start
, end
and
copy_number
columns. Typically these data frames will contain copy number
values for multiple samples and will also contain a sample
column.
We can create a copy number plot for a specified sample within the data set as follows.
genome_copy_number_plot(copy_number, segments, sample = "X17222")
We can also create a plot that shows the copy number data for a specific chromosome.
chromosome_copy_number_plot(copy_number, segments, sample = "X17222", chromosome = 3)
The start and end coordinates on the chromosome can be specified to focus on a smaller region.
chromosome_copy_number_plot(copy_number, segments, sample = "X17222", chromosome = 3, start = 100000000, end = 120000000)
The following shows how to display log~2~ ratios instead of relative copy number values.
log2_ratio <- mutate(copy_number, copy_number = log2(copy_number)) log2_ratio_segments <- mutate(segments, copy_number = log2(copy_number)) plot <- genome_copy_number_plot( log2_ratio, log2_ratio_segments, sample = "X17222", min_copy_number = -2, max_copy_number = 3, xlabel = NULL, ylabel = expression(log[2]~ratio)) plot + theme( axis.text.x = element_text(size = 6), axis.text.y = element_text(size = 8), axis.title.y = element_text(size = 10) )
Plotting functions in rascal use ggplot2 and
create ggplot
objects that can be modified, e.g. changing font sizes by
altering elements of the theme as shown in the above example.
Similarly, log2 ratios can be passed to the chromosome_copy_number_plot
function.
chromosome_copy_number_plot(log2_ratio, log2_ratio_segments, sample = "X17222", chromosome = 3, xlabel = NULL, ylabel = expression(log[2]~ratio))
Although it is convenient to collect copy number data for many samples in a
single file, particularly when importing data into the rascal Shiny app, most
of the R functions rascal provides require a filtered set of values for a
single sample. The plotting functions above are the exception to this,
performing filtering based on the sample
argument if provided.
We'll focus on the X17222 sample in what follows.
copy_number <- filter(copy_number, sample == "X17222") # also filter the segments data frame or alternative re-run copy_number_segments # segments <- filter(segments, sample == "X17222") segments <- copy_number_segments(copy_number)
Relative copy number values can be scaled to absolute copy numbers for a given ploidy and cellularity using the following simple formulation where $r_i$ is the relative copy number at locus $i$ and $a_i$ is the absolute copy number at that locus. The ploidy, $p$, is the average number of copies across the genome for the tumour sample, e.g. a triploid tumour will have an average of 3 DNA copies per cell, $p = 3$. The cellularity, $c$, is the proportion of cells in the sample that are tumour cells.
$$ r_i = \frac{a_ic + 2(1-c)}{pc + 2(1-c)} $$
which rearranges to
$$ a_i = p + (r_i - 1)(p + \frac{2}{c} - 2) $$
This assumes that the tumour is homogeneous in terms of copy number, i.e. that all clones within an otherwise heterogeneous tumour sample have the same copy number across the genome. In practice this is not the case but the fitting and scaling process is not adversely affected when only small portions of the genome exhibit copy number heterogeneity.
We can determine how well the copy number data fit a given ploidy and cellularity by using the above equation to scale the copy numbers to absolute values and seeing how closely these match whole number steps.
ploidy <- 3 cellularity <- 0.6 absolute_segments <- mutate(segments, copy_number = relative_to_absolute_copy_number(copy_number, ploidy, cellularity)) select(absolute_segments, chromosome, start, end, copy_number)
We can do the same for the copy number values for each bin.
absolute_copy_number <- mutate(copy_number, across(c(copy_number, segmented), relative_to_absolute_copy_number, ploidy, cellularity)) filter(absolute_copy_number, chromosome == 1, start >= 6000000, end <= 6500000)
The plot for chromosome 3 shows that this choice of ploidy and cellularity results in segmented copy numbers that line up fairly closely to whole number steps but also suggests that a better fit could be obtain with different ploidy and cellularity valeus.
chromosome_copy_number_plot(absolute_copy_number, absolute_segments, chromosome = 3, copy_number_breaks = 0:8, ylabel = "absolute copy number")
Note that we didn't need to specify the sample
argument since the data were
already filtered for the sample of interest.
Similarly, we can plot the scaled copy number values across the whole genome.
genome_copy_number_plot(absolute_copy_number, absolute_segments, min_copy_number = 0, max_copy_number = 7, copy_number_breaks = 0:7, point_colour = "grey40", ylabel = "absolute copy number")
A density plot can also be used to assess visually the goodness of fit.
copy_number_density_plot(absolute_copy_number$segmented, min_copy_number = 0, max_copy_number = 6, xlabel = "scaled/absolute copy number")
Alternatively the density plot can be displayed for the relative copy numbers with lines for whole number absolute copy number steps placed at the corresponding relative copy numbers.
# first create a table with each of the absolute copy number steps copy_number_steps <- tibble(absolute_copy_number = 1:5) copy_number_steps <- mutate(copy_number_steps, copy_number = absolute_to_relative_copy_number(absolute_copy_number, ploidy, cellularity)) copy_number_steps copy_number_density_plot(copy_number$segmented, copy_number_steps, min_copy_number = 0.3, max_copy_number = 1.7)
The relative copy number of 1 corresponds to the ploidy of the tumour sample. In this case a ploidy of 3 doesn't quite match up with the closest peak suggesting that the actual ploidy is slightly below 3 with the 5 main peaks corresponding to absolute copy numbers 1, 2, 3, 4, and 5.
The spacing between adjacent maxima is fairly consistent for the four or five main peaks which provides reassurance that these data fit the basic mathematical framework. Samples with lower cellularity, i.e. less pure and more contaminated with normal cells, display an average of the tumour copy number profile with a normal diploid profile (single peak at relative copy number 1 corresponding to 2 DNA copies) and will have more tightly spaced peaks.
A potential strategy for determining the ploidy and cellularity that best fit
the data would be to use the average spacing between copy number maxima on the
relative scale and to assign the absolute copy number for the peak at relative
copy number 1 based on some prior knowledge of the tumour ploidy. It would then
be straightforward to solve the equation given above for the cellularity of the
sample. Rascal provides the copy_number_maxima
function to obtain these
maxima and here we show the difference between successive maxima for this
example.
copy_number_maxima(copy_number$segmented, lower_threshold = 0.1) %>% mutate(difference = copy_number - lag(copy_number))
The rascal Shiny app uses the same approach as ACE and other copy number tools to find the optimal ploidy and cellularity that best fit the copy number data. It explores a range of combinations of ploidy and cellularity values and computes a goodness-of-fit or distance function.
For a given ploidy and cellularity, rascal scales the relative copy numbers to
absolute values and calculates the difference for each value to the nearest
whole number. This is done for all bins or equivalently for the collapsed
segment values weighted by the sizes of the bins. The
absolute_copy_number_distance
function can compute either a mean absolute
deviation (MAD) or root mean square deviation (RMSD).
absolute_copy_number_distance(ploidy, cellularity, segments$copy_number, segments$weight, distance = "MAD")
Rascal also provides a helper function to calculate the goodness-of-fit for a grid of ploidies and cellularities.
distances <- absolute_copy_number_distance_grid( segments$copy_number, segments$weight, min_ploidy = 1.5, max_ploidy = 5.5, ploidy_step = 0.01, min_cellularity = 0.1, max_cellularity = 1.0, cellularity_step = 0.01, distance_function = "MAD") distances
These can be displayed as a heatmap to show local minima. The function
absolute_copy_number_distance_heatmap
takes relative copy numbers and computes
the distance grid and generates the heatmap in one step.
absolute_copy_number_distance_heatmap( segments$copy_number, segments$weight, distance_function = "MAD" )
The find_best_fit_solutions
function performs a grid-based search and
identifies the local minima in the distance function that are considered the
most likely solutions.
find_best_fit_solutions( segments$copy_number, segments$weight, min_ploidy = 1.5, max_ploidy = 5.5, ploidy_step = 0.01, min_cellularity = 0.1, max_cellularity = 1.0, cellularity_step = 0.01, distance_function = "MAD" )
find_best_fit_solutions
filters sub-optimal and unlikely solutions and has
various parameters that can help tune these filters. For example, in the heat
map there appears to be a feasible solution at a ploidy of just under 2 and a
cellularity of approximately 0.45. This solution is filtered out because it
would result in the lowest copy number segments being assigned an absolute
copy number of 0. One of the filters removes solutions where the proportion of
the genome with zero copies is above a specified level. The default value of the
max_proportion_zero
is 0.05 but the Shiny app applies a more lenient setting,
allowing for 10% of the genome to be completely deleted.
solutions <- find_best_fit_solutions( segments$copy_number, segments$weight, min_ploidy = 1.5, max_ploidy = 5.5, ploidy_step = 0.01, min_cellularity = 0.1, max_cellularity = 1.0, cellularity_step = 0.01, max_proportion_zero = 0.1, distance_function = "MAD" ) solutions
The density plot for the additional low-cellularity, near-diploid solution confirms that a substantial proportion of the genome would have zero copies.
ploidy <- 1.87 cellularity <- 0.45 copy_number_steps <- tibble(absolute_copy_number = 0:4) copy_number_steps <- mutate(copy_number_steps, copy_number = absolute_to_relative_copy_number(absolute_copy_number, ploidy, cellularity)) copy_number_density_plot(copy_number$segmented, copy_number_steps, min_copy_number = 0.3, max_copy_number = 1.7)
It is common for multiple solutions to fit the data equally well. The main difference between these solutions is a shift in the absolute copy number steps assigned up or down incrementally. In the X17222 example, the peak corresponding to the most common copy number state with a relative copy number of around 0.8 is assigned an absolute copy number of 1 for the ploidy 1.87 solution while it is assigned a copy number 2 for the ploidy 2.87 solution and 3 for ploidy 3.87.
A very high proportion of high-grade serous ovarian carcinomas have a driver TP53 mutation that is clonal, i.e. homozygous variant in all tumour cells. This can help distinguish between the 3 competing solutions in the case of X17222. The segmented copy number for TP53 is 0.83 corresponding to the most common copy number state.
filter(copy_number, chromosome == "17", end > 7565097, start < 7590856)
Amplicon sequencing of this sample identifies a TP53 mutation with an allele fraction of 0.58. Given the absolute copy number at locus $i$, $a_i$, and the cellularity, $c$, it is straightforward to calculate the tumour fraction at that locus that should match the observed allele fraction.
$$ tumour fraction_i= \frac{a_ic}{a_ic + 2(1-c)} $$
Rascal provides a simple helper function for computing the tumour fraction for a given absolute copy number and cellularity and we can use this to predict the TP53 allele fraction for each of the 3 competing solutions.
solutions %>% select(ploidy, cellularity) %>% mutate(tp53_absolute_copy_number = relative_to_absolute_copy_number(0.832, ploidy, cellularity)) %>% mutate(tp53_tumour_fraction = tumour_fraction(tp53_absolute_copy_number, cellularity))
The solution closest that gives a TP53 tumour fraction closest to the observed allele fraction is the one with ploidy 2.87 and cellularity 0.58.
ploidy <- 2.87 cellularity <- 0.58 absolute_copy_number <- mutate(copy_number, across(c(copy_number, segmented), relative_to_absolute_copy_number, ploidy, cellularity)) absolute_segments <- copy_number_segments(absolute_copy_number)
copy_number_steps <- tibble(absolute_copy_number = 1:5) copy_number_steps <- mutate(copy_number_steps, copy_number = absolute_to_relative_copy_number(absolute_copy_number, ploidy, cellularity)) copy_number_density_plot(copy_number$segmented, copy_number_steps, min_copy_number = 0.3, max_copy_number = 1.7)
genome_copy_number_plot(absolute_copy_number, absolute_segments, min_copy_number = 0, max_copy_number = 7, copy_number_breaks = 0:7, point_colour = "grey40", ylabel = "absolute copy number")
The rascal package contains a script, fit_absolute_copy_numbers.R
(located
in the inst/scripts
directory) and on GitHub
here
that finds the best fitting solutions for a large number of samples as a batch
process. It can be run from the Unix command line as follows, assuming the
script is in a directory on the PATH:
fit_absolute_copy_numbers.R -i copy_number_data.csv -o solutions.csv
The input file can be a comma-separated value (CSV) file, tab-delimited (TSV)
file or a RDS file containing a QDNAseqCopyNumbers
object. For more details
about command-line options use the --help
option:
fit_absolute_copy_numbers.R --help
The plots and functions described in this workflow were all created to support an interactive web application developed using the R Shiny framework. This can be started at the R command prompt as follows:
library(rascal) start_shiny_app()
Alternatively it can be started from the Unix command line.
Rscript -e "rascal::start_shiny_app()"
A help page with instructions on how to use the rascal Shiny app can be accessed from the top menu (select 'More' and then 'Help').
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.