knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(crayon.enabled=F)
library(CNAqc)

require(dplyr)

Input format

CNAqc notation:

CNAqc comes with a template dataset.

# Load template data
data('example_dataset_CNAqc', package = 'CNAqc')

Somatic mutations

These fields are required for somatic mutations:

Chromosome names and alleles should be in character format; chromosomes must be in the format chr1, chr2, etc..

# Example input SNVs
example_dataset_CNAqc$mutations %>%
  dplyr::select(chr, from, to, # Genomic coordinates
         ref, alt,      # Alleles (reference and alternative)
         DP, NV, VAF    # Read counts (depth, number of variant reads, tumour VAF)
         ) %>%
         print()

Adding driver mutations

Optionally, you can annotate driver mutations by adding the following columns to your data:

example_dataset_CNAqc$mutations %>%
  dplyr::select(chr, from, to, ref, alt, is_driver, driver_label) %>%
  filter(is_driver) %>% 
print()

Copy number segments

CNAqc distinguishes between 3 types of copy number segments:

These fields are required for all types of CNAs:

Adding subclonal copy numbers

Optionally, you can annotate also subclonal CNAs.

To do this first you annotate the Cancer Cell Fraction (CCF) CCF for each input segment as an extra column in the dataframe: segments with CCF = 1 are clonal, otherwise subclonal;

# Example input CNA
print(
  example_dataset_CNAqc$cna %>% 
        select(
          chr, from, to, # Genomic coordinates
          Major, minor  # Number of copies of major/ and minor allele (B-allele)
        )
  )

Note: the CCF of a segment can only be computed by callers that support subclonal segments. If there are no subclonal CNAs the CCF column can be omitted. In that case CNAqc assumes all segments to be clonal and assigns CCF = 1.

If you wish to use subclonal CNAs, further columns are required.

The CNAqc model captures a mixture of two subclones, one with segment Major:minor and CCF CCF (which is compulsory), and another with segment Major_2:minor_2 and CCF 1 - CCF.

The values of Major_2 and minor_2 for clonal segments (CCF = 1) can be NA and will not be used.

Tumour purity

Tumour purity, defined as the percentage of reads coming from tumour cells must be a value in $[0, 1]$.

# Example purity
print(example_dataset_CNAqc$purity)

Initialisation of a new dataset

To use CNAqc, you need to initialize a cnaqc S3 object with the initialisation function init.

This function will check input formats, and will map mutations to CNA segments. This function does not subset the data and retains all and only the mutations that map on top of a CNA segment.

When you create a dataset it is required to explicit the reference genome for the assembly (see below).

# Use SNVs, CNAs and tumour purity (hg19 reference, see below)
x = init(
  mutations = example_dataset_CNAqc$mutations, 
  cna = example_dataset_CNAqc$cna,
  purity = example_dataset_CNAqc$purity,
  ref = 'hg19'
  )

The summary of x can be print to provide a number of usefull information.

print(x)

Subsetting data

You can subset randomly the data; if drivers are annotated, they can be forced to stay in.

y_5000 = subsample(x, N = 5000, keep_drivers = TRUE)

# 5000 + the ranomd entries that we sampled before
print(y_5000)

You can also subset data by karyotype of the segments, and by total copy number of the segment.

Both subset functions do not keep drivers that map off from the selected segments.

# Triploid and copy-neutral LOH segments 
y_tripl_cnloh = subset_by_segment_karyotype(x, karyotypes = c('2:1', '2:0'))

print(y_tripl_cnloh)

# Two and four copies
y_2_4 = subset_by_segment_totalcn(x, totalcn = c(2, 4))

print(y_2_4)

Reference genome coordinates

CNAqc uses a genome coordinates reference system to convert relative relative to absolute coordinates, a step required to plot segments across the whole genome (see plot_segments). For instance, if a mutation maps to position $100$ of chromosome chr2, its absolute coordinate is $100 + L$ where $L$ is the length of chr1. The reference system adopted by CNAqc needs therefore to report the length of each chromosome, plus the information regarding the boundary of each centromere.

CNAqc supports two coordinates reference genomes:

for which two dataframes are stored inside the package.

CNAqc:::get_reference("hg19") # equivalent to CNAqc:::get_reference("GRCh37")

CNAqc:::get_reference("GRCh38") # equivalent to CNAqc:::get_reference("hg38")

The reference genomes has to be specified when you create a CNAqc object -- see function init.

Note: mapping of mutations onto segments is independent of the reference genome, and it will work as far as both mutation and CNA segments are mapped to the same reference.

You can use a hidden function to plot a reference

CNAqc:::blank_genome(ref = 'hg19') + 
  ggplot2::labs(title = "HG19 genome reference")

Example CNAqc object(s)

CNAqc comes with an object released by PCAWG

CNAqc::example_PCAWG


caravagnalab/CNAqc documentation built on Oct. 31, 2024, 3:54 a.m.