knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(crayon.enabled=F)
library(CNAqc) require(dplyr)
CNAqc notation:
Major:minor
denotes a CNA segments with Major
/minor
copies of the major/minor allele. We sometimes call "1:1"
the karyotype or the copy state of a segment.
Clonal denotes something (a CNA, or a mutaiton) that is found at clonality 100% or, so to say, in all tumour cells; subclonal is something that is found in a subset of the tumour cells.
CNAqc comes with a template dataset.
# Load template data data('example_dataset_CNAqc', package = 'CNAqc')
These fields are required for somatic mutations:
chr
, from
, to
. ref
and alt
;DP
(depth);NV
(number of reads with variant);VAF
, defined as NV/DP
.Chromosome names and alleles should be in character format; chromosomes must be in the format chr1
, chr2
, etc..
# Example input SNVs example_dataset_CNAqc$mutations %>% dplyr::select(chr, from, to, # Genomic coordinates ref, alt, # Alleles (reference and alternative) DP, NV, VAF # Read counts (depth, number of variant reads, tumour VAF) ) %>% print()
Optionally, you can annotate driver mutations by adding the following columns to your data:
is_driver
: whether the mutation is a driver, or not;driver_label
: the label to be shown in the plots that report also drivers (e.g., BRAF V600E
could be a label).example_dataset_CNAqc$mutations %>% dplyr::select(chr, from, to, ref, alt, is_driver, driver_label) %>% filter(is_driver) %>% print()
CNAqc distinguishes between 3 types of copy number segments:
"1:0"
loss of heterzygosity (LOH);"2:0"
copy neutral LOH;"1:1"
diploid heterozygous (assumed to be the normal reference);"2:1"
trisomy;"2:2"
tetraploidy.These fields are required for all types of CNAs:
chr
, from
and to
;Major
and minor
;Optionally, you can annotate also subclonal CNAs.
To do this first you annotate the Cancer Cell Fraction (CCF) CCF
for each input segment as an extra column in the dataframe: segments with CCF = 1
are clonal, otherwise subclonal;
# Example input CNA print( example_dataset_CNAqc$cna %>% select( chr, from, to, # Genomic coordinates Major, minor # Number of copies of major/ and minor allele (B-allele) ) )
Note: the CCF of a segment can only be computed by callers that support subclonal segments. If there are no subclonal CNAs the
CCF
column can be omitted. In that case CNAqc assumes all segments to be clonal and assignsCCF = 1
.
If you wish to use subclonal CNAs, further columns are required.
Major_2
and minor_2
reporting the major and minor alleles for the second clone.The CNAqc model captures a mixture of two subclones, one with segment Major:minor
and CCF CCF
(which is compulsory), and another with segment Major_2:minor_2
and CCF 1 - CCF
.
The values of Major_2
and minor_2
for clonal segments (CCF = 1
) can be NA
and will not be used.
Tumour purity, defined as the percentage of reads coming from tumour cells must be a value in $[0, 1]$.
# Example purity print(example_dataset_CNAqc$purity)
To use CNAqc
, you need to initialize a cnaqc
S3 object with the initialisation function init
.
This function will check input formats, and will map mutations to CNA segments. This function does not subset the data and retains all and only the mutations that map on top of a CNA segment.
When you create a dataset it is required to explicit the reference genome for the assembly (see below).
# Use SNVs, CNAs and tumour purity (hg19 reference, see below) x = init( mutations = example_dataset_CNAqc$mutations, cna = example_dataset_CNAqc$cna, purity = example_dataset_CNAqc$purity, ref = 'hg19' )
The summary of x
can be print
to provide a number of usefull information.
print(x)
You can subset randomly the data; if drivers are annotated, they can be forced to stay in.
y_5000 = subsample(x, N = 5000, keep_drivers = TRUE) # 5000 + the ranomd entries that we sampled before print(y_5000)
You can also subset data by karyotype of the segments, and by total copy number of the segment.
Both subset functions do not keep drivers that map off from the selected segments.
# Triploid and copy-neutral LOH segments y_tripl_cnloh = subset_by_segment_karyotype(x, karyotypes = c('2:1', '2:0')) print(y_tripl_cnloh) # Two and four copies y_2_4 = subset_by_segment_totalcn(x, totalcn = c(2, 4)) print(y_2_4)
CNAqc uses a genome coordinates reference system to convert relative relative to absolute coordinates, a step required to plot segments across the whole genome (see plot_segments
). For instance, if a mutation maps to position $100$ of chromosome chr2
, its absolute coordinate is $100 + L$ where $L$ is the length of chr1
. The reference system adopted by CNAqc needs therefore to report the length of each chromosome, plus the information regarding the boundary of each centromere.
CNAqc supports two coordinates reference genomes:
hg19
or GRCh37
;hg38
or GRCh38
(default),for which two dataframes are stored inside the package.
CNAqc:::get_reference("hg19") # equivalent to CNAqc:::get_reference("GRCh37") CNAqc:::get_reference("GRCh38") # equivalent to CNAqc:::get_reference("hg38")
The reference genomes has to be specified when you create a CNAqc object -- see function init
.
Note: mapping of mutations onto segments is independent of the reference genome, and it will work as far as both mutation and CNA segments are mapped to the same reference.
You can use a hidden function to plot a reference
CNAqc:::blank_genome(ref = 'hg19') + ggplot2::labs(title = "HG19 genome reference")
CNAqc comes with an object released by PCAWG
CNAqc::example_PCAWG
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.