knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The aim of this package is to provide a powerful filtering pipeline of CNV calling data in fixed loci, before the visual inspection. This to reduce the amount of human effort as much as possible, discarding putative calls that we can be sure are not TRUE using a set of quality metrics derived from the raw data.
The focus is PennCNV calls, however it can in principle work with any other software output, given the availability of the three following global quality scores: BAF drift, LRR SD, GC Waviness Factor.
Please note also that at the moment only Illumina SNP array data and autosomal chromosomes (1:22) are supported.
This is a graphical representation of a CNV calling pipeline in fixed genomic loci. The package implement a filtering step just before visual inspection.
{width=100%}
To install and load the latest version of the package you can run the following code:
## NOT RUN ATM devtools::install_github("SinomeM/QCtreeCNV") library(QCtreeCNV)
The main function require three main input objects:
data.frame
(bonus if already data.table
)
containing the putative CNV carriers calls; data.frame
(bonus if already data.table
)
containing the required quality metrics, it can be
created using provided functions; data.table
containing CNVRs info, this is computed
by a package function starting from table 1.We provide functions to create the quality metrics table starting from:
a. the raw intensity files;
b. genomic coordinates of the loci of interest;
c. a table to link samples ID to the raw intensity files.
Se the section Run the Pipeline for graphical representation of the package pipeline and functions.
Specifications on each object is given in the following sections. We try to be consistent with CNVgears format when possible, and a further interconnection between the two packages is planned.
Table 1 consist of the [...]
| Column Name | Format | Notes | | :---------- | :----: | :---- | | sample_ID | \<chr> | _ | | locus | \<chr> | e.g. 22q11.2 | | chr | \<int> | 1:24 | | start | \<int> | _ | | end | \<int> | _ | | numsnp | \<int> | number of SNPs contributing to the call | | GT | \<int> | 0/1/2*| | CN | \<int> | 0:5* | | CNVR_ID | \<chr> | this columns is added by the package |
*Both measure are necessary. GT can be computed with a provided function.
Details [...]
Table 2 consist of [...]
| Column Name | Format | Notes | | :---------- | :----: | :---- | | sample_ID | \<chr> | _ | | locus | \<chr> | _ | | putCarrier | \<logi> | is the sample a putative carrier for this locus?* | | logr1 | \<num> | ** | | LRRSDlocus | \<num> | ** | | BAFc | \<num> | ** | | BAFb | \<num> | ** | | centDistProp | \<num> | ** | | overlapProp | \<num> | ** | | BAFdrift | \<num> | ** | | LRRSD | \<num> | ** | | GCWF | \<num> | ** |
*: Currently if the table is created using the provided functions one line is created for each sample/locus combination, even if the sample is a putative carrier only for one specific locus. Note however that this is not mandatory, it is for done mostly to reuse computation time in downstream efficacy measures.
**: details on each measure is given below.
Details [...]
Table 3 consist of [...]
| Column name | Format | Notes | |-------------| :------: |-------| | CNVR_ID | \<chr> | matches the one in table 1 | | chr | \<int> | 1:24 | | start | \<int> | _ | | end | \<int> | _ | | freq | \<int> | number of samples with a CNV in the region |
We provide functions to easily create the quality metrics table, see the section Run the Pipeline for details.
This is required to link each sample to an intensity file, it needs to have the following format:
| Column name | Format | Notes |
|-------------|:------:|---------------------------------------------------|
| sample_ID | /
or ~
) to the intensity file |
| file_path_tabix | /
or ~
) to the tabix indexed intensity file |
Genomic coordinates of the loci of interest, needed to extract only the necessary markers from the intensity files. It needs to have the following format:
| Column name | Format | Notes |
|-------------| :----: |-----------------------------------------------|
| locus |
We need LRR and BAF to compute the quality scores metrics, at the moment only data from Illumina array is supported. Moreover only "standard" Final Report format is supported at the moment, meaning that the function will look for the following columns:
Chr
Position
Log R Ratio
B Allele Freq
SNP Name
Both intensity file with full genomic coordinates (columns 1 and 2)
and with only SNP Name
(column 5) are supported, however the snppos
file is required in the second situation. It should have been already
computed by PennCNV anyway.
Please note that the package expect one file for each sample.
Please refer to the Current Protocol manuscript (see citation()
)
and the accompanying repository at https://github.com/SinomeM/IBPcnv
for instructions on how to index the intensity files.
The package has two main functions:
computeCNVRs()
clusterize/collapse CNVs into CNV Regions; qctree()
runs the filtering pipeline;In principle one could also use a different method to create
CNVRs, as long as the output format is compatible to the QCtreeCNV
one.
We provide two additional functions to easily create the
quality metrics table, however it can be computed by the user
separately, as long as the resulting table follows the format requirements.
Moreover the package include functions to standardise chr
and GT
/ CN
notation.
The two following images give a graphical representation of the package
standard pipeline (with the specific function used in each step) and of
the decision making tree implemented with qctree()
.
{width=100%}
{width=80%}
Some details on each step, the numbering follows the one in the figure.
Fig 2:
Fig 3:
Three main objects are needed to run the pipeline:
cnvs
, PennCNV calling results;sample_list
, as described earlier;loci
, list of loci of interest, as described earlier;qccnvs
, PennCNV main QC file, one line per sample.Assuming the objects have the required columns we can start by checking
chr
is in the correct format and computing GT
with the provided helper
functions
NB: there will be a wrapper function to run the entire pipeline or at least big chunks of it.
cnvs <- QCtreeCNV:::chr_uniform(cnvs) cnvs <- QCtreeCNV:::uniform_GT_CN(cnvs) loci <- QCtreeCNV:::chr_uniform(loci)
At this point we can select the putative calls using the function
select_stitch_calls()
, this will also stitch close calls by default.
Maximum gap in the stitching, minim number of SNPs and minim overlap with
a locus can be set by the user.
put_cnvs <- select_stitch_calls(cnvs, loci)
Now we can extract the data we need from the intensity files and then create the master QC object needed by the main function. We are also ready to run the CNVR computation.
Note that these steps may be slow and can be run in parallel of further parallelized.
As an example, data extraction from the intensity files can be parallelized by splitting
the put_cnvs
object.
extractRDS(loci, sample_list, "path/to/wkdir/tmp/") cnvrs <- cnvrs_create(cnvs=put_cnvs, chr_arms=hg19_chr_arms) putQC <- extractMetrics(loci, put_cnvs, qccnvs, "path/to/wkdir/tmp")
Now we have all the ingredients for the main function, qctree()
:
put_cnvs_qc <- qctree(put_cnvs, cnvrs[[1]], putQC, loci)
The column excl
is 1 for "bad" calls and 0 for "good" ones, so
to filter the bad one we can create a new object.
put_cnvs_filtered <- put_cnvs_qc[excl == 0, ]
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.