knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The aim of this package is to provide a powerful filtering pipeline of CNV calling data in fixed loci, before the visual inspection. This to reduce the amount of human effort as much as possible, discarding putative calls that we can be sure are not TRUE using a set of quality metrics derived from the raw data.

The focus is PennCNV calls, however it can in principle work with any other software output, given the availability of the three following global quality scores: BAF drift, LRR SD, GC Waviness Factor.

Please note also that at the moment only Illumina SNP array data and autosomal chromosomes (1:22) are supported.

This is a graphical representation of a CNV calling pipeline in fixed genomic loci. The package implement a filtering step just before visual inspection.

Fig 1, CNV calling pipeline{width=100%}

Setup

To install and load the latest version of the package you can run the following code:

## NOT RUN ATM
devtools::install_github("SinomeM/QCtreeCNV")
library(QCtreeCNV)

Data Format

The main function require three main input objects:

  1. a data.frame (bonus if already data.table) containing the putative CNV carriers calls;
  2. a data.frame (bonus if already data.table) containing the required quality metrics, it can be created using provided functions;
  3. a data.table containing CNVRs info, this is computed by a package function starting from table 1.

We provide functions to create the quality metrics table starting from:

a. the raw intensity files;
b. genomic coordinates of the loci of interest;
c. a table to link samples ID to the raw intensity files.

Se the section Run the Pipeline for graphical representation of the package pipeline and functions.

Specifications on each object is given in the following sections. We try to be consistent with CNVgears format when possible, and a further interconnection between the two packages is planned.

Main objects

CNV calls

Table 1 consist of the [...]

Format

| Column Name | Format | Notes | | :---------- | :----: | :---- | | sample_ID | \<chr> | _ | | locus | \<chr> | e.g. 22q11.2 | | chr | \<int> | 1:24 | | start | \<int> | _ | | end | \<int> | _ | | numsnp | \<int> | number of SNPs contributing to the call | | GT | \<int> | 0/1/2*| | CN | \<int> | 0:5* | | CNVR_ID | \<chr> | this columns is added by the package |

*Both measure are necessary. GT can be computed with a provided function.

Details [...]

Quality metrics / Raw data trends

Table 2 consist of [...]

Format

| Column Name | Format | Notes | | :---------- | :----: | :---- | | sample_ID | \<chr> | _ | | locus | \<chr> | _ | | putCarrier | \<logi> | is the sample a putative carrier for this locus?* | | logr1 | \<num> | ** | | LRRSDlocus | \<num> | ** | | BAFc | \<num> | ** | | BAFb | \<num> | ** | | centDistProp | \<num> | ** | | overlapProp | \<num> | ** | | BAFdrift | \<num> | ** | | LRRSD | \<num> | ** | | GCWF | \<num> | ** |

*: Currently if the table is created using the provided functions one line is created for each sample/locus combination, even if the sample is a putative carrier only for one specific locus. Note however that this is not mandatory, it is for done mostly to reuse computation time in downstream efficacy measures.

**: details on each measure is given below.

Details [...]

CNVRs

Table 3 consist of [...]

Format

| Column name | Format | Notes | |-------------| :------: |-------| | CNVR_ID | \<chr> | matches the one in table 1 | | chr | \<int> | 1:24 | | start | \<int> | _ | | end | \<int> | _ | | freq | \<int> | number of samples with a CNV in the region |

Objects required to create the QC metrics table using the package

We provide functions to easily create the quality metrics table, see the section Run the Pipeline for details.

Sample list

This is required to link each sample to an intensity file, it needs to have the following format:

| Column name | Format | Notes | |-------------|:------:|---------------------------------------------------| | sample_ID | | as in table 1, 2 and 3 | | file_path | | full path (from / or ~) to the intensity file | | file_path_tabix | | full path (from / or ~) to the tabix indexed intensity file |

Loci coordinates

Genomic coordinates of the loci of interest, needed to extract only the necessary markers from the intensity files. It needs to have the following format:

| Column name | Format | Notes | |-------------| :----: |-----------------------------------------------| | locus | | as in table 1 and 2 | | chr | | 1:24 | | start | | _ | | end | | _ | | length | | if not present will be computed automatically |

Intensity files

We need LRR and BAF to compute the quality scores metrics, at the moment only data from Illumina array is supported. Moreover only "standard" Final Report format is supported at the moment, meaning that the function will look for the following columns:

  1. Chr
  2. Position
  3. Log R Ratio
  4. B Allele Freq
  5. SNP Name

Both intensity file with full genomic coordinates (columns 1 and 2) and with only SNP Name (column 5) are supported, however the snppos file is required in the second situation. It should have been already computed by PennCNV anyway.

Please note that the package expect one file for each sample.

Tabix indexing

Please refer to the Current Protocol manuscript (see citation()) and the accompanying repository at https://github.com/SinomeM/IBPcnv for instructions on how to index the intensity files.

Run the Pipeline

The package has two main functions:

  1. computeCNVRs() clusterize/collapse CNVs into CNV Regions;
  2. qctree() runs the filtering pipeline;

In principle one could also use a different method to create CNVRs, as long as the output format is compatible to the QCtreeCNV one.

We provide two additional functions to easily create the quality metrics table, however it can be computed by the user separately, as long as the resulting table follows the format requirements. Moreover the package include functions to standardise chr and GT / CN notation.

The two following images give a graphical representation of the package standard pipeline (with the specific function used in each step) and of the decision making tree implemented with qctree().

Fig 2, QCtreeCNV package full pipeline{width=100%}

Fig 3, Decision making filtering tree{width=80%}

Details

Some details on each step, the numbering follows the one in the figure.

Fig 2:

  1. df
  2. dsf
  3. sfd
  4. sdf
  5. df

Fig 3:

  1. df
  2. dsf
  3. sfd
  4. sdf
  5. df
  6. dfd

Full pipeline example pseudo-code

Three main objects are needed to run the pipeline:

  1. cnvs, PennCNV calling results;
  2. sample_list, as described earlier;
  3. loci, list of loci of interest, as described earlier;
  4. qccnvs, PennCNV main QC file, one line per sample.

Assuming the objects have the required columns we can start by checking chr is in the correct format and computing GT with the provided helper functions

NB: there will be a wrapper function to run the entire pipeline or at least big chunks of it.

cnvs <- QCtreeCNV:::chr_uniform(cnvs)
cnvs <- QCtreeCNV:::uniform_GT_CN(cnvs)
loci <- QCtreeCNV:::chr_uniform(loci)

At this point we can select the putative calls using the function select_stitch_calls(), this will also stitch close calls by default. Maximum gap in the stitching, minim number of SNPs and minim overlap with a locus can be set by the user.

put_cnvs <- select_stitch_calls(cnvs, loci)

Now we can extract the data we need from the intensity files and then create the master QC object needed by the main function. We are also ready to run the CNVR computation.

Note that these steps may be slow and can be run in parallel of further parallelized. As an example, data extraction from the intensity files can be parallelized by splitting the put_cnvs object.

extractRDS(loci, sample_list, "path/to/wkdir/tmp/")

cnvrs <- cnvrs_create(cnvs=put_cnvs, chr_arms=hg19_chr_arms)

putQC <- extractMetrics(loci, put_cnvs, qccnvs, "path/to/wkdir/tmp")

Now we have all the ingredients for the main function, qctree():

put_cnvs_qc <- qctree(put_cnvs, cnvrs[[1]], putQC, loci)

The column excl is 1 for "bad" calls and 0 for "good" ones, so to filter the bad one we can create a new object.

put_cnvs_filtered <- put_cnvs_qc[excl == 0, ]

Session

sessionInfo()


SinomeM/QCtreeCNV documentation built on Dec. 5, 2023, 4:06 p.m.