knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Initial Note #

The illustrative pipelines shown in this file uses real data from the 1000 Genomes project (https://www.internationalgenome.org/). The examples are performed on array data for chromosomes 14 and 22 only and the raw final report files provided as they are necessary for some critical steps of the pipeline. Because those files are relatively large (> 5Mb) they are stored as a separate GitHub repository (https://github.com/SinomeM/CNVgears_data). The user is kindly asked to download the repo and give the directory path at the step 5.2 in order to being able to perform all the examples.

Introduction

The CNVgears package aims to implement an interactive and extensible framework for the analysis of Copy Number Variation calling and/or segmentation results, typically from multiple pipelines or algorithms, regardless the initial raw data type. At the moment Illumina SNPs array and any type of NGS data are supported, provided that at least the raw log2R value, or any related measure is accessible for each sample. The package is particularly suited for the analysis of family based studies, however, the more vast majority of the functions are of general purpose and useful for any CNVs study. Next session illustrate the pipeline main steps and most of the package features using CNVs calling results on a subset of the 1000 Genomes cohort. We created the subset using samples that creates full trios and have both SNP array and WGS data available. However at the moment only the arrays are used here. CNVs calling was performed using a modified minimal version of EnsembleCNV.

Main Steps #

CNVgears supports multiple analysis and cohort composition (e.g. family based or not) as it is possible to perform all or only a fraction of these steps. As an example, if the cohort is not family based, if CNVs inheritance pattern is not a desired information or if the markers-level raw data is not available, the raw data load step can be easily skip. The package can be used also only to combine the results of different methods in a uniform format and export the output to be processed in other programs.

The typical main steps of such analysis using CNVgears are:

  1. Load the cohort minimal metadata in the form of PED file (mandatory);
  2. Load the markers file (i.e. genomic location of each marker, SNP in case of array, interval/bin in case of NGS data) (mandatory);
  3. Read and pre-process the CNVs calling results for all samples from the individual algorithms/pipelines, in particular adjacent calls with equal genotype (i.e. deletions/duplications) can be merged if closer than user-specified distances (mandatory);
  4. Read and pre-process the "raw data" (in particular the log2R for each marker for all samples) (not mandatory but required for CNVs inheritance detection);
  5. Explore the sample's summary statistics, identify groups of unwanted events (e.g. smaller then 5 kbp or consisting of less than 5 points) and possible over-segmented samples (not mandatory but strongly suggested);
  6. Filter the results based on several parameters and blacklist (both provided by the package and user-generated), such as immunoglobulin and telomeric/centromeric regions. Also more extreme intra-results merging can be performed if one or more methods appear to over-segment (not mandatory but strongly suggested);
  7. Detect CNVs inheritance pattern, if families structure is given (not mandatory);
  8. Visually inspect the good de novo CNVs candidate raw data (not mandatory, strongly suggested in the case of de novo CNVs, require the previous step);
  9. Merge the various methods results in a single object (not mandatory, however this is one of the main points of the package);
  10. Second round of filtering, keep only the call made by N (one or more) algorithms/pipelines (not mandatory, requires the previous step);
  11. Annotate genomic locus of the calls and search for CNVs in known disease-linked loci (e.g. in studies on ASD or schizophrenia) (not mandatory);
  12. Annotate genic content of selected CNVs (not mandatory);
  13. Compute CNV regions (CNVRs) (not mandatory);
  14. Extract high confidence rare events based on CNVRs (not mandatory, requires the previous step);
  15. Export desired data. Several format are supported, such as BED (e.g. for being viewed in IGV) or VCF (e.g. for being processed with Annovar), as well as TSV (not mandatory).

Setup #

Install

CNVgears is part of Bioconductor (https://bioconductor.org), to install it using BiocManager:

if (!require("BiocManager"))
    install.packages("BiocManager")

BiocManager::install("CNVgears")

Until Biconductor 3.14 it can be found in the devel branch.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install("CNVgears")

To install the latest development version of the package via GitHub repository:

# not run 
devtools::install_github("SinomeM/CNVgears")

Load library

To load the package.

library(CNVgears, quietly = TRUE)

Set data.table threads

All the major functions uses data.table package, especially in the more intensive steps. The package uses by default half the CPU threads available, this can be changed with the function data.table::setDTthreads. For example to use the 75% of available threads or an integer number (four in the example) of them:

data.table::setDTthreads(percent=75)
data.table::setDTthreads(4)
data.table::setDTthreads(2)

Empirically, we found the ideal number of threads being between 4 and 8.
See the relative man pages for details on each function used.

Data Load

Cohort PED and Markers data

The first two objects we need to load are the cohort and markers information. The first object is loaded from a PED file from with the following columns are extracted: Sample ID, Family ID, sex and role (proband/sibling/mother/father). Role information needs to be present only if CNV inheritance detection is a desired step. The second object contains the genomic location of the markers used for CNV calling, thus the SNPs in case of DNA chip and genomic intervals if WES/WGS. The following columns are needed: SNP ID (only for DNA chips), chromosome, start, end. One of these objects is needed for each data type used for calling. Here we used only SNP arrays so only one markers object is needed.
In the package are bundled CNV calling results of three different algorithms for chromosomes 14 and 22 of 24 samples from the 1000 Genomes project, as well as a PED file of the cohort and the markers-level raw data for a single trio (within the 24 samples).

# cohort data
cohort <- read_metadt(DT_path = system.file("extdata", "cohort.ped", 
                                            package = "CNVgears"),
                      sample_ID_col = "Individual ID", 
                      fam_ID_col = "Family ID",
                      sex_col = "Gender", role_col = "Role")
head(cohort)

# markers location
SNP_markers <- read_finalreport_snps(system.file("extdata", "SNP.pfb", 
                                                 package = "CNVgears"),
                                     mark_ID_col = "Name", chr_col = "Chr", 
                                     pos_col = "Position")
head(SNP_markers)

CNV Calling Results and Markers-level raw data

Then we load the actual CNV calling results and the marker-level raw data. The results are loaded in the session as R objects, while the raw data are processed and stored as RDS one chromosome at the time. The RDS files will be loaded only when necessary in order to reduce the RAM requirements of processing results from very large cohorts and/or from several methods.
In this scenario the raw data consist of the SNP array only, thus we can process them one single time and use them for all the three callers results.
Note that during the importing of the results an intra-method CNV merging step is performed, i.e. adjacent call are merged into a single one if requirements are met. This step is performed by default but can be avoided by setting do_merge parameter to FALSE. The resulting objects are of the class CNVresults.

# CNV calling results
penn <- read_results(DT_path = system.file("extdata", 
                                           "chrs_14_22_cnvs_penn.txt", 
                                           package = "CNVgears"), 
                     res_type = "file", DT_type = "TSV/CSV", 
                     chr_col = "chr", start_col = "posStart", 
                     end_col = "posEnd", CN_col = "CN", 
                     samp_ID_col = "Sample_ID", sample_list = cohort, 
                     markers = SNP_markers, method_ID = "P")
head(penn)

quanti <- read_results(DT_path = system.file("extdata", 
                                             "chrs_14_22_cnvs_quanti.txt", 
                                             package = "CNVgears"), 
                      res_type = "file", DT_type = "TSV/CSV", 
                      chr_col = "chr", start_col = "posStart", 
                      end_col = "posEnd", CN_col = "CN", 
                      samp_ID_col = "Sample_ID", sample_list = cohort, 
                      markers = SNP_markers, method_ID = "Q")

ipn <-  read_results(DT_path = system.file("extdata", 
                                           "chrs_14_22_cnvs_ipn.txt", 
                                           package = "CNVgears"), 
                     res_type = "file", DT_type = "TSV/CSV", 
                     chr_col = "chr", start_col = "posStart", 
                     end_col = "posEnd", CN_col = "CN", 
                     samp_ID_col = "Sample_ID", sample_list = cohort, 
                     markers = SNP_markers, method_ID = "I")
# raw data
# not run. This step needs the additional data
library("data.table", quietly = TRUE)
setDTthreads(4)

read_finalreport_raw(DT_path = "PATH_TO_DOWNLOADED_DATA_DIRECTORY", 
                     pref = NA, suff = ".txt", 
                     rds_path = file.path("tmp_RDS"), 
                     markers = SNP_markers, 
                     sample_list = cohort[fam_ID == "1463", ])

Note that when loading the raw data only the trio is used as cohort. The samples object is subset using a data.table query.

Now the initial step of data load and normalization is completed and all a series of integration and analysis steps can take place.
Obviously, if more methods are used this initial step is longer, in particular if also WGS and/or high density SNPs array data is used (because of the very high number of markers). More examples will be added in future version of this vignette.

cn.mops package results loading

CNVs calling from NGS data Bioconductor package cn.mops uses GRanges objects as results. In particular, the actual calls are stored in the cnvs slot. These can be loaded into a CNVgears pipeline using the function cnmops_to_CNVresults as shown here. Note that only calls in chromosomes 1:22 are supported by this package at the moment.

## not run
library(cn.mops)
data(cn.mops)
resCNMOPS <- cn.mops(XRanges)
resCNMOPS <- calcIntegerCopyNumbers(resCNMOPS)
resCNMOPS_cnvs <- cnvs(resCNMOPS)

sample_list <- read_metadt("path/to/PED_like_file")
intervals <- read_NGS_intervals("path/to/markers/file")
cnmops_calls <- cnmops_to_CNVresults(resCNMOPS_cnvs, sample_list, intervals)

Data integration and processing

Results exploration and filtering

When analyzing CNVs calling or segmentation results, an important step is to filter out undesired and/or noisy/unreliable calls. This package provide the means to do so in an integrated an standardized manner using two functions: summary_stats and cleaning_filter. The first produces several summary statistics and exploratory plot on the results, both cohort-wise and samples-wise. They are designed for interactive use, also more then one time if desired. An example for one of the callers results.

sstatPenn <- summary(object = penn, sample_list = cohort,
                     markers = SNP_markers)
# not run 
sstatPenn <- summary(object = penn, sample_list = cohort,
                     markers = SNP_markers, 
                     plots_path = file.path("tmp","sstatPenn")) # plots are saved

Exploring these summary statistics can provide important insights on the cohort composition and on the differences between the individual callers. Obviously, the process is completely optional, and we provide default values for the filtering function based on literature and our experience.
The output of summary can help identifying over-segmented samples.

head(sstatPenn)
penn_filtered <- cleaning_filter(results = penn, min_len = 10000, min_NP = 10)
quanti_filtered <- cleaning_filter(quanti)
ipn_filtered <- cleaning_filter(ipn)

It is also possible to filter calls based on blacklists, e.g. immunoglobulin regions or telomers. In our package we provide two functions to generate immunoglobulin, telomeric and centromeric regions for the hg18, hg19 and hg38 genome assemblies. The user can specify the number of minimum consecutive immuno genes to create a region and the desired telomers and centromers blacklists width. Here we create those blacklists in hg19 as an example.
Note that the fucntion immuno_regions uses internally the package biomaRt that work with the hg38 genome assembly by default. To work with older assemblies we need to pass a custom mart object, as exemplified here.

hg19telom <- telom_centrom(hg19_start_end_centromeres, centrom = FALSE)
hg19centrom <- telom_centrom(hg19_start_end_centromeres, telom = FALSE, 
                             region = 25000)
ensembl37 <- biomaRt::useMart("ENSEMBL_MART_ENSEMBL", 
                              host = "grch37.ensembl.org", 
                              dataset="hsapiens_gene_ensembl")
hg19_immuno <- immuno_regions(n_genes = 10, mart = ensembl37)

A blacklist can be any data.table with at least these three columns: chr, start, end. Users should be particularly careful when filtering using blacklists as it is possible to filter out relevant data.

head(hg19telom)

Inheritance pattern detection

In family-based cohort studies, CNV inheritance detection is possible. However, because the process of CNVs calling tends to be imprecise, both false positive and false negative are present in the results. This is in part solved with the filtering process and can be solved combining more than one methods results.

Here, to identify good de-novo candidates hits as well as inherited events we use the function cnvs_inheritance that integrates three screening steps, both at segment-level and at marker-level. The first step, for every CNVs in an offspring, searches compatible CNVs in both parents (i.e. with 30% reciprocal overlap, by default). If an hit is found the CNVs is inherited, if not it is considered a putative de-novo ad undergoes the successive marker-level screenings. At this point the CopyRatio values for all the markers in the region are considered in the trio. A first screening consider as potentially inherited all the putative de-novo events of the offspring for witch a parent have more than 75% compatible markers points in the regions, i.e. markers with a CopyRatio < 0.75 if the CNV is a deletion or a CopyRatio > 1.25 if the CNV is a duplication. This step is mostly needed to speed up the function and will be made optional in near future.

For autosomal chromosomes

$$LRR = log2R = log2(CopyRatio) = log2(numeric_{CN}/2)$$

The putative de-novo events that survive undergo the final screening. This can be performed in several way, at the moment the suggested and most supported is the following. In brief for both pairs parent-offspring the mean of the CopyRatio values are compared using a Wilcox test. Under the null hypothesis that the difference of the two means is zero, we test if a difference is statistically significant, i.e. if the resulting p-value is lower than a user-defined alpha (default value is 0.05). If the test has a positive outcome for both parents than the CNV is defined as a good de-novo candidate. The two p-values from each comparison are stored in the dataset and, when all the individual comparison are done, two additional columns containing the adjusted p-values are computed, leaving to the user the choice of which sets of p-value use.

# not run, needs the additional raw data
penn_inheritance <- 
  cnvs_inheritance(results = penn_filtered, sample_list =  cohort[fam_ID == "1463", ], 
                   markers = SNP_markers, raw_path = file.path("tmp_RDS"))
quanti_inheritance <- 
  cnvs_inheritance(results = quanti_filtered, sample_list =  cohort[fam_ID == "1463", ], 
                   markers = SNP_markers, raw_path = file.path("tmp_RDS"))
ipn_inheritance <- 
  cnvs_inheritance(results = ipn_filtered, sample_list =  cohort[fam_ID == "1463", ], 
                   markers = SNP_markers, raw_path = file.path("tmp_RDS"))

penn_denovo <- penn_inheritance[inheritance == "denovo", ]
quanti_denovo <- quanti_inheritance[inheritance == "denovo", ]
ipn_denovo <- ipn_inheritance[inheritance == "denovo", ]

Note that inheritance detection is performed separately for each method, in this way it is possible to use marker-level raw data. However, when all the results use the same raw data (as in in this vignette, at the moment), it is also possible to first combine the results (see next section) an only then perform inheritance detection.

Inter-methods merging

At this point we can integrate the multiple methods results into a single object. As exemplification of a real pipeline we also process the de-novo calls.

ipq_filtered <- 
  inter_res_merge(res_list = list(penn_filtered, quanti_filtered, ipn_filtered),
                  sample_list = cohort, chr_arms = hg19_chr_arms)
# not run, it needs inheritance detection step
ipq_denovo <- 
  inter_res_merge(res_list = list(penn_denovo, quanti_denovo, ipn_denovo),
                  sample_list = cohort, chr_arms = hg19_chr_arms)

At this point it is possible to perform a second filtering step based on the number of calling methods, e.g. filter out all the calls made by one single algorithm.

Note how the merge perform reciprocal overlap comparison in order to merge two calls. Candidate events should thus not only overlap but must also be of comparable size.

CNV Regions Computation

Copy Number Variable Regions can be defined in several ways, an intuitive way to visualize them is: regions of the human genomes were CNVs (of comparable size) are present across samples in a population. This can be due for examples because such region is bordered by two particularly active breakpoints. CNVRs can be used for CNVs-GWAS and, more easily, to extract rare events from a cohort.
In this package, we provide a method to compute Copy Number Variable Regions (CNVRs), based on reciprocal overlap, starting from the integrated object created in the previous step. See the ?cnvrs_create for details.

cnvrs_chr22 <- cnvrs_create(cnvs = ipq_filtered[chr == 22,], 
                            chr_arms = hg19_chr_arms)

CNVRs can also be used for filtering purposes, as an example a user can be more interested in rare events and use CNVRs frequency to achieve this internally.
Rare (within the cohort) calls can be obtained filtering out the common CNVRs, as follows.

# define common CNVR as the one present in > 5% of the cohort 
# here the cohort is very small so it won't be a significative result
common_cnvrs_chr22 <- cnvrs_chr22[[1]][freq > nrow(cohort)*0.5, ]
rare_cvns_chr22 <- cnvrs_chr22[[2]][!cnvr %in% common_cnvrs_chr22$r_ID, ]

De-novo CNVs visual confirmation

In the package it is also implemented a function to visually inspect the de-novo candidates. We use as an example the two calls detected by both PennCNV and QantiSNP and the larger one made by IPattern.

# not run, it needs inheritance detection step
lrr_trio_plot(cnv = ipq_denovo[1], raw_path =  file.path("tmp_RDS"), 
              sample_list = cohort, results = ipn_filtered)
lrr_trio_plot(cnv = ipq_denovo[2], raw_path =  file.path("tmp_RDS"), 
              sample_list = cohort, results = ipn_filtered)
lrr_trio_plot(cnv = ipq_denovo[3], raw_path =  file.path("tmp_RDS"), 
              sample_list = cohort, results = ipn_filtered)

Note that the two calls are quite close.

Annotation

The package include also some annotation functions, for extracting both the CNVs locus and affected genes. At the moment the affected genes are listed using "-" separated Ensemble IDs.
As immuno_regions, the function genic_load uses biomaRt internally, thus a custom mart is needed to work in hg19.

# locus 
ipq_denovo_locus <- genomic_locus(DT_in = ipq_filtered, keep_str_end = FALSE)

# genes 
ensembl37 <- biomaRt::useMart("ENSEMBL_MART_ENSEMBL", 
                              host = "grch37.ensembl.org", 
                              dataset="hsapiens_gene_ensembl")

ipq_denovo_genes <- genic_load(DT_in = ipq_filtered, mart = ensembl37)

rm(ensembl37)

Visualization

All CNVs from an offspring (not only the de novo ones) can be visualized in the raw data using the function lrr_trio_plot. A simpler function to visualize one single event without any focus on family relations will be implemented soon. Integration with the package IGVR is also planned.

Data export

All the major object handled by the package are data.table. Thus to export outside of R an object (as an example the output of the combination of several methods results) it is possible to use data.table semantics and functions.

## not run

# Select a subset of columns using DT[, .(col1,col2,coln)]
tmp <- ipq_filtered[, .(sample_ID, chr, start, end, GT, meth_ID)]
head(tmp, 3)

# Select a subset of rows (i.e. filter) using DT[condition, ]
tmp <- ipq_filtered[n_meth > 1, .(sample_ID, chr, start, end, GT, meth_ID)]
head(tmp, 3)

# Export object as a TSV 
fwrite(tmp, "example.tsv", sep = "\t")

As an example to create a BED4 file were the entries names consist of sample_ID"-"GT that can be loaded in IGV we do:

tmp <- ipq_filtered[, .(chr, start, end, paste0(sample_ID, "-", GT))]
head(tmp)
## not run
fwrite(tmp, "example.bed", sep = "\t", col.names = FALSE)

Objects of the class CNVresults can also be converted into GenomicRanges::GRanges() using the function CNVresults_to_GRanges. Required columns (and thus maintained in the resulting object) are: chr, start, end, sample_ID, GT, meth_ID. The last three columns will be placed in the metadata field. In the near future a more general function to convert all the major object into GRanges as well as GRangesList (e.g. for integration with CNVranger package) will be implemented.

ipq_filtered_GRanges <- CNVresults_to_GRanges(ipq_filtered)

data.table queries can also be used to easily extract subset of interest from the results. Some relevant examples are shown here.

# only call with > 1 calling methods
tmp <- ipq_filtered[n_meth > 1, ]

# de novo CNVs with a significant corrected p-value (note that the p-values are
# lost when combining more result-objects)
tmp <- penn_inheritance[adj_m_pval < 0.05 & adj_p_pval < 0.05, ]

Notes on Parallelization #

As already mentioned, this package use data.table whenever is possible to handle the big tables that working with large cohorts produces and data.table is internally parallelized. As a consequence, most of the function already use more than one thread an should be run serially on system with a low number of CPU cores. On more CPU and RAM equipped systems however, the user may want to parallelize the longer steps. This can be done easily on two particularly long steps, i.e. the results and the markers-level raw data load. The suggested way for read_results is to split the sample_list object in two or more batches, run read_results in parallel on those batches, and finally combine the objects into a final one. The raw data processing do not produce any object, as it stores all its results into RDS files named after the individual samples, so it can be run as parallel as the user desire, again by simply generating subsets of the sample_list object. Ideally, the entire pipeline (excluding for obvious reasons CNVRs creation, in that case it is possible to run subset of chromosomes in parallel) can be run in parallel on more than one cohort subsets, as long as families are kept together. However, in order to have a better view on the cohort summary statistics and the filtering procedure effect, it is recommended to perform at least those steps on the entire dataset. While parallelizing the user should always keep in mind that data.table uses more than one thread by default, so data.table::setDTthreads() should always be set appropriately.
Again, empirically we found that the optimal number of threads is between 2 and 8 depending on the number of total accessible threads and the amount of formal, external parallelization desired, as more tend to increase the process speed rather than increasing it.



SinomeM/CNVgears documentation built on Nov. 21, 2021, 5:34 a.m.