Easy Copy Number !
EaCoN aims to be an all-packed in, user-friendly solution to perform relative or absolute copy-number analysis for multiple sources of data, with three different segmenters available (and corresponding three copy-number modelization methods). It consists in a series of R packages that perform such type of analysis, from raw CEL files of Affymetrix microarrays (GenomeWide snp6, OncoScan, CytoScan 750K, CytoScan HD) or from aligned reads as BAMs for WES (whole exome sequencing).
r
install.packages('devtools')
WARNING : If you get a GITHUB_PAT error when using the devtools::install_github() function, please run the following line once per session before running devtools::install_github() :
r
Sys.unsetenv("GITHUB_PAT")
r
devtools::install_github("Crick-CancerGenomics/ascat/ASCAT")
devtools::install_github("mskcc/facets")
r
## try using http:// if https:// URLs are not supported
if(!installed.packages('BiocManager')) install.packages('BiocManager')
install.packages('sequenza')
r
## try using http:// if https:// URLs are not supported
if(!installed.packages('BiocManager')) install.packages('BiocManager')
BiocManager::install(c("affxparser", "Biostrings", "aroma.light", "BSgenome", "copynumber", "GenomicRanges", "limma", "rhdf5", "sequenza"))
r
## Install the most recent STABLE version (@master)
devtools::install_github("gustaveroussy/EaCoN")
While the current EaCoN package is the core of the process and will straightly work for WES data, multiple other packages are needed to properly handle Affymetrix microarray : APT (affymetrix power tools), designs and corresponding annotations (genome build, Affymetrix annotation databases) ; others are required for the (re)normalization, especially pre-computed GC% or Wavetracks.
r
install.packages("https://zenodo.org/record/5494853/files/affy.CN.norm.data_0.1.2.tar.gz", repos = NULL, type = "source")
r
devtools::install_github("gustaveroussy/apt.oncoscan.2.4.0")
For the NA33 (hg19) build :
r
install.packages("https://zenodo.org/record/5494853/files/OncoScan.na33.r4_0.1.0.tar.gz", repos = NULL, type = "source")
r
install.packages("https://zenodo.org/record/5494853/files/OncoScanCNV.na33.r2_0.1.0.tar.gz", repos = NULL, type = "source")
For the NA36 (hg38) build :
r
install.packages("https://zenodo.org/record/5494853/files/OncoScan.na36.r1_0.1.0.tar.gz", repos = NULL, type = "source")
r
install.packages("https://zenodo.org/record/5494853/files/OncoScanCNV.na36.r1_0.1.0.tar.gz", repos = NULL, type = "source")
r
devtools::install_github("gustaveroussy/apt.cytoscan.2.4.0")
For the NA33 (hg19) build :
r
install.packages("https://zenodo.org/record/5494853/files/CytoScan750K.Array.na33.r4_0.1.0.tar.gz", repos = NULL, type = "source")
- For the CytoScan HD design :
r
install.packages("https://zenodo.org/record/5494853/files/CytoScanHD.Array.na33.r4_0.1.0.tar.gz", repos = NULL, type = "source")
For the NA36 (hg38) build :
r
install.packages(https://zenodo.org/record/5494853/files/CytoScan750K.Array.na36.r1_0.1.0.tar.gz", repos = NULL, type = "source")
- For the CytoScan HD design :
r
install.packages("https://zenodo.org/record/5494853/files/CytoScanHD.Array.na36.r1_0.1.0.tar.gz", repos = NULL, type = "source")
Lastly, install the rcnorm package to perform BAF normalization for the CytoScan family of arrays :
r
install.packages("https://zenodo.org/record/5494853/files/rcnorm_0.1.5.tar.gz", repos = NULL, type = "source")
r
devtools::install_github("gustaveroussy/apt.snp6.1.20.0")
r
install.packages("https://zenodo.org/record/5494853/files/GenomeWideSNP.6.na35.r1_0.1.0.tar.gz", repos = NULL, type = "source")
r
install.packages("https://zenodo.org/record/5494853/files/rcnorm_0.1.5.tar.gz", repos = NULL, type = "source")
r
if(!'BiocManager' %in% installed.packages()) install.packages('BiocManager')
BiocManager::install('BSgenome')
BSgenome::available.genomes()
r
BSgenome::installed.genomes()
``` r if(!'BiocManager' %in% installed.packages()) install.packages('BiocManager')
## To support NA33 / NA35 annotations (hg19) BiocManager::install('BSgenome.Hsapiens.UCSC.hg19')
## To support NA36 annotations (hg38) BiocManager::install('BSgenome.Hsapiens.UCSC.hg38') ```
r
if(!'BiocManager' %in% installed.packages()) install.packages('BiocManager')
BiocManager::install("BSgenome.Hsapiens.1000genomes.hs37d5")
The full workflow is decomposed into a few different functions, which roughly correspond to these steps :
normalization -> segmentation +-> reporting
|
+-> copy-number estimation
EaCoN allows different ways of running the full workflow : considering the analysis of a single sample, you can either run each step independently and write, then load the intermediate results, or you can pipe all steps in a single line of code. You can also run the step-by-step approach on multiple samples in a row, even possibly at the same time using multithreading, using a batch mode.
First, under R, load EaCoN and choose a directory for writing results, for example : /home/project/EaCoN_results
r
require(EaCoN)
setwd("/home/project/EaCoN_results")
r
OS.Process(ATChannelCel = "/home/project/CEL/S1_OncoScan_CNV_A.CEL", GCChannelCel = "/home/project/CEL/S1_OncoScan_CNV_C.CEL", samplename = "S1_OS")
r
CS.Process(CEL = "/home/project/CEL/S2_CytoScanHD.CEL", samplename = "S2_CSHD")
- The same output files will be generated (except for the "paircheck" file, obviously)
r
SNP6.Process(CEL = "/home/project/CEL/S3_GenomeWide_snp.6.CEL", samplename = "S3_SNP6")
- Again, the same output files will be generated (except for the "paircheck" file, obviously)
First, we will use the capture BED (A text file containing the positions of the captured regions, usualy provided by the capture kit manufacturer), choose a genome version corresponding to our aligned BAM files, and choose a window size for the future binning of the data. These will be used to generate what we call a "BINpack", a set of pre-computed tracks containing the bin positions and corresponding GC% values. Several tracks will be computed corresponding to different levels of elongation of the bin positions. In the example below, we used the BED corresponding to Agilent SureSelect v5 capture kit, a bin size of 50 nt, and chose the human hg19 genome build.
r
BINpack.Maker(bed.file = "/home/project/WES/SureSelect_v5.bed", bin.size = 50, genome.pkg = "BSgenome.Hsapiens.UCSC.hg19")
This will generate a "BINpack" (with a ".rda" extension) to be used in the next normalization steps : /home/project/EaCoN_results/SureSelect_v5_merged_sorted_hg19_b50.GC.rda
PLEASE NOTE THAT THIS STEP IS SAMPLE-INDEPENDENT, THUS NEEDS TO BE PERFORMED AGAIN ONLY IF YOU CHANGE EITHER THE CAPTURE BED, THE BIN SIZE OR THE GENOME BUILD. Thus, the generated BINpack can be used for any other sample in the same conditions.
Second, the WES data will be binned using the generated BINpack. We need three files as input :
r
WES.Bin(testBAM = "/home/project/WES/S4_WES_hg19_Tumor.BAM", refBAM = "/home/project/WES/S4_WES_hg19_Tumor.BAM", BINpack = "/home/project/EaCoN_results/SureSelect_v5_merged_sorted_hg19_b50.GC.rda", samplename = "S4_WES")
Third, now that the data have been binned, the normalization step can be performed :
r
WES.Normalize.ff(BIN.RDS.file = "/home/project/EaCoN_results/S4_WES/S4_WES_hg19_b50_binned.RDS", BINpack = "/home/project/EaCoN_results/SureSelect_v5_merged_sorted_hg19_b50.GC.rda")
r
Segment.ff(RDS.file = "/home/me/my_project/EaCoN_results/SAMPLE1/S1_OncoScan_CNV_hg19_processed.RDS", segmenter = "ASCAT")
This will perform the segmentation, centralization and calling steps, create a /home/project/EaCoN_results/S1/ASCAT/L2R/ subdirectory and write multiple files in it :
To perform the same using the FACETS segmenter, just change the value of the segmenter parameter !
I suppose you guessed how to do the same with SEQUENZA, right ? ;)
r
ASCN.ff(RDS.file = "/home/me/my_project/EaCoN_results/SAMPLE1/ASCAT/L2R/SAMPLE1.ASCAT.RDS")
This will perform these estimations for a range of values (default os 0.35 to 0.95, with a step of 0.05) of the "gamma" parameters (see more details in the ASCAT R package help pages), create a /home/project/EaCoN_results/S1/ASCAT/ASCN/ subdirectory, in which other subdirectories will be created, one for each gamma value /home/project/EaCoN_results/S1/ASCAT/ASCN/gamma_0.xx/. In each of those will be written :
To perform the same using the FACETS or SEQUENZA estimator, just use a RDS generated with Segment.FACETS() or Segment.SEQUENZA(), respectively (or their ".ff" equivalent).
r
Annotate.ff(RDS.file = "/home/project/EaCoN_results/S1/ASCAT/L2R/S1.EaCoN.ASPCF.RDS", author.name = "Me!")
All the steps described above in single sample mode can be run in batch mode, that is for multiple samples, possibly combined with multithreading to process multiple samples in parallel. It simply consists into using different functions with the same name but an added ".Batch" suffix. Those are just wrappers to the single-sample version of the functions.
The OS.Process.Batch function replaces the ATChannelCel, GCChannelCel and samplename parameters by the pairs.file parameters, which consists in a tab-separated file with made of three columns with a header, and one sample per line : - ATChannelCel : the path to the "A" OncoScan CEL file - GCChannelCel : the path to the "C" OncoScan CEL file - SampleName : the sample name to use
By default, the function will run all samples one by one, but multithreading can be set using the nthread parameter with a value greater than 1. Beware not setting a value higher than the current number of available threads on your machine ! Please also remember that each new thread will use its own amount of RAM...
Here is a synthetic example with 4 samples : - The pairs.file (stored as /home/project/CEL/OS_pairs.txt) :
ATChannelCel | GCChannelCEL | SampleName --- | --- | --- /home/project/CEL/S1_OncoScan_CNV_A.CEL | /home/project/CEL/S1_OncoScan_CNV_C.CEL | S1_OS /home/project/CEL/S5_OncoScan_CNV_A.CEL | /home/project/CEL/S5_OncoScan_CNV_C.CEL | S5_OS /home/project/CEL/S6_OncoScan_CNV_A.CEL | /home/project/CEL/S6_OncoScan_CNV_C.CEL | S6_OS /home/project/CEL/S7_OncoScan_CNV_A.CEL | /home/project/CEL/S7_OncoScan_CNV_C.CEL | S7_OS
r
OS.Process.Batch(pairs.file = "/home/project/CEL/OS_pairs.txt", nthread = 2)
Same principle, but this time we have one column less and header changes a bit : - CEL : The path to the CEL file - SampleName : the sample name to use
Here is a synthetic example with 4 samples : - The CEL.list.file (stored as /home/project/CEL/CSHD_list.txt) :
CEL | SampleName --- | --- /home/project/CEL/S8_CytoScanHD.CEL | S8_CSHD /home/project/CEL/S9_CytoScanHD.CEL | S9_CSHD /home/project/CEL/S10_CytoScanHD.CEL | S10_CSHD /home/project/CEL/S11_CytoScanHD.CEL | S11_CSHD
r
CS.Process.Batch(pairs.file = "/home/project/CEL/CSHD_list.txt", nthread = 2)
Identical to CytoScan 750k / HD, but the function is named SNP6.Process.Batch.
Still the same principle with an external list file, with column names : - testBAM : the path to the test BAM file - refBAM : the path to the reference BAM file - SampleName : the sample name to use
Here is a synthetic example with 4 samples : - The BAM.list.file (stored as /home/project/WES/BAM_list.txt) :
testBAM | refBAM | SampleName --- | --- | --- /home/project/WES/S4_WES_hg19_Tumor.BAM | /home/project/WES/S4_WES_hg19_Normal.BAM | S4_WES /home/project/WES/S12_WES_hg19_Tumor.BAM | /home/project/WES/S12_WES_hg19_Normal.BAM | S12_WES /home/project/WES/S13_WES_hg19_Tumor.BAM | /home/project/WES/S13_WES_hg19_Normal.BAM | S13_WES /home/project/WES/S14_WES_hg19_Tumor.BAM | /home/project/WES/S14_WES_hg19_Normal.BAM | S14_WES
r
WES.Bin.Batch(BAM.list.file = "/home/project/WES/BAM_list.txt", BINpack = "/home/project/EaCoN_results/SureSelect_v5_merged_sorted_hg19_b50.GC.rda", nthread = 2)
WES.Normalize.ff.Batch(BINpack = "/home/project/EaCoN_results/SureSelect_v5_merged_sorted_hg19_b50.GC.rda", nthread = 2)
Note that here we did not specify any RDS or list file to WES.Normalize.ff.Batch. This is because this fonction needs as its first argument BIN.RDS.files, a list of "_binned.RDS" files (generated at the former command line), and by default it will recursively search downwards the current working directory for any of these RDS files. You can of course design your own list of RDS files to process, if you know a bit of R.
As for the WES.Normalize.ff.Batch function, the Segment.ff.Batch function needs as its first argument RDS.files, a list of "_processed.RDS" files (generated at the raw data processing step). Likewise, it will by default recursively search downwards for any compatible RDS file.
Here is a synthetic example that will segment our CytoScan HD samples (as defined by the pattern below) using ASCAT :
r
Segment.ff.Batch(RDS.files = list.files(path = getwd(), pattern = ".*_processed.RDS$", full.names = TRUE, recursive = TRUE), segmenter = "ASCAT", smooth.k = 5, SER.pen = 20, nrf = 1.0, nthread = 2)
To perform the same using the FACETS segmenter, just change the value of the segmenter parameter.
I suppose you guessed how to do the same with SEQUENZA, right ? ;)
Still the same, with the ASCN.ff.Batch :
r
ASCN.ff.Batch(RDS.files = list.files(path = getwd(), pattern = "SEG\\..*\\.RDS$", full.names = TRUE, recursive = TRUE), nthread = 2)
And here again with the Annotate.ff.Batch :
r
Annotate.ff.Batch(RDS.files = list.files(path = getwd(), pattern = "SEG\\..*\\.RDS$", full.names = TRUE, recursive = TRUE), author.name = "Me!")
EaCoN has been implemented in such a way that one can also opt to launch the full workflow in a single command line for a single sample, using pipes from the magrittr package. However, this is not recommended as a default use : even though EaCoN is provided with recommendations that should fit most cases, users may have to deal with particular profiles requiring parameter tweaking, which is not possible in piped mode... Here is an example using ASCAT :
```r samplename <- "SAMPLE1_OS" workdir <- "/home/me/my_project/EaCoN_results" setwd(workdir) require(EaCoN) require(magrittr)
OS.Process(ATChannelCel = "/home/me/my_project/CEL/SAMPLE1_OncoScan_CNV_A.CEL", GCChannelCel = "/home/me/my_project/CEL/SAMPLE1_OncoScan_CNV_C.CEL", samplename = samplename, return.data = TRUE) %>% Segment(out.dir = paste0(workdir, "/", samplename), segmenter = "ASCAT", return.data = TRUE) %T>% Annotate(out.dir = paste0(workdir, "/", samplename, "/ASCAT/L2R")) %>% ASCN.ASCAT(out.dir = paste0(workdir, "/", samplename)) ```
SOURCE | SER.pen | smooth.k | nrf | BAF.filter
--- | --- | --- | --- | ---
OncoScan | 40
(default) | NULL
(default) | 0.5
(default) | 0.9
CytoScan HD | 20
| 5
| 1.0
| 0.75
(default)
SNP6 | 60
| 5
| 0.25
| 0.75
(default)
WES | 2
to 10
| 5
| 0.5
(default) to 1
| 0.75
(default)
The FACETS segmenter cannot currently be used on SNP6 data (due to missing normalized A and B signals).
The SEQUENZA segmenter SHOULD NOT be used with SNP6 microarrays (it theoretically can, but requires huge amounts of RAM, ie more than 32 GB). This may halt / swap your computer !
For WES data, any sorted BAM should work, but we recommend using BAMs for which duplicates were marked/removed (samtools markdup, Picard MarkDuplicates, etc...), for higher quality results.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.