README.md

EaCoN

Easy Copy Number !

DESCRIPTION

EaCoN aims to be an all-packed in, user-friendly solution to perform relative or absolute copy-number analysis for multiple sources of data, with three different segmenters available (and corresponding three copy-number modelization methods). It consists in a series of R packages that perform such type of analysis, from raw CEL files of Affymetrix microarrays (GenomeWide snp6, OncoScan, CytoScan 750K, CytoScan HD) or from aligned reads as BAMs for WES (whole exome sequencing).

FEATURES

NOTES

QUICK NEWS

2021-10-18 : v0.3.6-2 (CloudyMonday2) is out !

2021-06-18 : v0.3.6-1 (SweetSummerSweat) is out !

2021-05-23 : v0.3.6 (Barolo) is out !

2020-08-17 : v0.3.5 (CloudyMonday) is out !

2018-12-10 : v0.3.4-1 (PostRoscovite) is out !

2018-10-30 : v0.3.4 (Papy60) is out !

2018-10-02 : v0.3.3-1 (LittleWomanNoCry) is out !

2018-09-12 : v0.3.3 (Trinity) is out !

2018-08-08 : v0.3.2 (PapeMamiePichine) is out !

INSTALLATION

CORE

r install.packages('devtools')

WARNING : If you get a GITHUB_PAT error when using the devtools::install_github() function, please run the following line once per session before running devtools::install_github() :

r Sys.unsetenv("GITHUB_PAT")

r devtools::install_github("Crick-CancerGenomics/ascat/ASCAT") devtools::install_github("mskcc/facets")

r ## try using http:// if https:// URLs are not supported if(!installed.packages('BiocManager')) install.packages('BiocManager') install.packages('sequenza')

r ## try using http:// if https:// URLs are not supported if(!installed.packages('BiocManager')) install.packages('BiocManager') BiocManager::install(c("affxparser", "Biostrings", "aroma.light", "BSgenome", "copynumber", "GenomicRanges", "limma", "rhdf5", "sequenza"))

r ## Install the most recent STABLE version (@master) devtools::install_github("gustaveroussy/EaCoN")

MICROARRAY-SPECIFIC

While the current EaCoN package is the core of the process and will straightly work for WES data, multiple other packages are needed to properly handle Affymetrix microarray : APT (affymetrix power tools), designs and corresponding annotations (genome build, Affymetrix annotation databases) ; others are required for the (re)normalization, especially pre-computed GC% or Wavetracks.

ALL AFFYMETRIX MICROARRAYS

r install.packages("https://zenodo.org/record/5494853/files/affy.CN.norm.data_0.1.2.tar.gz", repos = NULL, type = "source")

ONCOSCAN FAMILY (OncoScan / OncoScan_CNV)

r devtools::install_github("gustaveroussy/apt.oncoscan.2.4.0")

CYTOSCAN FAMILY (CytoScan 750k / CytoScan HD)

r devtools::install_github("gustaveroussy/apt.cytoscan.2.4.0")

r install.packages("https://zenodo.org/record/5494853/files/rcnorm_0.1.5.tar.gz", repos = NULL, type = "source")

GENOMEWIDE SNP6

r devtools::install_github("gustaveroussy/apt.snp6.1.20.0")

r install.packages("https://zenodo.org/record/5494853/files/GenomeWideSNP.6.na35.r1_0.1.0.tar.gz", repos = NULL, type = "source")

r install.packages("https://zenodo.org/record/5494853/files/rcnorm_0.1.5.tar.gz", repos = NULL, type = "source")

GENOMES

r if(!'BiocManager' %in% installed.packages()) install.packages('BiocManager') BiocManager::install('BSgenome') BSgenome::available.genomes()

r BSgenome::installed.genomes()

``` r if(!'BiocManager' %in% installed.packages()) install.packages('BiocManager')

## To support NA33 / NA35 annotations (hg19) BiocManager::install('BSgenome.Hsapiens.UCSC.hg19')

## To support NA36 annotations (hg38) BiocManager::install('BSgenome.Hsapiens.UCSC.hg38') ```

r if(!'BiocManager' %in% installed.packages()) install.packages('BiocManager') BiocManager::install("BSgenome.Hsapiens.1000genomes.hs37d5")

INPUT

USAGE

The full workflow is decomposed into a few different functions, which roughly correspond to these steps :

normalization -> segmentation +-> reporting
                              |
                              +-> copy-number estimation

EaCoN allows different ways of running the full workflow : considering the analysis of a single sample, you can either run each step independently and write, then load the intermediate results, or you can pipe all steps in a single line of code. You can also run the step-by-step approach on multiple samples in a row, even possibly at the same time using multithreading, using a batch mode.

Step by step mode

First, under R, load EaCoN and choose a directory for writing results, for example : /home/project/EaCoN_results

r require(EaCoN) setwd("/home/project/EaCoN_results")

Raw data processing

Affymetrix OncoScan / OncoScan_CNV

r OS.Process(ATChannelCel = "/home/project/CEL/S1_OncoScan_CNV_A.CEL", GCChannelCel = "/home/project/CEL/S1_OncoScan_CNV_C.CEL", samplename = "S1_OS")

Affymetrix CytoScan 750k / CytoScan HD

r CS.Process(CEL = "/home/project/CEL/S2_CytoScanHD.CEL", samplename = "S2_CSHD") - The same output files will be generated (except for the "paircheck" file, obviously)

Affymetrix GenomeWide SNP6

r SNP6.Process(CEL = "/home/project/CEL/S3_GenomeWide_snp.6.CEL", samplename = "S3_SNP6") - Again, the same output files will be generated (except for the "paircheck" file, obviously)

WES data

L2R & BAF Segmentation

r Segment.ff(RDS.file = "/home/me/my_project/EaCoN_results/SAMPLE1/S1_OncoScan_CNV_hg19_processed.RDS", segmenter = "ASCAT")

Copy-number estimation

r ASCN.ff(RDS.file = "/home/me/my_project/EaCoN_results/SAMPLE1/ASCAT/L2R/SAMPLE1.ASCAT.RDS")

HTML reporting

r Annotate.ff(RDS.file = "/home/project/EaCoN_results/S1/ASCAT/L2R/S1.EaCoN.ASPCF.RDS", author.name = "Me!")

Batch mode (with multithreadng)

All the steps described above in single sample mode can be run in batch mode, that is for multiple samples, possibly combined with multithreading to process multiple samples in parallel. It simply consists into using different functions with the same name but an added ".Batch" suffix. Those are just wrappers to the single-sample version of the functions.

Raw data processing

Affymetrix OncoScan / OncoScan_CNV

The OS.Process.Batch function replaces the ATChannelCel, GCChannelCel and samplename parameters by the pairs.file parameters, which consists in a tab-separated file with made of three columns with a header, and one sample per line : - ATChannelCel : the path to the "A" OncoScan CEL file - GCChannelCel : the path to the "C" OncoScan CEL file - SampleName : the sample name to use

By default, the function will run all samples one by one, but multithreading can be set using the nthread parameter with a value greater than 1. Beware not setting a value higher than the current number of available threads on your machine ! Please also remember that each new thread will use its own amount of RAM...

Here is a synthetic example with 4 samples : - The pairs.file (stored as /home/project/CEL/OS_pairs.txt) :

ATChannelCel | GCChannelCEL | SampleName --- | --- | --- /home/project/CEL/S1_OncoScan_CNV_A.CEL | /home/project/CEL/S1_OncoScan_CNV_C.CEL | S1_OS /home/project/CEL/S5_OncoScan_CNV_A.CEL | /home/project/CEL/S5_OncoScan_CNV_C.CEL | S5_OS /home/project/CEL/S6_OncoScan_CNV_A.CEL | /home/project/CEL/S6_OncoScan_CNV_C.CEL | S6_OS /home/project/CEL/S7_OncoScan_CNV_A.CEL | /home/project/CEL/S7_OncoScan_CNV_C.CEL | S7_OS

r OS.Process.Batch(pairs.file = "/home/project/CEL/OS_pairs.txt", nthread = 2)

Affymetrix CytoScan 750k / CytoScan HD

Same principle, but this time we have one column less and header changes a bit : - CEL : The path to the CEL file - SampleName : the sample name to use

Here is a synthetic example with 4 samples : - The CEL.list.file (stored as /home/project/CEL/CSHD_list.txt) :

CEL | SampleName --- | --- /home/project/CEL/S8_CytoScanHD.CEL | S8_CSHD /home/project/CEL/S9_CytoScanHD.CEL | S9_CSHD /home/project/CEL/S10_CytoScanHD.CEL | S10_CSHD /home/project/CEL/S11_CytoScanHD.CEL | S11_CSHD

r CS.Process.Batch(pairs.file = "/home/project/CEL/CSHD_list.txt", nthread = 2)

Affymetrix GenomeWide SNP6

Identical to CytoScan 750k / HD, but the function is named SNP6.Process.Batch.

WES data

Still the same principle with an external list file, with column names : - testBAM : the path to the test BAM file - refBAM : the path to the reference BAM file - SampleName : the sample name to use

Here is a synthetic example with 4 samples : - The BAM.list.file (stored as /home/project/WES/BAM_list.txt) :

testBAM | refBAM | SampleName --- | --- | --- /home/project/WES/S4_WES_hg19_Tumor.BAM | /home/project/WES/S4_WES_hg19_Normal.BAM | S4_WES /home/project/WES/S12_WES_hg19_Tumor.BAM | /home/project/WES/S12_WES_hg19_Normal.BAM | S12_WES /home/project/WES/S13_WES_hg19_Tumor.BAM | /home/project/WES/S13_WES_hg19_Normal.BAM | S13_WES /home/project/WES/S14_WES_hg19_Tumor.BAM | /home/project/WES/S14_WES_hg19_Normal.BAM | S14_WES

r WES.Bin.Batch(BAM.list.file = "/home/project/WES/BAM_list.txt", BINpack = "/home/project/EaCoN_results/SureSelect_v5_merged_sorted_hg19_b50.GC.rda", nthread = 2) WES.Normalize.ff.Batch(BINpack = "/home/project/EaCoN_results/SureSelect_v5_merged_sorted_hg19_b50.GC.rda", nthread = 2)

Note that here we did not specify any RDS or list file to WES.Normalize.ff.Batch. This is because this fonction needs as its first argument BIN.RDS.files, a list of "_binned.RDS" files (generated at the former command line), and by default it will recursively search downwards the current working directory for any of these RDS files. You can of course design your own list of RDS files to process, if you know a bit of R.

L2R & BAF Segmentation

As for the WES.Normalize.ff.Batch function, the Segment.ff.Batch function needs as its first argument RDS.files, a list of "_processed.RDS" files (generated at the raw data processing step). Likewise, it will by default recursively search downwards for any compatible RDS file.

Here is a synthetic example that will segment our CytoScan HD samples (as defined by the pattern below) using ASCAT :

r Segment.ff.Batch(RDS.files = list.files(path = getwd(), pattern = ".*_processed.RDS$", full.names = TRUE, recursive = TRUE), segmenter = "ASCAT", smooth.k = 5, SER.pen = 20, nrf = 1.0, nthread = 2)

Copy-number estimation

Still the same, with the ASCN.ff.Batch :

r ASCN.ff.Batch(RDS.files = list.files(path = getwd(), pattern = "SEG\\..*\\.RDS$", full.names = TRUE, recursive = TRUE), nthread = 2)

HTML reporting

And here again with the Annotate.ff.Batch :

r Annotate.ff.Batch(RDS.files = list.files(path = getwd(), pattern = "SEG\\..*\\.RDS$", full.names = TRUE, recursive = TRUE), author.name = "Me!")

Piped

EaCoN has been implemented in such a way that one can also opt to launch the full workflow in a single command line for a single sample, using pipes from the magrittr package. However, this is not recommended as a default use : even though EaCoN is provided with recommendations that should fit most cases, users may have to deal with particular profiles requiring parameter tweaking, which is not possible in piped mode... Here is an example using ASCAT :

```r samplename <- "SAMPLE1_OS" workdir <- "/home/me/my_project/EaCoN_results" setwd(workdir) require(EaCoN) require(magrittr)

OS.Process(ATChannelCel = "/home/me/my_project/CEL/SAMPLE1_OncoScan_CNV_A.CEL", GCChannelCel = "/home/me/my_project/CEL/SAMPLE1_OncoScan_CNV_C.CEL", samplename = samplename, return.data = TRUE) %>% Segment(out.dir = paste0(workdir, "/", samplename), segmenter = "ASCAT", return.data = TRUE) %T>% Annotate(out.dir = paste0(workdir, "/", samplename, "/ASCAT/L2R")) %>% ASCN.ASCAT(out.dir = paste0(workdir, "/", samplename)) ```

Conclusion on usage

GUIDELINES

Segmentation

SOURCE | SER.pen | smooth.k | nrf | BAF.filter --- | --- | --- | --- | --- OncoScan | 40 (default) | NULL (default) | 0.5 (default) | 0.9 CytoScan HD | 20 | 5 | 1.0 | 0.75 (default) SNP6 | 60 | 5 | 0.25 | 0.75 (default) WES | 2 to 10 | 5 | 0.5 (default) to 1 | 0.75 (default)

NOTES

AUTHORS & CONTACT



gustaveroussy/EaCoN documentation built on Oct. 20, 2021, 2:41 a.m.