README.md

CopywriteR

(formerly known as ENCODER)

Current methods for detection of copy number variants and aberrations (CNV and CNA) from targeted sequencing data are based on the depth of coverage of captured exons. Accurate CNA determination is complicated by uneven genomic distribution and non-uniform capture efficiency of targeted exons. Here we present CopywriteR which eludes these problems by exploiting ‘off-target’ sequence reads. CopywriteR allows for extracting uniformly distributed copy number information, can be used without reference and can be applied to sequencing data obtained from various techniques including chromatin immunoprecipitation and target enrichment on small gene panels. CopywriteR outperforms existing methods and constitutes a widely applicable alternative to available tools.

CopywriteR has been described in Kuilman et al., 2015. The analysis described in this publication were performed using the older version V1.3.

Bioconductor

We are happy to announce that the CopywriteR package has been accepted in Bioconductor and has been released as of 17 April 2015. All the Bioconductor packages in the newest release have a dependency for the newest version of R (version 3.2) and for Bioconductor version 3.1. If you have these installed, CopywriteR can be installed as follows:

> source("http://bioconductor.org/biocLite.R")
> biocLite("CopywriteR")

We have developed CopywriteR with R 3.1 and Bioconductor 3.0, and therefore know that it works fine with these version too. In case you are running these, you can still install CopywriteR as is explained below.

Installation (not via Bioconductor)

CopywriteR can be installed without Bioconductor, which is useful when you have R 3.1 and Bioconductor 3.0 installed. Installation should be performed as follows:

> source("http://bioconductor.org/biocLite.R")
> biocLite(c("matrixStats", "gtools", "data.table", "S4Vectors", "chipseq",
             "IRanges", "Rsamtools", "DNAcopy", "GenomicAlignments",
             "GenomicRanges", "GenomeInfoDb", "BiocParallel",
             "futile.logger"))

In addition, CopywriteR requires the CopyhelpeR package, which can be downloaded as tarball (.tar.gz-file) from the CopyhelpeR releases webpage. Subsequently, the CopyhelpeR package can be installed from the command line using the following command:

$ R CMD INSTALL CopyhelpeR*.tar.gz

As the last step in the installation process, the latest CopywriteR package can be downloaded from the CopywriteR releases webpage and installed using the following command:

$ R CMD INSTALL CopywriteR*.tar.gz

Now you are all set to start your analysis.

CopywriteR usage:

Load the CopywriteR package in R using:

> library("CopywriteR")

CopywriteR contains three main functions:

preCopywriteR will generate a GRanges object containing mappability and GC-content information, and one containing 'blacklisted' regions that contain are subject to copy number variation. These 'helper' files can be created for any specified bin size that is a multiple of 1000 bp, and for any of the available reference genomes (hg18, hg19, hg38, mm9 and mm10). The helper files can be re-used and need to be created only once for every combination of reference genome and bin size. preCopywriteR uses information stored in pre-assembled 1kb bin mappability and GC-content GRanges objects to create the custom bin size helper files. These objects are stored in the CopyhelpeR annotation package.

preCopywriteR can be run as follows:

> preCopywriteR(output.folder, bin.size, ref.genome, prefix = "")

CopywriteR will generate separate tables with compensated read counts and log2-transformed normalized compensated read counts after correction for GC-content, mappability and upon removal of blacklisted regions.

> CopywriteR(sample.control, destination.folder, reference.folder,
             bp.param, capture.regions.file,
             keep.intermediary.files = FALSE)

plotCNA performs segmentation using the DNAcopy Bioconductor package, and plotting of copy number profiles.

> plotCNA(destinationFolder, set.nchrom)

For more details please refer to the CopywriteR vignette. Alternatively, one of the following commands can be used to show help files for the corresponding function:

> ?preCopywriteR
> ?CopywriteR
> ?plotCNA

Quick start

A typical analysis using CopywriteR could be as follows. First, CopywriteR needs to be loaded:

> library(CopywriteR)

Then, preCopywriteR can be run using the command:

> preCopywriteR(output.folder = file.path("./path/to/output/folder"),
                bin.size = 20000,
                ref.genome = "mm10",
                prefix = "")

Next, we need to specify the settings for parallel computing. We have implemented use of the BiocParallel package, which supports several different types of environments. For every environment, a BiocParallelParam can be specified that defines how parallel computation is executed. Below, we use a SnowParam instance of BiocParallelParam, which is based on the snow package. Please refer to the BiocParallel package for more information. A SnowParam using 12 CPUs can be defined as follows:

> bp.param <- SnowParam(workers = 12, type = "SOCK")

Next, we need to specify which samples and controls correspond to each other using the sample.control variable. For the CopywriteR function, controls are specified as those samples that will be used to identify which regions are 'peaks' and contain on-target reads. This information will then be used to remove on-target reads in the corresponding sample. We specify the sample.control variable as follows:

> path <- "./path/to/bam"
> samples <- list.files(path = path, pattern = "bam$", full.names = TRUE)
> controls <- samples[c(1:6, 1:6)]
> sample.control <- data.frame(samples, controls)

This might result in the following variable:

> sample.control
      samples   controls
1  ./C003.bam ./C003.bam
2  ./C016.bam ./C016.bam
3  ./C024.bam ./C024.bam
4  ./C037.bam ./C037.bam
5  ./C049.bam ./C049.bam
6  ./C055.bam ./C055.bam
7  ./M003.bam ./C003.bam
8  ./M016.bam ./C016.bam
9  ./M024.bam ./C024.bam
10 ./M037.bam ./C037.bam
11 ./M049.bam ./C049.bam
12 ./M055.bam ./C055.bam

Sequence data starting with 'M' could for instance be from a tumor sample, and the corresponding 'C' data set would from a matched germline sample. Please note that any sample that is to be used by the downstream plotCNA function needs to be analyzed by the CopywriteR function. Therefore, by including:

1  ./C003.bam ./C003.bam
2  ./C016.bam ./C016.bam
3  ./C024.bam ./C024.bam
4  ./C037.bam ./C037.bam
5  ./C049.bam ./C049.bam
6  ./C055.bam ./C055.bam

we make sure that in the plotCNA function we can analyze the tumor samples relative to the corresponding germline samples. We recommend identifying on-target and off-target regions based on a germline sample if possible, as this would avoid identifying highly amplified genomic regions in tumor cells as on-target regions. Nevertheless, we have observed that this effect is negligible in practice, and that CopywriteR analysis without a reference is still highly accurate. Please refer to Kuilman et al., 2015 for more details on how CopywriteR extracts copy number profiles from targeted sequencing.

We are now set for running CopywriteR:

> CopywriteR(sample.control = sample.control,
             destination.folder = file.path("./path/to/destination/folder"),
             reference.folder = file.path("./path/to/reference/folder", "mm10_20kb"),
             bp.param = bp.param)

Finally, we can segment the data using DNAcopy and plot the result using the plotCNA function:

> plotCNA(destination.folder = file.path("/path/to/destination/folder"))

By default, data will be analyzed and plotted according to the matching of samples and controls in the sample.control variable (specified above). Every sample in the samples column of the sample.control variable will be analyzed without a reference, and with the corresponding control as a reference. Optionally, the sample.plot argument can be used to control analysis and plotting by plotCNA. Please refer to the manual for more information.

Vignette code

The following packages can be optionally installed to allow running the vignette code:

> source("http://bioconductor.org/biocLite.R")
> biocLite(c("BiocStyle", "snow"))

In addition, the SCLCBam experiment package is also required. The latest SCLCBam package can be downloaded from the SCLCBam releases webpage and installed using the following command:

$ R CMD INSTALL SCLCBam*.tar.gz

Troubleshooting

In general, we advise to update to the latest versions of CopywriteR and CopyhelpeR in case of errors. If the problems persist, please refer below for troubleshooting purposes.

Bug

We have found out about a bug that is affecting situations where the number of comparisons between samples and controls exceeds the total number of unique samples AND controls. For instance, in the case below, the number of comparisons (3) exceeds the total number of unique samples and controls (./C003.bam and ./C016.bam; thus 2)

> sample.control
      samples   controls
1  ./C003.bam ./C003.bam
2  ./C016.bam ./C016.bam
3  ./C003.bam ./C016.bam

Error messages look like this: Error in [<-.data.frame(*tmp*, , "off.target", value = c(X, X, : replacement has Y rows, data has Z

The preferable way to solve this issue is by installing the newest CopywriteR version using Bioconductor. We have fixed the bug in the release (2.8.1) and development (2.9.0) versions of CopywriteR. Alternatively, I have provided two versions on the release page (under V2.6.1.2) that solve this issue. An alternative to installing a new package is to prevent the above situation from happening, which is usually possible too. Please contact us if you have more questions.

Mm10

We have come to realise that there was an bug in CopyhelpeR version 1.0.0 which leads to an error when creating mm10 helper files using the preCopywriteR function. If you need to use CopywriteR for mm10-based sequence data, please update to version 1.0.1 available at GitHub or via bioconductor (available with the next build).

Installation

One of the dependencies of CopywriteR (the chipseq package) requires Bioconductor 3.0. If installation fails, please check whether you are running the correct version of Bioconductor and whether all dependencies have been installed.

X11 and PNG

We have received feedback that CopywriteR throws the following error in a platform-dependent manner:

Error in log2.read.counts[selection, , drop = FALSE] : (subscript) logical subscript too long

In some (and possibly in all) cases this is due to missing X11 and/or PNG libraries. To test whether this is the case, one can type these

> x11()
> png()

in the R command line. If one of these libraries is indeed missing, you will get one or both of the following error message:

Error in x11(): X11 is not available

unable to start device PNG
In addition: Warning message:
no png support in this version of R

In this case, re-installing R after installation of the required libraries will solve the problem. This can be done as follows:

$ ## Install missing X11 libraries
$ sudo yum install libXt-devel
$ sudo yum install libX11-devel

$ ## Install missing PNG libraries
$ sudo yum install libpng
$ sudo yum install libpng-devel

$ ## Download and install R from source (we needed to use the additional
$ ##  --with-readline=no flag to allow installation)
$ curl -O http://cran.rstudio.com/src/base/R-3/R-3.2.0.tar.gz
$ ./configure --with-x
$ make

You can finish this exercise by re-installing CopywriteR. If this does not solve your issue, please let us know.

Unique naming of .bam files

CopywriteR uses the (base)name of a .bam file as an identifier for that file. Therefore, all the names of .bam files should be unique, and something like

> sample.control
                         samples                        controls
1  path/sample1/aligned_read.bam   path/sample2/aligned_read.bam

would result in an error. This can be solved by renaming the .bam files in such a way that all names become unique.

All .bam files should be processed by CopywriteR

Any sample that is used as a sample or a reference for analysis and plotting in the plotCNA function needs prior analysis as a sample in the CopywriteR function. As an example, if one would like to analyze and plot tumor1.bam relative to matched.normal1.bam in plotCNA, the sample.control variable should contain both of the two rows below:

> sample.control
                       samples                    controls
1  path/to/tumor1.bam          path/to/matched.normal1.bam
2  path/to/matched.normal1.bam path/to/matched.normal1.bam

If not all samples needed by plotCNA have been analyzed by CopywriteR, the following error message will be displayed: "One of the samples in [LIST OF .BAM FILES] refers to a BAM file that has not been processed in CopywriteR. Please make sure that you have provided the correct input files or re-run CopywriteR accordingly."

Number of off-target reads

CopywriteR relies on the off-target sequence read count, which in our hands is roughly 10% of the total amount of reads. You can find the relevant numbers in the log-file. Since the sequence reads are normally not required, a number of pipelines filter for or deplete sequence reads that are off-target. We recommend using the unprocessed .bam files to prevent discarding data for CopywriteR.

In addition to the above, we have never tested CopywriteR on sequence data upon PCR-based enrichment strategies. However, we believe it is likely that there will be insufficient off-target reads to allow high-quality copy number data using CopywriteR.

Chromosome names

CopywriteR by default assumes that the chromosome names in .bam files are "1", "2", ... "X", and "Y". These chromosome names are incorporated in the mappability and GC-content GRanges object, as well as in the blacklist GRanges object created by preCopywriteR. preCopywriteR has an optional prefix argument that can be used to created helper files with a chromosome notation using a prefix (for instance "chr" for "chr1", "chr2", ... notation).

Moreover, CopywriteR assumes that all .bam files that are analyzed together are mapped to the same reference genome. If this is not the case, please re-run multiple CopywriteR analyses with every analysis using .bam files mapped to the same reference genome.

Plotting

In order to allow plotting, the .bam files need to be present in the same folder as when the CopywriteR analysis was performed. Also, we have noted that due to changes in a CopywriteR dependency, chromosomes are plotted in the wrong order in older versions (< 2.0.4). Please update to version 2.0.4 to fix this problem.

Sorting of .bam files

The Rsamtools package requires .bam files to be sorted. If the .bam files that are to be analyzed are not sorted, CopywriteR will provide an error message. Unsorted .bam files can then be sorted using samtools as follows:

$ samtools sort sample.bam sample.sorted.bam

Speed

We have noted a ~1.5-fold reduction in total calculation time with the development version of R (3.2) and Bioconductor (3.1). These will soon be officially released, and in case the calculation time becomes a critical issue we recommend upgrading to these versions of R and Bioconductor.

Contact

We have tried to make the CopywriteR code readable and its use as easy as possible. For additional support next to GitHub, please refer the Bioconductor support site for CopywriteR-related posts or to submit a question regarding the use of CopywriteR. To report any bugs, please do not hesitate and contact:

Thomas and Oscar are working in the laboratory of Prof. Dr. Daniel S. Peeper.

Cited by

The CopywriteR tool has been cited and referenced by:

Reported bugs

None

Changes and additions we are currently working on



PeeperLab/CopywriteR documentation built on May 8, 2019, 1:28 a.m.