Current methods for detection of copy number variants and aberrations (CNV and CNA) from targeted sequencing data are based on the depth of coverage of captured exons. Accurate CNA determination is complicated by uneven genomic distribution and non-uniform capture efficiency of targeted exons. Here we present CopywriteR which eludes these problems by exploiting ‘off-target’ sequence reads. CopywriteR allows for extracting uniformly distributed copy number information, can be used without reference and can be applied to sequencing data obtained from various techniques including chromatin immunoprecipitation and target enrichment on small gene panels. CopywriteR outperforms existing methods and constitutes a widely applicable alternative to available tools.
We are happy to announce that the CopywriteR package has been accepted in Bioconductor and has been released as of 17 April 2015. All the Bioconductor packages in the newest release have a dependency for the newest version of R (version 3.2) and for Bioconductor version 3.1. If you have these installed, CopywriteR can be installed as follows:
> source("http://bioconductor.org/biocLite.R") > biocLite("CopywriteR")
We have developed CopywriteR with R 3.1 and Bioconductor 3.0, and therefore know that it works fine with these version too. In case you are running these, you can still install CopywriteR as is explained below.
CopywriteR can be installed without Bioconductor, which is useful when you have R 3.1 and Bioconductor 3.0 installed. Installation should be performed as follows:
> source("http://bioconductor.org/biocLite.R") > biocLite(c("matrixStats", "gtools", "data.table", "S4Vectors", "chipseq", "IRanges", "Rsamtools", "DNAcopy", "GenomicAlignments", "GenomicRanges", "GenomeInfoDb", "BiocParallel", "futile.logger"))
In addition, CopywriteR requires the CopyhelpeR package, which can be downloaded as tarball (.tar.gz-file) from the CopyhelpeR releases webpage. Subsequently, the CopyhelpeR package can be installed from the command line using the following command:
$ R CMD INSTALL CopyhelpeR*.tar.gz
As the last step in the installation process, the latest CopywriteR package can be downloaded from the CopywriteR releases webpage and installed using the following command:
$ R CMD INSTALL CopywriteR*.tar.gz
Now you are all set to start your analysis.
Load the CopywriteR package in R using:
CopywriteR contains three main functions:
preCopywriteR will generate a GRanges object containing mappability and
GC-content information, and one containing 'blacklisted' regions that contain
are subject to copy number variation. These 'helper' files can be created for
any specified bin size that is a multiple of 1000 bp, and for any of the
available reference genomes (hg18, hg19, hg38, mm9 and mm10). The helper files
can be re-used and need to be created only once for every combination of
reference genome and bin size.
preCopywriteR uses information stored in
pre-assembled 1kb bin mappability and GC-content GRanges objects to create the
custom bin size helper files. These objects are stored in the CopyhelpeR
preCopywriteR can be run as follows:
> preCopywriteR(output.folder, bin.size, ref.genome, prefix = "")
CopywriteR will generate separate tables with compensated read counts and
log2-transformed normalized compensated read counts after correction for
GC-content, mappability and upon removal of blacklisted regions.
> CopywriteR(sample.control, destination.folder, reference.folder, bp.param, capture.regions.file, keep.intermediary.files = FALSE)
plotCNA performs segmentation using the DNAcopy Bioconductor package, and
plotting of copy number profiles.
> plotCNA(destinationFolder, set.nchrom)
For more details please refer to the CopywriteR vignette. Alternatively, one of the following commands can be used to show help files for the corresponding function:
> ?preCopywriteR > ?CopywriteR > ?plotCNA
A typical analysis using CopywriteR could be as follows. First, CopywriteR needs to be loaded:
Then, preCopywriteR can be run using the command:
> preCopywriteR(output.folder = file.path("./path/to/output/folder"), bin.size = 20000, ref.genome = "mm10", prefix = "")
Next, we need to specify the settings for parallel computing. We have implemented use of the BiocParallel package, which supports several different types of environments. For every environment, a BiocParallelParam can be specified that defines how parallel computation is executed. Below, we use a SnowParam instance of BiocParallelParam, which is based on the snow package. Please refer to the BiocParallel package for more information. A SnowParam using 12 CPUs can be defined as follows:
> bp.param <- SnowParam(workers = 12, type = "SOCK")
Next, we need to specify which samples and controls correspond to each other using the sample.control variable. For the CopywriteR function, controls are specified as those samples that will be used to identify which regions are 'peaks' and contain on-target reads. This information will then be used to remove on-target reads in the corresponding sample. We specify the sample.control variable as follows:
> path <- "./path/to/bam" > samples <- list.files(path = path, pattern = "bam$", full.names = TRUE) > controls <- samples[c(1:6, 1:6)] > sample.control <- data.frame(samples, controls)
This might result in the following variable:
> sample.control samples controls 1 ./C003.bam ./C003.bam 2 ./C016.bam ./C016.bam 3 ./C024.bam ./C024.bam 4 ./C037.bam ./C037.bam 5 ./C049.bam ./C049.bam 6 ./C055.bam ./C055.bam 7 ./M003.bam ./C003.bam 8 ./M016.bam ./C016.bam 9 ./M024.bam ./C024.bam 10 ./M037.bam ./C037.bam 11 ./M049.bam ./C049.bam 12 ./M055.bam ./C055.bam
Sequence data starting with 'M' could for instance be from a tumor sample, and the corresponding 'C' data set would from a matched germline sample. Please note that any sample that is to be used by the downstream plotCNA function needs to be analyzed by the CopywriteR function. Therefore, by including:
1 ./C003.bam ./C003.bam 2 ./C016.bam ./C016.bam 3 ./C024.bam ./C024.bam 4 ./C037.bam ./C037.bam 5 ./C049.bam ./C049.bam 6 ./C055.bam ./C055.bam
we make sure that in the plotCNA function we can analyze the tumor samples relative to the corresponding germline samples. We recommend identifying on-target and off-target regions based on a germline sample if possible, as this would avoid identifying highly amplified genomic regions in tumor cells as on-target regions. Nevertheless, we have observed that this effect is negligible in practice, and that CopywriteR analysis without a reference is still highly accurate. Please refer to Kuilman et al., 2015 for more details on how CopywriteR extracts copy number profiles from targeted sequencing.
We are now set for running CopywriteR:
> CopywriteR(sample.control = sample.control, destination.folder = file.path("./path/to/destination/folder"), reference.folder = file.path("./path/to/reference/folder", "mm10_20kb"), bp.param = bp.param)
Finally, we can segment the data using DNAcopy and plot the result using the plotCNA function:
> plotCNA(destination.folder = file.path("/path/to/destination/folder"))
By default, data will be analyzed and plotted according to the matching of samples and controls in the sample.control variable (specified above). Every sample in the samples column of the sample.control variable will be analyzed without a reference, and with the corresponding control as a reference. Optionally, the sample.plot argument can be used to control analysis and plotting by plotCNA. Please refer to the manual for more information.
The following packages can be optionally installed to allow running the vignette code:
> source("http://bioconductor.org/biocLite.R") > biocLite(c("BiocStyle", "snow"))
In addition, the SCLCBam experiment package is also required. The latest SCLCBam package can be downloaded from the SCLCBam releases webpage and installed using the following command:
$ R CMD INSTALL SCLCBam*.tar.gz
In general, we advise to update to the latest versions of CopywriteR and CopyhelpeR in case of errors. If the problems persist, please refer below for troubleshooting purposes.
We have come to realise that there was an bug in CopyhelpeR version 1.0.0 which leads to an error when creating mm10 helper files using the preCopywriteR function. If you need to use CopywriteR for mm10-based sequence data, please update to version 1.0.1 available at GitHub or via bioconductor (available with the next build).
One of the dependencies of CopywriteR (the chipseq package) requires Bioconductor 3.0. If installation fails, please check whether you are running the correct version of Bioconductor and whether all dependencies have been installed.
We have received feedback that CopywriteR throws the following error in a platform-dependent manner:
Error in log2.read.counts[selection, , drop = FALSE] : (subscript) logical subscript too long
In some (and possibly in all) cases this is due to missing X11 and/or PNG libraries. To test whether this is the case, one can type these
> x11() > png()
in the R command line. If one of these libraries is indeed missing, you will get one or both of the following error message:
Error in x11(): X11 is not available unable to start device PNG In addition: Warning message: no png support in this version of R
In this case, re-installing R after installation of the required libraries will solve the problem. This can be done as follows:
$ ## Install missing X11 libraries $ sudo yum install libXt-devel $ sudo yum install libX11-devel $ ## Install missing PNG libraries $ sudo yum install libpng $ sudo yum install libpng-devel $ ## Download and install R from source (we needed to use the additional $ ## --with-readline=no flag to allow installation) $ curl -O http://cran.rstudio.com/src/base/R-3/R-3.2.0.tar.gz $ ./configure --with-x $ make
You can finish this exercise by re-installing CopywriteR. If this does not solve your issue, please let us know.
CopywriteR uses the (base)name of a .bam file as an identifier for that file. Therefore, all the names of .bam files should be unique, and something like
> sample.control samples controls 1 path/sample1/aligned_read.bam path/sample2/aligned_read.bam
would result in an error. This can be solved by renaming the .bam files in such a way that all names become unique.
Any sample that is used as a sample or a reference for analysis and plotting in the plotCNA function needs prior analysis as a sample in the CopywriteR function. As an example, if one would like to analyze and plot tumor1.bam relative to matched.normal1.bam in plotCNA, the sample.control variable should contain both of the two rows below:
> sample.control samples controls 1 path/to/tumor1.bam path/to/matched.normal1.bam 2 path/to/matched.normal1.bam path/to/matched.normal1.bam
If not all samples needed by plotCNA have been analyzed by CopywriteR, the following error message will be displayed: "One of the samples in [LIST OF .BAM FILES] refers to a BAM file that has not been processed in CopywriteR. Please make sure that you have provided the correct input files or re-run CopywriteR accordingly."
CopywriteR relies on the off-target sequence read count, which in our hands is roughly 10% of the total amount of reads. You can find the relevant numbers in the log-file. Since the sequence reads are normally not required, a number of pipelines filter for or deplete sequence reads that are off-target. We recommend using the unprocessed .bam files to prevent discarding data for CopywriteR.
In addition to the above, we have never tested CopywriteR on sequence data upon PCR-based enrichment strategies. However, we believe it is likely that there will be insufficient off-target reads to allow high-quality copy number data using CopywriteR.
CopywriteR by default assumes that the chromosome names in .bam files are "1", "2", ... "X", and "Y". These chromosome names are incorporated in the mappability and GC-content GRanges object, as well as in the blacklist GRanges object created by preCopywriteR. preCopywriteR has an optional prefix argument that can be used to created helper files with a chromosome notation using a prefix (for instance "chr" for "chr1", "chr2", ... notation).
Moreover, CopywriteR assumes that all .bam files that are analyzed together are mapped to the same reference genome. If this is not the case, please re-run multiple CopywriteR analyses with every analysis using .bam files mapped to the same reference genome.
In order to allow plotting, the .bam files need to be present in the same folder as when the CopywriteR analysis was performed. Also, we have noted that due to changes in a CopywriteR dependency, chromosomes are plotted in the wrong order in older versions (< 2.0.4). Please update to version 2.0.4 to fix this problem.
The Rsamtools package requires .bam files to be sorted. If the .bam files that are to be analyzed are not sorted, CopywriteR will provide an error message. Unsorted .bam files can then be sorted using samtools as follows:
$ samtools sort sample.bam sample.sorted.bam
We have noted a ~1.5-fold reduction in total calculation time with the development version of R (3.2) and Bioconductor (3.1). These will soon be officially released, and in case the calculation time becomes a critical issue we recommend upgrading to these versions of R and Bioconductor.
We have tried to make the CopywriteR code readable and its use as easy as possible. If any questions arise regarding the package, or if you want to report any bugs, please do not hesitate and contact:
Thomas and Oscar are working in the laboratory of Prof. Dr. Daniel S. Peeper.
The CopywriteR tool has been cited and referenced by:
/in folder path names
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.