cofragr is an R package for cofragmentation patterns of cell-free DNA.
The cofragr package is still under development and hasn’t been released to CRAN, so you must install cofragr by using the devtools workflow. You can install devtools two different ways:
You can install the library directly from CRAN:
install.packages("devtools")
You can also install the development version from GitHub:
devtools::install_github("r-lib/devtools")
You can learn more about devtools and get help with its installation here.
Now that you've downloaded devtools, you can use the following command to install the package:
devtools::install_github("epifluidlab/cofragr")
Congratulations! You've successfully downloaded and installed cofragr!
Note: If you're having any problems installing cofragr, make sure that you have the latest version of R installed.
Understanding how cofragr works can be crucial to its successful usage. There are three core functions to this R package:
Calling this function will read the fragment data file, which is typically a BED file. Since this functions uses the package bedtools to read the fragment files, the arguments/options for range
and genome
will follow the same format.
cofragr::read_fragments(file_path, range = NULL, genome = NULL)
range; default: range = NULL
Specifies the ranges of the BED file that read_fragments will read.
Note that range
follows the standard genomic range notation, ex. "chr1:1001-2000".
genome; default: genome = NULL
Specifies the reference genome for the BED file in question.
Note that genome
can take any of the genome
arguments that GenomeInfoDb::Seqinfo()
can.
Again, you can read more about the options behind range
and genome
here.
Calling this function will calculate the contact matrix from the fragment data file, which should have already been called by the read_fragments
function.
cofragr::contact_matrix(frag, bin_size = 500e3L, n_workers = 1L, subsample = 10e3L, min_sample_size = 100L, bootstrap = 1L, seed = NULL)
Note that frag
is already defined as the fragment file as a result of the read_fragments
function.
bin_size; default: bin_size = 500e3L
Specifies the size of the bins for the contact matrix in base-pair units.
n_workers; default: n_workers = 1L
Specifies the number of workers for the task at hand.
subsample; default: subsample = 10e3L
Specifies the sampling size for the contact matrix calculations.
Note: In this case, the subsample means for each genomic bin, we only randomly select 1 in every 10,000 fragments for the analysis. This counteracts the fact that the number of fragments are different between different bins.
bootstrap; default: bootstrap = 1L
Specifies the number of bootstrap samples in the contact matrix calculations.
Calling this function will write the contact matrix to a specified filepath; the contact matrix should have already been calculated by the contact_matrix
function.
cofragr::write_contact_matrix(cm, file_path, comments = NULL)
comments; default: comments = NULL
Allows the addition of a character vector to be appended at the top of the written BED file as a header.
cofragr provides both R APIs and an R script for the analysis.
The pipeline requires fragment data files as input. A fragment data file is essentially a BED file, with each row representing a cfDNA fragment. Below is an example:
14 19000035 19000198 . 27 +
14 19000044 19000202 . 42 +
14 19000045 19000202 . 20 -
14 19000049 19000202 . 12 +
The six columns are: chrom, start, end, feature name (which we do not use), MAPQ (which is the smallest MAPQ of the two paired reads), and strand (which we do not use).
You can easily prepare the fragment data file from a query-sorted BAM file, or refer to our FinaleDB paper for more details (Zheng, Zhu, and Liu 2020).
The quickest way to start is using the RScript for your dataset.
You can find the script in R package installation:
system.file("extdata/scripts", "cofragr.R", package = "cofragr")
Or simply clone the source code from GitHub and find the script in the directory inst/ext/scripts/
.
Then you can simply run some variation of the following, inputting any arguments as you desire.
Rscript cofragr.R
-i examplefile.bed.gz \
-o output_file_path \
--min-mapq 30 \
--chroms 1:2:3 \
--subsample 10000 \
--min-fraglen 50 \
--max-fraglen 350 \
--sample-id example \
We encourage you to look through the script file, cofragr.R
, to understand all of the available arguments.
You will be outputted two files, the calculated contact matrix and its accompanying index file.
cofrag_cm.bed.gz is the calculated contact matrix:
#chrom start end chrom2 start2 end2 score n_frag1 n_frag2 p_value p_value_sd
1 0 500000 1 0 500000 15.14129840182 435 435 0.58172039812 .
1 0 500000 1 500000 1000000 0 435 2539 0 .
1 0 500000 1 1000000 1500000 0 435 3997 0 .
1 0 500000 1 1500000 2000000 0 435 3707 0 .
1 0 500000 1 2000000 2500000 0 435 4361 0 .
For large-scale real applications, we suggest use workflow management tools such as snakemake, nextflow, etc. We provide a snakemake file. Similar to cofragr.R, you can find it:
In R package installation:
system.file("extdata/scripts", "cofragr.smk", package = "cofragr")
Or simply clone the source code from GitHub and find the script in the directory inst/ext/scripts/
.
Note: Usually snakemake workflow is more related to your specific computing environment. As a result, the example above is only for your reference. Have a look in both the snakemake file and the command above, and modify it accordingly.
See LICENSE
for more information.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.