R/cancermutations.R

#' Mutations in tumor DNA
#'
#' The dataset contains mutated and non-mutated genomic positions obtained from tumor samples from 505 cancer patients (see Source). The dataset consists of a random sample of genomic positions that cover around 0.4\% of the genome. The data is formatted for multinomial regression analysis of the mutation rate.
#'
#' The genomic positions are classified according to the following genomic properties: expression level, phyloP score, replication timing, strong site, CpG site, apobec site, neighboring sites (see Format for details). For each type of position, the number of mutations (YES) and the number of non-mutated positions (NO) per sample are counted. In addition, different types of mutations are counted, assuming strand-symmetry: I (transition), VA (transversion to an A:T basepair), VG (transversion to a G:C basepair). Only single-nucleotide variants are considered.
#'
#' The genomic properties expression, phyloP score and replication timing are originally measured on a (pseudo-)continuous scale. Here, they are binned by quintiles and the quintile means are used. For the expression level, this is done seperately for each cancer type.
#'
#' The data is not sorted to avoid that subset of the data consisting of subsequent lines contain only very few factor levels.
#'
#' @format A data frame with 1'092'000 observations on the following 14 variables:
#' \describe{
#'  \item{sample_id}{factor. Patient ID by TCGA with the cancer type added in front.}
#'  \item{cancer_type}{factor. Cancer type by TCGA.}
#'  \item{expression}{numeric. Cancer type specific gene expression level.}
#'  \item{phyloP}{numeric. PhyloP score.}
#'  \item{replication_timing}{numeric. Replication timing.}
#'  \item{strong, CpG, apobec}{numeric. strong: C:G position; CpG: CpG position (or reverse complement); apogec: TpCpA or TpCpT position (or reverse complement)}
#'  \item{neighbors}{factor. Left and right neighboring nucleotide (on the strand where the C or T lies, assuming strand-symmetry.)}
#'  \item{NO, I, VA, VG, YES}{integer, number of mutations of this type (see Details)}
#'  }
#'
#' @source Bertl, J.; Guo, Q.; Rasmussen, M. J.; Besenbacher, S; Nielsen, M. M.; Hornshøj, H.; Pedersen, J. S. & Hobolth, A. A Site Specific Model And Analysis Of The Neutral Somatic Mutation Rate In Whole-Genome Cancer Data. bioRxiv, 2017. doi: https://doi.org/10.1101/122879 \url{http://www.biorxiv.org/content/early/2017/06/21/122879}
#'
#' Sources of the underlying datasets:
#'
#' \describe{
#'  \item{Mutations}{Fredriksson, N. J.; Ny, L.; Nilsson, J. A. & Larsson, E. Systematic Analysis of noncoding somatic mutations and gene expression alterations across 14 tumor types. Nature Genetics, 2014, 46, 1258-1263}
#'  \item{Reference genome}{hg19}
#'  \item{Gene expression}{The Cancer Genome Atlas}
#'  \item{PhyloP score}{Pollard, K. S.; Hubisz, M. J.; Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Research, 2010, 20, 110-121}
#'  \item{Replication timing}{Chen, C.-L.; Rappailles, A.; Duquenne, L.; Huvet, M.; Guilbaud, G.; Farinelli, L.; Audit, B.; d'Aubenton Carafa, Y.; Arneodo, A.; Hyrien, O. & Thermes, C. Impact of replication timing on non-CpG and CpG substitution rates in mammalian genomes. Genome Research, 2010, 20, 447-457}
#' }
#'
#' @author Johanna Bertl, Qianyun Guo
"cancermutations"
MultinomialMutations/MultinomialMutations documentation built on May 22, 2019, 4:39 p.m.