cancermutations: Mutations in tumor DNA

Description Usage Format Details Author(s) Source

Description

The dataset contains mutated and non-mutated genomic positions obtained from tumor samples from 505 cancer patients (see Source). The dataset consists of a random sample of genomic positions that cover around 0.4% of the genome. The data is formatted for multinomial regression analysis of the mutation rate.

Usage

1

Format

A data frame with 1'092'000 observations on the following 14 variables:

sample_id

factor. Patient ID by TCGA with the cancer type added in front.

cancer_type

factor. Cancer type by TCGA.

expression

numeric. Cancer type specific gene expression level.

phyloP

numeric. PhyloP score.

replication_timing

numeric. Replication timing.

strong, CpG, apobec

numeric. strong: C:G position; CpG: CpG position (or reverse complement); apogec: TpCpA or TpCpT position (or reverse complement)

neighbors

factor. Left and right neighboring nucleotide (on the strand where the C or T lies, assuming strand-symmetry.)

NO, I, VA, VG, YES

integer, number of mutations of this type (see Details)

Details

The genomic positions are classified according to the following genomic properties: expression level, phyloP score, replication timing, strong site, CpG site, apobec site, neighboring sites (see Format for details). For each type of position, the number of mutations (YES) and the number of non-mutated positions (NO) per sample are counted. In addition, different types of mutations are counted, assuming strand-symmetry: I (transition), VA (transversion to an A:T basepair), VG (transversion to a G:C basepair). Only single-nucleotide variants are considered.

The genomic properties expression, phyloP score and replication timing are originally measured on a (pseudo-)continuous scale. Here, they are binned by quintiles and the quintile means are used. For the expression level, this is done seperately for each cancer type.

The data is not sorted to avoid that subset of the data consisting of subsequent lines contain only very few factor levels.

Author(s)

Johanna Bertl, Qianyun Guo

Source

Bertl, J.; Guo, Q.; Rasmussen, M. J.; Besenbacher, S; Nielsen, M. M.; Hornshøj, H.; Pedersen, J. S. & Hobolth, A. A Site Specific Model And Analysis Of The Neutral Somatic Mutation Rate In Whole-Genome Cancer Data. bioRxiv, 2017. doi: https://doi.org/10.1101/122879 http://www.biorxiv.org/content/early/2017/06/21/122879

Sources of the underlying datasets:

Mutations

Fredriksson, N. J.; Ny, L.; Nilsson, J. A. & Larsson, E. Systematic Analysis of noncoding somatic mutations and gene expression alterations across 14 tumor types. Nature Genetics, 2014, 46, 1258-1263

Reference genome

hg19

Gene expression

The Cancer Genome Atlas

PhyloP score

Pollard, K. S.; Hubisz, M. J.; Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Research, 2010, 20, 110-121

Replication timing

Chen, C.-L.; Rappailles, A.; Duquenne, L.; Huvet, M.; Guilbaud, G.; Farinelli, L.; Audit, B.; d'Aubenton Carafa, Y.; Arneodo, A.; Hyrien, O. & Thermes, C. Impact of replication timing on non-CpG and CpG substitution rates in mammalian genomes. Genome Research, 2010, 20, 447-457


MultinomialMutations/MultinomialMutations documentation built on May 22, 2019, 4:39 p.m.