add_counts: Adding pseudo counts to the number of mutations

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

The function add_counts adds pseudo counts to the number of mutations to incorporate prior information and avoid counts of zero.

Usage

1
add_counts(x, reference = NULL, categorical, make.integers = F)

Arguments

x

data frame with a set of count columns named NO, I, VA, VG (and YES).

reference

reference data frame of the same format. If reference=NULL, x is used. Using a different reference data frame than is currently not implemented.

categorical

names of the categorical variables. This vector should not include cancer_type and sample_id.

make_integers

logical. Should the final output contain integer counts only?

Details

The function uses functionalities from the packages data.table.

Let c_1, ..., c_n be a set of categorical variables and v_1, ..., v_m the remaining (categorical or continuous) explanatory variables in the dataset, except for the two variables sample_id and cancer_type.

First, the function checks if there is a positive count for each mutation type, denoted by nI, nVA and nVG for each combination of c_1 x ... x c_n x sample_id. If there isn't, pseudo counts are added to nNO, nI, nVA and nVG.

The pseudo counts are obtained from nNO, nI, nVA and nVG for each combination of c_1 x ... x c_n x v_1 x ... x x_m x cancer_type and added to the observed counts for each combination c_1 x ... x c_n x v_1 x ... x x_m x sample_id. The pseudo counts and the observed counts are weighted equally. The new sum nNO + nI + nVA + nVG is the same as originally, so the number of sites of a specific category is preserved and the size of the genome doesn't change.

If 'make_integers=T is used, this only holds approximately, because after adjusting to the number of observed sites, the ceiling is used (to avoid zero counts). This avoids very small non-integer counts that can induce the same numerical problems as zero counts, but on the other hand it increases the number of counts and can cause quite substantial biases.

Note that the number of mutations per sample is not preserved.

Value

A data frame (or data table) of the exact same format as the input table x with an additional logical column 'zero' (indicating the addition of pseudocounts because of a zero mutation count).

Author(s)

Johanna Bertl

References

Bertl, J.; Guo, Q.; Rasmussen, M. J.; Besenbacher, S; Nielsen, M. M.; Hornshøj, H.; Pedersen, J. S. & Hobolth, A. A Site Specific Model And Analysis Of The Neutral Somatic Mutation Rate In Whole-Genome Cancer Data. bioRxiv, 2017. doi: https://doi.org/10.1101/122879 http://www.biorxiv.org/content/early/2017/06/21/122879

See Also

add_counts_pres_mut – does the same, but preserving the number of mutations per sample.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Adding a prior to the example data

data(cancermutations)
newdata = add_counts(cancermutations, categorical=c("strong", "neighbors"), make.integers=T)

# Looking at a sample with few mutations to see the effect of the imputation: 

sample02 = cancermutations[cancermutations$sample_id=="GBM_TCGA_02_2483_01A" & cancermutations$strong==1 & cancermutations$neighbors=="TG",]
new02 = newdata[newdata$sample_id=="GBM_TCGA_02_2483_01A" & newdata$strong==1 & newdata$neighbors=="TG",]

# number of mutations before adding the prior:
sum(sample02$YES)
# number of mutations after adding the prior:
sum(new02$YES)

MultinomialMutations/MultinomialMutations documentation built on May 22, 2019, 4:39 p.m.