count_table_prep_multinom: Count table for multinomial regression

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

count_table_prep_multinom imputes missing values and reshapes the raw count table for multinomial logistic regression on the strand-symmetric mutation model.

Usage

1
2
3
count_table_prep_multinom(count_table, m, impute = T, strong = T, CpG = T,
  apobec = T, neighbors = T, DNase1_dummy = F, expression_dummy = F,
  data_source = "fredriksson")

Arguments

count_table

data frame. Raw count table. See examples.

m

integer. Number of multiple imputations. 0 means no imputation.

impute

logical. Impute missing values or remove them? See details.

strong

logical. Should the variable strong be computed?

CpG

logical.

apobec

logical.

neighbors

logical.

DNase1_dummy

logical. Should a dummy variable be computed for DNase1 peaks? If yes, NAs in the original variable DNase1 are replaced by zero. This means that DNase1 := DNase1*I(DNase1_dummy==1).

expression_dummy

logical. Should a dummy variable be computed for expression measure available? If yes, NAs in the original variable expression are replaced by zero.

data_source

character. Either "fredriksson" or "pcawg".

Details

If impute=T and m>=1, missing data is imputed m times. If impute=T and m=0, the missing data is not touched and just kept as NA. When impute=F, the value of m is irrelevant. In this case, only the complete cases are output (using the function complete.cases).

The packages data.table and reshape2 are used for efficient and fast handling of the large mutation datasets. Multiple imputation is handled by the function mimp.

Note that this function uses a few small functions from small_dataprep_functions.R.

The Fredriksson and PCAWG datasets are handled in the same way, apart from the location information that is removed from the cancer type in the PCAWG set.

Value

A list that consists of the following elements:

imputed

a list of length min(m, 1) of imputed or complete data.tables (data.frames)

missing

a named vector giving the number of sites with missing values

total_count

an integer value giving the total number of sites (to compute proportions of missing sites)

Author(s)

Johanna Bertl & Malene Juul

References

Bertl, J.; Guo, Q.; Rasmussen, M. J.; Besenbacher, S; Nielsen, M. M.; Hornshøj, H.; Pedersen, J. S. & Hobolth, A. A Site Specific Model And Analysis Of The Neutral Somatic Mutation Rate In Whole-Genome Cancer Data. bioRxiv, 2017. doi: https://doi.org/10.1101/122879 http://www.biorxiv.org/content/early/2017/06/21/122879

See Also

cancermutations

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# This is how the example dataset cancermutations was created (data(cancermutations)).

# use system.file to find the raw dataset that was installed along with the package:
location = system.file("extdata", "set0", package = "multinomutils")
count.raw = read.table(file=location, header = T, as.is=T)

# data preparation with imputation (on a subset of the data, for speed -- this still takes a few minutes!)
set.seed(1234)
count.raw.sub = count.raw[sample.int(nrow(count.raw), 1000),]
count.imp = count_table_prep_multinom(count.raw.sub, 2)
# Note that imputation only works if there is more than one cancer type in the dataset.

# data preparation without imputation, but with expression dummy variable:
count.noimp = count_table_prep_multinom(count.raw, 0, expression_dummy=T)+

# complete cases only
count.complete = count_table_prep_multinom(count.raw, m=5, impute=F)
# Note that this doesn't work with a very small subset of the data where after removal of the missing cases not all 4 mutation types exist.
# This is similar to the example dataset cancermutations. The code to create this dataset is in data-raw/cancermutations.R

MultinomialMutations/MultinomialMutations documentation built on May 22, 2019, 4:39 p.m.