Description Usage Arguments Details Value Author(s) References See Also Examples
count_table_prep_multinom
imputes missing values and reshapes the raw count table for multinomial logistic regression on the strand-symmetric mutation model.
1 2 3 | count_table_prep_multinom(count_table, m, impute = T, strong = T, CpG = T,
apobec = T, neighbors = T, DNase1_dummy = F, expression_dummy = F,
data_source = "fredriksson")
|
count_table |
data frame. Raw count table. See examples. |
m |
integer. Number of multiple imputations. 0 means no imputation. |
impute |
logical. Impute missing values or remove them? See details. |
strong |
logical. Should the variable strong be computed? |
CpG |
logical. |
apobec |
logical. |
neighbors |
logical. |
DNase1_dummy |
logical. Should a dummy variable be computed for DNase1 peaks? If yes, NAs in the original variable DNase1 are replaced by zero. This means that DNase1 := DNase1*I(DNase1_dummy==1). |
expression_dummy |
logical. Should a dummy variable be computed for expression measure available? If yes, NAs in the original variable expression are replaced by zero. |
data_source |
character. Either "fredriksson" or "pcawg". |
If impute=T and m>=1, missing data is imputed m times. If impute=T and m=0, the missing data is not touched and just kept as NA. When impute=F, the value of m is irrelevant. In this case, only the complete cases are output (using the function complete.cases).
The packages data.table and reshape2 are used for efficient and fast handling of the large mutation datasets. Multiple imputation is handled by the function mimp.
Note that this function uses a few small functions from small_dataprep_functions.R.
The Fredriksson and PCAWG datasets are handled in the same way, apart from the location information that is removed from the cancer type in the PCAWG set.
A list that consists of the following elements:
a list of length min(m, 1) of imputed or complete data.tables (data.frames)
a named vector giving the number of sites with missing values
an integer value giving the total number of sites (to compute proportions of missing sites)
Johanna Bertl & Malene Juul
Bertl, J.; Guo, Q.; Rasmussen, M. J.; Besenbacher, S; Nielsen, M. M.; Hornshøj, H.; Pedersen, J. S. & Hobolth, A. A Site Specific Model And Analysis Of The Neutral Somatic Mutation Rate In Whole-Genome Cancer Data. bioRxiv, 2017. doi: https://doi.org/10.1101/122879 http://www.biorxiv.org/content/early/2017/06/21/122879
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | # This is how the example dataset cancermutations was created (data(cancermutations)).
# use system.file to find the raw dataset that was installed along with the package:
location = system.file("extdata", "set0", package = "multinomutils")
count.raw = read.table(file=location, header = T, as.is=T)
# data preparation with imputation (on a subset of the data, for speed -- this still takes a few minutes!)
set.seed(1234)
count.raw.sub = count.raw[sample.int(nrow(count.raw), 1000),]
count.imp = count_table_prep_multinom(count.raw.sub, 2)
# Note that imputation only works if there is more than one cancer type in the dataset.
# data preparation without imputation, but with expression dummy variable:
count.noimp = count_table_prep_multinom(count.raw, 0, expression_dummy=T)+
# complete cases only
count.complete = count_table_prep_multinom(count.raw, m=5, impute=F)
# Note that this doesn't work with a very small subset of the data where after removal of the missing cases not all 4 mutation types exist.
# This is similar to the example dataset cancermutations. The code to create this dataset is in data-raw/cancermutations.R
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.