Description Usage Arguments Details Value Caution! Author(s) References See Also Examples
fast_multinom
estimates an M-category multinomial logistic regression
model by M-1 separate binomial logistic regression models and estimates the
VC matrix according to Begg and Gray (1984).
1 2 3 |
multinomformula |
A formula of the form cbind(category1, category2, ..., categoryM) ~ explanatory1 + explanatory2 + ... + explanatory3). |
data |
Dataframe. |
refLevel |
Reference level. |
family |
I don't think I implemented anything except binomial with logit link. |
subsetmatrix |
Logical matrix with M-1 columns and the same number of rows as there are explanatory variables (including the intercept, even if it is 0 or -1). F indicates that an explanatory variable is not included in the submodel. |
VC |
logical Should the variance-covariance-matrix of the estimates be calculated? |
predictions |
logical Should the predictions (fitted values) for the data be calculated? |
loglik |
logical Should the log likelihood be calculated? Note that the predictions have to be computed to obtain the log likelihood. Setting loglik=T and predictions=F saves space when saving the results, but it doesn't save time. |
parallel |
logical. Should the estimation of the M-1 submodels be distributed between M-1 cores? Not implemented yet. |
The separate regression models are estimated with the function glm4
from package MatrixModels
, which is very fast for sparse design
matrices (usually models including many factors and interactions with factors). For not sparse design matrices, a different function like
speedglm
or bigglm
might be preferable, but this is not
implemented here.
The computation of the VC matrix follows Begg & Gray (1984). As very large and sparse matrices are involved, the Matrix
package is used here. The computation of the VC matrix is very fast, but computations can be sped up by setting VC=F. This can be useful if the model is only used for prediction or estimation of a loss function.
The function vglm
from package VGAM
provides a more flexible interface to multinomial logistic regression, has much
more functionalities and a more extensive output, and also allows the
specification of joint parametes and many different (custom) link functions.
It is certainly preferable for many problems, but when it reaches its
boundary in terms of runtime and memory, this function can be an alternative.
If the data is a data.table, it is internally changed to a data.frame. The code of this function is developed for data.frames and data.table is not compatible with this (see e. g. FAQ 1.1 in the data.table package for examples).
An object of class fast_multinom (an S3 class, a list, actually) that consists of the following elements:
A named list that contains the formulae for the separate logistic regression models.
A named list that contains the estimated regression coefficients for each of the M-1 submodels.
Variance covariance matrix of the coefficients, if VC=T, otherwise NULL. Some type of Matrix
object.
NULL (the mean fraction of missing information, only computed by the function pooling
).
A matrix of predicted probabilities obtained from the current dataset, if predictions=T, otherwise NULL. The columns correspond to the M categories and each line corresponds to one observation.
Log-likelihood of the multinomial model, if loglik=T, otherwise NULL.
A named list that contains a named list of the contrasts of each factor in each of the M-1 submodels. This is needed for prediction with predict.fast_multinom
. The contrasts can be specified either as a contrast matrix or with a character string with the name of the function to make them (usually, when the default contrasts are used).
A named list that contains a named list of the factor levels for each factor in each of the M-1 submodels. This is needed for prediction with predict.fast_multinom
, when the new data set contains less levels.
Number of observations (sum over the weights).
Note that there is a bug in glm4
that produces erroneous results when contrasts other than the default treatment contrasts are used for a factor that is involved in an interaction with a numeric variable (see http://r-forge.r-project.org/tracker/index.php?func=detail&aid=6192&group_id=61&atid=294 for my bug report). The function doesn't find these errors, so only use treatment contrasts for that case. If the numeric variable is a dummy variable, it is enough to change it to a factor.
Note that the function model.Matrix
that is internally used by glm4
has a bug, s. th. models with interaction terms that are specified with ":" (partial interaction) instead of "*" (full interaction) are often not handled correctly and result in model matrices that don't have full rank. The function doesn't catch these errors, but usually the underlying Cholesky decomposition will stop with an error message. If unsure about the specified model, create the model matrix first with model.Matrix(..., sparse=T, ...)
and compare its rank with its dimension using the function rankMatrix
.
Note that it is currently impossible to run the function on data with missing values. glm4 allows standard handling of missing values (with na.action), but I have not implemented handling missing data (and as a consequence, changing sizes of predicted vs observed data, and potentially different numbers of missing values in the binary submodels). I set na.action=na.fail in glm4.
Johanna Bertl
Begg, C. B. & Gray, R. Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika, 1984, 71, 11-18
Bertl, J.; Guo, Q.; Rasmussen, M. J.; Besenbacher, S; Nielsen, M. M.; Hornshøj, H.; Pedersen, J. S. & Hobolth, A. A Site Specific Model And Analysis Of The Neutral Somatic Mutation Rate In Whole-Genome Cancer Data. bioRxiv, 2017. doi: https://doi.org/10.1101/122879 http://www.biorxiv.org/content/early/2017/06/21/122879
vglm
, glm4
, predict.fast_multinom
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | # Load the internal example dataset:
data(cancermutations)
# the APOBEC signature is only relevant for transitions and transversions to a G:C basepair -- construct the corresponding subset of parameters for the 3 binomial models:
subs = matrix(T, ncol=3, nrow=4)
subs[3,2] = F
# fit the multinomial model
fit = fast_multinom(cbind(NO, I, VA, VG) ~ strong + apobec + cancer_type, data = cancermutations, refLevel=1, loglik=T, predictions=F, VC=T, subsetmatrix=subs)
# The same without intercept
fit2 = fast_multinom(cbind(NO, I, VA, VG) ~ 0 + strong + apobec + cancer_type, data = chrom21, refLevel=1, loglik=T, predictions=F, VC=T, subsetmatrix=subs)
# Simpler model without subsetting and continuous variable
fit3 = fast_multinom(cbind(NO, I, VA, VG) ~ replication_timing, data = chrom21, refLevel=1, loglik=T, predictions=F, VC=T, subsetmatrix=NULL)
# See predict.multinom for more examples, especially for the use of different contrasts.
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.