fast_multinom: Fast multinomial logistic regression on large and sparse...

Description Usage Arguments Details Value Caution! Author(s) References See Also Examples

Description

fast_multinom estimates an M-category multinomial logistic regression model by M-1 separate binomial logistic regression models and estimates the VC matrix according to Begg and Gray (1984).

Usage

1
2
3
fast_multinom(multinomformula, data, refLevel = 1, family = binomial(link =
  "logit"), subsetmatrix = NULL, VC = F, predictions = F, loglik = F,
  parallel = F)

Arguments

multinomformula

A formula of the form cbind(category1, category2, ..., categoryM) ~ explanatory1 + explanatory2 + ... + explanatory3).

data

Dataframe.

refLevel

Reference level.

family

I don't think I implemented anything except binomial with logit link.

subsetmatrix

Logical matrix with M-1 columns and the same number of rows as there are explanatory variables (including the intercept, even if it is 0 or -1). F indicates that an explanatory variable is not included in the submodel.

VC

logical Should the variance-covariance-matrix of the estimates be calculated?

predictions

logical Should the predictions (fitted values) for the data be calculated?

loglik

logical Should the log likelihood be calculated? Note that the predictions have to be computed to obtain the log likelihood. Setting loglik=T and predictions=F saves space when saving the results, but it doesn't save time.

parallel

logical. Should the estimation of the M-1 submodels be distributed between M-1 cores? Not implemented yet.

Details

The separate regression models are estimated with the function glm4 from package MatrixModels, which is very fast for sparse design matrices (usually models including many factors and interactions with factors). For not sparse design matrices, a different function like speedglm or bigglm might be preferable, but this is not implemented here.

The computation of the VC matrix follows Begg & Gray (1984). As very large and sparse matrices are involved, the Matrix package is used here. The computation of the VC matrix is very fast, but computations can be sped up by setting VC=F. This can be useful if the model is only used for prediction or estimation of a loss function.

The function vglm from package VGAM provides a more flexible interface to multinomial logistic regression, has much more functionalities and a more extensive output, and also allows the specification of joint parametes and many different (custom) link functions. It is certainly preferable for many problems, but when it reaches its boundary in terms of runtime and memory, this function can be an alternative.

If the data is a data.table, it is internally changed to a data.frame. The code of this function is developed for data.frames and data.table is not compatible with this (see e. g. FAQ 1.1 in the data.table package for examples).

Value

An object of class fast_multinom (an S3 class, a list, actually) that consists of the following elements:

formulae

A named list that contains the formulae for the separate logistic regression models.

coefficients

A named list that contains the estimated regression coefficients for each of the M-1 submodels.

VC

Variance covariance matrix of the coefficients, if VC=T, otherwise NULL. Some type of Matrix object.

gamma

NULL (the mean fraction of missing information, only computed by the function pooling).

predictions

A matrix of predicted probabilities obtained from the current dataset, if predictions=T, otherwise NULL. The columns correspond to the M categories and each line corresponds to one observation.

log.lik

Log-likelihood of the multinomial model, if loglik=T, otherwise NULL.

contrasts

A named list that contains a named list of the contrasts of each factor in each of the M-1 submodels. This is needed for prediction with predict.fast_multinom. The contrasts can be specified either as a contrast matrix or with a character string with the name of the function to make them (usually, when the default contrasts are used).

xlevels

A named list that contains a named list of the factor levels for each factor in each of the M-1 submodels. This is needed for prediction with predict.fast_multinom, when the new data set contains less levels.

n

Number of observations (sum over the weights).

Caution!

Note that there is a bug in glm4 that produces erroneous results when contrasts other than the default treatment contrasts are used for a factor that is involved in an interaction with a numeric variable (see http://r-forge.r-project.org/tracker/index.php?func=detail&aid=6192&group_id=61&atid=294 for my bug report). The function doesn't find these errors, so only use treatment contrasts for that case. If the numeric variable is a dummy variable, it is enough to change it to a factor.

Note that the function model.Matrix that is internally used by glm4 has a bug, s. th. models with interaction terms that are specified with ":" (partial interaction) instead of "*" (full interaction) are often not handled correctly and result in model matrices that don't have full rank. The function doesn't catch these errors, but usually the underlying Cholesky decomposition will stop with an error message. If unsure about the specified model, create the model matrix first with model.Matrix(..., sparse=T, ...) and compare its rank with its dimension using the function rankMatrix.

Note that it is currently impossible to run the function on data with missing values. glm4 allows standard handling of missing values (with na.action), but I have not implemented handling missing data (and as a consequence, changing sizes of predicted vs observed data, and potentially different numbers of missing values in the binary submodels). I set na.action=na.fail in glm4.

Author(s)

Johanna Bertl

References

Begg, C. B. & Gray, R. Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika, 1984, 71, 11-18

Bertl, J.; Guo, Q.; Rasmussen, M. J.; Besenbacher, S; Nielsen, M. M.; Hornshøj, H.; Pedersen, J. S. & Hobolth, A. A Site Specific Model And Analysis Of The Neutral Somatic Mutation Rate In Whole-Genome Cancer Data. bioRxiv, 2017. doi: https://doi.org/10.1101/122879 http://www.biorxiv.org/content/early/2017/06/21/122879

See Also

vglm, glm4, predict.fast_multinom

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Load the internal example dataset:
data(cancermutations)

# the APOBEC signature is only relevant for transitions and transversions to a G:C basepair -- construct the corresponding subset of parameters for the 3 binomial models:
subs = matrix(T, ncol=3, nrow=4)
subs[3,2] = F

# fit the multinomial model
fit = fast_multinom(cbind(NO, I, VA, VG) ~ strong + apobec + cancer_type, data = cancermutations, refLevel=1, loglik=T, predictions=F, VC=T, subsetmatrix=subs)

# The same without intercept
fit2 = fast_multinom(cbind(NO, I, VA, VG) ~ 0 + strong + apobec + cancer_type, data = chrom21, refLevel=1, loglik=T, predictions=F, VC=T, subsetmatrix=subs)

# Simpler model without subsetting and continuous variable
fit3 = fast_multinom(cbind(NO, I, VA, VG) ~ replication_timing, data = chrom21, refLevel=1, loglik=T, predictions=F, VC=T, subsetmatrix=NULL)

# See predict.multinom for more examples, especially for the use of different contrasts.

MultinomialMutations/MultinomialMutations documentation built on May 22, 2019, 4:39 p.m.