MSTweedie: Regularization path for the Multi-source sparse Tweedie model

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

This function fits the sparse Tweedie model on multi-source datasets along a sequence of regularization parameters lambda. The optimization is done by a Fortran95 routine.

Usage

1
2
3
4
5
MSTweedie(x, y, w, source, rho = 1.5,
      nlambda = 100, lambda.min, lambda, x.normalize = T,
      eps, sr = T, kktstop = F, reg = c("L2", "Linf"),
      alpha = 0, dfmax = nvars + 1, pmax = min(dfmax * 1.2, nvars),
      pf = rep(1, nvars), maxit = 10000)

Arguments

x

Either (1) a data frame containing the predictors, the responses (identifying the sources either by different columns in the simultaneous case or via an additionnal index column) and, optionnaly, the observation weigths or (2) a list of matrices containing only the predictors (mostly used internally for cross-validation.)

y

Either (1) a single integer identifying the column of x containing the response (requires source to be specified), (2) a vector of integers indentifying which columns of x are the responses (simultaneous case) or (3) a list of vector of responses (mostly used internally for cross-validation.)

w

(Optional) Either (1) a single integer identifying the column of x containing the observation weights or (2) a list of vector of weights (mostly used internally for cross-validation.) If this argument is missing, equal weight is assumed.

source

When y is a single integer, this arguments identifies the column of x which indexes the different sources. Disregard is y is a vector or list of vectors.

rho

Power used for the mean-variance relation of the Tweedie distribution. Possible range is [1,2], default is 1.5.

nlambda

The length of the regularization path. Disregarded if lambda is specified, default if 100.

lambda.min

The fraction of the first regularization parameter (which is computed to be the smallest such that no predictors are included) defining the last regularization parameter. Disregarded if lambda is specified; possible range is (0,1), default is 1e-3.

lambda

(Optional) User specified sequence of regularization parameter with positive values. When omitted, the sequence is computed starting from the smallest value excluding all predictors from the model and decreasing to a fraction lambda.min of that starting value by logarithmic decreaments.

x.normalize

Logical flag for stadardization of the predictors prior to fitting the model. If TRUE, each predictors in each source is centered to zero and scaled to variance 1. After the fit of the model, the coefficients are returned on the original scale. Default is FALSE.

eps

Convergence threshold. Default is 1e-3.

sr

Logical flag for using the strong rule in the fit. Default is TRUE.

kktstop

Logical flag for using the KKT conditions to stop the fit before the end of the regularization parameter sequence. Default is FALSE.

reg

Either "Linf" for using L_∞-regularization in the fit or "L2" for the L_2-regularization. Default is "Linf".

alpha

Parameter controlling the balance between across-feature and within-feature sparsity in the penalty term

(1-α)||β||_q +α||β||_1.

Possible range is [0,1], default is 0.

dfmax

Maximum number of variables included in the model at a single time. Default is nvars+1.

pmax

Limits the number of features ever to be nonzero. The difference with dfmax, is that if, a variable eventually exits the model, it will still be counted here. Default is min(dfmax*1.2,nvars).

pf

Penalty weights in the penalty term by feature. Mostly used intternaly when the Adaptive Lasso is used in cross-validation. Expects a vector of length nvars, default is 1.

maxit

Maximum number of inner-loop iterations. Default is 10,000.

Details

The sequence of regularization parameters implies a sequence of models fitted by the IRLS-BSUM algorithm described in the reference. For each value of the parameter, this function yield a model optimizing the penalzed Tweedie log-likelihood of multi-source data. The type of sparsity can be controlled by the arguments reg and alpha.

The computation time is influence by the arguments eps, nlambda, lambda.min (or lambda) and maxit. Consider ajusting these parameters to speed up computation. Small values of regularization parameters are the often the longest to fit; the kktstop argument can stop the algorithm before the end if convergence is judged sufficient in term of KKT conditions.

To pass sources with missing features compared to other sources, simply add a column of zero instead.

Value

An object with S3 class MSTweedie :

beta0

A ntaks*nlambda matrix of parameter estimates for the intercept.

beta

A list of length nlambda containing nvars*ntaks matrix of parameter estimates for the features.

df

The number of included variables along the regularization path.

lambda

The sequence of regularization parameters.

npasses

The number of inner-loop iterations.

idvars

The index of the variables in order of inclusion in the model.

dim

The dimesions of the model (nvars,ntasks).

call

The original call that produce this object.

pf

The penalty factors for the features.

eps

The convergence threshold used in the algorithm.

kkt

A nvars*ntasks*nlambda array containing the values of the KKT conditions.

norm

A nvars*nlambda matrix containing the norm of the features along the regularization path.

reg

The type of regularization used in the algorithm.

alpha

The value of the argument alpha used.

y

A list of length ntasks containing the vectors of the responses for each source.

x

A list of length ntasks containing matrices of the features for each source.

w

A list of length ntasks containing the vectors of the observation weights for each source.

rho

The power of the mean-variance relation used in the algorithm.

M

A nvars*ntasks*nlambda array containing flags for the KKT conditions.

time

Computing time.

Author(s)

Simon Fontaine, Yi Yang, Bo Fan, Wei Qian and Yuwen Gu.

Maintainer: Simon Fontaine fontaines@dms.umontreal.ca

References

Fontaine, S., Yang, Y., Fan, B., Qian, W. and Gu, Y. (2018). "A Unified Approach to Sparse Tweedie Model with Big Data Applications to Multi-Source Insurance Claim Data Analysis," to be submitted.

See Also

MSTweedie, coef.MSTweedie, print.MSTweedie, plot.MSTweedie, kkt.check, predict.MSTweedie

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# import package
library(MSTweedie)

# load data
data(AutoClaim)

# fit the MSTweedie model with L1/Linf regularization
# y=1 sets CLM_AMT5 as the response
# source=4 sets REVOLKED as the source index
fit <- MSTweedie(x = AutoClaim, y=1, source=4, reg='Linf')

fontaine618/MSTweedie documentation built on May 25, 2019, 5:22 p.m.