mphcrm: Estimate a mixed proportional hazard model

Description Usage Arguments Details Value Note See Also Examples

View source: R/duration.R


mphcrm implements estimation of a mixed proportional hazard competing risk model. The baseline hazard is of the form exp(X β) where X is a matrix of covariates, and β is a vector of parameters to estimate. In addition there is an intercept term μ, i.e. the hazard is exp(X β + μ). There are several transitions to be made, and a set of X, β, and μ for each possible transition.

Each individual may have several observations, with either a transition at the end of the observation, or not a transition. It is a competing risk, there can be more than one possible transition for an observation, but only one is taken at the end of the period.

For each individual i there is a log likelihood as a function of μ, called M_i(μ).

The mixture part is that the μ's are stochastic. I.e. we have probabilities p_j, and a vector of μ_j of masspoints (one for each transition), for each such j.

So the full likelihood for an individual is L_i = ∑_j p_j M_i(μ_j).

The mphcrm() function maximizes the likelihood ∑_i L_i over p_j, μ_j, and β.

In addition to the covariates specified by a formula, a variable which records the duration of each observation must be specified.

In some datasets it is known that not all risks are present at all times. Like, losing your job when you do not have one. In this case it should be specified which risks are present.

The estimation starts out with one masspoint, maximizes the likelihood, tries to add another point, and continues in this fashion.


  risksets = NULL,
  timing = c("exact", "interval", "none"),
  control = mphcrm.control()



A formula specifying the covariates. In a formula like d ~ x1 + x2 + ID(id) + D(dur) + C(job,alpha) + S(state), the d is the transition which is taken, coded as an integer where 0 means no transition, and otherwise d is the number of the transition which is taken. d can also be a factor, in which case the level which is no transition must be named "0" or "none". If d is an integer, the levels for transitions will be named "t1", "t2", ..., and "none".

The x1+x2 part is like in lm, i.e. ordinary covariates or factors.

The D() specifies the covariate which holds the duration of each observation. The transition in d is assumed to be taken at the end of this period.

The ID() part specifies the covariate which holds the individual identification.

The S() specifies the covariate which holds an index into the risksets list.

These three special symbols are replaced with I(), so it is possible to have calculations inside them.

If the covariates differ among the transitions, one can specify covariates conditional on the transition taken. If e.g. the covariates alpha and x3 should only explain transition to job, specify C(job, alpha+x3). This comes in addition to the ordinary covariates. The name "job" refers to a level in the factor d, the transition taken.


A data frame which contains the covariates. It must be sorted on individuals.


A list of character vectors. Each vector is a list of transitions, i.e. which risks are present for the observation. The elements of the vectors must be levels of the covariate which is the left hand side of the formula. If the state variable in the formula is a factor, the risksets argument should be a named list, with names matching the levels of state. If all risks are present at all times, the risksets-argument can be specified as NULL, or ignored.


character. The timing in the duration model. Can be one of

  • "exact". The timing is exact, the transition occured at the end of the observation interval.

  • "interval". The transition occured some time during the observation interval. This model can be notoriously hard to estimate due to unfavourable numerics.

  • "none". There is no timing, the transition occured, or not. A logit model is used.


For specifying a subset of the dataset, similar to lm.


For handling of missing cases, similar to lm.


List of control parameters for the estimation. See mphcrm.control.


The estimation starts by estimating the null-model, i.e. all parameters set to 0, only one intercept for each transition is estimated.

Then it estimates the full model, still with one intercept in each transition.

After the initial model has been estimated, it tries to add a masspoint to the mixing distribution, then estimates the model with this new distribution.

The algorithm continues to add masspoints in this way until either it can not improve the likelihood, or the number of iterations as specified in control$iters are reached.

The result of every iteration is returned in a list.

If you interrupt mphcrm it will catch the interrupt and return with the estimates it has found so far. This behaviour can be switched off with control$trap.interrupt=FALSE.


A list, one entry for each iteration. Ordered in reverse order. Ordinarily you will be interested in the first entry.


The algorithm is not fully deterministic. New points are searched for randomly, there is no canonical order in which they can be found. It can happen that a point is found early which makes the rest of the estimation hard, so it terminates early. In particular when using interval timing. One should then make a couple of runs to ensure they yield reasonably equal results.

See Also

A description of the dataset is available in datagen and durdata, and in the vignette vignette("whatmph")


risksets <- list(c('job','program'), c('job'))
Fit <- mphcrm(d ~ x1+x2 + C(job,alpha) + ID(id) + D(duration) + S(alpha+1), data=durdata, 
     risksets=risksets, control=mphcrm.control(threads=1,iters=2))
best <- Fit[[1]]

durmod documentation built on March 31, 2020, 5:23 p.m.