datagen: Generate example data

Description Usage Arguments Details Note Examples

View source: R/datagen.R


Generate a data table with example data


datagen(N, censor = 80)



integer. The number of individuals in the dataset.


numeric. The total observation period. Individuals are removed from the dataset if they do not exit to "job" before this time.


The dataset simulates a labour market programme. People entering the dataset are without a job.

They experience two hazards, i.e. probabilities per time period. They can either get a job and exit from the dataset, or they can enter a labour market programme, e.g. a subsidised job or similar, and remain in the dataset and possibly get a job later. In the terms of this package, there are two transitions, "job" and "program".

The two hazards are influenced by covariates observed by the researcher, called "x1" and "x2". In addition there are unobserved characteristics influencing the hazards. Being on a programme also influences the hazard to get a job. In the generated dataset, being on a programme is the indicator variable alpha. While on a programme, the only transition that can be made is "job".

The dataset is organized as a series of rows for each individual. Each row is a time period with constant covariates.

The length of the time period is in the covariate duration.

The transition being made at the end of the period is coded in the covariate d. This is an integer which is 0 if no transition occurs (e.g. if a covariate changes), it is 1 for the first transition, 2 for the second transition. It can also be a factor, in which case the level marking no transition must be called "none".

The covariate alpha is zero when unemployed, and 1 if on a programme. It is used for two purposes. It is used as an explanatory variable for transition to job, this yields a coefficient which can be interpreted as the effect of being on the programme. It is also used as a "state variable", as an index into a "risk set". I.e. when estimating, the mphcrm function must be told which risks/hazards are present. When on a programme the "toprogram" transition can not be made. This is implemented by specifying a list of risksets and using alpha+1 as an index into this set.

The two hazards are modeled as exp(X β + μ), where X is a matrix of covariates β is a vector of coefficients to be estimated, and μ is an intercept. All of these quantities are transition specific. This yields an individual likelihood which we call M_i(μ). The idea behind the mixed proportional hazard model is to model the individual heterogeneity as a probability distribution of intercepts. We obtain the individual likelihood L_i = ∑_j p_j M_i(μ_j), and, thus, the likelihood L = ∑_j L_j.

The likelihood is to be maximized over the parameter vectors β (one for each transition), the masspoints μ_j, and probabilites p_j.

The probability distribution is built up in steps. We start with a single masspoint, with probability 1. Then we search for another point with a small probability, and maximize the likelihood from there. We continue with adding masspoints until we no longer can improve the likelihood.


The example illustrates how data(durdata) was generated.


data.table::setDTthreads(1)  # avoid screams from cran-testing
dataset <- datagen(5000,80)
risksets <- list(unemp=c("job","program"), onprogram="job")
# just two iterations to save time
Fit <- mphcrm(d ~ x1+x2 + ID(id) + D(duration) + S(alpha+1) + C(job,alpha),
          data=dataset, risksets=risksets,
best <- Fit[[1]]

durmod documentation built on Aug. 21, 2019, 5:10 p.m.