Data management

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This vignette explains how to provide choice data for RprobitB via

  1. empirical data or

  2. simulated data.

As a first step, we recommend to specify the model formula.

Specify the model formula

The model formula is specified using a formula object, let's call it form.

The structure of form is choice ~ A | B | C, where

Keep the following rules in mind:

To have random effects for specific variables, we need to define a character vector re of the corresponding variable names. To have random effects for the alternative specific constants, include "ASC" in re.

[^1]: Alternative specific constants can be interpreted as covariates of type 2. Due to the dummy variable trap, we cannot estimate alternative specific constants for all the alternatives. Therefore, they are added for all except for the last alternative.

Example: Simulated choice of transportation means

Say we want to explain the choice of transportation means by the variables cost, income, and travel_time. We furthermore want to add alternative specific constants.

Therefore, we specify:

form = choice ~ cost | income | travel_time

We typically would expect heterogeneity in preferences regarding spending money on a transportation means, therefore we impose a random effect on cost:

re = "cost"

Empirical data

This section explains how to prepare empirical data for estimation using the function prepare().

Say we have a data set with empirical choice data, let's call it choice_data. It must meet the following requirements:

  1. It must be a data frame.

  2. It must be in wide format, that means each row represents one choice occasion.

  3. It must contain a column named id, which contains a unique identifier for each decision maker.

  4. It must contain a column named choice, where choice must match the name of the dependent variable in form.

  5. For each alternative specific covariate p in form and each choice alternative j, choice_data must contain a column named p_j.

  6. For each covariate q that is constant across covariates (covariate of type 2), choice_data must contain a column named q.

To prepare choice_data for estimation, we must call

data = prepare(form = form, choice_data = choice_data)

The function prepare() has the following optional arguments:

Example: "Train" data set of the mlogit package

Let's prepare the Train data set of the mlogit package for estimation. We consider the covariates price (type 1), time, comfort and change (each of type 3), where we link price and time to random effects[^2].

data("Train", package = "mlogit")
data = prepare(form = choice ~ price | 0 | time + comfort + change, 
               choice_data = Train,
               re = c("price","time"))

[^2]: Note that alternative specific constants are excluded here.

Simulated data

This section explains how to simulate choice data using the function simulate().

If we want to simulate the choices of N deciders in T choice occasions[^3] among J alternatives from our model formulation form, we have to call

data = simulate(form = form, N = N, T = T, J = J)

The function simulate() has the following optional arguments:

We can specify true parameter values by adding values for

[^3]: T can be either a positive number, representing a fixed number of choice occasions for each decision maker, or a vector of length N, i.e. a decision maker specific number of choice occasions.

[^4]: For a covariate cov of type 1 or 3, you can either choose "name" = cov (to draw the covariate for all alternatives from the same distribution) or "name" = cov_alternative (to draw the covariate for a specific alternative from a specific distribution).

Example: Simulated choice of transportation means

We revisit our example of the simulated choice of transportation means, where we already specified:

form = choice ~ cost | income | travel_time
re = "cost"

Let us now simulate the choices of N = 100 decision makers in T = 10 choice occasions on the J = 3 alternatives "car", "bus" and "train". We want C = 2 true latent classes and specific distributions[^5] for our covariates:

N = 100
T = 10
J = 3
alternatives = c("car", "bus", "train")
distr = list("cost" = list("name" = "rnorm", sd = 3),
             "income" = list("name" =  "sample", x = (1:10)*1e3, replace = TRUE),
             "travel_time_car" = list("name" = "rlnorm", meanlog = 1),
             "travel_time_bus" = list("name" = "rlnorm", meanlog = 2))
data = simulate(form = form, N = N, T = T, J = J, re = re,
                alternatives = alternatives, distr = distr, C = 2)

[^5]: Note that the cost covariate for all alternatives is drawn from the same distribution. Also note that since we did not specify a distribution for travel_time_bus, this covariate is drawn from a standard normal distribution.

Standardize covariates

Both simulate() and prepare() have the optional input standardize, which is a character vector of names of covariates that get standardized, i.e. normalize to mean 0 and standard deviation 1. If standardize = "all", all covariates get standardized.

Covariates of type 1 or 3 have to be addressed by covariate_alternative.

If standardize = "all", all covariates get standardized.

Example: Simulated choice of transportation means

In our example of the simulated choice of transportation means, scaling the income is reasonable and can improve model fitting. For demonstration purpose, we also standardize travel_time for each alternative:

standardize = c("income", "travel_time_car", "travel_time_bus",
                "travel_time_train")
data = simulate(form = form, N = N, T = T, J = J, re = re,
                alternatives = alternatives, parm = parm, distr = distr,
                standardize = standardize)

Data summary

We can check if the data preparation or simulation worked as expected using the summary() function. The columns z and re indicate standardized and random effect covariates, respectively. The rest of the output is self-explanatory.

summary(data)


Try the RprobitB package in your browser

Any scripts or data that you put into this service are public.

RprobitB documentation built on Nov. 12, 2021, 5:08 p.m.