Data Formatting and Encoding
In logitr: Logit Models w/Preference & WTP Space Utility Parameterizations

knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  fig.retina = 3,
  comment = "#>"
)

Basic required format

The {logitr} package contains several example data sets that illustrate this data structure. For example, the yogurt contains observations of yogurt purchases by a panel of 100 households [@Jain1994]. Choice is identified by the choice column, the observation ID is identified by the obsID column, and the columns price, feat, and brand can be used as model covariates:

library("logitr")

head(yogurt)

This data set also includes an alt variable that determines the alternatives included in the choice set of each observation and an id variable that determines the individual as the data have a panel structure containing multiple choice observations from each individual.

Continuous versus discrete variables

Variables are modeled as either continuous or discrete based on their data type. Numeric variables are by default estimated with a single "slope" coefficient. For example, consider a data frame that contains a price variable with the levels $10, $15, and $20. Adding price to the pars argument in the main logitr() function would result in a single price coefficient for the "slope" of the change in price.

In contrast, categorical variables (i.e. character or factor type variables) are by default estimated with a coefficient for all but the first level, which serves as the reference level. The default reference level is determined alphabetically, but it can also be set by modifying the factor levels for that variable. For example, the default reference level for the brand variable is "dannon" as it is alphabetically first. To set "weight" as the reference level, the factor levels can be modified using the factor() function:

yogurt2 <- yogurt

brands <- c("weight", "hiland", "yoplait", "dannon")
yogurt2$brand <- factor(yogurt2$brand, levels = brands)

Creating dummy coded variables

If you wish to make dummy-coded variables yourself to use them in a model, I recommend using the dummy_cols() function from the {fastDummies} package. For example, in the code below, I create dummy-coded columns for the brand variable and then use those variables as covariates in a model:

yogurt2 <- fastDummies::dummy_cols(yogurt2, "brand")

The yogurt2 data frame now has new dummy-coded columns for brand:

head(yogurt2)

Now I can use those columns as covariates:

mnl_pref_dummies <- logitr(
  data    = yogurt2,
  outcome = 'choice',
  obsID   = 'obsID',
  pars    = c(
    'price', 'feat', 'brand_yoplait', 'brand_dannon', 'brand_weight'
  )
)

summary(mnl_pref_dummies)

Validating data before estimation

Before estimating a model, it is often helpful to validate that your data is properly formatted. The validate_data() function checks for common formatting errors and provides detailed diagnostic information. This can save time by catching errors before you attempt to estimate a model.

Basic validation

At a minimum, you should validate the outcome and obsID columns:

validation <- validate_data(
  data = yogurt,
  outcome = "choice",
  obsID = "obsID"
)

validation

The function returns a validation object that indicates whether the data is valid and provides summary information about the data structure.

Validation with parameters

You can also validate specific parameters to check for missing values or other issues:

validation <- validate_data(
  data = yogurt,
  outcome = "choice",
  obsID = "obsID",
  pars = c("price", "feat", "brand")
)

validation

Panel data validation

For panel data, you can validate the panel structure:

validation <- validate_data(
  data = yogurt,
  outcome = "choice",
  obsID = "obsID",
  pars = c("price", "feat", "brand"),
  panelID = "id"
)

validation

Common errors detected

The validate_data() function checks for several common formatting errors:

Multiple choices per observation: Each obsID should have exactly one choice (outcome = 1)
No choice in observation: Each obsID must have at least one choice
Non-contiguous observation blocks: All rows with the same obsID must be grouped together
Invalid outcome values: The outcome variable must only contain 0 and 1 (or TRUE and FALSE)
Missing values: Checks for missing values in required columns

Here's an example of detecting an error:

# Create problematic data with multiple choices in one observation
bad_data <- yogurt
bad_data$choice[1:2] <- 1

validation <- validate_data(
  data = bad_data,
  outcome = "choice",
  obsID = "obsID"
)

validation