This example will use the gapminder data to demonstrate a multinomial logistic regression model using cluster bootstrapping to calculate estimates and variances and showing the resulting odds ratios and predicted probabilities.

List of vignettes I'd like to include later:

Data and Simple Model Fit

A multinomial outcome has >2 unordered categories. This fits the criteria of Earth's continents, which are included in the gapminder data, along with characteristics of each country in a given year.

library(gapminder)
head(gapminder)

Let's say we want to look at the relationship between each continent and life expectancy, population and GDP. One option is to use vglm() from the VGAM package, with family = multinomial(), to model the log odds of a data point being from a certain continent.[^1]

We'll use Africa as our reference category, and since Oceania has relatively few records, for purposes of model stability we'll combine it with Asia. (This may or may not be the best approach, but we'll leave that to sociologists.)

library(VGAM, quietly = TRUE)

## Combine Australia/NZ with Asia to help model stability
gapminder$continent <- with(gapminder, {
  ifelse(continent == "Oceania", "Asia", as.character(continent))
})

## Fit model with outcome = continent,
##  exposures = life expectancy, population, GDP
our_mod <- vglm(
  continent ~ lifeExp + pop + gdpPercap,
  data = gapminder,
  family = multinomial(refLevel = "Africa")
)

# ## Print model summary (commented out for length)
# summary(our_mod)

If you run the code above, you'll see we get three sets of coefficients; each set describes the log odds of a data point being from Europe, Asia or the Americas vs Africa.

Accounting for Clustering

But let's say we want to account for the fact that our data are clustered by year; for example, correlations between records from the same continent may vary over time, or correlations among the continents may vary over time.

Here, we'll use the bootstrap to account for this clustering. A basic outline of the process:

We'll repeat these steps B times, then use the coefficients from all B models to get overall coefficient point estimates and variance-covariance. We'll use those to calculate meaningful results from our model: odds ratios, predicted probabilities (plus confidence limits for both), and hypothesis tests.

(In our examples, we'll use a small number of bootstrap samples to save computation time; in real life, you'll want many more.)

Step 1: Create Our Datasets

ClusterMultinom::create_bootdata() takes as arguments:

  1. An original dataset (multiple rows per cluster; data.frame)
  2. A single clustering variable name (column name in original dataset; character)
  3. A positive integer representing the number of bootstraps requested
  4. A seed (optional, to ensure reproducibility; numeric)

It returns a list of length nboot, where each element is a cluster bootstrapped dataset.

library(ClusterMultinom)
library(purrr)
library(tibble)

gap_bootdfs <- create_bootdata(
  df = gapminder,
  cluster_var = "year",
  nboot = 10,
  seed = 1234
)

length(gap_bootdfs)

Step 2: Run the Model on Each Dataset

ClusterMultinom::summarize_multinom_list() takes as required arguments:

  1. A model formula
  2. A list of datasets (eg, one created by ClusterMultinom::create_bootdata())
  3. Which level of the outcome to use as a reference

It also takes as optional arguments:

  1. A "test" dataset (orgdf)

    The function will run the specified model on the test dataset first; if that model fit succeeds, it will proceed with fitting the model to each element of the given list. If the model fails, the function will stop with an error message. In a typical use case, you would use the original dataset (from which all the bootstraps are sampled) as a test dataset.

  2. What information to keep from test model fit (orginfo)

    If an original dataset is specified, the user might be interested in looking at the entire model object (residuals, etc); in this case, specify orginfo = "modobj". Often, we are only interested in the coefficients - for example, to compare coefficients from the original dataset to what we get from bootstrapping. Thus, the default is orginfo = "coefs". Sometimes we're not interested in keeping any information; we just want to make sure the model runs on the test dataset. In this case, specify orginfo = "nothing".

  3. The number of successful model fits required (nsuccfits)

    Depending on the distribution of the outcome and the complexity of the model, some bootstrap datasets may not produce a model fit that converges, so we recommend creating some "extra" datasets in Step 1. If this argument is specified, and there are more than nsuccfits successful model fits, the results will be restricted to as many elements as it took to get nsuccfits successful fits. If this argument is specified and the entire list results in fewer than N successful fits, a warning message will be displayed.

    For example, if the goal is to have 1000 successful bootstraps, you might create a list of 1100 data objects in Step 1 and specify nsuccfits = 1000 in Step 2.

In one step, we can run our specified model on each element of the list we created in Step 1 and summarize the most relevant information from each model fit. We use the original dataset as our "test" - if the model fails on the original dataset, we don't want to try it on the bootstraps.

## Fit model with outcome = continent, exposures = [life expectancy, population,
##   GDP] to each data.frame in our list
## For demo purposes, we ask for 7 successful model fits
our_mod_summary <- summarize_multinom_list(
  continent ~ lifeExp + pop + gdpPercap,
  df_list = gap_bootdfs,
  ref_level = "Africa",
  orgdf = gapminder,
  nsuccfits = 7
)

Our output includes the following elements:

  1. orgcoefs: Coefficients from the full model fit using our original data (here, the original gapminder dataset)
  2. allresults: A tibble which includes a column for each piece of information we saved from each model fit
    • fitsucc: Whether the model was fit without warnings or errors
    • coefs: A 1 * p tibble of coefficient estimates; NULL if the model failed
    • msgs: A character vector of warnings/errors; NULL if the model was fit successfully
  3. fitsuccess: A named numeric vector indicating the total number of successful and failed model fits
  4. coefs: A tibble with [number of successful model fits] rows * p columns.
our_mod_summary

FROM HERE ONWARDS IT'S NOT READY

Great, now we have correct variance-covariance between all our coefficients! Now we can get the correct results.

Everybody loves a p-value.

<< this is where we have a function that gets p-values for a single/group of covariates >>

But p-values are imperfect and don't tell us anything about directionality or magnitude of effect.

Let's look at odds ratios!

<< this is where we have a function that gets odds ratios + CIs for a single covariate given a design matrix >>
<< and we plot them >>

But for continuous covariates like life expectancy, sometimes we want to plot the association over an entire range of the covariate, especially if we used something like restricted cubic splines, where the slope in our association isn't constant.

<< this is where we have a function that gets predicted probabilities + CIs for a single covariate given a design matrix >>
<< and we plot them >>

[^1]: There are other options for multinomial regression in R, including nnet::multinom() and the mlogit package. We chose VGAM::vglm() because its warning messages were the most conservative and verbose, which is particularly helpful in the context of bootstrapping.



jenniferthompson/ClusterMultinom documentation built on May 7, 2019, 8:59 p.m.