In friendly/heplots: Visualizing Hypothesis Tests in Multivariate Linear Models

knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.height=5,
  fig.width=5,
  # results='hide',
  # fig.keep='none',
  fig.path='fig/datasets-',
  echo=TRUE,
  collapse = TRUE,
  comment = "#>"
)

set.seed(1071)
options(width=80, digits=5, continue="  ")
library(heplots)
library(candisc)
library(ggplot2)
library(dplyr)

Documenting package datasets {-}

Datasets used in package examples are such an important part of making a package understandable and usable, but is often overlooked. In developing the heplots package I collected a large collection of data sets illustrating a variety of multivariate linear models with some an analyses, and graphical displays. Each of these have much more than the usual stub examples, that often look like:

data(dataset)
# str(dataset); plot(dataset)

But .Rd, and now roxygen, don't make it easy to work with numerous datasets in a package, or, more importantly, to document what they illustrate. I'm showing the work to create this vignette, in case these ideas are useful to others.

In this release, I started with a file generated by:

vcdExtra::datasets("heplots") |> head(4)

Then, in the roxygen documentation, I added @concept tags to classify these datasets according to methods used. (@concept entries are indexed with the package, so they work via help.search()) For example, the documentation for the AddHealth data contains these lines:

#' @name AddHealth
#' @docType data
 ...
#' @keywords datasets
#' @concept MANOVA
#' @concept ordered

With standard processing, these concepts along with the keywords, appear in the Index section of the manual constructed by devtools::build_manual(). In the pkgdown site for this package, they are also searchable in the search box.

With a bit of extra processing, I created a dataset datasets.csv used below.

Methods {-}

The main methods used in the example datasets are shown in the table below:

MANOVA: Multivariate analysis of variance
MANCOVA: Multivariate of covariance
MMRA: Multivariate multiple regression
cancor: Canonical correlation (using the candisc package)
candisc: Canonical discriminant analysis (using candisc)
repeated: Repeated measures designs, analyzed from the multivariate perspective
robust: Robust estimation of MLMs

In addition, a few examples illustrate special handling for linear hypotheses concerning factors:

ordered: ordered factors
contrasts: other contrasts

The dataset names are linked to the documentation with graphical output on the pkgdown website, [http://friendly.github.io/heplots/].

Dataset table {-}

library(here)
library(dplyr)
library(tinytable)
#dsets <- read.csv(here::here("extra", "datasets.csv"))  # doesn't work in a vignette
dsets <- read.csv("https://raw.githubusercontent.com/friendly/heplots/master/extra/datasets.csv")
dsets <- dsets |> 
  dplyr::select(-X) |> 
  arrange(tolower(dataset))

# link dataset to pkgdown doc
refurl <- "http://friendly.github.io/heplots/reference/"

dsets <- dsets |>
  mutate(dataset = glue::glue("[{dataset}]({refurl}{dataset}.html)")) 

#knitr::kable(dsets)
tinytable::tt(dsets)  |> format_tt(markdown = TRUE)

Concept table {-}

This table can be inverted to list the datasets that illustrate each concept:

concepts <- dsets |>
  select(dataset, tags) |>
  tidyr::separate_longer_delim(tags, delim = " ") |>
  arrange(tags, dataset) |>
  summarize(datasets = toString(dataset), .by = tags) |>
  rename(concept = tags)

#knitr::kable(concepts)
tinytable::tt(concepts) |> format_tt(markdown = TRUE)