Tidy inference for paired data
In infer: Tidy Statistical Inference

knitr::opts_chunk$set(fig.width = 6, fig.height = 4.5) 
options(digits = 4)

library(ggplot2)
library(dplyr)
library(infer)

Introduction

In this vignette, we'll walk through conducting a randomization-based paired test of independence with infer.

Throughout this vignette, we'll make use of the gss dataset supplied by infer, which contains a sample of data from the General Social Survey. See ?gss for more information on the variables included and their source. Note that this data (and our examples on it) are for demonstration purposes only, and will not necessarily provide accurate estimates unless weighted properly. For these examples, let's suppose that this dataset is a representative sample of a population we want to learn about: American adults. The data looks like this:

dplyr::glimpse(gss)

Two sets of observations are paired if each observation in one column has a special correspondence or connection with exactly one observation in the other. For the purposes of this vignette, we'll simulate an additional data variable with a natural pairing: suppose that each of these survey respondents had provided the number of hours worked per week when surveyed 5 years prior, encoded as hours_previous.

set.seed(1)

gss_paired <- gss %>%
   mutate(
      hours_previous = hours + 5 - rpois(nrow(.), 4.8),
      diff = hours - hours_previous
   )

gss_paired %>%
   select(hours, hours_previous, diff)

The number of hours worked per week by a particular respondent has a special correspondence with the number of hours worked 5 years prior hours_previous by that same respondent. We'd like to test the null hypothesis that the "mean" hours worked per week did not change between the sampled time and five years prior.

To carry out inference on paired data with infer, we pre-compute the difference between paired values at the beginning of the analysis, and use those differences as our values of interest.

Here, we pre-compute the difference between paired observations as diff. The distribution of diff in the observed data looks like this:

unique_diff <- unique(gss_paired$diff)
gss_paired %>%
  ggplot2::ggplot() +
  ggplot2::aes(x = diff) +
  ggplot2::geom_histogram(bins = diff(range(unique_diff))) +
  ggplot2::labs(x = "diff: Difference in Number of Hours Worked",
                y = "Number of Responses") +
  ggplot2::scale_x_continuous(breaks = c(range(unique_diff), 0))

From the looks of the distribution, most respondents worked a similar number of hours worked per week as they had 5 hours prior, though it seems like there may be a slight decline of number of hours worked per week in aggregate. (We know that the true effect is -.2 since we've simulated this data.)

We calculate the observed statistic in the paired setting in the same way that we would outside of the paired setting. Using specify() and calculate():

# calculate the observed statistic
observed_statistic <- 
   gss_paired %>% 
   specify(response = diff) %>% 
   calculate(stat = "mean")

The observed statistic is r observed_statistic. Now, we want to compare this statistic to a null distribution, generated under the assumption that the true difference was actually zero, to get a sense of how likely it would be for us to see this observed difference if there were truly no change in hours worked per week in the population.

Tests for paired data are carried out via the null = "paired independence" argument to hypothesize().

# generate the null distribution
null_dist <- 
   gss_paired %>% 
   specify(response = diff) %>% 
   hypothesize(null = "paired independence") %>%
   generate(reps = 1000, type = "permute") %>%
   calculate(stat = "mean")

null_dist

For each replicate, generate() carries out type = "permute" with null = "paired independence" by:

Randomly sampling a vector of signs (i.e. -1 or 1), probability .5 for either, with length equal to the input data, and
Multiplying the response variable by the vector of signs, "flipping" the observed values for a random subset of value in each replicate

To get a sense for what this distribution looks like, and where our observed statistic falls, we can use visualize():

# visualize the null distribution and test statistic
null_dist %>%
  visualize() + 
  shade_p_value(observed_statistic,
                direction = "two-sided")

It looks like our observed mean of r observed_statistic would be relatively unlikely if there were truly no change in mean number of hours worked per week over this time period.

More exactly, we can calculate the p-value:

# calculate the p value from the test statistic and null distribution
p_value <- null_dist %>%
  get_p_value(obs_stat = observed_statistic,
              direction = "two-sided")

p_value

Thus, if the change in mean number of hours worked per week over this time period were truly zero, our approximation of the probability that we would see a test statistic as or more extreme than r observed_statistic is approximately r p_value.

We can also generate a bootstrap confidence interval for the mean paired difference using type = "bootstrap" in generate(). As before, we use the pre-computed differences when generating bootstrap resamples:

# generate a bootstrap distribution
boot_dist <- 
   gss_paired %>% 
   specify(response = diff) %>% 
   hypothesize(null = "paired independence") %>%
   generate(reps = 1000, type = "bootstrap") %>%
   calculate(stat = "mean")

visualize(boot_dist)

Note that, unlike the null distribution of test statistics generated earlier with type = "permute", this distribution is centered at observed_statistic.

Calculating a confidence interval:

# calculate the confidence from the bootstrap distribution
confidence_interval <- boot_dist %>%
  get_confidence_interval(level = .95)

confidence_interval

By default, get_confidence_interval() constructs the lower and upper bounds by taking the observations at the $(1 - .95) / 2$ and $1 - ((1-.95) / 2)$th percentiles. To instead build the confidence interval using the standard error of the bootstrap distribution, we can write:

boot_dist %>%
  get_confidence_interval(type = "se",
                          point_estimate = observed_statistic,
                          level = .95)

To learn more about randomization-based inference for paired observations, see the relevant chapter in Introduction to Modern Statistics.