In UBC-MDS/rfer: Rewriting of `infer` package for practice purposes

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Overview

Reimplementation of the infer R package, that offers a tidy way of developing statistical inference built on top of Tidyverse.

The infer package streamlines the process of reshuffling and bootstrapping of samples, calculating summary statistics and confidence intervals, and performing hypothesis tests for statistical inference. It does this using a combination of functions that are built with the emphasis on clear expressive code and using correct statistical grammar that explains the way the values are calculated and the tests are evaluated in statistical inference.

With this package as the inspiration, rfer will have four main functions (specify,generate,calculate,get_ci) for the first iteration. These functions will, given a data frame and the specified response variable; calculate summary statistics and confidence intervals for the response variable. Further details follow in the description of the functions below.

Data preparation

In order to show an example of how the Rfer package works, we'll use an example using the old-faithful iris dataset. Boring, we know, but will get you up to speed with this package easier than other datasets. And it's fairly straightforward to interpret.

library(rfer)
library(dplyr)

set.seed(41)

iris_df <- iris %>%
  mutate(Species = factor(Species))

#Rough method to get a value of the point estimate

hp_point <- mtcars %>%
  specify(response = "hp") %>%
  generate(n_samples = 1) %>%
  calculate(column = "hp",stat="mean")

hp_point_estimate <- hp_point[[2]]

The Specify Function

In the specify function, the objective is to create a dataframe that will be used in the remainder of the pipeline that contains the response variable that is looking to be studied, along with optionally some explanatory variables.

Sep_Width <- iris_df %>%
  specify(response="Sepal.Width")

Sep_Width

The Generate Function

The objective of the generate function is to generate and create n samples (equivalent to the value set in the n_samples parameter)

Sep_width_resamples <- Sep_Width %>%
  generate(n_samples = 20)

head(Sep_width_resamples)

The Calculate Function

The objective of the calculate function is to calculate a statistic for each of the resampled groups. Up until this version of release, only the 'mean' statistic is available to be calculated thus far.

Sep_width_means <- Sep_width_resamples %>%
  calculate(column="Sepal.Width",stat="mean")

Sep_width_means

The Get_CI Function

The objective of the Get CI function is to find the confidence intervals of the the resampled groups. The user has the choice to set the level between 0 and 1 non-inclusive.

Sep_width_CI <- Sep_width_means %>%
  rfer::get_ci(column="Sepal.Width",confidence_level = 0.9)

Sep_width_CI

Note, the Point Estimate is N/A above. This is because it is not specified by the user and not required when the method to calculate is the percentile method. If specified, the value will be displayed in the output.

An Overall Example

Ultimately the objective of the rfer package is to combine all of the above functions to result in a streamlined method to calculate confidence intervals (and eventually other estimates of interest) once a column/variable is specified. In the example below, we use the mtcars dataset (just to shake things up a little) and our objective is to arrive at the 90% confidence intervals of the hp across all the cars.

mtcars %>%
  specify(response = "hp") %>%
  generate(n_samples = 10,type = "bootstrap") %>%
  calculate(column = "hp",stat="mean") %>%
  rfer::get_ci(column = "hp",confidence_level = 0.9,point_estimate = hp_point_estimate,type="percentile")