In southwick-associates/sastats: Calculate Statistics

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(dplyr)
library(sastats)

Overview

Outliers should be considered when working with continuous survey variables. There are no hard and fast rules for outlier identification, but Tukey's Test provides one method that is easy to apply in a standard way. You can view a production example for B4W-19-01.

Package sastats includes a few simple outlier functions:

For visualizing: outlier_plot()
For identification: outlier_tukey()
For top-coding: outlier_tukey_top()

Example Data

For demonstration, I included a survey dataset with annual participation metrics for 9 outdoor recreation activities:

library(dplyr)
library(sastats)

data(svy) # list with 2 data frames: person, act
activities <- svy$act

glimpse(activities)

Visualize

Visualizing the data is a good first step. We can use outlier_plot() which is largely a wrapper for a few ggplot2 functions. The ignore_zero = TRUE specification ensures we exclude any respondents who didn't actually participate.

outlier_plot(activities, days, act, ignore_zero = TRUE)

After running this function, we can see that the distributions are highly skewed and difficult to view. Additionally, the position of the whiskers suggests that we would be flagging many reasonable values as outliers (e.g., those above 20 or so for fishing).

Log-transforming the y-axis (apply_log = TRUE) produces more normal distributions, and likely provides a more reasonable criteria for outlier identification. Note that we don't need to supply ignore_zero = TRUE since log(0) is undefined.

outlier_plot(activities, days, act, apply_log = TRUE)

Flag Outliers

We can use outlier_tukey() to flag those values observed to be outliers:

activities <- activities %>%
    group_by(act) %>% 
    mutate(
        is_outlier = outlier_tukey(days, apply_log = TRUE), 
        days_cleaned = ifelse(is_outlier, NA, days) 
    ) %>% 
    ungroup()

outlier_plot(activities, days, act, apply_log = TRUE, show_outliers = TRUE)

We also have a couple summary functions available to demonstrate the effects of outlier removal:

outlier_pct(activities, act)

outlier_mean_compare(activities, days, days_cleaned, act)

Topcode

Instead of removing outliers, we could use outlier_tukey_top() to identify the topcode value and then recode accordingly:

activities <- activities %>%
    group_by(act) %>%
    mutate(
        topcode_value = outlier_tukey_top(days, apply_log = TRUE),
        days_cleaned = ifelse(is_outlier, topcode_value, days)
    ) %>%
    ungroup()

outlier_mean_compare(activities, days, days_cleaned, act)