knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(dplyr) library(sastats)
Outliers should be considered when working with continuous survey variables. There are no hard and fast rules for outlier identification, but Tukey's Test provides one method that is easy to apply in a standard way. You can view a production example for B4W-19-01.
Package sastats includes a few simple outlier functions:
outlier_plot()
outlier_tukey()
outlier_tukey_top()
For demonstration, I included a survey dataset with annual participation metrics for 9 outdoor recreation activities:
library(dplyr) library(sastats) data(svy) # list with 2 data frames: person, act activities <- svy$act glimpse(activities)
Visualizing the data is a good first step. We can use outlier_plot()
which is largely a wrapper for a few ggplot2 functions. The ignore_zero = TRUE
specification ensures we exclude any respondents who didn't actually participate.
outlier_plot(activities, days, act, ignore_zero = TRUE)
After running this function, we can see that the distributions are highly skewed and difficult to view. Additionally, the position of the whiskers suggests that we would be flagging many reasonable values as outliers (e.g., those above 20 or so for fishing).
Log-transforming the y-axis (apply_log = TRUE
) produces more normal distributions, and likely provides a more reasonable criteria for outlier identification. Note that we don't need to supply ignore_zero = TRUE
since log(0)
is undefined.
outlier_plot(activities, days, act, apply_log = TRUE)
We can use outlier_tukey()
to flag those values observed to be outliers:
activities <- activities %>% group_by(act) %>% mutate( is_outlier = outlier_tukey(days, apply_log = TRUE), days_cleaned = ifelse(is_outlier, NA, days) ) %>% ungroup() outlier_plot(activities, days, act, apply_log = TRUE, show_outliers = TRUE)
We also have a couple summary functions available to demonstrate the effects of outlier removal:
outlier_pct(activities, act) outlier_mean_compare(activities, days, days_cleaned, act)
Instead of removing outliers, we could use outlier_tukey_top()
to identify the topcode value and then recode accordingly:
activities <- activities %>% group_by(act) %>% mutate( topcode_value = outlier_tukey_top(days, apply_log = TRUE), days_cleaned = ifelse(is_outlier, topcode_value, days) ) %>% ungroup() outlier_mean_compare(activities, days, days_cleaned, act)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.