sastats: Calculate Statistics

Overview

Outliers should be considered when working with continuous survey variables. There are no hard and fast rules for outlier identification, but Tukey’s Test provides one method that is easy to apply in a standard way. You can view a production example for B4W-19-01.

Package sastats includes a few simple outlier functions:

For visualizing: outlier_plot()
For identification: outlier_tukey()
For top-coding: outlier_tukey_top()

For demonstration, I included a survey dataset with annual participation metrics for 9 outdoor recreation activities:

library(dplyr)
library(sastats)

data(svy) # list with 2 data frames: person, act
activities <- svy$act

glimpse(activities)
#> Observations: 11,268
#> Variables: 4
#> $ Vrid <chr> "98", "99", "100", "101", "102", "103", "105", "106", "107", "...
#> $ act  <chr> "trail", "trail", "trail", "trail", "trail", "trail", "trail",...
#> $ part <chr> "Unchecked", "Unchecked", "Unchecked", "Unchecked", "Unchecked...
#> $ days <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 15, 10, NA, 2, NA, NA, 10, NA,...

Visualizing the data is a good first step. We can use outlier_plot() which is largely a wrapper for a few ggplot2 functions. The ignore_zero = TRUE specification ensures we exclude any respondents who didn’t actually participate.

outlier_plot(activities, days, act, ignore_zero = TRUE)

After running this function, we can see that the distributions are highly skewed and difficult to view. Additionally, the position of the whiskers suggests that we would be flagging many reasonable values as outliers (e.g., those above 20 or so for fishing).

Log-transforming the y-axis (apply_log = TRUE) produces more normal distributions, and likely provides a more reasonable criteria for outlier identification. Note that we don’t need to supply ignore_zero = TRUE since log(0) is undefined.

outlier_plot(activities, days, act, apply_log = TRUE)

We can use outlier_tukey() to flag those values observed to be outliers:

activities <- activities %>%
    group_by(act) %>% 
    mutate(
        is_outlier = outlier_tukey(days, apply_log = TRUE), 
        days_cleaned = ifelse(is_outlier, NA, days) 
    ) %>% 
    ungroup()

outlier_plot(activities, days, act, apply_log = TRUE, show_outliers = TRUE)

We also have a couple summary functions available to demonstrate the effects of outlier removal:

outlier_pct(activities, act)
#> # A tibble: 8 x 4
#> # Groups:   act [8]
#>   act      is_outlier     n pct_outliers
#>   <chr>    <lgl>      <int>        <dbl>
#> 1 camp     TRUE           6       0.479 
#> 2 fish     TRUE           3       0.240 
#> 3 hunt     TRUE           1       0.0799
#> 4 picnic   TRUE          15       1.20  
#> 5 snow     TRUE           2       0.160 
#> 6 trail    TRUE           4       0.319 
#> 7 water    TRUE           4       0.319 
#> 8 wildlife TRUE          13       1.04

outlier_mean_compare(activities, days, days_cleaned, act) 
#> # A tibble: 9 x 3
#>   act       days days_cleaned
#>   <chr>    <dbl>        <dbl>
#> 1 bike     31.6         31.6 
#> 2 camp     11.2          8.58
#> 3 fish     11.6          9.40
#> 4 hunt      9.37         8.34
#> 5 picnic   17.5         13.1 
#> 6 snow      9.99         9.29
#> 7 trail    28.4         25.5 
#> 8 water    12.4         10.5 
#> 9 wildlife 30.5         21.8

Instead of removing outliers, we could use outlier_tukey_top() to identify the topcode value and then recode accordingly:

activities <- activities %>%
    group_by(act) %>%
    mutate(
        topcode_value = outlier_tukey_top(days, apply_log = TRUE),
        days_cleaned = ifelse(is_outlier, topcode_value, days)
    ) %>%
    ungroup()

outlier_mean_compare(activities, days, days_cleaned, act)
#> # A tibble: 9 x 3
#>   act       days days_cleaned
#>   <chr>    <dbl>        <dbl>
#> 1 bike     31.6         31.6 
#> 2 camp     11.2          9.30
#> 3 fish     11.6         10.5 
#> 4 hunt      9.37         9.05
#> 5 picnic   17.5         15.8 
#> 6 snow      9.99         9.91
#> 7 trail    28.4         28.1 
#> 8 water    12.4         11.5 
#> 9 wildlife 30.5         28.6

southwick-associates/sastats documentation built on March 27, 2020, 9:39 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com