ggsigmark: An Implementation of Grammar of Graphics to Denote Statistically Significant Differences

Package installation
Guide
Version changes
- v0.0.3
- v0.0.2
- v0.0.1.1
Plans for future additions and fixes

Market researchers typically use symbols in charts delivered to clients to denote whether a specific subgroup's result is (statistically speaking) significantly higher than another subgroup's result. The package ggsigmark aims to make it easier for market researchers to utilize ggplot2 by providing the ability to indicate whether differences are statistically significant with the usage of markers in the plots they are creating.

This vignette provides a manual on how to use the ggsigmark functions and lists planned future additions to the package.

Package installation

To install the package from GitHub, you will first need to install the devtools package.

install.packages("devtools")

The ggsigmark package can be installed by running the following line:

devtools::install_github("philstraforelli/ggsigmark")

It is made available for usage by running the following line:

library(ggsigmark)

The code below will utilize tidyverse code (which is of course necessary since the ultimate purpose is to use ggplot2), so we might as well load up that package as well, along with others that are used in this vignette:

library(tidyverse)
library(broom)
library(forcats)
library(ggrepel)
library(scales)
library(stringr)

Guide

A note about the `gss_data` General Social Survey data set

The package uses a reduced version of the NORC's General Social Survey to illustrate the usage of the ggsigmark tools. This data was chosen because it represents the type of data that market researchers generally work with; that is, tracking survey data (in this case run on a biennial basis) with post-stratification weights.

The full original data file includes as of this writing over 65,000 responses and over nearly 6,000 variables (many of which are defunct questions). To save memory, only a select list of variables are included, and the gss_data dataset only includes data from 2000 to 2016 (the GSS survey was first run in 1972). This brings the data frame down to a more manageable 24,350 responses and 101 variables.

The full data can be downloaded here. More information about the GSS data is available here.

Usage of the `ggsigmark` functions

The following flowchart illustrates the workflow for using ggsigmark functions.

The `freq_pair_wtd_t_test` function

Surprisingly, while the stats package includes a pairwise.t.test function to make multiple pairwise comparisons between group levels (with corrections for multiple testing), and the Hmisc package provides a wtd.mean function to calculate means with post-stratification weights, there was no function available to run pairwise comparison t-tests with weighted means.

As with the pairwise.t.test function, an adjustment is made to the p-value to mitigate the chance of a Type I error, defaulted to the "Holm" method (as opposed to the better known Bonferroni correction, generally deemed too conservative).

The function freq_pair_wtd_t_test fills that gap. This function is used in the background for the freq_t_test function, and most users would likely not see a need to run this function. However, it may have its uses to some, and it is therefore made available here. The following code is an example of its usage, where whether the coninc variable (adjusted family income) differs by levels of education (degree), and whether respondents' age (age_num) differ based on their region. For both cases, the weight wtssall is applied:

freq_pair_wtd_t_test(x = gss_data$coninc, subgroup = gss_data$degree, weight = gss_data$wtssall)

## 
##  Pairwise comparisons using Unpaired T test 
## 
## data:  gss_data$coninc and gss_data$degree 
## 
##                LT HIGH SCHOOL HIGH SCHOOL JUNIOR COLLEGE BACHELOR
## HIGH SCHOOL    <2e-16         -           -              -       
## JUNIOR COLLEGE <2e-16         <2e-16      -              -       
## BACHELOR       <2e-16         <2e-16      <2e-16         -       
## GRADUATE       <2e-16         <2e-16      <2e-16         <2e-16  
## 
## P value adjustment method: holm

freq_pair_wtd_t_test(x = gss_data$age_num, subgroup = gss_data$region, weight = gss_data$wtssall)

## 
##  Pairwise comparisons using Unpaired T test 
## 
## data:  gss_data$age_num and gss_data$region 
## 
##                 NEW ENGLAND MIDDLE ATLANTIC E. NOR. CENTRAL
## MIDDLE ATLANTIC 1.00000     -               -              
## E. NOR. CENTRAL 1.00000     1.00000         -              
## W. NOR. CENTRAL 1.00000     1.00000         1.00000        
## SOUTH ATLANTIC  1.00000     1.00000         1.00000        
## E. SOU. CENTRAL 1.00000     1.00000         1.00000        
## W. SOU. CENTRAL 0.13380     1.00000         0.92937        
## MOUNTAIN        0.00121     0.06522         0.00426        
## PACIFIC         5.7e-05     0.00211         1.9e-05        
##                 W. NOR. CENTRAL SOUTH ATLANTIC E. SOU. CENTRAL
## MIDDLE ATLANTIC -               -              -              
## E. NOR. CENTRAL -               -              -              
## W. NOR. CENTRAL -               -              -              
## SOUTH ATLANTIC  1.00000         -              -              
## E. SOU. CENTRAL 1.00000         1.00000        -              
## W. SOU. CENTRAL 1.00000         0.20971        0.79919        
## MOUNTAIN        0.37097         0.00036        0.01023        
## PACIFIC         0.07095         2.0e-07        0.00054        
##                 W. SOU. CENTRAL MOUNTAIN
## MIDDLE ATLANTIC -               -       
## E. NOR. CENTRAL -               -       
## W. NOR. CENTRAL -               -       
## SOUTH ATLANTIC  -               -       
## E. SOU. CENTRAL -               -       
## W. SOU. CENTRAL -               -       
## MOUNTAIN        1.00000         -       
## PACIFIC         0.28691         1.00000 
## 
## P value adjustment method: holm

The resulting tables above list the p-values for each pairwise comparison. The output is not ideal in handling further, so the user may wish to use the tidy() function from the broom package:

tidy(freq_pair_wtd_t_test(x = gss_data$coninc, subgroup = gss_data$degree, weight = gss_data$wtssall))

##            group1         group2       p.value
## 1     HIGH SCHOOL LT HIGH SCHOOL 3.383092e-100
## 2  JUNIOR COLLEGE LT HIGH SCHOOL 2.992877e-107
## 3        BACHELOR LT HIGH SCHOOL  0.000000e+00
## 4        GRADUATE LT HIGH SCHOOL  0.000000e+00
## 6  JUNIOR COLLEGE    HIGH SCHOOL  8.027388e-19
## 7        BACHELOR    HIGH SCHOOL  0.000000e+00
## 8        GRADUATE    HIGH SCHOOL  0.000000e+00
## 11       BACHELOR JUNIOR COLLEGE  2.294531e-68
## 12       GRADUATE JUNIOR COLLEGE 9.895405e-162
## 16       GRADUATE       BACHELOR  4.389259e-44

tidy(freq_pair_wtd_t_test(x = gss_data$coninc, subgroup = gss_data$region, weight = gss_data$wtssall))

##             group1          group2      p.value
## 1  MIDDLE ATLANTIC     NEW ENGLAND 3.111807e-04
## 2  E. NOR. CENTRAL     NEW ENGLAND 1.185668e-16
## 3  W. NOR. CENTRAL     NEW ENGLAND 7.889610e-19
## 4   SOUTH ATLANTIC     NEW ENGLAND 2.420030e-19
## 5  E. SOU. CENTRAL     NEW ENGLAND 2.032580e-25
## 6  W. SOU. CENTRAL     NEW ENGLAND 4.160657e-28
## 7         MOUNTAIN     NEW ENGLAND 1.742476e-22
## 8          PACIFIC     NEW ENGLAND 2.167928e-05
## 10 E. NOR. CENTRAL MIDDLE ATLANTIC 3.228809e-08
## 11 W. NOR. CENTRAL MIDDLE ATLANTIC 1.745968e-10
## 12  SOUTH ATLANTIC MIDDLE ATLANTIC 4.852845e-11
## 13 E. SOU. CENTRAL MIDDLE ATLANTIC 4.914109e-17
## 14 W. SOU. CENTRAL MIDDLE ATLANTIC 2.452296e-20
## 15        MOUNTAIN MIDDLE ATLANTIC 4.280706e-14
## 16         PACIFIC MIDDLE ATLANTIC 1.000000e+00
## 19 W. NOR. CENTRAL E. NOR. CENTRAL 2.466990e-01
## 20  SOUTH ATLANTIC E. NOR. CENTRAL 1.000000e+00
## 21 E. SOU. CENTRAL E. NOR. CENTRAL 1.781225e-04
## 22 W. SOU. CENTRAL E. NOR. CENTRAL 8.198474e-05
## 23        MOUNTAIN E. NOR. CENTRAL 1.579073e-02
## 24         PACIFIC E. NOR. CENTRAL 5.577282e-07
## 28  SOUTH ATLANTIC W. NOR. CENTRAL 8.297285e-01
## 29 E. SOU. CENTRAL W. NOR. CENTRAL 5.666813e-01
## 30 W. SOU. CENTRAL W. NOR. CENTRAL 8.297285e-01
## 31        MOUNTAIN W. NOR. CENTRAL 1.000000e+00
## 32         PACIFIC W. NOR. CENTRAL 2.415946e-09
## 37 E. SOU. CENTRAL  SOUTH ATLANTIC 1.646748e-03
## 38 W. SOU. CENTRAL  SOUTH ATLANTIC 1.205193e-03
## 39        MOUNTAIN  SOUTH ATLANTIC 1.074700e-01
## 40         PACIFIC  SOUTH ATLANTIC 9.718589e-10
## 46 W. SOU. CENTRAL E. SOU. CENTRAL 1.000000e+00
## 47        MOUNTAIN E. SOU. CENTRAL 1.000000e+00
## 48         PACIFIC E. SOU. CENTRAL 9.086298e-16
## 55        MOUNTAIN W. SOU. CENTRAL 1.000000e+00
## 56         PACIFIC W. SOU. CENTRAL 5.154153e-19
## 64         PACIFIC        MOUNTAIN 8.010958e-13

The `freq_t_test` function

Most users would consider the freq_t_test function more useful, as it takes the output from the freq_pair_wtd_t_test function and wrangles it into a more useful output. The output from the freq_t_test function consists of a data frame in which each row consists of a possible pairwise comparison, with the following variables:

Names of the two groups compared (group1 and group2)
The p.value for that comparison, and a logical indicating whether it is significant (i.e. whether it is below the alpha level, which can be user-defined but is defaulted to 0.05).
The unweighted sample sizes of each group (Sample_Size_group1 and Sample_Size_group2)
the weighted means of each group (wtd.mean_group1 and wtd.mean_group2).

Note that the arguments needed for the freq_t_test function consist of the following, in order:

The name of the dataframe
The name of the numeric variable being compared across subgroups, in quotes
The name of the factor variable in which the subgroup levels are compared, in quotes
Optionally, the name of the weight variable, in quotes
The criteria to determine whether a difference is significant, defaulted to 0.05.
The name of the method to adjust p-values due to multiple comparisons in quotes, defaulted to "holm". See p.adjust.methods for more details.
A logical argument to indicate whether the user wants to include the sample sizes into the names of the subgroup levels (TRUE) or not (FALSE), defaulted to FALSE.

#First filtering only to 2016 cases and among those who gave an inflation-adjusted family income
gss_data_mean <- filter(gss_data, year == "2016", coninc > 0)

freq_t_test(gss_data_mean, "coninc", "region", weight = "wtssall") #With weights

##             group1          group2      p.value significant
## 1  MIDDLE ATLANTIC     NEW ENGLAND 4.422520e-01       FALSE
## 2  E. NOR. CENTRAL     NEW ENGLAND 9.378812e-03        TRUE
## 3  W. SOU. CENTRAL     NEW ENGLAND 2.101227e-05        TRUE
## 4  W. NOR. CENTRAL     NEW ENGLAND 6.247240e-04        TRUE
## 5  E. SOU. CENTRAL     NEW ENGLAND 1.163490e-03        TRUE
## 6         MOUNTAIN     NEW ENGLAND 1.850896e-02        TRUE
## 7   SOUTH ATLANTIC     NEW ENGLAND 3.513405e-02        TRUE
## 8          PACIFIC     NEW ENGLAND 4.007362e-01       FALSE
## 9  E. NOR. CENTRAL MIDDLE ATLANTIC 1.000000e+00       FALSE
## 10 W. NOR. CENTRAL MIDDLE ATLANTIC 3.610339e-01       FALSE
## 11        MOUNTAIN MIDDLE ATLANTIC 1.000000e+00       FALSE
## 12 E. SOU. CENTRAL MIDDLE ATLANTIC 5.045073e-01       FALSE
## 13 W. SOU. CENTRAL MIDDLE ATLANTIC 5.105789e-02       FALSE
## 14         PACIFIC MIDDLE ATLANTIC 1.000000e+00       FALSE
## 15  SOUTH ATLANTIC MIDDLE ATLANTIC 1.000000e+00       FALSE
## 16 W. NOR. CENTRAL E. NOR. CENTRAL 1.000000e+00       FALSE
## 17  SOUTH ATLANTIC E. NOR. CENTRAL 1.000000e+00       FALSE
## 18        MOUNTAIN E. NOR. CENTRAL 1.000000e+00       FALSE
## 19 E. SOU. CENTRAL E. NOR. CENTRAL 1.000000e+00       FALSE
## 20         PACIFIC E. NOR. CENTRAL 1.000000e+00       FALSE
## 21 W. SOU. CENTRAL E. NOR. CENTRAL 6.372623e-01       FALSE
## 22  SOUTH ATLANTIC W. NOR. CENTRAL 9.768747e-01       FALSE
## 23 W. SOU. CENTRAL W. NOR. CENTRAL 1.000000e+00       FALSE
## 24        MOUNTAIN W. NOR. CENTRAL 1.000000e+00       FALSE
## 25         PACIFIC W. NOR. CENTRAL 2.433656e-01       FALSE
## 26 E. SOU. CENTRAL W. NOR. CENTRAL 1.000000e+00       FALSE
## 27        MOUNTAIN  SOUTH ATLANTIC 1.000000e+00       FALSE
## 28 E. SOU. CENTRAL  SOUTH ATLANTIC 1.000000e+00       FALSE
## 29         PACIFIC  SOUTH ATLANTIC 1.000000e+00       FALSE
## 30 W. SOU. CENTRAL  SOUTH ATLANTIC 2.038831e-01       FALSE
## 31 W. SOU. CENTRAL E. SOU. CENTRAL 1.000000e+00       FALSE
## 32        MOUNTAIN E. SOU. CENTRAL 1.000000e+00       FALSE
## 33         PACIFIC E. SOU. CENTRAL 3.865496e-01       FALSE
## 34        MOUNTAIN W. SOU. CENTRAL 1.000000e+00       FALSE
## 35         PACIFIC W. SOU. CENTRAL 2.461451e-02        TRUE
## 36         PACIFIC        MOUNTAIN 1.000000e+00       FALSE
##    Sample_Size_group1 Sample_Size_group2 wtd.mean_group1 wtd.mean_group2
## 1                 278                159        56184.96        66071.30
## 2                 461                159        51884.27        66071.30
## 3                 273                159        44860.81        66071.30
## 4                 177                159        46136.68        66071.30
## 5                 185                159        47019.22        66071.30
## 6                 223                159        50890.25        66071.30
## 7                 488                159        53476.81        66071.30
## 8                 352                159        56358.38        66071.30
## 9                 461                278        51884.27        56184.96
## 10                177                278        46136.68        56184.96
## 11                223                278        50890.25        56184.96
## 12                185                278        47019.22        56184.96
## 13                273                278        44860.81        56184.96
## 14                352                278        56358.38        56184.96
## 15                488                278        53476.81        56184.96
## 16                177                461        46136.68        51884.27
## 17                488                461        53476.81        51884.27
## 18                223                461        50890.25        51884.27
## 19                185                461        47019.22        51884.27
## 20                352                461        56358.38        51884.27
## 21                273                461        44860.81        51884.27
## 22                488                177        53476.81        46136.68
## 23                273                177        44860.81        46136.68
## 24                223                177        50890.25        46136.68
## 25                352                177        56358.38        46136.68
## 26                185                177        47019.22        46136.68
## 27                223                488        50890.25        53476.81
## 28                185                488        47019.22        53476.81
## 29                352                488        56358.38        53476.81
## 30                273                488        44860.81        53476.81
## 31                273                185        44860.81        47019.22
## 32                223                185        50890.25        47019.22
## 33                352                185        56358.38        47019.22
## 34                223                273        50890.25        44860.81
## 35                352                273        56358.38        44860.81
## 36                352                223        56358.38        50890.25

freq_t_test(gss_data_mean, "coninc", "degree") #Without weights

##            group1         group2      p.value significant
## 1     HIGH SCHOOL LT HIGH SCHOOL 3.633278e-05        TRUE
## 2  JUNIOR COLLEGE LT HIGH SCHOOL 2.082524e-07        TRUE
## 3        BACHELOR LT HIGH SCHOOL 1.028278e-42        TRUE
## 4        GRADUATE LT HIGH SCHOOL 2.861872e-63        TRUE
## 5  JUNIOR COLLEGE    HIGH SCHOOL 3.344687e-03        TRUE
## 6        BACHELOR    HIGH SCHOOL 9.442534e-46        TRUE
## 7        GRADUATE    HIGH SCHOOL 4.199987e-69        TRUE
## 8        BACHELOR JUNIOR COLLEGE 2.883638e-10        TRUE
## 9        GRADUATE JUNIOR COLLEGE 1.739274e-23        TRUE
## 10       GRADUATE       BACHELOR 5.587394e-07        TRUE
##    Sample_Size_group1 Sample_Size_group2 wtd.mean_group1 wtd.mean_group2
## 1                1317                279        37944.81        27334.50
## 2                 200                279        46300.44        27334.50
## 3                 489                279        66997.66        27334.50
## 4                 308                279        81251.88        27334.50
## 5                 200               1317        46300.44        37944.81
## 6                 489               1317        66997.66        37944.81
## 7                 308               1317        81251.88        37944.81
## 8                 489                200        66997.66        46300.44
## 9                 308                200        81251.88        46300.44
## 10                308                489        81251.88        66997.66

The names of the levels of the region variable are currently a little unsightly, so we'll use the fct_recode() function from the forcats package to make it look a little nicer. I'm also going to use fct_relevel() so that the order of the regions in the chart are somewhat similar to the geographical order from west to east.

levels(gss_data_mean$region)

## [1] "NEW ENGLAND"     "MIDDLE ATLANTIC" "E. NOR. CENTRAL" "W. NOR. CENTRAL"
## [5] "SOUTH ATLANTIC"  "E. SOU. CENTRAL" "W. SOU. CENTRAL" "MOUNTAIN"       
## [9] "PACIFIC"

gss_data_mean$region <- gss_data_mean$region %>% 
    fct_recode(
    `New England` = "NEW ENGLAND",
    `Middle Atlantic` = "MIDDLE ATLANTIC",
    `East North Central` = "E. NOR. CENTRAL",
    `West North Central` = "W. NOR. CENTRAL",
    `South Atlantic` = "SOUTH ATLANTIC",
    `East South Central` = "E. SOU. CENTRAL",
    `West South Central` = "W. SOU. CENTRAL",
    Mountain = "MOUNTAIN",
    Pacific = "PACIFIC") %>% 
    fct_relevel(
        "Pacific",
        "Mountain", 
        "West South Central", 
        "West North Central",
        "East South Central",
        "East North Central",
        "South Atlantic",
        "Middle Atlantic",
        "New England")

The `freq_prop_test` function

The freq_prop_test function behaves similarly as the freq_t_test function, except of course it focuses on proportions, instead of means. It relies on the pairwise.prop.test function in the background, but allows for the usage of post-stratification weights. The output of the freq_prop_test function consists of the following:

Names of the two groups compared (group1 and group2)
The specific level on which the proportions for group1 and group2 are being compared.
The p.value for that comparison, and a logical indicating whether it is significant (i.e. whether it is below the alpha level, which can be user-defined but is defaulted to 0.05).
The unweighted sample sizes of each group (Sample_Size_group1 and Sample_Size_group2)
the weighted proportions of each group (prop_group1 and prop_group2).

Note that the arguments needed for the freq_prop_test function consist of the following, in order:

The name of the dataframe
The name of the factor variable being compared across subgroups, in quotes
The name of the factor variable in which the subgroup levels are compared, in quotes
Optionally, the name of the weight variable, in quotes
The criteria to determine whether a difference is significant, defaulted to 0.05.
The name of the method to adjust p-values due to multiple comparisons in quotes, defaulted to "holm". See p.adjust.methods for more details.
A logical argument to indicate whether the user wants to include the sample sizes into the names of the subgroup levels (TRUE) or not (FALSE), defaulted to FALSE.

#First filtering only to 2016 cases and among those who gave an answer on their level of confidence with Congress
gss_data_prop <- filter(gss_data, year == "2016", !is.na(conlegis), !is.na(partyid))

#Are there differences in Americans' level of confidence with US Congress across their party identification?
gss_data_prop$partyidcoll <- fct_collapse(gss_data_prop$partyid,
                                        Democrat = c("STRONG DEMOCRAT", "NOT STR DEMOCRAT"),
                                        Independent = c("IND,NEAR DEM", "INDEPENDENT", "IND,NEAR REP"),
                                        Republican = c("NOT STR REPUBLICAN", "STRONG REPUBLICAN"),
                                        Other = "OTHER PARTY")

#With weights. To avoid a large table output, I'm filtering to show only the "HARDLY ANY" level here.
freq_prop_test(gss_data_prop, "conlegis", "partyidcoll", weight = "wtssall")

##         group1      group2        level     p.value significant
## 1  Independent    Democrat A GREAT DEAL 1.000000000       FALSE
## 2   Republican    Democrat A GREAT DEAL 1.000000000       FALSE
## 3        Other    Democrat A GREAT DEAL 1.000000000       FALSE
## 4  Independent    Democrat    ONLY SOME 0.009432938        TRUE
## 5   Republican    Democrat    ONLY SOME 0.859591206       FALSE
## 6        Other    Democrat    ONLY SOME 0.059022114       FALSE
## 7  Independent    Democrat   HARDLY ANY 0.092655207       FALSE
## 8        Other    Democrat   HARDLY ANY 0.160015743       FALSE
## 9   Republican    Democrat   HARDLY ANY 0.954596593       FALSE
## 10  Republican Independent A GREAT DEAL 1.000000000       FALSE
## 11       Other Independent A GREAT DEAL 1.000000000       FALSE
## 12  Republican Independent   HARDLY ANY 0.164178736       FALSE
## 13       Other Independent   HARDLY ANY 0.468843493       FALSE
## 14  Republican Independent    ONLY SOME 0.051645473       FALSE
## 15       Other Independent    ONLY SOME 0.439684591       FALSE
## 16       Other  Republican    ONLY SOME 0.067348400       FALSE
## 17       Other  Republican   HARDLY ANY 0.164178736       FALSE
## 18       Other  Republican A GREAT DEAL 1.000000000       FALSE
##    Sample_Size_group1 Sample_Size_group2 prop_group1 prop_group2
## 1                 800                641  0.06862036  0.05006322
## 2                 410                641  0.05383465  0.05006322
## 3                  52                641  0.06885408  0.05006322
## 4                 800                641  0.37286146  0.45774682
## 5                 410                641  0.45017368  0.45774682
## 6                  52                641  0.28477256  0.45774682
## 7                 800                641  0.55851818  0.49218996
## 8                  52                641  0.64637336  0.49218996
## 9                 410                641  0.49599167  0.49218996
## 10                410                800  0.05383465  0.06862036
## 11                 52                800  0.06885408  0.06862036
## 12                410                800  0.49599167  0.55851818
## 13                 52                800  0.64637336  0.55851818
## 14                410                800  0.45017368  0.37286146
## 15                 52                800  0.28477256  0.37286146
## 16                 52                410  0.28477256  0.45017368
## 17                 52                410  0.64637336  0.49599167
## 18                 52                410  0.06885408  0.05383465

#Are there differences in the usage of LinkedIn by levels of education?
gss_data_prop2 <- filter(gss_data, year == "2016", !is.na(LINKEDIN), !is.na(degree))
#Without weights, and filtering only to the "Yes" level
gss_data_prop2 %>% 
    freq_prop_test("LINKEDIN", "degree") %>% 
    filter(level == "Yes")

##            group1         group2 level      p.value significant
## 1     HIGH SCHOOL LT HIGH SCHOOL   Yes 2.038852e-02        TRUE
## 2  JUNIOR COLLEGE LT HIGH SCHOOL   Yes 2.468184e-03        TRUE
## 3        GRADUATE LT HIGH SCHOOL   Yes 2.210382e-11        TRUE
## 4        BACHELOR LT HIGH SCHOOL   Yes 1.851483e-11        TRUE
## 5  JUNIOR COLLEGE    HIGH SCHOOL   Yes 2.003130e-01       FALSE
## 6        BACHELOR    HIGH SCHOOL   Yes 5.028227e-20        TRUE
## 7        GRADUATE    HIGH SCHOOL   Yes 4.794632e-17        TRUE
## 8        BACHELOR JUNIOR COLLEGE   Yes 4.753333e-04        TRUE
## 9        GRADUATE JUNIOR COLLEGE   Yes 4.665409e-04        TRUE
## 10       GRADUATE       BACHELOR   Yes 7.422444e-01       FALSE
##    Sample_Size_group1 Sample_Size_group2 prop_group1 prop_group2
## 1                 679                 93   0.1826215  0.06451613
## 2                 110                 93   0.2545455  0.06451613
## 3                 188                 93   0.4946809  0.06451613
## 4                 301                 93   0.4750831  0.06451613
## 5                 110                679   0.2545455  0.18262150
## 6                 301                679   0.4750831  0.18262150
## 7                 188                679   0.4946809  0.18262150
## 8                 301                110   0.4750831  0.25454545
## 9                 188                110   0.4946809  0.25454545
## 10                188                301   0.4946809  0.47508306

The `tbl_chart` function

The tbl_chart function is a convenience function that takes the raw data and turns it into a data frame that allows for easy charting in ggplot2. The user may however find that some further wrangling may be needed prior to charting.

First, let's use the function on the results for proportions:

tbl_chart(gss_data_prop, "conlegis", "region", weight = "wtssall")

##             region     conlegis       prop Sample_Size
## 1      NEW ENGLAND A GREAT DEAL 0.02442069         119
## 2  MIDDLE ATLANTIC A GREAT DEAL 0.04689564         218
## 3  E. NOR. CENTRAL A GREAT DEAL 0.05054757         327
## 4  W. NOR. CENTRAL A GREAT DEAL 0.06970093         127
## 5   SOUTH ATLANTIC A GREAT DEAL 0.05078355         366
## 6  E. SOU. CENTRAL A GREAT DEAL 0.08040465         134
## 7  W. SOU. CENTRAL A GREAT DEAL 0.05643535         194
## 8         MOUNTAIN A GREAT DEAL 0.07232877         156
## 9          PACIFIC A GREAT DEAL 0.08561493         262
## 10     NEW ENGLAND    ONLY SOME 0.40286534         119
## 11 MIDDLE ATLANTIC    ONLY SOME 0.42561665         218
## 12 E. NOR. CENTRAL    ONLY SOME 0.35691209         327
## 13 W. NOR. CENTRAL    ONLY SOME 0.42547619         127
## 14  SOUTH ATLANTIC    ONLY SOME 0.40887691         366
## 15 E. SOU. CENTRAL    ONLY SOME 0.31844158         134
## 16 W. SOU. CENTRAL    ONLY SOME 0.55880086         194
## 17        MOUNTAIN    ONLY SOME 0.35487771         156
## 18         PACIFIC    ONLY SOME 0.44625534         262
## 19     NEW ENGLAND   HARDLY ANY 0.57271397         119
## 20 MIDDLE ATLANTIC   HARDLY ANY 0.52748771         218
## 21 E. NOR. CENTRAL   HARDLY ANY 0.59254034         327
## 22 W. NOR. CENTRAL   HARDLY ANY 0.50482288         127
## 23  SOUTH ATLANTIC   HARDLY ANY 0.54033954         366
## 24 E. SOU. CENTRAL   HARDLY ANY 0.60115377         134
## 25 W. SOU. CENTRAL   HARDLY ANY 0.38476379         194
## 26        MOUNTAIN   HARDLY ANY 0.57279352         156
## 27         PACIFIC   HARDLY ANY 0.46812973         262

And again on the results for means:

tbl_chart(gss_data_mean, "coninc", "region", weight = "wtssall")

##               region wtd.mean Sample_Size
## 1            Pacific 56358.38         352
## 2           Mountain 50890.25         223
## 3 West South Central 44860.81         273
## 4 West North Central 46136.68         177
## 5 East South Central 47019.22         185
## 6 East North Central 51884.27         461
## 7     South Atlantic 53476.81         488
## 8    Middle Atlantic 56184.96         278
## 9        New England 66071.30         159

NOTE: This function in versions previous to v0.0.3 took the output of either freq_t_test or freq_prop_test as input. It now takes the raw data as input for greater convenience.

The `tbl_sig` function

The tbl_sig function is a convenience function that takes the output from either the freq_t_test or the freq_prop_test functions and turns it into a data frame that allows for easy charting with geom_sigmark. The user may however find that some further wrangling may be needed prior to charting. Its input consists of the following:

An object created by the freq_t_test or the freq_prop_test functions.
A space_label numeric value indicating how much spacing is needed between the label and the marker. Some trial and error is needed to determine the ideal value.
A space_between numeric value indicating how much spacing is needed between markers. Some trial and error is needed to determine the ideal value.
An optional keep argument that should be one of three options:
- "all": the default setting. This is to be used for subgroup comparisons.
- "w2w": For tracking changes when we are only interested in comparing differences from one wave to the next.
- "curr": For tracking changes when we are only interested in comparing differences between the latest wave and past waves.

First, let's use the function on the results for proportions:

my_results_prop <- freq_prop_test(gss_data_prop, "conlegis", "region", weight = "wtssall")

## Warning in prop.test(x[c(i, j)], n[c(i, j)], ...): Chi-squared
## approximation may be incorrect

tbl_sig(my_results_prop, "region", space_label = 0.1, space_between = 0.05)

##        level          region            sign group_prop sign_prop
## 1 HARDLY ANY E. NOR. CENTRAL W. SOU. CENTRAL  0.5925403 0.3847638
## 2 HARDLY ANY  SOUTH ATLANTIC W. SOU. CENTRAL  0.5403395 0.3847638
## 3 HARDLY ANY E. SOU. CENTRAL W. SOU. CENTRAL  0.6011538 0.3847638
## 4  ONLY SOME W. SOU. CENTRAL E. NOR. CENTRAL  0.5588009 0.3569121
## 5  ONLY SOME W. SOU. CENTRAL  SOUTH ATLANTIC  0.5588009 0.4088769
## 6  ONLY SOME W. SOU. CENTRAL E. SOU. CENTRAL  0.5588009 0.3184416
## 7  ONLY SOME W. SOU. CENTRAL        MOUNTAIN  0.5588009 0.3548777
## 8 HARDLY ANY        MOUNTAIN W. SOU. CENTRAL  0.5727935 0.3847638
##   group_Sample_Size sign_Sample_Size       pos
## 1               327              194 0.6925403
## 2               366              194 0.6403395
## 3               134              194 0.7011538
## 4               194              327 0.6588009
## 5               194              366 0.7088009
## 6               194              134 0.7588009
## 7               194              156 0.8088009
## 8               156              194 0.6727935

And again on the results for means:

my_results_mean <- freq_t_test(gss_data_mean, "coninc", "region", weight = "wtssall")
tbl_sig(my_results_mean, "region", space_label = 0.1, space_between = 0.05)

##        region               sign group_wtd.mean sign_wtd.mean
## 1     Pacific West South Central       56358.38      44860.81
## 2 New England           Mountain       66071.30      50890.25
## 3 New England West South Central       66071.30      44860.81
## 4 New England West North Central       66071.30      46136.68
## 5 New England East South Central       66071.30      47019.22
## 6 New England East North Central       66071.30      51884.27
## 7 New England     South Atlantic       66071.30      53476.81
##   group_Sample_Size sign_Sample_Size      pos
## 1               352              273 56358.48
## 2               159              223 66071.40
## 3               159              273 66071.45
## 4               159              177 66071.50
## 5               159              185 66071.55
## 6               159              461 66071.60
## 7               159              488 66071.65

The `geom_sigmark`, `geom_sigmark_interactive`, `geom_sigmark_waves`, and `geom_sigmark_waves_interaction` functions

The geom_sigmark function is a geom_point wrapper that identifies with a marker whether a particular subgroup's result is, statistically speaking, significantly higher than another subgroup's result. In cases where there are more than 2 subgroups being compared, the usage of colours is key here to be able to identify which subgroup's result is being identified as being significantly lower.

All 4 geom_sigmark functions require several arguments:

The output created from the tbl_sig function.
The x and y arguments indicating what axes are used to chart the markers. The group argument is provided if needed, and the colour argument indicating how the markers should be coloured.
A quoted character string indicating the type of icon the user would like to have displayed where there are significant differences.
A value for the size of the icon. Generally a size of 5 is deemed appropriate.

The 2 _waves variants also include the following arguments:

The direction argument to specify which variable indicates whether the marker should be an increase or a decrease.
The labels argument to specify how the markers should be labeled in the plot legend. By default they are "Increase"/"Decrease".

The 2 _interactive variants also include the following arguments (both from ggiraph).

The tooltip argument to specify which variable indicates what gets shown in the tooltip when the mouse hovers on a marker.
The data_id argument to specific which variable should be highlighted if the end user clicks on a marker.

The geom_sigmark function is best displayed by providing examples on how it can be utilized with the most common types of charts used in market research. These are listed below.

Examples of the application of the `ggsigmark` function

We will use the coninc variable, which indicates the family income of each respondent, adjusted for inflation, and use the previous examples indicating whether there is a difference in income across regions in the US. I will re-run the my_results_mean object, this time setting the nlabels argument to TRUE. This is to merge the sample sizes with the names of the regions in order to display both in the chart.

my_results_mean <- freq_t_test(gss_data_mean, "coninc", "region", weight = "wtssall", nlabels = TRUE)
my_chart_data <- tbl_chart(gss_data_mean, "coninc", "region", weight = "wtssall", nlabels = TRUE)
my_sig_data <- tbl_sig(my_results_mean, "region", space_label = 5000, space_between = 2200)

As noted above, it is best to have the subgroups (i.e. the levels of the region variable) have their own specific colours. We therefore create the following labeled vector:

colour_vec <- c("#edc951", "#eb6841", "#cc2a36", "#4f372d", "#00a0b0", "#2175d9", "#00308f", "#e30074", "#b8d000")

#Applying names of regions to each colour
attributes(colour_vec)$names <- levels(my_chart_data$region)

And we now create the plot. Here we assign the plot into the object p. Note that we're re-labeling the x-axis to avoid the text overlapping.

(
    p <- ggplot() +
        geom_col(data = my_chart_data, aes(x = region, y = wtd.mean, fill = region)) +
        geom_text(data = my_chart_data, aes(x = region, y = wtd.mean,
                            label = scales::dollar(round(wtd.mean, 0))),
                            vjust = -0.2,
                            size = 3.25) +
        scale_fill_manual(values = colour_vec) + 
        scale_colour_manual(values = colour_vec) + 
        scale_y_continuous(limits = c(0, 85000), labels = scales::dollar) + 
        scale_x_discrete(labels = c("Pacific (n=352)" = "Pacific",
                                    "Mountain (n=223)" = "Mountain",
                                    "West South Central (n=273)" = "SW\nCentral",
                                    "West North Central (n=177)" = "NW\nCentral",
                                    "East South Central (n=185)" = "SE\nCentral",
                                    "East North Central (n=461)" = "NE\nCentral",
                                    "South Atlantic (n=488)" = "South\nAtl.",
                                    "Middle Atlantic (n=278)" = "Mid.\nAtl.",
                                    "New England (n=159)" = "New\nEngland")) +
        labs(title = "Average 2016 Inflation-Adjusted Family Income Across Regions", y = "Inflation-Adjusted Family Income", x = "Region", fill = "Region", colour = "Sig. Diff.")
)

And we now overlay this with the geom_sigmark function to indicate which regions have a significantly higher average family income than other regions.

p + geom_sigmark(my_sig_data, x = "region")

We will use the conlegis variable, which asks for Americans' level of confidence in US Congress. A useful feature of ggplot2 is the facet_wrap() layer that more easily breaks out answers across a factor variable. This feature will be used here to examine how the level of confidence in US Congress differs across educational levels using the degree variable.

gss_data_prop <- filter(gss_data, year == "2016", !is.na(conlegis), !is.na(degree))

#Relabeling the degree and conlegis variables to be more readable.
gss_data_prop$degree <- fct_recode(gss_data_prop$degree,
                                                `Less than high school` = "LT HIGH SCHOOL",
                                                `High school` = "HIGH SCHOOL",
                                                `Junior college` = "JUNIOR COLLEGE",
                                                `Bachelor` = "BACHELOR",
                                                `Graduate` = "GRADUATE")

gss_data_prop$conlegis <- fct_recode(gss_data_prop$conlegis,
                                                `A Great Deal` = "A GREAT DEAL",
                                                `Only Some` = "ONLY SOME",
                                                `Hardly Any` = "HARDLY ANY")

my_results_prop <- freq_prop_test(gss_data_prop, "conlegis", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE)
my_conlegis_chart <- tbl_chart(gss_data_prop, "conlegis", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE)
my_conlegis_sig <- tbl_sig(my_results_prop, "degree", space_label = 0.4, space_between = 0.15)

We now create the colour vector to denote significant differences and the plot.

colour_vec <- c("#edc951", "#eb6841", "#cc2a36", "#4f372d", "#00a0b0")

#Applying names to each colour
attributes(colour_vec)$names <- levels(my_conlegis_chart$degree)

(
    p <- ggplot() +
        geom_col(data = my_conlegis_chart, aes(x = conlegis, y = prop, fill = degree)) +
        geom_text(data = my_conlegis_chart, aes(x = conlegis, y = prop,
                            label = scales::percent(round(prop, 2))),
                            hjust = -0.2) +
        scale_fill_manual(values = colour_vec) + 
        scale_colour_manual(values = colour_vec) + 
        coord_flip() +
        facet_wrap(~ degree) +
        scale_y_continuous(limits = c(0, 1.2), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
        labs(title = "Level of Confidence on US Congress by Level of Education in 2016", x = "Level of Confidence", y = "Percentage", fill = "Level of Education", colour = "Sig. Diff.") + 
        theme(legend.key.height = unit(1.5, "lines"))
)

Let's use a full star instead. Again here, the space_label and space_between may require trial and error to get right.

p + geom_sigmark(my_conlegis_sig, icon = "full star")

With multi-select dummy variables, the chart would look very similar to the previous section, but the data wrangling would be quite different, since the data is dispersed across multiple dummy variables instead of being all in one variable.

Here, we'll use the degree variable denoting respondents' level of education again, but this time we'll focus on whether respondents use various forms of social media. The variables in question here are the following: FACEBOOK, TWITTER, INSTAGRM, LINKEDIN, SNAPCHAT, TUMBLR, WHATSAPP, GOOGLESN (Google +), PINTERST, FLICKR, VINE, and CLSSMTES (classmates). Each of these are dummy variables in which respondents answered "Yes", "No", or were not asked the question.

#Also making the names of the social media more readable
gss_data_prop <- gss_data %>% 
    filter(year == "2016", !is.na(degree)) %>% 
    select(wtssall, degree, 
                 Facebook = FACEBOOK, 
                 Twitter = TWITTER, 
                 Instagram = INSTAGRM, 
                 LinkedIn = LINKEDIN, 
                 Snapchat = SNAPCHAT, 
                 Tumblr = TUMBLR, 
                 WhatsApp = WHATSAPP, 
                 `Google+` = GOOGLESN, 
                 Pinterest = PINTERST, 
                 Flickr = FLICKR, 
                 Vine = VINE, 
                 Classmates = CLSSMTES)

#Fixing labels for degree as well.
gss_data_prop$degree <- fct_recode(gss_data_prop$degree,
                                         `Less than high school` = "LT HIGH SCHOOL",
                                         `High school` = "HIGH SCHOOL",
                                         `Junior college` = "JUNIOR COLLEGE",
                                         `Bachelor` = "BACHELOR",
                                         `Graduate` = "GRADUATE")

#Switch NA to "No" so that the full bases are kept.
gss_data_prop <- mutate_at(gss_data_prop, vars(3:14), funs(fct_explicit_na(., "No")))

This is a good time to provide an example on how a function like freq_test_prop can be used across several variables at once. First, we gather() the data frame so that all social media variables are listed in a long format instead of a wide format. We split this long data frame by type of social media, apply the freq_test_prop to each split, and recombine them using nest() from the tidyr package and map() from the purrr package.

(
    gss_data_prop_nest <- gss_data_prop %>% 
    gather(key = Social_Media, value = Use, -degree, -wtssall)
)

## # A tibble: 34,308 x 4
##      wtssall         degree Social_Media   Use
##        <dbl>         <fctr>        <chr> <chr>
##  1 0.9569935       Bachelor     Facebook    No
##  2 0.4784968    High school     Facebook    No
##  3 0.9569935       Bachelor     Facebook    No
##  4 1.9139870    High school     Facebook    No
##  5 1.4354903       Graduate     Facebook    No
##  6 0.9569935 Junior college     Facebook    No
##  7 1.4354903    High school     Facebook    No
##  8 0.9569935    High school     Facebook   Yes
##  9 0.9569935    High school     Facebook    No
## 10 0.9569935 Junior college     Facebook   Yes
## # ... with 34,298 more rows

gss_data_prop_nest <- mutate_if(gss_data_prop_nest, is.character, factor)

(
gss_data_prop_nest <- gss_data_prop_nest %>% 
    group_by(Social_Media) %>% 
    nest()
)

## # A tibble: 12 x 2
##    Social_Media                 data
##          <fctr>               <list>
##  1     Facebook <tibble [2,859 x 3]>
##  2      Twitter <tibble [2,859 x 3]>
##  3    Instagram <tibble [2,859 x 3]>
##  4     LinkedIn <tibble [2,859 x 3]>
##  5     Snapchat <tibble [2,859 x 3]>
##  6       Tumblr <tibble [2,859 x 3]>
##  7     WhatsApp <tibble [2,859 x 3]>
##  8      Google+ <tibble [2,859 x 3]>
##  9    Pinterest <tibble [2,859 x 3]>
## 10       Flickr <tibble [2,859 x 3]>
## 11         Vine <tibble [2,859 x 3]>
## 12   Classmates <tibble [2,859 x 3]>

#Note: warnings about chi-squares being approximated are suppressed here to save space.
(
gss_data_result_nest <- gss_data_prop_nest %>% 
    mutate(test = map(data, freq_prop_test, "Use", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE),
                 chart = map(data, tbl_chart, "Use", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE))
)

## # A tibble: 12 x 4
##    Social_Media                 data                  test
##          <fctr>               <list>                <list>
##  1     Facebook <tibble [2,859 x 3]> <data.frame [20 x 9]>
##  2      Twitter <tibble [2,859 x 3]> <data.frame [20 x 9]>
##  3    Instagram <tibble [2,859 x 3]> <data.frame [20 x 9]>
##  4     LinkedIn <tibble [2,859 x 3]> <data.frame [20 x 9]>
##  5     Snapchat <tibble [2,859 x 3]> <data.frame [20 x 9]>
##  6       Tumblr <tibble [2,859 x 3]> <data.frame [20 x 9]>
##  7     WhatsApp <tibble [2,859 x 3]> <data.frame [20 x 9]>
##  8      Google+ <tibble [2,859 x 3]> <data.frame [20 x 9]>
##  9    Pinterest <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 10       Flickr <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 11         Vine <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 12   Classmates <tibble [2,859 x 3]> <data.frame [20 x 9]>
## # ... with 1 more variables: chart <list>

(gss_data_result <- gss_data_result_nest %>%
    unnest(test) %>% 
    filter(level == "Yes"))

## # A tibble: 120 x 10
##    Social_Media                 group1                        group2
##          <fctr>                 <fctr>                        <fctr>
##  1     Facebook  High school
## (n=1,461) Less than high school
## (n=328)
##  2     Facebook Junior college
## (n=216) Less than high school
## (n=328)
##  3     Facebook       Bachelor
## (n=536) Less than high school
## (n=328)
##  4     Facebook       Graduate
## (n=318) Less than high school
## (n=328)
##  5     Facebook       Bachelor
## (n=536)         High school
## (n=1,461)
##  6     Facebook Junior college
## (n=216)         High school
## (n=1,461)
##  7     Facebook       Graduate
## (n=318)         High school
## (n=1,461)
##  8     Facebook       Bachelor
## (n=536)        Junior college
## (n=216)
##  9     Facebook       Graduate
## (n=318)        Junior college
## (n=216)
## 10     Facebook       Graduate
## (n=318)              Bachelor
## (n=536)
## # ... with 110 more rows, and 7 more variables: level <fctr>,
## #   p.value <dbl>, significant <lgl>, Sample_Size_group1 <int>,
## #   Sample_Size_group2 <int>, prop_group1 <dbl>, prop_group2 <dbl>

(my_chart_data <- gss_data_result_nest %>% 
        unnest(chart) %>% 
    filter(Use == "Yes"))

## # A tibble: 60 x 5
##    Social_Media                        degree    Use       prop
##          <fctr>                        <fctr> <fctr>      <dbl>
##  1     Facebook Less than high school
## (n=328)    Yes 0.21703540
##  2     Facebook         High school
## (n=1,461)    Yes 0.36974611
##  3     Facebook        Junior college
## (n=216)    Yes 0.35990926
##  4     Facebook              Bachelor
## (n=536)    Yes 0.40321565
##  5     Facebook              Graduate
## (n=318)    Yes 0.38672326
##  6      Twitter Less than high school
## (n=328)    Yes 0.03841661
##  7      Twitter         High school
## (n=1,461)    Yes 0.08290407
##  8      Twitter        Junior college
## (n=216)    Yes 0.07512621
##  9      Twitter              Bachelor
## (n=536)    Yes 0.12217037
## 10      Twitter              Graduate
## (n=318)    Yes 0.15212663
## # ... with 50 more rows, and 1 more variables: Sample_Size <int>

We repeat the process for the tbl_sig function.

(
gss_data_result_nest <- gss_data_result %>%
    group_by(Social_Media) %>%
    nest() %>%
    mutate(sig = map(data, tbl_sig, "degree", space_label = 0.3, space_between = 0.1))
)

## # A tibble: 12 x 3
##    Social_Media              data              sig
##          <fctr>            <list>           <list>
##  1     Facebook <tibble [10 x 9]> <tibble [4 x 8]>
##  2      Twitter <tibble [10 x 9]> <tibble [4 x 8]>
##  3    Instagram <tibble [10 x 9]> <tibble [2 x 8]>
##  4     LinkedIn <tibble [10 x 9]> <tibble [8 x 8]>
##  5     Snapchat <tibble [10 x 9]> <tibble [2 x 8]>
##  6       Tumblr <tibble [10 x 9]> <tibble [0 x 8]>
##  7     WhatsApp <tibble [10 x 9]> <tibble [4 x 8]>
##  8      Google+ <tibble [10 x 9]> <tibble [2 x 8]>
##  9    Pinterest <tibble [10 x 9]> <tibble [6 x 8]>
## 10       Flickr <tibble [10 x 9]> <tibble [2 x 8]>
## 11         Vine <tibble [10 x 9]> <tibble [0 x 8]>
## 12   Classmates <tibble [10 x 9]> <tibble [1 x 8]>

(my_sig_data <- unnest(gss_data_result_nest, sig))

## # A tibble: 35 x 9
##    Social_Media  level                 degree
##          <fctr> <fctr>                 <fctr>
##  1     Facebook    Yes  High school
## (n=1,461)
##  2     Facebook    Yes Junior college
## (n=216)
##  3     Facebook    Yes       Bachelor
## (n=536)
##  4     Facebook    Yes       Graduate
## (n=318)
##  5      Twitter    Yes  High school
## (n=1,461)
##  6      Twitter    Yes       Bachelor
## (n=536)
##  7      Twitter    Yes       Graduate
## (n=318)
##  8      Twitter    Yes       Graduate
## (n=318)
##  9    Instagram    Yes  High school
## (n=1,461)
## 10    Instagram    Yes       Bachelor
## (n=536)
## # ... with 25 more rows, and 6 more variables: sign <fctr>,
## #   group_prop <dbl>, sign_prop <dbl>, group_Sample_Size <int>,
## #   sign_Sample_Size <int>, pos <dbl>

I would also like to set the order based on the proportions using fct_reorder. I first need to create a summarized data frame that lists the total proportions of usage for each social media.

social_media_total <- gss_data_prop_nest %>%
    unnest(data) %>%
    group_by(Social_Media) %>%
    summarize(total_prop = sum(if_else(Use == "Yes", wtssall, 0)) / sum(wtssall))

my_chart_data <- left_join(my_chart_data, social_media_total)

## Joining, by = "Social_Media"

my_sig_data <- left_join(my_sig_data, social_media_total)

## Joining, by = "Social_Media"

my_chart_data$Social_Media <- fct_reorder(my_chart_data$Social_Media, my_chart_data$total_prop)
my_sig_data$Social_Media <- fct_reorder(my_sig_data$Social_Media, my_sig_data$total_prop)

We now create the colour vector to denote significant differences and the plot.

colour_vec <- c("#edc951", "#eb6841", "#cc2a36", "#4f372d", "#00a0b0")

#Applying names to each colour
attributes(colour_vec)$names <- levels(my_chart_data$degree)

(
    p <- ggplot() +
        geom_col(data = my_chart_data, aes(x = Social_Media, y = prop, fill = degree)) +
        geom_text(data = my_chart_data, aes(x = Social_Media, y = prop,
                            label = scales::percent(round(prop, 2))),
                            hjust = -0.2) +
        scale_fill_manual(values = colour_vec) +
        scale_colour_manual(values = colour_vec) +
        coord_flip() +
        facet_wrap(~ degree) +
        scale_y_continuous(limits = c(0, 1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
        labs(title = "Level of Usage of Social Media by Level of Education in 2016", x = "Social Media", y = "Percentage", fill = "Level of Education", colour = "Sig. Diff.") + 
    theme(legend.key.height = unit(1.5, "lines"))
)

p + geom_sigmark(my_sig_data, x = "Social_Media", icon = "checkmark")

Scaled variables are a very common form of measuring attitudes in market research and polling surveys. We will now turn to provide a couple of examples of charts involving scaled variables.

This example will focus on comparing one scaled variable across subgroups. Here we will compare the variable polviews which asks respondents to rate on a scale how liberal or conservative they consider themselves on a 7-point scale on their level of family income (grouped to 6 categories). Again, we'll only on focus on 2016 data.

gss_data_polviews <- gss_data %>% 
    filter(year == "2016", !is.na(coninc))

For the sake of expediency, we'll let the cut function set the breaks, even if they don't round up.

gss_data_polviews$coninc_6 <- cut(gss_data_polviews$coninc, 6, labels = 
                                                    c("Up to $27,600",
                                                        "$27,600 to $54,900",
                                                        "$54,900 to $82,100",
                                                        "$82,100 to $109,000", 
                                                        "$109,000 to $137,000", 
                                                        "$137,000 and higher"))

We'll specifically look at the proportion of those who consider themselves liberal. This means providing a top 3 box and a sum %. First, we'll need to make sure those who didn't give an answer are counted, using the fct_explicity_na() function.

gss_data_polviews$polviews <- gss_data_polviews$polviews %>% 
    fct_recode(
        `Extremely Liberal` = "EXTREMELY LIBERAL",
        Liberal = "LIBERAL",
        `Slightly Liberal` = "SLIGHTLY LIBERAL",
        Moderate = "MODERATE",
        `Slightly Conservative` = "SLGHTLY CONSERVATIVE",
        Conservative = "CONSERVATIVE",
        `Extremely Conservative` = "EXTRMLY CONSERVATIVE") %>% 
    fct_explicit_na(na_level = "NA")

gss_data_polviews <- mutate(gss_data_polviews, 
                                                        liberal = if_else(polviews %in% c("Extremely Liberal", "Liberal", "Slightly Liberal"), "Liberal", "Not Liberal") %>% 
                                                            factor())

We're only interested in showing significant difference markers for the top 3 box, so we're just running the tbl_chart for that.

(polviews_chart <- tbl_chart(gss_data_polviews, "polviews", "coninc_6", weight = "wtssall", nlabels = TRUE))

##                       coninc_6               polviews        prop
## 1      Up to $27,600 (n=1,070)      Extremely Liberal 0.048854089
## 2   $27,600 to $54,900 (n=683)      Extremely Liberal 0.043309374
## 3   $54,900 to $82,100 (n=393)      Extremely Liberal 0.037676752
## 4  $82,100 to $109,000 (n=225)      Extremely Liberal 0.052291285
## 5  $109,000 to $137,000 (n=61)      Extremely Liberal 0.107185558
## 6  $137,000 and higher (n=164)      Extremely Liberal 0.044100002
## 7      Up to $27,600 (n=1,070)                Liberal 0.118936455
## 8   $27,600 to $54,900 (n=683)                Liberal 0.113199526
## 9   $54,900 to $82,100 (n=393)                Liberal 0.135425916
## 10 $82,100 to $109,000 (n=225)                Liberal 0.149289929
## 11 $109,000 to $137,000 (n=61)                Liberal 0.105530524
## 12 $137,000 and higher (n=164)                Liberal 0.181187495
## 13     Up to $27,600 (n=1,070)       Slightly Liberal 0.098353280
## 14  $27,600 to $54,900 (n=683)       Slightly Liberal 0.105672707
## 15  $54,900 to $82,100 (n=393)       Slightly Liberal 0.109279813
## 16 $82,100 to $109,000 (n=225)       Slightly Liberal 0.149463618
## 17 $109,000 to $137,000 (n=61)       Slightly Liberal 0.067005330
## 18 $137,000 and higher (n=164)       Slightly Liberal 0.181670300
## 19     Up to $27,600 (n=1,070)               Moderate 0.402649235
## 20  $27,600 to $54,900 (n=683)               Moderate 0.369981123
## 21  $54,900 to $82,100 (n=393)               Moderate 0.306858359
## 22 $82,100 to $109,000 (n=225)               Moderate 0.346090543
## 23 $109,000 to $137,000 (n=61)               Moderate 0.316001610
## 24 $137,000 and higher (n=164)               Moderate 0.232208331
## 25     Up to $27,600 (n=1,070)  Slightly Conservative 0.116218443
## 26  $27,600 to $54,900 (n=683)  Slightly Conservative 0.155753787
## 27  $54,900 to $82,100 (n=393)  Slightly Conservative 0.153236023
## 28 $82,100 to $109,000 (n=225)  Slightly Conservative 0.136452281
## 29 $109,000 to $137,000 (n=61)  Slightly Conservative 0.224416174
## 30 $137,000 and higher (n=164)  Slightly Conservative 0.194483931
## 31     Up to $27,600 (n=1,070)           Conservative 0.124838774
## 32  $27,600 to $54,900 (n=683)           Conservative 0.140980268
## 33  $54,900 to $82,100 (n=393)           Conservative 0.219460073
## 34 $82,100 to $109,000 (n=225)           Conservative 0.133847878
## 35 $109,000 to $137,000 (n=61)           Conservative 0.141335610
## 36 $137,000 and higher (n=164)           Conservative 0.153777713
## 37     Up to $27,600 (n=1,070) Extremely Conservative 0.042753683
## 38  $27,600 to $54,900 (n=683) Extremely Conservative 0.052178367
## 39  $54,900 to $82,100 (n=393) Extremely Conservative 0.029254590
## 40 $82,100 to $109,000 (n=225) Extremely Conservative 0.018091370
## 41 $109,000 to $137,000 (n=61) Extremely Conservative 0.020090114
## 42 $137,000 and higher (n=164) Extremely Conservative 0.012572228
## 43     Up to $27,600 (n=1,070)                     NA 0.047396040
## 44  $27,600 to $54,900 (n=683)                     NA 0.018924848
## 45  $54,900 to $82,100 (n=393)                     NA 0.008808475
## 46 $82,100 to $109,000 (n=225)                     NA 0.014473096
## 47 $109,000 to $137,000 (n=61)                     NA 0.018435080
## 48 $137,000 and higher (n=164)                     NA 0.000000000
##    Sample_Size
## 1         1070
## 2          683
## 3          393
## 4          225
## 5           61
## 6          164
## 7         1070
## 8          683
## 9          393
## 10         225
## 11          61
## 12         164
## 13        1070
## 14         683
## 15         393
## 16         225
## 17          61
## 18         164
## 19        1070
## 20         683
## 21         393
## 22         225
## 23          61
## 24         164
## 25        1070
## 26         683
## 27         393
## 28         225
## 29          61
## 30         164
## 31        1070
## 32         683
## 33         393
## 34         225
## 35          61
## 36         164
## 37        1070
## 38         683
## 39         393
## 40         225
## 41          61
## 42         164
## 43        1070
## 44         683
## 45         393
## 46         225
## 47          61
## 48         164

Since we're only interested in showing 3 of these levels, we'll remove the others.

polviews_chart <- filter(polviews_chart, polviews %in% c("Extremely Liberal", "Liberal", "Slightly Liberal"))
polviews_chart$polviews <- droplevels(polviews_chart$polviews)

We'll use the created liberal variable for the markers.

polviews_results <- freq_prop_test(gss_data_polviews, "liberal", "coninc_6", weight = "wtssall", nlabels = TRUE)
(
    polviews_sig <- polviews_results %>% 
        tbl_sig("coninc_6", space_label = 0.1, space_between = 0.03) %>% 
    filter(level == "Liberal")
)

##     level                    coninc_6                       sign
## 1 Liberal $137,000 and higher (n=164)    Up to $27,600 (n=1,070)
## 2 Liberal $137,000 and higher (n=164) $27,600 to $54,900 (n=683)
## 3 Liberal $137,000 and higher (n=164) $54,900 to $82,100 (n=393)
##   group_prop sign_prop group_Sample_Size sign_Sample_Size       pos
## 1  0.4069578 0.2661438               164             1070 0.5069578
## 2  0.4069578 0.2621816               164              683 0.5369578
## 3  0.4069578 0.2823825               164              393 0.5669578

We also need the total proportions for the labels.

chart_labels <- polviews_chart %>% 
    group_by(coninc_6) %>% 
    summarize(prop = sum(prop))

Now turning to the chart. Here, we'll need two sets of colours; one for the scales, and one for the subgroups.

colour_scale_vec <- c("#d95f0e", "#fec44f", "#fff7bc")
colour_subgroup_vec <- c("#253494", "#2c7fb8", "#41b6c4", "#7fcdbb", "#c7e9b4", "#ffffcc")

#Applying names to each colour
attributes(colour_scale_vec)$names <- levels(polviews_chart$polviews)
attributes(colour_subgroup_vec)$names <- levels(polviews_chart$coninc_6)

ggplot() +
    geom_col(data = polviews_chart, aes(x = coninc_6, y = prop, fill = fct_rev(polviews))) +
    geom_text(data = polviews_chart, aes(x = coninc_6, y = prop, label = scales::percent(round(prop, 2))), position = position_stack(vjust = 0.5)) +
    geom_text(data = chart_labels, aes(x = coninc_6, y = prop, label = scales::percent(round(prop, 2))), hjust = -0.2) +
    scale_fill_manual(values = colour_scale_vec) +
    scale_colour_manual(values = colour_subgroup_vec) +
    coord_flip() +
    scale_y_continuous(limits = c(0, .75), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
    labs(title = "Proportion of Residents who Self-Label as Liberal", subtitle = "by Family Income Level", x = "Level of Family Income", y = "Percentage", fill = "Scale", colour = "Sig. Diff.") +
    geom_sigmark(polviews_sig, x = "coninc_6", icon = "asterisk") +
    guides(fill = guide_legend(reverse = TRUE))

This example will now turn to the need for multiple scaled variables to appear across subgroups. There are 6 variables that we'll focus here ranging from LOTR1 to LOTR6, which represent agreement scales on 6 attitudinal statements. We'll observe whether there are significant differences in whether they agree with these statements across gender (sex). The code required to produce the desired chart requires a mix of the code from the two previous examples.

Note that not everyone is asked these 6 questions, and there is no indication in the data file as to the logic as to why some respondents were asked and others not. The presence of a dummy variable indicating if they were asked or not would be useful here to ensure the sample sizes match throughout. For the sake of simplicity, the loop below assigns those who are NA to Neutral across the 6 questions.

gss_data_LOTR <- gss_data %>% 
    filter(year == "2016", !is.na(LOTR1)) %>% 
    select(wtssall, sex, starts_with("LOTR"))

gss_data_LOTR$sex <- fct_recode(gss_data_LOTR$sex, Male = "MALE", Female = "FEMALE")

gss_data_LOTR <- mutate_if(gss_data_LOTR, is.factor, funs(fct_explicit_na(., na_level = "Neutral")))

As per the previous example, we'll need to calculate the top 2 box for those who agree or strongly agree across each statement. Since I need to apply the same wrangling across 6 variables, I'm using the mutate_at() and mutate_if() functions.

gss_data_LOTR <- gss_data_LOTR %>% 
    mutate_at(vars(starts_with("LOTR")), 
                    funs("Top2" = if_else(. %in% c("Strongly agree", "Agree"), "Agree", "Disagree/Neutral"))) %>% 
    mutate_if(is.character, factor)

And as per 2 examples ago, we'll need to use the nest() and map() to apply the freq_test_prop, the tbl_chart, and the tbl_sig functions. Unlike last time, my levels differ between the original LOTR variables and my created LOTRx_Top2 variables. The easiest thing to do at this stage would be to split the original variables from the created variables.

(
    gss_data_LOTR_nest <- gss_data_LOTR %>% 
    gather(key = LOTR, value = Scale, -sex, -wtssall)
)

## # A tibble: 17,364 x 4
##      wtssall    sex  LOTR          Scale
##        <dbl> <fctr> <chr>          <chr>
##  1 0.4784968   Male LOTR1        Neutral
##  2 0.9569935   Male LOTR1        Neutral
##  3 1.4354903 Female LOTR1          Agree
##  4 0.9569935 Female LOTR1          Agree
##  5 0.9569935 Female LOTR1        Neutral
##  6 0.9569935   Male LOTR1          Agree
##  7 1.9139870   Male LOTR1 Strongly agree
##  8 0.4784968 Female LOTR1 Strongly agree
##  9 0.4784968   Male LOTR1          Agree
## 10 0.9569935 Female LOTR1          Agree
## # ... with 17,354 more rows

gss_data_LOTR_nest_Top2 <- filter(gss_data_LOTR_nest, str_detect(LOTR, "Top2"))
gss_data_LOTR_nest_Full <- filter(gss_data_LOTR_nest, !str_detect(LOTR, "Top2"))

gss_data_LOTR_nest_Top2$LOTR <- str_replace(gss_data_LOTR_nest_Top2$LOTR, "_Top2", "")
gss_data_LOTR_nest_Top2 <- mutate_if(gss_data_LOTR_nest_Top2, is.character, factor)
gss_data_LOTR_nest_Full <- mutate_if(gss_data_LOTR_nest_Full, is.character, factor)

(
gss_data_LOTR_nest_Full <- gss_data_LOTR_nest_Full %>% 
    group_by(LOTR) %>% 
    nest()
)

## # A tibble: 6 x 2
##     LOTR                 data
##   <fctr>               <list>
## 1  LOTR1 <tibble [1,447 x 3]>
## 2  LOTR2 <tibble [1,447 x 3]>
## 3  LOTR3 <tibble [1,447 x 3]>
## 4  LOTR4 <tibble [1,447 x 3]>
## 5  LOTR5 <tibble [1,447 x 3]>
## 6  LOTR6 <tibble [1,447 x 3]>

(
gss_data_LOTR_nest_Top2 <- gss_data_LOTR_nest_Top2 %>% 
    group_by(LOTR) %>% 
    nest()
)

## # A tibble: 6 x 2
##     LOTR                 data
##   <fctr>               <list>
## 1  LOTR1 <tibble [1,447 x 3]>
## 2  LOTR2 <tibble [1,447 x 3]>
## 3  LOTR3 <tibble [1,447 x 3]>
## 4  LOTR4 <tibble [1,447 x 3]>
## 5  LOTR5 <tibble [1,447 x 3]>
## 6  LOTR6 <tibble [1,447 x 3]>

(
gss_data_LOTR_chart_Full <- gss_data_LOTR_nest_Full %>% 
        mutate(chart = map(data, tbl_chart, "Scale", "sex", weight = "wtssall", nlabels = TRUE)) %>% 
            unnest(chart) %>% 
    filter(Scale %in% c("Strongly agree", "Agree"))
)

## # A tibble: 24 x 5
##      LOTR            sex          Scale       prop Sample_Size
##    <fctr>         <fctr>         <fctr>      <dbl>       <int>
##  1  LOTR1   Male (n=605)          Agree 0.46834182         605
##  2  LOTR1 Female (n=842)          Agree 0.47338741         842
##  3  LOTR1   Male (n=605) Strongly agree 0.14088480         605
##  4  LOTR1 Female (n=842) Strongly agree 0.16687914         842
##  5  LOTR2   Male (n=605)          Agree 0.24654460         605
##  6  LOTR2 Female (n=842)          Agree 0.23532435         842
##  7  LOTR2   Male (n=605) Strongly agree 0.05552968         605
##  8  LOTR2 Female (n=842) Strongly agree 0.05483089         842
##  9  LOTR3   Male (n=605)          Agree 0.52541323         605
## 10  LOTR3 Female (n=842)          Agree 0.49361067         842
## # ... with 14 more rows

gss_data_LOTR_chart_Top2 <- gss_data_LOTR_chart_Full %>% 
    group_by(LOTR, sex) %>% 
    summarize(total_prop = sum(prop))

(
gss_data_LOTR_result_Top2 <- gss_data_LOTR_nest_Top2 %>% 
        mutate(test = map(data, freq_prop_test, "Scale", "sex", weight = "wtssall", nlabels = TRUE)) %>% 
        unnest(test) %>% 
        filter(level == "Agree")
)

## # A tibble: 6 x 10
##     LOTR         group1       group2  level     p.value significant
##   <fctr>         <fctr>       <fctr> <fctr>       <dbl>       <lgl>
## 1  LOTR1 Female (n=842) Male (n=605)  Agree 0.253150237       FALSE
## 2  LOTR2 Female (n=842) Male (n=605)  Agree 0.667930118       FALSE
## 3  LOTR3 Female (n=842) Male (n=605)  Agree 0.765824419       FALSE
## 4  LOTR4 Female (n=842) Male (n=605)  Agree 0.446722189       FALSE
## 5  LOTR5 Female (n=842) Male (n=605)  Agree 0.005903342        TRUE
## 6  LOTR6 Female (n=842) Male (n=605)  Agree 0.039162074        TRUE
## # ... with 4 more variables: Sample_Size_group1 <int>,
## #   Sample_Size_group2 <int>, prop_group1 <dbl>, prop_group2 <dbl>

Now onto the tbl_sig function, which I only need to run for the significant differences for the Top 2 box because I want to extract the proportions for the labels.

(
my_sig_data <- gss_data_LOTR_result_Top2 %>%
    group_by(LOTR) %>%
    nest() %>%
    mutate(sig = map(data, tbl_sig, "sex", space_label = 0.25, space_between = 0.1)) %>%
    unnest(sig)
)

## # A tibble: 2 x 9
##     LOTR  level            sex           sign group_prop sign_prop
##   <fctr> <fctr>         <fctr>         <fctr>      <dbl>     <dbl>
## 1  LOTR5  Agree   Male (n=605) Female (n=842)  0.2272998 0.1677041
## 2  LOTR6  Agree Female (n=842)   Male (n=605)  0.7961268 0.7486442
## # ... with 3 more variables: group_Sample_Size <int>,
## #   sign_Sample_Size <int>, pos <dbl>

I'll now merge the total proportions from the label data back to the chart data in order to re-order the chart from high to low. First I need to remove "_Top2" from each level of the LOTR factor.

my_chart_data <- left_join(gss_data_LOTR_chart_Full, gss_data_LOTR_chart_Top2, by = c("LOTR", "sex"))
my_chart_data$Scale <- droplevels(my_chart_data$Scale)

Now, I'll apply the question labels, and reorder them so that they appear from high to low in the chart. Note the usage of "" within the labels to provide breaks to avoid the chart getting squished due to the length of the labels.

#Labels taken from here: http://gss.norc.org/documents/codebook/GSS_Codebook_mainbody.pdf and from gss_data_questions
my_chart_data$LOTR <- my_chart_data$LOTR %>% 
    fct_recode(`In uncertain times,\nI usually expect the best` = "LOTR1",
                `If something can go\nwrong for me, it will` = "LOTR2",
                `I'm always optimistic\nabout the future` = "LOTR3", 
                `I hardly ever expect\nthings to go my way` = "LOTR4", 
                `I rarely count on good\nthings happening to me` = "LOTR5", 
                `Overall I expect more good things\nto happen to me than bad` = "LOTR6") %>% 
    fct_reorder(my_chart_data$total_prop)

my_sig_data$LOTR <- my_sig_data$LOTR %>% 
    fct_recode(`In uncertain times,\nI usually expect the best` = "LOTR1",
                `If something can go\nwrong for me, it will` = "LOTR2",
                `I'm always optimistic\nabout the future` = "LOTR3", 
                `I hardly ever expect\nthings to go my way` = "LOTR4", 
                `I rarely count on good\nthings happening to me` = "LOTR5", 
                `Overall I expect more good things\nto happen to me than bad` = "LOTR6")

Let's build up the chart. One wrinkle I'm adding here for the individual scale point labels is to only display if they're over 5%, to avoid the "Strongly agree" and "agree" labels encroaching on each other.

colour_scale_vec <- c("#fff7bc", "#d95f0e")
colour_subgroup_vec <- c("#f48041", "#4286f4")

#Applying names to each colour
attributes(colour_scale_vec)$names <- levels(my_chart_data$Scale)
attributes(colour_subgroup_vec)$names <- levels(my_chart_data$sex)

(
    p <- ggplot() +
        geom_col(data = my_chart_data, aes(x = LOTR, y = prop, fill = Scale)) +
        geom_text(data = my_chart_data, aes(x = LOTR, y = prop, group = Scale, label = if_else(prop > 0.05, scales::percent(round(prop, 2)), "")), position = position_stack(vjust = 0.5)) +
        geom_text(data = my_chart_data, aes(x = LOTR, y = total_prop, label = scales::percent(round(total_prop, 2))), hjust = -0.2) +
        scale_fill_manual(values = colour_scale_vec) +
        scale_colour_manual(values = colour_subgroup_vec) +
        coord_flip() +
        facet_wrap(~ sex) +
        scale_y_continuous(limits = c(0, 1.1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
        labs(title = "Levels of Optimism Across Gender", x = "Statement", y = "Percentage", fill = "Scale", colour = "Sig. Diff.") 
)

p + geom_sigmark(my_sig_data, x = "LOTR", icon = "asterisk", group = "sex") +
        guides(fill = guide_legend(reverse = TRUE))

We will now turn to the geom_sigmark_waves function to use for tracking charts across multiple waves. A different function is required because we need to present whether a significant shift is an increase or a decrease over time. For this example, we'll look at the change in the level of confidence Americans have been placing on the financial industry (confinan).

gss_data1 <- filter(gss_data, year_num >= 2000, !is.na(confinan))
gss_data1$year <- droplevels(gss_data1$year)

gss_data1$confinan <- fct_recode(gss_data1$confinan, 
                                            "A great deal" = "A GREAT DEAL", 
                                            "Hardly any" = "HARDLY ANY", 
                                            "Only some" = "ONLY SOME")

my_track_results <- freq_prop_test(gss_data1, "confinan", "year", weight = "wtssall")
my_chart_track <- tbl_chart(gss_data1, "confinan", "year", weight = "wtssall")
my_sig_track <- tbl_sig(my_track_results, "year", compare = "w2w")

ggplot() + 
    geom_line(data = my_chart_track, aes(x = year, y = prop, colour = confinan, group = confinan)) +
    geom_point(data = my_chart_track, aes(x = year, y = prop, colour = confinan), show.legend = FALSE) +
    geom_text_repel(data = my_chart_track, aes(x = year, y = prop, colour = confinan, label = percent(round(prop, 2))), show.legend = FALSE) +
    scale_y_continuous(labels = percent) +
    geom_sigmark_waves(my_sig_track) +
        labs(title = "Americans' Level of Confidence in the Financial Industry", subtitle = "2000 to 2016", y = "Percentage", x = "Year", colour = "Confidence level", shape = "Sig. Diff.")

For this section, we'll expand to looking at how Americans' level of confidence ("some" or "a great deal") over several topics change over time.

gss_data1 <- select(gss_data, year, year_num, wtssall, starts_with("con"), -coninc)
gss_data1 <- filter(gss_data1, rowSums(is.na(gss_data1)) < 13, year_num >= 2000) %>% select(-year_num)
gss_data1 <- rename(gss_data1,
                                        Financial = confinan,
                                        Business = conbus,
                                        Religion = conclerg,
                                        Education = coneduc,
                                        POTUS = confed,
                                        Unions = conlabor,
                                        Press = conpress,
                                        Medicine = conmedic,
                                        Television = contv,
                                        SCOTUS = conjudge,
                                        Science = consci,
                                        Congress = conlegis,
                                        Military = conarmy)
gss_data1$year <- droplevels(gss_data1$year)

gss_data1 <- gss_data1 %>% 
    mutate_if(is.factor, funs("Top2" = if_else(. == "A GREAT DEAL" | . == "ONLY SOME", "Yes", "No"))) %>% 
  select(year, wtssall, ends_with("Top2"), -year_Top2) %>% 
    mutate_if(is.character, factor)
names(gss_data1) <- str_replace(names(gss_data1), "_Top2", "")

gss_data1_nest <- gss_data1 %>% 
    gather(key = Institution, value = Confident, -year, -wtssall) %>% 
  mutate_if(is.character, factor)

gss_data1_result <- gss_data1_nest %>% 
    group_by(Institution) %>% 
    nest() %>% 
  mutate(test = map(data, freq_prop_test, "Confident", "year", weight = "wtssall"),
             chart = map(data, tbl_chart, "Confident", "year", weight = "wtssall"))

gss_data1_result_nest <- gss_data1_result %>%
      unnest(test) %>% 
    filter(level == "Yes") %>% 
    group_by(Institution) %>%
    nest() %>%
    mutate(sig = map(data, tbl_sig, "year", compare = "curr"))

gss_data1_chart <- gss_data1_result %>% 
    unnest(chart) %>% 
    filter(Confident == "Yes")
gss_data1_sig <- unnest(gss_data1_result_nest, sig)

gss_data1_chart <- mutate(gss_data1_chart, Sector = case_when(
                                Institution %in% c("Business", "Financial", "Press", "Unions") ~ "Sector 1", 
                                Institution %in% c("Education", "Science", "Medicine", "Religion", "Television") ~ "Sector 2",
                                TRUE ~ "Sector 3"),
                                Sector = factor(Sector))
gss_data1_sig <- mutate(gss_data1_sig, Sector = case_when(
                                Institution %in% c("Business", "Financial", "Press", "Unions") ~ "Sector 1", 
                                Institution %in% c("Education", "Science", "Medicine", "Religion", "Television") ~ "Sector 2",
                                TRUE ~ "Sector 3"),
                                Sector = factor(Sector))

colour_vec <- c("#0db36f", "#c8dd56", "#030303", "#72477f", "#382b57", "#ce215b", "#e1a83e", "#39b287", "#71c3ca", "#3a8bea", "#034385", "#9f2b68", "#9d0e27")
names(colour_vec) <- levels(gss_data1_chart$Institution)

ggplot() + 
  geom_line(data = gss_data1_chart, aes(x = year, y = prop, group = Institution, colour = Institution)) + 
  geom_point(data = gss_data1_chart, aes(x = year, y = prop, colour = Institution)) +
  geom_text_repel(data = gss_data1_chart, aes(x = year, y = prop, label = percent(round(prop, 2)), colour = Institution), show.legend = FALSE) +
    labs(title = "Proportion of Americans with at least some level of confidence in Institutions", x = "Year", y = "Percentage", shape = "Sig. Diff.") +
    scale_y_continuous(labels = percent) +
    facet_wrap(~ Sector, nrow = 3) +
    scale_colour_manual(values = colour_vec) +
  geom_sigmark_waves(gss_data1_sig, colour = "Institution")

I would actually break this into 3 separate charts instead of using facet_wrap() to make the chart easier to read, but that's beside the point.

Since version 0.0.2, we can now compare a subgroup to the rest of the sample. Here we will use "Pacific" as a region to chart and compare it to the rest.

gss_data_prop <- filter(gss_data, year == "2016", !is.na(conlegis), !is.na(degree))
gss_data_prop$conlegis <- fct_recode(gss_data_prop$conlegis, 
                                        `A Great Deal` = "A GREAT DEAL",
                                        `Only Some` = "ONLY SOME",
                                        `Hardly Any` = "HARDLY ANY")

my_results_prop <- freq_prop_test(gss_data_prop, "conlegis", "degree", level = "GRADUATE", weight = "wtssall")
my_conlegis_chart <- tbl_chart(gss_data_prop, "conlegis", "degree", weight = "wtssall")
my_conlegis_sig <- tbl_sig(my_results_prop, "degree", space_label = 0.1, compare = "total")

We only want to chart the Graduates, so we'll filter them out.

my_conlegis_chart <- filter(my_conlegis_chart, degree == "GRADUATE")
my_conlegis_sig <- filter(my_conlegis_sig, degree == "GRADUATE")

(
    p <- ggplot() +
        geom_col(data = my_conlegis_chart, aes(x = conlegis, y = prop)) +
        geom_text(data = my_conlegis_chart, aes(x = conlegis, y = prop,
                            label = scales::percent(round(prop, 2))),
                            hjust = -0.2) +
        coord_flip() +
        scale_y_continuous(limits = c(0, 1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
        labs(title = "Graduates' Level of Confidence on US Congress in 2016", x = "Level of Confidence", y = "Percentage", shape = "Significant Differences")
)

Here, we will use geom_sigmark_total:

p + geom_sigmark_total(my_conlegis_sig, labels = c("Higher than those with \nother levels of education", "Lower than those with \nother levels of education")) + theme(legend.key.height = unit(1.5, "lines"))

We'll now expand on the previous section, and look into setting up a chart with multiple dummy variables. Here's we'll focus into how those in the 18-24 age category differ from the rest of the sample on their usage of social media:

gss_data_prop <- gss_data %>% 
    filter(year == "2016", !is.na(degree)) %>% 
    select(wtssall, age_7, Facebook = FACEBOOK, Twitter = TWITTER, Instagram = INSTAGRM, LinkedIn = LINKEDIN, Snapchat = SNAPCHAT, Tumblr = TUMBLR, Whatsapp = WHATSAPP, `Google+` = GOOGLESN, Pinterest = PINTERST, Flickr = FLICKR, Vine = VINE, Classmates = CLSSMTES)

#Switch NA to "No" so that the full bases are kept.
gss_data_prop <- mutate_at(gss_data_prop, vars(3:14), funs(fct_explicit_na(., "No")))

(
    gss_data_prop_nest <- gss_data_prop %>% 
    gather(key = Social_Media, value = Use, -age_7, -wtssall)
)

## # A tibble: 34,308 x 4
##      wtssall  age_7 Social_Media   Use
##        <dbl> <fctr>        <chr> <chr>
##  1 0.9569935  45-54     Facebook    No
##  2 0.4784968  55-64     Facebook    No
##  3 0.9569935  65-74     Facebook    No
##  4 1.9139870  35-44     Facebook    No
##  5 1.4354903  55-64     Facebook    No
##  6 0.9569935  45-54     Facebook    No
##  7 1.4354903  45-54     Facebook    No
##  8 0.9569935  18-24     Facebook   Yes
##  9 0.9569935  45-54     Facebook    No
## 10 0.9569935  65-74     Facebook   Yes
## # ... with 34,298 more rows

gss_data_prop_nest <- mutate_if(gss_data_prop_nest, is.character, factor)

(
gss_data_prop_nest <- gss_data_prop_nest %>% 
    group_by(Social_Media) %>% 
    nest()
)

## # A tibble: 12 x 2
##    Social_Media                 data
##          <fctr>               <list>
##  1     Facebook <tibble [2,859 x 3]>
##  2      Twitter <tibble [2,859 x 3]>
##  3    Instagram <tibble [2,859 x 3]>
##  4     LinkedIn <tibble [2,859 x 3]>
##  5     Snapchat <tibble [2,859 x 3]>
##  6       Tumblr <tibble [2,859 x 3]>
##  7     Whatsapp <tibble [2,859 x 3]>
##  8      Google+ <tibble [2,859 x 3]>
##  9    Pinterest <tibble [2,859 x 3]>
## 10       Flickr <tibble [2,859 x 3]>
## 11         Vine <tibble [2,859 x 3]>
## 12   Classmates <tibble [2,859 x 3]>

(
gss_data_result <- gss_data_prop_nest %>% 
    mutate(test = map(data, freq_prop_test, "Use", "age_7", level = "18-24", weight = "wtssall"),
                 chart = map(data, tbl_chart, "Use", "age_7", weight = "wtssall"))
)

## # A tibble: 12 x 4
##    Social_Media                 data                 test
##          <fctr>               <list>               <list>
##  1     Facebook <tibble [2,859 x 3]> <data.frame [2 x 9]>
##  2      Twitter <tibble [2,859 x 3]> <data.frame [2 x 9]>
##  3    Instagram <tibble [2,859 x 3]> <data.frame [2 x 9]>
##  4     LinkedIn <tibble [2,859 x 3]> <data.frame [2 x 9]>
##  5     Snapchat <tibble [2,859 x 3]> <data.frame [2 x 9]>
##  6       Tumblr <tibble [2,859 x 3]> <data.frame [2 x 9]>
##  7     Whatsapp <tibble [2,859 x 3]> <data.frame [2 x 9]>
##  8      Google+ <tibble [2,859 x 3]> <data.frame [2 x 9]>
##  9    Pinterest <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 10       Flickr <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 11         Vine <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 12   Classmates <tibble [2,859 x 3]> <data.frame [2 x 9]>
## # ... with 1 more variables: chart <list>

(
my_sig_data <- gss_data_result %>%
        unnest(test) %>% 
        filter(level == "Yes") %>% 
        group_by(Social_Media) %>%
        nest() %>%
        mutate(sig = map(data, tbl_sig, "age_7", space_label = 0.1, compare = "total")) %>% 
        unnest(sig)
)

## # A tibble: 7 x 10
##   Social_Media  level  age_7           sign  group_prop  sign_prop
##         <fctr> <fctr> <fctr>         <fctr>       <dbl>      <dbl>
## 1     Facebook    Yes  18-24 Rest of Sample 0.491384611 0.34355988
## 2      Twitter    Yes  18-24 Rest of Sample 0.211865563 0.07786832
## 3    Instagram    Yes  18-24 Rest of Sample 0.390509132 0.13203718
## 4     Snapchat    Yes  18-24 Rest of Sample 0.415225073 0.08418934
## 5       Tumblr    Yes  18-24 Rest of Sample 0.080860827 0.01846067
## 6         Vine    Yes  18-24 Rest of Sample 0.058186669 0.01554382
## 7   Classmates    Yes  18-24 Rest of Sample 0.003217104 0.02946433
## # ... with 4 more variables: group_Sample_Size <int>,
## #   sign_Sample_Size <int>, pos <dbl>, Direction <fctr>

(my_chart_data <- gss_data_result %>% 
        unnest(chart) %>% 
        filter(age_7 == "18-24", Use == "Yes"))

## # A tibble: 12 x 5
##    Social_Media  age_7    Use        prop Sample_Size
##          <fctr> <fctr> <fctr>       <dbl>       <int>
##  1     Facebook  18-24    Yes 0.491384611         228
##  2      Twitter  18-24    Yes 0.211865563         228
##  3    Instagram  18-24    Yes 0.390509132         228
##  4     LinkedIn  18-24    Yes 0.116682200         228
##  5     Snapchat  18-24    Yes 0.415225073         228
##  6       Tumblr  18-24    Yes 0.080860827         228
##  7     Whatsapp  18-24    Yes 0.068425641         228
##  8      Google+  18-24    Yes 0.196367705         228
##  9    Pinterest  18-24    Yes 0.167692568         228
## 10       Flickr  18-24    Yes 0.020169083         228
## 11         Vine  18-24    Yes 0.058186669         228
## 12   Classmates  18-24    Yes 0.003217104         228

We just need to re-order the Social_Media variable based on the prop variable, and we're ready to chart.

my_chart_data$Social_Media <- fct_reorder(my_chart_data$Social_Media, my_chart_data$prop)

(
ggplot() +
        geom_col(data = my_chart_data, aes(x = Social_Media, y = prop)) +
        geom_text(data = my_chart_data, aes(x = Social_Media, y = prop,
                            label = scales::percent(round(prop, 2))),
                            hjust = -0.2) +
        coord_flip() +
        scale_y_continuous(limits = c(0, 1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
        labs(title = "Level of Usage of Social Media in 2016 Among 18-24 Year Old", x = "Social Media", y = "Percentage", shape = "Significant Differences") +
        geom_sigmark_total(my_sig_data, x = "Social_Media", labels = c("Higher than \nthose aged 25+", "Lower than \nthose aged 25+")) +
        theme(legend.key.height = unit(1.5, "lines"))
)

Version changes

renamed freq_test_prop to freq_prop_test for consistency with *_test functions going forward.
Changed tbl_chart so that it can be run directly off the raw data frame instead of the *_test functions.
Added a subgroup input in tbl_sig so that the function's output matches with the tbl_chart output.
Moved the min_sample argument from tbl_sig to the *_test functions to increase computational efficiency by avoiding the tests to be run on subgroups the user would ignore if it is below the specified sample size threshold anyway.
Fixed some bugs, including a major one with freq_t_test that wasn't properly computing the weighted mean.

Changed the keep argument to compare for clarity in the tbl_sig function.
Added a "total" feature to the compare argument when one wishes to compare a particular subgroup to the rest of the sample.
Made some improvements in the background in how these functions handle factor levels.
Added sample sizes as output from tbl_sig to make it easier to include that information in the charts.
Added 2 examples to this vignette to show how to use the new "total" feature.
Simplified the code throughout this vignette in regards to the handling of the factor levels after including sample sizes, and added the nlabels and newline logical arguments to freq_test_prop and freq_t_test functions to make this change seamless.

Fixed an issue when used in Shiny when an end user de-selects one or more subgroups being compared in a chart ends up with other subgroups also being removed from the chart.

Plans for future additions and fixes

The following is a to-do list for future additions to the package, in order of importance from high to low:

The package currently provides frequentist tests. Bayesian alternatives to these tests will be eventually included.
A known bug: The freq_prop_test function does not work well when one of the sample groups has a sample size of 0.
The order of the guides in the legend appear to change randomly. I've attempted to coerce the markers legend to always remain at the bottom, but further testing is needed to ensure this is done correctly.
Provide a test for Poisson distributed data.
Any ideas on what else to include? Is there a particular type of chart commonly used in market research reports you'd like to see included in this vignette? Please let me know by contacting me at phil.straforelli@environics.ca.

philstraforelli/ggsigmark documentation built on May 20, 2019, 1:59 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

philstraforelli/ggsigmark
An Implementation of Grammar of Graphics to Denote Statistically Significant Differences

vignettes/ggsigmark.md
In philstraforelli/ggsigmark: An Implementation of Grammar of Graphics to Denote Statistically Significant Differences

Package installation

Guide

A note about the `gss_data` General Social Survey data set

Usage of the `ggsigmark` functions

The `freq_pair_wtd_t_test` function

The `freq_t_test` function

The `freq_prop_test` function

The `tbl_chart` function

The `tbl_sig` function

The `geom_sigmark`, `geom_sigmark_interactive`, `geom_sigmark_waves`, and `geom_sigmark_waves_interaction` functions

Examples of the application of the `ggsigmark` function

Simple bar chart displaying means with numeric variables

Simple bar chart on single-select variables

Simple bar chart on multi-select dummy variables

Stacked bar chart with scaled variables

Multiple stacked bar chart with scaled variables

Line chart for tracking responses of one question

Line chart for tracking responses of multiple questions

Comparing a subgroup to the rest of the sample for one question

Comparing a subgroup to the rest of the sample for multiple questions

Version changes

v0.0.3

v0.0.2

v0.0.1.1

Plans for future additions and fixes

R Package Documentation

Browse R Packages

We want your feedback!

philstraforelli/ggsigmark An Implementation of Grammar of Graphics to Denote Statistically Significant Differences

vignettes/ggsigmark.md In philstraforelli/ggsigmark: An Implementation of Grammar of Graphics to Denote Statistically Significant Differences

Package installation

Guide

A note about the gss_data General Social Survey data set

Usage of the ggsigmark functions

The freq_pair_wtd_t_test function

The freq_t_test function

The freq_prop_test function

The tbl_chart function

The tbl_sig function

The geom_sigmark, geom_sigmark_interactive, geom_sigmark_waves, and geom_sigmark_waves_interaction functions

Examples of the application of the ggsigmark function

Simple bar chart displaying means with numeric variables

Simple bar chart on single-select variables

Simple bar chart on multi-select dummy variables

Stacked bar chart with scaled variables

Multiple stacked bar chart with scaled variables

Line chart for tracking responses of one question

Line chart for tracking responses of multiple questions

Comparing a subgroup to the rest of the sample for one question

Comparing a subgroup to the rest of the sample for multiple questions

Version changes

v0.0.3

v0.0.2

v0.0.1.1

Plans for future additions and fixes

R Package Documentation

Browse R Packages

We want your feedback!

philstraforelli/ggsigmark
An Implementation of Grammar of Graphics to Denote Statistically Significant Differences

vignettes/ggsigmark.md
In philstraforelli/ggsigmark: An Implementation of Grammar of Graphics to Denote Statistically Significant Differences

A note about the `gss_data` General Social Survey data set

Usage of the `ggsigmark` functions

The `freq_pair_wtd_t_test` function

The `freq_t_test` function

The `freq_prop_test` function

The `tbl_chart` function

The `tbl_sig` function

The `geom_sigmark`, `geom_sigmark_interactive`, `geom_sigmark_waves`, and `geom_sigmark_waves_interaction` functions

Examples of the application of the `ggsigmark` function