In philstraforelli/ggsigmark: An Implementation of Grammar of Graphics to Denote Statistically Significant Differences

Market researchers typically use symbols in charts delivered to clients to denote whether a specific subgroup's result is (statistically speaking) significantly higher than another subgroup's result. The package ggsigmark aims to make it easier for market researchers to utilize ggplot2 by providing the ability to indicate whether differences are statistically significant with the usage of markers in the plots they are creating.

This vignette provides a manual on how to use the ggsigmark functions and lists planned future additions to the package.

Package installation

To install the package from GitHub, you will first need to install the devtools package.

install.packages("devtools")

The ggsigmark package can be installed by running the following line:

devtools::install_github("philstraforelli/ggsigmark")

It is made available for usage by running the following line:

library(ggsigmark)

The code below will utilize tidyverse code (which is of course necessary since the ultimate purpose is to use ggplot2), so we might as well load up that package as well, along with others that are used in this vignette:

library(tidyverse)
library(broom)
library(forcats)
library(ggrepel)
library(scales)
library(stringr)

Guide

A note about the `gss_data` General Social Survey data set

The package uses a reduced version of the NORC's General Social Survey to illustrate the usage of the ggsigmark tools. This data was chosen because it represents the type of data that market researchers generally work with; that is, tracking survey data (in this case run on a biennial basis) with post-stratification weights.

The full original data file includes as of this writing over 65,000 responses and over nearly 6,000 variables (many of which are defunct questions). To save memory, only a select list of variables are included, and the gss_data dataset only includes data from 2000 to 2016 (the GSS survey was first run in 1972). This brings the data frame down to a more manageable 24,350 responses and 101 variables.

The full data can be downloaded here. More information about the GSS data is available here.

Usage of the `ggsigmark` functions

The following flowchart illustrates the workflow for using ggsigmark functions.

knitr::include_graphics('./www/flowchart.png')

The `freq_pair_wtd_t_test` function

Surprisingly, while the stats package includes a pairwise.t.test function to make multiple pairwise comparisons between group levels (with corrections for multiple testing), and the Hmisc package provides a wtd.mean function to calculate means with post-stratification weights, there was no function available to run pairwise comparison t-tests with weighted means.

As with the pairwise.t.test function, an adjustment is made to the p-value to mitigate the chance of a Type I error, defaulted to the "Holm" method (as opposed to the better known Bonferroni correction, generally deemed too conservative).

The function freq_pair_wtd_t_test fills that gap. This function is used in the background for the freq_t_test function, and most users would likely not see a need to run this function. However, it may have its uses to some, and it is therefore made available here. The following code is an example of its usage, where whether the coninc variable (adjusted family income) differs by levels of education (degree), and whether respondents' age (age_num) differ based on their region. For both cases, the weight wtssall is applied:

freq_pair_wtd_t_test(x = gss_data$coninc, subgroup = gss_data$degree, weight = gss_data$wtssall)
freq_pair_wtd_t_test(x = gss_data$age_num, subgroup = gss_data$region, weight = gss_data$wtssall)

The resulting tables above list the p-values for each pairwise comparison. The output is not ideal in handling further, so the user may wish to use the tidy() function from the broom package:

tidy(freq_pair_wtd_t_test(x = gss_data$coninc, subgroup = gss_data$degree, weight = gss_data$wtssall))
tidy(freq_pair_wtd_t_test(x = gss_data$coninc, subgroup = gss_data$region, weight = gss_data$wtssall))

The `freq_t_test` function

Most users would consider the freq_t_test function more useful, as it takes the output from the freq_pair_wtd_t_test function and wrangles it into a more useful output. The output from the freq_t_test function consists of a data frame in which each row consists of a possible pairwise comparison, with the following variables:

Names of the two groups compared (group1 and group2)
The p.value for that comparison, and a logical indicating whether it is significant (i.e. whether it is below the alpha level, which can be user-defined but is defaulted to 0.05).
The unweighted sample sizes of each group (Sample_Size_group1 and Sample_Size_group2)
the weighted means of each group (wtd.mean_group1 and wtd.mean_group2).

Note that the arguments needed for the freq_t_test function consist of the following, in order:

The name of the dataframe
The name of the numeric variable being compared across subgroups, in quotes
The name of the factor variable in which the subgroup levels are compared, in quotes
Optionally, the name of the weight variable, in quotes
The criteria to determine whether a difference is significant, defaulted to 0.05.
The name of the method to adjust p-values due to multiple comparisons in quotes, defaulted to "holm". See p.adjust.methods for more details.
A logical argument to indicate whether the user wants to include the sample sizes into the names of the subgroup levels (TRUE) or not (FALSE), defaulted to FALSE.

#First filtering only to 2016 cases and among those who gave an inflation-adjusted family income
gss_data_mean <- filter(gss_data, year == "2016", coninc > 0)

freq_t_test(gss_data_mean, "coninc", "region", weight = "wtssall") #With weights
freq_t_test(gss_data_mean, "coninc", "degree") #Without weights

The names of the levels of the region variable are currently a little unsightly, so we'll use the fct_recode() function from the forcats package to make it look a little nicer. I'm also going to use fct_relevel() so that the order of the regions in the chart are somewhat similar to the geographical order from west to east.

levels(gss_data_mean$region)

gss_data_mean$region <- gss_data_mean$region %>% 
    fct_recode(
    `New England` = "NEW ENGLAND",
    `Middle Atlantic` = "MIDDLE ATLANTIC",
    `East North Central` = "E. NOR. CENTRAL",
    `West North Central` = "W. NOR. CENTRAL",
    `South Atlantic` = "SOUTH ATLANTIC",
    `East South Central` = "E. SOU. CENTRAL",
    `West South Central` = "W. SOU. CENTRAL",
    Mountain = "MOUNTAIN",
    Pacific = "PACIFIC") %>% 
    fct_relevel(
        "Pacific",
        "Mountain", 
        "West South Central", 
        "West North Central",
        "East South Central",
        "East North Central",
        "South Atlantic",
        "Middle Atlantic",
        "New England")

The `freq_prop_test` function

The freq_prop_test function behaves similarly as the freq_t_test function, except of course it focuses on proportions, instead of means. It relies on the pairwise.prop.test function in the background, but allows for the usage of post-stratification weights. The output of the freq_prop_test function consists of the following:

Names of the two groups compared (group1 and group2)
The specific level on which the proportions for group1 and group2 are being compared.
The p.value for that comparison, and a logical indicating whether it is significant (i.e. whether it is below the alpha level, which can be user-defined but is defaulted to 0.05).
The unweighted sample sizes of each group (Sample_Size_group1 and Sample_Size_group2)
the weighted proportions of each group (prop_group1 and prop_group2).

Note that the arguments needed for the freq_prop_test function consist of the following, in order:

The name of the dataframe
The name of the factor variable being compared across subgroups, in quotes
The name of the factor variable in which the subgroup levels are compared, in quotes
Optionally, the name of the weight variable, in quotes
The criteria to determine whether a difference is significant, defaulted to 0.05.
The name of the method to adjust p-values due to multiple comparisons in quotes, defaulted to "holm". See p.adjust.methods for more details.
A logical argument to indicate whether the user wants to include the sample sizes into the names of the subgroup levels (TRUE) or not (FALSE), defaulted to FALSE.

#First filtering only to 2016 cases and among those who gave an answer on their level of confidence with Congress
gss_data_prop <- filter(gss_data, year == "2016", !is.na(conlegis), !is.na(partyid))

#Are there differences in Americans' level of confidence with US Congress across their party identification?
gss_data_prop$partyidcoll <- fct_collapse(gss_data_prop$partyid,
                                        Democrat = c("STRONG DEMOCRAT", "NOT STR DEMOCRAT"),
                                        Independent = c("IND,NEAR DEM", "INDEPENDENT", "IND,NEAR REP"),
                                        Republican = c("NOT STR REPUBLICAN", "STRONG REPUBLICAN"),
                                        Other = "OTHER PARTY")

#With weights. To avoid a large table output, I'm filtering to show only the "HARDLY ANY" level here.
freq_prop_test(gss_data_prop, "conlegis", "partyidcoll", weight = "wtssall") 

#Are there differences in the usage of LinkedIn by levels of education?
gss_data_prop2 <- filter(gss_data, year == "2016", !is.na(LINKEDIN), !is.na(degree))
#Without weights, and filtering only to the "Yes" level
gss_data_prop2 %>% 
    freq_prop_test("LINKEDIN", "degree") %>% 
    filter(level == "Yes")

The `tbl_chart` function

The tbl_chart function is a convenience function that takes the raw data and turns it into a data frame that allows for easy charting in ggplot2. The user may however find that some further wrangling may be needed prior to charting.

First, let's use the function on the results for proportions:

tbl_chart(gss_data_prop, "conlegis", "region", weight = "wtssall")

And again on the results for means:

tbl_chart(gss_data_mean, "coninc", "region", weight = "wtssall")

NOTE: This function in versions previous to v0.0.3 took the output of either freq_t_test or freq_prop_test as input. It now takes the raw data as input for greater convenience.

The `tbl_sig` function

The tbl_sig function is a convenience function that takes the output from either the freq_t_test or the freq_prop_test functions and turns it into a data frame that allows for easy charting with geom_sigmark. The user may however find that some further wrangling may be needed prior to charting. Its input consists of the following:

An object created by the freq_t_test or the freq_prop_test functions.
A space_label numeric value indicating how much spacing is needed between the label and the marker. Some trial and error is needed to determine the ideal value.
A space_between numeric value indicating how much spacing is needed between markers. Some trial and error is needed to determine the ideal value.
An optional keep argument that should be one of three options:
- "all": the default setting. This is to be used for subgroup comparisons.
- "w2w": For tracking changes when we are only interested in comparing differences from one wave to the next.
- "curr": For tracking changes when we are only interested in comparing differences between the latest wave and past waves.

First, let's use the function on the results for proportions:

my_results_prop <- freq_prop_test(gss_data_prop, "conlegis", "region", weight = "wtssall")
tbl_sig(my_results_prop, "region", space_label = 0.1, space_between = 0.05)

And again on the results for means:

my_results_mean <- freq_t_test(gss_data_mean, "coninc", "region", weight = "wtssall")
tbl_sig(my_results_mean, "region", space_label = 0.1, space_between = 0.05)

The `geom_sigmark`, `geom_sigmark_interactive`, `geom_sigmark_waves`, and `geom_sigmark_waves_interaction` functions

The geom_sigmark function is a geom_point wrapper that identifies with a marker whether a particular subgroup's result is, statistically speaking, significantly higher than another subgroup's result. In cases where there are more than 2 subgroups being compared, the usage of colours is key here to be able to identify which subgroup's result is being identified as being significantly lower.

All 4 geom_sigmark functions require several arguments:

The output created from the tbl_sig function.
The x and y arguments indicating what axes are used to chart the markers. The group argument is provided if needed, and the colour argument indicating how the markers should be coloured.
A quoted character string indicating the type of icon the user would like to have displayed where there are significant differences.
A value for the size of the icon. Generally a size of 5 is deemed appropriate.

The 2 _waves variants also include the following arguments:

The direction argument to specify which variable indicates whether the marker should be an increase or a decrease.
The labels argument to specify how the markers should be labeled in the plot legend. By default they are "Increase"/"Decrease".

The 2 _interactive variants also include the following arguments (both from ggiraph).

The tooltip argument to specify which variable indicates what gets shown in the tooltip when the mouse hovers on a marker.
The data_id argument to specific which variable should be highlighted if the end user clicks on a marker.

The geom_sigmark function is best displayed by providing examples on how it can be utilized with the most common types of charts used in market research. These are listed below.

Examples of the application of the `ggsigmark` function

Simple bar chart displaying means with numeric variables

We will use the coninc variable, which indicates the family income of each respondent, adjusted for inflation, and use the previous examples indicating whether there is a difference in income across regions in the US. I will re-run the my_results_mean object, this time setting the nlabels argument to TRUE. This is to merge the sample sizes with the names of the regions in order to display both in the chart.

my_results_mean <- freq_t_test(gss_data_mean, "coninc", "region", weight = "wtssall", nlabels = TRUE)
my_chart_data <- tbl_chart(gss_data_mean, "coninc", "region", weight = "wtssall", nlabels = TRUE)
my_sig_data <- tbl_sig(my_results_mean, "region", space_label = 5000, space_between = 2200)

As noted above, it is best to have the subgroups (i.e. the levels of the region variable) have their own specific colours. We therefore create the following labeled vector:

colour_vec <- c("#edc951", "#eb6841", "#cc2a36", "#4f372d", "#00a0b0", "#2175d9", "#00308f", "#e30074", "#b8d000")

#Applying names of regions to each colour
attributes(colour_vec)$names <- levels(my_chart_data$region)

And we now create the plot. Here we assign the plot into the object p. Note that we're re-labeling the x-axis to avoid the text overlapping.

(
    p <- ggplot() +
        geom_col(data = my_chart_data, aes(x = region, y = wtd.mean, fill = region)) +
        geom_text(data = my_chart_data, aes(x = region, y = wtd.mean,
                            label = scales::dollar(round(wtd.mean, 0))),
                            vjust = -0.2,
                            size = 3.25) +
        scale_fill_manual(values = colour_vec) + 
        scale_colour_manual(values = colour_vec) + 
        scale_y_continuous(limits = c(0, 85000), labels = scales::dollar) + 
        scale_x_discrete(labels = c("Pacific (n=352)" = "Pacific",
                                    "Mountain (n=223)" = "Mountain",
                                    "West South Central (n=273)" = "SW\nCentral",
                                    "West North Central (n=177)" = "NW\nCentral",
                                    "East South Central (n=185)" = "SE\nCentral",
                                    "East North Central (n=461)" = "NE\nCentral",
                                    "South Atlantic (n=488)" = "South\nAtl.",
                                    "Middle Atlantic (n=278)" = "Mid.\nAtl.",
                                    "New England (n=159)" = "New\nEngland")) +
        labs(title = "Average 2016 Inflation-Adjusted Family Income Across Regions", y = "Inflation-Adjusted Family Income", x = "Region", fill = "Region", colour = "Sig. Diff.")
)

And we now overlay this with the geom_sigmark function to indicate which regions have a significantly higher average family income than other regions.

p + geom_sigmark(my_sig_data, x = "region")

Simple bar chart on single-select variables

We will use the conlegis variable, which asks for Americans' level of confidence in US Congress. A useful feature of ggplot2 is the facet_wrap() layer that more easily breaks out answers across a factor variable. This feature will be used here to examine how the level of confidence in US Congress differs across educational levels using the degree variable.

gss_data_prop <- filter(gss_data, year == "2016", !is.na(conlegis), !is.na(degree))

#Relabeling the degree and conlegis variables to be more readable.
gss_data_prop$degree <- fct_recode(gss_data_prop$degree,
                                                `Less than high school` = "LT HIGH SCHOOL",
                                                `High school` = "HIGH SCHOOL",
                                                `Junior college` = "JUNIOR COLLEGE",
                                                `Bachelor` = "BACHELOR",
                                                `Graduate` = "GRADUATE")

gss_data_prop$conlegis <- fct_recode(gss_data_prop$conlegis,
                                                `A Great Deal` = "A GREAT DEAL",
                                                `Only Some` = "ONLY SOME",
                                                `Hardly Any` = "HARDLY ANY")

my_results_prop <- freq_prop_test(gss_data_prop, "conlegis", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE)
my_conlegis_chart <- tbl_chart(gss_data_prop, "conlegis", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE)
my_conlegis_sig <- tbl_sig(my_results_prop, "degree", space_label = 0.4, space_between = 0.15)

We now create the colour vector to denote significant differences and the plot.

colour_vec <- c("#edc951", "#eb6841", "#cc2a36", "#4f372d", "#00a0b0")

#Applying names to each colour
attributes(colour_vec)$names <- levels(my_conlegis_chart$degree)

(
    p <- ggplot() +
        geom_col(data = my_conlegis_chart, aes(x = conlegis, y = prop, fill = degree)) +
        geom_text(data = my_conlegis_chart, aes(x = conlegis, y = prop,
                            label = scales::percent(round(prop, 2))),
                            hjust = -0.2) +
        scale_fill_manual(values = colour_vec) + 
        scale_colour_manual(values = colour_vec) + 
        coord_flip() +
        facet_wrap(~ degree) +
        scale_y_continuous(limits = c(0, 1.2), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
        labs(title = "Level of Confidence on US Congress by Level of Education in 2016", x = "Level of Confidence", y = "Percentage", fill = "Level of Education", colour = "Sig. Diff.") + 
        theme(legend.key.height = unit(1.5, "lines"))
)

Let's use a full star instead. Again here, the space_label and space_between may require trial and error to get right.

p + geom_sigmark(my_conlegis_sig, icon = "full star")

Simple bar chart on multi-select dummy variables

With multi-select dummy variables, the chart would look very similar to the previous section, but the data wrangling would be quite different, since the data is dispersed across multiple dummy variables instead of being all in one variable.

Here, we'll use the degree variable denoting respondents' level of education again, but this time we'll focus on whether respondents use various forms of social media. The variables in question here are the following: FACEBOOK, TWITTER, INSTAGRM, LINKEDIN, SNAPCHAT, TUMBLR, WHATSAPP, GOOGLESN (Google +), PINTERST, FLICKR, VINE, and CLSSMTES (classmates). Each of these are dummy variables in which respondents answered "Yes", "No", or were not asked the question.

#Also making the names of the social media more readable
gss_data_prop <- gss_data %>% 
    filter(year == "2016", !is.na(degree)) %>% 
    select(wtssall, degree, 
                 Facebook = FACEBOOK, 
                 Twitter = TWITTER, 
                 Instagram = INSTAGRM, 
                 LinkedIn = LINKEDIN, 
                 Snapchat = SNAPCHAT, 
                 Tumblr = TUMBLR, 
                 WhatsApp = WHATSAPP, 
                 `Google+` = GOOGLESN, 
                 Pinterest = PINTERST, 
                 Flickr = FLICKR, 
                 Vine = VINE, 
                 Classmates = CLSSMTES)

#Fixing labels for degree as well.
gss_data_prop$degree <- fct_recode(gss_data_prop$degree,
                                         `Less than high school` = "LT HIGH SCHOOL",
                                         `High school` = "HIGH SCHOOL",
                                         `Junior college` = "JUNIOR COLLEGE",
                                         `Bachelor` = "BACHELOR",
                                         `Graduate` = "GRADUATE")

#Switch NA to "No" so that the full bases are kept.
gss_data_prop <- mutate_at(gss_data_prop, vars(3:14), funs(fct_explicit_na(., "No")))

This is a good time to provide an example on how a function like freq_test_prop can be used across several variables at once. First, we gather() the data frame so that all social media variables are listed in a long format instead of a wide format. We split this long data frame by type of social media, apply the freq_test_prop to each split, and recombine them using nest() from the tidyr package and map() from the purrr package.

(
    gss_data_prop_nest <- gss_data_prop %>% 
    gather(key = Social_Media, value = Use, -degree, -wtssall)
)

gss_data_prop_nest <- mutate_if(gss_data_prop_nest, is.character, factor)

(
gss_data_prop_nest <- gss_data_prop_nest %>% 
    group_by(Social_Media) %>% 
    nest()
)

#Note: warnings about chi-squares being approximated are suppressed here to save space.
(
gss_data_result_nest <- gss_data_prop_nest %>% 
    mutate(test = map(data, freq_prop_test, "Use", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE),
                 chart = map(data, tbl_chart, "Use", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE))
)

(gss_data_result <- gss_data_result_nest %>%
    unnest(test) %>% 
    filter(level == "Yes"))

(my_chart_data <- gss_data_result_nest %>% 
        unnest(chart) %>% 
    filter(Use == "Yes"))

We repeat the process for the tbl_sig function.

(
gss_data_result_nest <- gss_data_result %>%
    group_by(Social_Media) %>%
    nest() %>%
    mutate(sig = map(data, tbl_sig, "degree", space_label = 0.3, space_between = 0.1))
)

(my_sig_data <- unnest(gss_data_result_nest, sig))

I would also like to set the order based on the proportions using fct_reorder. I first need to create a summarized data frame that lists the total proportions of usage for each social media.

social_media_total <- gss_data_prop_nest %>%
    unnest(data) %>%
    group_by(Social_Media) %>%
    summarize(total_prop = sum(if_else(Use == "Yes", wtssall, 0)) / sum(wtssall))

my_chart_data <- left_join(my_chart_data, social_media_total)
my_sig_data <- left_join(my_sig_data, social_media_total)

my_chart_data$Social_Media <- fct_reorder(my_chart_data$Social_Media, my_chart_data$total_prop)
my_sig_data$Social_Media <- fct_reorder(my_sig_data$Social_Media, my_sig_data$total_prop)

We now create the colour vector to denote significant differences and the plot.

colour_vec <- c("#edc951", "#eb6841", "#cc2a36", "#4f372d", "#00a0b0")

#Applying names to each colour
attributes(colour_vec)$names <- levels(my_chart_data$degree)

(
    p <- ggplot() +
        geom_col(data = my_chart_data, aes(x = Social_Media, y = prop, fill = degree)) +
        geom_text(data = my_chart_data, aes(x = Social_Media, y = prop,
                            label = scales::percent(round(prop, 2))),
                            hjust = -0.2) +
        scale_fill_manual(values = colour_vec) +
        scale_colour_manual(values = colour_vec) +
        coord_flip() +
        facet_wrap(~ degree) +
        scale_y_continuous(limits = c(0, 1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
        labs(title = "Level of Usage of Social Media by Level of Education in 2016", x = "Social Media", y = "Percentage", fill = "Level of Education", colour = "Sig. Diff.") + 
    theme(legend.key.height = unit(1.5, "lines"))
)

p + geom_sigmark(my_sig_data, x = "Social_Media", icon = "checkmark")

Stacked bar chart with scaled variables

Scaled variables are a very common form of measuring attitudes in market research and polling surveys. We will now turn to provide a couple of examples of charts involving scaled variables.

This example will focus on comparing one scaled variable across subgroups. Here we will compare the variable polviews which asks respondents to rate on a scale how liberal or conservative they consider themselves on a 7-point scale on their level of family income (grouped to 6 categories). Again, we'll only on focus on 2016 data.

gss_data_polviews <- gss_data %>% 
    filter(year == "2016", !is.na(coninc))

For the sake of expediency, we'll let the cut function set the breaks, even if they don't round up.

gss_data_polviews$coninc_6 <- cut(gss_data_polviews$coninc, 6, labels = 
                                                    c("Up to $27,600",
                                                        "$27,600 to $54,900",
                                                        "$54,900 to $82,100",
                                                        "$82,100 to $109,000", 
                                                        "$109,000 to $137,000", 
                                                        "$137,000 and higher"))

We'll specifically look at the proportion of those who consider themselves liberal. This means providing a top 3 box and a sum %. First, we'll need to make sure those who didn't give an answer are counted, using the fct_explicity_na() function.

gss_data_polviews$polviews <- gss_data_polviews$polviews %>% 
    fct_recode(
        `Extremely Liberal` = "EXTREMELY LIBERAL",
        Liberal = "LIBERAL",
        `Slightly Liberal` = "SLIGHTLY LIBERAL",
        Moderate = "MODERATE",
        `Slightly Conservative` = "SLGHTLY CONSERVATIVE",
        Conservative = "CONSERVATIVE",
        `Extremely Conservative` = "EXTRMLY CONSERVATIVE") %>% 
    fct_explicit_na(na_level = "NA")

gss_data_polviews <- mutate(gss_data_polviews, 
                                                        liberal = if_else(polviews %in% c("Extremely Liberal", "Liberal", "Slightly Liberal"), "Liberal", "Not Liberal") %>% 
                                                            factor())

We're only interested in showing significant difference markers for the top 3 box, so we're just running the tbl_chart for that.

(polviews_chart <- tbl_chart(gss_data_polviews, "polviews", "coninc_6", weight = "wtssall", nlabels = TRUE))

Since we're only interested in showing 3 of these levels, we'll remove the others.

polviews_chart <- filter(polviews_chart, polviews %in% c("Extremely Liberal", "Liberal", "Slightly Liberal"))
polviews_chart$polviews <- droplevels(polviews_chart$polviews)

We'll use the created liberal variable for the markers.

polviews_results <- freq_prop_test(gss_data_polviews, "liberal", "coninc_6", weight = "wtssall", nlabels = TRUE)
(
    polviews_sig <- polviews_results %>% 
        tbl_sig("coninc_6", space_label = 0.1, space_between = 0.03) %>% 
    filter(level == "Liberal")
)

We also need the total proportions for the labels.

chart_labels <- polviews_chart %>% 
    group_by(coninc_6) %>% 
    summarize(prop = sum(prop))

Now turning to the chart. Here, we'll need two sets of colours; one for the scales, and one for the subgroups.

colour_scale_vec <- c("#d95f0e", "#fec44f", "#fff7bc")
colour_subgroup_vec <- c("#253494", "#2c7fb8", "#41b6c4", "#7fcdbb", "#c7e9b4", "#ffffcc")

#Applying names to each colour
attributes(colour_scale_vec)$names <- levels(polviews_chart$polviews)
attributes(colour_subgroup_vec)$names <- levels(polviews_chart$coninc_6)

ggplot() +
    geom_col(data = polviews_chart, aes(x = coninc_6, y = prop, fill = fct_rev(polviews))) +
    geom_text(data = polviews_chart, aes(x = coninc_6, y = prop, label = scales::percent(round(prop, 2))), position = position_stack(vjust = 0.5)) +
    geom_text(data = chart_labels, aes(x = coninc_6, y = prop, label = scales::percent(round(prop, 2))), hjust = -0.2) +
    scale_fill_manual(values = colour_scale_vec) +
    scale_colour_manual(values = colour_subgroup_vec) +
    coord_flip() +
    scale_y_continuous(limits = c(0, .75), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
    labs(title = "Proportion of Residents who Self-Label as Liberal", subtitle = "by Family Income Level", x = "Level of Family Income", y = "Percentage", fill = "Scale", colour = "Sig. Diff.") +
    geom_sigmark(polviews_sig, x = "coninc_6", icon = "asterisk") +
    guides(fill = guide_legend(reverse = TRUE))

Multiple stacked bar chart with scaled variables

This example will now turn to the need for multiple scaled variables to appear across subgroups. There are 6 variables that we'll focus here ranging from LOTR1 to LOTR6, which represent agreement scales on 6 attitudinal statements. We'll observe whether there are significant differences in whether they agree with these statements across gender (sex). The code required to produce the desired chart requires a mix of the code from the two previous examples.

Note that not everyone is asked these 6 questions, and there is no indication in the data file as to the logic as to why some respondents were asked and others not. The presence of a dummy variable indicating if they were asked or not would be useful here to ensure the sample sizes match throughout. For the sake of simplicity, the loop below assigns those who are NA to Neutral across the 6 questions.

gss_data_LOTR <- gss_data %>% 
    filter(year == "2016", !is.na(LOTR1)) %>% 
    select(wtssall, sex, starts_with("LOTR"))

gss_data_LOTR$sex <- fct_recode(gss_data_LOTR$sex, Male = "MALE", Female = "FEMALE")

gss_data_LOTR <- mutate_if(gss_data_LOTR, is.factor, funs(fct_explicit_na(., na_level = "Neutral")))

As per the previous example, we'll need to calculate the top 2 box for those who agree or strongly agree across each statement. Since I need to apply the same wrangling across 6 variables, I'm using the mutate_at() and mutate_if() functions.

gss_data_LOTR <- gss_data_LOTR %>% 
    mutate_at(vars(starts_with("LOTR")), 
                    funs("Top2" = if_else(. %in% c("Strongly agree", "Agree"), "Agree", "Disagree/Neutral"))) %>% 
    mutate_if(is.character, factor)

And as per 2 examples ago, we'll need to use the nest() and map() to apply the freq_test_prop, the tbl_chart, and the tbl_sig functions. Unlike last time, my levels differ between the original LOTR variables and my created LOTRx_Top2 variables. The easiest thing to do at this stage would be to split the original variables from the created variables.

(
    gss_data_LOTR_nest <- gss_data_LOTR %>% 
    gather(key = LOTR, value = Scale, -sex, -wtssall)
)

gss_data_LOTR_nest_Top2 <- filter(gss_data_LOTR_nest, str_detect(LOTR, "Top2"))
gss_data_LOTR_nest_Full <- filter(gss_data_LOTR_nest, !str_detect(LOTR, "Top2"))

gss_data_LOTR_nest_Top2$LOTR <- str_replace(gss_data_LOTR_nest_Top2$LOTR, "_Top2", "")
gss_data_LOTR_nest_Top2 <- mutate_if(gss_data_LOTR_nest_Top2, is.character, factor)
gss_data_LOTR_nest_Full <- mutate_if(gss_data_LOTR_nest_Full, is.character, factor)

(
gss_data_LOTR_nest_Full <- gss_data_LOTR_nest_Full %>% 
    group_by(LOTR) %>% 
    nest()
)

(
gss_data_LOTR_nest_Top2 <- gss_data_LOTR_nest_Top2 %>% 
    group_by(LOTR) %>% 
    nest()
)

(
gss_data_LOTR_chart_Full <- gss_data_LOTR_nest_Full %>% 
        mutate(chart = map(data, tbl_chart, "Scale", "sex", weight = "wtssall", nlabels = TRUE)) %>% 
            unnest(chart) %>% 
    filter(Scale %in% c("Strongly agree", "Agree"))
)

gss_data_LOTR_chart_Top2 <- gss_data_LOTR_chart_Full %>% 
    group_by(LOTR, sex) %>% 
    summarize(total_prop = sum(prop))

(
gss_data_LOTR_result_Top2 <- gss_data_LOTR_nest_Top2 %>% 
        mutate(test = map(data, freq_prop_test, "Scale", "sex", weight = "wtssall", nlabels = TRUE)) %>% 
        unnest(test) %>% 
        filter(level == "Agree")
)

Now onto the tbl_sig function, which I only need to run for the significant differences for the Top 2 box because I want to extract the proportions for the labels.

(
my_sig_data <- gss_data_LOTR_result_Top2 %>%
    group_by(LOTR) %>%
    nest() %>%
    mutate(sig = map(data, tbl_sig, "sex", space_label = 0.25, space_between = 0.1)) %>%
    unnest(sig)
)

I'll now merge the total proportions from the label data back to the chart data in order to re-order the chart from high to low. First I need to remove "_Top2" from each level of the LOTR factor.

my_chart_data <- left_join(gss_data_LOTR_chart_Full, gss_data_LOTR_chart_Top2, by = c("LOTR", "sex"))
my_chart_data$Scale <- droplevels(my_chart_data$Scale)

Now, I'll apply the question labels, and reorder them so that they appear from high to low in the chart. Note the usage of "\n" within the labels to provide breaks to avoid the chart getting squished due to the length of the labels.

#Labels taken from here: http://gss.norc.org/documents/codebook/GSS_Codebook_mainbody.pdf and from gss_data_questions
my_chart_data$LOTR <- my_chart_data$LOTR %>% 
    fct_recode(`In uncertain times,\nI usually expect the best` = "LOTR1",
                `If something can go\nwrong for me, it will` = "LOTR2",
                `I'm always optimistic\nabout the future` = "LOTR3", 
                `I hardly ever expect\nthings to go my way` = "LOTR4", 
                `I rarely count on good\nthings happening to me` = "LOTR5", 
                `Overall I expect more good things\nto happen to me than bad` = "LOTR6") %>% 
    fct_reorder(my_chart_data$total_prop)

my_sig_data$LOTR <- my_sig_data$LOTR %>% 
    fct_recode(`In uncertain times,\nI usually expect the best` = "LOTR1",
                `If something can go\nwrong for me, it will` = "LOTR2",
                `I'm always optimistic\nabout the future` = "LOTR3", 
                `I hardly ever expect\nthings to go my way` = "LOTR4", 
                `I rarely count on good\nthings happening to me` = "LOTR5", 
                `Overall I expect more good things\nto happen to me than bad` = "LOTR6")

Let's build up the chart. One wrinkle I'm adding here for the individual scale point labels is to only display if they're over 5%, to avoid the "Strongly agree" and "agree" labels encroaching on each other.

colour_scale_vec <- c("#fff7bc", "#d95f0e")
colour_subgroup_vec <- c("#f48041", "#4286f4")

#Applying names to each colour
attributes(colour_scale_vec)$names <- levels(my_chart_data$Scale)
attributes(colour_subgroup_vec)$names <- levels(my_chart_data$sex)

(
    p <- ggplot() +
        geom_col(data = my_chart_data, aes(x = LOTR, y = prop, fill = Scale)) +
        geom_text(data = my_chart_data, aes(x = LOTR, y = prop, group = Scale, label = if_else(prop > 0.05, scales::percent(round(prop, 2)), "")), position = position_stack(vjust = 0.5)) +
        geom_text(data = my_chart_data, aes(x = LOTR, y = total_prop, label = scales::percent(round(total_prop, 2))), hjust = -0.2) +
        scale_fill_manual(values = colour_scale_vec) +
        scale_colour_manual(values = colour_subgroup_vec) +
        coord_flip() +
        facet_wrap(~ sex) +
        scale_y_continuous(limits = c(0, 1.1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
        labs(title = "Levels of Optimism Across Gender", x = "Statement", y = "Percentage", fill = "Scale", colour = "Sig. Diff.") 
)

p + geom_sigmark(my_sig_data, x = "LOTR", icon = "asterisk", group = "sex") +
        guides(fill = guide_legend(reverse = TRUE))

Line chart for tracking responses of one question

We will now turn to the geom_sigmark_waves function to use for tracking charts across multiple waves. A different function is required because we need to present whether a significant shift is an increase or a decrease over time. For this example, we'll look at the change in the level of confidence Americans have been placing on the financial industry (confinan).

gss_data1 <- filter(gss_data, year_num >= 2000, !is.na(confinan))
gss_data1$year <- droplevels(gss_data1$year)

gss_data1$confinan <- fct_recode(gss_data1$confinan, 
                                            "A great deal" = "A GREAT DEAL", 
                                            "Hardly any" = "HARDLY ANY", 
                                            "Only some" = "ONLY SOME")

my_track_results <- freq_prop_test(gss_data1, "confinan", "year", weight = "wtssall")
my_chart_track <- tbl_chart(gss_data1, "confinan", "year", weight = "wtssall")
my_sig_track <- tbl_sig(my_track_results, "year", compare = "w2w")

ggplot() + 
    geom_line(data = my_chart_track, aes(x = year, y = prop, colour = confinan, group = confinan)) +
    geom_point(data = my_chart_track, aes(x = year, y = prop, colour = confinan), show.legend = FALSE) +
    geom_text_repel(data = my_chart_track, aes(x = year, y = prop, colour = confinan, label = percent(round(prop, 2))), show.legend = FALSE) +
    scale_y_continuous(labels = percent) +
    geom_sigmark_waves(my_sig_track) +
        labs(title = "Americans' Level of Confidence in the Financial Industry", subtitle = "2000 to 2016", y = "Percentage", x = "Year", colour = "Confidence level", shape = "Sig. Diff.")

Line chart for tracking responses of multiple questions

For this section, we'll expand to looking at how Americans' level of confidence ("some" or "a great deal") over several topics change over time.

gss_data1 <- select(gss_data, year, year_num, wtssall, starts_with("con"), -coninc)
gss_data1 <- filter(gss_data1, rowSums(is.na(gss_data1)) < 13, year_num >= 2000) %>% select(-year_num)
gss_data1 <- rename(gss_data1,
                                        Financial = confinan,
                                        Business = conbus,
                                        Religion = conclerg,
                                        Education = coneduc,
                                        POTUS = confed,
                                        Unions = conlabor,
                                        Press = conpress,
                                        Medicine = conmedic,
                                        Television = contv,
                                        SCOTUS = conjudge,
                                        Science = consci,
                                        Congress = conlegis,
                                        Military = conarmy)
gss_data1$year <- droplevels(gss_data1$year)

gss_data1 <- gss_data1 %>% 
    mutate_if(is.factor, funs("Top2" = if_else(. == "A GREAT DEAL" | . == "ONLY SOME", "Yes", "No"))) %>% 
  select(year, wtssall, ends_with("Top2"), -year_Top2) %>% 
    mutate_if(is.character, factor)
names(gss_data1) <- str_replace(names(gss_data1), "_Top2", "")

gss_data1_nest <- gss_data1 %>% 
    gather(key = Institution, value = Confident, -year, -wtssall) %>% 
  mutate_if(is.character, factor)

gss_data1_result <- gss_data1_nest %>% 
    group_by(Institution) %>% 
    nest() %>% 
  mutate(test = map(data, freq_prop_test, "Confident", "year", weight = "wtssall"),
             chart = map(data, tbl_chart, "Confident", "year", weight = "wtssall"))

gss_data1_result_nest <- gss_data1_result %>%
      unnest(test) %>% 
    filter(level == "Yes") %>% 
    group_by(Institution) %>%
    nest() %>%
    mutate(sig = map(data, tbl_sig, "year", compare = "curr"))

gss_data1_chart <- gss_data1_result %>% 
    unnest(chart) %>% 
    filter(Confident == "Yes")
gss_data1_sig <- unnest(gss_data1_result_nest, sig)

gss_data1_chart <- mutate(gss_data1_chart, Sector = case_when(
                                Institution %in% c("Business", "Financial", "Press", "Unions") ~ "Sector 1", 
                                Institution %in% c("Education", "Science", "Medicine", "Religion", "Television") ~ "Sector 2",
                                TRUE ~ "Sector 3"),
                                Sector = factor(Sector))
gss_data1_sig <- mutate(gss_data1_sig, Sector = case_when(
                                Institution %in% c("Business", "Financial", "Press", "Unions") ~ "Sector 1", 
                                Institution %in% c("Education", "Science", "Medicine", "Religion", "Television") ~ "Sector 2",
                                TRUE ~ "Sector 3"),
                                Sector = factor(Sector))

colour_vec <- c("#0db36f", "#c8dd56", "#030303", "#72477f", "#382b57", "#ce215b", "#e1a83e", "#39b287", "#71c3ca", "#3a8bea", "#034385", "#9f2b68", "#9d0e27")
names(colour_vec) <- levels(gss_data1_chart$Institution)

ggplot() + 
  geom_line(data = gss_data1_chart, aes(x = year, y = prop, group = Institution, colour = Institution)) + 
  geom_point(data = gss_data1_chart, aes(x = year, y = prop, colour = Institution)) +
  geom_text_repel(data = gss_data1_chart, aes(x = year, y = prop, label = percent(round(prop, 2)), colour = Institution), show.legend = FALSE) +
    labs(title = "Proportion of Americans with at least some level of confidence in Institutions", x = "Year", y = "Percentage", shape = "Sig. Diff.") +
    scale_y_continuous(labels = percent) +
    facet_wrap(~ Sector, nrow = 3) +
    scale_colour_manual(values = colour_vec) +
  geom_sigmark_waves(gss_data1_sig, colour = "Institution")

I would actually break this into 3 separate charts instead of using facet_wrap() to make the chart easier to read, but that's beside the point.

Comparing a subgroup to the rest of the sample for one question

Since version 0.0.2, we can now compare a subgroup to the rest of the sample. Here we will use "Pacific" as a region to chart and compare it to the rest.

gss_data_prop <- filter(gss_data, year == "2016", !is.na(conlegis), !is.na(degree))
gss_data_prop$conlegis <- fct_recode(gss_data_prop$conlegis, 
                                        `A Great Deal` = "A GREAT DEAL",
                                        `Only Some` = "ONLY SOME",
                                        `Hardly Any` = "HARDLY ANY")

my_results_prop <- freq_prop_test(gss_data_prop, "conlegis", "degree", level = "GRADUATE", weight = "wtssall")
my_conlegis_chart <- tbl_chart(gss_data_prop, "conlegis", "degree", weight = "wtssall")
my_conlegis_sig <- tbl_sig(my_results_prop, "degree", space_label = 0.1, compare = "total")

We only want to chart the Graduates, so we'll filter them out.

my_conlegis_chart <- filter(my_conlegis_chart, degree == "GRADUATE")
my_conlegis_sig <- filter(my_conlegis_sig, degree == "GRADUATE")

(
    p <- ggplot() +
        geom_col(data = my_conlegis_chart, aes(x = conlegis, y = prop)) +
        geom_text(data = my_conlegis_chart, aes(x = conlegis, y = prop,
                            label = scales::percent(round(prop, 2))),
                            hjust = -0.2) +
        coord_flip() +
        scale_y_continuous(limits = c(0, 1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
        labs(title = "Graduates' Level of Confidence on US Congress in 2016", x = "Level of Confidence", y = "Percentage", shape = "Significant Differences")
)

Here, we will use geom_sigmark_total:

p + geom_sigmark_total(my_conlegis_sig, labels = c("Higher than those with \nother levels of education", "Lower than those with \nother levels of education")) + theme(legend.key.height = unit(1.5, "lines"))

Comparing a subgroup to the rest of the sample for multiple questions

We'll now expand on the previous section, and look into setting up a chart with multiple dummy variables. Here's we'll focus into how those in the 18-24 age category differ from the rest of the sample on their usage of social media:

gss_data_prop <- gss_data %>% 
    filter(year == "2016", !is.na(degree)) %>% 
    select(wtssall, age_7, Facebook = FACEBOOK, Twitter = TWITTER, Instagram = INSTAGRM, LinkedIn = LINKEDIN, Snapchat = SNAPCHAT, Tumblr = TUMBLR, Whatsapp = WHATSAPP, `Google+` = GOOGLESN, Pinterest = PINTERST, Flickr = FLICKR, Vine = VINE, Classmates = CLSSMTES)

#Switch NA to "No" so that the full bases are kept.
gss_data_prop <- mutate_at(gss_data_prop, vars(3:14), funs(fct_explicit_na(., "No")))

(
    gss_data_prop_nest <- gss_data_prop %>% 
    gather(key = Social_Media, value = Use, -age_7, -wtssall)
)

gss_data_prop_nest <- mutate_if(gss_data_prop_nest, is.character, factor)

(
gss_data_prop_nest <- gss_data_prop_nest %>% 
    group_by(Social_Media) %>% 
    nest()
)

(
gss_data_result <- gss_data_prop_nest %>% 
    mutate(test = map(data, freq_prop_test, "Use", "age_7", level = "18-24", weight = "wtssall"),
                 chart = map(data, tbl_chart, "Use", "age_7", weight = "wtssall"))
)

(
my_sig_data <- gss_data_result %>%
        unnest(test) %>% 
        filter(level == "Yes") %>% 
        group_by(Social_Media) %>%
        nest() %>%
        mutate(sig = map(data, tbl_sig, "age_7", space_label = 0.1, compare = "total")) %>% 
        unnest(sig)
)

(my_chart_data <- gss_data_result %>% 
        unnest(chart) %>% 
        filter(age_7 == "18-24", Use == "Yes"))

We just need to re-order the Social_Media variable based on the prop variable, and we're ready to chart.

my_chart_data$Social_Media <- fct_reorder(my_chart_data$Social_Media, my_chart_data$prop)

(
ggplot() +
        geom_col(data = my_chart_data, aes(x = Social_Media, y = prop)) +
        geom_text(data = my_chart_data, aes(x = Social_Media, y = prop,
                            label = scales::percent(round(prop, 2))),
                            hjust = -0.2) +
        coord_flip() +
        scale_y_continuous(limits = c(0, 1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
        labs(title = "Level of Usage of Social Media in 2016 Among 18-24 Year Old", x = "Social Media", y = "Percentage", shape = "Significant Differences") +
        geom_sigmark_total(my_sig_data, x = "Social_Media", labels = c("Higher than \nthose aged 25+", "Lower than \nthose aged 25+")) +
        theme(legend.key.height = unit(1.5, "lines"))
)

Version changes

v0.0.3

renamed freq_test_prop to freq_prop_test for consistency with *_test functions going forward.
Changed tbl_chart so that it can be run directly off the raw data frame instead of the *_test functions.
Added a subgroup input in tbl_sig so that the function's output matches with the tbl_chart output.
Moved the min_sample argument from tbl_sig to the *_test functions to increase computational efficiency by avoiding the tests to be run on subgroups the user would ignore if it is below the specified sample size threshold anyway.
Fixed some bugs, including a major one with freq_t_test that wasn't properly computing the weighted mean.

v0.0.2

Changed the keep argument to compare for clarity in the tbl_sig function.
Added a "total" feature to the compare argument when one wishes to compare a particular subgroup to the rest of the sample.
Made some improvements in the background in how these functions handle factor levels.
Added sample sizes as output from tbl_sig to make it easier to include that information in the charts.
Added 2 examples to this vignette to show how to use the new "total" feature.
Simplified the code throughout this vignette in regards to the handling of the factor levels after including sample sizes, and added the nlabels and newline logical arguments to freq_test_prop and freq_t_test functions to make this change seamless.

v0.0.1.1

Fixed an issue when used in Shiny when an end user de-selects one or more subgroups being compared in a chart ends up with other subgroups also being removed from the chart.

Plans for future additions and fixes

The following is a to-do list for future additions to the package, in order of importance from high to low:

The package currently provides frequentist tests. Bayesian alternatives to these tests will be eventually included.
A known bug: The freq_prop_test function does not work well when one of the sample groups has a sample size of 0.
The order of the guides in the legend appear to change randomly. I've attempted to coerce the markers legend to always remain at the bottom, but further testing is needed to ensure this is done correctly.
Provide a test for Poisson distributed data.
Any ideas on what else to include? Is there a particular type of chart commonly used in market research reports you'd like to see included in this vignette? Please let me know by contacting me at phil.straforelli@environics.ca.

philstraforelli/ggsigmark documentation built on May 20, 2019, 1:59 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

philstraforelli/ggsigmark
An Implementation of Grammar of Graphics to Denote Statistically Significant Differences

In philstraforelli/ggsigmark: An Implementation of Grammar of Graphics to Denote Statistically Significant Differences

Package installation

Guide

A note about the `gss_data` General Social Survey data set

Usage of the `ggsigmark` functions

The `freq_pair_wtd_t_test` function

The `freq_t_test` function

The `freq_prop_test` function

The `tbl_chart` function

The `tbl_sig` function

The `geom_sigmark`, `geom_sigmark_interactive`, `geom_sigmark_waves`, and `geom_sigmark_waves_interaction` functions

Examples of the application of the `ggsigmark` function

Simple bar chart displaying means with numeric variables

Simple bar chart on single-select variables

Simple bar chart on multi-select dummy variables

Stacked bar chart with scaled variables

Multiple stacked bar chart with scaled variables

Line chart for tracking responses of one question

Line chart for tracking responses of multiple questions

Comparing a subgroup to the rest of the sample for one question

Comparing a subgroup to the rest of the sample for multiple questions

Version changes

v0.0.3

v0.0.2

v0.0.1.1

Plans for future additions and fixes

R Package Documentation

Browse R Packages

We want your feedback!

philstraforelli/ggsigmark An Implementation of Grammar of Graphics to Denote Statistically Significant Differences

In philstraforelli/ggsigmark: An Implementation of Grammar of Graphics to Denote Statistically Significant Differences

Package installation

Guide

A note about the gss_data General Social Survey data set

Usage of the ggsigmark functions

The freq_pair_wtd_t_test function

The freq_t_test function

The freq_prop_test function

The tbl_chart function

The tbl_sig function

The geom_sigmark, geom_sigmark_interactive, geom_sigmark_waves, and geom_sigmark_waves_interaction functions

Examples of the application of the ggsigmark function

Simple bar chart displaying means with numeric variables

Simple bar chart on single-select variables

Simple bar chart on multi-select dummy variables

Stacked bar chart with scaled variables

Multiple stacked bar chart with scaled variables

Line chart for tracking responses of one question

Line chart for tracking responses of multiple questions

Comparing a subgroup to the rest of the sample for one question

Comparing a subgroup to the rest of the sample for multiple questions

Version changes

v0.0.3

v0.0.2

v0.0.1.1

Plans for future additions and fixes

R Package Documentation

Browse R Packages

We want your feedback!

philstraforelli/ggsigmark
An Implementation of Grammar of Graphics to Denote Statistically Significant Differences

A note about the `gss_data` General Social Survey data set

Usage of the `ggsigmark` functions

The `freq_pair_wtd_t_test` function

The `freq_t_test` function

The `freq_prop_test` function

The `tbl_chart` function

The `tbl_sig` function

The `geom_sigmark`, `geom_sigmark_interactive`, `geom_sigmark_waves`, and `geom_sigmark_waves_interaction` functions

Examples of the application of the `ggsigmark` function