Market researchers typically use symbols in charts delivered to clients to denote whether a specific subgroup's result is (statistically speaking) significantly higher than another subgroup's result. The package ggsigmark
aims to make it easier for market researchers to utilize ggplot2
by providing the ability to indicate whether differences are statistically significant with the usage of markers in the plots they are creating.
This vignette provides a manual on how to use the ggsigmark
functions and lists planned future additions to the package.
To install the package from GitHub, you will first need to install the devtools
package.
install.packages("devtools")
The ggsigmark
package can be installed by running the following line:
devtools::install_github("philstraforelli/ggsigmark")
It is made available for usage by running the following line:
library(ggsigmark)
The code below will utilize tidyverse
code (which is of course necessary since the ultimate purpose is to use ggplot2
), so we might as well load up that package as well, along with others that are used in this vignette:
library(tidyverse) library(broom) library(forcats) library(ggrepel) library(scales) library(stringr)
gss_data
General Social Survey data setThe package uses a reduced version of the NORC's General Social Survey to illustrate the usage of the ggsigmark
tools. This data was chosen because it represents the type of data that market researchers generally work with; that is, tracking survey data (in this case run on a biennial basis) with post-stratification weights.
The full original data file includes as of this writing over 65,000 responses and over nearly 6,000 variables (many of which are defunct questions). To save memory, only a select list of variables are included, and the gss_data
dataset only includes data from 2000 to 2016 (the GSS survey was first run in 1972). This brings the data frame down to a more manageable 24,350 responses and 101 variables.
The full data can be downloaded here. More information about the GSS data is available here.
ggsigmark
functionsThe following flowchart illustrates the workflow for using ggsigmark
functions.
knitr::include_graphics('./www/flowchart.png')
freq_pair_wtd_t_test
functionSurprisingly, while the stats package includes a pairwise.t.test function to make multiple pairwise comparisons between group levels (with corrections for multiple testing), and the Hmisc package provides a wtd.mean function to calculate means with post-stratification weights, there was no function available to run pairwise comparison t-tests with weighted means.
As with the pairwise.t.test function, an adjustment is made to the p-value to mitigate the chance of a Type I error, defaulted to the "Holm" method (as opposed to the better known Bonferroni correction, generally deemed too conservative).
The function freq_pair_wtd_t_test
fills that gap. This function is used in the background for the freq_t_test
function, and most users would likely not see a need to run this function. However, it may have its uses to some, and it is therefore made available here. The following code is an example of its usage, where whether the coninc
variable (adjusted family income) differs by levels of education (degree
), and whether respondents' age (age_num
) differ based on their region
. For both cases, the weight wtssall
is applied:
freq_pair_wtd_t_test(x = gss_data$coninc, subgroup = gss_data$degree, weight = gss_data$wtssall) freq_pair_wtd_t_test(x = gss_data$age_num, subgroup = gss_data$region, weight = gss_data$wtssall)
The resulting tables above list the p-values for each pairwise comparison. The output is not ideal in handling further, so the user may wish to use the tidy()
function from the broom package:
tidy(freq_pair_wtd_t_test(x = gss_data$coninc, subgroup = gss_data$degree, weight = gss_data$wtssall)) tidy(freq_pair_wtd_t_test(x = gss_data$coninc, subgroup = gss_data$region, weight = gss_data$wtssall))
freq_t_test
functionMost users would consider the freq_t_test
function more useful, as it takes the output from the freq_pair_wtd_t_test
function and wrangles it into a more useful output. The output from the freq_t_test
function consists of a data frame in which each row consists of a possible pairwise comparison, with the following variables:
group1
and group2
)p.value
for that comparison, and a logical indicating whether it is significant
(i.e. whether it is below the alpha level, which can be user-defined but is defaulted to 0.05).Sample_Size_group1
and Sample_Size_group2
)wtd.mean_group1
and wtd.mean_group2
).Note that the arguments needed for the freq_t_test
function consist of the following, in order:
#First filtering only to 2016 cases and among those who gave an inflation-adjusted family income gss_data_mean <- filter(gss_data, year == "2016", coninc > 0) freq_t_test(gss_data_mean, "coninc", "region", weight = "wtssall") #With weights freq_t_test(gss_data_mean, "coninc", "degree") #Without weights
The names of the levels of the region
variable are currently a little unsightly, so we'll use the fct_recode()
function from the forcats
package to make it look a little nicer. I'm also going to use fct_relevel()
so that the order of the regions in the chart are somewhat similar to the geographical order from west to east.
levels(gss_data_mean$region) gss_data_mean$region <- gss_data_mean$region %>% fct_recode( `New England` = "NEW ENGLAND", `Middle Atlantic` = "MIDDLE ATLANTIC", `East North Central` = "E. NOR. CENTRAL", `West North Central` = "W. NOR. CENTRAL", `South Atlantic` = "SOUTH ATLANTIC", `East South Central` = "E. SOU. CENTRAL", `West South Central` = "W. SOU. CENTRAL", Mountain = "MOUNTAIN", Pacific = "PACIFIC") %>% fct_relevel( "Pacific", "Mountain", "West South Central", "West North Central", "East South Central", "East North Central", "South Atlantic", "Middle Atlantic", "New England")
freq_prop_test
functionThe freq_prop_test
function behaves similarly as the freq_t_test
function, except of course it focuses on proportions, instead of means. It relies on the pairwise.prop.test
function in the background, but allows for the usage of post-stratification weights. The output of the freq_prop_test
function consists of the following:
group1
and group2
)level
on which the proportions for group1
and group2
are being compared.p.value
for that comparison, and a logical indicating whether it is significant
(i.e. whether it is below the alpha level, which can be user-defined but is defaulted to 0.05).Sample_Size_group1
and Sample_Size_group2
)prop_group1
and prop_group2
).Note that the arguments needed for the freq_prop_test
function consist of the following, in order:
#First filtering only to 2016 cases and among those who gave an answer on their level of confidence with Congress gss_data_prop <- filter(gss_data, year == "2016", !is.na(conlegis), !is.na(partyid)) #Are there differences in Americans' level of confidence with US Congress across their party identification? gss_data_prop$partyidcoll <- fct_collapse(gss_data_prop$partyid, Democrat = c("STRONG DEMOCRAT", "NOT STR DEMOCRAT"), Independent = c("IND,NEAR DEM", "INDEPENDENT", "IND,NEAR REP"), Republican = c("NOT STR REPUBLICAN", "STRONG REPUBLICAN"), Other = "OTHER PARTY") #With weights. To avoid a large table output, I'm filtering to show only the "HARDLY ANY" level here. freq_prop_test(gss_data_prop, "conlegis", "partyidcoll", weight = "wtssall") #Are there differences in the usage of LinkedIn by levels of education? gss_data_prop2 <- filter(gss_data, year == "2016", !is.na(LINKEDIN), !is.na(degree)) #Without weights, and filtering only to the "Yes" level gss_data_prop2 %>% freq_prop_test("LINKEDIN", "degree") %>% filter(level == "Yes")
tbl_chart
functionThe tbl_chart
function is a convenience function that takes the raw data and turns it into a data frame that allows for easy charting in ggplot2
. The user may however find that some further wrangling may be needed prior to charting.
First, let's use the function on the results for proportions:
tbl_chart(gss_data_prop, "conlegis", "region", weight = "wtssall")
And again on the results for means:
tbl_chart(gss_data_mean, "coninc", "region", weight = "wtssall")
NOTE: This function in versions previous to v0.0.3 took the output of either freq_t_test
or freq_prop_test
as input. It now takes the raw data as input for greater convenience.
tbl_sig
functionThe tbl_sig
function is a convenience function that takes the output from either the freq_t_test
or the freq_prop_test
functions and turns it into a data frame that allows for easy charting with geom_sigmark
. The user may however find that some further wrangling may be needed prior to charting. Its input consists of the following:
freq_t_test
or the freq_prop_test
functions.space_label
numeric value indicating how much spacing is needed between the label and the marker. Some trial and error is needed to determine the ideal value.space_between
numeric value indicating how much spacing is needed between markers. Some trial and error is needed to determine the ideal value.keep
argument that should be one of three options:First, let's use the function on the results for proportions:
my_results_prop <- freq_prop_test(gss_data_prop, "conlegis", "region", weight = "wtssall") tbl_sig(my_results_prop, "region", space_label = 0.1, space_between = 0.05)
And again on the results for means:
my_results_mean <- freq_t_test(gss_data_mean, "coninc", "region", weight = "wtssall") tbl_sig(my_results_mean, "region", space_label = 0.1, space_between = 0.05)
geom_sigmark
, geom_sigmark_interactive
, geom_sigmark_waves
, and geom_sigmark_waves_interaction
functionsThe geom_sigmark
function is a geom_point
wrapper that identifies with a marker whether a particular subgroup's result is, statistically speaking, significantly higher than another subgroup's result. In cases where there are more than 2 subgroups being compared, the usage of colours is key here to be able to identify which subgroup's result is being identified as being significantly lower.
All 4 geom_sigmark
functions require several arguments:
tbl_sig
function.x
and y
arguments indicating what axes are used to chart the markers. The group
argument is provided if needed, and the colour
argument indicating how the markers should be coloured. icon
the user would like to have displayed where there are significant differences.size
of the icon
. Generally a size of 5 is deemed appropriate.The 2 _waves
variants also include the following arguments:
direction
argument to specify which variable indicates whether the marker should be an increase or a decrease.labels
argument to specify how the markers should be labeled in the plot legend. By default they are "Increase"/"Decrease".The 2 _interactive
variants also include the following arguments (both from ggiraph).
tooltip
argument to specify which variable indicates what gets shown in the tooltip when the mouse hovers on a marker.data_id
argument to specific which variable should be highlighted if the end user clicks on a marker.The geom_sigmark
function is best displayed by providing examples on how it can be utilized with the most common types of charts used in market research. These are listed below.
ggsigmark
functionWe will use the coninc
variable, which indicates the family income of each respondent, adjusted for inflation, and use the previous examples indicating whether there is a difference in income across regions in the US. I will re-run the my_results_mean object, this time setting the nlabels
argument to TRUE. This is to merge the sample sizes with the names of the regions in order to display both in the chart.
my_results_mean <- freq_t_test(gss_data_mean, "coninc", "region", weight = "wtssall", nlabels = TRUE) my_chart_data <- tbl_chart(gss_data_mean, "coninc", "region", weight = "wtssall", nlabels = TRUE) my_sig_data <- tbl_sig(my_results_mean, "region", space_label = 5000, space_between = 2200)
As noted above, it is best to have the subgroups (i.e. the levels of the region
variable) have their own specific colours. We therefore create the following labeled vector:
colour_vec <- c("#edc951", "#eb6841", "#cc2a36", "#4f372d", "#00a0b0", "#2175d9", "#00308f", "#e30074", "#b8d000") #Applying names of regions to each colour attributes(colour_vec)$names <- levels(my_chart_data$region)
And we now create the plot. Here we assign the plot into the object p
. Note that we're re-labeling the x-axis to avoid the text overlapping.
( p <- ggplot() + geom_col(data = my_chart_data, aes(x = region, y = wtd.mean, fill = region)) + geom_text(data = my_chart_data, aes(x = region, y = wtd.mean, label = scales::dollar(round(wtd.mean, 0))), vjust = -0.2, size = 3.25) + scale_fill_manual(values = colour_vec) + scale_colour_manual(values = colour_vec) + scale_y_continuous(limits = c(0, 85000), labels = scales::dollar) + scale_x_discrete(labels = c("Pacific (n=352)" = "Pacific", "Mountain (n=223)" = "Mountain", "West South Central (n=273)" = "SW\nCentral", "West North Central (n=177)" = "NW\nCentral", "East South Central (n=185)" = "SE\nCentral", "East North Central (n=461)" = "NE\nCentral", "South Atlantic (n=488)" = "South\nAtl.", "Middle Atlantic (n=278)" = "Mid.\nAtl.", "New England (n=159)" = "New\nEngland")) + labs(title = "Average 2016 Inflation-Adjusted Family Income Across Regions", y = "Inflation-Adjusted Family Income", x = "Region", fill = "Region", colour = "Sig. Diff.") )
And we now overlay this with the geom_sigmark
function to indicate which regions have a significantly higher average family income than other regions.
p + geom_sigmark(my_sig_data, x = "region")
We will use the conlegis
variable, which asks for Americans' level of confidence in US Congress. A useful feature of ggplot2
is the facet_wrap()
layer that more easily breaks out answers across a factor variable. This feature will be used here to examine how the level of confidence in US Congress differs across educational levels using the degree
variable.
gss_data_prop <- filter(gss_data, year == "2016", !is.na(conlegis), !is.na(degree)) #Relabeling the degree and conlegis variables to be more readable. gss_data_prop$degree <- fct_recode(gss_data_prop$degree, `Less than high school` = "LT HIGH SCHOOL", `High school` = "HIGH SCHOOL", `Junior college` = "JUNIOR COLLEGE", `Bachelor` = "BACHELOR", `Graduate` = "GRADUATE") gss_data_prop$conlegis <- fct_recode(gss_data_prop$conlegis, `A Great Deal` = "A GREAT DEAL", `Only Some` = "ONLY SOME", `Hardly Any` = "HARDLY ANY") my_results_prop <- freq_prop_test(gss_data_prop, "conlegis", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE) my_conlegis_chart <- tbl_chart(gss_data_prop, "conlegis", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE) my_conlegis_sig <- tbl_sig(my_results_prop, "degree", space_label = 0.4, space_between = 0.15)
We now create the colour vector to denote significant differences and the plot.
colour_vec <- c("#edc951", "#eb6841", "#cc2a36", "#4f372d", "#00a0b0") #Applying names to each colour attributes(colour_vec)$names <- levels(my_conlegis_chart$degree) ( p <- ggplot() + geom_col(data = my_conlegis_chart, aes(x = conlegis, y = prop, fill = degree)) + geom_text(data = my_conlegis_chart, aes(x = conlegis, y = prop, label = scales::percent(round(prop, 2))), hjust = -0.2) + scale_fill_manual(values = colour_vec) + scale_colour_manual(values = colour_vec) + coord_flip() + facet_wrap(~ degree) + scale_y_continuous(limits = c(0, 1.2), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) + labs(title = "Level of Confidence on US Congress by Level of Education in 2016", x = "Level of Confidence", y = "Percentage", fill = "Level of Education", colour = "Sig. Diff.") + theme(legend.key.height = unit(1.5, "lines")) )
Let's use a full star instead. Again here, the space_label
and space_between
may require trial and error to get right.
p + geom_sigmark(my_conlegis_sig, icon = "full star")
With multi-select dummy variables, the chart would look very similar to the previous section, but the data wrangling would be quite different, since the data is dispersed across multiple dummy variables instead of being all in one variable.
Here, we'll use the degree
variable denoting respondents' level of education again, but this time we'll focus on whether respondents use various forms of social media. The variables in question here are the following: FACEBOOK
, TWITTER
, INSTAGRM
, LINKEDIN
, SNAPCHAT
, TUMBLR
, WHATSAPP
, GOOGLESN
(Google +), PINTERST
, FLICKR
, VINE
, and CLSSMTES
(classmates). Each of these are dummy variables in which respondents answered "Yes", "No", or were not asked the question.
#Also making the names of the social media more readable gss_data_prop <- gss_data %>% filter(year == "2016", !is.na(degree)) %>% select(wtssall, degree, Facebook = FACEBOOK, Twitter = TWITTER, Instagram = INSTAGRM, LinkedIn = LINKEDIN, Snapchat = SNAPCHAT, Tumblr = TUMBLR, WhatsApp = WHATSAPP, `Google+` = GOOGLESN, Pinterest = PINTERST, Flickr = FLICKR, Vine = VINE, Classmates = CLSSMTES) #Fixing labels for degree as well. gss_data_prop$degree <- fct_recode(gss_data_prop$degree, `Less than high school` = "LT HIGH SCHOOL", `High school` = "HIGH SCHOOL", `Junior college` = "JUNIOR COLLEGE", `Bachelor` = "BACHELOR", `Graduate` = "GRADUATE") #Switch NA to "No" so that the full bases are kept. gss_data_prop <- mutate_at(gss_data_prop, vars(3:14), funs(fct_explicit_na(., "No")))
This is a good time to provide an example on how a function like freq_test_prop
can be used across several variables at once. First, we gather()
the data frame so that all social media variables are listed in a long format instead of a wide format. We split this long data frame by type of social media, apply the freq_test_prop
to each split, and recombine them using nest()
from the tidyr
package and map()
from the purrr
package.
( gss_data_prop_nest <- gss_data_prop %>% gather(key = Social_Media, value = Use, -degree, -wtssall) ) gss_data_prop_nest <- mutate_if(gss_data_prop_nest, is.character, factor) ( gss_data_prop_nest <- gss_data_prop_nest %>% group_by(Social_Media) %>% nest() ) #Note: warnings about chi-squares being approximated are suppressed here to save space. ( gss_data_result_nest <- gss_data_prop_nest %>% mutate(test = map(data, freq_prop_test, "Use", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE), chart = map(data, tbl_chart, "Use", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE)) ) (gss_data_result <- gss_data_result_nest %>% unnest(test) %>% filter(level == "Yes")) (my_chart_data <- gss_data_result_nest %>% unnest(chart) %>% filter(Use == "Yes"))
We repeat the process for the tbl_sig
function.
( gss_data_result_nest <- gss_data_result %>% group_by(Social_Media) %>% nest() %>% mutate(sig = map(data, tbl_sig, "degree", space_label = 0.3, space_between = 0.1)) ) (my_sig_data <- unnest(gss_data_result_nest, sig))
I would also like to set the order based on the proportions using fct_reorder
. I first need to create a summarized data frame that lists the total proportions of usage for each social media.
social_media_total <- gss_data_prop_nest %>% unnest(data) %>% group_by(Social_Media) %>% summarize(total_prop = sum(if_else(Use == "Yes", wtssall, 0)) / sum(wtssall)) my_chart_data <- left_join(my_chart_data, social_media_total) my_sig_data <- left_join(my_sig_data, social_media_total) my_chart_data$Social_Media <- fct_reorder(my_chart_data$Social_Media, my_chart_data$total_prop) my_sig_data$Social_Media <- fct_reorder(my_sig_data$Social_Media, my_sig_data$total_prop)
We now create the colour vector to denote significant differences and the plot.
colour_vec <- c("#edc951", "#eb6841", "#cc2a36", "#4f372d", "#00a0b0") #Applying names to each colour attributes(colour_vec)$names <- levels(my_chart_data$degree) ( p <- ggplot() + geom_col(data = my_chart_data, aes(x = Social_Media, y = prop, fill = degree)) + geom_text(data = my_chart_data, aes(x = Social_Media, y = prop, label = scales::percent(round(prop, 2))), hjust = -0.2) + scale_fill_manual(values = colour_vec) + scale_colour_manual(values = colour_vec) + coord_flip() + facet_wrap(~ degree) + scale_y_continuous(limits = c(0, 1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) + labs(title = "Level of Usage of Social Media by Level of Education in 2016", x = "Social Media", y = "Percentage", fill = "Level of Education", colour = "Sig. Diff.") + theme(legend.key.height = unit(1.5, "lines")) ) p + geom_sigmark(my_sig_data, x = "Social_Media", icon = "checkmark")
Scaled variables are a very common form of measuring attitudes in market research and polling surveys. We will now turn to provide a couple of examples of charts involving scaled variables.
This example will focus on comparing one scaled variable across subgroups. Here we will compare the variable polviews
which asks respondents to rate on a scale how liberal or conservative they consider themselves on a 7-point scale on their level of family income (grouped to 6 categories). Again, we'll only on focus on 2016 data.
gss_data_polviews <- gss_data %>% filter(year == "2016", !is.na(coninc))
For the sake of expediency, we'll let the cut function set the breaks, even if they don't round up.
gss_data_polviews$coninc_6 <- cut(gss_data_polviews$coninc, 6, labels = c("Up to $27,600", "$27,600 to $54,900", "$54,900 to $82,100", "$82,100 to $109,000", "$109,000 to $137,000", "$137,000 and higher"))
We'll specifically look at the proportion of those who consider themselves liberal. This means providing a top 3 box and a sum %. First, we'll need to make sure those who didn't give an answer are counted, using the fct_explicity_na()
function.
gss_data_polviews$polviews <- gss_data_polviews$polviews %>% fct_recode( `Extremely Liberal` = "EXTREMELY LIBERAL", Liberal = "LIBERAL", `Slightly Liberal` = "SLIGHTLY LIBERAL", Moderate = "MODERATE", `Slightly Conservative` = "SLGHTLY CONSERVATIVE", Conservative = "CONSERVATIVE", `Extremely Conservative` = "EXTRMLY CONSERVATIVE") %>% fct_explicit_na(na_level = "NA") gss_data_polviews <- mutate(gss_data_polviews, liberal = if_else(polviews %in% c("Extremely Liberal", "Liberal", "Slightly Liberal"), "Liberal", "Not Liberal") %>% factor())
We're only interested in showing significant difference markers for the top 3 box, so we're just running the tbl_chart
for that.
(polviews_chart <- tbl_chart(gss_data_polviews, "polviews", "coninc_6", weight = "wtssall", nlabels = TRUE))
Since we're only interested in showing 3 of these levels, we'll remove the others.
polviews_chart <- filter(polviews_chart, polviews %in% c("Extremely Liberal", "Liberal", "Slightly Liberal")) polviews_chart$polviews <- droplevels(polviews_chart$polviews)
We'll use the created liberal
variable for the markers.
polviews_results <- freq_prop_test(gss_data_polviews, "liberal", "coninc_6", weight = "wtssall", nlabels = TRUE) ( polviews_sig <- polviews_results %>% tbl_sig("coninc_6", space_label = 0.1, space_between = 0.03) %>% filter(level == "Liberal") )
We also need the total proportions for the labels.
chart_labels <- polviews_chart %>% group_by(coninc_6) %>% summarize(prop = sum(prop))
Now turning to the chart. Here, we'll need two sets of colours; one for the scales, and one for the subgroups.
colour_scale_vec <- c("#d95f0e", "#fec44f", "#fff7bc") colour_subgroup_vec <- c("#253494", "#2c7fb8", "#41b6c4", "#7fcdbb", "#c7e9b4", "#ffffcc") #Applying names to each colour attributes(colour_scale_vec)$names <- levels(polviews_chart$polviews) attributes(colour_subgroup_vec)$names <- levels(polviews_chart$coninc_6) ggplot() + geom_col(data = polviews_chart, aes(x = coninc_6, y = prop, fill = fct_rev(polviews))) + geom_text(data = polviews_chart, aes(x = coninc_6, y = prop, label = scales::percent(round(prop, 2))), position = position_stack(vjust = 0.5)) + geom_text(data = chart_labels, aes(x = coninc_6, y = prop, label = scales::percent(round(prop, 2))), hjust = -0.2) + scale_fill_manual(values = colour_scale_vec) + scale_colour_manual(values = colour_subgroup_vec) + coord_flip() + scale_y_continuous(limits = c(0, .75), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) + labs(title = "Proportion of Residents who Self-Label as Liberal", subtitle = "by Family Income Level", x = "Level of Family Income", y = "Percentage", fill = "Scale", colour = "Sig. Diff.") + geom_sigmark(polviews_sig, x = "coninc_6", icon = "asterisk") + guides(fill = guide_legend(reverse = TRUE))
This example will now turn to the need for multiple scaled variables to appear across subgroups. There are 6 variables that we'll focus here ranging from LOTR1
to LOTR6
, which represent agreement scales on 6 attitudinal statements. We'll observe whether there are significant differences in whether they agree with these statements across gender (sex
). The code required to produce the desired chart requires a mix of the code from the two previous examples.
Note that not everyone is asked these 6 questions, and there is no indication in the data file as to the logic as to why some respondents were asked and others not. The presence of a dummy variable indicating if they were asked or not would be useful here to ensure the sample sizes match throughout. For the sake of simplicity, the loop below assigns those who are NA
to Neutral
across the 6 questions.
gss_data_LOTR <- gss_data %>% filter(year == "2016", !is.na(LOTR1)) %>% select(wtssall, sex, starts_with("LOTR")) gss_data_LOTR$sex <- fct_recode(gss_data_LOTR$sex, Male = "MALE", Female = "FEMALE") gss_data_LOTR <- mutate_if(gss_data_LOTR, is.factor, funs(fct_explicit_na(., na_level = "Neutral")))
As per the previous example, we'll need to calculate the top 2 box for those who agree or strongly agree across each statement. Since I need to apply the same wrangling across 6 variables, I'm using the mutate_at()
and mutate_if()
functions.
gss_data_LOTR <- gss_data_LOTR %>% mutate_at(vars(starts_with("LOTR")), funs("Top2" = if_else(. %in% c("Strongly agree", "Agree"), "Agree", "Disagree/Neutral"))) %>% mutate_if(is.character, factor)
And as per 2 examples ago, we'll need to use the nest()
and map()
to apply the freq_test_prop
, the tbl_chart
, and the tbl_sig
functions. Unlike last time, my levels differ between the original LOTR
variables and my created LOTRx_Top2
variables. The easiest thing to do at this stage would be to split the original variables from the created variables.
( gss_data_LOTR_nest <- gss_data_LOTR %>% gather(key = LOTR, value = Scale, -sex, -wtssall) ) gss_data_LOTR_nest_Top2 <- filter(gss_data_LOTR_nest, str_detect(LOTR, "Top2")) gss_data_LOTR_nest_Full <- filter(gss_data_LOTR_nest, !str_detect(LOTR, "Top2")) gss_data_LOTR_nest_Top2$LOTR <- str_replace(gss_data_LOTR_nest_Top2$LOTR, "_Top2", "") gss_data_LOTR_nest_Top2 <- mutate_if(gss_data_LOTR_nest_Top2, is.character, factor) gss_data_LOTR_nest_Full <- mutate_if(gss_data_LOTR_nest_Full, is.character, factor) ( gss_data_LOTR_nest_Full <- gss_data_LOTR_nest_Full %>% group_by(LOTR) %>% nest() ) ( gss_data_LOTR_nest_Top2 <- gss_data_LOTR_nest_Top2 %>% group_by(LOTR) %>% nest() ) ( gss_data_LOTR_chart_Full <- gss_data_LOTR_nest_Full %>% mutate(chart = map(data, tbl_chart, "Scale", "sex", weight = "wtssall", nlabels = TRUE)) %>% unnest(chart) %>% filter(Scale %in% c("Strongly agree", "Agree")) ) gss_data_LOTR_chart_Top2 <- gss_data_LOTR_chart_Full %>% group_by(LOTR, sex) %>% summarize(total_prop = sum(prop)) ( gss_data_LOTR_result_Top2 <- gss_data_LOTR_nest_Top2 %>% mutate(test = map(data, freq_prop_test, "Scale", "sex", weight = "wtssall", nlabels = TRUE)) %>% unnest(test) %>% filter(level == "Agree") )
Now onto the tbl_sig
function, which I only need to run for the significant differences for the Top 2 box because I want to extract the proportions for the labels.
( my_sig_data <- gss_data_LOTR_result_Top2 %>% group_by(LOTR) %>% nest() %>% mutate(sig = map(data, tbl_sig, "sex", space_label = 0.25, space_between = 0.1)) %>% unnest(sig) )
I'll now merge the total proportions from the label data back to the chart data in order to re-order the chart from high to low. First I need to remove "_Top2" from each level of the LOTR
factor.
my_chart_data <- left_join(gss_data_LOTR_chart_Full, gss_data_LOTR_chart_Top2, by = c("LOTR", "sex")) my_chart_data$Scale <- droplevels(my_chart_data$Scale)
Now, I'll apply the question labels, and reorder them so that they appear from high to low in the chart. Note the usage of "\n" within the labels to provide breaks to avoid the chart getting squished due to the length of the labels.
#Labels taken from here: http://gss.norc.org/documents/codebook/GSS_Codebook_mainbody.pdf and from gss_data_questions my_chart_data$LOTR <- my_chart_data$LOTR %>% fct_recode(`In uncertain times,\nI usually expect the best` = "LOTR1", `If something can go\nwrong for me, it will` = "LOTR2", `I'm always optimistic\nabout the future` = "LOTR3", `I hardly ever expect\nthings to go my way` = "LOTR4", `I rarely count on good\nthings happening to me` = "LOTR5", `Overall I expect more good things\nto happen to me than bad` = "LOTR6") %>% fct_reorder(my_chart_data$total_prop) my_sig_data$LOTR <- my_sig_data$LOTR %>% fct_recode(`In uncertain times,\nI usually expect the best` = "LOTR1", `If something can go\nwrong for me, it will` = "LOTR2", `I'm always optimistic\nabout the future` = "LOTR3", `I hardly ever expect\nthings to go my way` = "LOTR4", `I rarely count on good\nthings happening to me` = "LOTR5", `Overall I expect more good things\nto happen to me than bad` = "LOTR6")
Let's build up the chart. One wrinkle I'm adding here for the individual scale point labels is to only display if they're over 5%, to avoid the "Strongly agree" and "agree" labels encroaching on each other.
colour_scale_vec <- c("#fff7bc", "#d95f0e") colour_subgroup_vec <- c("#f48041", "#4286f4") #Applying names to each colour attributes(colour_scale_vec)$names <- levels(my_chart_data$Scale) attributes(colour_subgroup_vec)$names <- levels(my_chart_data$sex) ( p <- ggplot() + geom_col(data = my_chart_data, aes(x = LOTR, y = prop, fill = Scale)) + geom_text(data = my_chart_data, aes(x = LOTR, y = prop, group = Scale, label = if_else(prop > 0.05, scales::percent(round(prop, 2)), "")), position = position_stack(vjust = 0.5)) + geom_text(data = my_chart_data, aes(x = LOTR, y = total_prop, label = scales::percent(round(total_prop, 2))), hjust = -0.2) + scale_fill_manual(values = colour_scale_vec) + scale_colour_manual(values = colour_subgroup_vec) + coord_flip() + facet_wrap(~ sex) + scale_y_continuous(limits = c(0, 1.1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) + labs(title = "Levels of Optimism Across Gender", x = "Statement", y = "Percentage", fill = "Scale", colour = "Sig. Diff.") ) p + geom_sigmark(my_sig_data, x = "LOTR", icon = "asterisk", group = "sex") + guides(fill = guide_legend(reverse = TRUE))
We will now turn to the geom_sigmark_waves
function to use for tracking charts across multiple waves. A different function is required because we need to present whether a significant shift is an increase or a decrease over time. For this example, we'll look at the change in the level of confidence Americans have been placing on the financial industry (confinan
).
gss_data1 <- filter(gss_data, year_num >= 2000, !is.na(confinan)) gss_data1$year <- droplevels(gss_data1$year) gss_data1$confinan <- fct_recode(gss_data1$confinan, "A great deal" = "A GREAT DEAL", "Hardly any" = "HARDLY ANY", "Only some" = "ONLY SOME") my_track_results <- freq_prop_test(gss_data1, "confinan", "year", weight = "wtssall") my_chart_track <- tbl_chart(gss_data1, "confinan", "year", weight = "wtssall") my_sig_track <- tbl_sig(my_track_results, "year", compare = "w2w") ggplot() + geom_line(data = my_chart_track, aes(x = year, y = prop, colour = confinan, group = confinan)) + geom_point(data = my_chart_track, aes(x = year, y = prop, colour = confinan), show.legend = FALSE) + geom_text_repel(data = my_chart_track, aes(x = year, y = prop, colour = confinan, label = percent(round(prop, 2))), show.legend = FALSE) + scale_y_continuous(labels = percent) + geom_sigmark_waves(my_sig_track) + labs(title = "Americans' Level of Confidence in the Financial Industry", subtitle = "2000 to 2016", y = "Percentage", x = "Year", colour = "Confidence level", shape = "Sig. Diff.")
For this section, we'll expand to looking at how Americans' level of confidence ("some" or "a great deal") over several topics change over time.
gss_data1 <- select(gss_data, year, year_num, wtssall, starts_with("con"), -coninc) gss_data1 <- filter(gss_data1, rowSums(is.na(gss_data1)) < 13, year_num >= 2000) %>% select(-year_num) gss_data1 <- rename(gss_data1, Financial = confinan, Business = conbus, Religion = conclerg, Education = coneduc, POTUS = confed, Unions = conlabor, Press = conpress, Medicine = conmedic, Television = contv, SCOTUS = conjudge, Science = consci, Congress = conlegis, Military = conarmy) gss_data1$year <- droplevels(gss_data1$year) gss_data1 <- gss_data1 %>% mutate_if(is.factor, funs("Top2" = if_else(. == "A GREAT DEAL" | . == "ONLY SOME", "Yes", "No"))) %>% select(year, wtssall, ends_with("Top2"), -year_Top2) %>% mutate_if(is.character, factor) names(gss_data1) <- str_replace(names(gss_data1), "_Top2", "") gss_data1_nest <- gss_data1 %>% gather(key = Institution, value = Confident, -year, -wtssall) %>% mutate_if(is.character, factor) gss_data1_result <- gss_data1_nest %>% group_by(Institution) %>% nest() %>% mutate(test = map(data, freq_prop_test, "Confident", "year", weight = "wtssall"), chart = map(data, tbl_chart, "Confident", "year", weight = "wtssall")) gss_data1_result_nest <- gss_data1_result %>% unnest(test) %>% filter(level == "Yes") %>% group_by(Institution) %>% nest() %>% mutate(sig = map(data, tbl_sig, "year", compare = "curr")) gss_data1_chart <- gss_data1_result %>% unnest(chart) %>% filter(Confident == "Yes") gss_data1_sig <- unnest(gss_data1_result_nest, sig) gss_data1_chart <- mutate(gss_data1_chart, Sector = case_when( Institution %in% c("Business", "Financial", "Press", "Unions") ~ "Sector 1", Institution %in% c("Education", "Science", "Medicine", "Religion", "Television") ~ "Sector 2", TRUE ~ "Sector 3"), Sector = factor(Sector)) gss_data1_sig <- mutate(gss_data1_sig, Sector = case_when( Institution %in% c("Business", "Financial", "Press", "Unions") ~ "Sector 1", Institution %in% c("Education", "Science", "Medicine", "Religion", "Television") ~ "Sector 2", TRUE ~ "Sector 3"), Sector = factor(Sector)) colour_vec <- c("#0db36f", "#c8dd56", "#030303", "#72477f", "#382b57", "#ce215b", "#e1a83e", "#39b287", "#71c3ca", "#3a8bea", "#034385", "#9f2b68", "#9d0e27") names(colour_vec) <- levels(gss_data1_chart$Institution) ggplot() + geom_line(data = gss_data1_chart, aes(x = year, y = prop, group = Institution, colour = Institution)) + geom_point(data = gss_data1_chart, aes(x = year, y = prop, colour = Institution)) + geom_text_repel(data = gss_data1_chart, aes(x = year, y = prop, label = percent(round(prop, 2)), colour = Institution), show.legend = FALSE) + labs(title = "Proportion of Americans with at least some level of confidence in Institutions", x = "Year", y = "Percentage", shape = "Sig. Diff.") + scale_y_continuous(labels = percent) + facet_wrap(~ Sector, nrow = 3) + scale_colour_manual(values = colour_vec) + geom_sigmark_waves(gss_data1_sig, colour = "Institution")
I would actually break this into 3 separate charts instead of using facet_wrap()
to make the chart easier to read, but that's beside the point.
Since version 0.0.2, we can now compare a subgroup to the rest of the sample. Here we will use "Pacific" as a region to chart and compare it to the rest.
gss_data_prop <- filter(gss_data, year == "2016", !is.na(conlegis), !is.na(degree)) gss_data_prop$conlegis <- fct_recode(gss_data_prop$conlegis, `A Great Deal` = "A GREAT DEAL", `Only Some` = "ONLY SOME", `Hardly Any` = "HARDLY ANY") my_results_prop <- freq_prop_test(gss_data_prop, "conlegis", "degree", level = "GRADUATE", weight = "wtssall") my_conlegis_chart <- tbl_chart(gss_data_prop, "conlegis", "degree", weight = "wtssall") my_conlegis_sig <- tbl_sig(my_results_prop, "degree", space_label = 0.1, compare = "total")
We only want to chart the Graduates, so we'll filter them out.
my_conlegis_chart <- filter(my_conlegis_chart, degree == "GRADUATE") my_conlegis_sig <- filter(my_conlegis_sig, degree == "GRADUATE") ( p <- ggplot() + geom_col(data = my_conlegis_chart, aes(x = conlegis, y = prop)) + geom_text(data = my_conlegis_chart, aes(x = conlegis, y = prop, label = scales::percent(round(prop, 2))), hjust = -0.2) + coord_flip() + scale_y_continuous(limits = c(0, 1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) + labs(title = "Graduates' Level of Confidence on US Congress in 2016", x = "Level of Confidence", y = "Percentage", shape = "Significant Differences") )
Here, we will use geom_sigmark_total
:
p + geom_sigmark_total(my_conlegis_sig, labels = c("Higher than those with \nother levels of education", "Lower than those with \nother levels of education")) + theme(legend.key.height = unit(1.5, "lines"))
We'll now expand on the previous section, and look into setting up a chart with multiple dummy variables. Here's we'll focus into how those in the 18-24 age category differ from the rest of the sample on their usage of social media:
gss_data_prop <- gss_data %>% filter(year == "2016", !is.na(degree)) %>% select(wtssall, age_7, Facebook = FACEBOOK, Twitter = TWITTER, Instagram = INSTAGRM, LinkedIn = LINKEDIN, Snapchat = SNAPCHAT, Tumblr = TUMBLR, Whatsapp = WHATSAPP, `Google+` = GOOGLESN, Pinterest = PINTERST, Flickr = FLICKR, Vine = VINE, Classmates = CLSSMTES) #Switch NA to "No" so that the full bases are kept. gss_data_prop <- mutate_at(gss_data_prop, vars(3:14), funs(fct_explicit_na(., "No"))) ( gss_data_prop_nest <- gss_data_prop %>% gather(key = Social_Media, value = Use, -age_7, -wtssall) ) gss_data_prop_nest <- mutate_if(gss_data_prop_nest, is.character, factor) ( gss_data_prop_nest <- gss_data_prop_nest %>% group_by(Social_Media) %>% nest() ) ( gss_data_result <- gss_data_prop_nest %>% mutate(test = map(data, freq_prop_test, "Use", "age_7", level = "18-24", weight = "wtssall"), chart = map(data, tbl_chart, "Use", "age_7", weight = "wtssall")) ) ( my_sig_data <- gss_data_result %>% unnest(test) %>% filter(level == "Yes") %>% group_by(Social_Media) %>% nest() %>% mutate(sig = map(data, tbl_sig, "age_7", space_label = 0.1, compare = "total")) %>% unnest(sig) ) (my_chart_data <- gss_data_result %>% unnest(chart) %>% filter(age_7 == "18-24", Use == "Yes"))
We just need to re-order the Social_Media
variable based on the prop
variable, and we're ready to chart.
my_chart_data$Social_Media <- fct_reorder(my_chart_data$Social_Media, my_chart_data$prop) ( ggplot() + geom_col(data = my_chart_data, aes(x = Social_Media, y = prop)) + geom_text(data = my_chart_data, aes(x = Social_Media, y = prop, label = scales::percent(round(prop, 2))), hjust = -0.2) + coord_flip() + scale_y_continuous(limits = c(0, 1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) + labs(title = "Level of Usage of Social Media in 2016 Among 18-24 Year Old", x = "Social Media", y = "Percentage", shape = "Significant Differences") + geom_sigmark_total(my_sig_data, x = "Social_Media", labels = c("Higher than \nthose aged 25+", "Lower than \nthose aged 25+")) + theme(legend.key.height = unit(1.5, "lines")) )
freq_test_prop
to freq_prop_test
for consistency with *_test
functions going forward.tbl_chart
so that it can be run directly off the raw data frame instead of the *_test
functions.subgroup
input in tbl_sig
so that the function's output matches with the tbl_chart
output.min_sample
argument from tbl_sig
to the *_test
functions to increase computational efficiency by avoiding the tests to be run on subgroups the user would ignore if it is below the specified sample size threshold anyway.freq_t_test
that wasn't properly computing the weighted mean.keep
argument to compare
for clarity in the tbl_sig
function.compare
argument when one wishes to compare a particular subgroup to the rest of the sample.tbl_sig
to make it easier to include that information in the charts.nlabels
and newline
logical arguments to freq_test_prop
and freq_t_test
functions to make this change seamless. The following is a to-do list for future additions to the package, in order of importance from high to low:
freq_prop_test
function does not work well when one of the sample groups has a sample size of 0.Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.