gss_data
General Social Survey data
setggsigmark
functions
ggsigmark
functionMarket researchers typically use symbols in charts delivered to clients
to denote whether a specific subgroup's result is (statistically
speaking) significantly higher than another subgroup's result. The
package ggsigmark
aims to make it easier for market researchers to
utilize ggplot2
by providing the ability to indicate whether
differences are statistically significant with the usage of markers in
the plots they are creating.
This vignette provides a manual on how to use the ggsigmark
functions
and lists planned future additions to the package.
To install the package from GitHub, you will first need to install the
devtools
package.
install.packages("devtools")
The ggsigmark
package can be installed by running the following line:
devtools::install_github("philstraforelli/ggsigmark")
It is made available for usage by running the following line:
library(ggsigmark)
The code below will utilize tidyverse
code
(which is of course necessary since the ultimate purpose is to use
ggplot2
), so we might as well load up
that package as well, along with others that are used in this vignette:
library(tidyverse)
library(broom)
library(forcats)
library(ggrepel)
library(scales)
library(stringr)
gss_data
General Social Survey data setThe package uses a reduced version of the NORC's General Social
Survey to illustrate the usage of the ggsigmark
tools. This data was chosen because it represents the type of data that
market researchers generally work with; that is, tracking survey data
(in this case run on a biennial basis) with post-stratification weights.
The full original data file includes as of this writing over 65,000
responses and over nearly 6,000 variables (many of which are defunct
questions). To save memory, only a select list of variables are
included, and the gss_data
dataset only includes data from 2000 to
2016 (the GSS survey was first run in 1972). This brings the data frame
down to a more manageable 24,350 responses and 101 variables.
The full data can be downloaded here. More information about the GSS data is available here.
ggsigmark
functionsThe following flowchart illustrates the workflow for using ggsigmark
functions.
freq_pair_wtd_t_test
functionSurprisingly, while the stats package includes a pairwise.t.test function to make multiple pairwise comparisons between group levels (with corrections for multiple testing), and the Hmisc package provides a wtd.mean function to calculate means with post-stratification weights, there was no function available to run pairwise comparison t-tests with weighted means.
As with the pairwise.t.test function, an adjustment is made to the p-value to mitigate the chance of a Type I error, defaulted to the "Holm" method (as opposed to the better known Bonferroni correction, generally deemed too conservative).
The function freq_pair_wtd_t_test
fills that gap. This function is
used in the background for the freq_t_test
function, and most users
would likely not see a need to run this function. However, it may have
its uses to some, and it is therefore made available here. The following
code is an example of its usage, where whether the coninc
variable
(adjusted family income) differs by levels of education (degree
), and
whether respondents' age (age_num
) differ based on their region
. For
both cases, the weight wtssall
is applied:
freq_pair_wtd_t_test(x = gss_data$coninc, subgroup = gss_data$degree, weight = gss_data$wtssall)
##
## Pairwise comparisons using Unpaired T test
##
## data: gss_data$coninc and gss_data$degree
##
## LT HIGH SCHOOL HIGH SCHOOL JUNIOR COLLEGE BACHELOR
## HIGH SCHOOL <2e-16 - - -
## JUNIOR COLLEGE <2e-16 <2e-16 - -
## BACHELOR <2e-16 <2e-16 <2e-16 -
## GRADUATE <2e-16 <2e-16 <2e-16 <2e-16
##
## P value adjustment method: holm
freq_pair_wtd_t_test(x = gss_data$age_num, subgroup = gss_data$region, weight = gss_data$wtssall)
##
## Pairwise comparisons using Unpaired T test
##
## data: gss_data$age_num and gss_data$region
##
## NEW ENGLAND MIDDLE ATLANTIC E. NOR. CENTRAL
## MIDDLE ATLANTIC 1.00000 - -
## E. NOR. CENTRAL 1.00000 1.00000 -
## W. NOR. CENTRAL 1.00000 1.00000 1.00000
## SOUTH ATLANTIC 1.00000 1.00000 1.00000
## E. SOU. CENTRAL 1.00000 1.00000 1.00000
## W. SOU. CENTRAL 0.13380 1.00000 0.92937
## MOUNTAIN 0.00121 0.06522 0.00426
## PACIFIC 5.7e-05 0.00211 1.9e-05
## W. NOR. CENTRAL SOUTH ATLANTIC E. SOU. CENTRAL
## MIDDLE ATLANTIC - - -
## E. NOR. CENTRAL - - -
## W. NOR. CENTRAL - - -
## SOUTH ATLANTIC 1.00000 - -
## E. SOU. CENTRAL 1.00000 1.00000 -
## W. SOU. CENTRAL 1.00000 0.20971 0.79919
## MOUNTAIN 0.37097 0.00036 0.01023
## PACIFIC 0.07095 2.0e-07 0.00054
## W. SOU. CENTRAL MOUNTAIN
## MIDDLE ATLANTIC - -
## E. NOR. CENTRAL - -
## W. NOR. CENTRAL - -
## SOUTH ATLANTIC - -
## E. SOU. CENTRAL - -
## W. SOU. CENTRAL - -
## MOUNTAIN 1.00000 -
## PACIFIC 0.28691 1.00000
##
## P value adjustment method: holm
The resulting tables above list the p-values for each pairwise
comparison. The output is not ideal in handling further, so the user may
wish to use the tidy()
function from the broom
package:
tidy(freq_pair_wtd_t_test(x = gss_data$coninc, subgroup = gss_data$degree, weight = gss_data$wtssall))
## group1 group2 p.value
## 1 HIGH SCHOOL LT HIGH SCHOOL 3.383092e-100
## 2 JUNIOR COLLEGE LT HIGH SCHOOL 2.992877e-107
## 3 BACHELOR LT HIGH SCHOOL 0.000000e+00
## 4 GRADUATE LT HIGH SCHOOL 0.000000e+00
## 6 JUNIOR COLLEGE HIGH SCHOOL 8.027388e-19
## 7 BACHELOR HIGH SCHOOL 0.000000e+00
## 8 GRADUATE HIGH SCHOOL 0.000000e+00
## 11 BACHELOR JUNIOR COLLEGE 2.294531e-68
## 12 GRADUATE JUNIOR COLLEGE 9.895405e-162
## 16 GRADUATE BACHELOR 4.389259e-44
tidy(freq_pair_wtd_t_test(x = gss_data$coninc, subgroup = gss_data$region, weight = gss_data$wtssall))
## group1 group2 p.value
## 1 MIDDLE ATLANTIC NEW ENGLAND 3.111807e-04
## 2 E. NOR. CENTRAL NEW ENGLAND 1.185668e-16
## 3 W. NOR. CENTRAL NEW ENGLAND 7.889610e-19
## 4 SOUTH ATLANTIC NEW ENGLAND 2.420030e-19
## 5 E. SOU. CENTRAL NEW ENGLAND 2.032580e-25
## 6 W. SOU. CENTRAL NEW ENGLAND 4.160657e-28
## 7 MOUNTAIN NEW ENGLAND 1.742476e-22
## 8 PACIFIC NEW ENGLAND 2.167928e-05
## 10 E. NOR. CENTRAL MIDDLE ATLANTIC 3.228809e-08
## 11 W. NOR. CENTRAL MIDDLE ATLANTIC 1.745968e-10
## 12 SOUTH ATLANTIC MIDDLE ATLANTIC 4.852845e-11
## 13 E. SOU. CENTRAL MIDDLE ATLANTIC 4.914109e-17
## 14 W. SOU. CENTRAL MIDDLE ATLANTIC 2.452296e-20
## 15 MOUNTAIN MIDDLE ATLANTIC 4.280706e-14
## 16 PACIFIC MIDDLE ATLANTIC 1.000000e+00
## 19 W. NOR. CENTRAL E. NOR. CENTRAL 2.466990e-01
## 20 SOUTH ATLANTIC E. NOR. CENTRAL 1.000000e+00
## 21 E. SOU. CENTRAL E. NOR. CENTRAL 1.781225e-04
## 22 W. SOU. CENTRAL E. NOR. CENTRAL 8.198474e-05
## 23 MOUNTAIN E. NOR. CENTRAL 1.579073e-02
## 24 PACIFIC E. NOR. CENTRAL 5.577282e-07
## 28 SOUTH ATLANTIC W. NOR. CENTRAL 8.297285e-01
## 29 E. SOU. CENTRAL W. NOR. CENTRAL 5.666813e-01
## 30 W. SOU. CENTRAL W. NOR. CENTRAL 8.297285e-01
## 31 MOUNTAIN W. NOR. CENTRAL 1.000000e+00
## 32 PACIFIC W. NOR. CENTRAL 2.415946e-09
## 37 E. SOU. CENTRAL SOUTH ATLANTIC 1.646748e-03
## 38 W. SOU. CENTRAL SOUTH ATLANTIC 1.205193e-03
## 39 MOUNTAIN SOUTH ATLANTIC 1.074700e-01
## 40 PACIFIC SOUTH ATLANTIC 9.718589e-10
## 46 W. SOU. CENTRAL E. SOU. CENTRAL 1.000000e+00
## 47 MOUNTAIN E. SOU. CENTRAL 1.000000e+00
## 48 PACIFIC E. SOU. CENTRAL 9.086298e-16
## 55 MOUNTAIN W. SOU. CENTRAL 1.000000e+00
## 56 PACIFIC W. SOU. CENTRAL 5.154153e-19
## 64 PACIFIC MOUNTAIN 8.010958e-13
freq_t_test
functionMost users would consider the freq_t_test
function more useful, as it
takes the output from the freq_pair_wtd_t_test
function and wrangles
it into a more useful output. The output from the freq_t_test
function
consists of a data frame in which each row consists of a possible
pairwise comparison, with the following variables:
group1
and group2
)p.value
for that comparison, and a logical indicating whether
it is significant
(i.e. whether it is below the alpha level, which
can be user-defined but is defaulted to 0.05).Sample_Size_group1
and
Sample_Size_group2
)wtd.mean_group1
and
wtd.mean_group2
).Note that the arguments needed for the freq_t_test
function consist of
the following, in order:
#First filtering only to 2016 cases and among those who gave an inflation-adjusted family income
gss_data_mean <- filter(gss_data, year == "2016", coninc > 0)
freq_t_test(gss_data_mean, "coninc", "region", weight = "wtssall") #With weights
## group1 group2 p.value significant
## 1 MIDDLE ATLANTIC NEW ENGLAND 4.422520e-01 FALSE
## 2 E. NOR. CENTRAL NEW ENGLAND 9.378812e-03 TRUE
## 3 W. SOU. CENTRAL NEW ENGLAND 2.101227e-05 TRUE
## 4 W. NOR. CENTRAL NEW ENGLAND 6.247240e-04 TRUE
## 5 E. SOU. CENTRAL NEW ENGLAND 1.163490e-03 TRUE
## 6 MOUNTAIN NEW ENGLAND 1.850896e-02 TRUE
## 7 SOUTH ATLANTIC NEW ENGLAND 3.513405e-02 TRUE
## 8 PACIFIC NEW ENGLAND 4.007362e-01 FALSE
## 9 E. NOR. CENTRAL MIDDLE ATLANTIC 1.000000e+00 FALSE
## 10 W. NOR. CENTRAL MIDDLE ATLANTIC 3.610339e-01 FALSE
## 11 MOUNTAIN MIDDLE ATLANTIC 1.000000e+00 FALSE
## 12 E. SOU. CENTRAL MIDDLE ATLANTIC 5.045073e-01 FALSE
## 13 W. SOU. CENTRAL MIDDLE ATLANTIC 5.105789e-02 FALSE
## 14 PACIFIC MIDDLE ATLANTIC 1.000000e+00 FALSE
## 15 SOUTH ATLANTIC MIDDLE ATLANTIC 1.000000e+00 FALSE
## 16 W. NOR. CENTRAL E. NOR. CENTRAL 1.000000e+00 FALSE
## 17 SOUTH ATLANTIC E. NOR. CENTRAL 1.000000e+00 FALSE
## 18 MOUNTAIN E. NOR. CENTRAL 1.000000e+00 FALSE
## 19 E. SOU. CENTRAL E. NOR. CENTRAL 1.000000e+00 FALSE
## 20 PACIFIC E. NOR. CENTRAL 1.000000e+00 FALSE
## 21 W. SOU. CENTRAL E. NOR. CENTRAL 6.372623e-01 FALSE
## 22 SOUTH ATLANTIC W. NOR. CENTRAL 9.768747e-01 FALSE
## 23 W. SOU. CENTRAL W. NOR. CENTRAL 1.000000e+00 FALSE
## 24 MOUNTAIN W. NOR. CENTRAL 1.000000e+00 FALSE
## 25 PACIFIC W. NOR. CENTRAL 2.433656e-01 FALSE
## 26 E. SOU. CENTRAL W. NOR. CENTRAL 1.000000e+00 FALSE
## 27 MOUNTAIN SOUTH ATLANTIC 1.000000e+00 FALSE
## 28 E. SOU. CENTRAL SOUTH ATLANTIC 1.000000e+00 FALSE
## 29 PACIFIC SOUTH ATLANTIC 1.000000e+00 FALSE
## 30 W. SOU. CENTRAL SOUTH ATLANTIC 2.038831e-01 FALSE
## 31 W. SOU. CENTRAL E. SOU. CENTRAL 1.000000e+00 FALSE
## 32 MOUNTAIN E. SOU. CENTRAL 1.000000e+00 FALSE
## 33 PACIFIC E. SOU. CENTRAL 3.865496e-01 FALSE
## 34 MOUNTAIN W. SOU. CENTRAL 1.000000e+00 FALSE
## 35 PACIFIC W. SOU. CENTRAL 2.461451e-02 TRUE
## 36 PACIFIC MOUNTAIN 1.000000e+00 FALSE
## Sample_Size_group1 Sample_Size_group2 wtd.mean_group1 wtd.mean_group2
## 1 278 159 56184.96 66071.30
## 2 461 159 51884.27 66071.30
## 3 273 159 44860.81 66071.30
## 4 177 159 46136.68 66071.30
## 5 185 159 47019.22 66071.30
## 6 223 159 50890.25 66071.30
## 7 488 159 53476.81 66071.30
## 8 352 159 56358.38 66071.30
## 9 461 278 51884.27 56184.96
## 10 177 278 46136.68 56184.96
## 11 223 278 50890.25 56184.96
## 12 185 278 47019.22 56184.96
## 13 273 278 44860.81 56184.96
## 14 352 278 56358.38 56184.96
## 15 488 278 53476.81 56184.96
## 16 177 461 46136.68 51884.27
## 17 488 461 53476.81 51884.27
## 18 223 461 50890.25 51884.27
## 19 185 461 47019.22 51884.27
## 20 352 461 56358.38 51884.27
## 21 273 461 44860.81 51884.27
## 22 488 177 53476.81 46136.68
## 23 273 177 44860.81 46136.68
## 24 223 177 50890.25 46136.68
## 25 352 177 56358.38 46136.68
## 26 185 177 47019.22 46136.68
## 27 223 488 50890.25 53476.81
## 28 185 488 47019.22 53476.81
## 29 352 488 56358.38 53476.81
## 30 273 488 44860.81 53476.81
## 31 273 185 44860.81 47019.22
## 32 223 185 50890.25 47019.22
## 33 352 185 56358.38 47019.22
## 34 223 273 50890.25 44860.81
## 35 352 273 56358.38 44860.81
## 36 352 223 56358.38 50890.25
freq_t_test(gss_data_mean, "coninc", "degree") #Without weights
## group1 group2 p.value significant
## 1 HIGH SCHOOL LT HIGH SCHOOL 3.633278e-05 TRUE
## 2 JUNIOR COLLEGE LT HIGH SCHOOL 2.082524e-07 TRUE
## 3 BACHELOR LT HIGH SCHOOL 1.028278e-42 TRUE
## 4 GRADUATE LT HIGH SCHOOL 2.861872e-63 TRUE
## 5 JUNIOR COLLEGE HIGH SCHOOL 3.344687e-03 TRUE
## 6 BACHELOR HIGH SCHOOL 9.442534e-46 TRUE
## 7 GRADUATE HIGH SCHOOL 4.199987e-69 TRUE
## 8 BACHELOR JUNIOR COLLEGE 2.883638e-10 TRUE
## 9 GRADUATE JUNIOR COLLEGE 1.739274e-23 TRUE
## 10 GRADUATE BACHELOR 5.587394e-07 TRUE
## Sample_Size_group1 Sample_Size_group2 wtd.mean_group1 wtd.mean_group2
## 1 1317 279 37944.81 27334.50
## 2 200 279 46300.44 27334.50
## 3 489 279 66997.66 27334.50
## 4 308 279 81251.88 27334.50
## 5 200 1317 46300.44 37944.81
## 6 489 1317 66997.66 37944.81
## 7 308 1317 81251.88 37944.81
## 8 489 200 66997.66 46300.44
## 9 308 200 81251.88 46300.44
## 10 308 489 81251.88 66997.66
The names of the levels of the region
variable are currently a little
unsightly, so we'll use the
fct_recode()
function from the forcats
package to make it look a little nicer. I'm also going to use
fct_relevel()
so that the order of the regions in the chart are somewhat similar to
the geographical order from west to east.
levels(gss_data_mean$region)
## [1] "NEW ENGLAND" "MIDDLE ATLANTIC" "E. NOR. CENTRAL" "W. NOR. CENTRAL"
## [5] "SOUTH ATLANTIC" "E. SOU. CENTRAL" "W. SOU. CENTRAL" "MOUNTAIN"
## [9] "PACIFIC"
gss_data_mean$region <- gss_data_mean$region %>%
fct_recode(
`New England` = "NEW ENGLAND",
`Middle Atlantic` = "MIDDLE ATLANTIC",
`East North Central` = "E. NOR. CENTRAL",
`West North Central` = "W. NOR. CENTRAL",
`South Atlantic` = "SOUTH ATLANTIC",
`East South Central` = "E. SOU. CENTRAL",
`West South Central` = "W. SOU. CENTRAL",
Mountain = "MOUNTAIN",
Pacific = "PACIFIC") %>%
fct_relevel(
"Pacific",
"Mountain",
"West South Central",
"West North Central",
"East South Central",
"East North Central",
"South Atlantic",
"Middle Atlantic",
"New England")
freq_prop_test
functionThe freq_prop_test
function behaves similarly as the freq_t_test
function, except of course it focuses on proportions, instead of means.
It relies on the
pairwise.prop.test
function in the background, but allows for the usage of
post-stratification weights. The output of the freq_prop_test
function
consists of the following:
group1
and group2
)level
on which the proportions for group1
and
group2
are being compared.p.value
for that comparison, and a logical indicating whether
it is significant
(i.e. whether it is below the alpha level, which
can be user-defined but is defaulted to 0.05).Sample_Size_group1
and
Sample_Size_group2
)prop_group1
and
prop_group2
).Note that the arguments needed for the freq_prop_test
function consist
of the following, in order:
#First filtering only to 2016 cases and among those who gave an answer on their level of confidence with Congress
gss_data_prop <- filter(gss_data, year == "2016", !is.na(conlegis), !is.na(partyid))
#Are there differences in Americans' level of confidence with US Congress across their party identification?
gss_data_prop$partyidcoll <- fct_collapse(gss_data_prop$partyid,
Democrat = c("STRONG DEMOCRAT", "NOT STR DEMOCRAT"),
Independent = c("IND,NEAR DEM", "INDEPENDENT", "IND,NEAR REP"),
Republican = c("NOT STR REPUBLICAN", "STRONG REPUBLICAN"),
Other = "OTHER PARTY")
#With weights. To avoid a large table output, I'm filtering to show only the "HARDLY ANY" level here.
freq_prop_test(gss_data_prop, "conlegis", "partyidcoll", weight = "wtssall")
## group1 group2 level p.value significant
## 1 Independent Democrat A GREAT DEAL 1.000000000 FALSE
## 2 Republican Democrat A GREAT DEAL 1.000000000 FALSE
## 3 Other Democrat A GREAT DEAL 1.000000000 FALSE
## 4 Independent Democrat ONLY SOME 0.009432938 TRUE
## 5 Republican Democrat ONLY SOME 0.859591206 FALSE
## 6 Other Democrat ONLY SOME 0.059022114 FALSE
## 7 Independent Democrat HARDLY ANY 0.092655207 FALSE
## 8 Other Democrat HARDLY ANY 0.160015743 FALSE
## 9 Republican Democrat HARDLY ANY 0.954596593 FALSE
## 10 Republican Independent A GREAT DEAL 1.000000000 FALSE
## 11 Other Independent A GREAT DEAL 1.000000000 FALSE
## 12 Republican Independent HARDLY ANY 0.164178736 FALSE
## 13 Other Independent HARDLY ANY 0.468843493 FALSE
## 14 Republican Independent ONLY SOME 0.051645473 FALSE
## 15 Other Independent ONLY SOME 0.439684591 FALSE
## 16 Other Republican ONLY SOME 0.067348400 FALSE
## 17 Other Republican HARDLY ANY 0.164178736 FALSE
## 18 Other Republican A GREAT DEAL 1.000000000 FALSE
## Sample_Size_group1 Sample_Size_group2 prop_group1 prop_group2
## 1 800 641 0.06862036 0.05006322
## 2 410 641 0.05383465 0.05006322
## 3 52 641 0.06885408 0.05006322
## 4 800 641 0.37286146 0.45774682
## 5 410 641 0.45017368 0.45774682
## 6 52 641 0.28477256 0.45774682
## 7 800 641 0.55851818 0.49218996
## 8 52 641 0.64637336 0.49218996
## 9 410 641 0.49599167 0.49218996
## 10 410 800 0.05383465 0.06862036
## 11 52 800 0.06885408 0.06862036
## 12 410 800 0.49599167 0.55851818
## 13 52 800 0.64637336 0.55851818
## 14 410 800 0.45017368 0.37286146
## 15 52 800 0.28477256 0.37286146
## 16 52 410 0.28477256 0.45017368
## 17 52 410 0.64637336 0.49599167
## 18 52 410 0.06885408 0.05383465
#Are there differences in the usage of LinkedIn by levels of education?
gss_data_prop2 <- filter(gss_data, year == "2016", !is.na(LINKEDIN), !is.na(degree))
#Without weights, and filtering only to the "Yes" level
gss_data_prop2 %>%
freq_prop_test("LINKEDIN", "degree") %>%
filter(level == "Yes")
## group1 group2 level p.value significant
## 1 HIGH SCHOOL LT HIGH SCHOOL Yes 2.038852e-02 TRUE
## 2 JUNIOR COLLEGE LT HIGH SCHOOL Yes 2.468184e-03 TRUE
## 3 GRADUATE LT HIGH SCHOOL Yes 2.210382e-11 TRUE
## 4 BACHELOR LT HIGH SCHOOL Yes 1.851483e-11 TRUE
## 5 JUNIOR COLLEGE HIGH SCHOOL Yes 2.003130e-01 FALSE
## 6 BACHELOR HIGH SCHOOL Yes 5.028227e-20 TRUE
## 7 GRADUATE HIGH SCHOOL Yes 4.794632e-17 TRUE
## 8 BACHELOR JUNIOR COLLEGE Yes 4.753333e-04 TRUE
## 9 GRADUATE JUNIOR COLLEGE Yes 4.665409e-04 TRUE
## 10 GRADUATE BACHELOR Yes 7.422444e-01 FALSE
## Sample_Size_group1 Sample_Size_group2 prop_group1 prop_group2
## 1 679 93 0.1826215 0.06451613
## 2 110 93 0.2545455 0.06451613
## 3 188 93 0.4946809 0.06451613
## 4 301 93 0.4750831 0.06451613
## 5 110 679 0.2545455 0.18262150
## 6 301 679 0.4750831 0.18262150
## 7 188 679 0.4946809 0.18262150
## 8 301 110 0.4750831 0.25454545
## 9 188 110 0.4946809 0.25454545
## 10 188 301 0.4946809 0.47508306
tbl_chart
functionThe tbl_chart
function is a convenience function that takes the raw
data and turns it into a data frame that allows for easy charting in
ggplot2
. The user may however find
that some further wrangling may be needed prior to charting.
First, let's use the function on the results for proportions:
tbl_chart(gss_data_prop, "conlegis", "region", weight = "wtssall")
## region conlegis prop Sample_Size
## 1 NEW ENGLAND A GREAT DEAL 0.02442069 119
## 2 MIDDLE ATLANTIC A GREAT DEAL 0.04689564 218
## 3 E. NOR. CENTRAL A GREAT DEAL 0.05054757 327
## 4 W. NOR. CENTRAL A GREAT DEAL 0.06970093 127
## 5 SOUTH ATLANTIC A GREAT DEAL 0.05078355 366
## 6 E. SOU. CENTRAL A GREAT DEAL 0.08040465 134
## 7 W. SOU. CENTRAL A GREAT DEAL 0.05643535 194
## 8 MOUNTAIN A GREAT DEAL 0.07232877 156
## 9 PACIFIC A GREAT DEAL 0.08561493 262
## 10 NEW ENGLAND ONLY SOME 0.40286534 119
## 11 MIDDLE ATLANTIC ONLY SOME 0.42561665 218
## 12 E. NOR. CENTRAL ONLY SOME 0.35691209 327
## 13 W. NOR. CENTRAL ONLY SOME 0.42547619 127
## 14 SOUTH ATLANTIC ONLY SOME 0.40887691 366
## 15 E. SOU. CENTRAL ONLY SOME 0.31844158 134
## 16 W. SOU. CENTRAL ONLY SOME 0.55880086 194
## 17 MOUNTAIN ONLY SOME 0.35487771 156
## 18 PACIFIC ONLY SOME 0.44625534 262
## 19 NEW ENGLAND HARDLY ANY 0.57271397 119
## 20 MIDDLE ATLANTIC HARDLY ANY 0.52748771 218
## 21 E. NOR. CENTRAL HARDLY ANY 0.59254034 327
## 22 W. NOR. CENTRAL HARDLY ANY 0.50482288 127
## 23 SOUTH ATLANTIC HARDLY ANY 0.54033954 366
## 24 E. SOU. CENTRAL HARDLY ANY 0.60115377 134
## 25 W. SOU. CENTRAL HARDLY ANY 0.38476379 194
## 26 MOUNTAIN HARDLY ANY 0.57279352 156
## 27 PACIFIC HARDLY ANY 0.46812973 262
And again on the results for means:
tbl_chart(gss_data_mean, "coninc", "region", weight = "wtssall")
## region wtd.mean Sample_Size
## 1 Pacific 56358.38 352
## 2 Mountain 50890.25 223
## 3 West South Central 44860.81 273
## 4 West North Central 46136.68 177
## 5 East South Central 47019.22 185
## 6 East North Central 51884.27 461
## 7 South Atlantic 53476.81 488
## 8 Middle Atlantic 56184.96 278
## 9 New England 66071.30 159
NOTE: This function in versions previous to v0.0.3 took the output of
either freq_t_test
or freq_prop_test
as input. It now takes the raw
data as input for greater convenience.
tbl_sig
functionThe tbl_sig
function is a convenience function that takes the output
from either the freq_t_test
or the freq_prop_test
functions and
turns it into a data frame that allows for easy charting with
geom_sigmark
. The user may however find that some further wrangling
may be needed prior to charting. Its input consists of the following:
freq_t_test
or the freq_prop_test
functions.space_label
numeric value indicating how much spacing is needed
between the label and the marker. Some trial and error is needed to
determine the ideal value.space_between
numeric value indicating how much spacing is
needed between markers. Some trial and error is needed to determine
the ideal value.keep
argument that should be one of three options:First, let's use the function on the results for proportions:
my_results_prop <- freq_prop_test(gss_data_prop, "conlegis", "region", weight = "wtssall")
## Warning in prop.test(x[c(i, j)], n[c(i, j)], ...): Chi-squared
## approximation may be incorrect
tbl_sig(my_results_prop, "region", space_label = 0.1, space_between = 0.05)
## level region sign group_prop sign_prop
## 1 HARDLY ANY E. NOR. CENTRAL W. SOU. CENTRAL 0.5925403 0.3847638
## 2 HARDLY ANY SOUTH ATLANTIC W. SOU. CENTRAL 0.5403395 0.3847638
## 3 HARDLY ANY E. SOU. CENTRAL W. SOU. CENTRAL 0.6011538 0.3847638
## 4 ONLY SOME W. SOU. CENTRAL E. NOR. CENTRAL 0.5588009 0.3569121
## 5 ONLY SOME W. SOU. CENTRAL SOUTH ATLANTIC 0.5588009 0.4088769
## 6 ONLY SOME W. SOU. CENTRAL E. SOU. CENTRAL 0.5588009 0.3184416
## 7 ONLY SOME W. SOU. CENTRAL MOUNTAIN 0.5588009 0.3548777
## 8 HARDLY ANY MOUNTAIN W. SOU. CENTRAL 0.5727935 0.3847638
## group_Sample_Size sign_Sample_Size pos
## 1 327 194 0.6925403
## 2 366 194 0.6403395
## 3 134 194 0.7011538
## 4 194 327 0.6588009
## 5 194 366 0.7088009
## 6 194 134 0.7588009
## 7 194 156 0.8088009
## 8 156 194 0.6727935
And again on the results for means:
my_results_mean <- freq_t_test(gss_data_mean, "coninc", "region", weight = "wtssall")
tbl_sig(my_results_mean, "region", space_label = 0.1, space_between = 0.05)
## region sign group_wtd.mean sign_wtd.mean
## 1 Pacific West South Central 56358.38 44860.81
## 2 New England Mountain 66071.30 50890.25
## 3 New England West South Central 66071.30 44860.81
## 4 New England West North Central 66071.30 46136.68
## 5 New England East South Central 66071.30 47019.22
## 6 New England East North Central 66071.30 51884.27
## 7 New England South Atlantic 66071.30 53476.81
## group_Sample_Size sign_Sample_Size pos
## 1 352 273 56358.48
## 2 159 223 66071.40
## 3 159 273 66071.45
## 4 159 177 66071.50
## 5 159 185 66071.55
## 6 159 461 66071.60
## 7 159 488 66071.65
geom_sigmark
, geom_sigmark_interactive
, geom_sigmark_waves
, and geom_sigmark_waves_interaction
functionsThe geom_sigmark
function is a
geom_point
wrapper that identifies with a marker whether a particular subgroup's
result is, statistically speaking, significantly higher than another
subgroup's result. In cases where there are more than 2 subgroups being
compared, the usage of colours is key here to be able to identify which
subgroup's result is being identified as being significantly lower.
All 4 geom_sigmark
functions require several arguments:
tbl_sig
function.x
and y
arguments indicating what axes are used to chart the
markers. The group
argument is provided if needed, and the
colour
argument indicating how the markers should be coloured.icon
the user
would like to have displayed where there are significant
differences.size
of the icon
. Generally a size of 5 is
deemed appropriate.The 2 _waves
variants also include the following arguments:
direction
argument to specify which variable indicates whether
the marker should be an increase or a decrease.labels
argument to specify how the markers should be labeled
in the plot legend. By default they are "Increase"/"Decrease".The 2 _interactive
variants also include the following arguments (both
from ggiraph).
tooltip
argument to specify which variable indicates what gets
shown in the tooltip when the mouse hovers on a marker.data_id
argument to specific which variable should be
highlighted if the end user clicks on a marker.The geom_sigmark
function is best displayed by providing examples on
how it can be utilized with the most common types of charts used in
market research. These are listed below.
ggsigmark
functionWe will use the coninc
variable, which indicates the family income of
each respondent, adjusted for inflation, and use the previous examples
indicating whether there is a difference in income across regions in the
US. I will re-run the my_results_mean object, this time setting the
nlabels
argument to TRUE. This is to merge the sample sizes with the
names of the regions in order to display both in the chart.
my_results_mean <- freq_t_test(gss_data_mean, "coninc", "region", weight = "wtssall", nlabels = TRUE)
my_chart_data <- tbl_chart(gss_data_mean, "coninc", "region", weight = "wtssall", nlabels = TRUE)
my_sig_data <- tbl_sig(my_results_mean, "region", space_label = 5000, space_between = 2200)
As noted above, it is best to have the subgroups (i.e. the levels of the
region
variable) have their own specific colours. We therefore create
the following labeled vector:
colour_vec <- c("#edc951", "#eb6841", "#cc2a36", "#4f372d", "#00a0b0", "#2175d9", "#00308f", "#e30074", "#b8d000")
#Applying names of regions to each colour
attributes(colour_vec)$names <- levels(my_chart_data$region)
And we now create the plot. Here we assign the plot into the object p
.
Note that we're re-labeling the x-axis to avoid the text overlapping.
(
p <- ggplot() +
geom_col(data = my_chart_data, aes(x = region, y = wtd.mean, fill = region)) +
geom_text(data = my_chart_data, aes(x = region, y = wtd.mean,
label = scales::dollar(round(wtd.mean, 0))),
vjust = -0.2,
size = 3.25) +
scale_fill_manual(values = colour_vec) +
scale_colour_manual(values = colour_vec) +
scale_y_continuous(limits = c(0, 85000), labels = scales::dollar) +
scale_x_discrete(labels = c("Pacific (n=352)" = "Pacific",
"Mountain (n=223)" = "Mountain",
"West South Central (n=273)" = "SW\nCentral",
"West North Central (n=177)" = "NW\nCentral",
"East South Central (n=185)" = "SE\nCentral",
"East North Central (n=461)" = "NE\nCentral",
"South Atlantic (n=488)" = "South\nAtl.",
"Middle Atlantic (n=278)" = "Mid.\nAtl.",
"New England (n=159)" = "New\nEngland")) +
labs(title = "Average 2016 Inflation-Adjusted Family Income Across Regions", y = "Inflation-Adjusted Family Income", x = "Region", fill = "Region", colour = "Sig. Diff.")
)
And we now overlay this with the geom_sigmark
function to indicate
which regions have a significantly higher average family income than
other regions.
p + geom_sigmark(my_sig_data, x = "region")
We will use the conlegis
variable, which asks for Americans' level of
confidence in US Congress. A useful feature of
ggplot2
is the
facet_wrap()
layer that more easily breaks out answers across a factor variable. This
feature will be used here to examine how the level of confidence in US
Congress differs across educational levels using the degree
variable.
gss_data_prop <- filter(gss_data, year == "2016", !is.na(conlegis), !is.na(degree))
#Relabeling the degree and conlegis variables to be more readable.
gss_data_prop$degree <- fct_recode(gss_data_prop$degree,
`Less than high school` = "LT HIGH SCHOOL",
`High school` = "HIGH SCHOOL",
`Junior college` = "JUNIOR COLLEGE",
`Bachelor` = "BACHELOR",
`Graduate` = "GRADUATE")
gss_data_prop$conlegis <- fct_recode(gss_data_prop$conlegis,
`A Great Deal` = "A GREAT DEAL",
`Only Some` = "ONLY SOME",
`Hardly Any` = "HARDLY ANY")
my_results_prop <- freq_prop_test(gss_data_prop, "conlegis", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE)
my_conlegis_chart <- tbl_chart(gss_data_prop, "conlegis", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE)
my_conlegis_sig <- tbl_sig(my_results_prop, "degree", space_label = 0.4, space_between = 0.15)
We now create the colour vector to denote significant differences and the plot.
colour_vec <- c("#edc951", "#eb6841", "#cc2a36", "#4f372d", "#00a0b0")
#Applying names to each colour
attributes(colour_vec)$names <- levels(my_conlegis_chart$degree)
(
p <- ggplot() +
geom_col(data = my_conlegis_chart, aes(x = conlegis, y = prop, fill = degree)) +
geom_text(data = my_conlegis_chart, aes(x = conlegis, y = prop,
label = scales::percent(round(prop, 2))),
hjust = -0.2) +
scale_fill_manual(values = colour_vec) +
scale_colour_manual(values = colour_vec) +
coord_flip() +
facet_wrap(~ degree) +
scale_y_continuous(limits = c(0, 1.2), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
labs(title = "Level of Confidence on US Congress by Level of Education in 2016", x = "Level of Confidence", y = "Percentage", fill = "Level of Education", colour = "Sig. Diff.") +
theme(legend.key.height = unit(1.5, "lines"))
)
Let's use a full star instead. Again here, the space_label
and
space_between
may require trial and error to get right.
p + geom_sigmark(my_conlegis_sig, icon = "full star")
With multi-select dummy variables, the chart would look very similar to the previous section, but the data wrangling would be quite different, since the data is dispersed across multiple dummy variables instead of being all in one variable.
Here, we'll use the degree
variable denoting respondents' level of
education again, but this time we'll focus on whether respondents use
various forms of social media. The variables in question here are the
following: FACEBOOK
, TWITTER
, INSTAGRM
, LINKEDIN
, SNAPCHAT
,
TUMBLR
, WHATSAPP
, GOOGLESN
(Google +), PINTERST
, FLICKR
,
VINE
, and CLSSMTES
(classmates). Each of these are dummy variables
in which respondents answered "Yes", "No", or were not asked the
question.
#Also making the names of the social media more readable
gss_data_prop <- gss_data %>%
filter(year == "2016", !is.na(degree)) %>%
select(wtssall, degree,
Facebook = FACEBOOK,
Twitter = TWITTER,
Instagram = INSTAGRM,
LinkedIn = LINKEDIN,
Snapchat = SNAPCHAT,
Tumblr = TUMBLR,
WhatsApp = WHATSAPP,
`Google+` = GOOGLESN,
Pinterest = PINTERST,
Flickr = FLICKR,
Vine = VINE,
Classmates = CLSSMTES)
#Fixing labels for degree as well.
gss_data_prop$degree <- fct_recode(gss_data_prop$degree,
`Less than high school` = "LT HIGH SCHOOL",
`High school` = "HIGH SCHOOL",
`Junior college` = "JUNIOR COLLEGE",
`Bachelor` = "BACHELOR",
`Graduate` = "GRADUATE")
#Switch NA to "No" so that the full bases are kept.
gss_data_prop <- mutate_at(gss_data_prop, vars(3:14), funs(fct_explicit_na(., "No")))
This is a good time to provide an example on how a function like
freq_test_prop
can be used across several variables at once. First, we
gather()
the data
frame so that all social media variables are listed in a long format
instead of a wide format. We split this long data frame by type of
social media, apply the freq_test_prop
to each split, and recombine
them using nest()
from the tidyr
package and
map()
from the
purrr
package.
(
gss_data_prop_nest <- gss_data_prop %>%
gather(key = Social_Media, value = Use, -degree, -wtssall)
)
## # A tibble: 34,308 x 4
## wtssall degree Social_Media Use
## <dbl> <fctr> <chr> <chr>
## 1 0.9569935 Bachelor Facebook No
## 2 0.4784968 High school Facebook No
## 3 0.9569935 Bachelor Facebook No
## 4 1.9139870 High school Facebook No
## 5 1.4354903 Graduate Facebook No
## 6 0.9569935 Junior college Facebook No
## 7 1.4354903 High school Facebook No
## 8 0.9569935 High school Facebook Yes
## 9 0.9569935 High school Facebook No
## 10 0.9569935 Junior college Facebook Yes
## # ... with 34,298 more rows
gss_data_prop_nest <- mutate_if(gss_data_prop_nest, is.character, factor)
(
gss_data_prop_nest <- gss_data_prop_nest %>%
group_by(Social_Media) %>%
nest()
)
## # A tibble: 12 x 2
## Social_Media data
## <fctr> <list>
## 1 Facebook <tibble [2,859 x 3]>
## 2 Twitter <tibble [2,859 x 3]>
## 3 Instagram <tibble [2,859 x 3]>
## 4 LinkedIn <tibble [2,859 x 3]>
## 5 Snapchat <tibble [2,859 x 3]>
## 6 Tumblr <tibble [2,859 x 3]>
## 7 WhatsApp <tibble [2,859 x 3]>
## 8 Google+ <tibble [2,859 x 3]>
## 9 Pinterest <tibble [2,859 x 3]>
## 10 Flickr <tibble [2,859 x 3]>
## 11 Vine <tibble [2,859 x 3]>
## 12 Classmates <tibble [2,859 x 3]>
#Note: warnings about chi-squares being approximated are suppressed here to save space.
(
gss_data_result_nest <- gss_data_prop_nest %>%
mutate(test = map(data, freq_prop_test, "Use", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE),
chart = map(data, tbl_chart, "Use", "degree", weight = "wtssall", nlabels = TRUE, newline = TRUE))
)
## # A tibble: 12 x 4
## Social_Media data test
## <fctr> <list> <list>
## 1 Facebook <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 2 Twitter <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 3 Instagram <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 4 LinkedIn <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 5 Snapchat <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 6 Tumblr <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 7 WhatsApp <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 8 Google+ <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 9 Pinterest <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 10 Flickr <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 11 Vine <tibble [2,859 x 3]> <data.frame [20 x 9]>
## 12 Classmates <tibble [2,859 x 3]> <data.frame [20 x 9]>
## # ... with 1 more variables: chart <list>
(gss_data_result <- gss_data_result_nest %>%
unnest(test) %>%
filter(level == "Yes"))
## # A tibble: 120 x 10
## Social_Media group1 group2
## <fctr> <fctr> <fctr>
## 1 Facebook High school
## (n=1,461) Less than high school
## (n=328)
## 2 Facebook Junior college
## (n=216) Less than high school
## (n=328)
## 3 Facebook Bachelor
## (n=536) Less than high school
## (n=328)
## 4 Facebook Graduate
## (n=318) Less than high school
## (n=328)
## 5 Facebook Bachelor
## (n=536) High school
## (n=1,461)
## 6 Facebook Junior college
## (n=216) High school
## (n=1,461)
## 7 Facebook Graduate
## (n=318) High school
## (n=1,461)
## 8 Facebook Bachelor
## (n=536) Junior college
## (n=216)
## 9 Facebook Graduate
## (n=318) Junior college
## (n=216)
## 10 Facebook Graduate
## (n=318) Bachelor
## (n=536)
## # ... with 110 more rows, and 7 more variables: level <fctr>,
## # p.value <dbl>, significant <lgl>, Sample_Size_group1 <int>,
## # Sample_Size_group2 <int>, prop_group1 <dbl>, prop_group2 <dbl>
(my_chart_data <- gss_data_result_nest %>%
unnest(chart) %>%
filter(Use == "Yes"))
## # A tibble: 60 x 5
## Social_Media degree Use prop
## <fctr> <fctr> <fctr> <dbl>
## 1 Facebook Less than high school
## (n=328) Yes 0.21703540
## 2 Facebook High school
## (n=1,461) Yes 0.36974611
## 3 Facebook Junior college
## (n=216) Yes 0.35990926
## 4 Facebook Bachelor
## (n=536) Yes 0.40321565
## 5 Facebook Graduate
## (n=318) Yes 0.38672326
## 6 Twitter Less than high school
## (n=328) Yes 0.03841661
## 7 Twitter High school
## (n=1,461) Yes 0.08290407
## 8 Twitter Junior college
## (n=216) Yes 0.07512621
## 9 Twitter Bachelor
## (n=536) Yes 0.12217037
## 10 Twitter Graduate
## (n=318) Yes 0.15212663
## # ... with 50 more rows, and 1 more variables: Sample_Size <int>
We repeat the process for the tbl_sig
function.
(
gss_data_result_nest <- gss_data_result %>%
group_by(Social_Media) %>%
nest() %>%
mutate(sig = map(data, tbl_sig, "degree", space_label = 0.3, space_between = 0.1))
)
## # A tibble: 12 x 3
## Social_Media data sig
## <fctr> <list> <list>
## 1 Facebook <tibble [10 x 9]> <tibble [4 x 8]>
## 2 Twitter <tibble [10 x 9]> <tibble [4 x 8]>
## 3 Instagram <tibble [10 x 9]> <tibble [2 x 8]>
## 4 LinkedIn <tibble [10 x 9]> <tibble [8 x 8]>
## 5 Snapchat <tibble [10 x 9]> <tibble [2 x 8]>
## 6 Tumblr <tibble [10 x 9]> <tibble [0 x 8]>
## 7 WhatsApp <tibble [10 x 9]> <tibble [4 x 8]>
## 8 Google+ <tibble [10 x 9]> <tibble [2 x 8]>
## 9 Pinterest <tibble [10 x 9]> <tibble [6 x 8]>
## 10 Flickr <tibble [10 x 9]> <tibble [2 x 8]>
## 11 Vine <tibble [10 x 9]> <tibble [0 x 8]>
## 12 Classmates <tibble [10 x 9]> <tibble [1 x 8]>
(my_sig_data <- unnest(gss_data_result_nest, sig))
## # A tibble: 35 x 9
## Social_Media level degree
## <fctr> <fctr> <fctr>
## 1 Facebook Yes High school
## (n=1,461)
## 2 Facebook Yes Junior college
## (n=216)
## 3 Facebook Yes Bachelor
## (n=536)
## 4 Facebook Yes Graduate
## (n=318)
## 5 Twitter Yes High school
## (n=1,461)
## 6 Twitter Yes Bachelor
## (n=536)
## 7 Twitter Yes Graduate
## (n=318)
## 8 Twitter Yes Graduate
## (n=318)
## 9 Instagram Yes High school
## (n=1,461)
## 10 Instagram Yes Bachelor
## (n=536)
## # ... with 25 more rows, and 6 more variables: sign <fctr>,
## # group_prop <dbl>, sign_prop <dbl>, group_Sample_Size <int>,
## # sign_Sample_Size <int>, pos <dbl>
I would also like to set the order based on the proportions using
fct_reorder
.
I first need to create a summarized data frame that lists the total
proportions of usage for each social media.
social_media_total <- gss_data_prop_nest %>%
unnest(data) %>%
group_by(Social_Media) %>%
summarize(total_prop = sum(if_else(Use == "Yes", wtssall, 0)) / sum(wtssall))
my_chart_data <- left_join(my_chart_data, social_media_total)
## Joining, by = "Social_Media"
my_sig_data <- left_join(my_sig_data, social_media_total)
## Joining, by = "Social_Media"
my_chart_data$Social_Media <- fct_reorder(my_chart_data$Social_Media, my_chart_data$total_prop)
my_sig_data$Social_Media <- fct_reorder(my_sig_data$Social_Media, my_sig_data$total_prop)
We now create the colour vector to denote significant differences and the plot.
colour_vec <- c("#edc951", "#eb6841", "#cc2a36", "#4f372d", "#00a0b0")
#Applying names to each colour
attributes(colour_vec)$names <- levels(my_chart_data$degree)
(
p <- ggplot() +
geom_col(data = my_chart_data, aes(x = Social_Media, y = prop, fill = degree)) +
geom_text(data = my_chart_data, aes(x = Social_Media, y = prop,
label = scales::percent(round(prop, 2))),
hjust = -0.2) +
scale_fill_manual(values = colour_vec) +
scale_colour_manual(values = colour_vec) +
coord_flip() +
facet_wrap(~ degree) +
scale_y_continuous(limits = c(0, 1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
labs(title = "Level of Usage of Social Media by Level of Education in 2016", x = "Social Media", y = "Percentage", fill = "Level of Education", colour = "Sig. Diff.") +
theme(legend.key.height = unit(1.5, "lines"))
)
p + geom_sigmark(my_sig_data, x = "Social_Media", icon = "checkmark")
Scaled variables are a very common form of measuring attitudes in market research and polling surveys. We will now turn to provide a couple of examples of charts involving scaled variables.
This example will focus on comparing one scaled variable across
subgroups. Here we will compare the variable polviews
which asks
respondents to rate on a scale how liberal or conservative they consider
themselves on a 7-point scale on their level of family income (grouped
to 6 categories). Again, we'll only on focus on 2016 data.
gss_data_polviews <- gss_data %>%
filter(year == "2016", !is.na(coninc))
For the sake of expediency, we'll let the cut function set the breaks, even if they don't round up.
gss_data_polviews$coninc_6 <- cut(gss_data_polviews$coninc, 6, labels =
c("Up to $27,600",
"$27,600 to $54,900",
"$54,900 to $82,100",
"$82,100 to $109,000",
"$109,000 to $137,000",
"$137,000 and higher"))
We'll specifically look at the proportion of those who consider
themselves liberal. This means providing a top 3 box and a sum %. First,
we'll need to make sure those who didn't give an answer are counted,
using the
fct_explicity_na()
function.
gss_data_polviews$polviews <- gss_data_polviews$polviews %>%
fct_recode(
`Extremely Liberal` = "EXTREMELY LIBERAL",
Liberal = "LIBERAL",
`Slightly Liberal` = "SLIGHTLY LIBERAL",
Moderate = "MODERATE",
`Slightly Conservative` = "SLGHTLY CONSERVATIVE",
Conservative = "CONSERVATIVE",
`Extremely Conservative` = "EXTRMLY CONSERVATIVE") %>%
fct_explicit_na(na_level = "NA")
gss_data_polviews <- mutate(gss_data_polviews,
liberal = if_else(polviews %in% c("Extremely Liberal", "Liberal", "Slightly Liberal"), "Liberal", "Not Liberal") %>%
factor())
We're only interested in showing significant difference markers for the
top 3 box, so we're just running the tbl_chart
for that.
(polviews_chart <- tbl_chart(gss_data_polviews, "polviews", "coninc_6", weight = "wtssall", nlabels = TRUE))
## coninc_6 polviews prop
## 1 Up to $27,600 (n=1,070) Extremely Liberal 0.048854089
## 2 $27,600 to $54,900 (n=683) Extremely Liberal 0.043309374
## 3 $54,900 to $82,100 (n=393) Extremely Liberal 0.037676752
## 4 $82,100 to $109,000 (n=225) Extremely Liberal 0.052291285
## 5 $109,000 to $137,000 (n=61) Extremely Liberal 0.107185558
## 6 $137,000 and higher (n=164) Extremely Liberal 0.044100002
## 7 Up to $27,600 (n=1,070) Liberal 0.118936455
## 8 $27,600 to $54,900 (n=683) Liberal 0.113199526
## 9 $54,900 to $82,100 (n=393) Liberal 0.135425916
## 10 $82,100 to $109,000 (n=225) Liberal 0.149289929
## 11 $109,000 to $137,000 (n=61) Liberal 0.105530524
## 12 $137,000 and higher (n=164) Liberal 0.181187495
## 13 Up to $27,600 (n=1,070) Slightly Liberal 0.098353280
## 14 $27,600 to $54,900 (n=683) Slightly Liberal 0.105672707
## 15 $54,900 to $82,100 (n=393) Slightly Liberal 0.109279813
## 16 $82,100 to $109,000 (n=225) Slightly Liberal 0.149463618
## 17 $109,000 to $137,000 (n=61) Slightly Liberal 0.067005330
## 18 $137,000 and higher (n=164) Slightly Liberal 0.181670300
## 19 Up to $27,600 (n=1,070) Moderate 0.402649235
## 20 $27,600 to $54,900 (n=683) Moderate 0.369981123
## 21 $54,900 to $82,100 (n=393) Moderate 0.306858359
## 22 $82,100 to $109,000 (n=225) Moderate 0.346090543
## 23 $109,000 to $137,000 (n=61) Moderate 0.316001610
## 24 $137,000 and higher (n=164) Moderate 0.232208331
## 25 Up to $27,600 (n=1,070) Slightly Conservative 0.116218443
## 26 $27,600 to $54,900 (n=683) Slightly Conservative 0.155753787
## 27 $54,900 to $82,100 (n=393) Slightly Conservative 0.153236023
## 28 $82,100 to $109,000 (n=225) Slightly Conservative 0.136452281
## 29 $109,000 to $137,000 (n=61) Slightly Conservative 0.224416174
## 30 $137,000 and higher (n=164) Slightly Conservative 0.194483931
## 31 Up to $27,600 (n=1,070) Conservative 0.124838774
## 32 $27,600 to $54,900 (n=683) Conservative 0.140980268
## 33 $54,900 to $82,100 (n=393) Conservative 0.219460073
## 34 $82,100 to $109,000 (n=225) Conservative 0.133847878
## 35 $109,000 to $137,000 (n=61) Conservative 0.141335610
## 36 $137,000 and higher (n=164) Conservative 0.153777713
## 37 Up to $27,600 (n=1,070) Extremely Conservative 0.042753683
## 38 $27,600 to $54,900 (n=683) Extremely Conservative 0.052178367
## 39 $54,900 to $82,100 (n=393) Extremely Conservative 0.029254590
## 40 $82,100 to $109,000 (n=225) Extremely Conservative 0.018091370
## 41 $109,000 to $137,000 (n=61) Extremely Conservative 0.020090114
## 42 $137,000 and higher (n=164) Extremely Conservative 0.012572228
## 43 Up to $27,600 (n=1,070) NA 0.047396040
## 44 $27,600 to $54,900 (n=683) NA 0.018924848
## 45 $54,900 to $82,100 (n=393) NA 0.008808475
## 46 $82,100 to $109,000 (n=225) NA 0.014473096
## 47 $109,000 to $137,000 (n=61) NA 0.018435080
## 48 $137,000 and higher (n=164) NA 0.000000000
## Sample_Size
## 1 1070
## 2 683
## 3 393
## 4 225
## 5 61
## 6 164
## 7 1070
## 8 683
## 9 393
## 10 225
## 11 61
## 12 164
## 13 1070
## 14 683
## 15 393
## 16 225
## 17 61
## 18 164
## 19 1070
## 20 683
## 21 393
## 22 225
## 23 61
## 24 164
## 25 1070
## 26 683
## 27 393
## 28 225
## 29 61
## 30 164
## 31 1070
## 32 683
## 33 393
## 34 225
## 35 61
## 36 164
## 37 1070
## 38 683
## 39 393
## 40 225
## 41 61
## 42 164
## 43 1070
## 44 683
## 45 393
## 46 225
## 47 61
## 48 164
Since we're only interested in showing 3 of these levels, we'll remove the others.
polviews_chart <- filter(polviews_chart, polviews %in% c("Extremely Liberal", "Liberal", "Slightly Liberal"))
polviews_chart$polviews <- droplevels(polviews_chart$polviews)
We'll use the created liberal
variable for the markers.
polviews_results <- freq_prop_test(gss_data_polviews, "liberal", "coninc_6", weight = "wtssall", nlabels = TRUE)
(
polviews_sig <- polviews_results %>%
tbl_sig("coninc_6", space_label = 0.1, space_between = 0.03) %>%
filter(level == "Liberal")
)
## level coninc_6 sign
## 1 Liberal $137,000 and higher (n=164) Up to $27,600 (n=1,070)
## 2 Liberal $137,000 and higher (n=164) $27,600 to $54,900 (n=683)
## 3 Liberal $137,000 and higher (n=164) $54,900 to $82,100 (n=393)
## group_prop sign_prop group_Sample_Size sign_Sample_Size pos
## 1 0.4069578 0.2661438 164 1070 0.5069578
## 2 0.4069578 0.2621816 164 683 0.5369578
## 3 0.4069578 0.2823825 164 393 0.5669578
We also need the total proportions for the labels.
chart_labels <- polviews_chart %>%
group_by(coninc_6) %>%
summarize(prop = sum(prop))
Now turning to the chart. Here, we'll need two sets of colours; one for the scales, and one for the subgroups.
colour_scale_vec <- c("#d95f0e", "#fec44f", "#fff7bc")
colour_subgroup_vec <- c("#253494", "#2c7fb8", "#41b6c4", "#7fcdbb", "#c7e9b4", "#ffffcc")
#Applying names to each colour
attributes(colour_scale_vec)$names <- levels(polviews_chart$polviews)
attributes(colour_subgroup_vec)$names <- levels(polviews_chart$coninc_6)
ggplot() +
geom_col(data = polviews_chart, aes(x = coninc_6, y = prop, fill = fct_rev(polviews))) +
geom_text(data = polviews_chart, aes(x = coninc_6, y = prop, label = scales::percent(round(prop, 2))), position = position_stack(vjust = 0.5)) +
geom_text(data = chart_labels, aes(x = coninc_6, y = prop, label = scales::percent(round(prop, 2))), hjust = -0.2) +
scale_fill_manual(values = colour_scale_vec) +
scale_colour_manual(values = colour_subgroup_vec) +
coord_flip() +
scale_y_continuous(limits = c(0, .75), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
labs(title = "Proportion of Residents who Self-Label as Liberal", subtitle = "by Family Income Level", x = "Level of Family Income", y = "Percentage", fill = "Scale", colour = "Sig. Diff.") +
geom_sigmark(polviews_sig, x = "coninc_6", icon = "asterisk") +
guides(fill = guide_legend(reverse = TRUE))
This example will now turn to the need for multiple scaled variables to
appear across subgroups. There are 6 variables that we'll focus here
ranging from LOTR1
to LOTR6
, which represent agreement scales on 6
attitudinal statements. We'll observe whether there are significant
differences in whether they agree with these statements across gender
(sex
). The code required to produce the desired chart requires a mix
of the code from the two previous examples.
Note that not everyone is asked these 6 questions, and there is no
indication in the data file as to the logic as to why some respondents
were asked and others not. The presence of a dummy variable indicating
if they were asked or not would be useful here to ensure the sample
sizes match throughout. For the sake of simplicity, the loop below
assigns those who are NA
to Neutral
across the 6 questions.
gss_data_LOTR <- gss_data %>%
filter(year == "2016", !is.na(LOTR1)) %>%
select(wtssall, sex, starts_with("LOTR"))
gss_data_LOTR$sex <- fct_recode(gss_data_LOTR$sex, Male = "MALE", Female = "FEMALE")
gss_data_LOTR <- mutate_if(gss_data_LOTR, is.factor, funs(fct_explicit_na(., na_level = "Neutral")))
As per the previous example, we'll need to calculate the top 2 box for
those who agree or strongly agree across each statement. Since I need to
apply the same wrangling across 6 variables, I'm using the
mutate_at()
and
mutate_if()
functions.
gss_data_LOTR <- gss_data_LOTR %>%
mutate_at(vars(starts_with("LOTR")),
funs("Top2" = if_else(. %in% c("Strongly agree", "Agree"), "Agree", "Disagree/Neutral"))) %>%
mutate_if(is.character, factor)
And as per 2 examples ago, we'll need to use the
nest()
and
map()
to apply the
freq_test_prop
, the tbl_chart
, and the tbl_sig
functions. Unlike
last time, my levels differ between the original LOTR
variables and my
created LOTRx_Top2
variables. The easiest thing to do at this stage
would be to split the original variables from the created variables.
(
gss_data_LOTR_nest <- gss_data_LOTR %>%
gather(key = LOTR, value = Scale, -sex, -wtssall)
)
## # A tibble: 17,364 x 4
## wtssall sex LOTR Scale
## <dbl> <fctr> <chr> <chr>
## 1 0.4784968 Male LOTR1 Neutral
## 2 0.9569935 Male LOTR1 Neutral
## 3 1.4354903 Female LOTR1 Agree
## 4 0.9569935 Female LOTR1 Agree
## 5 0.9569935 Female LOTR1 Neutral
## 6 0.9569935 Male LOTR1 Agree
## 7 1.9139870 Male LOTR1 Strongly agree
## 8 0.4784968 Female LOTR1 Strongly agree
## 9 0.4784968 Male LOTR1 Agree
## 10 0.9569935 Female LOTR1 Agree
## # ... with 17,354 more rows
gss_data_LOTR_nest_Top2 <- filter(gss_data_LOTR_nest, str_detect(LOTR, "Top2"))
gss_data_LOTR_nest_Full <- filter(gss_data_LOTR_nest, !str_detect(LOTR, "Top2"))
gss_data_LOTR_nest_Top2$LOTR <- str_replace(gss_data_LOTR_nest_Top2$LOTR, "_Top2", "")
gss_data_LOTR_nest_Top2 <- mutate_if(gss_data_LOTR_nest_Top2, is.character, factor)
gss_data_LOTR_nest_Full <- mutate_if(gss_data_LOTR_nest_Full, is.character, factor)
(
gss_data_LOTR_nest_Full <- gss_data_LOTR_nest_Full %>%
group_by(LOTR) %>%
nest()
)
## # A tibble: 6 x 2
## LOTR data
## <fctr> <list>
## 1 LOTR1 <tibble [1,447 x 3]>
## 2 LOTR2 <tibble [1,447 x 3]>
## 3 LOTR3 <tibble [1,447 x 3]>
## 4 LOTR4 <tibble [1,447 x 3]>
## 5 LOTR5 <tibble [1,447 x 3]>
## 6 LOTR6 <tibble [1,447 x 3]>
(
gss_data_LOTR_nest_Top2 <- gss_data_LOTR_nest_Top2 %>%
group_by(LOTR) %>%
nest()
)
## # A tibble: 6 x 2
## LOTR data
## <fctr> <list>
## 1 LOTR1 <tibble [1,447 x 3]>
## 2 LOTR2 <tibble [1,447 x 3]>
## 3 LOTR3 <tibble [1,447 x 3]>
## 4 LOTR4 <tibble [1,447 x 3]>
## 5 LOTR5 <tibble [1,447 x 3]>
## 6 LOTR6 <tibble [1,447 x 3]>
(
gss_data_LOTR_chart_Full <- gss_data_LOTR_nest_Full %>%
mutate(chart = map(data, tbl_chart, "Scale", "sex", weight = "wtssall", nlabels = TRUE)) %>%
unnest(chart) %>%
filter(Scale %in% c("Strongly agree", "Agree"))
)
## # A tibble: 24 x 5
## LOTR sex Scale prop Sample_Size
## <fctr> <fctr> <fctr> <dbl> <int>
## 1 LOTR1 Male (n=605) Agree 0.46834182 605
## 2 LOTR1 Female (n=842) Agree 0.47338741 842
## 3 LOTR1 Male (n=605) Strongly agree 0.14088480 605
## 4 LOTR1 Female (n=842) Strongly agree 0.16687914 842
## 5 LOTR2 Male (n=605) Agree 0.24654460 605
## 6 LOTR2 Female (n=842) Agree 0.23532435 842
## 7 LOTR2 Male (n=605) Strongly agree 0.05552968 605
## 8 LOTR2 Female (n=842) Strongly agree 0.05483089 842
## 9 LOTR3 Male (n=605) Agree 0.52541323 605
## 10 LOTR3 Female (n=842) Agree 0.49361067 842
## # ... with 14 more rows
gss_data_LOTR_chart_Top2 <- gss_data_LOTR_chart_Full %>%
group_by(LOTR, sex) %>%
summarize(total_prop = sum(prop))
(
gss_data_LOTR_result_Top2 <- gss_data_LOTR_nest_Top2 %>%
mutate(test = map(data, freq_prop_test, "Scale", "sex", weight = "wtssall", nlabels = TRUE)) %>%
unnest(test) %>%
filter(level == "Agree")
)
## # A tibble: 6 x 10
## LOTR group1 group2 level p.value significant
## <fctr> <fctr> <fctr> <fctr> <dbl> <lgl>
## 1 LOTR1 Female (n=842) Male (n=605) Agree 0.253150237 FALSE
## 2 LOTR2 Female (n=842) Male (n=605) Agree 0.667930118 FALSE
## 3 LOTR3 Female (n=842) Male (n=605) Agree 0.765824419 FALSE
## 4 LOTR4 Female (n=842) Male (n=605) Agree 0.446722189 FALSE
## 5 LOTR5 Female (n=842) Male (n=605) Agree 0.005903342 TRUE
## 6 LOTR6 Female (n=842) Male (n=605) Agree 0.039162074 TRUE
## # ... with 4 more variables: Sample_Size_group1 <int>,
## # Sample_Size_group2 <int>, prop_group1 <dbl>, prop_group2 <dbl>
Now onto the tbl_sig
function, which I only need to run for the
significant differences for the Top 2 box because I want to extract the
proportions for the labels.
(
my_sig_data <- gss_data_LOTR_result_Top2 %>%
group_by(LOTR) %>%
nest() %>%
mutate(sig = map(data, tbl_sig, "sex", space_label = 0.25, space_between = 0.1)) %>%
unnest(sig)
)
## # A tibble: 2 x 9
## LOTR level sex sign group_prop sign_prop
## <fctr> <fctr> <fctr> <fctr> <dbl> <dbl>
## 1 LOTR5 Agree Male (n=605) Female (n=842) 0.2272998 0.1677041
## 2 LOTR6 Agree Female (n=842) Male (n=605) 0.7961268 0.7486442
## # ... with 3 more variables: group_Sample_Size <int>,
## # sign_Sample_Size <int>, pos <dbl>
I'll now merge the total proportions from the label data back to the
chart data in order to re-order the chart from high to low. First I need
to remove "_Top2" from each level of the LOTR
factor.
my_chart_data <- left_join(gss_data_LOTR_chart_Full, gss_data_LOTR_chart_Top2, by = c("LOTR", "sex"))
my_chart_data$Scale <- droplevels(my_chart_data$Scale)
Now, I'll apply the question labels, and reorder them so that they appear from high to low in the chart. Note the usage of "" within the labels to provide breaks to avoid the chart getting squished due to the length of the labels.
#Labels taken from here: http://gss.norc.org/documents/codebook/GSS_Codebook_mainbody.pdf and from gss_data_questions
my_chart_data$LOTR <- my_chart_data$LOTR %>%
fct_recode(`In uncertain times,\nI usually expect the best` = "LOTR1",
`If something can go\nwrong for me, it will` = "LOTR2",
`I'm always optimistic\nabout the future` = "LOTR3",
`I hardly ever expect\nthings to go my way` = "LOTR4",
`I rarely count on good\nthings happening to me` = "LOTR5",
`Overall I expect more good things\nto happen to me than bad` = "LOTR6") %>%
fct_reorder(my_chart_data$total_prop)
my_sig_data$LOTR <- my_sig_data$LOTR %>%
fct_recode(`In uncertain times,\nI usually expect the best` = "LOTR1",
`If something can go\nwrong for me, it will` = "LOTR2",
`I'm always optimistic\nabout the future` = "LOTR3",
`I hardly ever expect\nthings to go my way` = "LOTR4",
`I rarely count on good\nthings happening to me` = "LOTR5",
`Overall I expect more good things\nto happen to me than bad` = "LOTR6")
Let's build up the chart. One wrinkle I'm adding here for the individual scale point labels is to only display if they're over 5%, to avoid the "Strongly agree" and "agree" labels encroaching on each other.
colour_scale_vec <- c("#fff7bc", "#d95f0e")
colour_subgroup_vec <- c("#f48041", "#4286f4")
#Applying names to each colour
attributes(colour_scale_vec)$names <- levels(my_chart_data$Scale)
attributes(colour_subgroup_vec)$names <- levels(my_chart_data$sex)
(
p <- ggplot() +
geom_col(data = my_chart_data, aes(x = LOTR, y = prop, fill = Scale)) +
geom_text(data = my_chart_data, aes(x = LOTR, y = prop, group = Scale, label = if_else(prop > 0.05, scales::percent(round(prop, 2)), "")), position = position_stack(vjust = 0.5)) +
geom_text(data = my_chart_data, aes(x = LOTR, y = total_prop, label = scales::percent(round(total_prop, 2))), hjust = -0.2) +
scale_fill_manual(values = colour_scale_vec) +
scale_colour_manual(values = colour_subgroup_vec) +
coord_flip() +
facet_wrap(~ sex) +
scale_y_continuous(limits = c(0, 1.1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
labs(title = "Levels of Optimism Across Gender", x = "Statement", y = "Percentage", fill = "Scale", colour = "Sig. Diff.")
)
p + geom_sigmark(my_sig_data, x = "LOTR", icon = "asterisk", group = "sex") +
guides(fill = guide_legend(reverse = TRUE))
We will now turn to the geom_sigmark_waves
function to use for
tracking charts across multiple waves. A different function is required
because we need to present whether a significant shift is an increase or
a decrease over time. For this example, we'll look at the change in the
level of confidence Americans have been placing on the financial
industry (confinan
).
gss_data1 <- filter(gss_data, year_num >= 2000, !is.na(confinan))
gss_data1$year <- droplevels(gss_data1$year)
gss_data1$confinan <- fct_recode(gss_data1$confinan,
"A great deal" = "A GREAT DEAL",
"Hardly any" = "HARDLY ANY",
"Only some" = "ONLY SOME")
my_track_results <- freq_prop_test(gss_data1, "confinan", "year", weight = "wtssall")
my_chart_track <- tbl_chart(gss_data1, "confinan", "year", weight = "wtssall")
my_sig_track <- tbl_sig(my_track_results, "year", compare = "w2w")
ggplot() +
geom_line(data = my_chart_track, aes(x = year, y = prop, colour = confinan, group = confinan)) +
geom_point(data = my_chart_track, aes(x = year, y = prop, colour = confinan), show.legend = FALSE) +
geom_text_repel(data = my_chart_track, aes(x = year, y = prop, colour = confinan, label = percent(round(prop, 2))), show.legend = FALSE) +
scale_y_continuous(labels = percent) +
geom_sigmark_waves(my_sig_track) +
labs(title = "Americans' Level of Confidence in the Financial Industry", subtitle = "2000 to 2016", y = "Percentage", x = "Year", colour = "Confidence level", shape = "Sig. Diff.")
For this section, we'll expand to looking at how Americans' level of confidence ("some" or "a great deal") over several topics change over time.
gss_data1 <- select(gss_data, year, year_num, wtssall, starts_with("con"), -coninc)
gss_data1 <- filter(gss_data1, rowSums(is.na(gss_data1)) < 13, year_num >= 2000) %>% select(-year_num)
gss_data1 <- rename(gss_data1,
Financial = confinan,
Business = conbus,
Religion = conclerg,
Education = coneduc,
POTUS = confed,
Unions = conlabor,
Press = conpress,
Medicine = conmedic,
Television = contv,
SCOTUS = conjudge,
Science = consci,
Congress = conlegis,
Military = conarmy)
gss_data1$year <- droplevels(gss_data1$year)
gss_data1 <- gss_data1 %>%
mutate_if(is.factor, funs("Top2" = if_else(. == "A GREAT DEAL" | . == "ONLY SOME", "Yes", "No"))) %>%
select(year, wtssall, ends_with("Top2"), -year_Top2) %>%
mutate_if(is.character, factor)
names(gss_data1) <- str_replace(names(gss_data1), "_Top2", "")
gss_data1_nest <- gss_data1 %>%
gather(key = Institution, value = Confident, -year, -wtssall) %>%
mutate_if(is.character, factor)
gss_data1_result <- gss_data1_nest %>%
group_by(Institution) %>%
nest() %>%
mutate(test = map(data, freq_prop_test, "Confident", "year", weight = "wtssall"),
chart = map(data, tbl_chart, "Confident", "year", weight = "wtssall"))
gss_data1_result_nest <- gss_data1_result %>%
unnest(test) %>%
filter(level == "Yes") %>%
group_by(Institution) %>%
nest() %>%
mutate(sig = map(data, tbl_sig, "year", compare = "curr"))
gss_data1_chart <- gss_data1_result %>%
unnest(chart) %>%
filter(Confident == "Yes")
gss_data1_sig <- unnest(gss_data1_result_nest, sig)
gss_data1_chart <- mutate(gss_data1_chart, Sector = case_when(
Institution %in% c("Business", "Financial", "Press", "Unions") ~ "Sector 1",
Institution %in% c("Education", "Science", "Medicine", "Religion", "Television") ~ "Sector 2",
TRUE ~ "Sector 3"),
Sector = factor(Sector))
gss_data1_sig <- mutate(gss_data1_sig, Sector = case_when(
Institution %in% c("Business", "Financial", "Press", "Unions") ~ "Sector 1",
Institution %in% c("Education", "Science", "Medicine", "Religion", "Television") ~ "Sector 2",
TRUE ~ "Sector 3"),
Sector = factor(Sector))
colour_vec <- c("#0db36f", "#c8dd56", "#030303", "#72477f", "#382b57", "#ce215b", "#e1a83e", "#39b287", "#71c3ca", "#3a8bea", "#034385", "#9f2b68", "#9d0e27")
names(colour_vec) <- levels(gss_data1_chart$Institution)
ggplot() +
geom_line(data = gss_data1_chart, aes(x = year, y = prop, group = Institution, colour = Institution)) +
geom_point(data = gss_data1_chart, aes(x = year, y = prop, colour = Institution)) +
geom_text_repel(data = gss_data1_chart, aes(x = year, y = prop, label = percent(round(prop, 2)), colour = Institution), show.legend = FALSE) +
labs(title = "Proportion of Americans with at least some level of confidence in Institutions", x = "Year", y = "Percentage", shape = "Sig. Diff.") +
scale_y_continuous(labels = percent) +
facet_wrap(~ Sector, nrow = 3) +
scale_colour_manual(values = colour_vec) +
geom_sigmark_waves(gss_data1_sig, colour = "Institution")
I would actually break this into 3 separate charts instead of using
facet_wrap()
to make the chart easier to read, but that's beside the
point.
Since version 0.0.2, we can now compare a subgroup to the rest of the sample. Here we will use "Pacific" as a region to chart and compare it to the rest.
gss_data_prop <- filter(gss_data, year == "2016", !is.na(conlegis), !is.na(degree))
gss_data_prop$conlegis <- fct_recode(gss_data_prop$conlegis,
`A Great Deal` = "A GREAT DEAL",
`Only Some` = "ONLY SOME",
`Hardly Any` = "HARDLY ANY")
my_results_prop <- freq_prop_test(gss_data_prop, "conlegis", "degree", level = "GRADUATE", weight = "wtssall")
my_conlegis_chart <- tbl_chart(gss_data_prop, "conlegis", "degree", weight = "wtssall")
my_conlegis_sig <- tbl_sig(my_results_prop, "degree", space_label = 0.1, compare = "total")
We only want to chart the Graduates, so we'll filter them out.
my_conlegis_chart <- filter(my_conlegis_chart, degree == "GRADUATE")
my_conlegis_sig <- filter(my_conlegis_sig, degree == "GRADUATE")
(
p <- ggplot() +
geom_col(data = my_conlegis_chart, aes(x = conlegis, y = prop)) +
geom_text(data = my_conlegis_chart, aes(x = conlegis, y = prop,
label = scales::percent(round(prop, 2))),
hjust = -0.2) +
coord_flip() +
scale_y_continuous(limits = c(0, 1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
labs(title = "Graduates' Level of Confidence on US Congress in 2016", x = "Level of Confidence", y = "Percentage", shape = "Significant Differences")
)
Here, we will use geom_sigmark_total
:
p + geom_sigmark_total(my_conlegis_sig, labels = c("Higher than those with \nother levels of education", "Lower than those with \nother levels of education")) + theme(legend.key.height = unit(1.5, "lines"))
We'll now expand on the previous section, and look into setting up a chart with multiple dummy variables. Here's we'll focus into how those in the 18-24 age category differ from the rest of the sample on their usage of social media:
gss_data_prop <- gss_data %>%
filter(year == "2016", !is.na(degree)) %>%
select(wtssall, age_7, Facebook = FACEBOOK, Twitter = TWITTER, Instagram = INSTAGRM, LinkedIn = LINKEDIN, Snapchat = SNAPCHAT, Tumblr = TUMBLR, Whatsapp = WHATSAPP, `Google+` = GOOGLESN, Pinterest = PINTERST, Flickr = FLICKR, Vine = VINE, Classmates = CLSSMTES)
#Switch NA to "No" so that the full bases are kept.
gss_data_prop <- mutate_at(gss_data_prop, vars(3:14), funs(fct_explicit_na(., "No")))
(
gss_data_prop_nest <- gss_data_prop %>%
gather(key = Social_Media, value = Use, -age_7, -wtssall)
)
## # A tibble: 34,308 x 4
## wtssall age_7 Social_Media Use
## <dbl> <fctr> <chr> <chr>
## 1 0.9569935 45-54 Facebook No
## 2 0.4784968 55-64 Facebook No
## 3 0.9569935 65-74 Facebook No
## 4 1.9139870 35-44 Facebook No
## 5 1.4354903 55-64 Facebook No
## 6 0.9569935 45-54 Facebook No
## 7 1.4354903 45-54 Facebook No
## 8 0.9569935 18-24 Facebook Yes
## 9 0.9569935 45-54 Facebook No
## 10 0.9569935 65-74 Facebook Yes
## # ... with 34,298 more rows
gss_data_prop_nest <- mutate_if(gss_data_prop_nest, is.character, factor)
(
gss_data_prop_nest <- gss_data_prop_nest %>%
group_by(Social_Media) %>%
nest()
)
## # A tibble: 12 x 2
## Social_Media data
## <fctr> <list>
## 1 Facebook <tibble [2,859 x 3]>
## 2 Twitter <tibble [2,859 x 3]>
## 3 Instagram <tibble [2,859 x 3]>
## 4 LinkedIn <tibble [2,859 x 3]>
## 5 Snapchat <tibble [2,859 x 3]>
## 6 Tumblr <tibble [2,859 x 3]>
## 7 Whatsapp <tibble [2,859 x 3]>
## 8 Google+ <tibble [2,859 x 3]>
## 9 Pinterest <tibble [2,859 x 3]>
## 10 Flickr <tibble [2,859 x 3]>
## 11 Vine <tibble [2,859 x 3]>
## 12 Classmates <tibble [2,859 x 3]>
(
gss_data_result <- gss_data_prop_nest %>%
mutate(test = map(data, freq_prop_test, "Use", "age_7", level = "18-24", weight = "wtssall"),
chart = map(data, tbl_chart, "Use", "age_7", weight = "wtssall"))
)
## # A tibble: 12 x 4
## Social_Media data test
## <fctr> <list> <list>
## 1 Facebook <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 2 Twitter <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 3 Instagram <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 4 LinkedIn <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 5 Snapchat <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 6 Tumblr <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 7 Whatsapp <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 8 Google+ <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 9 Pinterest <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 10 Flickr <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 11 Vine <tibble [2,859 x 3]> <data.frame [2 x 9]>
## 12 Classmates <tibble [2,859 x 3]> <data.frame [2 x 9]>
## # ... with 1 more variables: chart <list>
(
my_sig_data <- gss_data_result %>%
unnest(test) %>%
filter(level == "Yes") %>%
group_by(Social_Media) %>%
nest() %>%
mutate(sig = map(data, tbl_sig, "age_7", space_label = 0.1, compare = "total")) %>%
unnest(sig)
)
## # A tibble: 7 x 10
## Social_Media level age_7 sign group_prop sign_prop
## <fctr> <fctr> <fctr> <fctr> <dbl> <dbl>
## 1 Facebook Yes 18-24 Rest of Sample 0.491384611 0.34355988
## 2 Twitter Yes 18-24 Rest of Sample 0.211865563 0.07786832
## 3 Instagram Yes 18-24 Rest of Sample 0.390509132 0.13203718
## 4 Snapchat Yes 18-24 Rest of Sample 0.415225073 0.08418934
## 5 Tumblr Yes 18-24 Rest of Sample 0.080860827 0.01846067
## 6 Vine Yes 18-24 Rest of Sample 0.058186669 0.01554382
## 7 Classmates Yes 18-24 Rest of Sample 0.003217104 0.02946433
## # ... with 4 more variables: group_Sample_Size <int>,
## # sign_Sample_Size <int>, pos <dbl>, Direction <fctr>
(my_chart_data <- gss_data_result %>%
unnest(chart) %>%
filter(age_7 == "18-24", Use == "Yes"))
## # A tibble: 12 x 5
## Social_Media age_7 Use prop Sample_Size
## <fctr> <fctr> <fctr> <dbl> <int>
## 1 Facebook 18-24 Yes 0.491384611 228
## 2 Twitter 18-24 Yes 0.211865563 228
## 3 Instagram 18-24 Yes 0.390509132 228
## 4 LinkedIn 18-24 Yes 0.116682200 228
## 5 Snapchat 18-24 Yes 0.415225073 228
## 6 Tumblr 18-24 Yes 0.080860827 228
## 7 Whatsapp 18-24 Yes 0.068425641 228
## 8 Google+ 18-24 Yes 0.196367705 228
## 9 Pinterest 18-24 Yes 0.167692568 228
## 10 Flickr 18-24 Yes 0.020169083 228
## 11 Vine 18-24 Yes 0.058186669 228
## 12 Classmates 18-24 Yes 0.003217104 228
We just need to re-order the Social_Media
variable based on the prop
variable, and we're ready to chart.
my_chart_data$Social_Media <- fct_reorder(my_chart_data$Social_Media, my_chart_data$prop)
(
ggplot() +
geom_col(data = my_chart_data, aes(x = Social_Media, y = prop)) +
geom_text(data = my_chart_data, aes(x = Social_Media, y = prop,
label = scales::percent(round(prop, 2))),
hjust = -0.2) +
coord_flip() +
scale_y_continuous(limits = c(0, 1), labels = scales::percent, breaks = c(0, 0.25, 0.5, 0.75)) +
labs(title = "Level of Usage of Social Media in 2016 Among 18-24 Year Old", x = "Social Media", y = "Percentage", shape = "Significant Differences") +
geom_sigmark_total(my_sig_data, x = "Social_Media", labels = c("Higher than \nthose aged 25+", "Lower than \nthose aged 25+")) +
theme(legend.key.height = unit(1.5, "lines"))
)
freq_test_prop
to freq_prop_test
for consistency with
*_test
functions going forward.tbl_chart
so that it can be run directly off the raw data
frame instead of the *_test
functions.subgroup
input in tbl_sig
so that the function's output
matches with the tbl_chart
output.min_sample
argument from tbl_sig
to the *_test
functions to increase computational efficiency by avoiding the tests
to be run on subgroups the user would ignore if it is below the
specified sample size threshold anyway.freq_t_test
that
wasn't properly computing the weighted mean.keep
argument to compare
for clarity in the
tbl_sig
function.compare
argument when one wishes to
compare a particular subgroup to the rest of the sample.tbl_sig
to make it easier to
include that information in the charts.nlabels
and newline
logical arguments to
freq_test_prop
and freq_t_test
functions to make this change
seamless.The following is a to-do list for future additions to the package, in order of importance from high to low:
freq_prop_test
function does not work well when
one of the sample groups has a sample size of 0.Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.