knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(actdata) library(dplyr) library(tidyr) library(ggplot2)
The Tidyverse makes it easy to create new combinations or subsets of ACT dictionaries in a way that is completely replicable, and to visualize quantities of interest over time and/or across countries. An example of how these data sets can be used along with dplyr
, tidyr
, and ggplot
is presented here.
Say we are interested in comparing changes in evaluation ratings for terms over time in whatever countries possible. The table on the dictionaries help page shows that the three countries in which multiple dictionaries have been collected are the U.S., Canada, and Germany. Let's choose three U.S. dictionaries (nc1978, texas1998, and usfullsurveyor2015), two Canadian dictionaries (ontario1980 and ontario2001), and two German dictionaries (germany1989 and germany2007). Now we need to find the terms that are in all seven dictionaries. The term table is useful to quickly get a first look at this.
datasets <- c("nc1978", "texas1998", "usfullsurveyor2015", "ontario1980", "ontario2001", "germany1989", "germany2007") term_table_short <- term_table %>% # we only need the columns for our chosen dictionaries select(term, component, all_of(datasets)) %>% # filter to those terms present in all chosen dictionaries filter(if_all(datasets, ~ . == 1)) head(term_table_short)
We now have a list of the r nrow(term_table_short)
identities included in all seven dictionaries. Let's get the mean EPA values for these terms.
# trim the term table further to get rid of the dataset key columns term_table_forfilter <- term_table_short %>% select(term, component) # Now, merge this list with the EPA summary dataset to get epa values for the countries and years we are interested in. # First, subset the epa summary dataframe to the dictionaries we want, to only include means, and to only include gender averaged values. term_subset <- epa_subset(dataset = datasets, stat = "mean", group = "all") %>% # then right join with the term list we built before to subset to only the terms we are interested in. right_join(term_table_forfilter, by = c("term", "component")) %>% arrange(term) term_subset[1:20,]
Now we have a dataframe that contains just the gender-averaged EPA values for the terms that the seven dictionaries share.
How are meanings changing overall? We can calculate meaning change summary statistics to get a better sense of this:
eval_change <- term_subset %>% # get rid of the texas dataset here; just use the first and last US sets for a change calculation. Drop institution codes too, we won't use them here. filter(dataset != "texas1998") %>% # first pivot longer so that there is one sentiment value per row pivot_longer(cols = c(E, P, A), names_to = "dimension", values_to = "rating") %>% select(term, component, context, year, dimension, rating, -instcodes) %>% mutate(order = ifelse(year %in% c("2015", "2007", "2001"), "late", "early")) %>% select(-year) %>% # then pivot wider by time order pivot_wider(names_from = order, values_from = rating) %>% # we get a warning here because there are a few duplicate terms. Clown, secretary, and waiter are all included more than once in the nc1978 data set. This is something to watch for; there are a handful of duplicates in other data sets as well. These are in the original raw data files and they have been left as is in this package. Here, I remove these terms. I also remove "no_emotion" here. filter(!(term %in% c("waiter", "clown", "secretary", "no_emotion"))) %>% # now calculate change mutate(change = as.numeric(late) - as.numeric(early)) summary(eval_change$change)
Mean and median change for all three countries are close to 0, suggesting broad stability in cultural meanings over time, but in all three datasets there are terms which change meaning substantially. Let's find the terms that have undergone the biggest changes in meaning:
eval_change_substantial <- eval_change %>% arrange(-abs(change)) head(eval_change_substantial)
This data format is amenable to plotting with ggplot. Let's look at a few of the terms which have seen substantial meaning change.
change_terms <- c("lesbian", "junkie", "depressed") # I'm returning to the subset data frame here terms_toplot <- term_subset %>% filter(term %in% change_terms) %>% rename(Evaluation = E, Potency = P, Activity = A) %>% pivot_longer(cols = c(Evaluation, Potency, Activity), names_to = "dimension", values_to = "rating") %>% mutate(dimension = factor(dimension, levels = c("Evaluation", "Potency", "Activity"))) ggplot(terms_toplot, aes(x = as.numeric(year), y = rating, shape = context, color = term)) + geom_point(size = 2.5) + geom_line(alpha = .5) + facet_wrap("dimension") + labs(x = "Year", y = "Mean sentiment rating")
This shows that "lesbian" rose on the evaluation and power dimensions in all three countries and "junkie" fell on the activity dimension in all three countries. The trajectory of "depressed" was more place-dependent: it fell on all three dimensions in the US, stayed stable on evaluation and potency in Canada and Germany, and fell on activity in Germany. This type of analysis cannot show causation, of course, but it can be a useful start point for identifying and investigating terms that may have been affected by social movements or other shifts within and/or across countries. Similarly, this kind of analysis could be used to identify terms that vary across cultures, and, in cultures where multiple datasets have been collected, to investigate whether these differences have changed over time.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.