Gender

gen <- df %>% select(author, author_gender)
dfs <- gen %>% group_by(author, author_gender) # %>% summarize(n = n())
dfs$author_gender <- as.character(dfs$author_gender)
dfs$author_gender[is.na(dfs$author_gender)] <- "unknown"
for (id in unique(dfs$author_gender)) {
  p <- top_plot(filter(dfs, author_gender == id), "author", ntop) + ggtitle(paste("Top", id, "authors"))
  print(p)
}

Author gender distribution in the complete data:

dfs <- df %>% group_by(author_gender) %>%
              summarize(docs = n(), fraction = round(100*n()/nrow(df), 2))
names(dfs) <- c("Gender", "Documents (n)", "Fraction (%)")        
kable(dfs, digit = 2)

Author gender distribution over time. Note that the name-gender mappings change over time and geography but this has not been taken into account here.

tab <- table(df$author_gender)
dfd <- df %>% group_by(publication_decade) %>% summarize(n.male = sum(author_gender == "male", na.rm = T), n.female = sum(author_gender == "female", na.rm = T), n.total = n()) %>% mutate(p.male = 100*n.male/n.total, p.female = 100*n.female/n.total) %>% filter(n.total > 25 & publication_decade > 1470) 
dfy <- df %>% group_by(publication_year) %>% summarize(n.male = sum(author_gender == "male", na.rm = T), n.female = sum(author_gender == "female", na.rm = T), n.total = n()) %>% mutate(p.male = 100*n.male/n.total, p.female = 100*n.female/n.total) %>% filter(n.total > 25)
library(microbiome)
theme_set(theme_bw(25))
p <- NULL # Avoid confusion if the plot gives error
p <- microbiome::plot_regression(p.female ~ publication_decade, dfd)
p <- p + 
       labs(x = "Publication decade", y = "Female authors (%)") +
       guides(fill = "none", size = "none", color = "none", alpha = "none")
print(p)

Data sources

The name-gender mappings were collected from the following sources using this script:

The name-gender mappings from different years and regions are combined. When the sources give conflicting gender mappings, the gender is marked to be ambiguous. Afterwards, our custom name-gender mappings and custom author information tables are used to augment this information. The genderizeR R package could also be useful but the genderizer.io API has a limit of 1000 queries a day, hence omitted for now.



rOpenGov/bibliographica documentation built on April 10, 2022, 8:51 p.m.