Author-gender mappings in the final data
r length(unique(subset(df.preprocessed, author_gender == "male")$author))
unique male authors
r length(unique(subset(df.preprocessed, author_gender == "female")$author))
unique female authors
r nrow(subset(df.preprocessed, author_gender == "male"))
documents (r round(100*nrow(subset(df.preprocessed, author_gender == "male"))/nrow(df.preprocessed), 1)
%) with a male author
r nrow(subset(df.preprocessed, author_gender == "female"))
documents (r round(100*nrow(subset(df.preprocessed, author_gender == "female"))/nrow(df.preprocessed), 1)
%) with a female author
r nrow(subset(df.preprocessed, author_gender == "ambiguous" | is.na(author_gender)))
documents (r round(100*nrow(subset(df.preprocessed, author_gender == "ambiguous" | is.na(author_gender)))/nrow(df.preprocessed), 1)
%) with unresolved gender (including pseudonymes)
First names identified as female in the preprocessed data (including pseudonymes)
First names identified as male in the preprocessed data (including pseudonymes)
First names with ambiguous gender (both male and female listed in the gender mapping tables) in the preprocessed data (including pseudonymes). To override and resolve amiguous mappings, gender info can be added to the custom name-gender mappings or the custom author information table
First names with unknown gender (no gender mapping info available) in the preprocessed data (including pseudonymes). The missing info can be added to the custom name-gender mappings or the custom author information table
gen <- df %>% select(author, author_gender) dfs <- gen %>% group_by(author, author_gender) # %>% summarize(n = n()) dfs$author_gender <- as.character(dfs$author_gender) dfs$author_gender[is.na(dfs$author_gender)] <- "unknown" for (id in unique(dfs$author_gender)) { p <- top_plot(filter(dfs, author_gender == id), "author", ntop) + ggtitle(paste("Top", id, "authors")) print(p) }
Author gender distribution in the complete data:
dfs <- df %>% group_by(author_gender) %>% summarize(docs = n(), fraction = round(100*n()/nrow(df), 2)) names(dfs) <- c("Gender", "Documents (n)", "Fraction (%)") kable(dfs, digit = 2)
Author gender distribution over time. Note that the name-gender mappings change over time and geography but this has not been taken into account here.
tab <- table(df$author_gender) dfd <- df %>% group_by(publication_decade) %>% summarize(n.male = sum(author_gender == "male", na.rm = T), n.female = sum(author_gender == "female", na.rm = T), n.total = n()) %>% mutate(p.male = 100*n.male/n.total, p.female = 100*n.female/n.total) %>% filter(n.total > 25 & publication_decade > 1470) dfy <- df %>% group_by(publication_year) %>% summarize(n.male = sum(author_gender == "male", na.rm = T), n.female = sum(author_gender == "female", na.rm = T), n.total = n()) %>% mutate(p.male = 100*n.male/n.total, p.female = 100*n.female/n.total) %>% filter(n.total > 25) library(microbiome) theme_set(theme_bw(25)) p <- NULL # Avoid confusion if the plot gives error p <- microbiome::plot_regression(p.female ~ publication_decade, dfd) p <- p + labs(x = "Publication decade", y = "Female authors (%)") + guides(fill = "none", size = "none", color = "none", alpha = "none") print(p)
The name-gender mappings were collected from the following sources using this script:
The name-gender mappings from different years and regions are combined. When the sources give conflicting gender mappings, the gender is marked to be ambiguous. Afterwards, our custom name-gender mappings and custom author information tables are used to augment this information. The genderizeR R package could also be useful but the genderizer.io API has a limit of 1000 queries a day, hence omitted for now.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.