GenderInfer"
In GenderInfer: This is a Collection of Functions to Analyse Gender Differences

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(GenderInfer)
library(dplyr)
library(ggplot2)

GenderInfer package

GenderInfer is a package developed to investigate gender differences within a data set. This package is based on the work of Dr. A. Day et al. Chem. Sci., 2020,11, 2277-2301. This has been developed for analysing differences in publishing authorship by gender. This package could also be useful for other analyses where there might be differences between male and female percentages from a specified baseline. The gender is assigned based on the first name, using the following data set as a corpus: https://github.com/OpenGenderTracking/globalnamedata The data source take into account data from:

The United States
Social Security Administration
United Kingdom
UK Office of National Statistics
Northern Ireland Statistics and Research Administration
Scotland General Register Office

Example data set

In this vignette the example data frame authors contain random names (first and last name for each row), country and publication_years from 2016 to 2020. This data set allow us to check the gender difference in the case of submission of articles to a journal.

head(authors)

Assign gender based on first name

The function assign_gender assigns a plausible gender for each row in the supplied data frame (data_df) based on the values of the first name stored in the column specified by first_name_col. It creates in output a data frame, similar to the input one, but with a new column containing the variable gender, which contains values M (male), F (female) or U (Unknown).

authors_df <- assign_gender(data_df = authors, first_name_col = "first_name")

head(authors_df)

We can now explore how many female, male and unknown there are in the data frame, using the function count from dplyr package.

## Count how many female, male and unknown gender there are in the data
authors_df %>% count(gender)

## per gender and country
authors_df %>% count(gender, country_code)

Calculate baseline and plot basic chart.

GenderInfer calculates the female baseline using the function baseline, which will be used for further statistical calculation and for the graphics. The baseline female percentage is calculated by: \

$$baseline = \frac{Female}{Female + Male} $$ \

Note that the Unknown totals are omitted when calculating any percentages (for baselines and any female percentage comparison with it) by this methodology as discussed in the paper . The analysis compares the female percentage of various sub-populations with this baseline in order to find those there the difference is significant. It is also possible to calculate the baseline for different level, such as year or country, or another variables. The level represents the variable we want to use to make the comparison.

In the following case we calculate the baseline for the year range 2016-2019 to compare with 2020 for the whole data set.

## calculates baseline for the year range 2016-2019
baseline_female <- baseline(data_df = authors_df %>% 
                              filter(publication_years %in% seq(2016, 2019)),
                            gender_col = "gender")
baseline_female

Create a simple bar chart showing the number of male and female.

The package has the function calculate_binom_baseline, which applies the binomial test where the number of female is the number of success in a Bernoulli experiment and it uses the baseline value as expected probability of success. This function finds if there is any statistical significance in the difference between female and male. Before the binomial is calculated the input data frame is reshaped in a new data form.

In first instance we calculate the count of female for the 2020. The variable we want to make the comparison in this case is publication_years. This variable will allow a comparison with the previous year range. In the present package we call level the variable used for comparison. The function reshape_for_binomial creates a new input data frame containing the female and male percentage, the total for level (total_for_level), which is the sum of female, male and unknown and the sum of female and male (total_female_male).

## Create a data frame that containing only the data from 2020 and
## the count of the variable gender.
female_count_2020 <- authors_df %>% 
  filter(publication_years == 2020) %>%
  count(gender)

## create a new data frame to be used for the binomial calculation.
df_gender <- reshape_for_binomials(data = female_count_2020,
                                   gender_col = "gender",
                                   level = 2020)
#df_gender <- test(female_count_2020, "gender", 2020)

df_gender

The function calculate_binomial_baseline calculates also the lower CI, upper CI and significance. The default value of the confidence level is 0.95. Before plotting the results, the function gender_total_df pivots the data in longer format, which means that the data frame now has more rows and less columns by creating a coloumn gender that contains the values for female, male and unknown. The function gender_bar_chart creates a bar chart showing the number of female, male and unknown.

## Calculate the binomial
## Create a new column with the baseline and calculate the binomial.

df_gender <- calculate_binom_baseline(data_df = df_gender,
                                      baseline_female = baseline_female)

df_gender

## Reshape first the dataframe using `gender_total_df` and afterwards create a
## bar chart of showing the number of male, female and unknown gender with `gender_bar_chart`
gender_total <- total_gender_df(data_df = df_gender, level = "level")

bar_chart(data_df = gender_total, x_label = "Year", 
                 y_label = "Total number")

Create barchart with significance bar and baseline.

The function stacked_bar_chart create a stacked bar chart using the percentage. This chart shows information about the baseline and the percentage of males and females.

## reshape the dataframe using the function `percent_df`.
## Add to `stacked_bar_chart` coord_flip() from ggplot2 to invert the xy axis.
# percent_df(data_df = df_gender)
percent_data <- percent_df(data_df = df_gender) 
stacked_bar_chart(percent_data, baseline_female = baseline_female,
                    x_label = "Year", y_label = "Percentage of authors",
                    baseline_label = "Female baseline 2016-2019:") +
  coord_flip()

Multibaseline analysis

We can now see how to calculate the baseline for several levels of the same variable and how to generate the graphics. In the example below we use the function sapply to generate the baselines value for c("UK", "US"). This generates a numeric vector containing two values, one for "US" and the second for "UK". As before we now reshape the data with the function reshape_for_binomials and afterwards we apply the calcultate_binom_baseline.

## calculate binomials for us and uk. 
## Reshape the dataframe and filter it country UK and US and year 2020 and count
## gender per countries.
# as.data.frame(t(with(authors_df, tapply(n, list(gender), c))))

UK_US_df <- reshape_for_binomials(data_df = authors_df %>%
                                   filter(country_code %in% c("UK", "US"),
                                          publication_years == 2020) %>%
                                    count(gender, country_code),
                                 gender_col = "gender", level = "country_code")

## To calculate the baseline for each country we can use the function `sapply`
baseline_uk_us <- sapply(UK_US_df$level, function(x) {
  baseline(data_df = authors_df %>%
            filter(country_code %in% x, publication_years %in% seq(2016, 2019)),
           gender_col = "gender")
})

baseline_uk_us

UK_US_binom <- calculate_binom_baseline(data_df = UK_US_df,
                                        baseline_female = baseline_uk_us)

UK_US_binom

A bullet chart displays the baseline and the female and male percentage for US and UK

percent_uk_us <- percent_df(UK_US_binom)

bullet_chart <- bullet_chart(data_df = percent_uk_us,
                             baseline_female = baseline_uk_us,
                             x_label = "Countries", y_label = "% Authors",
                             baseline_label = "Female baseline for 2016-2019")
bullet_chart

With the GenderInfer package it is possible to create a bullet chart with line chart in the same graph. The bullet chart in this example shows the difference for UK for the year range 2017-2020. Each bar will show the baseline for the previous year

## calculate binomials for US and UK

UK_df <- reshape_for_binomials(data_df = authors_df %>%
                                     filter(country_code == "UK") %>% 
                                     count(gender, publication_years),
                               "gender", "publication_years")

UK_df

## create a baseline vector containing values for each year from 2016 to 2020.
## using as country to compare France.
baseline_fr <- sapply(seq(2016, 2020), function(x) {
  baseline(data_df = authors_df %>%
             filter(country_code == "FR", publication_years %in% x), 
           gender_col = "gender")
})
baseline_fr

UK_binom <- calculate_binom_baseline(UK_df, baseline_female = baseline_fr)
UK_binom

The line chart on the top of the bullet chart is the total number of gender in this case per year.

## Calculate the total number of submission per country and per year
percent_uk <- percent_df(UK_binom)
## calculate the number of submission from UK
total_uk <- authors_df %>%
  filter(country_code == "UK") %>%
  count(publication_years) %>%
  mutate(x_values = factor(publication_years,
                                    levels = publication_years))
## conversion factor to create the second y-axis
c <- min(total_uk$n) / 100
bullet_line_chart(data_df = percent_uk, baseline_female = baseline_fr,
                  x_label = "year", y_bullet_chart_label = "Authors submission (%)",
                  baseline_label = "French Female baseline",
                  line_chart_df = total_uk,
                  line_chart_scaling = c, y_line_chart_label = "Total number",
                  line_label = "Total submission UK")