knitr::opts_chunk$set (
    collapse = TRUE,
    warning = TRUE,
    message = TRUE,
    width = 120,
    comment = "#>",
    fig.retina = 2,
    fig.path = "README-"
)
library (genderconsciousrouting)

Gender Encoding

This R package includes an internally bundled C library for categorising the gender of first names, thanks to Michael Jörg (available here). The library is extremely fast and flexible; covers all European languages and a host of others, and categorises gender very accurately. Here's a test run on a very large data set of names from the English-speaking world:

u <- "https://github.com/hadley/data-baby-names/raw/master/baby-names.csv"
if (!file.exists ("baby-names.csv"))
    chk <- download.file (u, "baby-names.csv")
n <- read.csv ("baby-names.csv", stringsAsFactors = FALSE)
format (nrow (n), big.mark = ",")
st <- system.time (x <- get_gender (n$name))
st
knitr::kable (table (x$gender))
knitr::kable (table (n$sex))

Categorising r format (nrow (n), big.mark = ",") names took only r st [3] seconds, or around 100,000 names per second. The following code compares the accuracy, noting that many names are of course unisex, whereas the "baby-names" data are direct records of individual names and sex.

x$gender [x$gender == "IS_MALE"] <- "boy"
x$gender [x$gender == "IS_MOSTLY_MALE"] <- "boy"
x$gender [x$gender == "IS_FEMALE"] <- "girl"
x$gender [x$gender == "IS_MOSTLY_FEMALE"] <- "girl"

index_right <- which (x$gender == n$sex)
message (format (length (index_right), big.mark = ","), " / ",
         format (nrow (x), big.mark = ","),
         " of names correctly classified = ",
         formatC (100 * length (index_right) / nrow (x),
                  format = "f", digits = 1), "%")

Noting that the baby name records are structured over time, and include many repeats of the same names, we can try to create "mostly girl/boy" categories based on relative proportions.

library (dplyr)

categorise_sex <-  function (sex, size) {
    # define relative proportions:
    # if > rel_props [2], then category is singular
    # else if > rel_props [1], then category is "mostly" singular,
    # else category is unisex
    rel_props <- c (4, 1000)

    if (length (size) == 1)
      return (sex)

    bi <- which (sex == "boy")
    gi <- which (sex == "girl")

    if (size [bi] > (size [gi] * rel_props [2]))
        return ("boy")
    else if (size [gi] > (size [bi] * rel_props [2]))
        return ("girl")
    else if (size [bi] > (size [gi] * rel_props [1]))
        return ("mostly boy")
    else if (size [gi] > (size [bi] * rel_props [1]))
        return ("mostly girl")
    else
        return ("unisex")
    }
n2 <- n |>
    group_by (name, sex) |>
    summarise (size = n ()) |>
    group_by (name) |>
    summarise (category = categorise_sex (sex, size))
knitr::kable (table (n2$category))

The above values for relative proportions were selected to give good agreement with the observed overall distribution of categories as determined by the internal library. These two more refined data sets can then be compared:

n2$gender <- get_gender (n2$name)$gender
n2$gender [n2$gender == "IS_FEMALE"] <- "girl"
n2$gender [n2$gender == "IS_MALE"] <- "boy"
n2$gender [n2$gender == "IS_MOSTLY_FEMALE"] <- "mostly girl"
n2$gender [n2$gender == "IS_MOSTLY_MALE"] <- "mostly boy"
n2$gender [n2$gender == "IS_UNISEX_NAME"] <- "unisex"

Some names are simply not found, so we'll remove those from the comparison before calculating final statistics.

n2 <- n2 [which (!n2$gender == "NAME_NOT_FOUND"), ]
knitr::kable (with (n2, table (category, gender)))

The accuracy in that case is

ct <- with (n2, table (category, gender))
sum (diag (ct)) / sum (ct)


mpadge/gender-conscious--routing documentation built on Aug. 15, 2022, 10:28 a.m.