estc: English Short Title Catalogue (ESTC) Metadata Toolkit

title: "Gender preprocessing overview" author: "Leo Lahti / Computational History Group" date: "2018-06-21" output: markdown_document

Author-gender mappings in the final data
3282 unique male authors
375 unique female authors
15289 documents (3.2%) with a male author
1828 documents (0.4%) with a female author
457245 documents (95.2%) with unresolved gender (including pseudonymes)
First names identified as female in the preprocessed data (including pseudonymes)
First names identified as male in the preprocessed data (including pseudonymes)
First names with ambiguous gender (both male and female listed in the gender mapping tables) in the preprocessed data (including pseudonymes). To override and resolve amiguous mappings, gender info can be added to the custom name-gender mappings or the custom author information table
First names with unknown gender (no gender mapping info available) in the preprocessed data (including pseudonymes). The missing info can be added to the custom name-gender mappings or the custom author information table

plot of chunk summary-authorgenders

Author gender distribution in the complete data:

|Gender | Documents (n)| Fraction (%)| |:---------|-------------:|------------:| | | 5846| 1.22| |ambiguous | 119231| 24.83| |female | 1828| 0.38| |male | 15289| 3.18| |NA | 338014| 70.39|

Author gender distribution over time. Note that the name-gender mappings change over time and geography but this has not been taken into account here.

plot of chunk summarygendertime

The name-gender mappings were collected from the following sources using this script:

U.S. Social Security Administration baby name data as implemented in the babynames and gender R packages. For each year from 1880 to 2013, the number of children of each sex given each name. All names with more than 5 uses are given.
The U.S. Census data in the Integrated Public Use Microdata Series as implemented in the genderdata R package
The Kantrowitz corpus of male and female names as implemented in the genderdata R package
The genderdata R package mappings for Canada, UK, Germany, Iceland, Norway, and Sweden.
Multilingual database (Prenoms.txt)
French first names
German first names
Finnish population register (Vaestorekisterikeskus; VRK). First names for living Finnish citizens that live in Finland and abroad in 2016. Only names with frequency n>10 are included. Source: avoindata.fi service and Vaestorekisterikeskus (VRK). Version: 3/2016. Data license CC-BY 4.0.
Pseudonymes provided by the authors of the bibliographica R package.
Custom name-gender mappings constructed manually by the authors of this R package
Custom author information constructed manually by the authors of this R package

The name-gender mappings from different years and regions are combined. When the sources give conflicting gender mappings, the gender is marked to be ambiguous. Afterwards, our custom name-gender mappings and custom author information tables are used to augment this information. The genderizeR R package could also be useful but the genderizer.io API has a limit of 1000 queries a day, hence omitted for now.

COMHIS/estc documentation built on April 7, 2022, 4:53 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com