inst/examples/gender.md

title: "Gender preprocessing overview" author: "Leo Lahti / Computational History Group" date: "2018-06-21" output: markdown_document

Gender

plot of chunk summary-authorgendersplot of chunk summary-authorgendersplot of chunk summary-authorgendersplot of chunk summary-authorgendersplot of chunk summary-authorgenders

Author gender distribution in the complete data:

|Gender | Documents (n)| Fraction (%)| |:---------|-------------:|------------:| | | 5846| 1.22| |ambiguous | 119231| 24.83| |female | 1828| 0.38| |male | 15289| 3.18| |NA | 338014| 70.39|

Author gender distribution over time. Note that the name-gender mappings change over time and geography but this has not been taken into account here.

plot of chunk summarygendertime

Data sources

The name-gender mappings were collected from the following sources using this script:

The name-gender mappings from different years and regions are combined. When the sources give conflicting gender mappings, the gender is marked to be ambiguous. Afterwards, our custom name-gender mappings and custom author information tables are used to augment this information. The genderizeR R package could also be useful but the genderizer.io API has a limit of 1000 queries a day, hence omitted for now.



COMHIS/estc documentation built on April 7, 2022, 4:53 p.m.