knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" )
genderBR
predicts gender from Brazilian first names using data from the Instituto Brasileiro de Geografia e Estatistica's 2010 Census.
genderBR
's main function is get_gender
, which takes a string with a Brazilian first name and predicts its gender using data from the IBGE's 2010 Census -- specifically, from its API and from an internal dataset.
More specifically, it uses data on the number of females and males with the same name in Brazil, or in a given Brazilian state, and calculates the proportion of female's uses of it. The function then classifies a name as male or female only when that proportion is higher than a given threshold (e.g., female if proportion > 0.9
, or male if proportion <= 0.1
); proportions below those threshold are classified as missing (NA
). An example:
library(genderBR) get_gender("joão") get_gender("ana")
Multiple names can be passed at the same function call:
get_gender(c("pedro", "maria"))
And both full names and names written in lower or upper case are accepted as inputs:
get_gender("Mario da Silva") get_gender("ANA MARIA")
Additionally, one can filter results by state with the argument state
; or get the probability that a given first name belongs to a female person by setting the prob
argument to TRUE
(defaults to FALSE
).
# What is the probability that the name Ariel belongs to a female person in Brazil? get_gender("Ariel", prob = TRUE) # What about differences between Brazilian states? get_gender("Ariel", prob = TRUE, state = "RJ") # RJ, Rio de Janeiro get_gender("Ariel", prob = TRUE, state = "RS") # RS, Rio Grande do Sul get_gender("Ariel", prob = TRUE, state = "SP") # SP, Sao Paulo
Note that a vector with states' abbreviations is a valid input for get_gender
function, so this also works:
name <- rep("Ariel", 3) states <- c("rj", "rs", "sp") get_gender(name, prob = T, state = states)
This can be useful also to predict the gender of different individuals living in different states:
df <- data.frame(name = c("Alberto da Silva", "Maria dos Santos", "Thiago Rocha", "Paula Camargo"), uf = c("AC", "SP", "PE", "RS"), stringsAsFactors = FALSE ) df$gender <- get_gender(df$name, df$uf) df
The genderBR
package relies on Brazilian state abbreviations (acronyms) to filter results. To get a complete dataset with the full name, IBGE code, and abbreviations of all 27 Brazilian states, use the get_states
functions:
get_states()
The genderBR
package can also be used to get information on the relative and total number of persons with a given name by gender and by state in Brazil. To that end, use the map_gender
function:
map_gender("maria")
To specify gender in the consultation, use the optional argument gender
(valid inputs are f
, for female; m
, for male; or NULL
, the default option).
map_gender("iris", gender = "m")
To install genderBR
's last stable version on CRAN, use:
install.packages("genderBR")
To install a development version, use:
if (!require("devtools")) install.packages("devtools") devtools::install_github("meirelesff/genderBR")
The surveyed population in the Instituto Brasileiro de Geografia e Estatistica's (IBGE) 2010 Census includes 190,8 million Brazilians -- with more than 130,000 unique first names.
To extracts the numer of male or female uses of a given first name in Brazil, the package employs the IBGE's API and, from in 1.1.0 version, also from an internal dataset containing all the names recorded in the IBGE's Census. In this service, different spelling (e.g., Ana and Anna, or Marcos and Markos) implies different occurrences, and only names with more than 20 occurrences, or more than 15 occurrences in a given state, are included in the database.
For more information on the IBGE's data, please check (in Portuguese): https://censo2010.ibge.gov.br/nomes/
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.