gender_df: Use gender prediction with data frames

Description Usage Arguments Value See Also Examples

View source: R/gender_df.R

Description

In a common use case for gender prediction, you have a data frame with a column for first names and a column for birth years (or, two columns specifying a minimum and a maximum potential birth year). This function wraps the gender function to efficiently apply it to such a data frame. The result is a data frame with one prediction of the gender for each unique combination of first name and birth year. The resulting data frame can then be merged back into your original data frame.

Usage

1
2
3
4
5
6
gender_df(
  data,
  name_col = "name",
  year_col = "year",
  method = c("ssa", "ipums", "napp", "demo")
)

Arguments

data

A data frame containing first names and birth year or range of potential birth years.

name_col

A string specifying the name of the column containing the first names.

year_col

Either a single string specifying the birth year associated with the first name, or character vector with two elements: the names of the columns with the minimum and maximum years for the range of potential birth years.

method

One of the historical methods provided by this package: "ssa", "ipums", "napp", or "demo". See gender for details.

Value

A data frame with columns from the output of the gender function, and one row for each unique combination of first names and birth years.

See Also

gender

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
library(dplyr)
demo_df <- tibble(names = c("Hillary", "Hillary", "Hillary",
                                "Madison", "Madison"),
                      birth_year = c(1930, 2000, 1930, 1930, 2000),
                      min_year = birth_year - 1,
                      max_year = birth_year + 1,
                      stringsAsFactors = FALSE)

# Using the birth year for the predictions.
# Notice that the duplicate value for Hillary in 1930 is removed
gender_df(demo_df, method = "demo",
          name_col = "names", year_col = "birth_year")

# Using a range of years
gender_df(demo_df, method = "demo",
          name_col = "names", year_col = c("min_year", "max_year"))

Example output

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# A tibble: 4 x 6
  name    proportion_male proportion_female gender year_min year_max
* <chr>             <dbl>             <dbl> <chr>     <dbl>    <dbl>
1 Hillary          1                  0     male       1930     1930
2 Madison          1                  0     male       1930     1930
3 Hillary          0                  1     female     2000     2000
4 Madison          0.0069             0.993 female     2000     2000
# A tibble: 4 x 6
  name    proportion_male proportion_female gender year_min year_max
* <chr>             <dbl>             <dbl> <chr>     <dbl>    <dbl>
1 Hillary          1                  0     male       1929     1931
2 Madison          1                  0     male       1929     1931
3 Hillary          0.0065             0.994 female     1999     2001
4 Madison          0.0072             0.993 female     1999     2001

gender documentation built on July 1, 2020, 7:02 p.m.