Introduction to DemografixeR

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

Let's illustrate the usefulness of DemografixeR with a simple example. Say we know the first name of a sample of customers, but useful information about gender, age or nationality is unavailable:

df <- structure(list(X1 = "Maria", X2 = "Ben", X3 = "Claudia", X4 = "Adam", 
    X5 = "Hannah", X6 = "Robert"), class = "data.frame", row.names = "**Customers:**")
knitr::kable(df, col.names = NULL)

It's common knowledge that names have a strong sociocultural influence - names' popularity vary across time and location - and these naming conventions may be good predictors for other useful variables such as gender, age & nationality. Here's where DemografixeR comes in:

"DemografixeR allows R users to connect directly to the (1) genderize.io API, the (2) agify.io API and the (3) nationalize.io API to obtain the (1) gender, (2) age & (3) nationality of a name in a tidy format."

DemografixeR deals with the hassle of API pagination, missing values, duplicated names, trimming whitespace and parsing the results in a tidy format, giving the user time to analyze instead of tidying the data.

To do so, DemografixeR is based on these three main pillar functions, which we will use to predict the key demographic variables of the previous sample of customers, so that we can 'fix' the missing demographic information:

`` {r package structure, echo=FALSE, cache=FALSE} df <- structure(list(API = c("https://genderize.io", "https://agify.io", "https://nationalize.io"),R function= c("genderize(name)", "agify(name)", "nationalize(name)"),Estimated variable` = c("Gender", "Age", "Nationality")), class = "data.frame", row.names = c(NA, -3L))

knitr::kable(df, row.names = FALSE)

They all work similarly, and allow to be integrated in multiple workflows. Using the previous group of customers, we can obtain the following results:

```r
df <- structure(list(X1 = c("Maria", "female", "21", "CY"), X2 = c("Ben", 
"male", "48", "AU"), X3 = c("Claudia", "female", "45", "CL"), 
    X4 = c("Adam", "male", "34", "PL"), X5 = c("Hannah", "female", 
    "27", "SL"), X6 = c("Robert", "male", "59", "US")), row.names = c("**Customers:**", 
"**Estimated gender:**", "**Estimated age:**", "**Estimated nationality:**"
), class = "data.frame")

knitr::kable(df, col.names = NULL)

To see how to get to these results, read on!

Get Started

Setup

First, we need to load the package:

library("DemografixeR")

API credentials

The following step is optional, it is only necessary if you plan to estimate gender, age or nationality for more than 1000 different names a day. To do so, you need to obtain an API key from the following link:

To use the API key, simply save it only once with the save_api_key(key) and you're all set. All the functions will automatically retrieve the key once saved:

save_api_key(key = "__YOUR_API_KEY__")

Please be careful when dealing with secrets/tokens/credentials and do not share them publicly. Yet, if you wish explicitly know which API key you've saved, retrieve it with the get_api_key() function. To fully remove the saved key use the remove_api_key() function.

Gender

We start by predicting the gender from our customers. For this we use the genderize(name) function:

customers_names <- c("Maria", "Ben", "Claudia", 
                     "Adam", "Hannah", "Robert")
customers_predicted_gender <- genderize(name = customers_names)
customers_predicted_gender # Print results

We see that genderize(name) returns the estimated gender for each name as a character vector:

class(customers_predicted_gender)

Yet, it is also possible to obtain a detailed data.frame object with additional information. DemografixeR also allows to use 'pipes':

gender_df <- genderize(name = customers_names, simplify = FALSE)
customers_names %>% 
  genderize(simplify = FALSE) %>% 
  knitr::kable(row.names = FALSE)
df <- structure(list(name = c("Maria", "Ben", "Claudia", "Adam", "Hannah", "Robert"),
                     type = c("gender", "gender", "gender", "gender", "gender", "gender"),
                     gender = c("female", "male", "female", "male", "female", "male"), 
                     probability = c(0.98, 0.95, 0.98, 0.98, 0.97, 0.99), 
                     count = c(334287L, 77991L, 118604L, 116396L, 13198L, 177418L)), 
                row.names = c(5L, 2L, 3L, 1L, 4L, 6L), class = "data.frame")

knitr::kable(df, row.names = FALSE)

Age

We continue with the age estimation of our customers. As with the genderize(name) function, the simplify parameter also works with the agify(name) function to retrieve a data.frame:

customers_predicted_age <- agify(name = customers_names, simplify = FALSE)

customers_names %>% 
  agify(simplify = FALSE) %>% 
  knitr::kable(row.names = FALSE)
df <- structure(list(name = c("Maria", "Ben", "Claudia", "Adam", "Hannah", "Robert"), 
               type = c("age", "age", "age", "age", "age", "age"), 
               age = c(21L, 48L, 45L, 34L, 27L, 59L), 
               count = c(517258L, 75632L, 110105L, 110754L, 12843L, 160915L)), 
          row.names = c(5L, 2L, 3L, 1L, 4L, 6L), class = "data.frame")

knitr::kable(df, row.names = FALSE)

Nationality

Last but not least, we finish with the nationality extrapolation. Equally as with the genderize(name) and agify(name) function, the simplify parameter also works with the nationalize(name) function to retrieve a data.frame:

customers_predicted_nationality <- nationalize(name = customers_names, simplify = FALSE)

customers_names %>% 
  nationalize(simplify = FALSE) %>% 
  knitr::kable(row.names = FALSE)
df <- structure(list(name = c("Maria", "Ben", "Claudia", "Adam", "Hannah", "Robert"), 
               type = c("nationality", "nationality", "nationality", "nationality", "nationality", "nationality"),
               country_id = c("CY", "AU", "CL", "PL", "SL", "US"), 
               probability = c(0.0550797623186088, 0.0665534047056411, 0.0559340459826196, 0.0905835974781003, 0.267325397602591, 0.0909442129069588)), 
          row.names = c(5L, 2L, 3L, 1L, 4L, 6L), class = "data.frame")

knitr::kable(df, row.names = FALSE)

Other parameters

country_id parameter

Responses of names will in a lot of cases be more accurate if the data is narrowed to a specific country. Luckily, both the genderize(name) and agify(name) function support passing a country code parameter (following the common ISO 3166-1 alpha-2 country code convention). For obvious reasons the nationalize(name) does not:

us_customers_predicted_gender<-genderize(name = customers_names, 
                                         country_id = "US")
us_customers_predicted_gender

us_customers_predicted_age<-agify(name = customers_names,
                                  country_id = "US")
us_customers_predicted_age

To obtain a data.frame of all supported countries, use the supported_countries(type) function. Here's an example of 5 countries:

supported_countries(type = "genderize") %>% 
  head(5) %>%
  knitr::kable(row.names = FALSE)
df <- structure(list(country_id = c("AD", "AE", "AF", "AG", "AI"), 
               name = c("Andorra", "United Arab Emirates", "Afghanistan", "Antigua and Barbuda", "Anguilla"),
               total = c(29783L, 145847L, 23531L, 1723L, 1081L)), 
          row.names = c(NA, 5L), class = "data.frame")

knitr::kable(df, row.names = FALSE)

In this case the total column reflects the number of observations the API has for each country. The beauty of the country_id parameter lies in that it allows to pass a single character string or a character vector with the same length as the name parameter. An example illustrates this better:

agify(name = c("Hannah", "Ben"),
      country_id = c("US", "GB"),
      simplify = FALSE) %>%
  knitr::kable(row.names = FALSE)
df <- structure(list(name = c("Hannah", "Ben"), 
                     type = c("age", "age"),
                     age = c(54L, 38L),
                     count = c(67L, 1980L), 
                     country_id = c("US", "GB")),
                row.names = 2:1, class = "data.frame") 

knitr::kable(df, row.names = FALSE)

In this previous example we passed two names - Hannah & Ben - and two country codes - US & GB. Thus, the functions allow to pass vectorized vectors - this is especially useful for workflows where we are using a data.frame with a variable with names and another variable containing country codes.

meta parameter

All three functions have a parameter defined as meta, which returns information about the API itself, such as:

Here's an example:

genderize(name = "Hannah", 
          simplify = FALSE, 
          meta = TRUE) %>% 
  knitr::kable(row.names = FALSE)
df <- structure(list(name = "Hannah",
                     type = "gender",
                     gender = "female", 
                     probability = 0.97, 
                     count = 13198L, 
                     api_rate_limit = 1000L, 
                     api_rate_remaining = 977L,
                     api_rate_reset = 7218L,
                     api_request_timestamp = 
                       structure(1588715982, class = c("POSIXct", "POSIXt"), tzone = "GMT")),
                row.names = 1L, class = "data.frame")
knitr::kable(df, row.names = FALSE)

sliced parameter

The nationalize(name) function has the useful sliced parameter. Logically, names can have multiple estimated nationalities - and the nationalize(name) function automatically ranks them by probability. This logical parameter allows to 'slice'/keep only the value with the highest probability to keep a single estimate for each name (one country per name) - and is set by default to TRUE. But you may wish to see all to potential countries a name can be associated to. For this simply set the parameter to FALSE:

nationalize(name = "Matthias", 
            simplify = FALSE, 
            sliced=FALSE) %>% 
  knitr::kable(row.names = FALSE)
df <- structure(list(name = c("Matthias", "Matthias", "Matthias"), 
                     type = c("nationality", "nationality", "nationality"),
                     country_id = c("DE", "AT", "CH"), 
                     probability = c(0.41616382151, 0.265062479571606, 0.1106921611552)), 
                row.names = c(NA, 3L), class = "data.frame")
knitr::kable(df, row.names = FALSE)

In the last example you see that instead of returning a single country code, it returns multiple country codes with their associated probability.

Customers example

Let's replicate the initial example with our group of customers. VoilĂ !

library(dplyr)

df<-data.frame("Customers:"=c("Maria", "Ben", "Claudia",
                           "Adam", "Hannah", "Robert"), 
               stringsAsFactors = FALSE,
               check.names = FALSE)

df <- df %>% mutate(`Estimated gender:`= genderize(`Customers:`),
                    `Estimated age:`= agify(`Customers:`),
                    `Estimated nationality:`= nationalize(`Customers:`))

df %>% t() %>% knitr::kable(col.names = NULL)
df <- structure(
  c("Maria", "female", "21", "CY", "Ben", "male", "48", 
    "AU", "Claudia", "female", "45", "CL", "Adam", "male", "34", 
    "PL", "Hannah", "female", "27", "SL", "Robert", "male", "59", "US"),
  .Dim = c(4L, 6L), 
  .Dimnames = list(c("Customers:", "Estimated gender:", "Estimated age:", "Estimated nationality:"), NULL))

knitr::kable(df, col.names = NULL)

Further information

For more information access the package documentation at https://matbmeijer.github.io/DemografixeR.



Try the DemografixeR package in your browser

Any scripts or data that you put into this service are public.

DemografixeR documentation built on July 8, 2020, 5:18 p.m.