addAgeGender: Function to enrich a filtered corpus with Twitter users' most...

Description Usage Arguments Value

Description

This function crossreferences the 'name' field in the corpus files with a large database of baby names statistics, drawn from two sources: United States Social Security (included in the R package 'babynames' by Hadley Wickham) and the Spanish Instituto Nacional de Estadisticas (INE). The function implements a cascade system, attempting first to find exact matches, after which it results to approximate string matching using Levenhstein distance.

Usage

1
2
addAgeGender(filtered_corpus, language = c("English", "Spanish"),
  maxDistance = 1, nthreads = parallel::detectCores())

Arguments

maxDistance

maximum Levenhstein distance to use for approximate string matching. Defaults to 2

nthreads

number of threads to use in the C++ code for approximate string matching. Defaults to the number of CPU cores on your machine and it's probably a good idea to use that default.

filteredCorpus

filtered corpus. Do not use on unfiltered data if you want to get results in this century.

Value

a data.frame with the two added columns: gender (column 'sex') and most likely year of birth (column 'year')


jeroenclaes/tweetCorp documentation built on May 27, 2019, 4:50 a.m.