detect_geographies: Convert messy text and email data into standardized...

Description Usage Arguments Examples

View source: R/detect_geographies.R

Description

This function standardizes messy text data that contains city, region, and/or country names as well as email domains into standardized geographic entities. The detect_geographies() function relies on a "funnel matching" method that unnests text and then reiterates over n-grams, matching all words sequences from n to n = 1 without much use of regular expressions or text cleaning. Currently, the function offers 14 output types including countries, continents, flag emojis, and seven languages.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
detect_geographies(
  data,
  id,
  input,
  output = c("country", "iso_2", "iso_3", "flag", "continent", "region", "sub_region",
    "int_region", "country_chinese", "country_russian", "country_french",
    "country_spanish", "country_arabic"),
  email = FALSE,
  cities = TRUE,
  denonyms = TRUE
)

Arguments

data

A data frame or data frame extension (e.g. a tibble).

id

A numeric or character vector unique to each entry.

input

Character vector of text data that includes the name of cities, states, and/or countries that will be standardized into country names or country codes. If multiple countries are detected, they will be separated by the "|" symbol.

output

Output column. Options include 'country', 'iso2', 'iso3', 'flag', 'continent', 'region', 'sub_region', 'int_region', 'country_arabic', 'country_chinese', 'country_french', 'country_russian', and 'country_spanish'.

email

Character vector of email or email domain information. Defaults to FALSE

cities

Optional argument to detect major cities in each country. Defaults to TRUE.

denonyms

Optional argument to detect denonyms of inhabitants of each country. Defaults to TRUE.

Examples

1
2
3
4
5
6
library(tidyverse)
library(diverstidy)
data(github_users)

classified_by_text <- github_users %>%
  detect_geographies(login, location, "country", email)

brandonleekramer/diverstidy documentation built on Dec. 19, 2021, 11:42 a.m.