detect_geographies: Convert messy text and email data into standardized...
In brandonleekramer/diverstidy: Standardize messy text data for geographic, population, and diversity-related research

This function standardizes messy text data that contains city, region, and/or country names as well as email domains into standardized geographic entities. The detect_geographies() function relies on a "funnel matching" method that unnests text and then reiterates over n-grams, matching all words sequences from n to n = 1 without much use of regular expressions or text cleaning. Currently, the function offers 14 output types including countries, continents, flag emojis, and seven languages.

detect_geographies(
  data,
  id,
  input,
  output = c("country", "iso_2", "iso_3", "flag", "continent", "region", "sub_region",
    "int_region", "country_chinese", "country_russian", "country_french",
    "country_spanish", "country_arabic"),
  email = FALSE,
  cities = TRUE,
  denonyms = TRUE
)

`data`	A data frame or data frame extension (e.g. a tibble).
`id`	A numeric or character vector unique to each entry.
`input`	Character vector of text data that includes the name of cities, states, and/or countries that will be standardized into country names or country codes. If multiple countries are detected, they will be separated by the "\|" symbol.
`output`	Output column. Options include 'country', 'iso2', 'iso3', 'flag', 'continent', 'region', 'sub_region', 'int_region', 'country_arabic', 'country_chinese', 'country_french', 'country_russian', and 'country_spanish'.
`email`	Character vector of email or email domain information. Defaults to FALSE
`cities`	Optional argument to detect major cities in each country. Defaults to TRUE.
`denonyms`	Optional argument to detect denonyms of inhabitants of each country. Defaults to TRUE.

library(tidyverse)
library(diverstidy)
data(github_users)

classified_by_text <- github_users %>%
  detect_geographies(login, location, "country", email)

brandonleekramer/diverstidy documentation built on Dec. 19, 2021, 11:42 a.m.

brandonleekramer/diverstidy index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

brandonleekramer/diverstidy
Standardize messy text data for geographic, population, and diversity-related research

detect_geographies: Convert messy text and email data into standardized...
In brandonleekramer/diverstidy: Standardize messy text data for geographic, population, and diversity-related research

Description

Usage

Arguments

Examples

Related to detect_geographies in brandonleekramer/diverstidy...

R Package Documentation

Browse R Packages

We want your feedback!

brandonleekramer/diverstidy Standardize messy text data for geographic, population, and diversity-related research

detect_geographies: Convert messy text and email data into standardized... In brandonleekramer/diverstidy: Standardize messy text data for geographic, population, and diversity-related research

Description

Usage

Arguments

Examples

Related to detect_geographies in brandonleekramer/diverstidy...

R Package Documentation

Browse R Packages

We want your feedback!

brandonleekramer/diverstidy
Standardize messy text data for geographic, population, and diversity-related research

detect_geographies: Convert messy text and email data into standardized...
In brandonleekramer/diverstidy: Standardize messy text data for geographic, population, and diversity-related research