Description Usage Arguments Examples
View source: R/detect_geographies.R
This function standardizes messy text data that contains city, region, and/or country names as well as email domains into standardized geographic entities. The detect_geographies() function relies on a "funnel matching" method that unnests text and then reiterates over n-grams, matching all words sequences from n to n = 1 without much use of regular expressions or text cleaning. Currently, the function offers 14 output types including countries, continents, flag emojis, and seven languages.
1 2 3 4 5 6 7 8 9 10 11 |
data |
A data frame or data frame extension (e.g. a tibble). |
id |
A numeric or character vector unique to each entry. |
input |
Character vector of text data that includes the name of cities, states, and/or countries that will be standardized into country names or country codes. If multiple countries are detected, they will be separated by the "|" symbol. |
output |
Output column. Options include 'country', 'iso2', 'iso3', 'flag', 'continent', 'region', 'sub_region', 'int_region', 'country_arabic', 'country_chinese', 'country_french', 'country_russian', and 'country_spanish'. |
email |
Character vector of email or email domain information. Defaults to FALSE |
cities |
Optional argument to detect major cities in each country. Defaults to TRUE. |
denonyms |
Optional argument to detect denonyms of inhabitants of each country. Defaults to TRUE. |
1 2 3 4 5 6 | library(tidyverse)
library(diverstidy)
data(github_users)
classified_by_text <- github_users %>%
detect_geographies(login, location, "country", email)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.