After working with https://github.com/john-kurkowski/tldextract in python, I wanted the same functionality within R. The list of top level domains can be automatically loaded from https://publicsuffix.org/list/effective_tld_names.dat. A cached version of the data is stored in the package.

Installation

To install this package, use the devtools package:

devtools::install_github("jayjacobs/tldextract")

Usage

library(tldextract)
# use the cached lookup data, simple call
tldextract("www.google.com")

# it can take multiple domains at the same time
tldextract(c("www.google.com", "www.google.com.ar", "googlemaps.ca", "tbn0.google.cn"))

The specification for the top-level domains is cached in the package and is viewable.

# view and update the TLD domains list in the tldnames data
data(tldnames)
head(tldnames)

If the cached version is out of data and the package isn't updated, the data can be manually loaded, and then passed into the \code{tldextract} function.

# get most recent TLD listings
tld <- getTLD() # optionally pass in a different URL than the default
manyhosts <- c("pages.parts.marionautomotive.com", "www.embroiderypassion.com", 
               "fsbusiness.co.uk", "www.vmm.adv.br", "ttfc.cn", "carole.co.il",
               "visiontravail.qc.ca", "mail.space-hoppers.co.uk", "chilton.k12.pa.us")
tldextract(manyhosts, tldnames=tld)


jayjacobs/tldextract documentation built on Jan. 7, 2020, 12:25 a.m.