get_dictionary_and_links: Scrape the Drug Definitions and Links from the NCI Drug...

Description Usage Arguments Details Value Web Source Types Drug Dictionary See Also

View source: R/cg-internals.R

Description

Run the full sequence that scrapes, parses, and stores the NCI Drug Dictionary found at CancerGov.org and any correlates to the NCI Thesaurus in a Postgres Database.

Usage

1
2
3
4
5
6
7
8
9
get_dictionary_and_links(
  conn,
  max_page = 50,
  sleep_time = 3,
  verbose = TRUE,
  render_sql = TRUE,
  crawl_delay = 5,
  size = 10000
)

Arguments

conn

Postgres connection object.

max_page

maximum page number to iterate the scrape over in the "https://www.cancer.gov/publications/dictionaries/cancer-drug?expand=ALL&page=" path, Default: 50

sleep_time

Time in seconds for the system to sleep before each scrape with read_html.

verbose

When reading from a slow connection, this prints some output on every iteration so you know its working.

Details

Scrapes the Definitions and the links to each Drug Page at the main Drug Dictionary pages in the https://www.cancer.gov/publications/dictionaries/cancer-drugi and stores the parsed response to the Drug Dictionary and Drug Link Tables, respectively.

Value

Any differences found between the scraped data and the existing data in the Drug Dictionary and Drug Link Tables are appended to their respective tables with the local timestamp.

Web Source Types

The NCI Drug Dictionary has 2 data sources that run in parallel. The first source is the Drug Dictionary itself at https://www.cancer.gov/publications/dictionaries/cancer-drug. The other source are the individual drug pages, called Drug Detail Links in skyscraper, that contain tables of synonyms, including investigational names.

Drug Dictionary

The listed drug names and their definitions are scraped from the Drug Dictionary HTML and updated to a Drug Dictionary Table in a cancergov schema.

See Also

brake_closed_conn,query,appendTable typewrite_progress,c("typewrite", "typewrite"),character(0) html_nodes,html_text tibble bind,mutate,distinct


meerapatelmd/skyscraper documentation built on Dec. 27, 2020, 7:46 a.m.