chemidplus_scraping_functions: ChemiDPlus Scraping Functions

Description Arguments Value Names and Synonyms Classification Links to Resources Registry Numbers Registry Number Log Table RN URL Validity Table

Description

All ChemiDPlus Scraping Functions operate on a Registry Number URL (rn_url). The initial search is logged to a "REGISTRY_NUMBER_LOG" Table. If the RN URL is then tested for 404 Status and logged to the "RN_URL_VALIDITY" Table. The major sections found at the ChemiDPlus site are: "Names and Synonyms", "Classification", "Registry Numbers", "Links to Resources" with these sections are written to their respective tables "NAMES_AND_SYNONYMS", "CLASSIFICATION", "REGISTRY_NUMBERS", and "LINKS_TO_RESOURCES".

Arguments

conn

Postgres connection object

rn_url

Registry number URL to read that also serves as an Identifier

response

(optional) "xml_document" "xml_node" class object returned by xml2::read_html for the rn_url argument. Providing a response from a single HTML read reduces the chance of encountering a HTTP 503 error when parsing multiple sections from a single URL. If a response argument is missing, a response is read. Followed by the sleep_time in seconds.

schema

Schema that the returned data is written to, Default: 'chemidplus'

sleep_time

If the response argument is missing, the number seconds to pause after reading the URL, Default: 3

Value

Each section is parsed by a respective skyscraper function that stores the scraped results in a table of the same name in a schema. If a connection argument is not provided, the results are returned as a dataframe in the R console.

Names and Synonyms

The "Names and Synonyms" Section scraped results contain a Timestamp, RN URL. If the section has subheadings, the subheading is scraped as the Synonym Type along with the Synonym itself.

Classification

The "Classification" Section results contain a Timestamp, RN URL, and the drug classifications on the page.

Links to Resources

The "Links to Resources" Section derives all the HTML links to other data and web sources for the drug. The results include a Timestamp, RN URL, and the Resource Agency and its HTML link.

Registry Numbers

The "Registry Numbers" Section contains other identifiers for the given drug at other Agencies.

Registry Number Log Table

The REGISTRY_NUMBER_LOG Table is the landing table for any ChemiDPlus searches using skyscrape. It is the place where a source concept is searched based on a given set of parameters and all the possible Registry Numbers (RN) that source concept can be associated with in ChemiDPlus. The Registry Number then serves as a jump-off point from where a second RN URL is read and split based on the sections, and read into their corresponding ChemiDPlus Tables.

The Table logs the Raw Concept, the processed version of the Concept (ie removed spaces and error-throwing characters to generate a valid search URL for the Concept), the type of search (ie equals or contains), and the final search URL used to read a search result. A series of booleans are performed to determine whether the search was performed (ie a response was received), and if the results were for any records, and if these records were saved. If an RN was found, it is included along with the full URL associated with the URL.

RN URL Validity Table

The RN_URL_VALIDITY Table logs whether a HTTP 404 Error was recorded for a RN URL found in the REGISTRY_NUMBER_LOG Table for QA purposes.


meerapatelmd/skyscraper documentation built on Dec. 27, 2020, 7:46 a.m.