URL_PDF_Text_Pull: PDF scrape text from a list of pdf urls generated from a...

Description Usage Arguments Value

Description

URL_PDF_Text_Pull takes a url and scrapes your wanted urls from a minimum css selector. These urls are pdfs which are then downloaded and text scraped for 2 words that appear after a provided string.

Usage

1
2
3
4
  URL_PDF_Text_Pull(url = "http://www.who.int/globalchange/resources/country-profiles/en/",
  css = ".a_z a", string = "Population (2013)", pdf.dir.dump,
  downloaded = FALSE, col.names = c("Number", "Size"),
  search.length = 200, words = c(2, 3))

Arguments

url

Url containing links to pdfs

css

Minimal css selector for links in url

string

String which is to eb matched from pdf.

pdf.dir.dump

Directory path where pdfs are downloaded to

downloaded

Boolean determining if URL_PDF_Text_Pull has already been called and thus there is no need to redownload pdfs. Default = FALSE

col.names

vector of length 2 for data frame result names

search.length

Integer giving the length of the pdf text to search after the occurence of string

words

Vector of integer determining which words to store from the search length. N.B. function will fail if the number of words is greater than the actual number of words that appear after the search string search length

Value

list of dataframes of scraped information

list of dataframes of scraped information


OJWatson/waities documentation built on May 7, 2019, 8:34 p.m.