fetchR_parseR_edit: Fetch a List of Url's.

View source: R/fetchR_parseR_edit.R

fetchR_parseR_editR Documentation

Fetch a List of Url's.

Description

Fetches list of URL's created by the generateR() function.

Usage

fetchR_parseR_edit(
  out_dir = NULL,
  work_dir = NULL,
  fetch_list = NULL,
  crawl_delay = NULL,
  max_concurr = NULL,
  max_concurr_host = NULL,
  timeout = Inf,
  timeout_request = NULL,
  queue_scl = 1,
  comments = "",
  log_file = NULL,
  readability_content = F,
  parser = crawlR::parse_content,
  writer = NULL,
  status_print_interval = 500,
  curl_opts = list(`User-Agent` =
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36",
    `Accept-Language` = "en;q=0.7", Connection = "close", CURLOPT_DNS_CACHE_TIMEOUT =
    "3600")
)

Arguments

out_dir

(Required) Current output directory.

work_dir

(Required) Current working directory.

fetch_list

(Required) Created by generateR.R.

crawl_delay

time (in seconds) for calls to the same host.

max_concurr

Max. total concurrent connections open at any given time.

max_concurr_host

Max. total concurrent connections per host at any given time.

timeout

Total (all requests) timeout

timeout_request

per request timeout

queue_scl

Scaler

comments

Some comments to print while running.

log_file

Name of log file. If null, writes to stdout().

readability_content

T

parser

parse func

writer

placeholder to allow custom output functions

status_print_interval

num urls fetched between crawler status outputs

curl_opts

list of curl options


barob1n/crawlR documentation built on May 23, 2023, 10:53 a.m.