scrapeR_in_batches: Batch Web Page Content Scraper

View source: R/scrapeR.R

scrapeR_in_batchesR Documentation

Batch Web Page Content Scraper

Description

The scrapeR_in_batches function processes a dataframe in batches, scraping web content from URLs in a specified column and writing the scraped content to an output file.

Usage

scrapeR_in_batches(df, url_column, output_file)

Arguments

df

A dataframe containing the URLs to be scraped.

url_column

The name of the column in df that contains the URLs.

output_file

The path to the output file where the scraped content will be saved.

Details

This function divides the input dataframe into batches of a fixed size (default: 100). For each batch, it extracts the combined text content from the web pages of the URLs in the specified column. The results are appended to the output file. The function also includes a throttling mechanism to pause between batch processing, reducing the load on the server being scraped.

Value

There is no return value, as the functions output is directly written to the specified file.

Note

Ensure that the httr and rvest packages are installed and loaded. Also, handle large datasets and output files with care to avoid memory issues.

Author(s)

Mathieu Dubeau Ph.D

References

Refer to rvest package documentation and httr package documentation for underlying web scraping methods.

See Also

GET, read_html, html_nodes, html_text, write.table

Examples


  mock_scrapeR <- function(url) {
    return(paste("Scraped content from", url))
  }

  df <- data.frame(url = c("http://site1.com", "http://site2.com"), stringsAsFactors = FALSE)

  ## Not run: 
    scrapeR_in_batches(df, url_column = "url", output_file = "mock_output.csv")
  
## End(Not run)

scrapeR documentation built on Nov. 23, 2023, 5:06 p.m.