knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, cache = TRUE)
When it comes to web scraping, there is good news and bad news.
The good news is the fundamentals of scraping are simple and R
offers an array of great tools to assist us. Once you master, or become conversant in, a few packages---rvest
(the go-to html parser in R
), httr
, and rjson
or jsonlite
(for API interfacing)---you are ready to tackle almost any web scraping task. For large web crawling endeavors or really difficult to parse objects, such as D3.js visualizations, other languages (java, python, etc.) may be better or necessary. Fortunately, R
should meet 99 percent of one's basic needs.
The bad news is that there is no cookie-cutter solution for web scraping. Websites are idiosyncratic beasts, with their underlying code ranging in quality and structure. Some sites look clean on the surface, but are riddled with invisible javascript and other structures that will derail your scraper, intentionally or not. Thus, the best we can do is cover the basics andtailor our scripts to the specific task, addressing problems as they arise.
Before starting, it's pertinent to understand how websites and their underlying servers work as well as some basic etiquette we should follow when scraping data from them.
Each call to a web server takes time, server cycles, and memory. Most servers can handle significant traffic, but cannot necessarily handle the strain induced by massive automated requests. An overzealous scraper can overload a server. Servers may interpret this as a Denial of Service Attack (DOS).
We do not want to be seen as committing a DOS for a couple reasons. First, such an overload can crash a server and keep other users from accessing the website. Second, servers can and will implement countermeasures (i.e. block our access from our IP and so on). We will cover ways to keep our scrapers from being seen as committing this type of attack.
If you want to build a crawler---continuously extract data from a website or group of websites in real time or anytime some change occurs---there are some additional considerations. Below is a list of best practices:
All websites can be broken down into two main parts:
A well-designed website is like an iceberg---the front-end is the small tip above the surface while the back-end is everything hidden under the water. Most of the heavy-lifting is hidden from site, and an efficient, minimal interface is displayed to the user.
Before pulling out the scalpel and scraping a website to shreds, make sure there is not a more efficient option available. There are three possible cases for getting data from the web:
The first choice, direct database access, is ideal since it entail unrestricted direct access to source of your desired information. There are no layers of abstraction or restrictions on what you can get. However, this is also an extremely rare situation. Unrestricted access would entail a major security breach for most websites, potentially exposing private user data to the public. If you manage to find a way to directly access a website's database via the front-end, as a good citizen you should contact the administrators and inform them of the security risk.
The second choice, indirect database access via an api, is the sweet spot. This is a structured way to make requests to the database that is designed by the website developers so as to ensure security. Most websites utilize APIs in the back-end to service requests. However, many websites do not expose their API to the public. Others do expose the API to the user, but finding the connection request in your network traffic monitor can be difficult. Finally, some websites explicitly offer public-facing APIs for developers to interact with, e.g. Twitter, Facebook graph API, US GPO ,etc.
The third choice, database access via the front-end, is what comes to mind when most people think of web scraping. This entails making a request to the page where our sought after info lives, parsing through the returned document which is a combination of HTML, CSS, and Javascript, extracting the information, storing it somewhere. If our desired information lives across more than one page, then we must iterate this process until we have collected everything. Web scraping is tedious and time-consuming, but unfortunately the most common situation.
This lab focuses on web scraping and APIs since they are the most common. Below you proceed to either.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.