The wikiScraper
package makes it easy to get and transform data from Wikipedia pages. The package uses rvest
and xml2
to get data from web pages, and tidyverse
packages for transformation.
wikiScraper
is available via github. To install, use the devtools
package.
install.packages('devtools')
devtools::install_github("niedermansam/wikiScraper")
When the installation is complete, load the wikiScraper
package and you're ready to get started. The following code creates a dataframe of all of the metro systems listed on the Wikipedia page List of metro systems.
library(wikiScraper)
library(tidyverse)
metro_systems <- wiki_table("List_of_metro_systems")
metro_systems
If you are planning on getting information from several parts of a page (e.g. more than one table), load the full page using wiki_page
. wiki_page
automatically replaces spaces (" ") with underscores ("_"), and by default concatinates the page provided to the url "https://en.wikipedia.org/wiki/". Let's say we want to get data from the page List of power stations in California.
# Get page from wikipedia
cali_power <- wiki_page("List of power stations in California")
# Get natural gas plant table, the fourth table on the page
cali_gas <- wiki_table(table_num = 4)
# For pages with lots of tables, use wiki_section()
cali_solar <- cali_power %>%
wiki_section('Solar') %>% # Get HTML data for the section titled "Solar"
wiki_table(1) # Get the first table in the "Solar" section
A lot of Wikipedia pages and tables contain geographic data. wikiScraper
provides a helper function to parse Wikipedia's formatting for coordinates. wiki_geography
takes a data frame as an argument, and returns the same data frame with columns "lat" and "lon" added.
# Deletes "Coordinates" column, and inserts columns for "lat" and "lon"
cali_solar %>% wiki_geography()
Some tables with complex header structures are not accurately parsed by wiki_table
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.