README.md
In niedermansam/wikiScraper: Scraping and formatting information from Wikipedia pages.

wikiScraper

The wikiScraper package makes it easy to get and transform data from Wikipedia pages. The package uses rvest and xml2 to get data from web pages, and tidyverse packages for transformation.

wikiScraper is available via github. To install, use the devtools package.

install.packages('devtools')
devtools::install_github("niedermansam/wikiScraper")

When the installation is complete, load the wikiScraper package and you're ready to get started. The following code creates a dataframe of all of the metro systems listed on the Wikipedia page List of metro systems.

library(wikiScraper)
library(tidyverse)

metro_systems <- wiki_table("List_of_metro_systems")
metro_systems

If you are planning on getting information from several parts of a page (e.g. more than one table), load the full page using wiki_page. wiki_page automatically replaces spaces (" ") with underscores ("_"), and by default concatinates the page provided to the url "https://en.wikipedia.org/wiki/". Let's say we want to get data from the page List of power stations in California.

# Get page from wikipedia
cali_power <- wiki_page("List of power stations in California")

# Get natural gas plant table, the fourth table on the page
cali_gas <- wiki_table(table_num = 4)

# For pages with lots of tables, use wiki_section()
cali_solar <- cali_power %>%
  wiki_section('Solar') %>% # Get HTML data for the section titled "Solar"
  wiki_table(1) # Get the first table in the "Solar" section

A lot of Wikipedia pages and tables contain geographic data. wikiScraper provides a helper function to parse Wikipedia's formatting for coordinates. wiki_geography takes a data frame as an argument, and returns the same data frame with columns "lat" and "lon" added.

# Deletes "Coordinates" column, and inserts columns for "lat" and "lon"
cali_solar %>% wiki_geography()

Some tables with complex header structures are not accurately parsed by wiki_table.

niedermansam/wikiScraper documentation built on Nov. 4, 2019, 10:06 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

niedermansam/wikiScraper
Scraping and formatting information from Wikipedia pages.

README.md
In niedermansam/wikiScraper: Scraping and formatting information from Wikipedia pages.

wikiScraper

Getting Started

Handling Geography

Limitations

R Package Documentation

Browse R Packages

We want your feedback!

niedermansam/wikiScraper Scraping and formatting information from Wikipedia pages.

README.md In niedermansam/wikiScraper: Scraping and formatting information from Wikipedia pages.

wikiScraper

Getting Started

Handling Geography

Limitations

R Package Documentation

Browse R Packages

We want your feedback!

niedermansam/wikiScraper
Scraping and formatting information from Wikipedia pages.

README.md
In niedermansam/wikiScraper: Scraping and formatting information from Wikipedia pages.