Encycopedia of Life Web Scrape

A step by step walkthrough of the eol_data function in bomeara/phydo

To install the repo:

#install the devtools package (if not already installed)
install.packages("devtools")
#go to the library and get the devtools package
library(devtools)
#install phydo
devtools::install_github("bomeara/phydo")
#go to the library and get phydo
library(phydo)

To run the eol_data function:

#function_name("Genus species")
eol_data("Formica accreta")

The function in its entirety is at the bottom of this page.

Breaking down the components of the function:

Locate the correct EOL webpage

Search EOL for the species name entered in the function call (see above: "Formica accreta").
The code below tells R to create a variable that contains the url with the species name pasted in the correct location with no spaces between sep=""

#this code will not work because it is designed to sit inside a function where (species) has been supplied. 
searchurl <- paste0('http://eol.org/api/search/1.0.json?q=', URLencode(species), '&exact=1&page=1&key=')

When species has been supplied in the function the code that runs will look like this:

searchurl <- paste0('http://eol.org/api/search/1.0.json?q=', URLencode("Formica accreta"), '&exact=1&page=1&key=')
searchurl

Copy-paste this link into your browser. It looks like gobbeldly-gook but there is one particular piece of information we need. Find the word 'link'. This is what the next steps will extract.
Create an empty variable calld url
Then ask the library jsonlite to use the function fromJSON to find the link on the webpage created and saved as searchurl (above) and add '/data' to the end of the url because this is where the information we are looking for is stored

    url <- NA
    url <- paste0(jsonlite::fromJSON(searchurl)$results$link[1], "/data")
    url

Again, copy-paste this link to see the page we will be scraping in the next steps

Scrape the webpage

Ask the library rvest to using the function read_html to save the url (created above) as an xml_document in an object called input

input <-  rvest::read_html(url)
input

Ask the library rvest to using the function html_elements to to search input (created above) for the css element tags 'ul' and save them in an object all_ul

all_ul <-  rvest::html_elements(input,'ul')
head(all_ul)

The 5th 'ul' is class "traits" this is the one we want so we extract the 5th element and save it as list called trait_ul.

trait_ul <- all_ul[[5]]

rvest::html_text2: convert to plain text so that trait_list_text is a vector of stings, rvest::html_nodes: extract all "div" nodes from trait_ul

trait_list_text <- rvest::html_text2(rvest::html_nodes(trait_ul, "div"))
head(trait_list_text)

The resulting vector has strings that are not data we need like:

trait_list_text[[60]]
trait_list_text[[62]]

Also viewable at the webpage we created above url

trait_list_text <- gsub("([0-9]*) records hidden", " \\1 records hidden", trait_list_text)
trait_list_text[[60]]
trait_list_text[[62]]
trait_list_text <- gsub('\\d* records hidden \\— show all', "", trait_list_text)
trait_list_text[[60]]
trait_list_text[[62]]
trait_list_text <- gsub('\nshow all records', "", trait_list_text)
trait_list_raw <- as.character(rvest::html_nodes(trait_ul, "div"))

Remove empty strings

empty <- which(nchar(trait_list_text)==0)
trait_list_text <- trait_list_text[-empty]
trait_list_raw <- trait_list_raw[-empty]

Find trait classes

data_heads <- which(grepl("h3", trait_list_raw))
trait_df <- data.frame(matrix(nrow=0, ncol=6))
colnames(trait_df) <- c("species", "trait", "value", "source", "URI", "definition")
data_head_plus_end <- c(-1+data_heads[-1], length(trait_list_text))
for (i in seq_along(data_heads)) {
    relevant_rows <- trait_list_text[(data_heads[i]+1):(data_head_plus_end[i])]

    relevant_rows <- relevant_rows[grepl(".+\\\n.+\\\nURI", relevant_rows)]

    for(j in seq_along(relevant_rows)) {
        trait_info <- strsplit(relevant_rows[j], "\n")[[1]][2]
        source_info <- strsplit(relevant_rows[j], "\n")[[1]][1]
        URI_info <- strsplit(relevant_rows[j], "\n")[[1]][3]
        definition_info <- NA
        try(definition_info <- strsplit(relevant_rows[j], "\n")[[1]][4])

        #species name manually inserted here for example
        trait_df <- rbind(trait_df, data.frame(species="Formica accreta", trait=gsub("\n", "", trait_list_text[data_heads[i]]), value=trait_info, source=source_info, URI=URI_info, definition=definition_info))
    }
}


bomeara/chapter2 documentation built on July 22, 2023, 3:19 a.m.