eol_data
function in bomeara/phydo
To install the repo:
#install the devtools package (if not already installed) install.packages("devtools") #go to the library and get the devtools package library(devtools) #install phydo devtools::install_github("bomeara/phydo") #go to the library and get phydo library(phydo)
To run the eol_data
function:
#function_name("Genus species") eol_data("Formica accreta")
The function in its entirety is at the bottom of this page.
Search EOL for the species name entered in the function call (see above: "Formica accreta").
The code below tells R to create a variable that contains the url with the species name pasted in the correct location with no spaces between sep=""
#this code will not work because it is designed to sit inside a function where (species) has been supplied. searchurl <- paste0('http://eol.org/api/search/1.0.json?q=', URLencode(species), '&exact=1&page=1&key=')
When species has been supplied in the function the code that runs will look like this:
searchurl <- paste0('http://eol.org/api/search/1.0.json?q=', URLencode("Formica accreta"), '&exact=1&page=1&key=') searchurl
Copy-paste this link into your browser. It looks like gobbeldly-gook but there is one particular piece of information we need. Find the word 'link'. This is what the next steps will extract.
Create an empty variable calld url
Then ask the library jsonlite
to use the function fromJSON
to find the link on the webpage created and saved as searchurl
(above) and add '/data' to the end of the url because this is where the information we are looking for is stored
url <- NA url <- paste0(jsonlite::fromJSON(searchurl)$results$link[1], "/data") url
Again, copy-paste this link to see the page we will be scraping in the next steps
Ask the library rvest
to using the function read_html
to save the url (created above) as an xml_document in an object called input
input <- rvest::read_html(url) input
Ask the library rvest
to using the function html_elements
to to search input
(created above) for the css element tags 'ul' and save them in an object all_ul
all_ul <- rvest::html_elements(input,'ul') head(all_ul)
The 5th 'ul' is class "traits" this is the one we want so we extract the 5th element and save it as list called trait_ul
.
trait_ul <- all_ul[[5]]
rvest::html_text2
: convert to plain text so that trait_list_text
is a vector of stings, rvest::html_nodes
: extract all "div" nodes from trait_ul
trait_list_text <- rvest::html_text2(rvest::html_nodes(trait_ul, "div")) head(trait_list_text)
The resulting vector has strings that are not data we need like:
trait_list_text[[60]] trait_list_text[[62]]
Also viewable at the webpage we created above url
trait_list_text <- gsub("([0-9]*) records hidden", " \\1 records hidden", trait_list_text) trait_list_text[[60]] trait_list_text[[62]]
trait_list_text <- gsub('\\d* records hidden \\— show all', "", trait_list_text) trait_list_text[[60]] trait_list_text[[62]]
trait_list_text <- gsub('\nshow all records', "", trait_list_text)
trait_list_raw <- as.character(rvest::html_nodes(trait_ul, "div"))
Remove empty strings
empty <- which(nchar(trait_list_text)==0) trait_list_text <- trait_list_text[-empty] trait_list_raw <- trait_list_raw[-empty]
Find trait classes
data_heads <- which(grepl("h3", trait_list_raw))
trait_df <- data.frame(matrix(nrow=0, ncol=6)) colnames(trait_df) <- c("species", "trait", "value", "source", "URI", "definition")
data_head_plus_end <- c(-1+data_heads[-1], length(trait_list_text))
for (i in seq_along(data_heads)) { relevant_rows <- trait_list_text[(data_heads[i]+1):(data_head_plus_end[i])] relevant_rows <- relevant_rows[grepl(".+\\\n.+\\\nURI", relevant_rows)] for(j in seq_along(relevant_rows)) { trait_info <- strsplit(relevant_rows[j], "\n")[[1]][2] source_info <- strsplit(relevant_rows[j], "\n")[[1]][1] URI_info <- strsplit(relevant_rows[j], "\n")[[1]][3] definition_info <- NA try(definition_info <- strsplit(relevant_rows[j], "\n")[[1]][4]) #species name manually inserted here for example trait_df <- rbind(trait_df, data.frame(species="Formica accreta", trait=gsub("\n", "", trait_list_text[data_heads[i]]), value=trait_info, source=source_info, URI=URI_info, definition=definition_info)) } }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.