library(learnr) knitr::opts_chunk$set(error = TRUE) set.seed(123)
In this session you will take some classic biodiversity data and turn it into an R package.
When we think about biodiversity data, we generally think about the hypothesis we are testing, how we are going to collect the data, and how we need to analyse it to investigate that hypothesis. Little thought is usually given to how to store the data. This is a mistake! Large amounts of hard work in the field can be wasted by mistakes in storage. After regular backups so you don't physically lose your data (if you're not doing this, then you should stop whatever you're doing and fix that now!), the next important thing to consider is whether you are definitely, always using the most up-to-date, error-free copy of your dataset when you carry out analyses.
If your data is stored in one or more csv files or excel spreadsheets, you need to read that file into R. For that you need to know where it is. If you only have one copy of it on your computer, and you update that when you clean the data or add new records, then that is a single point of failure. If you accidentally make incorrect edits or delete the file, then you are entirely dependent on backups to recover from your mistakes (assuming you followed the instructions above and have a backup!).
If, on the other hand, you create a new file every time you update the data, then all of your R scripts that do analyses on the data now have to refer to the new file, or you will need to copy the new data file to the projects where you are doing the analysis. If you don't do either of these everywhere you are working on the data, one or more of your analyses may be using an out of date version of the data, and silently give incorrect answers.
Even if your behaviour is perfect, and you never make mistakes with your data, then it is usually necessary to do some processing to the data before carrying out the analyses. This may be as simple as changing some of the column names in your tables so they are easier to work with in R (no spaces, for instance), some cleaning of the data to remove missing data, or some significant processing to turn records that are easy to collect into data that is easy to analyse. Such a script, just like the data, needs to be updated, and the code needs to be present wherever you carry out an analysis. It therefore suffers from exactly the same problems as the data itself.
There is a simple(-ish!) solution though. You don't need to carry around your
own copy of ggplot()
or lmer()
in every script you write, so why should you
have to do it with the data? There is one definitive version of ggplot()
and
lmer()
on your computer, and you just refer to them by loading the ggplot2
and lme4
packages using the library()
function. When these packages get
updated to fix bugs or extend functionality, you just do it once by installing
(normally updating) the R package, and then all of your scripts use the new
version. So why not turn your data into an R package and do the same?
You'll go through the process of creating a package, processing data, inserting the data into the package, and building the package so that you can use it during the project work at the end of the course.
First, create a new package called githubusernameBCI
(replacing
githubusername
with your actual GitHub username) as a git repository
connected to GitHub as a repo in the SBOHVM
organisation. Remember that
a guide to creating packages on GitHub is available on the RPiR
help pages
here if
you're not sure what to do.
Generally, you would want to store the raw data from your work in the git repository so everything is archived and version controlled on GitHub. However, in this work, the dataset is not yours, and it's so big it easily exceeds file limits for storage in GitHub. So, under no circumstances commit the raw records to your git repository -- it's far too big. We'll show you how to ignore it so that you don't do this by mistake shortly, but just bear this in mind now so you ensure that you don't -- the moment you commit a file like this into a repo it's hard to undo the error.
Next, you need to download the Barro Colorado Forest Census Plot Data from the Smithsonian's repository at https://repository.si.edu/handle/10088/20925. Download files 3 and 4, the full census data (ViewFullTable.zip) and the taxonomic data on all of the species present (ViewTax.txt). Now create a folder to store these raw data in, and create the files you will use shortly to process the data:
```{R, eval = FALSE} usethis::use_data_raw("bci_2010") usethis::use_data_raw("bci_quadrats") usethis::use_data_raw("bci_taxa")
Copy the files you have downloaded to the <span style="color: #dd1c77;">data-raw</span> folder that has just been created inside the R package, and unzip the <span style="color: #dd1c77;">ViewFullTable.zip</span> file into the same folder. You can now delete the zip file if you like. Now we need to make sure that none of these files make it into your git repo by accident. Run: ```{R, eval = FALSE} usethis::edit_git_ignore("project")
And add the following lines to the end of the .gitignore file that has opened:
```{R, eval = FALSE} ViewTax.txt ViewFullTable.zip ViewFullTable TSMAttributes.txt ViewFullTable.txt ViewFullTable.pdf
Now go to the git pane in RStudio, and make sure that the none of the files you just downloaded show up in the pane. Commit everything and create a suitable commit message for these initial files. Now create a new R file in the <span style="color: #dd1c77;">data-raw</span> folder. This will be the file you run to put all of the data into the package. It should contain the following: ```{R, eval = FALSE} # Put the datasets into the package library(dplyr) # Move to the data-raw subfolder to access the raw data devtools::wd(".", "data-raw") # BCI data # Load individual BCI records data <- read.delim(file.path("ViewFullTable", "ViewFullTable.txt")) # Load BCI taxonomic data, create new species column, and extract species taxa <- read.delim("ViewTax.txt") %>% mutate(GenusSpecies = as.factor(paste(Genus, SpeciesName))) %>% filter(IDLevel == "species") # Store species counts from 2010 census in package source("bci_2010.R") # Store quadrat metadata in package source("bci_quadrats.R") # Store taxonomic data in package source("bci_taxa.R")
This will load all of the raw data into R (this will take some time) and run the
files you have created to process each of them and store them in the package.
Don't run this script yet though! First, you need to edit the processing
files (bci_2010.R
, bci_quadrats.R
and bci_taxa.R
) to do the processing
correctly. Actually, we provide them for you here!
This is bci_2010.R:
```{R, eval = FALSE} library(dplyr) library(reshape2)
records <- data %>% filter(PrimaryStem == "main", Status == "alive", !is.na(QuadratName)) %>% filter(GenusSpecies %in% taxa$GenusSpecies) %>% select(GenusSpecies, PlotCensusNumber, QuadratName) %>% mutate(col = as.integer(floor(QuadratName / 100)), row = as.integer(QuadratName - col * 100)) %>% filter(row < 25)
bci_2010 <- records %>% filter(PlotCensusNumber == 7) %>% select(GenusSpecies, QuadratName) %>% acast(GenusSpecies ~ QuadratName, fill = 0, value.var = "QuadratName", fun.aggregate = length)
colnames(bci_2010) <- sprintf("Q.%04d", as.integer(colnames(bci_2010)))
usethis::use_data(bci_2010, overwrite = TRUE)
It takes the massive Barro Colorado Island (BCI) dataset and extracts only the 2010 (7th) census and summarises it to extract a matrix of counts of known species of living trees in 20m x 20m quadrats across the site. This is <span style="color: #dd1c77;">bci_quadrats.R</span>: ```{R, eval = FALSE} library(dplyr) library(tibble) # Work out useful information about the quadrats themselves # Note: they index from 0 and they are 20m x 20m quadrats bci_quadrats <- records %>% select(QuadratName, row, col) %>% unique %>% mutate(x = row * 20, y = col * 20, row = row + 1, col = col + 1) %>% mutate(Quadrat = sprintf("Q.%04d", as.integer(QuadratName))) %>% arrange(Quadrat) %>% as_tibble # Store in package usethis::use_data(bci_quadrats, overwrite = TRUE)
It provides metadata about the quadrats -- where they are within the site -- as
a tibble (a prettier version of a data frame that the RStudio team have
created). Specifically, the bci_quadrats
data frame translates the column
names in bci_2010
into actually rows and columns in a grid or (x,y)
coordinates in metres in case you want to be able to display where the quadrats
are physically.
Finally, this is bci_taxa.R:
```{R, eval = FALSE} library(dplyr) library(tibble)
bci_taxa <- taxa %>% filter(IDLevel == "species") %>% select(GenusSpecies, Genus, Family) %>% filter(GenusSpecies %in% rownames(bci_2010)) %>% unique %>% as_tibble
usethis::use_data(bci_taxa, overwrite = TRUE)
This will turn the huge data frame containing all of the taxonomic records of the site into a tibble called `bci_taxa`, which just contains the species name (`GenusSpecies`), genus (`Genus`) and family (`Family`) of each living tree species on Barro Colorado Island (BCI) and stores it in the package. Finally you need to do three things: 1. Commit all of the scripts that you have just created 2. Run the script that you created to generate the data files and commit them 3. Add in all of the packages you are using as dependencies of this package If you have a problem in step 2, with an error about a file missing here: ```{R, eval = FALSE} data <- read.delim(file.path("ViewFullTable", "ViewFullTable.txt"))
Then your computer may have unzipped files differently from mine -- mine has created a folder in data-raw called ViewFullTable, but yours may have just unzipped the files directly into data-raw. In that case, just change the line to:
```{R, eval = FALSE} data <- read.delim("ViewFullTable.txt")
Step 3 is (hopefully!) easy, and it is something you are going to have to keep up to date as you develop any package -- you need to check which libraries you load using `library(xxx)` in any script or demo, and which libraries you use by qualifying function calls with `xxx::function_name()`. You then need to run: ```{R, eval = FALSE} usethis::use_package("xxx")
to add the xxx
package to those imported by your package in the
DESCRIPTION file. See
here
for further details. Commit these changes to the repo too, and push them to
GitHub. Note that you should not add your own (githubusernameBCI
) package to
the package dependencies of githubusernameBCI
, because it is not a dependency
of itself! If you have accidentally done this you need to go into the
DESCRIPTION
file and delete it from the Imports:
entry there.
Finally, if you are going to add a package on GitHub to your dependencies (such
as our RPiR
package), you need to use usethis::use_dev_package()
instead of
usethis::use_package()
. You need to know where to find it on GitHub, and then
use:
```{R, eval = FALSE} usethis::use_dev_package("RPiR", remote = "SBOHVM/RPiR")
where the `remote` argument is the organisation and repo name. In the project, when you are using your own data package in your new project package, you will need to do this by calling something like `usethis::use_dev_package("githubusernameBCI", remote = "SBOHVM/githubusernameBCI")`. ## Tasks: documenting the package First you need to edit the <span style="color: #dd1c77;">DESCRIPTION</span> file (more [here](https://r-pkgs.org/description.html) so that it contains the right information about you and the package, and you need to create a documentation file for the package in the <span style="color: #dd1c77;">R</span> folder so that `?githubusernameBCI` returns a description. You had this in the package for the third practical series, but you'll see details on how to do that [here](https://sbohvm.github.io/RPiR/articles/pages/packages_guide.html#package-documentation). Create a new file in the <span style="color: #dd1c77;">R</span> folder -- I'm going to call it <span style="color: #dd1c77;">githubusernameBCI-package.R</span> because the convention is packagename-package.R. Then describe your package (most easily lifted from the <span style="color: #dd1c77;">DESCRIPTION</span> file): ```r #' Barro Colorado Island data package #' #' Package to hold the BCI data (or whatever) -- maybe also mention something #' about these functions now, and put that in the DESCRIPTION too. And then #' put it in the README.md file. And don't forget to reference the source of #' the data correctly. #' #' @import magrittr #' #' @name githubusernameBCI-package #' @aliases githubusernameBCI #' @docType package #' NULL
There are a few things going on here that you should notice for making packages
in the future. The first is that you need to say what your package is called --
here I have given it two names githubusernameBCI-package
and an alias of just
githubusernameBCI
, which you do with the @aliases
command. Secondly, the
NULL
at the end of the file is included when there is no object associated
with this documentation, which is the case here, since this file contains the
package documentation, as defined explicitly in the @docType
tag (more info
here). Finally, if you are
using any packages in any of your functions, you may want to import them into
your package here. This is done using @import
-- see more
here. You can then put the same, or a
similar, package description into the
README.md file in the package so that
people going to GitHub will see what the package that you have created does
without having to install it. Note that GitHub README pages use a special type
of markdown called Git Flavored Markdown; more information can be found here https://docs.github.com/en/free-pro-team@latest/github/writing-on-github with
more general documentation here
https://guides.github.com/features/mastering-markdown/. Note that with markdown
you don't need to start lines with #'
.
Second, you need to document the data that you are providing in this package, by creating file(s) in the R folder again. You'll find details on how to create the documentation here, or an abbreviated version in our guide here.
Finally, build the documentation (using devtools::document()
), and commit all
of the changes to git.
Make sure throughout the documentation that you credit the real source of the data as the Smithsonian.
Now you should have a working R data package. Try installing it (using
devtools::install()
), restarting R and loading it. Check that the objects you
have created exist -- you should be able to see them in the
Environment pane, but more simply by just
typing:
```{R, eval = FALSE} library(githubusernameBCI) bci_taxa bci_quadrats bci_2010
You should (hopefully!) find that all of the objects exist, though you may be surprised to find that they appear to be normal data frames instead of tibbles. This actually isn't true, and if you run: ```{R, eval = FALSE} library(githubusernameBCI) library(tibble) bci_taxa bci_quadrats
You'll see that they are tibbles. Tibbles automatically fall back to being data frames if you don't load the library.
Now push all of your changes (commits) to GitHub, and check you can install
it by using devtools::install_github("SBOHVM/githubusernameBCI")
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.