knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" )
This package is the early stages of development and should be used with caution. If you encounter any bugs, please open an issue on GitHub
eBird is an online tool for recording bird observations. Since its inception, nearly 500 million records of bird sightings (i.e. combinations of location, date, time, and bird species) have been collected, making eBird one of the largest citizen science projects in history and an extremely valuable resource for bird research and conservation. The full eBird database is packaged as a text file and available for download as the eBird Basic Dataset (EBD). Due to the large size of this dataset, it must be filtered to a smaller subset of desired observations before reading into R. This filtering is most efficiently done using AWK, a Unix utility and programming language for processing column formatted text data. This package acts as a front end for AWK, allowing users to filter eBird data before import into R.
This package can be installed directly from GitHub with:
# install.packages("devtools") devtools::install_github("ebird/auk")
auk
requires the Unix utility AWK, which is available on most Linux and Mac OS X machines. Windows users will first need to install Cygwin before using this package. Note that Cygwin must be installed in the default location (C:/cygwin/bin/gawk.exe
or C:/cygwin64/bin/gawk.exe
) in order for auk
to work.
Full details on using auk
to produce both presence-only and presence-absence data are outlined in the vignette, which can be accessed with vignette("auk")
.
This package contains a current (as of the time of package release) version of the bird taxonomy used by eBird. This taxonomy determines the species that can be reported in eBird and therefore the species that users of auk
can extract from the EBD. eBird releases an updated taxonomy once a year, typically in August, at which time auk
will be updated to include the current taxonomy. When using auk
, users should be careful to ensure that the version they're using is in sync with the EBD file they're working with. This is most easily accomplished by always using the must recent version of auk
and the most recent release of the EBD.
Some rows in the eBird Basic Dataset (EBD) may have an incorrect number of columns, typically from problematic characters in the comments fields, and the dataset has an extra blank column at the end. The function auk_clean()
drops these erroneous records and removes the blank column.
library(auk) # sample data f <- system.file("extdata/ebd-sample_messy.txt", package = "auk") tmp <- tempfile() # remove problem records auk_clean(f, tmp) # number of lines in input length(readLines(f)) # number of lines in output length(readLines(tmp)) unlink(tmp)
auk
uses a pipeline-based workflow for defining filters, which can then be compiled into an AWK script. Users should start by defining a reference to the EBD file with auk_ebd()
. Then any of the following filters can be applied:
auk_species()
: filter by species using common or scientific names.auk_country()
: filter by country using the standard English names or ISO 2-letter country codes.auk_extent()
: filter by spatial extent, i.e. a range of latitudes and longitudes.auk_date()
: filter to checklists from a range of dates.auk_last_edited()
: filter to checklists from a range of last edited dates, useful for extracting just new or recently edited data.auk_time()
: filter to checklists started during a range of times-of-day.auk_duration()
: filter to checklists that are the result of observation periods that lasted a given range of durations.auk_complete()
: only retain checklists in which the observer has specified that they recorded all species seen or heard. It is necessary to retain only complete records for the creation of presence-absence data, because the "absence"" information is inferred by the lack of reporting of a species on checklists. Note that all of the functions listed above only modify the auk_ebd
object, in order to define the filters. Once the filters have been defined, the filtering is actually conducted using auk_filter()
.
# sample data f <- system.file("extdata/ebd-sample.txt", package = "auk") # define an EBD reference and a set of filters ebd <- auk_ebd(f) %>% # species: common and scientific names can be mixed auk_species(species = c("Gray Jay", "Cyanocitta cristata")) %>% # country: codes and names can be mixed; case insensitive auk_country(country = c("US", "Canada", "mexico")) %>% # extent: formatted as `c(lng_min, lat_min, lng_max, lat_max)` auk_extent(extent = c(-100, 37, -80, 52)) %>% # date: use standard ISO date format `"YYYY-MM-DD"` auk_date(date = c("2012-01-01", "2012-12-31")) %>% # time: 24h format auk_time(time = c("06:00", "09:00")) %>% # duration: length in minutes of checklists auk_duration(duration = c(0, 60)) %>% # complete: all species seen or heard are recorded auk_complete() ebd
In all cases, extensive checks are performed to ensure filters are valid. For example, species are checked against the official eBird taxonomy and countries are checked using the countrycode
package.
Each of the functions described in the Defining filters section only defines a filter. Once all of the required filters have been set, auk_filter()
should be used to compile them into an AWK script and execute it to produce an output file. So, as an example of bringing all of these steps together, the following commands will extract all Gray Jay and Blue Jay records from Canada and save the results to a tab-separated text file for subsequent use:
output_file <- "ebd_filtered_blja-grja.txt" ebd <- system.file("extdata/ebd-sample.txt", package = "auk") %>% auk_ebd() %>% auk_species(species = c("Gray Jay", "Cyanocitta cristata")) %>% auk_country(country = "Canada") %>% auk_filter(file = output_file) # tidy up unlink(output_file)
Filtering the full EBD typically takes at least a couple hours, so set it running then go grab lunch!
EBD files can be read with read_ebd()
:
system.file("extdata/ebd-sample.txt", package = "auk") %>% read_ebd() %>% str()
For many applications, presence-only data are sufficient; however, for modeling and analysis, presence-absence data are required. auk
includes functionality to produce presence-absence data from eBird checklists. For full details, consult the vignette: vignette("auk")
.
This package is based on the AWK scripts provided in a presentation given by Wesley Hochachka, Daniel Fink, Tom Auer, and Frank La Sorte at the 2016 NAOC eBird Data Workshop on August 15, 2016.
eBird Basic Dataset. Version: ebd_relFeb-2017. Cornell Lab of Ornithology, Ithaca, New York. May 2013.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.