The naturecounts
package restricts users to three types of data filters in order not to overload the NatureCounts server. However, it is fairly straightforward to apply data filters after the download.
In this vignette we will go over several different types of filtering using the tidyverse
collection of packages. In particular, we will make extensive use of the filter()
function from the dplyr
package. For more information about this function, see the chapter on Data Transformations in R for Data Science by Hadley Wickham and Garrett Grolemund.
For each example, we will show case how to filter your data either directly downloaded as a data frame, or obtained from a SQLite database.
First let's get a collection of chickadee data to use. We'll collect all public data on Carolina, black-capped and mountain chickadees (see the vignette on species codes) for more information on how to find species codes).
The following examples use the "testuser" user which is not available to you. You can quickly sign up for a free account of your own to access and play around with these examples. Simply replace
testuser
with your own username.
Packages
library(naturecounts) library(dplyr) library(lubridate)
Downloading data directly to a Data Frame
boreal_chickadee_df <- nc_data_dl(species = 14320, collection = "BBS", verbose = FALSE, username = "testuser", info = "nc_vignette")
Downloading data to a SQLite DataBase (chickadee.nc
)
unlink("boreal_chickadee.nc") con <- nc_data_dl(species = 14320, collection = "BBS", sql_db = "boreal_chickadee", verbose = FALSE, username = "testuser", info = "nc_vignette") boreal_chickadee_db <- tbl(con, "naturecounts")
DBI::dbDisconnect(con)
Or, if starting a new session with an existing SQLite database, first create the connection to the database
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "boreal_chickadee.nc") boreal_chickadee_db <- tbl(con, "naturecounts")
When using a database, you can choose to collect()
your data into a data frame (in which case it will be identical to chickadee_df
, but if your database is large, it will be faster to leave it as a database as long as possible.
boreal_chickadee_df <- collect(boreal_chickadee_db)
We will also use a helper function to add columns for date
and doy
(day of year) to our data.
For data frames, you directly modify the data frame object.
boreal_chickadee_df <- format_dates(boreal_chickadee_df, overwrite = TRUE)
For databases, you'll modify the connection, then update your table reference.
Note: This results in on-disk additions to the SQLite database. If you do not wish to make add fields to your SQLite database, you will have to collect()
the data into a data frame first.
con <- format_dates(con) boreal_chickadee_db <- tbl(con, "naturecounts")
Not all dates can be created if survey_year
, survey_month
or survey_day
are missing.
filter(boreal_chickadee_df, is.na(date)) %>% select(survey_year, survey_month, survey_day, date)
Unless stated otherwise, in the following examples, we demonstrate code that works for both data frames and databases.
The simplest way to filter your data is to pick out categories to include or exclude. In each case there are three steps.
These steps work for categories such as collection
, species
, survey_year
, iba_site
, bcr
or any variable that can be expressed as a category.
pull()
grabs the column of interest, unique()
returns only unique values.boreal_chickadee_db %>% select(collection) %>% distinct()
2a) Filter the categories you wish to keep
alberta_atlas <- filter(boreal_chickadee_db, collection %in% c("ABATLAS1", "ABATLAS2"))
2b) Filter the categories you wish to exclude with !
alberta_atlas <- filter(boreal_chickadee_db, !collection %in% c("ABBIRDRECS", "RCBIOTABASE"))
alberta_atlas %>% select(collection) %>% distinct()
Another way to filter your data is to specify numerical ranges include or exclude. In each case there are the same three steps as above.
These steps work for categories such as survey_year
, survey_month
, survey_day
, doy
or any variable that can be expressed as a number.
pull()
grabs the column of interest, summary()
returns a numerical summary of the values.boreal_chickadee_db %>% select(survey_year) %>% summarize(min = min(survey_year, na.rm = TRUE), max = max(survey_year, na.rm = TRUE))
2a) Filter the variable to the range you wish to keep (note that the ,
is an implied &
)
recent <- filter(boreal_chickadee_db, survey_year >= 2005)
2b) Filter the variable to the range you wish to exclude with !
(note that here we need to use the &
explicitly)
recent <- filter(boreal_chickadee_db, !(survey_year >= 1960 & survey_year < 2000))
recent %>% select(survey_year) %>% summarize(min = min(survey_year, na.rm = TRUE), max = max(survey_year, na.rm = TRUE))
Filtering by dates is usually slightly more complex, but fortunately, the filter()
function automatically detects dates which means it's pretty much the same as a numerical filter.
pull()
grabs the column of interest, summary()
returns a summary of the values.boreal_chickadee_db %>% select(date) %>% summarize(min = min(date, na.rm = TRUE), max = max(date, na.rm = TRUE))
2a) Filter the variable to the range you wish to keep (note that the ,
is an implied &
)
field_season <- filter(boreal_chickadee_db, date >= "2010-05-10", date <= "2010-09-21")
2b) Filter the variable to the range you wish to exclude with !
(note that here we need to use the &
explicitly)
omit_field_season <- filter(boreal_chickadee_db, !(date >= "2010-05-10" & date <= "2010-09-21"))
field_season %>% select(date) %>% summarize(min = min(date, na.rm = TRUE), max = max(date, na.rm = TRUE))
unlink("boreal_chickadee.nc")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.