This R-package is for the analysis of recorder behaviour in citizen science projects. This package takes data from citizen science projects that have 'what' (typically species), 'where', 'when', and 'who' fields. From this data the various functions in this package create metrics that quantify the behaviour of the participants (the 'who').
This package accompanies the paper Data-derived metrics describing the behaviour of field-based citizen scientists provide insights for project design and modelling bias by TA August et al in press 2019
# Install the dependencies install.packages(c('adehabitatHR', 'raster', 'sp', 'rgdal')) devtools::install_github('biologicalrecordscentre/recorderMetrics')
library(knitr) knitr::opts_chunk$set(cache = TRUE)
This R package is designed for organisers of citizen science projects and academics interested in the 'citizen' element of citizen science. Specifically this package is designed for wildlife recording citizen science projects where observers are allowed to record what, where and when, they want. However, elements of the package are still useful if this is not the case. The metrics presented here allow organisers to gain new insights in the variation in recorder behaviour. The insights will help organisers to decide which citizen scientists might be interested in a new project, or what sort of project will best fit with the patterns of behaviours in the citizen scientists.
The data needed is what, where, when, who. What was recorded (e.g. species), where was it recorded (i.e. latitude and longitude AND the grid square this falls in), when was it recorded (e.g. 20/01/1997), and who recorded it (e.g. 'John Smith' or 'USER1548785').
# We have included some example data in the r-package library(recorderMetrics) head(cit_sci_data) str(cit_sci_data)
In the case of the example the recorder column is an anonymised number, but you could also use the users name (e.g. 'John Smith') as long as this uniquely identifies your user. If you usernames vary (e.g. 'John Smith', 'J Smith', 'John A Smith', 'JA Smith') you will need to harmonise these first. The same applies for species, numerics are used here but you could have Latin binomials as long a they uniquely identify a species. Note that location is given in latitude and longitude as well as 1km grid square. If you don't have 1km square information you could use the name of the site or a concatenation of latitude and longitude, rounded to a fewer decimal places. Either way this location should be the same for all records submitted by a recorder on an individual visit. For comparability to August et al it should be a 1km square grid cell.
The four axes presented in August et al may not be the most appropriate axes for all citizen science projects however if you wish to use them this can be done using the predictAxes
function. This function is a wrapper for a lot of other functions that we will explore in a moment.
#Run for 10 recorders metrics_axes <- predictAxes(data = cit_sci_data, recorders = unique(cit_sci_data$recorder)[1:10])
# The returned object is a table of the metrics... str(metrics_axes$recorder_metrics) # ...and the axes values head(metrics_axes$axes) # Run the metric all recorders. NOTE: this takes a long time # metrics_axes <- predictAxes(data = cit_sci_data)
Note that recorder metrics are only calculated for recorders with 10 or more active days as below that the metrics cannot be estimated with confidence.
The function we just looked at (predictAxes
) calculates all of the metrics as well as the axes so you can use that to quickly calculate the metrics, however that function makes use of a number of other functions which calculate the metrics. If you want to have more control over the parameters used to calculate the metrics then these functions will give you that level of control.
summerData
Recorder metrics can be biased if there are annual breaks in data collection. In these cases it is better to ensure that only data in the recording period (typically summer), is included. This function is an objective way to identify this recording period.
# Subset this data to summer periods only SD <- summerData(input_data = cit_sci_data, probs = c(0.025, 0.975), date_col = 'date') head(SD) # Data not in the summer period is removed nrow(cit_sci_data) nrow(SD) # The cutoffs used to define summer are also returned attr(SD, which = 'cutoffs')
activityRatio
This function takes in data for a recorder and calculates the activity ratio, total duration and number of active days.
# Get the summer period SD <- summerData(cit_sci_data, date_col = 'date') ar <- activityRatio(recorder_name = 3007, data = cit_sci_data, recorder_col = 'recorder', date_col = 'date', summer_days = attr(SD, 'cutoffs')) # Run the metric for all recorders ar_all <- lapply(unique(cit_sci_data$recorder), FUN = activityRatio, data = cit_sci_data, recorder_col = 'recorder', date_col = 'date', summer_days = attr(SD, 'cutoffs')) # summarise as one table ar_all_sum <- do.call(rbind, ar_all) head(ar_all_sum) hist(ar_all_sum$active_days, breaks = 80, xlab = 'Number of Active days', main = '')
listLength
This function takes in data for a recorder and calculates the list length metrics. These metrics are based around the idea of a 'list', defined as the species recorded at a single location (often a 1km square) on a single day by an individual recorder.
# run for one recorder LL <- listLength(data = cit_sci_data, recorder_name = 3007, threshold = 10, plot = FALSE, sp_col = 'species', date_col = 'date', recorder_col = 'recorder', location_col = 'km_sq') # Run the metric for all recorders LL_all <- lapply(unique(cit_sci_data$recorder), FUN = listLength, data = cit_sci_data, threshold = 10, plot = FALSE, sp_col = 'species', date_col = 'date', recorder_col = 'recorder', location_col = 'km_sq') # summarise as one table LL_all_sum <- do.call(rbind, LL_all) head(LL_all_sum) hist(LL_all_sum$n_lists, breaks = 80, main = '', xlab = 'Number of lists recorded by an individual recorder')
periodicity
This function takes in data for a recorder and calculates the periodicity metrics.
# run for one recorder P <- periodicity(recorder_name = 3007, data = cit_sci_data, date_col = 'date', recorder_col = 'recorder', day_limit = 5) # Run the metric for all recorders P_all <- lapply(unique(cit_sci_data$recorder), FUN = periodicity, data = cit_sci_data, date_col = 'date', recorder_col = 'recorder', day_limit = 5) # summarise as one table P_all_sum <- do.call(rbind, P_all) head(P_all_sum) hist(P_all_sum$max_streak, main = '', xlab = 'Max streak length')
spatialBehaviour
This function takes in data for a recorder and calculates the spatial metrics. Note that the spatial projection of the data allows for estimates of distance in meters. Pay particular attention to the specification of the parameters crs
and new_crs
## get the proj4 strings from http://spatialreference.org # current form is lat long WGS_84 <- "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs" # I want to change to UK national grid as that is in meters UKNG <- "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs" SB <- spatialBehaviour(recorder_name = 3007, data = cit_sci_data, crs = WGS_84, new_crs = UKNG, y_col = 'lat', x_col = 'long', recorder_col = 'recorder') # Plot the polygons that capture 95% of a recorders records plot(SB$poly_upper) # Run for more than one recorder, this can be slow SB_all <- lapply(unique(cit_sci_data$recorder)[1:10], FUN = spatialBehaviour, data = cit_sci_data, crs = WGS_84, new_crs = UKNG, y_col = 'lat', x_col = 'long', recorder_col = 'recorder') # summarise as one table SB_all_sum <- do.call(rbind, SB_all) # This contains the polygons which I drop here # so we can see the other results in a table head(SB_all_sum)[,c(1,6:10)]
speciesRarity
This function takes in data for a recorder and calculates the recorder's rarity metrics.
# Run for a single recorder SR <- speciesRarity(recorder_name = 3007, data = cit_sci_data, sp_col = 'species', recorder_col = 'recorder') # Run the metric for all recorders SR_all <- lapply(unique(cit_sci_data$recorder), FUN = speciesRarity, data = cit_sci_data, sp_col = 'species', recorder_col = 'recorder') # summarise as one table SR_all_sum <- do.call(rbind, SR_all) head(SR_all_sum) hist(SR_all_sum$median_diff_rarity, breaks = 20, main = '', xlab = 'Median Rarity Difference')
When you are calculating this metric over large scales, for example across Europe or North America, the metric can become biased because the observable species vary spatially. As an alternative the metric can be calculated at a regional level (e.g. country/state), or using a buffer around the records of the recorder.
## Accounting for spatial restriction in movement # If recorders where restricted to the countries that # make up GB (Scotland England and Wales). We should # analyse the data by country library(sp) # Convert our citizen science data to a SpatialPointsDataframe SP <- SpatialPointsDataFrame(data = cit_sci_data, coords = cit_sci_data[,c('long','lat')]) # Define lat long coordinate system CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs") proj4string(SP) <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs") # Empty object for all results SR_all_countries <- NULL # Loop through counties for(i in unique(GB$NAME)){ # Subset by country SP_C <- SP[GB[GB$NAME == i, ], ] # Calculate the metric within country SR_one_country <- lapply(unique(SP_C$recorder), FUN = speciesRarity, data = SP_C@data, sp_col = 'species', recorder_col = 'recorder') # combine data SR_one_country <- do.call(rbind, SR_one_country) SR_one_country$country <- i SR_all_countries <- rbind(SR_all_countries, SR_one_country) } # Note that recorders that have recorded in more than # one country are replicated in our results (n = 75) sum(table(SR_all_countries$recorder) > 1) # Alternativly we can subset data by a buffer around the # recorders records, rather than by country. # Here I use a buffer of 150km library(raster) library(rgeos) # Empty object for all results SR_all_150km_buffer <- NULL for(i in unique(SP$recorder)){ SP_R <- SP[SP$recorder == i, ] SP_R_buffer <- buffer(SP_R, 150000) SP_P <- SP[SP_R_buffer, ] SR_one_buffer <- speciesRarity(recorder_name = i, data = SP_P@data, sp_col = 'species', recorder_col = 'recorder') SR_all_150km_buffer <- rbind(SR_all_150km_buffer, SR_one_buffer) } # Compare results with original analysis combo <- merge(y = SR_all_150km_buffer, x = SR_all_sum, by = 'recorder') plot(combo$median_diff_rarity.x[combo$n.x > 10], combo$median_diff_rarity.y[combo$n.x > 10], xlab = 'Original', ylab = 'By buffer', main = 'Median Rarity Difference') abline(0,1)
taxaBreadth
These metrics describe the 'experience' the recorder has had recording species within the group.
# Calculate the taxa breadth metrics for one recorder TB <- taxaBreadth(recorder_name = 3007, data = cit_sci_data, sp_col = 'species', recorder_col = 'recorder') # Run for more than one recorder, this can be slow TB_all <- lapply(unique(cit_sci_data$recorder), FUN = taxaBreadth, data = cit_sci_data, sp_col = 'species', recorder_col = 'recorder') # summarise as one table TB_all_sum <- do.call(rbind, TB_all) head(TB_all_sum) hist(TB_all_sum$taxa_prop, breaks = 40, main = '', xlab = 'Proportion of Taxa Recorded')
When you are calculating this metric over large scales, for example across Europe or North America, the metric can become biased because the denominator (total number of species) varies spatially. As an alternative the metric can be calculated at a regional level (e.g. country/state), or using a buffer around the records of the recorder.
library(sp) # Convert our citizen science data to a SpatialPointsDataframe SP <- SpatialPointsDataFrame(data = cit_sci_data, coords = cit_sci_data[,c('long','lat')]) # Define lat long coordinate system CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs") proj4string(SP) <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs") # Empty object for all results TB_all_countries <- NULL # Loop through counties for(i in unique(GB$NAME)){ # Subset by country SP_C <- SP[GB[GB$NAME == i, ], ] # Calculate the metric within country TB_one_country <- lapply(unique(SP_C$recorder), FUN = taxaBreadth, data = SP_C@data, sp_col = 'species', recorder_col = 'recorder') # combine data TB_one_country <- do.call(rbind, TB_one_country) TB_one_country$country <- i TB_all_countries <- rbind(TB_all_countries, TB_one_country) } # Note that recorders that have recorded in more than # one country are replicated in our results (n = 75) sum(table(TB_all_countries$recorder) > 1) # Alternativly we can subset data by a buffer around the # recorders records, rather than by country. # Here I use a buffer of 30km library(raster) library(rgeos) # Empty object for all results TB_all_30km_buffer <- NULL for(i in unique(SP$recorder)){ SP_R <- SP[SP$recorder == i, ] SP_R_buffer <- buffer(SP_R, 30000) SP_P <- SP[SP_R_buffer, ] TB_one_buffer <- taxaBreadth(recorder_name = i, data = SP_P@data, sp_col = 'species', recorder_col = 'recorder') TB_all_30km_buffer <- rbind(TB_all_30km_buffer, TB_one_buffer) } # Compare results with original analysis combo <- merge(y = TB_all_30km_buffer, x = TB_all_sum, by = 'recorder') plot(combo$taxa_prop.x[combo$n.x > 10], combo$taxa_prop.y[combo$n.x > 10], xlab = 'Original', ylab = 'By buffer', main = 'Proportion of taxa recorded') abline(0,1)
weeklyDevotedDays
These metrics describe frequency of recording in weeks in which the recorder made observations.
# Run for one recorder WDD <- weeklyDevotedDays(recorder_name = 3007, data = cit_sci_data, recorder_col = 'recorder', date_col = 'date') # Run for more than one recorder, this can be slow WDD_all <- lapply(unique(cit_sci_data$recorder), FUN = weeklyDevotedDays, data = cit_sci_data, recorder_col = 'recorder', date_col = 'date') # summarise as one table WDD_all_sum <- do.call(rbind, WDD_all) head(WDD_all_sum) hist(WDD_all_sum$median_weekly_devoted_days, main = '', xlab = 'Median Weekly Devoted days')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.