title: "cosore-working-with-data"
author: "Ben Bond-Lamberty"
date: "r Sys.Date()
"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
knitr::opts_chunk$set( echo = TRUE, collapse = TRUE, comment = "#>" )
Soil respiration--the flux of CO2 from the soil surface to the atmosphere--has been measured by 'continuous' (e.g. half-hourly) measurement systems in many places over the last 20 years. The goal of the COSORE database is to collect many of these data into a single open community resource to support and speed up data synthesis and to fight the file drawer problem.
The database is distributed as an R package. Use the remotes
or devtools
packages
to install COSORE, e.g. devtools::install_github("bpbond/cosore")
.
(For non-R users, the database is also distributed as a zip file of comma-separated files, available through the package's release on GitHub. When you download and extract the file, there's a README, the data in two separate formats, and a number of other files, including a version of this vignette.)
But how do we work with this database, exactly? Let's start by loading it into R:
library(cosore)
The database is comprised of a collection of datasets, each converted to a standard format and units. A dataset is one or more files of continuous (automated) soil respiration data, with accompanying metadata, with all measurements (i) taken at a single site (although different chambers can have individual geographic coordinates) and (ii) with constant treatment assignments. In R, each dataset is a list of data frames:
dataset 1 |- description table (a data.frame) |- contributors table (ditto...) |- ports table |- columns table |- ancillary table |- data table |- diagnostics table dataset 2 | |- etc.
For most analyses we want to extract one or more of these pieces and combine them--for example, to get a single dataset, a table of contributors, or an overview of the entire database.
The package provides a useful function that gives an overview of the entire database:
db_info <- csr_database() tibble::glimpse(db_info)
There's lots of information here, one row per dataset, including dataset name; geographic location; number of records; vegetation types; and gases, fluxes, and dates measured. Much of this is also summarized in the Report-all.html
file included with the data in each release.
To begin, we pick a single dataset (d20190415_VARNER
), get some information about it, and plot it.
varner <- csr_dataset("d20190415_VARNER") tibble::glimpse(varner$description)
The description
table gives the basic information about this dataset:
where it was measured, the time zone that the data
timestamps are in,
instrument used, and citation and acknowledgment information.
If you have questions about what COSORE fields contain, their units, etc., the csr_metadata()
function returns a table with full information about this. This information is also packaged into each flat-file release.
Next, we want to look at the actual data:
sr <- varner$data nrow(sr) summary(sr)
This dataset has r format(nrow(sr), big.mark = ",")
observations; extends from April 2003 to December 2006; and soil respiration was measured using eight chambers, along with air and 5 cm soil temperature. Visualizing it:
library(ggplot2) theme_set(theme_minimal()) # so much nicer ggplot(sr, aes(CSR_TIMESTAMP_BEGIN, CSR_FLUX_CO2, color = CSR_PORT)) + geom_point(size = 0.5, alpha = 0.25) + coord_cartesian(ylim = c(0, 20)) library(lubridate, warn.conflicts = FALSE) ggplot(sr, aes(CSR_T5, CSR_FLUX_CO2)) + facet_wrap(~month(CSR_TIMESTAMP_BEGIN), scales = "free") + geom_point(size = 0.5, alpha = 0.25)
The November data include one very large flux that we'd probably want to exclude.
Did these eight different ports (chambers) represent different treatments?
We might want to exclude treatment collars, or color them differently
in the plots above. The ports
table holds this information.
tibble::glimpse(varner$ports)
From this we see that the only CSR_PORT
entry is zero (meaning that this information applies to all ports), has has a CSR_TREATMENT
of "None". Also, the collars were r format(varner$ports$CSR_AREA, big.mark = ",")
cm2.
Finally, we can use the description
table information to get a full reference (which you will definitely cite in your published analysis, right?):
doi <- varner$description$CSR_PRIMARY_PUB print(doi) try({ # in case you don't have 'rcrossref' installed... library(rcrossref) cr_cn(dois = doi, format = "text") })
Time for something more ambitious: let's examine how soil respiration varies over the course of the day in temperate deciduous forests. For this we use the csr_table()
function, which combined data across multiple datasets.
dbf_datasets <- subset(db_info, CSR_IGBP == "Deciduous broadleaf forest")$CSR_DATASET tdf <- csr_table("description", dbf_datasets)
# Make a map of these datasets library(sp) library(leaflet) map <- data.frame(lon = tdf$CSR_LONGITUDE, lat = tdf$CSR_LATITUDE) coordinates(map) <- ~lon + lat leaflet(map) %>% addMarkers() %>% addTiles()
There are r nrow(tdf)
datasets here. Extract and visualize their data:
tdf_dat <- csr_table("data", dbf_datasets, quiet = TRUE)
r format(nrow(tdf_dat), big.mark = ",")
rows of data!
ggplot(tdf_dat, aes(CSR_TIMESTAMP_BEGIN, CSR_FLUX_CO2, color = CSR_DATASET)) + geom_point(size = 0.5, alpha = 0.25) + scale_color_discrete(guide = FALSE) + coord_cartesian(ylim = c(0, 20))
The original question we were interested in was how respiration varies over the course of the day. Say we're also interested in site latitude as a possible covariate, requiring us to join together two of the tables we've extracted.
site_info <- tdf[c("CSR_DATASET", "CSR_SITE_NAME", "CSR_LATITUDE")] # join two tables together tdf_combined <- merge(tdf_dat, site_info, by = "CSR_DATASET") # add some new fields tdf_combined$Hour <- hour(tdf_combined$CSR_TIMESTAMP_BEGIN) tdf_combined$Month <- month(tdf_combined$CSR_TIMESTAMP_BEGIN) # for each month, compute mean flux for each hour of the day tdf_smry <- aggregate(CSR_FLUX_CO2 ~ CSR_DATASET + CSR_LATITUDE + Month + Hour, FUN = mean, data = tdf_combined) ggplot(tdf_smry, aes(Hour, CSR_FLUX_CO2, color = CSR_LATITUDE)) + geom_point(size = 0.5) + facet_wrap(~Month)
Note that in COSORE all timestamps are in the site's local, standard time.
This vignette has mostly focused on the description
and data
tables, and briefly mentioned ports
and contributors
. Others include:
contributors
table: dataset contributors. The first person listed should be considered the primary point of contact for questions about the data or offers of co-authorship.ancillary
: ancillary site-level data such as leaf area index, net primary production, soil texture, etc. All optional.Two additional tables provide information about the processing of raw (contributed) data into standardized COSORE datasets:
columns
: describes how columns in the original dataset (i.e. as contributed) were mapped to the COSORE standard fields, including any unit changes or transformations.diagnostics
: metadata about the data ingestion process: rows and columns removed, errors, etc.These can all be extracted using the csr_table()
function shown above.
csr_database()
, demonstrated above, provides a database overviewcsr_dataset()
, demonstrated above, loads a single datasetcsr_table()
, demonstrated above, load a single table across multiple datasetscsr_report_database()
generates a HTML summary of the entire databasecsr_report_dataset()
generates an HTML summary of a single datasetcsr_metadata()
returns an informational table describing columns in all tablesFeedback is welcome on any aspects of the database design, strengths, limitations, formats, documentation...please open a GitHub issue or email; see the README.
Interested in contributing data to COSORE? Pleae contact the maintainer or open a GitHub issue.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.