title: "cosore-working-with-data" author: "Ben Bond-Lamberty" date: "r Sys.Date()" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Vignette Title} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}


  knitr::opts_chunk$set(
  echo = TRUE,
  collapse = TRUE,
  comment = "#>"
  )

Introduction

Soil respiration--the flux of CO2 from the soil surface to the atmosphere--has been measured by 'continuous' (e.g. half-hourly) measurement systems in many places over the last 20 years. The goal of the COSORE database is to collect many of these data into a single open community resource to support and speed up data synthesis and to fight the file drawer problem.

The database is distributed as an R package. Use the remotes or devtools packages to install COSORE, e.g. devtools::install_github("bpbond/cosore").

(For non-R users, the database is also distributed as a zip file of comma-separated files, available through the package's release on GitHub. When you download and extract the file, there's a README, the data in two separate formats, and a number of other files, including a version of this vignette.)

But how do we work with this database, exactly? Let's start by loading it into R:

library(cosore)

The database is comprised of a collection of datasets, each converted to a standard format and units. A dataset is one or more files of continuous (automated) soil respiration data, with accompanying metadata, with all measurements (i) taken at a single site (although different chambers can have individual geographic coordinates) and (ii) with constant treatment assignments. In R, each dataset is a list of data frames:

dataset 1
|- description table (a data.frame)
|- contributors table (ditto...)
|- ports table
|- columns table
|- ancillary table
|- data table
|- diagnostics table

dataset 2
|   |- etc.

For most analyses we want to extract one or more of these pieces and combine them--for example, to get a single dataset, a table of contributors, or an overview of the entire database.

The package provides a useful function that gives an overview of the entire database:

db_info <- csr_database()
tibble::glimpse(db_info)

There's lots of information here, one row per dataset, including dataset name; geographic location; number of records; vegetation types; and gases, fluxes, and dates measured. Much of this is also summarized in the Report-all.html file included with the data in each release.

Exploring a single dataset

To begin, we pick a single dataset (d20190415_VARNER), get some information about it, and plot it.

varner <- csr_dataset("d20190415_VARNER")
tibble::glimpse(varner$description)

The description table gives the basic information about this dataset: where it was measured, the time zone that the data timestamps are in, instrument used, and citation and acknowledgment information.

If you have questions about what COSORE fields contain, their units, etc., the csr_metadata() function returns a table with full information about this. This information is also packaged into each flat-file release.

Next, we want to look at the actual data:

sr <- varner$data
nrow(sr)
summary(sr)

This dataset has r format(nrow(sr), big.mark = ",") observations; extends from April 2003 to December 2006; and soil respiration was measured using eight chambers, along with air and 5 cm soil temperature. Visualizing it:

library(ggplot2)
theme_set(theme_minimal()) # so much nicer
ggplot(sr, aes(CSR_TIMESTAMP_BEGIN, CSR_FLUX_CO2, color = CSR_PORT)) +
  geom_point(size = 0.5, alpha = 0.25) +
  coord_cartesian(ylim = c(0, 20))

library(lubridate, warn.conflicts = FALSE)
ggplot(sr, aes(CSR_T5, CSR_FLUX_CO2)) + 
  facet_wrap(~month(CSR_TIMESTAMP_BEGIN), scales = "free") +
  geom_point(size = 0.5, alpha = 0.25)

The November data include one very large flux that we'd probably want to exclude.

Did these eight different ports (chambers) represent different treatments? We might want to exclude treatment collars, or color them differently in the plots above. The ports table holds this information.

tibble::glimpse(varner$ports)

From this we see that the only CSR_PORT entry is zero (meaning that this information applies to all ports), has has a CSR_TREATMENT of "None". Also, the collars were r format(varner$ports$CSR_AREA, big.mark = ",") cm2.

Finally, we can use the description table information to get a full reference (which you will definitely cite in your published analysis, right?):

doi <- varner$description$CSR_PRIMARY_PUB
print(doi)
try({  # in case you don't have 'rcrossref' installed...
  library(rcrossref)
  cr_cn(dois = doi, format = "text")
})

Selecting and combining multiple datasets

Time for something more ambitious: let's examine how soil respiration varies over the course of the day in temperate deciduous forests. For this we use the csr_table() function, which combined data across multiple datasets.

dbf_datasets <- subset(db_info, CSR_IGBP == "Deciduous broadleaf forest")$CSR_DATASET
tdf <- csr_table("description", dbf_datasets)
# Make a map of these datasets
library(sp)
library(leaflet)
map <- data.frame(lon = tdf$CSR_LONGITUDE, lat = tdf$CSR_LATITUDE)
coordinates(map) <- ~lon + lat
leaflet(map) %>% 
  addMarkers() %>% 
  addTiles()

There are r nrow(tdf) datasets here. Extract and visualize their data:

tdf_dat <- csr_table("data", dbf_datasets, quiet = TRUE)

r format(nrow(tdf_dat), big.mark = ",") rows of data!

ggplot(tdf_dat, aes(CSR_TIMESTAMP_BEGIN, CSR_FLUX_CO2, color = CSR_DATASET)) + 
  geom_point(size = 0.5, alpha = 0.25) + 
  scale_color_discrete(guide = FALSE) +
  coord_cartesian(ylim = c(0, 20))

The original question we were interested in was how respiration varies over the course of the day. Say we're also interested in site latitude as a possible covariate, requiring us to join together two of the tables we've extracted.

site_info <- tdf[c("CSR_DATASET", "CSR_SITE_NAME", "CSR_LATITUDE")]
# join two tables together
tdf_combined <- merge(tdf_dat, site_info, by = "CSR_DATASET")
# add some new fields
tdf_combined$Hour <- hour(tdf_combined$CSR_TIMESTAMP_BEGIN)
tdf_combined$Month <- month(tdf_combined$CSR_TIMESTAMP_BEGIN)

# for each month, compute mean flux for each hour of the day
tdf_smry <- aggregate(CSR_FLUX_CO2 ~ CSR_DATASET + CSR_LATITUDE + Month + Hour,
                      FUN = mean, data = tdf_combined)

ggplot(tdf_smry, aes(Hour, CSR_FLUX_CO2, color = CSR_LATITUDE)) + 
  geom_point(size = 0.5) + facet_wrap(~Month)

Note that in COSORE all timestamps are in the site's local, standard time.

Other tables and data

This vignette has mostly focused on the description and data tables, and briefly mentioned ports and contributors. Others include:

Two additional tables provide information about the processing of raw (contributed) data into standardized COSORE datasets:

These can all be extracted using the csr_table() function shown above.

Useful package functions for users

Feedback and contributions

Feedback is welcome on any aspects of the database design, strengths, limitations, formats, documentation...please open a GitHub issue or email; see the README.

Interested in contributing data to COSORE? Pleae contact the maintainer or open a GitHub issue.



bpbond/cosore documentation built on July 20, 2021, 3:17 p.m.