knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) folder <- 'resources'
sapfluxnetr
R package provides tools for a tidy data analysis for the first
sap flow measurements global database
(SAPFLUXNET Project). In this vignette you
will learn how to install the package, download the data and get started with
some data metrics.
sapfluxnetr
is in the CRAN, so the installation is straightforward:
install.packages('sapfluxnetr')
Development versions of the package reside in github. If you want the latest
updates, and also the latest bugs ( be advised ;) ), please install the
package from the devel branch of the
github repository with the
remotes
package:
# if (!require(remotes)) {install.packages('remotes')} remotes::install_github( 'sapfluxnet/sapfluxnetr', ref = 'devel', build_opts = c("--no-resave-data", "--no-manual", "--build-vignettes") )
Now you can load the package, along with the tidyverse
-related packages, as
they will be needed later:
library(sapfluxnetr) # tidyverse library(dplyr) library(ggplot2)
.zip file is about 3GB. Unzipped folders use about 24GB of disk space. Please check thet you have enough space available before downloading and unzipping the files.
Data is publicly available on Zenodo to download. You can manually download the data or use a programmatic approach:
# download the data download.file( url = "https://zenodo.org/record/3971689/files/0.1.5.zip?download=1", destfile = '0.1.5.zip' ) # unzip the data # BE SURE YOU HAVE AT LEAST 24GB OF DISK SPACE unzip("0.1.5.zip") # check if files are present list.files(file.path('0.1.5', 'RData', 'plant')) list.files(file.path('0.1.5', 'csv', 'plant'))
SAPFLUXNET database structure is as follows:
0.1.5 (database version number) | |-- RData | | | |-- plant | |-- sapwood | |-- leaf | |-- csv | | | |-- plant | |-- sapwood | |-- leaf
RData
folder contains the RData files for each site divided by sap flow units
level:
plant
sites with sap flow plant level units available. sapwood
sites with sap flow sapwood level units available. leaf
sites with sap flow leaf level units available.csv
folder contains the csv files (9 files, 5 of metadata, 2 of data and 2 more
for the data flags) for each units level available and site. We do not provide
scripts or functions to work with the csv files, only the RData objects.
To start working with the data, you have two options:
If working in RStudio, create a project in the data root folder (0.1.5
in
this example) and follow the examples in this vignette to get started.
If not, set the data root folder (0.1.5
in this example) as the working
directory with setwd
and follow the examples in this vignette to get
started.
DISCLAIMER: In order to be able to build the vignette in the CRAN tests the following examples will be based on a small subset of SAPFLUXNET Data, composed by
ARG_TRE
,ARG_MAZ
andAUS_CAN_ST2_MIX
sites. Outputs will vary if you follow the vignette examples with the complete database.
First, let's get used to the data structure thet SAPFLUXNET provides, and for thet we will choose a site and start playing with it.
In this example we will use the ARG_MAZ
site, as it is small and it will be
fast seeing the package capabilities. There are sites like FRA_PUE
100 times
bigger then this, but as one can imagine, the time is also increasing when
analyzing those bigger datasets.
So, let's read the data in the environment:
# read the data arg_maz <- read_sfn_data('ARG_MAZ', folder = 'RData/plant') # see a brief summary of the site: arg_maz
# read the data arg_maz <- read_sfn_data('ARG_MAZ', folder = folder) # see a brief summary of the site: arg_maz
At first glance, we know by this summary thet is an Argentinian forest (first
three letters of the site code are always the country code),
contributed by Sebastian Pfautsch and Pablo Peri with 5 Nothofagus pumilio
trees measured along 15 days in 2009. Also we can see the environmental variables
measured (ta
, rh
, vpd
, sw_in
, ws
, precip
, swc_shallow
, ppfd_in
and ext_rad
) and the biome classification. Finally, we can see thet the
environmental data has some flags (more on thet later).
sfn_data
objects have different slots containing the different data, each of
one has a getter function (see ?sfn_get_methods
and
vignette('sfn-data-classes', package = 'sapfluxnetr'
for detailed info):
# sapf data with original site timestamp arg_maz_sapf <- get_sapf_data(arg_maz, solar = FALSE) arg_maz_sapf # env_data with calculated aparent solar time arg_maz_env <- get_env_data(arg_maz, solar = TRUE) arg_maz_env
You can see thet the TIMESTAMP variable changes between both kinds of data. That
is because the TIMESTAMP returned is controlled by the solar
parameter
(see ?sfn_get_methods
).
Metadata can be accessed in the same way:
arg_maz_site_md <- get_site_md(arg_maz) arg_maz_site_md arg_maz_stand_md <- get_stand_md(arg_maz) arg_maz_stand_md arg_maz_species_md <- get_species_md(arg_maz) arg_maz_species_md arg_maz_plant_md <- get_plant_md(arg_maz) arg_maz_plant_md arg_maz_env_md <- get_env_md(arg_maz) arg_maz_env_md
If in doubt about some of the metadata variables (what it means, units...) a
description can be obtained from describe_md_variable
function:
# what is env_ta? describe_md_variable('env_ta') # or pl_species? describe_md_variable('pl_species')
There is also some getters thet can come in handy sometimes. get_timestamp
and
get_solar_timestamp
access to the original timestamp and the apparent solar
time timestamp. get_si_code
access to the site code. See
vignette('sfn-data-classes', package = 'sapfluxnetr'
for more info.
sfn_data
objects also have two more slots, accessed with get_sapf_flags
and
get_env_flags
.
arg_maz_sapf_flags <- get_sapf_flags(arg_maz, solar = TRUE) arg_maz_sapf_flags arg_maz_env_flags <- get_env_flags(arg_maz, solar = TRUE) arg_maz_env_flags
This datasets store any flag thet each data point may have (possible outlier,
data removed in the Quality Check of the data...). For a complete list of
flags possible values see vignette('data-flags', package = 'sapfluxnetr')
.
As an example, let's see which values are marked as "RANGE_WARN" (a warning
indicating thet the value may be out of normal variable range):
arg_maz_env_flags %>% filter_all(any_vars(stringr::str_detect(., 'RANGE_WARN')))
We see thet the out of range warnings refer to wind variable. We can cross the data to see which values of wind speed are giving the warnings,
arg_maz_env_flags %>% filter_all(any_vars(stringr::str_detect(., 'RANGE_WARN'))) %>% semi_join(arg_maz_env, ., by = 'TIMESTAMP') %>% select(TIMESTAMP, ws)
and confirm thet the warnings refer to values above the "usual" wind speed maximum.
We can also plot the different data with the help of sfn_plot
function. It will
return ggplot
objects thet can be modified afterwards:
sfn_plot(arg_maz, type = 'sapf', solar = TRUE) + facet_wrap(~ Tree) + theme(legend.position = 'none')
sfn_plot(arg_maz, type = 'env', solar = TRUE) + facet_wrap(~ Variable, scales = 'free_y') + theme(legend.position = 'none')
We can also plot environmental variables individually (with the type
argument),
or an environmental variable versus the sap flow measurements (with the
formula_env
argument). See ?sfn_plot
for a complete description of the
available plots.
# vpd individually sfn_plot(arg_maz, type = 'vpd', solar = TRUE) # vpd vs sapf sfn_plot(arg_maz, formula_env = ~vpd, solar = TRUE) + theme(legend.position = 'none')
SAPFLUXET data is stored as sub-daily measures with different time step between
sites (ranging from 10 minutes to 2 hours).
sapfluxnetr
offers some simple, yet powerful aggregation functions returning
pre-defined statistics: daily_metrics
, monthly_metrics
, predawn_metrics
,
midday_metrics
, daylight_metrics
and nightly_metrics
.
daily_metrics
and monthly_metrics
perform a daily and monthly aggregation,
respectively. predawn_metrics
, midday_metrics
, daylight_metrics
and
nightly_metrics
perform daily or monthly aggregations (controlled by the
period
argument) only by hour-defined intervals. All the aggregations are
performed both for sap flow and environmental data.
Predefined calculated statistics are:
daily_metrics
, see ?diurnal_centroid
for limitations in the calculation
of this metric)Let's see some examples:
arg_maz_daily <- daily_metrics(arg_maz, solar = TRUE) names(arg_maz_daily) names(arg_maz_daily[['sapf']]) names(arg_maz_daily[['env']])
We can see thet results are divided in sapf
and env
and inside each of them
the metrics are indicated by the end of the variable names.
This way we can select specific variables, for example the 0.95 quantile of sap
flow measures:
arg_maz_daily[['sapf']] %>% select(TIMESTAMP, ends_with('q_95'))
The same is applicable to the environmental data, in this case the mean values:
arg_maz_daily[['env']] %>% select(TIMESTAMP, ends_with('mean'))
If interested in custom metrics or custom aggregations, there is a generic
function, sfn_metrics
thet allows for customization of the statistics to
calculate and the periods to aggregate. See ?sfn_metrics
and
vignette('custom-aggregation'. package = 'sapfluxnetr')
for more details
about it.
It's worth to mention thet aggregated TIMESTAMPS are fixed to the beginning
of the period aggregated, meaning thet data from 2018-01-01 00:00:00
to
2018-01-01 23:59:59
are aggregated as 2018-01-01 00:00:00
.
You can change this using the side
parameter (see ?sfn_metrics
)
The default returned object for the aggregation functions is a list with
the sap flow and the environmental data, but given thet usually is more
comfortable to have all data (sap flow and environmental) and ancillary data
(metadata) altogether in a tidy data frame (each row an observation), all
aggregation functions have an argument, tidy
thet can be set to TRUE
to obtain this kind of data frame. We will cover this in the
"Tidy metrics" section.
Getting the insights about one site is interesting, but getting the insights
of a common group of sites could be even more interesting. sapfluxnetr
allows
filtering sites by metadata values (biome, country, species...) and work
with them as a unique set.
First thing we have to do is creating a metadata database. It is not mandatory, but filtering sites by metadata can be a time/resources consuming step if we have to temporary build the database each time we want filter sites. So, let's create a cached metadata database. This will take some minutes, so maybe it is a good moment to prepare a hot beverage ;)
sfn_metadata <- read_sfn_metadata(folder = 'RData/plant', .write_cache = TRUE)
# sfn_metadata <- read_sfn_metadata(folder = folder, .write_cache = TRUE) sfn_metadata <- sapfluxnetr:::.write_metadata_cache(folder = folder, .dry = TRUE)
The important bit here is .write_cache = TRUE
. This will write a file called
.metadata_cache.RData
containing all the metadata for all sites present in
folder
. This file will be used any time we will filter the metadata, so there
is no need of accessing all the data again.
If we take a look at sfn_metadata
we can see a list with 5 data frames, one for
each metadata class (site, stand, species, plant and environmental metadata).
# access plant metadata sfn_metadata[['plant_md']]
Now thet we have our metadata database built, we can inspect the site codes
in a folder with sfn_sites_in_folder
:
folder <- 'RData/plant/' sites <- sfn_sites_in_folder(folder) sites
sites <- sfn_sites_in_folder(folder) sites
We can filter these sites by any metadata variable, to select those thet met
some criteria. This is done with filter_sites_by_md
. As a first try, let's
list all sites belonging to temperate forests (woodland/shrubland included):
temperate <- sfn_sites_in_folder(folder) %>% filter_sites_by_md( si_biome %in% c('Woodland/Shrubland', 'Temperate forest'), metadata = sfn_metadata ) temperate
temperate <- sfn_sites_in_folder(folder) %>% filter_sites_by_md( si_biome %in% c('Woodland/Shrubland', 'Temperate forest'), metadata = sfn_metadata ) temperate
You can combine all filters you want:
temperate_hr <- sfn_sites_in_folder(folder) %>% filter_sites_by_md( si_biome %in% c('Woodland/Shrubland', 'Temperate forest'), pl_sens_meth == 'HR', metadata = sfn_metadata ) temperate_hr
temperate_hr <- sfn_sites_in_folder(folder) %>% filter_sites_by_md( si_biome %in% c('Woodland/Shrubland', 'Temperate forest'), pl_sens_meth == 'HR', metadata = sfn_metadata ) temperate_hr
Remember thet you can get all the info from a metadata variable with
describe_md_variable
, and also you can get a complete list of metadata
variables for filtering with sfn_vars_to_filter
:
sfn_vars_to_filter() # and see what values we must use for filtering by pl_sens_meth describe_md_variable('pl_sens_meth')
We can load all temperate sites with a simple pipe
temperate_sites <- temperate %>% read_sfn_data(folder = 'RData/plant') temperate_sites
temperate_sites <- temperate %>% read_sfn_data(folder = folder) temperate_sites
This creates an sfn_data_multi
object, which is just a list, but adapted to
contain sfn_data
objects. Main functions of sapfluxnetr
work with this type
of objects. As in any list, we can access the sites:
# the following is the same as temperate_sites[[3]] or # temperate_sites$AUS_CAN_ST2_MIX temperate_sites[['AUS_CAN_ST2_MIX']]
Now we can aggregate all sites at once
temperate_aggregated <- temperate_sites %>% daily_metrics() names(temperate_aggregated)
and voilĂ , all sites aggregated and stored in a list, so we can start working with the data.
Most of the times it comes in handy to have a tidy dataset with all the
sap flow and environmental measurements, along with the metadata for the sites
aggregated, in order to work with the data in an easier way. This can be
done with the metrics_tidyfier
function (see ?metrics_tidyfier
). But
daily_metrics
and related functions also implement the tidy
argument.
For the tidy
argument to work, we must have available the metadata,
see the "metadata section" for more details about this.
This allows for returning directly a metrics data frame combining all sites aggregated along with their metadata, saving one step in the workflow:
temperate_tidy <- temperate_sites %>% daily_metrics(tidy = TRUE, metadata = sfn_metadata) temperate_tidy
Now we can start analyzing, modeling, etc.
For example, to look for site effects in the relationship between sap flow and
vpd, using the 0.95 quantile as maximum values of sap flow and the plant sapwood
area as a third variable:
ggplot( temperate_tidy, aes(x = vpd_mean, y = sapflow_q_95, colour = si_code, size = pl_sapw_area) ) + geom_point(alpha = 0.2)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.