Internal reproducibility => Public reproducibility
L0 code : download PRISM data
.Renviron (file):
Does not get put into git, each user must create their own
DATADIR=/Volumes/GoogleDrive/Shared\ drives/MacrosystemsBiodiversity/data/projectx
# OR if data is on your local computer; something like
DATADIR=D:/MSB/downloaded_data/projectx
Tell the user this needs to be set"You must create an .Renviron file that contains the line DATADIR=/", or set the envirorment vaariable before running this script DATADIR=/my/working/folder; Rscript write_fd_files.R
Possibly include a file in the project called something like Renviron-example.txt
to be used as a starting point for personalized .Renviron. A single user may run in multiple environments and each git clone would need a new .Renviron
file (e.g. laptop, HPC, cloud (eventually))
Example R code to load options for file paths::
# global variables using R 'options'
datadir <- sys.getenv("DATADIR")
if(is.na(matador)) { datadir <- "C:/somewhere/reasonable")
options(L0dir = file.path(datadir, "L0")
options(L1dir = file.path(datadir, "L1")
options(L2dir = file.path(datadir, "L2")
#' read tree trait data, calculate tree functional diversity, and writes output to tree_pd.csv
#' overwrites any existing output file in the given output dir
#'
#' @param treefile Optionaal Name of file to read (not the full path, just the name), defaults to trees.csv
#' @param inputdir Optional directory to read from (defaults to L0 option)
#' @param outputdir Optional directory to write to (defaults to L0 option)
#' @return The full path of the file that was written, or NULL if there was an error
write_tree_fd<-function( treefile = "trees.csv", inputdir = options("L0dir"), outputdir = options("L1dir")){
# note the calculation is in a different function than one that writes data
# so that the calculation may be tested without side effects of writing files
treedata <- read.csv(file.path(inputdir, treefile))
fd <- tree_fd(treedata)
# validate fd, check for non-zero length, etc
# this over-writes existing data
outputfile <- file.path(outputdir, "tree_pd.csv")
write.csv(fd, file=outputfile)
return(outputfile)
}
Option for R configuration recommendation: one main "datadir" folder ( /Volumes/GoogleDrive/Shared\ drives/MacrosystemsBiodiversity/data/projectx/
) or separate folders for each L0/1/2 instead on a single "DATADIR" ( I have had times when I wanted to write output to different folder), for example
.Renviron:
L0=/Volumes/GoogleDrive/Shared\ drives/MacrosystemsBiodiversity/data/projectx/L0
L1=/Volumes/GoogleDrive/Shared\ drives/MacrosystemsBiodiversity/data/projectx/L1
L2=/Volumes/GoogleDrive/Shared\ drives/MacrosystemsBiodiversity/data/projectx/L2
* update data with coordinated intension
* quarterly, annually? based on NEON schedule?
goal: answer "am I using latest data" with "yes"
L0 == data obtained from another producer (NEON, PRISM, etc)
L1 == and data maninpulation needed for analysis, filter/group/joins
L2 == output, can be another projects L0
data inside projects:
example MSB/data/projectx/L0
=>data project owners have collected for use in the project
project personnel maintain these folders
Updates for common data must be coordinated: find a week where all projects update their data sources if needed
MSB/data/L0 => data we work together on and share
how to deal with two possible data folders in R? Two variables. L0= L0MSB=
who maintains data outside of projects?
Minimally there should be clear instructions for how L0 data are acquired and where possible, there should be R scripts. It's easy to spend a lot of time perfecting an L0 script when manually instructions would suffice.
R functions or scripts should be flexible enough to use different L0 paths, set as configuration.
Functions should be of two classes, if possible
code to get the data and save into the folder where it's to go, using L0dir as the root.
if the data is at all complex, or has any quirks that need to be remember asside from a simple
read.csv(file.path(L0dir, 'filename.csv'))
then it may be worth creating functions to read and return the data. These could be called L0 vs L1 script? (the EDI framework speaks to data, not code) A functional approach this protects the down-stream scripts (L1, L2) from small alterations in the data, and is a single point of code for changes. For example, if data is multiple years (2018xdata.csv, 2019xdata.csv, 2020xdata.csv) and L1 scripts should be able to use all of these files, write a function
```R # untested demo code read_xdata <- function(datadir){ # this function knows the file names and folders where x data live xdata_folder <- file.path(datadir, "xdata")
read_by_year <- function(year){
xdata_file <- file.path(xdata_folder, paste0(as.character(year), "xdata.csv")
if(file.exists(xdata_file)){
read.csv(xdata_file) }
else {
warning(paste(xdata_file, " not found, exiting"))
return(NA)
}
# read data for each year, updated as more data is stored
xdata_list <- lapply(c(2018, 2019, 2020), read_by_year)
# combine data frames in the list
xdata <- do.call(rbind, xdata_list)
return(xdata)
} ```
function(s) to check that the data is present, and potentially print or plot very basic summaries (without manipulation), and possibily return T/F. This splits the tasks of validati data files, reading data, and cleaning/summarizing procedures for easier troubleshooting. See the Seperation of Concerrns principle for sofware engineering.
check_xdata <- function(datadir, years=c(2018, 2019, 2020)){
# this function knows the file names and folders where x data live
xdata_folder <- file.path(datadir, "xdata")
check_by_year <- function(year){
xdata_file <- file.path(xdata_folder, paste0(as.character(year), "xdata.csv")
if(file.exists(xdata_file)){
return(TRUE)
else {
warning(paste(xdata_file, "not found"))
return(FALSE)
}
# read data for each year, updated as more data is stored
checks <- sapply(c(2018, 2019, 2020), check_by_year)
# combine data frames in the list
return(all(checks)) # all returns TRUE if all elements are TRUE
}
There are (at least) 3 packages to get NEON data (that I know of): neon_utilities
, neonDivData
and neonstore
.
There are detailed notes and example scripts in the repository https://github.com/NEON-biodiversity/neon_data_examples
I believe that MSB should choose one method and recommend all sub-projects use this method.
neonDivData
quickly provides data without downloading. This follows the "data as a package" recommendation we discussed last summer.
# install.packages("devtools")
devtools::install_github("daijiang/neonDivData")
library(neonDivData)
automaticaly makes available neon_sites
,neon_taxa
, and data files like data_zooplankton
taxa_count <- function(div_data){ length(unique(div_data$taxon_id))}
zooplanton_count <- function(){ taxa_count(neonDivData::data_zooplanton)}
bird_species_count <- function(){ taxa_count(neonDivData::data_birds)}
Question to group: How to document the source and state of L1 data?
Example, running computation on HPC
Now... how to document the provenance of the results? If the source code is updated in prep for publication, or for use in another analysis, how can we link output data file(s) to this version of code? How to indicate who/what created this results file? 1. in alll data folders, or parent folder include documentation of the files present (README.txt?, log?) 2. Use git "tags" to mark the version of code that generated a data file, and indicate that in the data README (can link to github tag) 3. Include a copy of the code used to create all files in data folders? 4. Add versioning to all file names? (requires using these versioned file names in all of our code) 5. How to indicate the version of the inputs, or parameter values used to create this output
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.