knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette documents the workflow to create the target dataset. The training dataset is extracted from the target data set at selected labelled locations.
library(magrittr) library(RiverML) library(raster)
The dataset of target streamlines is loaded with get_target_streamlines()
.
The target streamlines for nine regions are included in the package and accessed with target_streamlines_SFE
, for SFE
region corresponding to the catchment of the South Fork Eel River, California, USA.
The nine regions are:
| Region name | Region code | |:----------------------|:-----------:| | South Fork Eel | SFE | | Klamath | K | | North Coast | NC | | North Central Coast | NCC | | South Central Coast | SCC | | South Coast | SC | | Sacramento | SAC | | South East California | SECA | | San Joaquin Tulare | SJT |
target_streamlines
We first define a region of study with the variable region
and check that get_target_streamlines()
and the lazy loaded dataset target_streamlines_SFE
are identical.
region <- "SFE" target_streamlines <- get_target_streamlines(region) all.equal(target_streamlines, target_streamlines_SFE) target_streamlines
target_points
From the target_streamlines
, we now extract the target_points
using get_target_points()
which returns a data.frame
containing the location of the midpoints of all the 200-m stream intervals.
target_points <- get_target_points(target_streamlines) head(target_points)
You will want to save this target_points
data.frame
as a separate .csv
with format region_all_input.csv
.
Such a file is required for the next step, high performance computing.
fname <- paste0(region,"_all_input.csv") write.csv(target_points, file.path("path/to/output/", fname))
At this stage, the goal is to use High Performance Computing (HPC) to highly parallelize the extraction of the data at the location defined by target_points
.
While it is possible to request one HPC node per location in target_points
, this is highly unefficient.
This is because, as one HPC admin told me once, if you do so, most of your time might be spent in reading required data, rather than in actual compute.
To estimate the optimum ratio of nodes to compute iterations, we use HPC_optim()
which requires that you profile one iteration of the code using for example profvis
package.
optim_df <- HPC_optim( nrow(target_points), hpc_lim = 1e4, io_time = 20e3, computing_time = 4e3, AssocGrpCpuLimit = 258, core_per_task = 1 ) optim_df
This indicates that we should submit r optim_df$n_array
HPC jobs each computing r optim_df$task_per_array
iterations with an estimated time of r signif(optim_df$exec_time_min, 3)
minutes per array.
Take this estimated execution time with a grain of salt as the io_time
and computing_time
values are usually derived using a local computer with likely more RAM than one HPC node, and with faster I/O speed than the I/O speed between the HPC nodes and wherever is your input data stored.
Armed with the optimized number of arrays and tasks per arrays, the HPC job can be submitted.
This job will compute terrain analysis metrics (TAM, e.g. slope, curvature), and more specifically metrics of such TAM (e.g., mean, max).
To derive these, a digital elevation model (DEM) is tiled around the locations specified by target_points
.
In our case, the input DEM corresponds to a 10-m DEM of California and its size prevents use to make the rest of the vignette executable.
In the following, the HPC code is detailled and commented.
The TAM-DM HPC job is submitted with the following chunk which you will need to tailor to your own needs.
The arguments $1
and $2
specify the output directory and the type of run.
One example for $1
is /home/hguillon/out/run124
.
One example for $2
is SFE_all
.
```{bash, eval = FALSE}
srun Rscript ~/farm_scripts/main_process_tam-dm.R $1 $2
### Initialization We first check for arguments from the submitting `slurm` commands, retrieve the `region` and `slurm_arrayid` variables and define the output files, `fname_tam` and `fname_H`. ```r args = commandArgs(trailingOnly=TRUE) # checking for args if (length(args)!=2) { # test if there is at least one argument: if not, return an error stop("Output dir or input file missing.\n", call.=FALSE) } outdir <- args[1] run_type <- args[2] region <- unlist(strsplit(run_type, "_"))[1] slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID') # grab the array id value from the environment variable passed from sbatch slurm_arrayid <- as.numeric(slurm_arrayid) # coerce the value to an integer fname_tam <- paste0(outdir, slurm_arrayid,'_tam.csv') fname_H <- paste0(outdir, slurm_arrayid,'_H.csv')
Assuming run_type <- "SFE_all"
: r run_type <- "SFE_all"
input_dir <- system.file("extdata/input_data", package = "RiverML") fname <- paste0(run_type,"_input.csv") input_data <- get_input_data(file.path(input_dir, fname)) head(input_data)
Assuming slurm_arrayid <- 89
: r slurm_arrayid <- 89
df_optim <- HPC_optim(nrow(input_data), io_time = 20e3, computing_time = 4e3, pl = FALSE) n_start <- (slurm_arrayid - 1) * df_optim$task_per_array + 1 n_end <- n_start + (df_optim$task_per_array - 1) if (n_end > nrow(input_data)) n_end <- nrow(input_data) n <- seq(n_start, n_end)
This selects the iterations between r n_start
and r n_end
.
We now calculate the TAM-DM.
We first compute the tile extents with get_input_polygons()
and feed the resulting SpatialPolygons
to get_terrain_metrics()
which produces a stack of terrain analysis Raster
objects.
This is then summarized with get_stats_df()
at the level of the entire tile (raster_stats
) and along a 100-m riparian buffer (near_channel_stats
), requiring to retrieve the target_strealines
with get_target_streamlines()
.
# if(!file.exists(fname_tam)){ dem_dir <- "/vsicurl/https://riverml.s3-us-west-2.amazonaws.com/" dem_file <- "DEM_SFE.tif" DEM <- raster(paste0(dem_dir, dem_file)) polys <- get_input_polygons(input_data, n, DEM, .dl = 25) ls <- get_terrain_metrics(polys, DEM, curvature = FALSE) # raster stats raster_stats_df <- get_stats_df(raster_stats, "rstr", ls, dem_file) # near channel stats target_streamlines <- get_target_streamlines(region) target_streamlines <- target_streamlines[n, ] near_channel_stats_df <- get_stats_df(near_channel_stats, "nrch", ls, dem_file, lines = target_streamlines) target_data_df_tam <- cbind(raster_stats_df, near_channel_stats_df) # write.csv(target_data_df_tam, file.path(fname_tam), row.names = FALSE) # saving # rm(DEM) # clean up # gc() # }
target_data_df_tam %>% knitr::kable(digits = 3, format = "html", caption = "TAM-DM data") %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>% kableExtra::scroll_box(width = "7in", height = "5in")
Rasters of the statistical roughness of California were computed using the R
package statisticalRoughness
from scales from r min(get_H_rasters()$upper_scales)
to r max(get_H_rasters()$upper_scales)
meters.
The resulting rasters and associated scales are loaded with get_H_rasters()
.
# if(!file.exists(fname_H)){ lH <- get_H_rasters() H_rasters <- lH$H_rasters upper_scales <- lH$upper_scales target_data_df_H <- join_streamlines_with_H_rasters(H_rasters, target_streamlines, upper_scales) # write.csv(target_data_df_H, file.path(fname_H), row.names = FALSE) # saving # }
target_data_df_H %>% knitr::kable(digits = 3, format = "html", caption = "statistical roughness data") %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>% kableExtra::scroll_box(width = "7in", height = "5in")
Typically, the following chunks are executed after having calculated target_data_df_H
and target_data_df_tam
for each HPC array.
Now is time to join all of them and to add StreamCat data to the resulting unified data.frame
.
StreamCat data are publicly available. In our case, we retrieved the following data from HydroRegions corresponding to California, Nevada and Oregon:
These data are aggregated into streamcat_df
with get_streamcat_df()
.
An cloud-based version of streamcat_df
can be loaded directly with:
streamcat_df <- data.table::fread("https://riverml.s3-us-west-2.amazonaws.com/streamcat_df.csv.bz2")
streamcat_df %>% head(20) %>% knitr::kable(digits = 3, format = "html", caption = "StreamCat data") %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>% kableExtra::scroll_box(width = "7in", height = "5in")
target_streamlines
(again)Because, as explained, these chunks are likely executed in a separate run than the previous ones, we need to reload the target_streamlines
.
For the sake of this vignette, we keep the reduced version with only r length(target_streamlines)
stream intervals.
target_streamlines <- get_target_streamlines(region)
Using get_target_streamcat_df()
, we bind select the lines of streamcat_df
matching the COMID
of target_streamlines
.
get_target_streamcat_df()
also drops some variables under the hood that are not required for our predictions.
target_streamcat_df <- get_target_streamcat_df(streamcat_df, target_streamlines)
target_streamcat_df %>% knitr::kable(digits = 3, format = "html", caption = "StreamCat data") %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>% kableExtra::scroll_box(width = "7in", height = "5in")
We can now add some predictors stored in target_streamlines
: slope (SLOPE
), valley confinement distance (CONFINEMEN
), an estimate of sediment-supply (RUSLE
), the Strahler's stream order (SO
) and the local drainage density (LDD
).
Finally, we drop the COMID
.
target_streamcat_df <- cbind(target_streamcat_df, SLOPE = target_streamlines$SLOPE, CONFINEMEN = target_streamlines$CONFINEMEN, RUSLE = target_streamlines$RUSLE, SO = target_streamlines$StreamOrde, LDD = target_streamlines$LDD) target_streamcat_df$COMID <- NULL
The following two chunks the .csv
files corresponding to all arrays.
lf <- list.files(file.path(outdir), pattern = "tam.csv") ids <- unlist(lapply(lf, function(x) unlist(strsplit(x,"_"))[1])) ids <- sapply(ids, as.numeric) lf <- lf[order(ids)] l_target_data_df_tam <- lapply(lf, function(f) read.csv(file.path(outdir, f))) target_data_df_tam <- do.call(rbind, l_target_data_df_tam)
lf <- list.files(file.path(outdir), pattern = "H.csv") ids <- unlist(lapply(lf, function(x) unlist(strsplit(x,"_"))[1])) ids <- sapply(ids, as.numeric) lf <- lf[order(ids)] l_target_data_df_H <- lapply(lf, function(f) read.csv(file.path(outdir, f))) target_data_df_H <- do.call(rbind, l_target_data_df_H)
Finally, we can bind together the data from StreamCat, the TAM-DM and the statistical roughness.
target_data_df <- cbind(target_streamcat_df, target_data_df_tam, target_data_df_H) # write.csv(target_data_df, file = file.path(outdir, paste0(region, '_all_data_df.csv')), row.names = FALSE)
target_data_df %>% knitr::kable(digits = 3, format = "html", caption = "target data") %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>% kableExtra::scroll_box(width = "7in", height = "5in")
We also save a reminder of the ID
of the computed data as well as their associated COMID
.
target_ID_df <- data.frame(ID = target_streamlines$ID, COMID = target_streamlines$COMID) # write.csv(target_ID_df, file = file.path(outdir, paste0(region, '_all_ID_df.csv')), row.names = FALSE)
target_ID_df %>% knitr::kable(digits = 3, format = "html", caption = "target ID") %>% kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>% kableExtra::scroll_box(width = "7in", height = "5in")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.