In hrvg/RiverML: Machine Learning for River Science

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Purpose

This vignette documents the workflow to create the target dataset. The training dataset is extracted from the target data set at selected labelled locations.

Libraries

library(magrittr)
library(RiverML)
library(raster)

Loading data

The dataset of target streamlines is loaded with get_target_streamlines(). The target streamlines for nine regions are included in the package and accessed with target_streamlines_SFE, for SFE region corresponding to the catchment of the South Fork Eel River, California, USA. The nine regions are:

| Region name | Region code | |:----------------------|:-----------:| | South Fork Eel | SFE | | Klamath | K | | North Coast | NC | | North Central Coast | NCC | | South Central Coast | SCC | | South Coast | SC | | Sacramento | SAC | | South East California | SECA | | San Joaquin Tulare | SJT |

Loading `target_streamlines`

We first define a region of study with the variable region and check that get_target_streamlines() and the lazy loaded dataset target_streamlines_SFE are identical.

region <- "SFE"
target_streamlines <- get_target_streamlines(region)
all.equal(target_streamlines, target_streamlines_SFE)
target_streamlines

Getting `target_points`

From the target_streamlines, we now extract the target_points using get_target_points() which returns a data.frame containing the location of the midpoints of all the 200-m stream intervals.

target_points <- get_target_points(target_streamlines)
head(target_points)

You will want to save this target_points data.frame as a separate .csv with format region_all_input.csv. Such a file is required for the next step, high performance computing.

fname <- paste0(region,"_all_input.csv")
write.csv(target_points, file.path("path/to/output/", fname))

High Performance Computing

At this stage, the goal is to use High Performance Computing (HPC) to highly parallelize the extraction of the data at the location defined by target_points.

HPC Optimization

While it is possible to request one HPC node per location in target_points, this is highly unefficient. This is because, as one HPC admin told me once, if you do so, most of your time might be spent in reading required data, rather than in actual compute. To estimate the optimum ratio of nodes to compute iterations, we use HPC_optim() which requires that you profile one iteration of the code using for example profvis package.

optim_df <- HPC_optim(
    nrow(target_points), 
    hpc_lim = 1e4, 
    io_time = 20e3, 
    computing_time = 4e3, 
    AssocGrpCpuLimit = 258, 
    core_per_task = 1
)
optim_df

This indicates that we should submit r optim_df$n_array HPC jobs each computing r optim_df$task_per_array iterations with an estimated time of r signif(optim_df$exec_time_min, 3) minutes per array. Take this estimated execution time with a grain of salt as the io_time and computing_time values are usually derived using a local computer with likely more RAM than one HPC node, and with faster I/O speed than the I/O speed between the HPC nodes and wherever is your input data stored.

Getting the terrain analysis and statistical roughness data

Armed with the optimized number of arrays and tasks per arrays, the HPC job can be submitted. This job will compute terrain analysis metrics (TAM, e.g. slope, curvature), and more specifically metrics of such TAM (e.g., mean, max). To derive these, a digital elevation model (DEM) is tiled around the locations specified by target_points. In our case, the input DEM corresponds to a 10-m DEM of California and its size prevents use to make the rest of the vignette executable.

In the following, the HPC code is detailled and commented.

Submitting

The TAM-DM HPC job is submitted with the following chunk which you will need to tailor to your own needs. The arguments $1 and $2 specify the output directory and the type of run. One example for $1 is /home/hguillon/out/run124. One example for $2 is SFE_all.

```{bash, eval = FALSE}

!/bin/bash -l

SBATCH --partition=med

SBATCH --time=00-03:00:00

SBATCH --mem-per-cpu=5200

SBATCH --mail-type=END

SBATCH --mail-user=hguillon@ucdavis.edu

SBATCH --job-name=main_process_tam-dm

SBATCH --output=/home/hguillon/out/main_process_tam-dm_%A_%a.out

SBATCH --error=/home/hguillon/out/main_process_tam-dm_%A_%a_err.out

srun Rscript ~/farm_scripts/main_process_tam-dm.R $1 $2

### Initialization

We first check for arguments from the submitting `slurm` commands, retrieve the `region` and `slurm_arrayid` variables and define the output files, `fname_tam` and `fname_H`.

```r
args = commandArgs(trailingOnly=TRUE) # checking for args

if (length(args)!=2) { # test if there is at least one argument: if not, return an error
  stop("Output dir or input file missing.\n", call.=FALSE)
}

outdir <- args[1]
run_type <- args[2]
region <- unlist(strsplit(run_type, "_"))[1]

slurm_arrayid <- Sys.getenv('SLURM_ARRAY_TASK_ID') # grab the array id value from the environment variable passed from sbatch
slurm_arrayid <- as.numeric(slurm_arrayid) # coerce the value to an integer

fname_tam <- paste0(outdir, slurm_arrayid,'_tam.csv')
fname_H <- paste0(outdir, slurm_arrayid,'_H.csv')

Data Loading

Assuming run_type <- "SFE_all": r run_type <- "SFE_all"

input_dir <- system.file("extdata/input_data", package = "RiverML")
fname <- paste0(run_type,"_input.csv")
input_data <- get_input_data(file.path(input_dir, fname))
head(input_data)

Assuming slurm_arrayid <- 89: r slurm_arrayid <- 89

df_optim <- HPC_optim(nrow(input_data), io_time = 20e3, computing_time = 4e3, pl = FALSE)
n_start <- (slurm_arrayid - 1) * df_optim$task_per_array + 1
n_end <- n_start + (df_optim$task_per_array - 1)
if (n_end > nrow(input_data)) n_end <- nrow(input_data)
n <- seq(n_start, n_end)

This selects the iterations between r n_start and r n_end.

Getting the TAM-DM data

We now calculate the TAM-DM. We first compute the tile extents with get_input_polygons() and feed the resulting SpatialPolygons to get_terrain_metrics() which produces a stack of terrain analysis Raster objects. This is then summarized with get_stats_df() at the level of the entire tile (raster_stats) and along a 100-m riparian buffer (near_channel_stats), requiring to retrieve the target_strealines with get_target_streamlines().

# if(!file.exists(fname_tam)){
    dem_dir <- "/vsicurl/https://riverml.s3-us-west-2.amazonaws.com/"
    dem_file <- "DEM_SFE.tif"
    DEM <- raster(paste0(dem_dir, dem_file))
    polys <- get_input_polygons(input_data, n, DEM, .dl = 25)
    ls <- get_terrain_metrics(polys, DEM, curvature = FALSE)

    # raster stats
    raster_stats_df <- get_stats_df(raster_stats, "rstr", ls, dem_file)

    # near channel stats
    target_streamlines <- get_target_streamlines(region)
    target_streamlines <- target_streamlines[n, ]
    near_channel_stats_df <- get_stats_df(near_channel_stats, "nrch", ls, dem_file, lines = target_streamlines)

    target_data_df_tam <- cbind(raster_stats_df, near_channel_stats_df)
    # write.csv(target_data_df_tam, file.path(fname_tam), row.names = FALSE) # saving
    # rm(DEM) # clean up
    # gc()
# }

target_data_df_tam %>% 
    knitr::kable(digits = 3, format = "html", caption = "TAM-DM data") %>% 
    kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>% 
    kableExtra::scroll_box(width = "7in", height = "5in")

Getting the statistical roughness data

Rasters of the statistical roughness of California were computed using the R package statisticalRoughness from scales from r min(get_H_rasters()$upper_scales) to r max(get_H_rasters()$upper_scales) meters. The resulting rasters and associated scales are loaded with get_H_rasters().

# if(!file.exists(fname_H)){
    lH <- get_H_rasters()
    H_rasters <- lH$H_rasters
    upper_scales <- lH$upper_scales
    target_data_df_H <- join_streamlines_with_H_rasters(H_rasters, target_streamlines, upper_scales)
    # write.csv(target_data_df_H, file.path(fname_H), row.names = FALSE) # saving
# }

target_data_df_H %>% 
    knitr::kable(digits = 3, format = "html", caption = "statistical roughness data") %>% 
    kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>% 
    kableExtra::scroll_box(width = "7in", height = "5in")

Getting the StreamCat data

Typically, the following chunks are executed after having calculated target_data_df_H and target_data_df_tam for each HPC array. Now is time to join all of them and to add StreamCat data to the resulting unified data.frame.

Loading StreamCat data

StreamCat data are publicly available. In our case, we retrieved the following data from HydroRegions corresponding to California, Nevada and Oregon:

ICI_IWI_Region18
Lithology_Region18
Mines_Region18
MinesRipBuf100_Region18
NLCD2011_Region18
NLCD2011RipBuf100_Region18
PRISM_1981_2010_Region18
Runoff_Region18
STATSGO_Set1_Region18
STATSGO_Set2_Region18

These data are aggregated into streamcat_df with get_streamcat_df().

An cloud-based version of streamcat_df can be loaded directly with:

streamcat_df <- data.table::fread("https://riverml.s3-us-west-2.amazonaws.com/streamcat_df.csv.bz2")

streamcat_df %>%
    head(20) %>% 
    knitr::kable(digits = 3, format = "html", caption = "StreamCat data") %>% 
    kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>% 
    kableExtra::scroll_box(width = "7in", height = "5in")

Loading the `target_streamlines` (again)

Because, as explained, these chunks are likely executed in a separate run than the previous ones, we need to reload the target_streamlines. For the sake of this vignette, we keep the reduced version with only r length(target_streamlines) stream intervals.

target_streamlines <- get_target_streamlines(region)

Using get_target_streamcat_df(), we bind select the lines of streamcat_df matching the COMID of target_streamlines. get_target_streamcat_df() also drops some variables under the hood that are not required for our predictions.

target_streamcat_df <- get_target_streamcat_df(streamcat_df, target_streamlines)

target_streamcat_df %>%
    knitr::kable(digits = 3, format = "html", caption = "StreamCat data") %>% 
    kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>% 
    kableExtra::scroll_box(width = "7in", height = "5in")

Adding GIS predictors

We can now add some predictors stored in target_streamlines: slope (SLOPE), valley confinement distance (CONFINEMEN), an estimate of sediment-supply (RUSLE), the Strahler's stream order (SO) and the local drainage density (LDD). Finally, we drop the COMID.

target_streamcat_df <- cbind(target_streamcat_df,
    SLOPE = target_streamlines$SLOPE,
    CONFINEMEN = target_streamlines$CONFINEMEN, 
    RUSLE = target_streamlines$RUSLE,
    SO = target_streamlines$StreamOrde,
    LDD = target_streamlines$LDD)
target_streamcat_df$COMID <- NULL

Loading TAM-DM and statistical roughness results

The following two chunks the .csv files corresponding to all arrays.

lf <- list.files(file.path(outdir), pattern = "tam.csv")
ids <- unlist(lapply(lf, function(x) unlist(strsplit(x,"_"))[1]))
ids <- sapply(ids, as.numeric)
lf <- lf[order(ids)]
l_target_data_df_tam <- lapply(lf, function(f) read.csv(file.path(outdir, f)))
target_data_df_tam <- do.call(rbind, l_target_data_df_tam)

lf <- list.files(file.path(outdir), pattern = "H.csv")
ids <- unlist(lapply(lf, function(x) unlist(strsplit(x,"_"))[1]))
ids <- sapply(ids, as.numeric)
lf <- lf[order(ids)]
l_target_data_df_H <- lapply(lf, function(f) read.csv(file.path(outdir, f)))
target_data_df_H <- do.call(rbind, l_target_data_df_H)

Binding everything

Finally, we can bind together the data from StreamCat, the TAM-DM and the statistical roughness.

target_data_df <- cbind(target_streamcat_df, target_data_df_tam, target_data_df_H)
# write.csv(target_data_df, file = file.path(outdir, paste0(region, '_all_data_df.csv')), row.names = FALSE)

target_data_df %>%
    knitr::kable(digits = 3, format = "html", caption = "target data") %>% 
    kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>% 
    kableExtra::scroll_box(width = "7in", height = "5in")

We also save a reminder of the ID of the computed data as well as their associated COMID.

target_ID_df <- data.frame(ID = target_streamlines$ID, COMID = target_streamlines$COMID)
# write.csv(target_ID_df, file = file.path(outdir, paste0(region, '_all_ID_df.csv')), row.names = FALSE)

target_ID_df %>%
    knitr::kable(digits = 3, format = "html", caption = "target ID") %>% 
    kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>% 
    kableExtra::scroll_box(width = "7in", height = "5in")

hrvg/RiverML documentation built on Oct. 12, 2020, 10:40 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

hrvg/RiverML
Machine Learning for River Science

In hrvg/RiverML: Machine Learning for River Science

Purpose

Libraries

Loading data

Loading `target_streamlines`

Getting `target_points`

High Performance Computing

HPC Optimization

Getting the terrain analysis and statistical roughness data

Submitting

!/bin/bash -l

SBATCH --partition=med

SBATCH --time=00-03:00:00

SBATCH --mem-per-cpu=5200

SBATCH --mail-type=END

SBATCH --mail-user=hguillon@ucdavis.edu

SBATCH --job-name=main_process_tam-dm

SBATCH --output=/home/hguillon/out/main_process_tam-dm_%A_%a.out

SBATCH --error=/home/hguillon/out/main_process_tam-dm_%A_%a_err.out

Data Loading

Getting the TAM-DM data

Getting the statistical roughness data

Getting the StreamCat data

Loading StreamCat data

Loading the `target_streamlines` (again)

Adding GIS predictors

Loading TAM-DM and statistical roughness results

Binding everything

R Package Documentation

Browse R Packages

We want your feedback!

hrvg/RiverML Machine Learning for River Science

In hrvg/RiverML: Machine Learning for River Science

Purpose

Libraries

Loading data

Loading target_streamlines

Getting target_points

High Performance Computing

HPC Optimization

Getting the terrain analysis and statistical roughness data

Submitting

!/bin/bash -l

SBATCH --partition=med

SBATCH --time=00-03:00:00

SBATCH --mem-per-cpu=5200

SBATCH --mail-type=END

SBATCH --mail-user=hguillon@ucdavis.edu

SBATCH --job-name=main_process_tam-dm

SBATCH --output=/home/hguillon/out/main_process_tam-dm_%A_%a.out

SBATCH --error=/home/hguillon/out/main_process_tam-dm_%A_%a_err.out

Data Loading

Getting the TAM-DM data

Getting the statistical roughness data

Getting the StreamCat data

Loading StreamCat data

Loading the target_streamlines (again)

Adding GIS predictors

Loading TAM-DM and statistical roughness results

Binding everything

R Package Documentation

Browse R Packages

We want your feedback!

hrvg/RiverML
Machine Learning for River Science

Loading `target_streamlines`

Getting `target_points`

Loading the `target_streamlines` (again)