createBigDistMat: Calculate Hydrologic Distances for a large...
In jayverhoef/SSNbd: Spatial Modeling for Big Data Sets on Stream Networks

Description Usage Arguments Details Value Author(s) See Also Examples

Creates a collection of (non-symmetric) matrices containing pairwise downstream hydrologic distances between sites in a SpatialStreamNetwork object

1 2	createBigDistMat(ssn, predpts = NULL, o.write = FALSE, amongpreds = FALSE, no.cores = 1)

`ssn`	SpatialStreamNetwork-class object
`predpts`	a valid predpoints ID from the ssn. Default is `NULL`.
`o.write`	If `TRUE`, overwrite existing distance matrices. Defaults to `FALSE`.
`amongpreds`	If `TRUE`, compute the distances between the prediction sites. Defaults to `FALSE`.
`no.cores`	nNumber of cores to use in computation of the distance matrices. Also, the number of chunks to split the dataset into during computation.

A distance matrix that contains the hydrologic distance between any two sites in SpatialStreamNetwork object is needed to fit a spatial statistical model using the tail-up and tail-down autocovariance functions described in Ver Hoef and Peterson (2010). These models are implemented in R via glmssn in the SSN package. The hydrologic distance information needed to model the covariance between flow-connected (i.e. water flows from one location to the other) and flow-unconnected (i.e. water does not flow from one location to the other, but they reside on the same network) locations differs. The total hydrologic distance is a directionless measure; it represents the hydrologic distance between two sites, ignoring flow direction. The hydrologic distance from each site to a common downstream stream junction is used when creating models for flow-unconnected pairs, which we term downstream hydrologic distance. In contrast, the total hydrologic distance is used for modeling flow-connected pairs, which we term total hydrologic distance.

A downstream hydrologic distance matrix provides enough information to meet the data requirements for both the tail-up and tail-down models. When two locations are flow-connected, the downstream hydrologic distance from the upstream location to the downstream location is greater than zero, but it is zero in the other direction. When two locations are flow-unconnected the downstream hydrologic distance will be greater than zero in both directions. A site's downstream hydrologic distance to itself is equal to zero. The format of the downstream hydrologic distance matrix is efficient because distance information needed to fit both the tail-up and tail-down models is only stored once. As an example, a matrix containing the total hydrologic distance between sites is easily calculated by adding the downstream distance matrix to its transpose.

The downstream hydrologic distances are calculated based on the binaryIDs and stored as matrices. The matrices are stored in a directory named ‘distance’, which is created by the createBigDistMat function within the .ssn directory. The distance directory will always contain at least one directory named ‘obs’, which contains a number of files for each network that has observed sites residing on it with file extensions .bmat, .desc.txt, nmscol.txt, and .nmsrow.txt. The basefile naming convention is based on the netID number (e.g. dist.net1). Each matrix in the ‘obs’ folder contains the information to form a square matrix, which contains the downstream hydrologic distance between each pair of observed sites on the network. Direction is preserved, with columns representing the FROM site and rows representing the TO site. Row and column names correspond to the pid attribute for each site.

If the argument predpts is specified in the call to the function, the downstream hydrologic distances between the observed and prediction sites will also be computed. A new directory is created within the distance directory, with the name corresponding to the predpoints ID (e.g. “preds”). A sequence of files is created within this directory, similar to the structure for the observed sites, except that two objects are stored for each network that contains both observed and prediction sites. The letters a and b are used in the naming convention to distinguish between the two objects (e.g. dist.net1.a and dist.net1.b). The matrices that these objects represent are not necessarily square. In matrices of type a and b, rows correspond to prediction locations and columns to observed locations. However, direction is preserved differently in these matrices. For matrices of type a, columns represent the TO site and rows represent the FROM site. In contrast, the columns represent hte FROM site and the rows represent the TO site in matrices of type b. Again, row and column names correspond to the pid attribute for each site.

If the argument amongpreds is set to TRUE, the downstream hydrologic distances will also be computed between prediction sites, for each network. Again these are stored within the distance directory with the name corresponding to the predpoints ID. The naming convention for these prediction to prediction site distance matrices is the same as the distance matrices stored in the ‘obs’ directory (e.g. dist.net1). These extra distance matrices are needed to perform block Kriging using the glmssn.

The distance matrices generated using the createBigDistMat function are file-backed, meaning that they are stored as files, rather than in memory. This approach is more computationally efficient when users must calculate a large number of pairwise distances between observed and/or prediction locations. The no.cores argument designates the number of cores to be used if the code is to be executed in parallel. Note that, the detectCores function can be used to detect the number of cores available. The no.cores argument is also used to 'chunk' the dataset into multiple parts for processing, with one part allocated to each core. However, there is an overhead for connecting to each core and so the computational benefits of using the createBigDistMat function over the createDistMat function are only evident when datasets are large.

The createBigDistMat function creates a collection of hierarchical directories in the ssn@path directory, which store the pairwise distances between sites associated with the SpatialStreamNetwork-class object. See details section for additional information.

Erin Peterson

SpatialStreamNetwork-class, importSSN, createSSN, glmssn

library(SSN)
# NOT RUN
# mf04 <- importSSN(system.file("lsndata/MiddleFork04.ssn",
#	        package = "SSN"), o.write = TRUE)
# use SpatialStreamNetwork object mf04 that was already created
data(mf04)
# for examples, copy MiddleFork04.ssn directory to R's temporary directory
copyLSN2temp()
#make sure mf04p has the correct path, will vary for each users installation
mf04 <- updatePath(mf04, paste0(tempdir(),'/MiddleFork04.ssn'))

# create distance matrix among observed data points
createBigDistMat(mf04, o.write = TRUE, no.cores = 1)

# create distance matrix among observed data points
#     and between observed and prediction points
data(mf04p)
mf04p <- updatePath(mf04p, paste0(tempdir(),'/MiddleFork04.ssn'))
createBigDistMat(mf04p, predpts = "pred1km", o.write = TRUE, no.cores = 1)

# NOT RUN include prediction to prediction site distances
      # createBigDistMat(mf04p, predpts = "pred1km", o.write = TRUE,
      #     amongpreds = TRUE, no.cores = 4)