Nothing
##############################################################
## profile classification functions ##
##############################################################
## lots TODO: https://github.com/ncss-tech/aqp/issues/7
# define function for summing a list of dissimilarity matrices
# that have had NA replaced with 0
.SumDistanceList <- function(x) Reduce("+", x)
## low-level function that the user will probably not ever use directly
# Seems to scale to 1000 profiles with 5 variables, could use optimization
# function requires at least two attributes
# hard coded reference to s$id
# set k to 0 for no depth weighting
#' Deprecated: Numerical Soil Profile Comparison
#'
#' See `NCSP()` for the modern interface to numerical soil profile comparison/classification.
#'
#' Performs a numerical comparison of soil profiles using named properties,
#' based on a weighted, summed, depth-segment-aligned dissimilarity
#' calculation. If \code{s} is a \code{\link{SoilProfileCollection}},
#' site-level variables (2 or more) can also be used. The site-level and
#' horizon-level dissimilarity matrices are then re-scaled and averaged.
#'
#' Variability in soil depth can interfere significantly with the calculation
#' of between-profile dissimilarity-- what is the numerical ``distance'' (or
#' dissimilarity) between a slice of soil from profile A and the corresponding,
#' but missing, slice from a shallower profile B? Gower's distance metric would
#' yield a NULL distance, despite the fact that intuition suggests otherwise:
#' shallower soils should be more dissimilar from deeper soils. For example,
#' when a 25 cm deep profile is compared with a 50 cm deep profile, numerical
#' distances are only accumulated for the first 25 cm of soil (distances from
#' 26 - 50 cm are NULL). When summed, the total distance between these profiles
#' will generally be less than the distance between two profiles of equal
#' depth. Our algorithm has an option (setting replace_na=TRUE) to replace NULL
#' distances with the maximum distance between any pair of profiles for the
#' current depth slice. In this way, the numerical distance between a slice of
#' soil and a corresponding slice of non-soil reflects the fact that these two
#' materials should be treated very differently (i.e. maximum dissimilarity).
#'
#' This alternative calculation of dissimilarities between soil and non-soil
#' slices solves the problem of comparing shallow profiles with deeper
#' profiles. However, it can result in a new problem: distances calculated
#' between two shallow profiles will be erroneously inflated beyond the extent
#' of either profile's depth. Our algorithm has an additional option (setting
#' add_soil_flag=TRUE) that will preserve NULL distances between slices when
#' both slices represent non-soil material. With this option enabled, shallow
#' profiles will only accumulate mutual dissimilarity to the depth of the
#' deeper profile.
#'
#' Note that when the \code{add_soil_flag} option is enabled (default), slices
#' are classified as 'soil' down to the maximum depth to which at least one of
#' variables used in the dissimilarity calculation is not NA. This will cause
#' problems when profiles within a collection contain all NAs within the
#' columns used to determine dissimilarity. An approach for identifying and
#' removing these kind of profiles is presented in the examples section below.
#'
#' A notice is issued if there are any NA values within the matrix used for
#' distance calculations, as these values are optionally replaced by the max
#' dissimilarity.
#'
#' Our approach builds on the work of (Moore, 1972) and the previously
#' mentioned depth-slicing algorithm.
#'
#' @aliases pc pc.SPC profile_compare
#' profile_compare,SoilProfileCollection-method
#' profile_compare,data.frame-method
#'
#' @docType methods
#'
#' @param s a `data.frame` with at least 2 columns of soil properties, and an 'id'
#' column for each profile. horizon depths must be integers and
#' self-consistent, or a \code{SoilProfileCollection} object
#'
#' @param vars A vector with named properties that will be used in the
#' comparison. These are typically column names describing horizon-level
#' attributes (2 or more), but can also contain site-level attributes (2 or
#' more) if \code{s} is a \code{\link{SoilProfileCollection}}.
#' @param max_d depth-slices up to this depth are considered in the comparison
#' @param k a depth-weighting coeficient, use '0' for no depth-weighting (see
#' examples below)
#' @param filter an index used to determine which horizons (rows) are included
#' in the analysis
#' @param sample_interval use every n-th depth slice instead of every depth
#' slice, useful for working with > 1000 profiles at a time
#' @param replace_na if TRUE, missing data are replaced by maximum
#' dissimilarity (TRUE)
#' @param add_soil_flag The algorithm will generate a 'soil'/'non-soil' matrix
#' for use when comparing soil profiles with large differences in depth (TRUE).
#' See details section below.
#' @param return_depth_distances return intermediate, depth-wise dissimilarity
#' results (FALSE)
#' @param strict_hz_eval should horizons be strictly checked for internal
#' self-consistency? (FALSE)
#' @param progress 'none' (default)
#' @param plot.depth.matrix should a plot of the 'soil'/'non-soil' matrix be
#' returned (FALSE)
#' @param rescale.result should the result be rescaled by dividing by max(D)
#' (FALSE)
#' @param verbose extra debug output (FALSE)
#' @return A dissimilarity matrix object of class 'dissimilarity, dist',
#' optionally scaled by max(D).
#' @author Dylan E. Beaudette
#' @seealso [dice()], [cluster::daisy()]
#' @references
#' - D.E. Beaudette, P. Roudier, A.T. O'Geen, Algorithms for quantitative pedology: A toolkit for soil scientists, Computers & Geosciences, Volume 52, 2013, Pages 258-268, ISSN 0098-3004, \doi{10.1016/j.cageo.2012.10.020}.
#' - Moore, A.; Russell, J. & Ward, W. Numerical analysis of soils: A comparison of three soil profile models with field classification. Journal of Soil Science, 1972, 23, 194-209.
#' @keywords methods manip
#' @examples
#' # See ?NCSP for examples
#' @rdname profile_compare
#' @export
pc <- function(s, vars, max_d, k, filter=NULL, sample_interval=NA, replace_na=TRUE,
add_soil_flag=TRUE, return_depth_distances=FALSE, strict_hz_eval=FALSE, progress='none',
plot.depth.matrix=FALSE, rescale.result=FALSE, verbose=FALSE) {
# as of aqp 2.0
.Deprecated(new = 'NCSP', old = "profile_compare", msg = 'profile_compare() has been deprecated, please use NCSP()')
if (!requireNamespace("plyr")) {
stop("package `plyr` is required to use `profile_compare()`; use `NCSP()` instead or install the `plyr` package", call. = FALSE)
}
# currently this will only work with integer depths
# test by attempting to cast to integers
# if there is no difference, then we are fine
top.test <- any( ! as.integer(na.omit(s$top)) == na.omit(s$top))
bottom.test <- any( ! as.integer(na.omit(s$bottom)) == na.omit(s$bottom))
if(top.test | bottom.test)
stop('this function can only accept integer horizon depths', call.=FALSE)
## TODO: this should be updated
# check to make sure that there is an 'id' column
if(is.null(s$id))
stop("'s' must contain a column named 'id' ", call.=FALSE)
## TODO: evaluate the practical implications of this
# optionally filter the data
if(!missing(filter)) {
s <- s[filter, ]
}
# identify the number of profiles
# n.profiles <- length(s)
n.profiles <- length(unique(s$id))
# number of variables
n.vars <- length(vars)
# sequence describing depth slice indices
# use a sequence from 1 ... max depth
depth_slice_seq <- 1:max_d
# use decimated sampling if requested
if(!is.na(sample_interval) & sample_interval != 1)
depth_slice_seq <- depth_slice_seq[depth_slice_seq %% sample_interval == 1]
# compute a weighting vector based on k
w <- 1 * exp(-k * depth_slice_seq)
## TODO: convert to dice() (#7)
## BUG: !!! this step re-orders via string-factor-string conversion step by split() (#7)
## --> this is only a bug when profile IDs have been altered after SPC init
## --> https://stackoverflow.com/questions/17611734/r-split-preserving-natural-order-of-factors
##
## unroll each named soil property, for each soil profile
## the result is a list matrices with dimensions: depth, num_properties
# this approach requires a named list of soil properties
s.unrolled <- plyr::dlply(s, "id", .progress=progress, .fun=function(di, p=vars, d=max_d, strict=strict_hz_eval, .parallel=getOption('AQP_parallel', default=FALSE)) {
# iterate over the set of properties, unrolling as we go
# the result is a [z by p] matrix unrolled to max_d
m <- try(sapply(p, function(p_i) unroll(di$top, di$bottom, prop=di[, p_i], max_depth=d, strict=strict) ))
## TODO: could be better
# check for a non-NULL attribute of 'class'
# this will only happen when there was an error
if( !is.null(attr(m, 'class'))) {
if(attr(m, 'class') == 'try-error') {
stop(paste('Error: bad horizon structure in soil id', as.character(unique(di$id))), call.=FALSE)
}
}
else
return(m)
}
)
## generate a matrix storing a flag describing soil vs. non-soil at each slice
## note that this will truncate a profile to the max depth of actual data
## profiles missing data in all variables will cause function to stop here
if(add_soil_flag){
# keep temp subset of the data so that soil/non-soil matrix is
# evaluated based on presence of real data in at least 1 variable
s.sub <- na.omit(s[, c('id', 'top', 'bottom', vars)])
## BUG!!! tapply() re-orders the results based on sorting of s.sub$id (#7)
## ----> this will cause problems when input isn't sorted by ID
# get the depth of each profile
s.slices_of_soil <- tapply(s.sub$bottom, s.sub$id, max, na.rm=TRUE)
# truncate to the max requested depth
s.slices_of_soil <- ifelse(s.slices_of_soil <= max_d, s.slices_of_soil, max_d)
s.slices_of_non_soil <- max_d - s.slices_of_soil
s.slices_of_soil.length <- length(s.slices_of_soil)
# init a matrix with dimensions: depth slices, number of profiles
soil.matrix <- matrix(ncol=s.slices_of_soil.length, nrow=max_d)
# fill with TRUE for 'soil' or FALSE for 'non-soil'
for(s.i in 1:s.slices_of_soil.length) {
soil.matrix[, s.i] <- c(rep(TRUE, s.slices_of_soil[s.i]), rep(FALSE, s.slices_of_non_soil[s.i]))
}
# plot a diagnostic image, may not be useful for n > 100 profiles
if(plot.depth.matrix) {
# define color scheme: if all TRUE, then we only need 1 color
if(length(table(soil.matrix)) > 1)
image.cols <- c(NA, 'grey')
else
image.cols <- c('grey')
## TODO: use SPC methods, unique() will not return labels in the correct order
# labs <- profile_id(s)
labs <- unique(s.sub$id)
image(x=1:n.profiles, y=1:max_d, z=t(soil.matrix), col=image.cols, ylim=c(max_d, 0), xlab='ID', ylab='Slice Number (usually eq. to depth)', main='Soil / Non-Soil Matrix', axes=FALSE)
box()
abline(v=seq(1, n.profiles)+0.5, lty=2)
axis(side=2, at=pretty(c(0, depth_slice_seq)), las=1)
axis(side=1, at=1:n.profiles, labels=labs, las=2, cex.axis=0.75)
}
# cleanup
rm(s.sub)
}
##
## new version for computing slice-wise dissimilarities... fast!
##
message(paste('Computing dissimilarity matrices from', n.profiles, 'profiles'), appendLF=FALSE)
## TEMP HACK to supress warnings generated by calling daisy with all NA input (#7)
ow <- options('warn')
options(warn=-1)
d <- plyr::llply(depth_slice_seq, .parallel=getOption('AQP_parallel', default=FALSE), .progress=progress, .fun=function(i, su=s.unrolled) {
## 2021-03-03
## this approach breaks when using a single variable for the distance calc
## sp is a 1-row matrix which returns an empty distance matrix
## solution: don't attempt to fix here, re-write the entire thing
## this could be a source of slowness, esp. the t()
## TODO: new implementation will require drop=FALSE
ps <- sapply(su, function(dz, z_i=depth_slice_seq[i]) { dz[z_i,] })
sp <- t(ps)
# compute distance metric for this depth
# distance metric has large effect on results
# Gower's distance gives the best looking results, and automatically standardizes variables
## this is where we run into memory-size limitations
## an ff object would help here... however it can not preserve all of the information
## that a list can... we would need to store these data as raw matrices
## TODO: sometimes we don't want to standardize, and use Euclidean distance
## TODO: don't call daisy on bogus input data, temp fix: disable warnings (#7)
## if all of the input to daisy is NA, then we get warnings from min() and max()
## this happens when we set a max depth that is beyond most profiles
d.i <- cluster::daisy(sp, metric='gower')
return(d.i)
}
)
# reset warning options
options(ow)
## TODO: does this actually do anything?
# clean-up
rm(s.unrolled) ; gc()
# print total size of D
message(paste(" [", round(object.size(d) / 1024^2, 2), " Mb]", sep=''))
# should NA in the dissimilarity matrix be replaced with max(D) ?
if(replace_na) {
# replace all NA with the MAX distance between any observations
# note that down deep, there may not be enough data for any pair-wise comparisons
# therefore, we should not attempt to calculate max() on a matrix of all NA
max.distance.vect <- sapply(d, function(i) if(all(is.na(i))) NA else max(i, na.rm=TRUE))
max.distance <- max(max.distance.vect, na.rm=TRUE)
## note: this will not work with sample_interval set
# should we use a more expensive approach, that uses the soil/non_soil flag?
if(add_soil_flag) {
# kind of messy: re-using an object like this
for(i in 1:length(d)) {
d_i <- as.matrix(d[[i]])
# set all pairs that are made between deep vs. shallow soil
# to the maximum distance- by row and column
cells.with.na.rows <- which(is.na(d_i[, which(soil.matrix[i, ])]))
cells.with.na.cols <- which(is.na(d_i[which(soil.matrix[i, ]), ]))
d_i[, which(soil.matrix[i, ])][cells.with.na.rows] <- max.distance
d_i[which(soil.matrix[i, ]), ][cells.with.na.cols] <- max.distance
# convert back to dist object
d_i <- as.dist(d_i)
# copy original attributes
attributes(d_i) <- attributes(d[[i]])
# save back to original position in list
d[[i]] <- d_i
}
# remove the soil.matrix object to save some space
rm(soil.matrix) ; gc()
}
# use a less expensive approach, where all NA are replaced by the max distance
else {
d <- lapply(d, function(d_i)
{
cells.with.na <- which(is.na(d_i))
d_i[cells.with.na] <- max.distance
return(d_i)
} )
}
}
# optionally return the distances for each depth
# depth-weighting is performed, but NA is not converted to 0
if(return_depth_distances) {
# depth-weighting
for(i in seq_along(depth_slice_seq))
d[[i]] <- d[[i]] * w[i]
return(d)
}
# final tidy-ing of the list of dissimilarity matrices
for(i in seq_along(depth_slice_seq)) {
# convert NA -> 0
na.idx <- which(is.na(d[[i]]))
if(length(na.idx) > 0)
d[[i]][na.idx] <- 0
# depth-weighting
d[[i]] <- d[[i]] * w[i]
}
# reduce list of dissimilarity matrices by summation
D <- .SumDistanceList(d)
# add distance metric
attr(D, 'Distance Metric') <- 'Gower'
# remove previous warnings about NA
attr(D, 'NA.message') <- NULL
# normalize by dividing by max(D)
# this is important when incorporating site data
# causes problems for some functions like sammon
if(rescale.result)
D <- D/max(D, na.rm=TRUE)
## DEBUG
if(verbose) {
cat('depth-slice seq:\n')
print(depth_slice_seq)
cat('depth-weighting vector:\n')
print(round(w, 2))
cat(paste('max dissimilarity:', max.distance, '\n'))
}
# return the distance matrix, class = 'dissimilarity, dist'
return(D)
}
#' @param s A SoilProfileCollection
#' @param vars Variable names
#' @param rescale.result Rescale result? Default: FALSE
#' @param ... Additional arguments passed to `profile_compare()`
#'
#' @rdname profile_compare
#' @export
pc.SPC <- function(s, vars, rescale.result=FALSE, ...){
.Deprecated('NCSP', old = "profile_compare")
if(!requireNamespace("scales", quietly = TRUE))
stop("package `scales` is required", call.=FALSE)
# default behavior: do not normalize D
## 2016-02-22: check for missing data moved from pc() to here
## TODO: this makes an assumption on the column containing horizon designations
## 2016-08-16: this function ignores vars that don't existin in @horizons
## 2019-12-19: disabled until PC is re-factored and / or there is a better way to get hz name column
# # iterate over profiles and compute percent missing data by variable
# pct_data <- evalMissingData(s, vars, name = name)
#
# ## TODO: review this, and make optional via argument (#7)
# # keep track of profiles missing any or all of their data
# problem.profiles.idx <- which(pct_data < 1)
# # bad.profiles.idx <- which(pct_data == 0)
#
# if(length(problem.profiles.idx) > 0) {
# # assign('problem.profiles', value=pct_missing[problem.profiles.idx,], envir=aqp.env)
# message('Missing data will bias results, check inputs.')
# }
## 2016-02-22: disabled for now
# if(length(bad.profiles.idx) > 0) {
# # stop stop and let the user know
# bad.profiles <- profile_id(s)[bad.profiles.idx]
# stop(paste('no non-NA values associated with profiles:', paste(bad.profiles, collapse=', '), '\nConsider removing these profiles and re-running.'), call.=FALSE)
# }
## 2020-06-19: DEB
## temporary fix for #7, related to profile ID ordering in site, horizon, and results from tapply()
## this is only a problem when using profile IDs that are numeric and not stable when alpha-sorted
##
## However, this will result in changes to sorting of profile_id(), @site, @horizon
##
## Also, this will break with merge of aqpdf branch were sorting by ID is not performed at init time
##
## this was a dump idea -> going to remove these comments after re-write
# s <- rebuildSPC(s)
# extract horizons
s.hz <- horizons(s)
# extract site
s.site <- site(s)
sn <- names(s.site)
# check for any site data, remove and a save for later
if(any(vars %in% sn)) {
# extract site-level vars
matching.idx <- na.omit(match(sn, vars))
site.vars <- vars[matching.idx]
# remove from hz-level vars
vars <- vars[-matching.idx]
## TODO: BUG!!! horizon data are rescaled via D/max(D) !!!
## TODO: allow user to pass-in variable type information
# compute dissimilarty on site-level data: only works with 2 or more variables
# rescale to [0,1]
message(paste('site-level variables included:', paste(site.vars, collapse=', ')))
d.site <- cluster::daisy(s.site[, site.vars, drop=FALSE], metric='gower')
# re-scale to [0,1]
d.site <- .rescaleRange(d.site, x0 = 0, x1 = 1)
# reset default behavior of hz-level D
rescale.result=TRUE
## TODO: there might be cases where we get an NA in d.site ... seems like it happens with boolean variables
## ... but why ? read-up on daisy
if(any(is.na(d.site))) {
warning('NA in site-level dissimilarity matrix, replacing with min dissimilarity', call.=FALSE)
d.site[which(is.na(d.site))] <- min(d.site, na.rm=TRUE)
}
## TODO: ordering of D_hz vs D_site ... assumptions safe?
} else {
# setup a dummy D_site
d.site <- NULL
}
##
## TODO: update this next part
##
# add old-style, hard-coded {id, top, bottom} column names
s.hz$id <- s.hz[[idname(s)]]
hzDepthCols <- horizonDepths(s)
s.hz$top <- s.hz[[hzDepthCols[1]]]
s.hz$bottom <- s.hz[[hzDepthCols[2]]]
# invoke data.frame method
res <- profile_compare(s.hz, vars=vars, rescale.result=rescale.result, ...)
# if we have site-level data and a valid D_site
# combine via weighted average: using weights of 1 for now
if(inherits(d.site, 'dist')) {
res <- (res + d.site) / 2
# re-scale to [0,1]
res <- .rescaleRange(res, x0 = 0, x1 = 1)
}
## fail-safe check on ordering of input profile IDs vs. labels
## #7
test.labels <- ! profile_id(s) == attributes(res)$Labels
if(any(test.labels)) {
warning('SPC / distance matrix IDs out of order, soon to be fixed (#7)', call. = FALSE)
}
# result is a distance matrix
return(res)
}
##############
## S4 stuff ##
##############
## NOTE: don't mess with this!
# setup generic function
# if (!isGeneric("profile_compare"))
setGeneric("profile_compare", function(s, ...) standardGeneric("profile_compare"))
# temp interface to SPC class objects
#' @export
#' @rdname profile_compare
setMethod(f='profile_compare', signature='SoilProfileCollection', definition=pc.SPC)
# temp interface for dataframes
#' @export
#' @rdname profile_compare
setMethod(f='profile_compare', signature='data.frame', definition=pc)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.