knitr::opts_chunk$set(echo = TRUE)

klrfome - Kernel Logistic Regression on Focal Mean Embeddings

Introduction

The purpose of this package is to solve the Distribution Regression problem for archaeological site location modeling; or any other data for that matter. The aim of Distribution Regression is to map a single scalar outcome (e.g. presence/absence; 0/1) to a distribution of features. This is opposed to typical regression where you have one observation mapping a single outcome to a single set of features/predictors. For example, an archaeological site is singularly defined as either present or absent, however the area within the sites boundary is not singularly defined by any one measurement. The area with an archaeology site is defined by an infinite distribution of measurements. Modeling this in traditional terms means either collapsing that distribution to a single measurement or pretending that a site is actually a series of adjacent, but independent measurements. The methods developed for this package take a different view instead by modeling the distribution of measurements from within a single site on a scale of similarity to the distribution of measurements on other sites and the environmental background in general. This method avoids collapsing measurements and promotes the assumption of independence from within a site to between sites. By doing so, this approach models a richer description of the landscape in a more intuitive sense of similarity.

To achieve this goal, the package fits a Kernel Logistic Regression (KLR) model onto a mean embedding similarity matrix and predicts as a roving focal function of varying window size. The name of the package is derived from this approach; Kernel Logistic Regression on FOcal Mean Embeddings (klrfome) pronounced clear foam.

Example workflow on simulated data

knitr::include_graphics("https://github.com/mrecos/klrfome/blob/master/README_images/KLRfome_dataflow.png?raw=true")

In brief, the process below is 1) take a table of observations of two or more environmental variables within known sites and across the background of the study area; 2) use format_data() to convert that table to a list and under-sample the background data to a desired ratio (each group of observations with a site or background area are referred o in the ML literature as "bags"); 3) use build_k() function with the sigma hyperparameter and distance (default euclidean) to create a similarity matrix between all site and background bags; 4) the similarity matrix is the object that the kernel logistic regression model klr() function uses to fit its parameters. Steps 3 and 4 are where this method detracts most from traditional regression, but it is also what sets this method apart. unlike most regression that fits a model to a table of measurements, this approach fits a model to a matrix of similarities between all of the units of analysis (sites and background areas).

Libraries

library("ggplot2")   # for plotting results
library("NLMR")      # for creating simulated landscapes
library("rasterVis") # for plotting simulated lan
library("pROC")      # for evaluation of model AUC metric
library("dplyr")     # for data manipulation
library("knitr")     # for printing tables in this document
library("klrfome")   # for modeling
library("sp")        # for plotting raster prediction in document

Set hyperparameters and load simulated site location data

In this block, the random seed, sigma and lambda hyperparameters, and the dist_metric are all set. The sigma parameter controls how "close" observations must be to be considered similar. Closeness in this context is defined in the 'feature space', but in geographic or measurement space. At a higher sigma more distant observations can still be considered similar. The lambda hyperparameter controls the regularization in the KLR model by penalizing large coefficients; it must by greater than zero. This means that the higher the lambda penalty, the closer the model will shrink its alpha parameters closer to zero, thereby reducing the influence of any one or group of observations on the overall model. These two hyperparameters are most critical as the govern the flexibility and scope of the model. Ideally, these will be set via k-fold Cross-Validation, Leave-one-out cross-validation, grid search, or trial and error. Exploring these hyperparameters will likely take most of the time and attention within the modeling process.

#Parameters
set.seed(232)
sigma = 0.5
lambda = 0.1
dist_metric = "euclidean"

Simulate site data and format

Archaeological site data is protected information that can not often be shared. It is a major impediment to open archaeological research. In this package I created a function to simulate archaeological site data so that people can test the package and see how to format their site data. The get_sim_data function simulates N_site_bags number of sites from site_sample number of observations for two environmental variables. The variables are normal with a mean and standard deviation; well-discriminated defaults are provided. The function simulates by site-present and environmental background classes with different means and SD's for the two variables. The returned object is a list that contains the data in a list format for use in the KLR functions, but also a table format for use in most other modeling functions. This is so that you can compare model results on the same data.

### Simulate Training Data
sim_data <- get_sim_data(site_samples = 800, N_site_bags = 75)
formatted_data <- format_site_data(sim_data, N_sites=10, train_test_split=0.8,
                                   sample_fraction = 0.9, background_site_balance=1)
train_data <- formatted_data[["train_data"]]
train_presence <- formatted_data[["train_presence"]]
test_data <- formatted_data[["test_data"]]
test_presence <- formatted_data[["test_presence"]]

Build Similarilty kernel, fit KLR model, and predict on training set

The first step in modeling these data is to build the similarity kernel with build_k. The output is a pairwise similarity matrix for each element of the list you give it, in this case training_data. The object K is the NxN similarity matrix of the mean similarity between the multivariate distance of each site and background list element. These elements are often referred to as laballed bags because they are a collection of measurements with a presence or absence label. The second step is to fit the KLR model with the KLR function. The KLR fit function is a key component of this package. The function fits a KLR model using Iterative Re-weighted Least Squares (IRLS). Verbose = 2 shows the convergence of the algorithm. The output of this function is a list of the alpha parameters (for prediction) and predicted probability of site-present for the train_data set. Finally, the KLR_predict function uses the train_data and alphas to predict the probability of site-presence for new test_data. The similarity matrix can be visualized with corrplot, the predicted site-presence probability for simulated site-present and background test data can be viewed as a ggplot, and finally the parameters of the model are saved to a list to be used for predicting on a study area raster stack.

##### Logistic Mean Embedding KRR Model
#### Build Kernel Matrix
K <- build_K(train_data, sigma = sigma, dist_metric = dist_metric, progress = FALSE)
#### Train KLR model
train_log_pred <- KLR(K, train_presence, lambda, 100, 0.001, verbose = 2)
#### Predict KLR model on test data
test_log_pred <- KLR_predict(test_data, train_data, dist_metric = dist_metric,
                             train_log_pred[["alphas"]], sigma, progress = FALSE)

### Plot K Matrix
K_corrplot(K,train_data,clusters=4)

### Plot Test Set Prediction
predicted_log <- data.frame(pred = test_log_pred, obs = test_presence)
ggplot(predicted_log, aes(x = as.factor(obs), y = pred, color = as.factor(obs))) +
  geom_jitter(width = 0.1) +
  theme_bw() +
  ylim(c(0,1)) +
  labs(y = "Predicted Probability", x = "Site Presence",
       title = "Kernel Logistic Regression",
       subtitle = "test set predictions; simulated data") +
  theme(
    legend.position = "none"
  )

### Save parameters for later prediction
params <- list(train_data = train_data,
               alphas_pred = train_log_pred[["alphas"]],
               sigma = sigma,
               lambda = lambda,
               means = formatted_data$means,
               sds = formatted_data$sds)

Predicting on a raster stack

knitr::include_graphics("https://github.com/mrecos/klrfome/blob/master/README_images/KLRfome_prediction.png?raw=true")

This package can be used to predict on tabular data as above, but a more practical approach is to predict directly on a set of raster layers representing the predictor variables. Most of the code below is there for creating a simulated landscape that has some fidelity to the training data. For real-world examples, the prediction starts with a raster stack of predictor variable rasters. Form there the function scale_prediction_rasters center and scales the values of the rasters to that of the train_data. Having data that is centered at zero and scaled to z-scores is critical in measuring the distance between observations. Further, it is critical that the test data (raster stack) is scaled to the same values as the training data or the predictions will be invalid. Once scaled, the raster stack is sent to the KLR_raster_predict function for prediction. The prediction function requires a scaled raster stack of the same variables used to train the model, the ngb value that specifies the x and y dimension of the focal window, and finally the list of params from the trained model. The setting on KLR_raster_predict shown here are predicting over the entire raster at once and not in parallel. The KLR_raster_predict function has options for splitting the prediction into a number of squares and predicting on each of those. Further, each split raster block can be assigned to a different core on your computer to compute in parallel. This is because prediction is a time consuming process and it is often helpful to split the computation into more manageable blocks. Oh, and you can set it to return the predicted blocks as a list of raster (all in memory) or to save each block as a GeoTiff after it is predicted. A version of parallel processing is shown in the next code section.

### width and hieght of roving focal window (required)
ngb = 5
### Number of rows and columns in prediction rasters
## needed for making simulated rasters, as well as for predicting real-world rasters
cols = 100
rows = 100

### Create simulated environmental rasters  (sim data only) ####
s_var1r <- NLMR::nlm_gaussianfield(cols,rows, autocorr_range = 20)
s_var1 <- rescale_sim_raster(s_var1r, 50, 10) 
s_var2 <- rescale_sim_raster(s_var1r, 3, 2) 
b_var1r <- NLMR::nlm_gaussianfield(cols,rows,autocorr_range = 20)
b_var1 <- rescale_sim_raster(b_var1r, 100, 20) 
b_var2 <- rescale_sim_raster(b_var1r, 6, 3) 
### Create a site-present trend surface  (sim data only)
trend_coords <- sim_trend(cols, rows, n = 3)
coords <- trend_coords$coords
trend <- trend_coords$trend
inv_trend <- abs(1-trend)
var1 <- (s_var1 * trend) + (b_var1 * inv_trend)
var2 <- (s_var2 * trend) + (b_var2 * inv_trend)
#### end simulated data creation ####

### Create raster stack of predictor variables
pred_var_stack <- raster::stack(var1, var2)
names(pred_var_stack) <- c("var1","var2")
### scale rasters to training data
pred_var_stack_scaled <- scale_prediction_rasters(pred_var_stack, params, verbose = 0)
### Predict raster (single chunk, not in parallel) 
pred_rast <- KLR_raster_predict(pred_var_stack_scaled, ngb = ngb, params, split = FALSE, ppside = NULL,
                                progress = FALSE, parallel = FALSE)
### plot with simulated sites
rasterVis::levelplot(pred_rast, margin = FALSE, par.settings=viridisTheme()) +
  latticeExtra::layer(sp.points(sp.points(SpatialPoints(coords), pch=15, cex = 2.25, col = "red")), columns=1)

Predicting in parallel/multi-core

This package uses the foreach package's %dopar% function to parallelize model prediction. To make this work you need to start a parallel backend. Here I show how to do this with doParallel package. It is simple in most cases, just make a cluster with makeCluster and the number of cores you have available and then register the cluster with registerDoParallel (note the use of stopCluster() when you are done). In the KLR_raster_predict function, you need to set a few things. Set parallel = TRUE, split = TRUE, and ppside is the number of raster blocks on the x and y axis (resulting in ppside^2 number of output rasters). The function also gives the options for output = "list" to return a list or output = "save" to save each raster as a GeoTiff to the save_loc directory. Finally, the cols and rows values are used here so that the split function knows the dimensions of the entire prediction raster. It need to know this becuase it mitigates the edge effect predicting on blocks by putting a collar of size ngb around each block. The cols and rows lets the function know when to remove the collar and when it is at an edge of the study area.

library("doParallel")
library("parallel")
### create and register parallel backend
### Vignette uses 2 cores, but you can use as many as you have
cl <- makeCluster(2L)
doParallel::registerDoParallel(cl)

### Use same KLR_raster_predict function with parallel = TRUE
pred_rast_list <- KLR_raster_predict(pred_var_stack_scaled, ngb = ngb, params, split = TRUE, ppside = 5,
                                   progress = FALSE, parallel = TRUE, output = "list",
                                   save_loc = NULL, overwrite = TRUE, cols = cols, rows = rows)
### Merge list back to a single raster
pred_rast <-  do.call(merge, pred_rast_list)
### plot with simulated sites
rasterVis::levelplot(pred_rast, margin = FALSE, par.settings=viridisTheme()) +
  latticeExtra::layer(sp.points(sp.points(SpatialPoints(coords), pch=15, cex = 2.25, col = "red")), columns=1)

### Or set output = "save" to save each prediction block out to a folder as a GeoTiff # not run
# pred_rast_list <- KLR_raster_predict(pred_var_stack_scaled, ngb = ngb, params, split = TRUE, ppside = 5,
#                                    progress = FALSE, parallel = TRUE, output = "save",
#                                    save_loc = "c:/Temp/tif", overwrite = TRUE)

stopCluster(cl)

Model Evaluation

Evaluating model on independent test data is very important for understanding the models ability to generalize to new observations. The klrfome package provides two functions to assist with this; CM-quads() and metrics(). In order to use these functions, the user needs to extract the sensitivity values output from klr_raster_predict at both site locations (preferably not those used to create the model) and a large number of random locations to sample the full distribution of sensitivities. The code below uses the simulated site locations, raster::extract, and sampleRandom to obtain these values. Once the values are put into a table and labelled as 1 or 0, the CM-quads function is applied to retrieve the True Positive, True Negative, False Positive, and False Negative values of the model at one or more probability thresholds. Choosing the appropriate threshold at which to classify a model into High, Moderate, and Low or Site-present vs. site-absent needs a good deal of consideration for both the intrinsic model quality and extrinsic model goals. Here I show how to evaluate the classification metrics at every 0.1 threshold between 0 and 1. The results from CM_quads (named for the four quadrants of the confusion matrix) are then evaluated with the pROC::auc and klrfome::metrics functions to calculate the metrics of choice.

### Make some polygons around the simulated site points.
### If all you have is points for sites, site radius can be an assumption
site_pnts  <- SpatialPoints(coords)
site_polys <- rgeos::gBuffer(site_pnts, width = 6, byid = FALSE)

### extract sensitivity raster values to site areas
site_sample <- raster::extract(pred_rast, site_polys, weights = FALSE, 
                               small = TRUE, df = TRUE) %>%
  rename(pred = layer) %>%
  mutate(presence = 1)
### sample for an environmental background of sensitivity values. (e.g. n = 500)
bkg_sample <- data.frame(ID = 0, pred = raster::sampleRandom(pred_rast, 500),
                         presence = 0)
model_pred <- rbind(site_sample, bkg_sample)

### A vector of the sensitivity thresholds that you want to evaluate the model at
threshold <- seq(0,1,0.1)
### Compute True Positive, True Negative, False Positive, and False Negative values at each threshold
kstats <- CM_quads(model_pred, threshold)

### use the pROC::auc and klrfome::metrics functions to compute the metrics of choice at each threshold
Test_area_metrics <- kstats %>%
  group_by(Threshold) %>%
  dplyr::mutate(AUC = round(pROC::auc(model_pred$presence, model_pred$pred, type = "linear"),3),
                YoudensJ = round(metrics(TP,TN,FP,FN)$Informedness,3),
                KG       = round(metrics(TP,TN,FP,FN)$KG,3),
                Sensitivity = round(metrics(TP,TN,FP,FN)$Sensitivity,3),
                FPR = round(metrics(TP,TN,FP,FN)$FPR,3),
                FNR = round(metrics(TP,TN,FP,FN)$FNR,3)) %>%
  data.frame()

knitr::kable(Test_area_metrics)

In this case those metrics are the auc, Youdens J, KG = Kvamme Gain, Sensitivity, FPR = False Positive Rate, and FNR = False Negative Rate. While there is no one metric that is a perfect descriptor of model performance in all scenarios, these are the metrics that I find most useful for describing archaeological predictive models. In this example, I am aiming to maximize my two-class classification for the Youden's J (aka Informedness). As such, I would set my site-present vs. site-absent Threshold at r Test_area_metrics[which(Test_area_metrics$YoudensJ == max(Test_area_metrics$YoudensJ)),"Threshold"]. At this threshold we have a Sensitivity = r Test_area_metrics[which(Test_area_metrics$YoudensJ == max(Test_area_metrics$YoudensJ)),"Sensitivity"] and a FPR = r Test_area_metrics[which(Test_area_metrics$YoudensJ == max(Test_area_metrics$YoudensJ)),"FPR"].

Main Package Functions

This package contains the functions necessary to compute Kernel Linear Regression (KLR) on mean kernel embeddings, functions for preparing site and background data, and a function for simulate archaeological site location data.

KLR_funs.R for Fitting and Predicting on Tabular Data

raster_predict_funs.R for Predicting on Raster Data

format_site_data.R for Data Formatting

get_sim_data.R for Simulating Site Location Data

metrics.R for Calculating Performance Metrics

plot.R for Plotting Similarity Kernel

Citation

Please cite this package as:

Harris, Matthew D., (2017). klrfome - Kernel Logistic Regression on Focal Mean Embeddings. Accessed 10 Sep 2017. Online at https://doi.org/10.5281/zenodo.1218403

Special Thanks

This model is inspired by and borrows from Zoltán Szabó's work on mean embeddings [@Szabo, @Szabo2] and Ji Zhu & Trevor Hastie's Kernel Logistic Regression algorithm [@Zhu]. I extend a hardy thank you to Zoltán for his correspondence during the development of this approach. This approach would not have been created without his help. Further, a big thank you to Ben Markwick for his moral support and rrtools package used to create this package. However that being said, all errors, oversights, and omissions are my own.

Licenses

Text and figures: CC-BY-4.0

Code: See the DESCRIPTION file

$\textbf{Data:}$ CC-0 attribution requested in reuse



mrecos/DistRegLMERR documentation built on April 9, 2022, 5:10 p.m.