suppressMessages(library(Rcssplot)) library(checkmate) source("plot_consonance.R")
The consonance
package provides a framework for performing quality control.
This vignette demonstrates how to attach a consonance test suite to a model.
One advantage of this technique is it allows a model to be deployed together
with its own set of quality control criteria that can be check on any data
before computing predictions. Another advantage is that it allows the
consonance suite to access model data during the course of a quality
control checks.
The vignette is centered around an example of multiple regression on a synthetic dataset. The techniques demonstrated, however, are applicable to any model. Indeed, consonance suites can be attached to any list-like object.
For a concrete example of a statistical model, let's work with multiple regression on a synthetic dataset included with the package.
library(consonance) d <- consonance_model_data head(d, 1)
The dataset has r nrow(d)
rows, one variable y
that we will treat as an
outcome, and and five variables x1
through x5
that we will consider as
inputs. In this section we want to train regression models on this data. So let's
split the data into parts for training and testing.
d_train <- d[seq(1, nrow(d), by=2),] d_test <- d[seq(2, nrow(d), by=2),]
We can explore the training set with regard to the correlations between variables.
pairwise_cors <- cor(d_train) round(pairwise_cors, 2)
The outcome variable y
is strongly or moderately correlated to each of x1
through x4
, but not to x5
. In addition, x2
, x3
, and x4
are strongly
correlated between them. Correlations between the variables can complicate
the interpretation of multiple regression models, but
they are not in themselves incompatible with the regression framework.
We start modeling by including all the input variables.
model_all <- lm(y ~ x1 + x2 + x3 + x4 + x5, data=d_train) coef(model_all)
Because the ranges of x1
through x5
are all similar, the model
coefficients capture feature importance. Thus, the most important variables
are x1
and x4
.
Let's suppose we want to simplify the model so that it uses fewer explanatory variables. In practice, this might be done using lasso regularization with the glmnet package. Here, let's just create a model with two variables manually.
model_two <- lm(y ~ x1 + x4, data=d_train) coef(model_two)
We now have two distinct models. The quality of their fit can be captured by the root-mean-square error.
rmse <- function(m) { sqrt(sum(m$residual^2)) } c(model_all=rmse(model_all), model_two=rmse(model_two))
The errors are similar in magnitude, so the models provide similar fits for the training data.
We can now use the two models to make predictions on new data d_test
. The
error between the predicted and the expected values summarize quality.
predict.rmse <- function(model, data) { predicted_values <- predict(model, data) sqrt(sum((predicted_values-data$y)^2)) } d_errors <- data.frame(dataset="d_test", model_all = predict.rmse(model_all, d_test), model_two = predict.rmse(model_two, d_test)) d_errors
The errors remain about equal for the two models.
Let's now consider how the quality is affected if the data is corrupted. In a dataset with many explanatory variables, this can occur in many different ways. Here, let's consider if corruption affects the variables that are correlated to each other.
d_corrupt_3 <- d_corrupt_4 <- d_test # corrupt by reversing the values in one column # (this approach does not require any randomization) d_corrupt_3$x3 <- rev(d_corrupt_3$x3) d_corrupt_4$x4 <- rev(d_corrupt_4$x4)
We can now evaluate prediction errors on the two corrupted datasets.
# errors for dataset with corrupted x3 d_errors <- rbind( d_errors, data.frame(dataset="corrupt_3", model_all = predict.rmse(model_all, d_corrupt_3), model_two = predict.rmse(model_two, d_corrupt_3)), data.frame(dataset="corrupt_4", model_all = predict.rmse(model_all, d_corrupt_4), model_two = predict.rmse(model_two, d_corrupt_4)) ) d_errors
For the model that uses all input variables, any corruption raises the prediction error. The model that only uses two features is unaffected by the first corruption, but is more affected than the larger model in the second case.
We saw that regression models can make poor predictions from corrupted datasets. In this section, we can try to avoid that, or at least set off warnings if a new dataset does not conform to the training data.
It is important to distinguish between two types of criteria. One type
consists of conditions that can be formulated without reference to the
training data. The other type is dependent more deeply on the data used
in training. These types are called model-independent and model-dependent
below and the distinction is important because they are implemented
differently in the consonance
package.
An example of model-independent criterion is the requirement that variables are in a well-defined range. For our dataset, all the input variables are in the unit range. Thus, we can create a suite of tests to check ranges
library(checkmate) suite_ranges <- consonance_test("x1 range", test_numeric, lower=0, upper=1, .var="x1") + consonance_test("x2 range", test_numeric, lower=0, upper=1, .var="x2") + consonance_test("x3 range", test_numeric, lower=0, upper=1, .var="x3") + consonance_test("x4 range", test_numeric, lower=0, upper=1, .var="x4") + consonance_test("x5 range", test_numeric, lower=0, upper=1, .var="x5")
Another type of quality control criterion might be on the correlations
between the input variables. We saw that x4
in the training data is
negatively correlated with x2
and x3
. We can create a custom test function
that computes correlations between pairs of variables, and then define
consonance tests for variables x4
and x2
and for x4
and x3
.
# custom test function is_cor_neg <- function(x, a=1, b=2) { cor(x[[a]], x[[b]]) < 0 } suite_cor <- consonance_test("x2 x4 negative cor", is_cor_neg, a="x2", b="x4") + consonance_test("x3 x4 negative cor", is_cor_neg, a="x3", b="x4")
Note that these checks are model-independent because the criteria on the correlations are set to be computed only from a new dataset, and the threshold of zero is hard-coded. Let's attach these suites to the two regression models from the previous section.
model_all_A <- attach_consonance(model_all, suite_ranges + suite_cor) model_two_A <- attach_consonance(model_two, suite_cor + suite_ranges)
We can now validate the training data, test data, and the two corrupted datasets.
library(magrittr) # the training and test datasets should evaluate quietly d_train %>% validate(model_two_A) d_test %>% validate(model_two_A) # the corrupted datasets will generate messages d_corrupt_3 %>% validate(model_two_A) d_corrupt_4 %>% validate(model_two_A)
Note that the test suites for model_all_A
and model_two_A
are the
same, so repeating the above commands with the other model will produce the
same results and error messages. Importantly, although the regression in
model_two_A
is only based on variables x1
and x4
, the test suite requires
access to variables x2
and x3
in the input data.
The previous section implemented tests that can be evaluated entirely from the data being tested and hard-coded criteria. It is also possible to create tests that use information from the model during testing. This is achieved via custom functions and a special function signature.
For concreteness, let's again implement a test on correlations. We already computed the pairwise correlations between variables in the training data. Let's store these values within our two models.
model_all$pairwise_cors <- pairwise_cors model_two$pairwise_cors <- pairwise_cors
Now, let's create a custom test function that will read this information.
is_strong_cor <- function(x, .model, a="x1", b="x2") { .threshold <- .model$pairwise_cors[a, b] abs(cor(x[[a]], x[[b]])) > abs(.threshold)/2 }
Compared to the previous function is_cor_neg
, this function carries an extra
argument .model
. The name of this argument is important because it signals
to the consonance package that this function requires access to the model
data. The consonance package provides the model object at run-time, so the
function body can look up the pairwise correlations matrix. In this
implementation, the function compares a correlation in the dataset x
with
the corresponding correlation in the model, and it reports reports TRUE
only
if the correlation in x
is at least half of the previously seen value.
(The criterion 'at least half of the previously seen value' is an ad-hoc
construct for the purpose of this illustrative example; a real consonance
suite should implement a criterion that is appropriate to the data.)
We can now define a test suite.
suite_strong <- consonance_test("x2 x3 strong cor", is_strong_cor, a="x2", b="x3") + consonance_test("x2 x4 strong cor", is_strong_cor, a="x2", b="x4") + consonance_test("x3 x4 strong cor", is_strong_cor, a="x3", b="x4")
Apart from having three terms rather than two, the definition syntax is similar as before. The next step is to attach the suite to the models.
model_all_B <- attach_consonance(model_all, suite_ranges + suite_strong) model_two_B <- attach_consonance(model_two, suite_ranges + suite_strong)
We can again evaluate our original and corrupted datasets with these new models.
# the training and test datasets should evaluate quietly d_train %>% validate(model_all_B) d_test %>% validate(model_all_B) # the corrupted datasets will generate messages d_corrupt_3 %>% validate(model_all_B) d_corrupt_4 %>% validate(model_all_B)
Again, we have signals that the corrupted datasets are not concordant with the models. The difference is that these new tests draw thresholds from a matrix of numbers stored within the model definitions.
Other custom functions can make use of any other component in the model object and implement any kind of criteria. Thus, the consonance testing framework is versatile and powerful.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.