library(statisticalModeling)
library(dplyr)
library(ggplot2)
library(mosaic)

A course named "data analysis" is a common gateway to the concepts and techniques for drawing information from data. In this context, as in so many, "analysis" is used to infuse a scientific flavor into the word it modifies; it has no precise meaning.^[The character Spock in Star Trek says in one episode, "I have no analysis due to insufficient information." That sounds more scientific or logical than its simpler equivalent: "I don't know."] Strictly speaking, "analysis" is the opposite of "synthesis." Synthesis brings parts together to form a whole; analysis splits the whole into its parts. In statistics, "analysis of variance" uses "analysis" in precisely this sense: dividing the whole variance into parts.

Wikipedia describes "data analysis" as having "multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains." This means that "data analysis" is, ironically, the synthesis of many different things.

Many, but not all, of the techniques taught in a typical "data analysis" course are really about modeling. In my experience, "modeling" provides a useful framework for describing much of what we want to accomplish by looking at data. The goal of this vignette is to motivate, describe, and document some software --- the statisticalModeling package --- designed to make it easier for students and their teachers to discuss, study, and carry out modeling.

Modeling

My favorite definition of a model comes from a book, How to Model It: Problem Solving for the Computer Age, by Starfield, Smith, and Blelock. They wrote that

a model is a representation for a purpose.

This definition points out that a model is not the thing itself, it is a representation, it stands for, it displays important facets of the thing itself. The definition also emphasizes that the best representation depends in large part on the purpose for which that representation will be used. The definition provides a useful framework for organizing concepts, e.g.

In teaching and conducting statistical modeling, I think it's important to be able to hold in one's head the answers to these questions. With this accomplished, building a model is a matter of making choices from the possibilities.

With respect to data, models are typically created to:

  1. Make predictions about new events based on observations of events that have already occurred.
  2. Provide a description of the real-world mechanisms that underlie the system being modeled.
  3. Look for (potentially complex) relationships in a mass of data.

The statisticalModeling package is intended to make it straightforward to carry out these objectives. This is achieved by using notation that is consistent across different operations for modeling and by providing specific operations most importantly needed for building and interpreting models.

Data

The starting point for using statisticalModeling is data organized into the usual tabular form: cases and variables. The package does not provide facilities for collecting and organizing raw data. There are many excellent packages already in wide use for that, e.g., dplyr, tidyr, and data.table. So start working with statisticalModeling with a data table (e.g. a "data.frame"", a "tibble", etc.) already in hand.

Throughout the operations provided by statisticalModeling, you will access a variable by its column name, and direct the operation to a particular data table by using a data = argument.

Some data tables used for examples are provided directly by statisticalModeling. There is nothing special or essential about these data. They are in the package because it is helpful in examples or demonstrations to work with data that the learner also has access to, and because over-familiarity with ever-popular data sets such as the famous iris, mtcars, and diamonds can lead to a failure to see what's new about a demonstration.

Models and functions

Models are built using the standard R functions for such things, e.g. lm(), glm(), rpart() and so on.

Each of these model-building functions takes the same kinds of inputs:

  1. a data table passed via the data = argument.
  2. a formula specifying the response variable on the left and the explanatory variables on the right. Depending on the model architecture, e.g. lm() or rpart(), the formula may also specify interactions, transformations, etc.
  3. whatever parameters are specific to the model architecture, e.g. the complexity parameter cp for recursive partitioning models or the family for generalized linear models.

This vignette isn't about how to select response and explanatory variables, or which model architectures are appropriate for any given purpose. For our purposes here, it suffices to say that the choice depends on (among other things) the purpose for building the model:

Accomplishing these purposes of course involves constructing models. R already provides many relevant model-fitting capabilities in regression, machine learning, classification, etc. Typically, constructing a model means investigating several or many competing models, comparing them, and evaluating and interpreting them to extract information of interest.

The statisticalModeling package proviodes a small set of operations on models to make it straightforward to compare, evaluate, and interpret models. My goals in choosing these operations were:

Graphics of model functions

The function gmodel()^[Called fmodel() in earlier versions of this package.] draws a graph of the function induced by a model. I'll start with this because it illustrates well how the statisticalModeling functions provide ease of use.

When graphing a model, choices need to be made about the levels at which to evaluate the explanatory variables, and which explanatory variable is to be mapped to which feature of the graph: x-axis, color, facets. The only required argument to gmodel() is a model object. You can use a formula if you want to control which the roles played by the explanatory variables.

Example: The Houses_for_sale dataset was organized by Prof. Dick de Veau of Williams college as a demonstration of confounding for introductory statistics students. Here is a model of the asking prices of houses as a function of several familiar attributes of the property.

house_mod <- lm(price ~ fireplaces * living_area + land_value + bedrooms, data = Houses_for_sale)

Since five variables are involved in the model, some quantitative and some categorical, decisions need to be made about how to depict the model in a graph. gmodel() provides plausible defaults so that a graph of the model function can be made without specifying any details. gmodel() also provides a simple mechanism, via a formula, to set the roles of the various explanatory variables in the graph.

gmodel(house_mod)
gmodel(house_mod, ~ living_area + fireplaces)
gmodel(house_mod, ~ living_area + land_value + bedrooms, bedrooms = 1:4)

Note that the argument bedrooms = 1:4 indicates to gmodel() that it should use the values 1, 2, 3, and 4 as levels shown in the plot for the number of bedrooms.

Data can be added to the graphic using geom_point(). Here's an example of a model linking wages to age, sex, and the sector of work:

library(ggplot2)
wage_mod <- lm(wage ~ age * sex + sector, data = mosaicData::CPS85)
gmodel(wage_mod, ~ age + sex + sector, nlevels = 8) + 
  ggplot2::geom_point(data = mosaicData::CPS85, alpha = 0.1)

Transforms

Often, we apply transforms such as the logarithm to the response or to explanatory variables before constructing the model. It can be convenient to be able to display the variable itself rather than the log. The post_transform argument supports this for the response variable. [Feedback: would you like to be able to do this for explanatory variables? How about for transforms such as the rank?]

mod <- lm(log(wage) ~ age + educ + sex, data = CPS85)
gmodel(mod, post_transform = c("hourly wage" = exp))

Plotting classifiers

When the output of a classifier is to be plotted, gmodel() selects one of the possible output levels and presents the probability of that level as a function of the explanatory variables. You can specify this level with the prob_of argument. For example:

data(CPS85, package = "mosaicData")
mod <- glm(married ~ age*wage*sex + sector, family = "binomial", data = CPS85)
gmodel(mod, prob_of = "Single")

Evaluating models

Evaluating a model means to provide a set of inputs for the explanatory variables and then compute the output of the model at each of those inputs. In base R, the predict() method provides this functionality, but it has several deficiencies:

  1. The default inputs are the data used to train the model. But often, the modeller wants a quick view of the model output for just a few different inputs. Accomplishing this involves creating a new data frame.
  2. predict() has an argument interface that is inconsistent among different model types.
  3. predict() does not record which inputs got mapped to which outputs.

The evaluate_model() function in the statisticalModeling package addresses these difficulties. It's default inputs are a small set of "typical" values derived from the training data. It's easy to override this default for any particular explanatory variable, while continuing to use the defaults for the others. The interface is consistent across kinds of model objects. And the output of evaluate_model() is a data frame that includes both the inputs and the corresponding output.

To illustrate:

evaluate_model(wage_mod)

Or, to override the levels selected for evaluation:

evaluate_model(wage_mod, nlevels = 2)
evaluate_model(wage_mod, sector = "service", age = c(25, 55))

Use the data = argument to provide an explicit set of cases, or set on_training = TRUE to evaluate the model on the training data. When this is done, the response variable used for training will also be in the output of evaluate_model(), making it easy to calculate residuals or prediction error.

Effect sizes

A simple but effective way to interpret a model is to look at how a change in an explanatory variable leads to a change in model output: the effect size. Conventionally, this is done by looking at model coefficients, but this can become complicated in the presence of interactions or with models involving link functions.

A short-cut is to use evaluate_model() with inputs at different levels, for example:

evaluate_model(wage_mod, sector = "service", sex = "F", age = 50)
evaluate_model(wage_mod, sector = "service", sex = "F", age = 55)

The effect_size() function provides a simpler interface:

effect_size(wage_mod, ~ age, age = 50, sector = "service")

The explanatory variable whose effect size is being calculated is set by the one-sided formula, e.g. ~ age in the above example. Note also that the effect size is specified as a slope when the explanatory variable is quantitative, and as a change when the explanatory variable is categorical.

Also notice that for quantitative variables, the two levels being compared are set to a mid-sized value (based on the range in the training data). This finite-difference can be more appropriate when working with discontinuous architectures such as recursive partitioning.

Inference

Statistical inference is one of the most daunting subjects that a beginning student faces. Students need to learn about parameters like degrees of freedom, the null hypothesis plays an outside role, there are several different kinds of sampling distribution involved.

Any inference technique provided by a model object is, of course, always available for your use. For instance, regression tables and ANOVA are often used in linear models.

The statisticalModeling package attempts to provide a much simpler basis for inference. There are two elements behind the strategy for doing so:

  1. Use conceptually simple techniques, e.g. the bootstrap, that work in the same way across many different model architectures.
  2. Use cross-validation to avoid having to worry about the consequences of in-sample evaluation of data.

To illustrate, let's draw on the question that motivated the collection of the Houses_for_sale data: how much does a fireplace influence the sales price of a house? It's hardly reasonable to claim that a fireplace is a primary determinant of a house's price: there are covariates that are likely much more important. Here's a baseline model of price based on some of these covariates:

baseline_mod <- lm(price ~ living_area + bathrooms + land_value, data = Houses_for_sale)

And here's a model that adds fireplaces to the mix:

fireplace_mod <- lm(price ~ living_area + bathrooms + land_value + fireplaces, 
                    data = Houses_for_sale)

A conventional comparison of the two models is done using in-sample residuals and therefore needs to take into account degrees of freedom, the shape of the F distribution, etc.

anova(baseline_mod, fireplace_mod)

This is fine, but somewhat complicated and abstract.

A cross-validation approach looks at the prediction error of the models on out-of-sample data (implemented in this case with k-fold cross-validation). That prediction error will vary depending on the data used for training and the data used for evaluation. The cv_pred_error() function calculates several random replicates of cross-validation, producing an output of mean-squared error for each replicate.

Trials <- cv_pred_error(baseline_mod, fireplace_mod, ntrials = 10)
Trials

Assessing whether fireplaces shows up as determining the house's price is a matter of comparing the MSE for the two models:

boxplot(mse ~ model, data = Trials)

Not so much. Or, if you want to generate a p-value, try

t.test(mse ~ model, data = Trials)

Cross-validation tests the performance of models when assessed on out-of-sample data. Here, we see that fireplaces is not a major player.

Bootstrapping

Returning the the house-price models, consider the effect size of fireplaces.

effect_size(fireplace_mod, ~ fireplaces)

This indicates that a fireplace adds about $4000 to the price of a house. How precise is this estimate? We can address this by bootstrapping.

In the statisticalModeling package, the process is to create an ensemble of bootstrap replicates of a model.

E <- ensemble(fireplace_mod, nreps = 50)

This particular ensemble has 50 bootstrap replicates. These replicates can be passed to functions such as effect_size() in the same way that an individual model is used.

Calculating the effect size on this ensemble will produce an ensemble of effect sizes, the spread of which gives the precision of the estimate.

EE <- effect_size(E, ~ fireplaces)
with(data = EE, sd(slope))

[Please provide feedback: Would it be useful to be able to pass a bootstrap ensemble to gmodel() to plot the ensemble of model functions?]

This is different from the approach in the mosaic package. I think the mosaic approch is expressive and elegant, but it's not well suited to the automatic choice of input values supported by evaluate_model(). Nonetheless, you can use the mosaic approach with statisticalModeling functions, for instance:

library(mosaic)

do(10) * {
  fireplace_mod <- 
    lm(price ~ living_area + bathrooms + land_value + fireplaces, 
       data = resample(Houses_for_sale))
  effect_size(fireplace_mod, ~ fireplaces)
}

Formula-driven graphics

The functions and documentation for the gf_ graphics functions have been moved to the ggformula package as of statisticalModeling v. 0.4.0.



dtkaplan/gghelper documentation built on May 15, 2019, 5 p.m.