In jefftwebb/lightbulb: Tutorials for Learning Data Science

library(learnr)
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
library(caret)
library(GGally)
library(mice)
library(imputeMissings)
library(missForest)

rmse <- function(observed, predicted) sqrt(mean((observed - predicted)^2))


homes <- read_csv("LAHomes.csv") %>% 
  mutate(city = case_when(city == "Beverly Hills" ~ "San Francisco",
                          city == "Santa Monica" ~ "Menlo Park",
                          city == "Westwood" ~ "Berkeley", TRUE ~ "Oakland"))

set.seed(123)

price <- homes$price
city <- homes$city
#homes$bed <- rpois(1594, 3) + 1
homes$rooms <- homes$bed* 2 + 1 + rbinom(1594, 1, .5)
#homes$bath <- rpois(1594, 2) +2
#homes$pool <- pool
homes <- prodNA(homes, .05)
homes$spa[c(12,945,677)] <- 1
homes$price <- price
homes$city <- city

homes <- homes|>
  dplyr::select(-garage)|>
  mutate(city = factor(city),
         pool = factor(pool))

complete_homes <- homes %>% 
  mutate(pool = ifelse(is.na(pool), "N", pool)) %>% 
  dplyr::select(-spa)

complete_homes <- impute(complete_homes, method = "median/mode")

log_mod <- lm(log(price) ~ city + 
             sqft + 
             bed + 
             bath + 
             pool, data = complete_homes)

loglog_mod <- lm(log(price) ~ city + 
             log(sqft) + 
             bed + 
             bath + 
             pool, data = complete_homes)

nolog_mod <- lm(price ~ city + 
             log(sqft) + 
             bed + 
             bath + 
             pool, data = complete_homes)

# set the seed for reproducibility
set.seed(123)

# randomly select 70% of the row indices of the data frame
index <- sample(1:nrow(complete_homes), size = .7*nrow(complete_homes), replace = F)

# subset the data with the selected row indices to create the training set
train <- complete_homes[index, ]

# subset the data with the remaining row indices to create the testing set
test <- complete_homes[-index, ]

loglog_model <- loglog_mod <- lm(log(price) ~ city + 
             log(sqft) + 
             bed + 
             bath + pool, data = train)

Introduction

By "real world regression" I mean to refer to all of the potential complications associated with using regression to model actual data. What issues are you likely to encounter?

Missing observations.
Continuous data that is skewed or poorly distributed, with outliers.
Categorical data with high cardinality (lots of levels in factor variables).
Non-linear relationships between predictor variables and the target variable.
High dimensional data---large numbers of predictors as well as observations---that can cause fitting algorithms to bog down.
Complicated models (lots of coefficients) leading to the possibility of overfitting.
Many different possible models leading to uncertainty about which one is best.
Poorly-defined job tasks requiring educated guesses and critical thinking.

The goal in this tutorial is to equip you with the skills and example code to handle some of the above challenges.

Predicting Home Sale Price

As an example, we will use data similar to what you worked with for the PacDev case, but exhibiting more real world characteristics, with---among other problems---missing observations and non-linearity. Here is the data dictionary:

Each row represents sales information on a single family home in the Bay Area.

city [fct]: home location.
bed [dbl]: number of bedrooms.
bath [dbl]: number of baths.
sqft [dbl]: size of home.
pool [fct]: indicator for whether the home has a pool.
spa [dbl]: indicator for whether the home has a spa.
rooms [dbl]: number of rooms.
price [dbl]: home sales price.

To get started, take a look at the data:

summary(homes)

glimpse(homes)

Two of the variables are factor variables, while the rest are numeric. There is quite a bit of missing data, which we will need to address before modeling. Additionally, pool and spa have zero variance: the data that is not missing consists in just a single value---Y and 1, respectively.

Near zero variance predictors

Zero variance predictors like pool and spa contain no information at all. By definition, these predictors will add nothing to a regression model and should be removed. In fact, the least squares algorithm will not be able to estimate a coefficient for a zero variance predictor.

Near zero variance predictors contain a little information---but not much. Should they be removed? Not necessarily. In fact, I would recommend not automatically eliminating such variables with the rationale that a little information is better than none. And including weak predictors in a linear regression model will typically not cause problems.

When might you want to remove near zero variance predictors? Here are some considerations:

Sometimes you need an excuse to reduce the dimensionality of your data by removing predictors. Either the predictor set is large, which causes modeling algorithms to bog down, or there are more predictors than instances (rows), in which case an algorithm like OLS regression will not work. Removing predictors is the only option in the latter case, and a good option in the former case.
Low variance predictors, if they contain a few large outliers, can sometimes produce unrealistic coefficients and standard errors in linear regression.

The caret package includes a function, nzv() (for "near zero variance"), that will identify the columns in a data frame with little or no variation. Here is a quick example of how to use nzv():

nzv(homes, names = T)

This result tells us that all the columns in this data set except pool and spa have enough variance to function as useful predictors. Here is how you would remove near zero variance variables using nzv(): new_data <- data[, -nzv(data)].

Missing Data

If the missing observations are missing at random (MAR) or missing completely at random (MCAR), we could just remove the rows with NAs. Why? If the missings are indeed random, then removing them should not alter the general pattern of relationships in the data. But if the number of missing observations is large, as here, then using only the complete cases can create problems: we might encounter the problem of not having enough data to create a reliable model, particularly if some of the variables don't have much variation.

The solution in such situations is to impute the missing observations, which means to make an educated guess about the probable values of the NAs---what they would have been were they not missing.

Note that NA can mean different things for different data sets. Read the data dictionary carefully! Sometimes you don't need to impute NAs but to replace them with appropriate values. For example, the Alley column in the Ames Housing data (for the Kaggle project) is 90% NAs. The data dictionary informs us that NA means the home has no alley. Thus, NA does not represent missing information in this case, but a missing feature of the home. In this case we could confidently replace NA with "none." In the data we are working with here, the NAs in pool also seem to mean "none" since the only recorded observations are Y. And the extent of missing observations in spa---1591---combined with the fact that the only recorded values are 1, means that this variable contains no information.

unique(homes$spa)

spa can be removed without consequence, and the NAs in pool can be replaced with N (based on the logic stated above). Let's make these changes now and calculate the complete cases:

homes <- homes %>% 
  mutate(pool = ifelse(is.na(pool), "N", pool),
         pool = factor(pool)) %>% 
  select(-spa)

sum(complete.cases(homes))

The lm() function will automatically remove rows with missing observations---it uses only complete cases---so the functional size of this data set for regression modeling is much smaller than it seems at first.

How do you know when to impute genuinely missing observations rather than removing the entire row? The short answer is: you don't. You must use your judgment. If there are just a few missing observations then remove those rows. That is the most convenient thing to do and should not impact results. But if the missing observations are more than 10-20% of the data set (though this is a very rough, ballparky guideline), and if there are large numbers of predictors, and you are including interactions, then you should consider imputation simply in the interest of data preservation.

One caveat is that if the NAs are too extensive then the columns won't contain very much information and imputation won't be very reliable. In such cases consider starting over with data collection or---if there are just a few majority NA columns---removing those that have a high proportion of missings.

The imputation approach you use will in part depend on your modeling goal. For example, if your goal is inference---that is, if your goal is to interpret model coefficients to reason about causal relationships---then you should use multiple imputation. This method creates multiple data sets representing different possibilities for the imputed values. It ensures that the uncertainty inherent in guessing unknown values makes its way appropriately into the standard errors for the model coefficients. If, on the other hand, you need to impute a single value in order to have a complete test set for prediction then you should use single imputation. Here we'll cover only single imputation, since it is the appropriate method for the Kaggle competition.

There are many packages available in R to do imputation. Most of them are fiddly. One that is straightforward to use for single imputation is the imputeMissings package, which offers several imputation methods, the fastest and simplest of which is imputation with medians (for numeric data) or the mode (for categorical data) using the impute() function.

complete_homes <- impute(homes, method = "median/mode")

Did it work?

nrow(complete_homes) == sum(complete.cases(complete_homes))

Yes. That was pretty simple.

Multicolinearity

Highly correlated predictors add noise to a regression model without adding information, and can cause instability in standard error and coefficient estimates. This is obviously unwelcome if your goal is inference.

Note that multicolinearity is not a problem for prediction; it will not impact the quality of predictions. But when working with large data sets removing one of a pair of highly correlated predictors can facilitate faster model fitting. The loss of information is minimal at correlations above .8 or .9.

There are more or less complicated ways of identifying multicollinearity statistically. Various functions in R use what is known as the "variance inflation factor" or VIF to identify correlated predictors for removal. More simply, bivariate correlations can be calculated between all the predictors and then one of each correlated pair above a certain threshold can be removed. (See Chapter 3.3 in the caret documentation for information on how to implement this strategy.) However you identify the correlated predictors, the solution is to remove them.

As noted, the symptoms of multicolinearity for inference include instability in standard errors and coefficients. You will see these estimates go haywire when the correlated predictor is added to the model. For example:

lm(price ~ rooms, data = complete_homes) %>% 
  summary

Observe what happens to the coefficient, SE and p-value for rooms when we add bed:

lm(price ~ rooms + bed, data = complete_homes) %>% 
  summary

The coefficient estimate declines and the SE goes way up. This behavior is due to the correlation between bed and rooms:

complete_homes %>% 
  select(bed, rooms) %>% 
  cor()

Clearly, these two variables contain virtually the same information. To reason accurately about effect sizes and statistical significance we need to remove one or the other. Again, for purposes of prediction, it would be fine to leave both in the model.

Non-linear Data

Price data is typically non-linear. We see that in this data set:

ggplot(complete_homes, aes(sqft, price)) +
  geom_point() +
  theme_minimal() +
  stat_smooth(method="lm", se = F) +
  labs(title = "price ~ sqft")

The price variable spans many orders of magnitude, and the relationship between price and sqft is not linear, especially in San Francisco.

ggplot(complete_homes, aes(sqft, price)) +
  geom_point() +
  theme_minimal() +
  facet_wrap(~city) +
  stat_smooth(method="lm", se = F) +
  labs(title = "price ~ sqft, varying by city")

A linear model will have trouble with non-linear data.

Log Transformation

One solution is to log transform some of the variables. Log transformation compresses right skewed data and helps the linear model fit better. The function we often use in R for log transformation, log(), uses the natural log. (log2() will compute log base 2.) After log transformation of price, each fixed distance in price represents a multiplication (not an addition) of the value.

ggplot(complete_homes, aes(sqft, log(price))) +
  geom_point() +
  theme_minimal() +
  stat_smooth(method="lm", se = F) +
  labs(title = "log(price) ~ sqft")

That doesn't look like an improvement. Remember that our goal in transforming the data is to make the relationship between two variables linear, so as to improve the fit of a linear model. The problem is that sqft is also right skewed. So, let's log transform sqft as well.

ggplot(complete_homes, aes(log(sqft), log(price))) +
  geom_point() +
  theme_minimal() +
  stat_smooth(method="lm", se = F) +
  labs(title = "log(price) ~ sqft")

That looks much better!

What is a logarithm anyway? Think of the Richter scale for earthquake magnitude. Each linear increment---3 to 4, for example---represents a multiplicative increase in size: 10 times larger. This is because the Richter scale uses the log base 10 scale. Basically: $10^3$ = 10 x 10 x 10 = 1000 vs. $10^4$ = 10 x 10 x 10 x 10 = 10,000, which is 10 times larger than $10^3$. The linear increments of the Richter scale uses the exponents on 10, in this case 3 vs. 4.

In statistics we almost always use the natural log, where the base is not 10 but the mathematical constant $e$. What is $e$? The value of $e$ is $e^1$.

(e <- exp(1))

The natural log asks: to what power must $e$ (2.718...) be raised to equal another number? So if we take the natural log of, say, 9, we are asking: to what power must $e$ be raised to equal 9? It must be raised to the power of 2.197...

exp(1) ^ 2.197225

How do we know that?

log(9)

Key fact: Natural log undoes exponentiation (raising $e$ to a power) and exponentiation undoes natural log. Use this identity: $e^{ln(x)}$ = x. In R: exp(log(x)) = x. Observe:

exp(log(9))

log(exp(9))

Here is what the natural log function looks like:

data.frame(x = seq(1, 1000, by = 1)) %>% 
  mutate(natural_log_x = log(x)) %>% 
  ggplot(aes(x, natural_log_x)) +
  geom_line()

Notice that the natural log values are compressed: the range 0-1000 becomes 0-6.5. This is important because a natural log transformation will help make outliers less influential in a regression.

And here is the reverse of the natural log, the exponential function:

data.frame(x = seq(1, 10, by = .01)) %>% 
  mutate(exp_x = exp(x)) %>% 
  ggplot(aes(x, exp_x)) +
  geom_line()

A log transformation takes an exponential increase and makes it linear, which the the goal in linear modeling.

data.frame(x = seq(1, 10, by = .01)) %>% 
  mutate(exp_x = exp(x)) %>% 
  ggplot(aes(x, log(exp_x))) +
  geom_line()

Such a transformation will often improve model fit. Why? Think about it this way:

Your goal in creating a model is to faithfully summarize the data, but outlying observations can frustrate that goal by unduly influencing the model. Consider a simple data summary like a mean. If the data are skewed, the mean will be pulled away from the center of the data and will not be very representative. The same thing happens in a linear model when the outcome variable is skewed.
If we can make outliers less anomalous with a data transformation then we can reduce their influence and improve the model. That is the point of log transforming the outcome variable.

Will log transforming price improve the model? We can answer this question empirically by comparing models using $R^2$. In the models below I will use shorthand: . means to include all predictors.

No log transformation:

summary(nolog_mod <- lm(price ~ ., data = complete_homes))$r.square

plot(nolog_mod, which = 1)

Terrible! The exponential increase in price shows up in the residuals.

Log transform price (a log model):

summary(log_mod <- lm(log(price) ~ ., data = complete_homes))$r.square

plot(log_mod, which = 1)

Log transform price and sqft (a log-log model):

summary(loglog_mod <- lm(log(price) ~ . -sqft + log(sqft), data = complete_homes))$r.square

plot(loglog_mod, which = 1)

The third model, judged by $R^2$, and the residual plot, is the best.

One thing to keep in mind is that after log transformation the model's fitted values will be expressed in log units. This won't affect $R^2$, since it is calculated from a ratio---$1 -\frac{RSS}{TSS}$---and therefore does not depend on the underlying units. But RMSE will be affected. To calculate RMSE from a log or log-log model in original units (rather than log units) will require exponentiating the fitted values: $ \sqrt{\frac{1}{n}(y_i - e^{\hat{y}_i})^2}$.

Here is residual standard deviation, in log units, for the log-log model:

summary(loglog_mod)$sigma

And here is the calculation of RMSE in original units:

rmse <- function(actual, fitted) sqrt(mean(actual - fitted)^2)


rmse(complete_homes$price, exp(fitted(loglog_mod)))

You can see the advantage of computing RMSE in dollars rather than log dollars. The average error in the log-log model is meaningless, except for model comparison. But having the error in dollars provides a clear sense of the model's accuracy.

Interpreting a log model

The coefficients in a log or log-log model are expressed in log units, which are hard to connect to real world quantities.

It is always possible to leave the coefficients expressed in log units and interpret them as for a model with an unlogged outcome. For example, the coefficient for log sqft could be interpreted as the change in log price associated with one unit increase in log sqft, holding the other predictors constant. This provides limited insight, however, since log units are not intuitive. So, how would we make the scale of these coefficients easier to understand?

For reference, here are the coefficients for the log model.

log_mod$coefficients

We need to exponentiate coefficients in a log model to get the percentage increase in the outcome associated with a 1 unit increase in the predictor, holding the other predictors constant. Thus, the coefficient for Menlo Park, .317, would exp(.317) = 1.373, which is a 37.3% increase compared the baseline of 1: 1.373 - 1 = -.373 * 100 = 37.3%. In other words, we see a 37.3% increase in average home prices in Menlo Park compared to the reference city of Berkeley, everything being equal. As a rule of thumb, we can dispense with exponentiation when the coefficient is close to 1 because exponentiating returns a number very close to the original. For example, exp(.02) = 1.02, representing a 2% increase over the baseline of 1. (The calculation goes like this: 1.02 - 1 = .02 * 100 = 2%.) In this example the rule of thumb works well. But it does not work well for the Menlo Park coefficient which is further from 0, as we saw above.

For reference, here are the coefficients for the log-log model.

loglog_mod$coefficients

How would coefficients in a log-log model, in which both the outcome and a predictor had been log transformed, be interpreted? Mercifully, we don't need to exponentiate. Instead, we can interpret the coefficient directly as a percentage increase. The coefficient for log(sqft) is .976. This would mean that each 1% increase in square feet is associated with a .976% increase in the price. The other coefficients would be interpreted similarly as percentage increases.

Notice that the signs on the coefficients in this model now make sense, given what we know about average home prices: Oakland has lower prices than Berkeley, but San Francisco and Menlo Park have higher prices.

Outliers

There are some extreme home prices in this data set:

homes %>% 
  arrange(desc(price)) %>% 
  slice(1)

This is an outlier, yes, but it is not---presumably---a data collection error, but simply a very large (and expensive) home.

In general, outliers should not be reflexively removed from a data set. Simply because the model fit is better without an outlier is not a reason to remove it! Instead:

Ask questions about data collection. Is the outlier a mistake, the result of an error?
Ask questions of domain experts to understand other factors influencing home sales. Could a variable be added, or created, that would help explain high prices?
Use log transformations to make outliers less influential.

As a last resort, remove the extreme values, but make sure to clearly indicate that you have done so when presenting results.

Overfitting

Adding predictors to a regression model---increasing model complexity---will generally improve the model. Simple models may not describe the actual patterns in the data. As an illustration, consider the following scenarios.

set.seed(123) # set seed for reproducibility
n <- 50 # number of observations
x1 <- rnorm(n, mean = 0, sd = 1) # generate predictor 1
x2 <- rnorm(n, mean = 0, sd = 1) # generate predictor 2
y <- x1 + x1^2 -x2 + rnorm(n, mean = 0, sd = 0.5) # generate response variable
data <- data.frame(x1, x2, y) # combine into a data frame

data$pred1 <- predict(lm(y~x1, data = data))
data$pred2 <- predict(lm(y~x1 + I(x1^2), data = data))


ggplot(data, aes(x1, y))+
geom_point()+
  geom_line(aes(x1, pred1), col = 2) +
  theme_minimal()+
  labs(title= "Regression model: y ~ x1")

Clearly there is non-linear pattern in the data that could be modeled by adding a quadratic term.

set.seed(123) # set seed for reproducibility
n <- 50 # number of observations
x1 <- rnorm(n, mean = 0, sd = 1) # generate predictor 1
x2 <- rnorm(n, mean = 0, sd = 1) # generate predictor 2
y <- x1 + x1^2 -x2 + rnorm(n, mean = 0, sd = 0.5) # generate response variable
data <- data.frame(x1, x2, y) # combine into a data frame

data$pred1 <- predict(lm(y~x1, data = data))
data$pred2 <- predict(lm(y~x1 + I(x1^2), data = data))

ggplot(data, aes(x1, y))+
geom_point()+
  geom_line(aes(x1, pred2), col = 2) +
  theme_minimal()+
  labs(title= "Regression model: y ~ x1 + x1 * x1")

# summary(lm(y ~ x1, data = data))$r.square
#  summary(lm(y ~ x1 + I(x1^2), data = data))
#  
#  rmse <- function(y, pred) sqrt(mean((y-pred)^2))
# rmse(data$y, data$pred2)

R-squared for the first linear model was .42, which has increased to .65 with the addition of the quadratic term. In this case additional model complexity seems warranted to capture the non-linear pattern in the data. RMSE for this model is 1.07.

However, inspired by this success, we may be drawn to creating even more complex models:

set.seed(123) # set seed for reproducibility
n <- 50 # number of observations
x1 <- rnorm(n, mean = 0, sd = 1) # generate predictor 1
x2 <- rnorm(n, mean = 0, sd = 1) # generate predictor 2
y <- x1 + x1^2 -x2 + rnorm(n, mean = 0, sd = 0.5) # generate response variable
data <- data.frame(x1, x2, y) # combine into a data frame

data$pred1 <- predict(lm(y~x1, data = data))
data$pred2 <- predict(lm(y~x1 + I(x1^2), data = data))
model <- lm(y~x1 + I(x1^2) + I(x1^3)+ I(x1^4)+ I(x1^5)+ I(x1^6)+ I(x1^7)+ I(x1^8)+ I(x1^9)+ I(x1^10) +
                           I(x1^11) + I(x1^12)+ I(x1^13)+ I(x1^14)+ I(x1^15)+ I(x1^16)+ I(x1^17)+ I(x1^18), data = data)
data$pred3 <- predict(model)


ggplot(data, aes(x1, y))+
geom_point()+
  geom_line(aes(x1, pred3), col = 2) +
  theme_minimal()+
  labs(title= "Regression model of y with 20th degree polynomial")

# summary(lm(y~x1 + I(x1^2) + I(x1^3)+ I(x1^4)+ I(x1^5)+ I(x1^6)+ I(x1^7)+ I(x1^8)+ I(x1^9)+ I(x1^10) +
#                           I(x1^11) + I(x1^12)+ I(x1^13)+ I(x1^14)+ I(x1^15)+ I(x1^16)+ I(x1^17)+ I(x1^18), data = data))

#  rmse <- function(y, pred) sqrt(mean((y-pred)^2))
# rmse(data$y, data$pred3)

Our efforts are rewarded with an even higher R-squared: .76. RMSE has of course also declined---down to .87.

But is this a worthwhile achievement? Not really. It should be apparent that the basic pattern in the data is a curve. Additional model complexity, represented by the squiggles of the regression line, is very likely driven by individual data points. Notice that the model fits some data points exactly. This is an example of overfitting. The model has become overly complex: it fits the data points in a single sample so accurately---too accurately--- that it will actually be quite inaccurate with new data. Complexity is a good thing in statistical modeling, but only up to a certain point. To see this, consider predictions from this model using inputs from a new sample from the same population:

set.seed(123) # set seed for reproducibility
n <- 50 # number of observations
x1 <- rnorm(n, mean = 0, sd = 1) # generate predictor 1
x2 <- rnorm(n, mean = 0, sd = 1) # generate predictor 2
y <- x1 + x1^2 -x2 + rnorm(n, mean = 0, sd = 0.5) # generate response variable
data <- data.frame(x1, x2, y) # combine into a data frame

data$pred1 <- predict(lm(y~x1, data = data))
data$pred2 <- predict(lm(y~x1 + I(x1^2), data = data))
model <- lm(y~x1 + I(x1^2) + I(x1^3)+ I(x1^4)+ I(x1^5)+ I(x1^6)+ I(x1^7)+ I(x1^8)+ I(x1^9)+ I(x1^10) +
                           I(x1^11) + I(x1^12)+ I(x1^13)+ I(x1^14)+ I(x1^15)+ I(x1^16)+ I(x1^17)+ I(x1^18), data = data)

set.seed(130) # set seed for reproducibility
n <- 50 # number of observations
x1_new <- rnorm(n, mean = 0, sd = 1) # generate predictor 1
x2_new <- rnorm(n, mean = 0, sd = 1) # generate predictor 2
y_new <- x1_new + x1_new^2 -x2_new + rnorm(n, mean = 0, sd = 0.5) # generate response variable
data_new <- data.frame(x1_new, x2_new, y_new) # combine into a data frame
data_new$pred <- predict(model, newdata= data_new)

ggplot()+
geom_point(data = data_new, aes(x1_new, y_new))+
  geom_line(data =data_new, aes(x1_new, pred), col = 2) +
  theme_minimal()+
  labs(title= "Predictions from a 20th degree polynomial model using new data")


# rmse <- function(y, pred) sqrt(mean((y-pred)^2))
# rmse(data_new$y_new, data_new$pred)

This new data displays the same basic curve. But the model, expecting the same data points on which it was trained, does not generalize well and yields a much worse RMSE: 3.67. What looked like spectacular performance falls apart when the model is used with new data.

Bias-Variance Trade-off

The bias-variance trade off refers to the tension between two important properties of a model: bias and variance.

Bias: The error introduced by approximating a real-world problem with a simplified model. All models are simplifications, of course. But some are too simple, leading to underfitting and poor performance on both training and testing data. Such models have high bias. The first model in the above section---using a regression line to fit a curve--- has high bias.
Variance: The error that is introduced by the model's sensitivity to the training data. Models with high variance tend to be too complex and fit the training data too closely, leading to overfitting and poor generalization to new and unseen data. The last model in the above section has high variance.

It is not possible to create a model that has both very low bias and very low variance. Hence the trade off. The best we can do is manage the tension between bias and variance by avoiding overfitting and creating models with appropriate complexity. How would we know how much complexity is appropriate or when a model begins overfitting? Cross-validation.

Cross-validation

The language that we've been using---training and testing---is the language of cross-validation. When we train a model on one sample of data (training data) and then evaluate its performance by predicting the target on another sample (testing data) we are doing cross-validation. Note that the testing data must contain the target; model evaluation then consists in comparing the model's predictions with the observed target. Ideally, performance on the training and testing data ---we call this "in-sample" and "out-of_sample" performance---should be similar. The typical pattern is that both in-sample and out-of-sample performance will improve as a model becomes more complex---think of this as the addition of appropriate complexity. The model's bias and variance both decrease. Eventually, however, overfitting will occur, the tell-tale sign of which is that in-sample performance continues to improve while out-of-sample performance gets worse.

There are many different forms of cross-validation.

Validation set method.

This is the simplest cross-validation method. It involves splitting the available data into two separate sets: a training set and a testing or validation set. A 70/30 train/test split is common. The training set is then used to train the model, while the validation set is used to evaluate its performance.

A variation of the validation set method is to split in three part: train, test and holdout. This is, in effect, what you will be using for the Kaggle competition; your submitted predictions will be evaluated on the holdout set and summarized on the leaderboard.

One disadvantage of the validation set method is that it can result in high variance, as the model's performance may be sensitive to the random partitioning of the data. Other cross-validation techniques such as k-fold cross-validation reduce this problem.

k-fold cross-validation

K-fold cross-validation involves splitting the available data set into k equal-sized parts or "folds". In each iteration, one fold of the data is used as the validation set or test set while the other folds are together used as the training set. This process is then repeated k times, with each fold being used as the validation set exactly once. The performance metric for each iteration of k-fold cross-validation---for example, RMSE or R-squared--- is recorded and averaged across all k iterations to obtain an estimate of the model's performance on new data.

A common value for k is 10. However, while this number ensures low variance in performance estimates, it can also be very time-consuming since the model must be fit 10 times. 5 fold cross-validation can be a good compromise.

Cross validation example

Here are the steps for the validation set method.

Split the data into train and test.
Fit the model on train.
Calculate in-sample model performance.
Use the model to predict the target for the test set.
Calculate out-of-sample model performance.

We will demonstrate using the complete_homes data.

Split the data into train and test.

# set the seed for reproducibility
set.seed(123)

# randomly select 70% of the row indices of the data frame
index <- sample(1:nrow(complete_homes), size = .7*nrow(complete_homes), replace = F)

# subset the data with the selected row indices to create the training set
train <- complete_homes[index, ]

# subset the data with the remaining row indices to create the testing set
test <- complete_homes[-index, ]

Fit the model using the train data.

As we've seen, log transforming both price and sqft improves the fit.

# Create a linear regression model with log transformed price and sqft
loglog_model <- lm(log(price) ~ city + bed + bath + log(sqft) + pool, data = train)

summary(loglog_model)

Calculate in-sample model performance.

We must remember to exponentiate the fitted values (expressed in log dollars) in order to return dollar units.

# Create a function to calculate RMSE
rmse <- function(observed, predicted) sqrt(mean((observed - predicted)^2))

# Calculate RMSE
rmse(train$price, exp(predict(loglog_model)))

Use the model to predict the target for the test set.
Calculate out-of-sample model performance.

# Predict for the test data
predictions <- predict(loglog_model, newdata = test)

# Calculate RMSE
rmse(test$price, exp(predictions))

The RMSE is modestly higher on the test set than on the training set but not enough to worry about overfitting.

jefftwebb/lightbulb documentation built on March 28, 2023, 12:35 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jefftwebb/lightbulb
Tutorials for Learning Data Science

In jefftwebb/lightbulb: Tutorials for Learning Data Science

Introduction

Predicting Home Sale Price

Near zero variance predictors

Missing Data

Multicolinearity

Non-linear Data

Log Transformation

Interpreting a log model

Outliers

Overfitting

Bias-Variance Trade-off

Cross-validation

Cross validation example

R Package Documentation

Browse R Packages

We want your feedback!

jefftwebb/lightbulb Tutorials for Learning Data Science

In jefftwebb/lightbulb: Tutorials for Learning Data Science

Introduction

Predicting Home Sale Price

Near zero variance predictors

Missing Data

Multicolinearity

Non-linear Data

Log Transformation

Interpreting a log model

Outliers

Overfitting

Bias-Variance Trade-off

Cross-validation

Cross validation example

R Package Documentation

Browse R Packages

We want your feedback!

jefftwebb/lightbulb
Tutorials for Learning Data Science