208 Data Analysis

## Do not delete this!
## It loads the s20x library for you. If you delete it 
## your document may not compile
library(s20x)

Question 1

A real estate agent in Saratoga, New York, wishes to investigate how the sale price of houses is affected by the size of the house. In particular, what is the effect on price of an additional 20 m^2 in living area. We also want to compare prices for houses with living areas categorised as small and large and estimate what the expected house price for these two groups. She has compiled data from a random sample of 112 recent house sales in the city.

The dataset is stored in Houses.csv and includes variables:

Variable | Description ------------|------------------------------------------------------- price | sale price of house, in US dollars livingArea | size of the living area of the house, in square metres livingSpace | a factor classifying the size of the living area as either small if less than 170 square metres or large if greater.

Disclaimer: Before you rush off to Saratoga to buy a house, this is an old data set. I'm afraid house prices have gone up a lot since this data was collected.

Instructions:

Comment on the two plots of the data.
Fit an appropriate model to the data. Check the model assumptions.
Plot the data on the log scale with your appropriate model superimposed over it.
Plot the data on the original scale with your appropriate model superimposed over it.
Write appropriate Methods and Assumption Checks.
Write an appropriate Executive Summary.

Question of interest/goal of the study

We wish to investigate how the prices of houses in Saratoga, New York, are affected by the size of the house. In particular, what is the effect on price of an additional 20 $m^2$ in living area.

Inspect the data: livingArea as an explanatory variable

load(system.file("extdata", "houses.df.rda", package = "s20x"))

houses.df=read.csv("Houses.csv",header=T, stringsAsFactors = TRUE)
plot(price~livingArea, houses.df, main="Price versus Living Area")
plot(log(price)~livingArea, houses.df, main="log(Price) versus Living Area")

plot(price~livingArea, houses.df, main="Price versus Living Area")
plot(log(price)~livingArea, houses.df, main="log(Price) versus Living Area")

Comment on the two plots of the data.

There is an increasing relationship between house price and living area. The initial plot shows that scatter increases for higher values of living area, but log-transforming the response variable in the second plot results in a relationship that looks reasonably linear with constant scatter

Fit model and check assumptions

houses.fit1 <- lm(price~livingArea, houses.df)
modelcheck(houses.fit1)

# Log the response variable and refit the model:
houses.fit2 <- lm(log(price)~livingArea, houses.df)
modelcheck(houses.fit2)
summary(houses.fit2)
confint(houses.fit2)
# back transform
exp(confint(houses.fit2))

# Extract second row of CI output only.
exp(confint(houses.fit2)[2,])

# % change 100*(value-1)
100*(exp(confint(houses.fit2)[2,])-1)

# scale by 20 and THEN back transform
exp(confint(houses.fit2)[2,]*20)
# % change 100*(value-1)
100*(exp(confint(houses.fit2)[2,]*20)-1)

conf1=as.data.frame(t(100*(exp(confint(houses.fit2)[2,]*20)-1)))
resultStr1 = paste0(sprintf("%.1f", conf1$`2.5 %`), " and ", sprintf("%.1f", conf1$`97.5 %`))

Plot the data on the log scale with your appropriate model superimposed over it

plot(log(price)~livingArea, houses.df, main="log(Price) versus Living Area")
abline(houses.fit2)

Plot the data on the original scale with your appropriate model superimposed over it

plot(price~livingArea, houses.df, main="log(Price) versus Living Area")
lines(50:350,exp(houses.fit2$coef[1]+houses.fit2$coef[2]*50:350))

Methods and assumption checks

We have one numeric explanatory variable so have fitted a simple linear regression model to the data. However, we have clear evidence of increasing scatter as the living area increases so have logged the response variable price.

After logging price, the residuals looked much better. Normality looks good and no influential points were detected. We have a random sample, so the independence assumption is satisfied. Model assumptions are satisfied.

Our model is: $log(price_i) = \beta_0 + \beta_1 \times livingArea_i + \epsilon_{i}$, where $\epsilon_i \sim iid N(0,\sigma^2)$

Our model explained 53.5% of the variability in the logged data.

Executive Summary.

We investigated how the prices of houses in Saratoga, New York, are affected by house size.

We found strong evidence that the median house price increases as the size of the living area increases.Furthermore, this relationship increases exponentially, so the greater the size of the living area, the bigger the increase.

We estimate that the median house price increases by between r resultStr1[1] for every 20 $m^2$ increase in living area.

Any scripts or data that you put into this service are public.

s20x documentation built on Jan. 14, 2026, 9:07 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

s20x
Functions for University of Auckland Course STATS 201/208 Data Analysis

Case Study 6.5: STATS 201/8 Extra Case Study - Log-linear model
In s20x: Functions for University of Auckland Course STATS 201/208 Data Analysis

Question 1

Question of interest/goal of the study

Inspect the data: livingArea as an explanatory variable

Comment on the two plots of the data.

Fit model and check assumptions

Plot the data on the log scale with your appropriate model superimposed over it

Plot the data on the original scale with your appropriate model superimposed over it

Methods and assumption checks

Executive Summary.

Try the s20x package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

s20x Functions for University of Auckland Course STATS 201/208 Data Analysis

Case Study 6.5: STATS 201/8 Extra Case Study - Log-linear model In s20x: Functions for University of Auckland Course STATS 201/208 Data Analysis

Question 1

Question of interest/goal of the study

Inspect the data: livingArea as an explanatory variable

Comment on the two plots of the data.

Fit model and check assumptions

Plot the data on the log scale with your appropriate model superimposed over it

Plot the data on the original scale with your appropriate model superimposed over it

Methods and assumption checks

Executive Summary.

Try the s20x package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

s20x
Functions for University of Auckland Course STATS 201/208 Data Analysis

Case Study 6.5: STATS 201/8 Extra Case Study - Log-linear model
In s20x: Functions for University of Auckland Course STATS 201/208 Data Analysis