208 Data Analysis

knitr::opts_chunk$set(fig.height=3)

## Do not delete this!
## It loads the s20x library for you. If you delete it 
## your document may not compile
library(s20x)

Question 1

A real estate agent in Saratoga, New York, wishes to investigate how the sale price of houses is affected by the size of the house. She has compiled data from a random sample of 112 recent house sales in the city. For this question, we want to compare prices for houses with living areas categorised as small and large and also estimate what the expected house price for these two groups.

The dataset is stored in Houses.csv and includes variables:

Variable | Description ------------|------------------------------------------------------- price | sale price of house, in US dollars livingArea | size of the living area of the house, in square metres livingSpace | a factor classifying the size of the living area as either small if less than 170 square metres or large if greater.

Disclaimer: Before you rush off to Saratoga to buy a house, this is an old data set. I'm afraid house prices have gone up a lot since this data was collected.

Instructions:

Comment on the two plots of the data.
Comment why we log the response variable for this data
Fit an appropriate model to the data. Check the model assumptions.
Write appropriate Methods and Assumption Checks.
Write an appropriate Executive Summary.

Question of interest/goal of the study

We want to compare prices for houses with living areas categorised as small and large and estimate what the expected house price for each group.

Inspect the data: livingSpace as an explanatory variable

load(system.file("extdata", "houses.df.rda", package = "s20x"))

houses.df=read.csv("Houses.csv",header=T, stringsAsFactors = TRUE)
plot(price~livingSpace, houses.df, main="Price versus Living Area",horizontal=TRUE)
summaryStats(price~livingSpace, houses.df)

plot(log(price)~livingSpace, houses.df, main="log(Price) versus Living Area",horizontal=TRUE)
summaryStats(log(price)~livingSpace, houses.df)

boxplot(price~livingSpace, houses.df, main="Price versus Living Area",horizontal=TRUE)
summaryStats(price~livingSpace, houses.df)

boxplot(log(price)~livingSpace, houses.df, main="log(Price) versus Living Area",horizontal=TRUE)
summaryStats(log(price)~livingSpace, houses.df)

Comment on the two plots of the data.

House prices are higher and more varied when there are large living spaces compared to small living spaces. We can see this in the summary statistics as well. After logging the data, the variability in the data is a lot fairly similar for the two groups, with the centre still being higher for the group with larger living spaces.

Comment why we log the response variable for this data

As the variability between the groups is not constant, with the group with the larger centre having more than double the standard deviation of the other group, we need to solve the issue of unequal variability, so logging the data is the best choice.

We also have financial data, and so interpreting the results as percentage changes does make more sense if appropriate.

The data is right skewed, but this is not the major concern for this particular data set as the skewness isn't too extreme and we have a sufficiently large sample size for the Central Limit Theorem to come into play.

# Log the response variable and fit the model:
houses.fit1 = lm(log(price)~livingSpace, houses.df)
modelcheck(houses.fit1)
summary(houses.fit1)
confint(houses.fit1)
# back transform
exp(confint(houses.fit1))

# Extract second row of CI output only.
exp(confint(houses.fit1)[2,])

# % change 100*(value-1)
100*(exp(confint(houses.fit1)[2,])-1)

# % Change CI for opposite direction (Large - Small), so negative confidence interval before back transform
100*(exp(-confint(houses.fit1)[2,])-1)


# Rotate factor to get intercept to give estimates when area = small
houses.df=within(houses.df, {livingSpaceR=factor(livingSpace,levels=c("Small","Large"))})
houses.fit2 = lm(log(price)~livingSpaceR, houses.df)
summary(houses.fit2)
confint(houses.fit2)
# back transform
exp(confint(houses.fit2))

conf1 = as.data.frame(t(abs(100*(exp(confint(houses.fit1)[2,])-1))))
resultStr1 = paste0(sprintf("%.1f%%", conf1$`97.5 %`), " and ", sprintf("%.1f%%", conf1$`2.5 %`))

# % Change CI for opposite direction (Large - Small), so negative confidence interval before back transform
conf2=as.data.frame(t(100*(exp(-confint(houses.fit1)[2,])-1)))
resultStr2 = paste0(sprintf("%.1f%%", conf2$`97.5 %`), " and ", sprintf("%.1f%%", conf2$`2.5 %`))

conf3 = as.data.frame(exp(confint(houses.fit1)))
resultStr3 = sprintf("$%s and $%s",
                    format(round(conf3$`2.5 %`,-3), big.mark = ",", trim = TRUE),
                    format(round(conf3$`97.5 %`,-3), big.mark = ",", trim = TRUE)
                    )

conf4 = as.data.frame(exp(confint(houses.fit2)))
resultStr4 = sprintf("$%s and $%s",
                    format(round(conf4$`2.5 %`,-3), big.mark = ",", trim = TRUE),
                    format(round(conf4$`97.5 %`,-3), big.mark = ",", trim = TRUE)
                    )

Method and Assumption Checks

We have a single grouping explanatory variable with two levels, so have fitted a linear model with a single dummy variable to the data. Due to the large differences in variability between the two groups,

After logging price, the residuals looked much better. Normality looks good and no influential points were detected. We have a random sample, so the independence assumption is satisfied. Model assumptions are satisfied.

Our model is: $log(price_i) = \beta_0 + \beta_1 \times LivingSpaceSmall_i + \epsilon_i,$

where $LivingSpaceSmall_i = 1$ if the $i$th house has a large living area and is 0 otherwise, and $\epsilon_i \sim iid ~N(0,\sigma^2)$

Alternatively, our model is: $log(burn_{ij}) =\mu_i + \epsilon_{ij}$ where $\mu_1$ is the mean log price if the house has a small living space and $\mu_2$ is the mean log price if the house has a large living space., $\epsilon_{ij} \sim iid ~N(0,\sigma^2)$

Our model explained 36.7% of the variation in the data.

Executive Summary.

We investigated how the prices of houses in Saratoga, New York, are affected by house size.

We found strong evidence that the median house price is greater in houses with larger living area (greater than 170 square metres) than those with smaller living areas..

We estimate that the median house price for houses with small living areas (less than 170 square metres) was between r resultStr1[1] smaller than that than for houses with large living area.

Alternatively: We estimate that the median house price for houses with large living areas (greater than 170 square metres) was between r resultStr2[1] more than that than for houses with small living area.

We estimate that the median house price for: