knitr::opts_chunk$set(echo = TRUE)
library(car)
data(SLID, package="car")
SLID <- SLID[complete.cases(SLID), ]

library(tidyverse)
library(ggplot2)
library(GGally)

library(mylm)

Part 1: Explanatory analysis of the dataset (0.5 points)

Below is a scatterplot matrix of the continuous variables wages, education and age, colored by the categorical variable sex. Firstly, by looking at the plots on the diagonal of the sample distribution, we note that the sample is equally split among the two sexes with respect to age and education; but the sampled women generally earn less than the sampled men. Secondly, the wage-education scatterplot (as well as the wage-education correlation) indicate that women have more to gain in terms of earnings by acquiring higher education. Similarly, the wage-age scatterplot hints at men's wages increasing more with age than for women.

ggpairs(SLID, columns=1:3, mapping=aes(color=sex, alpha=0.9))

The next plot is colored by language, not sex. We see that, in the given sample, the median years of education is approx. 12.5 years for each language; however there is a greater variance among the non-english speakers (and most variance for the "other" category).

ggpairs(SLID, columns=1:3, mapping=aes(color=language, alpha=0.9))

Lastly, we have included barcharts of the two categorical variables against each other. It is apparent that the sample includes equal amounts of male and female speakers of each language. However, there are many more english speakers than french or other languages.

ggpairs(SLID, columns=4:5)

Inherent in any linear model is the assumption that the mean of the response variable is determined as a linear function of the covariates. Moreover, the responses need to be independently sampled with equal variance, and the covariates linearly independent in the sense that no covariate can be expressed as a linear combination of the others. As such, we keep these assumptions in mind when performing our following (multiple) linear regression.

Part 2: Linear regression with the mylm package (3.5 points)

a) Fitting our regression model consists in computing $\hat\beta = (X^T X)^{-1} X^T Y$, and $\hat Y = X\hat\beta$. Here's the model fitted using our mylm function:

model1 <- mylm(wages ~ education, data=SLID); print(model1)

And using R's lm function:

model1b <- lm(wages ~ education, data = SLID); print(model1b)

b) We see (in both outputs) that $\hat\beta_1 > 0$, indicating that education correlates positively with earning wage. That is, the average hourly wage increases by 0.7923 CAD/hr for every additional year of schooling.

The estimated covariance matrix $\text{Cov}(\hat\beta) = \hat\sigma^2(X^T X)^{-1}$, where $\hat\sigma^2 = \frac{\hat\epsilon^T\hat\epsilon}{n - p}$ ($\hat\epsilon = Y - \hat Y$; $X$ is the $n\times p$ design matrix). We compute the standard error $\hat\sigma = \sqrt{\hat\sigma^2}$ and the $p$-values $\hat\beta_i$ are $P(|Z| > \frac{\hat\beta_i}{\sqrt{\hat\sigma^2 (X^TX)^{-1}{ii}}})$. Projecting onto $e_i$ gives the coefficients $\hat\beta_i = e_i^T\hat\beta$ and their standard errors $\hat\sigma^2(X^TX)^{-1}{ii}$. Below is a summary of our model:

summary(model1)

c)

plot(model1)

d) The residual sum of squares $\text{SSE} = \hat\epsilon^T \hat\epsilon$ and the degrees of freedom $n-p$ ($\frac{\text{SSE}}{\sigma^2} \sim \chi^2_{n-p}$) are:

sprintf("SSE: %.1f, SST: %.1f", model1$sse, model1$sst)

Testing the significance of the model is done using the fact that $\frac{SST - SSE}{\frac{SSE}{n-p}} \xrightarrow{\mathcal D} \chi^2_{p-1}$. In the last row of the summary above, we see both the $\chi^2_{n-p}$-statistic value and the associated p-value, which suggests that the model is significant.

Relationship between the $\chi^2$- and $z$-statistics

In simple linear regression, the $t$- (or in our asymptotic case, $z$-) statistic is calculated as follows:

$$Z = \frac{\hat \beta_1 - 0}{\sqrt{c_{11}} \hat \sigma}$$ Where $c_{11} = (X^TX)^{-1}_{11}$ (where "11" actually refers to the second row, second column since we begin indexing at 00).

The $F$- (or in our asymptotic case, $\chi^2$-) statistic is calculated as follows:

$$X = \frac{SSR/k}{SSE/(n-p)} = \frac{SSR}{SSE/(n-2)}$$

Claim: We first claim that $X = Z^2$.

Proof:

We have $$(X^TX) = \begin{pmatrix} 1 & \cdots & 1\ x_1 & \cdots & x_n \end{pmatrix} \begin{pmatrix} 1 & x_1\ \vdots & \vdots\ 1 & x_n \end{pmatrix}= \begin{pmatrix} n & \sum x_i\ \sum x_i & \sum x_i^2 \end{pmatrix}$$

Therefore,

$$(X^TX)^{-1} = \frac{1}{n\sum x_i^2 - \sum x_i \sum x_i} \begin{pmatrix} \sum x_i^2 & -\sum x_i\ -\sum x_i & n \end{pmatrix}$$

And so we have that $$c_{11} = \frac{n}{n\sum x_i^2 - \sum x_i \sum x_i} = \frac{1}{\sum(x_i - \bar x_i)^2}$$

As such, we have that $$Z^2 = \frac{\hat \beta^2 \sum{(x_i - \bar x_i)^2}}{\hat \sigma^2} = \frac{\hat \beta^2 \sum{(x_i - \bar x_i)^2}}{SSE/(n-2)}$$

So to prove $Z^2 = X$, it suffices to show that $SSR = \hat \beta^2 \sum(x_i - \bar x_i)^2$.

Well, consider $SSR$:

$$SSR = \sum(\hat y_i - \bar y_i)^2 = \sum(\hat \beta_0 + \hat \beta_1x_i - \bar y)^2$$

And since $\hat \beta_0 = \bar y - \hat \beta_1 \bar x$, we have that

$$SSR = \sum(\bar y - \hat \beta_1 \bar x + \hat \beta_1 x_i - \bar y)^2 = \hat \beta_1^2\sum(x_i - \bar x)^2$$

And we are done. Therefore, $Z^2 = X$.

End of proof

Now, consider any random variables $A \sim F(1, n-2)$ and $B \sim t_{n-2}$.

Then $A \xrightarrow{\mathcal D} \chi^2(1)$ and $B \xrightarrow{\mathcal D} N(0,1)$ as $n \rightarrow \infty$.

Moreover, we have that $B^2 \sim A$ as $n \rightarrow \infty$.

Let $p_Z$ and $p_X$ be the p-values obtained from testing the null hypothesis: $\beta_1 = 0$ using the $Z$ and $X$ statistics respectively.

Claim: $p_Z = p_X$.

Proof:

Let $\tilde Z \sim N(0,1)$ and $\tilde X \sim \chi^2(1)$, that is $\tilde Z^2 \sim \tilde X$

$$p_Z = P(|\tilde Z| > |Z|) = P(\tilde Z^2 > Z^2)$$ $$= P(\tilde Z^2 > X) = P(\tilde X > X) = p_X$$.

End of proof

As we see here, there is a precise relationship between the $\chi^2$- and $z$- statistics: the $\chi^2$- statistic is the square of the $z$- statistic and they yield the same p-values when testing the null hypothesis: $\beta_1 = 0$.

Under a significance level of $\alpha = 0.05$, we may calculate the critical values of the $\chi^2$- and $z$-statistics as follows in R:

alpha <- 0.05
chisq.crit <- qchisq(p = alpha, df = 1, lower.tail = FALSE)

# p_Z = P(|Z| > |z|) = P(Z > |z|) + P(Z < -|z|)
# Therefore, we must calculate |z| such that P(Z > |z|) = P(Z < -|z|) = alpha/2
z.crit <- qnorm(p = alpha/2, lower.tail = FALSE, mean = 0, sd = 1) 

print("Critical value for chi-square:")
print(chisq.crit)

print("Critical value for z")
print(z.crit)

Statistical theory of ANOVA

Our sequential ANOVA is done iteratively; we start by fitting two models, one containing only the intercept $M_0$, and the full model $M_{full}$. The model $M_i$ is constructed by adding the covariate $x_i$ to the model $M_{i-1}$. We test the hypothesis that the two models $M_i$ and $M_{i-1}$ are "equally good" with the following asymptotic $\chi^2_{df(M_{i-1})-df(M_{i})}$-distributed statistic:

$$ X_i = \frac{(SSE_{M_{i-1}} - SSE_{M_i}) / (df(M_{i-1})-df(M_{i}))}{SSE_{full} / df(M_{full})}$$

e) The coefficient of determination $R^2 = \frac{SSR}{SST}$ (where $SSR = Y^T(H - \frac{1}{n}11^T)T$ and $SST = SSR+SSE$) is an attempt to measure the fraction of the inherent response variance captured by our regression model. It is bounded by 1, and bigger values are better.

f) In a simple linear regression, Pearson's linear correlation coefficient is simply $R$, and equals $\text{Corr}(\hat Y, Y)$.

Part 3: Multiple linear regression (2 points)

a) We fit the desired model, and print the fitted values:

model2 <- mylm(wages ~ education + age, data=SLID); print(model2)

b) The summary of our model (below) contains the desired information. It seems all coefficients are significant, and both education and age correlate positively with higher wages.

summary(model2)

c) Fitting a third model having only age as covariate allows us to compare the three models.

model1age <- mylm(wages ~ age, data=SLID); print(model1age)

As we can see, the fitted values are not the same across the different models. If the columns of $X$, the design matrix, corresponding to age and education were orthogonal, the fitted values would remain the same when considering the different models. In general, however, this is unlikely to happen, especially considering that we would expect age and education to correlate.

Part 4: Analysis of variance (ANOVA) (2 points)

4a) In the output below, the dummy coding was used for the categorical variables sex and language.

model3a<-mylm(wages~sex+language, data=SLID,
              contrasts=list(language="contr.treatment",sex="contr.treatment"))
summary(model3a)

Interpretation:

$\beta_0)$ represents the average hourly wage for english-speaking females. The intercept is highly significant, judging by the corresponding p-value.

$\beta_1)$ represents the additional average hourly wage of the males compared to the females, holding language constant. The covariate is also highly significant.

$\beta_2)$ represents the difference of the average wage of those who speak french relative to the english-speakers, all else being constant. This covariate, however, is not significant at level $\alpha = 0.05$.

$\beta_3)$ represents the difference of the average wage of those who speak other languages relative to english-speakers. This covariate isn't significant at level $\alpha = 0.05$, either.

In the output below, the effect coding was used for the categorical variables: sex and language.

model3b<-mylm(wages~sex+language, data=SLID,contrasts = list(language="contr.sum",sex="contr.sum"))
summary.mylm(model3b)

The interpretation of the effect-coded variables $\beta_1, \beta_2, \beta_3$ differ from the dummy-coded ones in that they represent the relative difference to the grand mean, a not to the "baseline" english-speaking female.

The significant covariates are the same as with dummy-coding.

b) The ANOVA-table below with dummy-coding shows the sequential reduction in the residual sum of squares ($SSE$) as each term of the regression formular is added in turn.

anova.mylm(model3a)

We observe the following:

The covariate sex is significant, indicating a difference in the mean of the wages between the males and females.

On the other hand, there is an insignificant differnce in the mean of wages among the English, French and other languages after sex has been added to the model. Hence we can decide not to add the language as a covariate.

The ANOVA output that follows is that of the effect coded covariates. The results for this ANOVA is the same as the one generated for the dummy coded variables.

anova.mylm(model3b)

Now we fit a model with the covariates in the opposite order:

anova.mylm(mylm(wages~language+sex,data=SLID))

The difference is due to the covariates being added in different sequences in the two ANOVAs. Generally, adding a covariate to a model with only an intercept will have a smaller p-value than if this covariate was added to a bigger model, since the added covariate isn't generally orthogonal (as a column of $X$) to the previously added ones.

c) We fit the two desired models containing the interaction term, which allows us to fit different slopes for age vs wages in the two sex groups.

I) DUMMY CODING

model4a<-mylm(wages~sex+age+sex*age, data=SLID,contrasts=list(sex="contr.treatment"))
summary(model4a)

Interpretation:

$\beta_0)$ represents the average wage of "<1-year old"" females.

$\beta_1)$ represents the reduction in average wage for the males compared to females of same age.

$\beta_2)$ represents the average wage increase per year for females.

$\beta_3)$ represents the additional average wage increase per year for males relative to females.

The estimates are all significant at $\alpha =0.05$.

II) EFFECT CODING

model4b<-mylm(wages~sex+age+sex*age, data=SLID,contrasts = list(sex="contr.sum"))
summary.mylm(model4b)

Interpretation of the result

The differences between dummy and effect coding are as previously explained. Again, the estimates are all significant at $\alpha =0.05$.

Now, we'll plot the regression lines for different values of sex:

ggplot(SLID, aes(x=age, y=wages, col=sex)) + geom_point(alpha=0.5) +
            geom_smooth(method="lm", se=TRUE)

As we can see, the regression lines appear to be different, with wages increasing with age more among males than among females. However, there is still considerable overlap between the two groups. This naturally motivates a more formal test of hypotheses.

To more formally test whether these slopes are the same, we can test the null-hypothesis: $\beta_3 = 0$. More specifically, under this null hypothesis, the slope of age vs wages is the same for both sex groups. In the summary above, we see that the p-value of $\beta_3$ is highly significant, allowing us to reject the null-hypothesis.

Part 5: Testing the mylm package (2 points)

We fit the 3 requested models.

model5a <- mylm(wages ~ sex + age + language + I(education^2), data=SLID) # red
model5b <- mylm(wages ~ language + age + language*age, data=SLID)         # blue
model5c <- mylm(wages ~ education - 1, data=SLID)                         # green

Looking at their summaries (omitted for brevity), language is insignificant in both the first models, as might be expected from the ANOVA from before. By far the largest coefficient is sexMale of the first model (by an order of magnitude), indicating that wages depend more on sex, than on education or age.

Computing $AIC = n\log(\frac{SSE}{n}) + 2(p+1)$ for the three models also favors the first, bigger model.

aic <- function(model) { model$n * log(model$sse / model$n) + 2*(model$p)}
sprintf("AIC 5a: %.3f, 5b: %.3f, 5c: %.3f", 
        aic(model5a),
        aic(model5b),
        aic(model5c))

To better understand the behaviour of the models, we may plot the fitted values against the studentized residuals $r_i^* = \frac{\hat\epsilon_i}{\hat\sigma_{(i)}\sqrt{1 - h_{ii}}}$:

s <- 0.3
a <- 0.5

# check homosk. of residuals vs fitted values
ggplot(data.frame(xa=model5a$yhat, ya=model5a$studentepsilon, 
                  xb=model5b$yhat, yb=model5b$studentepsilon,
                  xc=model5c$yhat, yc=model5c$studentepsilon)) +
  geom_point(aes(x=xa, y=ya), size=s, color="red", alpha=a) + geom_smooth(aes(x=xa, y=ya), method="loess", color="red") +
  geom_point(aes(x=xb, y=yb), size=s, color="blue", alpha=a) + geom_smooth(aes(x=xb, y=yb), method="loess", color="blue") +
  geom_point(aes(x=xc, y=yc), size=s, color="green", alpha=a) + geom_smooth(aes(x=xc, y=yc), method="loess", color="green") +
  labs(x="Fitted values", y="Studentized Residuals")

The only model for which $E(\hat\epsilon) = 0$ seems to hold, is for the first one (red). Moreover, judging by the blue line, the model containing only language and age (and their interaction) is lacking some covariates; there is evident structure in the residuals. Adding sex and education alleviates this. The residuals are heteroskedastic for all of the models fit, with a clear structure in that they blow up as $\hat y$ increases -- a good idea might be to consider the logarithm of the response, in order to counteract this blowup, which we now do:

model5d <- mylm(log(wages) ~ age + sex + sex*age + education, data=SLID)
ggplot(data.frame(x=model5d$yhat, y=model5d$studentepsilon), aes(x=x, y=y)) +
  geom_point() + geom_smooth(method="loess")

(Language is excluded since it is insignificant; the interaction between age and sex is important from earlier analysis.)

It appears that model5d obeys the assumption of mean-0 homoskedastic (studentized) residuals.

# AIC of log-transformed model
res5d <- exp(model5d$y) - exp(model5d$yhat)
sse5d <- t(res5d)%*%res5d
aic5d <- model5d$n * log(sse5d / model5d$n) + 2 * (model5d$p)
#summary(model5d)
#aic5d
sprintf("AIC 5d: %.3f", aic5d)

# about the same as model5a
#ggplot(data.frame(x=, y=model5d$studentepsilon), aes(x=x, y=y)) + geom_point()

The computed $AIC$ performs roughly equivalently with model5a, and far better on the QQ-plot:

# qq-plot may be used with untreated residuals; only tests normality
# (not uncorrelation nor mean-zero)
set.seed(1)
par(mar=c(4,4,1,1),mfrow=c(1,2))
qqnorm(model5a$epsilon, main="model5a"); qqline(model5a$epsilon, col="red")
qqnorm(model5b$epsilon, main="model5b"); qqline(model5b$epsilon, col="red")

qqnorm(model5c$epsilon, main="model5c"); qqline(model5c$epsilon, col="red")
qqnorm(model5d$epsilon, main="model5d"); qqline(model5d$epsilon, col="red")

We've found other models (using different transformations of the covariates) improving $R^2_{adj}$ and $AIC$; however model5d seems to be most true to the hypotheses of the linear model. Another consideration is the interpretability of the model, which is hurt whenever covariates (or responses) are transformed. In this sense, model5d is parsimonious. Furthermore, judging by the QQ-plot above, its residuals are nearly normally distributed, and the studentized residuals nearly homoskedastic by the previous plot. The tools for comparing models rely first and foremost on the model hypotheses being satisfied, which justifies picking model5d of the models considered.

All R code from R/mylm.R

project.dir <- getwd()
cat(readLines(file.path(project.dir, "R/mylm.R")) , sep = "\n")


WannabeSmith/TMA4315-Project-1 documentation built on May 18, 2019, 10:12 a.m.