knitr::opts_chunk$set(echo = TRUE)
#install.packages("car")
#install.packages("GGally")
library(car)
library(GGally)
library(mylm)

Part 1: Explanatory analysis

We will work on the dataset carData, which consists of 3987 observations on the following 5 variables:

data(SLID, package = "carData")
SLID <- SLID[complete.cases(SLID), ]
n <- length(SLID[,1])

Using the function ggpairs we get the following diagnostic plot matrix:

ggpSLID = ggpairs(SLID)
print(ggpSLID,progress=F)

The plot suggests that wages increase with both education and age, despite the apparent slight decrease of education with age. It also seems like males recieve higher wages than females, despite similar levels of age and education. There seems to be some relationship between language and age, but it is not clear if there is a corresponding relationship between language and wages.

When performing a multiple linear regression analysis to study how wages depends on the explanatory variables we need to assume that:

Part 2: Simple linear regression with the mylm package

a)

model1 <- mylm(wages ~education, data = SLID)
print(model1)

model1b <- lm(wages ~ education, data = SLID)
print(model1b)

b)

summary(model1)

c)

plot(model1)

From the plot it is not clear whether the residuals depends on the fitted value. While the mean of the residuals is very close to zero, the distribution seems to have more variation in the positive direction.

d)

The error of the regression can be represented using the following quantities. This holds for linear regression. We are using the following naming convention, which may differ from other texts.

\begin{align} SSE = \sum^{n}{i=1}(\hat{y}_i - \bar{y})^2 \ SSR = \sum^{n}{i=1}(y_i - \hat{y}i)^2 \ SST = \sum^{n}{i=1}(y_i - \bar{y}_i)^2 = SSE + SSR \ R^2 = 1 - \frac{SSR}{SST} \end{align} where $\bar{y}$ is the mean of responses and $\hat{y}_i$ is the fitted value.

The residual sum of squares (SSR) for this model is (r sprintf('%.0f', model1$ssr)). The model has (r n-model1$p) degrees of freedom, which is obtained by subtracting the number of estimated parameters from the number of observations. The total sum of squares (SST) is (r sprintf('%.0f', model1$sst)). For a model with so many degrees of freedom we can use the limiting distribution

\begin{equation} (p-1)F = (p-1)\frac{\nicefrac{SSE}{p-1}}{\nicefrac{SSR}{n-p}} \approx \chi_{p-1}^2, \end{equation}

which gives gives a value of (r sprintf('%.2f', (n-model1$p)*model1$sse/model1$ssr)). For a $\chi_{1}^2$ distribution, where the mean is 1, this is a very extreme value. Hence the regression is clearly significant. A $\chi^2_k$ distribution is the sum of the squares of $k$ independent standard normal random variables. In our case we have $k=1$, which means that the $\chi^2$-statistic is simply the square of the $z$-statistic. The critical value for a $\chi_{1}^2$ test with confidence of 95% is (r sprintf('%.2f',qchisq(0.95,1))), while the two-sided $z$-test has critical values $\pm 1.96$.

e) The coefficient of determination, (R^2), shows how much of the variance of the prediction can be explained by the variance of the response. This gives some information about models goodness of fit. In our case (R^2\approx0.09), which indicates that considering a model with additional covariates is a good idea. A high proportion of the variance in the data is not accounted for by model1.

Part 3: Multiple linear regression

a)

model2 <- mylm(wages ~ education + age, data = SLID)
summary(model2)

b)

From the summary above we see that the intercept and both the regression coefficients are significantly different from zero according to the $z$-test, which has critical values $\pm1.96$.

c)

model2c1 <- mylm(wages ~ education, data = SLID)
model2c2 <- mylm(wages ~ age, data = SLID)
summary(model2c1)
summary(model2c2)

The values differ from the bigger model because each covariate is used to explain more of the resonse than in the bigger model. We know from the plot in part 1 that age and education are correlated. This means that changing the covariate in one of the small models is similar to changing both covariates in the bigger model. For uncorrelated covariates we would expect the same parameter estimates from the simple and multiple linear regression.

Part 4: Testing mylm

Now we can test our function mylm on any linear model.

model4a <- mylm(wages ~ sex + age + language + I(education^2), data = SLID)
model4b <- mylm(wages ~ age + language + language * age, data = SLID)
shiftedlanguage <- relevel(SLID$language,ref =2)
model4bnew <- mylm(wages ~ age + shiftedlanguage + shiftedlanguage * age, data = SLID)
model4c <- mylm(wages ~ education -1, data = SLID)
summary(model4a)
plot(model4a)
summary(model4b)
plot(model4b)
summary(model4c)
plot(model4c)

All the coefficients in both model4a and model4c are significant, according to the $z$-test. In model4b the interaction between age and language is barely signicant for French speakers, and not at all for other languages. By releveling, we can find the interaction between English speakers and age which is significant. It is notable that the coefficients for language change a lot between model4a and model4b. The coefficent for speaking French even changes sign between the two models. The interpretation of this is that the seemingly increased wages for French speakers are better explained by their sex and education. It is possible that model4a could be improved by transforming the age covariate, for example by taking the logarithm . For model4b, it might be improved by adding education as a covariate and model4c could be improved by including the intercept.



Anderssorby/mylm documentation built on May 6, 2019, 8:01 p.m.