Part 4: Testing the mylm package

library(car)
library(mylm)
data(SLID, package = "carData")
SLID <- SLID[complete.cases(SLID),]

In the following section the mylm function will be employed to three different model suggestions. To evaluate the fit of these models we'll print the summary of the models to discuss the p-values of the parameters, as well as the residual plots to see if there are any clear trends in the residuals, or any discrepancies from the MLR assumptions.

Model 1: sex, age, language and education^2

model4a <- mylm(formula = wages ~ sex + age + language + I(education^2), data=SLID)
summary(model4a)

From the summary of the model fit above the first thing to note is that the intercept is negative, implying that the "base case" (all covariates set equal to 0) has negative wage. This may be considered non-intuitive, and an indication that the model fit is perhaps not the best.

What is the base case in this model? A female, English speaking newborn with no education. You could therefore argue that the non-intuitive intercept is acceptable, as this is not in the domain of interest.

Further, the p-values for sex, age and education^2 indicate that all these parameters have a significant effect on wages, while the effect from language is not quite as clear.

Finally, note that the effect from education is aggregated, as the effect of wages from an increase in education will be quadratic rather than linear. There is no clear indication why this type of response is better than a linear response.

Residual analysis

plot(model4a)

In the plot of the residuals a clear trend is visible. From low fitted values the residuals are small, while they increase with larger fitted values. In addition, it seems that the residuals are somewhat asymmetric, with larger positive values than negative. This further support the claim that the model fit is not particularly good, as these observations do not coincide with the assumptions made regarding the distribution of $\varepsilon$.

Improvements

A suggested improvement to this model is to not use education as a quadratic parameter, but rather use it as a linear one. Using education as a linear parameter has shown success in earlier model attempts above, and there is no clear quadratic trend to observe in the wage-education distribution shown in part 1.

Model 2: age, language and interaction term

Now we model the effect of language and age on wage, including interaction effects.

model4b <- mylm(formula = wages ~ language*age, data=SLID)
summary(model4b)

The first thing to notice from the summary of the second model, is that the only two parameters that are clearly significant is the intercept term and the effect from age. The language terms, as well as the interaction effect, are not clearly significant, although the effect of languageFrench and age:languageFrench would be included with a significance level of 5%.

Residual analysis

plot(model4b)

The plot of the residuals exhibit a similar trend as the previous model, namely that the residuals increase with larger fitted values. In addition, it also seems in this case as if the residuals are asymmetric, the positive residuals are considerably larger in absolute value than the negative. The residuals plot should, if the model assumptions are in fact correct, be normally distributed around the $x$-axis. One can with some confidence say that this is not the case for this model, as the residuals exhibits some non-random behaviour.

Improvements

Language doesn't seem to be significant and hasn't been significant in previous models either. So we suggest removing language altogether and replacing it with for example education.

Model 3: education without intercept

Now we fit a new model with education as covariate and without an intercept.

model4c <- mylm(formula = wages ~ -1 + education, data=SLID)
summary(model4c)

First thing to notice from this model is that, as there is now interaction term, the only coefficient, education, is highly significant. The statistics does not make sense in this case, and can therefore be discarded. In all the previous models the interaction term has been significantly different from 0, indicating that it should be included - there is no indication it should be dropped.

Residual analysis

plot(model4c)

The residual plot seem to have a decreasing trend with the fitted values, as all residuals are larger than zero fro small fitted values, while the majority seem to be negative for larger fitted values. This residual plot does also exhibit asymmetric behaviour, similar to the previous models.

Improvement

An improvement to this model could be to include an intercept, for the reasons mentioned above.



JakobGM/mylm documentation built on May 7, 2019, 2:27 p.m.