Part 1: Explanatory analysis of the dataset

Importing the data set

The dataset we will work on is from Canada, and consists of 3987 observations on the following 5 variables:

We will store this data set in a SLID dataframe, and remove all rows with incomplete data:

library(car)
data(SLID, package = "carData")
SLID <- SLID[complete.cases(SLID),]

a) Analyze relationships between some of the variables

First, let's plot a diagnostic plot of all the variables:

library(GGally)
ggpairs(data = SLID)

We can now make some remarks regarding the relationship between wages and some of the explanatory variables.

Effect of education on wages

There is correlation of 0.306 between education and wages, meaning that you would expect more educated workers to earn more.

Looking at the scatter plot, it shows that there is a much greater spread in education among low paid workers, while better wages require a minimum level of education in most cases.

With other words, a high degree of education does not guarantee greater wages, while high wages require higher education.

PS: Much of the same can be said on the effect of age on wages.

Effect of sex on wages

Men earn on average more than women, as low paid jobs are over-represented by women, and high paid jobs are over-represented by men. The extremal values (minimum and maximum wages) are though approximately equal among the sexes.

Assumptions of the MLR analysis

Some assumptions must be made regarding the data if we are to perform a successful multiple linear regression analysis on the data.

We construct a classical linear model of the form: $$ \bf{Y} = \bf{X} \beta + \bf{\varepsilon} $$ Let's explain the notation:

Now onto the assumptions we must make regarding this classical linear model.

1) $\text{E}[\varepsilon] = \bf 0$

The sum of error terms must converge to zero as $n$ approaches infinity.

2) $\mathrm{Cov}(\varepsilon) = \mathrm{E}(\varepsilon\varepsilon^T) = \sigma^2 \bf I$

Error terms must be completely independent and have identical variances.

3) $\mathrm{rank}(\bf{X}) = k + 1 = p$

$\bf X$ must have full rank, i.e. no column should be a linear combination of the other columns. We must also have $p <= n$. In most cases we have $p << n$.

In addition, if we want a classical normal linear regression model, we must assume:

4) $\varepsilon \sim \text{N}_n(\bf 0, \sigma^2 \bf{I})$

The error terms must be normally distributed, in addition to the earlier assumptions (above).



JakobGM/mylm documentation built on May 7, 2019, 2:27 p.m.