In JakobGM/mylm:

Part 1: Explanatory analysis of the dataset

Importing the data set

The dataset we will work on is from Canada, and consists of 3987 observations on the following 5 variables:

wages, composite hourly wage rate from all jobs
education, number of years in schooling
age, in years
sex, Male or Female
language, English, French or Other

We will store this data set in a SLID dataframe, and remove all rows with incomplete data:

library(car)
data(SLID, package = "carData")
SLID <- SLID[complete.cases(SLID),]

a) Analyze relationships between some of the variables

First, let's plot a diagnostic plot of all the variables:

library(GGally)
ggpairs(data = SLID)

We can now make some remarks regarding the relationship between wages and some of the explanatory variables.

Effect of `education` on `wages`

There is correlation of 0.306 between education and wages, meaning that you would expect more educated workers to earn more.

Looking at the scatter plot, it shows that there is a much greater spread in education among low paid workers, while better wages require a minimum level of education in most cases.

With other words, a high degree of education does not guarantee greater wages, while high wages require higher education.

PS: Much of the same can be said on the effect of age on wages.

Effect of `sex` on `wages`

Men earn on average more than women, as low paid jobs are over-represented by women, and high paid jobs are over-represented by men. The extremal values (minimum and maximum wages) are though approximately equal among the sexes.

Assumptions of the MLR analysis

Some assumptions must be made regarding the data if we are to perform a successful multiple linear regression analysis on the data.

We construct a classical linear model of the form: $$ \bf{Y} = \bf{X} \beta + \bf{\varepsilon} $$ Let's explain the notation:

We have $n$ observations and $k$ covariates.
Denote $p := k + 1$, as we will include an intercept parameter.
$\bf Y$ is a $n \times 1$ column vector containing respective response values for each observation.
$\bf X$ is a $n \times p$ matrix, containing individual observations on each row, and respective covariates on each column. The first column contains solely 1s, as this is the intercept "covariate".
$\bf \beta$ is $p \times 1$ column vector containing model coefficient parameters $[\beta_0, \beta_1, ..., \beta_k]^T$.
$\varepsilon$ is the error term, being a $n \times 1$ column vector.

Now onto the assumptions we must make regarding this classical linear model.

1) $\text{E}[\varepsilon] = \bf 0$

The sum of error terms must converge to zero as $n$ approaches infinity.

2) $\mathrm{Cov}(\varepsilon) = \mathrm{E}(\varepsilon\varepsilon^T) = \sigma^2 \bf I$

Error terms must be completely independent and have identical variances.

3) $\mathrm{rank}(\bf{X}) = k + 1 = p$

$\bf X$ must have full rank, i.e. no column should be a linear combination of the other columns. We must also have $p <= n$. In most cases we have $p << n$.

In addition, if we want a classical normal linear regression model, we must assume:

4) $\varepsilon \sim \text{N}_n(\bf 0, \sigma^2 \bf{I})$