The dataset we will work on is from Canada, and consists of 3987 observations on the following 5 variables:
wages
, composite hourly wage rate from all jobseducation
, number of years in schoolingage
, in yearssex
, Male or Femalelanguage
, English, French or OtherWe will store this data set in a SLID
dataframe, and remove all rows with incomplete data:
library(car) data(SLID, package = "carData") SLID <- SLID[complete.cases(SLID),]
First, let's plot a diagnostic plot of all the variables:
library(GGally) ggpairs(data = SLID)
We can now make some remarks regarding the relationship between wages
and some of the explanatory variables.
education
on wages
There is correlation of 0.306
between education
and wages
, meaning that you would expect more educated workers to earn more.
Looking at the scatter plot, it shows that there is a much greater spread in education among low paid workers, while better wages require a minimum level of education in most cases.
With other words, a high degree of education does not guarantee greater wages, while high wages require higher education.
PS: Much of the same can be said on the effect of age
on wages
.
sex
on wages
Men earn on average more than women, as low paid jobs are over-represented by women, and high paid jobs are over-represented by men. The extremal values (minimum and maximum wages) are though approximately equal among the sexes.
Some assumptions must be made regarding the data if we are to perform a successful multiple linear regression analysis on the data.
We construct a classical linear model of the form: $$ \bf{Y} = \bf{X} \beta + \bf{\varepsilon} $$ Let's explain the notation:
1
s, as this is the intercept "covariate".Now onto the assumptions we must make regarding this classical linear model.
1) $\text{E}[\varepsilon] = \bf 0$
The sum of error terms must converge to zero as $n$ approaches infinity.
2) $\mathrm{Cov}(\varepsilon) = \mathrm{E}(\varepsilon\varepsilon^T) = \sigma^2 \bf I$
Error terms must be completely independent and have identical variances.
3) $\mathrm{rank}(\bf{X}) = k + 1 = p$
$\bf X$ must have full rank, i.e. no column should be a linear combination of the other columns. We must also have $p <= n$. In most cases we have $p << n$.
In addition, if we want a classical normal linear regression model, we must assume:
4) $\varepsilon \sim \text{N}_n(\bf 0, \sigma^2 \bf{I})$
The error terms must be normally distributed, in addition to the earlier assumptions (above).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.