knitr::opts_chunk$set(echo = TRUE)
s_ex04p01_data_path <- "https://charlotte-ngs.github.io/asmss2022/data/asm_bw_flem.csv"

Problem 1: Overfitting

Use the extended dataset on Body Weight of animals and fit all the variables and the factor breed. Compare the result with a regression that uses only Breast Circumference or with the linear model that only uses the factor Breed. The data set is available from: r s_ex04p01_data_path

Solution

s_ex04p01_data_path <- "https://charlotte-ngs.github.io/asmss2022/data/asm_bw_flem.csv"
tbl_ex04p01_data <- readr::read_csv(file = s_ex04p01_data_path)
lm_ex04p01_full <- lm(formula = `Body Weight` ~ `Breast Circumference` + BCS + HEI + Breed, data = tbl_ex04p01_data)
summary(lm_ex04p01_full)
lm_ex04p01_bwbc <- lm(formula = `Body Weight` ~ `Breast Circumference`, data = tbl_ex04p01_data)
summary(lm_ex04p01_bwbc)
lm_ex04p01_bwbreed <- lm(formula = `Body Weight` ~ Breed, data = tbl_ex04p01_data)
summary(lm_ex04p01_bwbreed)

The comparison of the models shows that the full model does not produce a better model fit. The reason for this is that the explanatory variables in the full model are correlated among each other. As a result of this correlation structure, the same information is contained in different variables and as a result the single variables do not contribute a substantial amount to the explanation of the variation in the response variable.

The correlation structure amoung the different variables can be visualized via a so called pairs plot.

tbl_ex04p01_data$Breed <- as.factor(tbl_ex04p01_data$Breed)
pairs(formula = ~ `Breast Circumference` + BCS + HEI + Breed , data = tbl_ex04p01_data)

From this plot, we can clearly see that Breast Circumference and Breed are correlated. If we switch levels 2 and 3 of the breeds, then we can see the relationship between Breast Circumference and Breed even better.

Problem 2: Plotting

The first step before doing any analysis should always be to plot the data which helps to visualise the internal structure of a dataset. A very instructive plot is the so-called pairs-plot. This plot can be done using the function pairs(). The task of this problem is to create a pairs-plot for the extended dataset on Body Weight of animals. The input to the function pairs() must be all numeric. This means that the column containing the Breed in our dataset must be converted to a datatype called factor. This can be done using the function as.factor().

Results of linear models can also be plotted. In such plots, we are mainly interested in the behavior of the residuals. Hence, fit a linear regression model between Body Weight and Breast Circumference and plot the resulting linear model object.

Solution

s_ex04p02_data_path <- "https://charlotte-ngs.github.io/asmss2022/data/asm_bw_flem.csv"
tbl_ex04p02_data <- readr::read_csv(file = s_ex04p02_data_path)
tbl_ex04p02_data$Breed <- as.factor(tbl_ex04p02_data$Breed)
pairs(tbl_ex04p02_data)

The above matrix of scatterplots shows relationships between pairs of variables.

lm_ex04p02 <- lm(formula = `Body Weight` ~ `Breast Circumference`, data = tbl_ex04p02_data)
plot(lm_ex04p02)

For the behavior of the resiudals, we are focusing on the first two plots. The first plot shows whether there is a dependence pattern between the residuals and the fitted values. For this plot a random pattern is desired. The second plot shows a QQ-plot of the residuals. This plot shows any deviation of the numeric distribution of the residuals from the normal distribution.



charlotte-ngs/asmss2022 documentation built on June 7, 2022, 1:33 p.m.