knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "ws#>",
  fig.align = "center"
)

# directory to Lab
dirdl <- system.file("Lab4",package = "Intro2R")

# create rmd link

library(Intro2R)

Introduction

Lab 4 is primarily about fixing the problem which is apparent in Lab 3 namely the non linear trend seen in the data.

This means that a different model must be used. But which model? Since we have one model by default -- the strait line, we will compare it with another model on the basis of some criterion.

Criteria

  1. Information measures
    • AIC (Choose model with smallest AIC)
    • DIC
    • BIC
    • ...
  2. Penalized $R^2$ also known as adjusted R squared, $R_a^2$
  3. Mallows's $C_p$ (Choose smaller $C_p$ -- equivalent to AIC for guassian regression)
  4. In cases where one MLR model is nested in the other we can use an ANOVA test.

There are others.

R skills

Plots

We have already been introduced to Linear and Non-Linear models. Briefly linear models can be expressed as a linear combination of the parameters and non-linear models cannot be.

In lab 3 you were introduced to the shiny server which enables you to create a dynamic input/output server.

Lab 4 does two major things:

  1. Fits a curve to the data
  2. Derives and applies the theory for piecewise regression

The following will remind you of some things you have learnt in the past and extend others.

Smoothers

Background

In the R package ggplot2 there is a function called geom_smooth() -- this enables you to create estimating trends for your data.

The word smoother is less technical than the precise idea of an estimating trend line. The smoother is best in the context of ggplot because we will use non parametric models which seek only to gather a smoothed curve/line based on the data itself.

One example of a smoother is known as LOESS (locally estimated scatterplot smoothing) and LOWESS (locally weighted scatterplot smoothing). This is non-parametric meaning that we do not invoke a parametric model and find estimates of the parameters rather we apply weight least squares regression over a window of the data where more weight is given to the middle value and less on the boundaries. The window is moved through the domain of the data and a curve generated from the predicted response. This means that the curve is "data" generated and gives you a better idea of what the data is saying concerning the trend.

Example

Lets go ahead and use LOESS to create a smoother

library(ggplot2)
gs <- ggplot(ddt, aes(x=LENGTH,y=WEIGHT)) + geom_point(aes(color = SPECIES)) + geom_smooth(method = "loess", formula = y~x)
gs

You can make good use of smoothers to represent complex models by including an aesthetic that will divide the data and apply the model to subsets. In this case we have a Non linear model applied to each of the species.

g <- ggplot(ddt, aes(x = LENGTH, y = WEIGHT)) +
  geom_point(aes(color = SPECIES)) 
g <- g + geom_smooth(method = "nls", formula = 'y ~ b0*exp(b1*x)', method.args = list(start = list(b0 = 100, b1 = 0.05)),se = FALSE, aes(color = SPECIES))
g <- g + labs(title = "nls fitted to subgroups of SPECIES")
g

Piecewise smoother

You will need to understand the development of this:

Introduction

Suppose we have two line segments which make a reasonable fit to data, joining at a single point (x_k,y_k) which we may call the change point. We wish to use data to estimate the lines and some measure of fit to determine the change point.

Theory

Suppose that for line 1 and line 2 we have the following formulae

$$l1: y=\beta_0 +\beta_1 x$$ $$l2: y = \beta_0 + \delta +(\beta_1 + \zeta)x$$

fig 1:Piecewise{ width=70% }

Then at the change point we have the two lines intersecting

$$\beta_0 +\beta_1x_k = \beta_0+\delta + (\beta_1 +\zeta)x_k$$

Hence we have

$$\delta=-\zeta x_k$$

Therefore we can write l2 as

$$y = \beta_0 -\zeta x_k + (\beta_1 +\zeta)x $$ That is

$$ y = \beta_0 + \beta_1 x + \zeta (x-x_k)$$

l2 is l1 with an adjustment term.

We will introduce an indicator function that will be 1 when $x>x_k$ and $0$ else.

So

$$y = \beta_0 + \beta_1 x +\zeta (x-x_k)I(x>x_k)$$

Create plot using ggplot

library(ggplot2)
xk = 18
dff <- spruce
df <- within(dff, X<-(BHDiameter-xk)*(BHDiameter>xk))
head(df)
ylm <- lm(Height ~ BHDiameter + X, data = df)
summary(ylm)
gg <- ggplot(df, aes(x = BHDiameter, y= Height)) + geom_point() + 
  geom_smooth(method = "lm", formula = 'y ~ x + I((x-xk)*(x>xk))', se = FALSE, data = df)
gg

Generalization

Since the next line segment is the addition of an interaction term we can easily generalize this procedure

Say we have two change points and thus three line segments.

Then:

$$ y=\beta_0 + \beta_1 x + \beta_2 (x-x_k)I(x> x_k) + \beta_3 (x-x_{k2})I(x>x_{k2}) $$

Application

Add two change points to the Spruce.csv data set and make a plot of the line segments. Make $x_k=10$ and $x_{k2}=18$

spruce.df = spruce
head(spruce.df)
myf2 = function(x,xk,xk2,coef){
  coef[1]+coef[2]*(x) + coef[3]*(x-xk)*(x-xk>0)+ coef[4]*(x-xk2)*(x-xk2>0)
}

coeff = function(xk,xk2){ # data=spruce.df
  df=within(spruce.df, 
            {
            X<-(BHDiameter-xk)*(BHDiameter>xk) 
            X2<-(BHDiameter-xk2)*(BHDiameter>xk2)
            }
            ) 
  lmp=lm(Height ~ BHDiameter + X + X2, data=df)
  coef(lmp)
}


with(spruce.df,plot(BHDiameter, Height,
                    pch = 21, 
                    cex=1.5,
                    bg="green",
                    main="Height Vs BHDiameter using piecewise regression"
                    ))

cf = coeff(10,18)
curve(myf2(x,10,18,cf), add=TRUE, lwd=2,col="Blue")

Questions to consider (Not a part of the Lab -- extra for experts!!)

  1. Make a shiny app that will plot points and fit a 3 segment model using changepoints set through widgets

  2. Adjust the app so that AIC is displayed (Keep k=2 as default) AIC() is displayed on the upper plot.

  3. Find optimal knots -- what criteria would you use?

Some code

# combinations of xk1<=xk2
n=4
x=seq(5,20, length=n)
for(xk1 in x[x<=(max(x)-3)]){
  for(xk2 in x[x>=xk1]){
    print(paste0("xk1=", xk1,", ", "xk2=",xk2))
  }
  }

More models

This course will take you further by introducing you to other models -- remember, modeling must be understood well to implement with confidence. However, as a means of motivation we will learn more advanced applications of models from an introductory and practical stand point.

GLM (Generalized Linear Models)

If we look at the extended SLR model, that is we add more predictors, the model becomes MLR (Multiple Linear Regression).

We could express it this way:

$$Y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \ldots \beta_r x_r +\epsilon$$ Where $\epsilon \sim N(0, \sigma^2)$

Or we could define the linear predictor

$$\eta = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \ldots \beta_r x_r $$

$$Y=\eta + \epsilon$$

Or

$$E(Y) = \eta $$ $$Y\sim N(\eta, \sigma^2)$$ This last expression contains 3 parts:

  1. Response variables $Y_1, \ldots, Y_n$ which are assumed to share the same distribution from the exponential family
  2. A set of parameters $\beta$ and explanatory variables $X$, where $\eta = X^{'}\beta$
  3. A monotone link function $g$ such that $$ g(\mu) = \eta$$

where $\mu = E(Y)$.

The exponential family can be defined as any probability or density function that can be expressed in the following form:

$$f(y|\theta) = exp(a(\theta)b(y) + c(\theta) + d(y))$$ Where $\theta$ is the parameter, $y$ the data. A theory is built around these assumptions which generalizes the Linear Model.

Maximum likelihood estimates for $\theta$ are made through Newton Raphson optimization.

Notice that this applies to any $Y\sim exponential \; family$ -- this means that $Y$ could be discrete or continuous.

This means that GLM's are very flexible.

We will take an example. Suppose that the dangerous levels of DDT in the flesh of fish is $\ge 5ppm$.

We can then take the ddt data set and break it into two parts according to whether fish have more or less than 5ppm DDT.

First prepare the data

Eat <-with(ddt, cut(DDT, breaks = c(0,5, max(DDT)+1), labels = c("OK", "Dang"), right = FALSE, ordered_result = TRUE))

summary(Eat)
class(Eat)

ddt <- within(ddt, Eat <- Eat)
head(ddt)
levels(Eat)

It is important to see that the new variable created is not simply nominal since we know that the underlying variable DDT is continuous and hence OK would represent values of $DDT< 5ppm$ and Dang would represent values of $DDT \ge 5ppm$. Hence Eat is actually an ordinal variable.

Now we will use GLM theory to analyze this, taking the response to be a Bernoulli (Bin(n=1, p)). The link is logit. Lets review the three parts that must be present in a GLM

  1. $Y_i \sim Bin(n=1, p) \in Expo.\; Family$
  2. $\eta_{i} = \beta_{0} + \beta_{1} LENGTH + \beta_{2} WEIGHT$
  3. $log(\frac{p_i}{1-p_i}) = logit(p_i) = \eta_i$

Where $\mu_i = p_i = E(Y_i)$

We will use the function glm() to perform the logistic regression.

yglm <- glm(Eat ~ LENGTH + WEIGHT,family = binomial(link = "logit") ,data = ddt)
summary(yglm)
cf<-coef(yglm)

The estimating equation for the trend is:

$$log(odds) = r cf[1] + r cf[2]LENGTH + r cf[3]WEIGHT$$ where $odds = \frac{p}{1-p}$ -- notice that $p = \frac{exp(\eta)}{1+exp(\eta)}$

We can calculate these predicted probabilities in R

prob <-predict(yglm, type = "response")
head(prob,10) 

Summary of Lab 4 main ideas

It is a good idea to start with a straight line and then see how the model is working by investigating the residuals. Typically a plot of the residuals versus the fitted values is a good idea plot(ylm, which =1).

If additional signal is in the residual plot then you will need to adjust the basic simple straight line model. Sometimes a higher term polynomial will work well or perhaps a non linear model like an exponential y ~ b0*exp(b1*x) -- you can fit this with nls() -- see above examples. Unless you are proficient with non linear models it will be best for you to stick with linear models.

To check which model is better you will need a criterion. Many statisticians use AICsince this is generally applicable -- choose the model with smallest AIC. If you use MLR (multiple linear regression) as in the case of a quadratic then for our course we can use $R_a^2$ -- choose the model with the biggest adjusted R squared.

You may need to try a few linear models until the assumptions are confirmed. Once the best model is determined ("best" determined by satisfying the criteria of a Linear Model and the biggest $R_a^2$) then go ahead and use that model to solve whatever problems you have been confronted with.

Extra for experts

quad.lm = lm(Height~BHDiameter + I(BHDiameter ^ 2), data = spruce.df)

myplot = function(x) {
  quad.lm$coef[1] + quad.lm$coef[2] * x  + quad.lm$coef[3] * x ^ 2
}

plot(Height~BHDiameter, main = "Spruce Height Prediction",
     xlab = "Breast Height Diameter (cm)", ylab = "Height of Tree (m)",
     pch = 21, bg = "blue", cex = 1.1,
     xlim = c(0, max(BHDiameter) * 1.0), ylim = c(0, max(Height) * 1.0),
     data = spruce.df)
curve(myplot, lwd = 2, col = "steelblue", add = TRUE) # Add curve

# General line segments
with(spruce.df, segments(BHDiameter, Height, 
                         BHDiameter, quad.lm$coef[1] + quad.lm$coef[2] * BHDiameter  + quad.lm$coef[3] * BHDiameter ^ 2, col = "Black"))

lgcooks = spruce.df[c(18, 21, 24),] # 3 highest Cook's distances

# Highest Cook's distance line segments in thick red
with(lgcooks, segments(BHDiameter, Height, 
                         BHDiameter, quad.lm$coef[1] + quad.lm$coef[2] * BHDiameter  + quad.lm$coef[3] * BHDiameter ^ 2, col = "Red", lwd = 3))

# Number points
with(spruce.df, text(Height~BHDiameter, labels = row.names(spruce.df), pos = 4, cex = 0.5))

# Arrow to highest Cook's
with(spruce.df, arrows(BHDiameter[18], Height[24], BHDiameter[24], Height[24], col = "blue", lwd = 2))

# Text about highest Cook's
with(spruce.df, text(BHDiameter[18], Height[24], labels = c("Highest Cook's\ndistance   "), pos = 2, cex = 1))

Lab 4

Please download the following files into appropriate directories

r rmdfiles("Lab4")



MATHSTATSOU/Intro2R documentation built on Feb. 20, 2025, 6:18 a.m.