knitr::opts_chunk$set( collapse = TRUE, comment = "ws#>", fig.align = "center" ) # directory to Lab dirdl <- system.file("Lab4",package = "Intro2R") # create rmd link library(Intro2R)
Lab 4 is primarily about fixing the problem which is apparent in Lab 3 namely the non linear trend seen in the data.
This means that a different model must be used. But which model? Since we have one model by default -- the strait line, we will compare it with another model on the basis of some criterion.
- Information measures
- AIC (Choose model with smallest AIC)
- DIC
- BIC
- ...
- Penalized $R^2$ also known as adjusted R squared, $R_a^2$
- Mallows's $C_p$ (Choose smaller $C_p$ -- equivalent to AIC for guassian regression)
- In cases where one MLR model is nested in the other we can use an ANOVA test.
There are others.
We have already been introduced to Linear and Non-Linear models. Briefly linear models can be expressed as a linear combination of the parameters and non-linear models cannot be.
In lab 3 you were introduced to the shiny
server which enables you to create a dynamic input/output server.
Lab 4 does two major things:
- Fits a curve to the data
- Derives and applies the theory for piecewise regression
The following will remind you of some things you have learnt in the past and extend others.
In the R package ggplot2
there is a function called geom_smooth()
-- this enables you to create estimating trends for your data.
The word smoother
is less technical than the precise idea of an estimating trend line. The smoother
is best in the context of ggplot
because we will use non parametric models which seek only to gather a smoothed curve/line based on the data itself.
One example of a smoother is known as LOESS (locally estimated scatterplot smoothing) and LOWESS (locally weighted scatterplot smoothing). This is non-parametric meaning that we do not invoke a parametric model and find estimates of the parameters rather we apply weight least squares regression over a window of the data where more weight is given to the middle value and less on the boundaries. The window is moved through the domain of the data and a curve generated from the predicted response. This means that the curve is "data" generated and gives you a better idea of what the data is saying concerning the trend
.
Lets go ahead and use LOESS
to create a smoother
library(ggplot2) gs <- ggplot(ddt, aes(x=LENGTH,y=WEIGHT)) + geom_point(aes(color = SPECIES)) + geom_smooth(method = "loess", formula = y~x) gs
You can make good use of smoothers to represent complex models by including an aesthetic that will divide the data and apply the model to subsets. In this case we have a Non linear model applied to each of the species.
g <- ggplot(ddt, aes(x = LENGTH, y = WEIGHT)) + geom_point(aes(color = SPECIES)) g <- g + geom_smooth(method = "nls", formula = 'y ~ b0*exp(b1*x)', method.args = list(start = list(b0 = 100, b1 = 0.05)),se = FALSE, aes(color = SPECIES)) g <- g + labs(title = "nls fitted to subgroups of SPECIES") g
You will need to understand the development of this:
Suppose we have two line segments which make a reasonable fit to data, joining at a single point (x_k,y_k)
which we may call the change point
. We wish to use data to estimate the lines and some measure of fit
to determine the change point
.
Suppose that for line 1 and line 2 we have the following formulae
$$l1: y=\beta_0 +\beta_1 x$$ $$l2: y = \beta_0 + \delta +(\beta_1 + \zeta)x$$
{ width=70% }
Then at the change point we have the two lines intersecting
$$\beta_0 +\beta_1x_k = \beta_0+\delta + (\beta_1 +\zeta)x_k$$
Hence we have
$$\delta=-\zeta x_k$$
Therefore we can write l2
as
$$y = \beta_0 -\zeta x_k + (\beta_1 +\zeta)x $$ That is
$$ y = \beta_0 + \beta_1 x + \zeta (x-x_k)$$
l2
is l1
with an adjustment term.
We will introduce an indicator function that will be 1 when $x>x_k$ and $0$ else.
So
$$y = \beta_0 + \beta_1 x +\zeta (x-x_k)I(x>x_k)$$
ggplot
library(ggplot2) xk = 18 dff <- spruce df <- within(dff, X<-(BHDiameter-xk)*(BHDiameter>xk)) head(df) ylm <- lm(Height ~ BHDiameter + X, data = df) summary(ylm) gg <- ggplot(df, aes(x = BHDiameter, y= Height)) + geom_point() + geom_smooth(method = "lm", formula = 'y ~ x + I((x-xk)*(x>xk))', se = FALSE, data = df) gg
Since the next line segment is the addition of an interaction term we can easily generalize this procedure
Say we have two change points
and thus three line segments.
Then:
$$ y=\beta_0 + \beta_1 x + \beta_2 (x-x_k)I(x> x_k) + \beta_3 (x-x_{k2})I(x>x_{k2}) $$
Add two change points to the Spruce.csv
data set and make a plot of the line segments.
Make $x_k=10$ and $x_{k2}=18$
spruce.df = spruce head(spruce.df) myf2 = function(x,xk,xk2,coef){ coef[1]+coef[2]*(x) + coef[3]*(x-xk)*(x-xk>0)+ coef[4]*(x-xk2)*(x-xk2>0) } coeff = function(xk,xk2){ # data=spruce.df df=within(spruce.df, { X<-(BHDiameter-xk)*(BHDiameter>xk) X2<-(BHDiameter-xk2)*(BHDiameter>xk2) } ) lmp=lm(Height ~ BHDiameter + X + X2, data=df) coef(lmp) } with(spruce.df,plot(BHDiameter, Height, pch = 21, cex=1.5, bg="green", main="Height Vs BHDiameter using piecewise regression" )) cf = coeff(10,18) curve(myf2(x,10,18,cf), add=TRUE, lwd=2,col="Blue")
Make a shiny app that will plot points and fit a 3 segment model using changepoints set through widgets
Adjust the app so that AIC is displayed (Keep k=2
as default) AIC()
is displayed on the upper plot.
Find optimal knots -- what criteria would you use?
# combinations of xk1<=xk2 n=4 x=seq(5,20, length=n) for(xk1 in x[x<=(max(x)-3)]){ for(xk2 in x[x>=xk1]){ print(paste0("xk1=", xk1,", ", "xk2=",xk2)) } }
This course will take you further by introducing you to other models -- remember, modeling must be understood well to implement with confidence. However, as a means of motivation we will learn more advanced applications of models from an introductory and practical stand point.
If we look at the extended SLR model, that is we add more predictors, the model becomes MLR (Multiple Linear Regression).
We could express it this way:
$$Y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \ldots \beta_r x_r +\epsilon$$ Where $\epsilon \sim N(0, \sigma^2)$
Or we could define the linear predictor
$$\eta = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \ldots \beta_r x_r $$
$$Y=\eta + \epsilon$$
Or
$$E(Y) = \eta $$ $$Y\sim N(\eta, \sigma^2)$$ This last expression contains 3 parts:
- Response variables $Y_1, \ldots, Y_n$ which are assumed to share the same distribution from the exponential family
- A set of parameters $\beta$ and explanatory variables $X$, where $\eta = X^{'}\beta$
- A monotone link function $g$ such that $$ g(\mu) = \eta$$
where $\mu = E(Y)$.
The exponential family can be defined as any probability or density function that can be expressed in the following form:
$$f(y|\theta) = exp(a(\theta)b(y) + c(\theta) + d(y))$$ Where $\theta$ is the parameter, $y$ the data. A theory is built around these assumptions which generalizes the Linear Model.
Maximum likelihood estimates for $\theta$ are made through Newton Raphson optimization.
Notice that this applies to any $Y\sim exponential \; family$ -- this means that $Y$ could be discrete or continuous.
This means that GLM's are very flexible.
We will take an example. Suppose that the dangerous levels of DDT in the flesh of fish is $\ge 5ppm$.
We can then take the ddt data set and break it into two parts according to whether fish have more or less than 5ppm DDT.
First prepare the data
Eat <-with(ddt, cut(DDT, breaks = c(0,5, max(DDT)+1), labels = c("OK", "Dang"), right = FALSE, ordered_result = TRUE)) summary(Eat) class(Eat) ddt <- within(ddt, Eat <- Eat) head(ddt) levels(Eat)
It is important to see that the new variable created is not simply
nominal
since we know that the underlying variableDDT
is continuous and henceOK
would represent values of $DDT< 5ppm$ andDang
would represent values of $DDT \ge 5ppm$. HenceEat
is actually an ordinal variable.
Now we will use GLM theory to analyze this, taking the response to be a Bernoulli (Bin(n=1, p)).
The link is logit
. Lets review the three parts that must be present in a GLM
- $Y_i \sim Bin(n=1, p) \in Expo.\; Family$
- $\eta_{i} = \beta_{0} + \beta_{1} LENGTH + \beta_{2} WEIGHT$
- $log(\frac{p_i}{1-p_i}) = logit(p_i) = \eta_i$
Where $\mu_i = p_i = E(Y_i)$
We will use the function glm()
to perform the logistic regression
.
yglm <- glm(Eat ~ LENGTH + WEIGHT,family = binomial(link = "logit") ,data = ddt) summary(yglm) cf<-coef(yglm)
The estimating equation for the trend is:
$$log(odds) = r cf[1]
+ r cf[2]
LENGTH + r cf[3]
WEIGHT$$
where $odds = \frac{p}{1-p}$ -- notice that $p = \frac{exp(\eta)}{1+exp(\eta)}$
We can calculate these predicted probabilities in R
prob <-predict(yglm, type = "response") head(prob,10)
It is a good idea to start with a straight line and then see how the model is working by investigating the residuals. Typically a plot of the residuals versus the fitted values is a good idea plot(ylm, which =1)
.
If additional signal is in the residual plot then you will need to adjust the basic simple straight line model. Sometimes a higher term polynomial will work well or perhaps a non linear model like an exponential y ~ b0*exp(b1*x)
-- you can fit this with nls()
-- see above examples. Unless you are proficient with non linear models it will be best for you to stick with linear models.
To check which model is better you will need a criterion. Many statisticians use AIC
since this is generally applicable -- choose the model with smallest AIC
. If you use MLR
(multiple linear regression) as in the case of a quadratic then for our course we can use $R_a^2$ -- choose the model with the biggest adjusted R squared.
You may need to try a few linear models until the assumptions are confirmed. Once the best model is determined ("best" determined by satisfying the criteria of a Linear Model and the biggest $R_a^2$) then go ahead and use that model to solve whatever problems you have been confronted with.
quad.lm = lm(Height~BHDiameter + I(BHDiameter ^ 2), data = spruce.df) myplot = function(x) { quad.lm$coef[1] + quad.lm$coef[2] * x + quad.lm$coef[3] * x ^ 2 } plot(Height~BHDiameter, main = "Spruce Height Prediction", xlab = "Breast Height Diameter (cm)", ylab = "Height of Tree (m)", pch = 21, bg = "blue", cex = 1.1, xlim = c(0, max(BHDiameter) * 1.0), ylim = c(0, max(Height) * 1.0), data = spruce.df) curve(myplot, lwd = 2, col = "steelblue", add = TRUE) # Add curve # General line segments with(spruce.df, segments(BHDiameter, Height, BHDiameter, quad.lm$coef[1] + quad.lm$coef[2] * BHDiameter + quad.lm$coef[3] * BHDiameter ^ 2, col = "Black")) lgcooks = spruce.df[c(18, 21, 24),] # 3 highest Cook's distances # Highest Cook's distance line segments in thick red with(lgcooks, segments(BHDiameter, Height, BHDiameter, quad.lm$coef[1] + quad.lm$coef[2] * BHDiameter + quad.lm$coef[3] * BHDiameter ^ 2, col = "Red", lwd = 3)) # Number points with(spruce.df, text(Height~BHDiameter, labels = row.names(spruce.df), pos = 4, cex = 0.5)) # Arrow to highest Cook's with(spruce.df, arrows(BHDiameter[18], Height[24], BHDiameter[24], Height[24], col = "blue", lwd = 2)) # Text about highest Cook's with(spruce.df, text(BHDiameter[18], Height[24], labels = c("Highest Cook's\ndistance "), pos = 2, cex = 1))
Please download the following files into appropriate directories
r rmdfiles("Lab4")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.