# Load your libraries here library('lehmansociology') library('ggplot2')
Previously we looked at the 4 pairs of x and y variables in the anscombe dataset. We saw that although the variables were very similar to each other in terms of some statistics, such as the means, and the regression results were all identical, the plotted data looked pretty different
Remember you can see the anscombe data by typing View(anscombe) in the console.
Run the code below to review the plots.
#type your code here ggplot(anscombe, aes(x=x1 , y=y1)) + geom_point() +geom_smooth(method="lm") + ggtitle("Results for x1 and y1 ") ggplot(anscombe, aes(x=x2 , y=y2)) + geom_point() +geom_smooth(method="lm") + ggtitle("Results for x2 and y2 ") ggplot(anscombe, aes(x=x3 , y=y3)) + geom_point() +geom_smooth(method="lm") + ggtitle("Results for x3 and y3 ") ggplot(anscombe, aes(x=x4 , y=y4)) + geom_point() +geom_smooth(method="lm") + ggtitle("Results for x4 and y4 ")
Just use words to say what you see. You can mention individual points if it makes sense to do so.
Ordinary least squares regression is designed to estimate a straight line that has the "best" fit to the data. But what does "best" fit mean? We have already seen that sometimes the regression line is not really the most accurate way to summarize data.
Remember that all of the regression results for the four pairs were the same. Let's just look at the summary for the first result to remind us of this. We'll also create the 3 other results objects.
results1<-lm(y1 ~ x1, data=anscombe) summary(results1) results2<-lm(y2 ~ x2, data=anscombe) results3<-lm(y3 ~ x3 , data=anscombe) results4<-lm(y4 ~ x4, data=anscombe)
The two things we want to look closely at here are the coefficient estimates and the Multiple R-Squared values.
The multiple R squared represents how much of the variation in the dependent variable is "explained by" the independent variable. A 0 would mean none, a 1 would mean all. This is a proportion, so it has to be between 0 and 1.
The coefficients represent a regresssion equation that is
predicted(y) = 3.001 + .5001*x
Use R as a calculator below to calculate the predicted(y) for 0, 8, 9, 19 and a value of your choice.
Fortunately R will calculated the predicted values of y for each observation. These are found in fitted.values(lm_results_object). Fitted values and predicted values mean the same thing.
It will also calculate the actual y value minus the predicted value. These are called either resdiduals or errors. These are found in resid(lm_results_object). Let's get all 4 sets of actual x, actual y, predicted and residual.
# Results 1 (x1, y1) fitdata1 <-data.frame(x=anscombe$x1, y=anscombe$y1, predicted = fitted.values(results1), error = resid(results1)) # To make the results a bit easier to read, arrange them by the size of x. dplyr::arrange(fitdata1, x)
It turns out that one way to tell if a regression line is an appropriate
approach for your data is to look for patterns or strange values, such as
extremely large values, in the errors. If there are patterns then the
regression model you have used probably does not make sense.
Errors should be evenly (and randomly) distributed around the regression line.
Now we want to look closely at the last 3 models, since the first one looks fine.
# Create a new data set without observation 8. no_obs_8 <- anscombe[-8,] # You can View() it to see the new data, but not in your file. results4<-lm(y4 ~ x4, data=no_obs_8) # Get the summary
You might want to also try some of the techniques we used earlier, such as graphing or looking at the errors.
There is not a right answer, only well thought out answers.
Now let's look at the x3, y3 results.
Write the code to see what happens if we drop the outlier.
You might want to also try some of the techniques we used earlier, such as graphing or looking at the errors.
Another way to appoach the x3, y3 data would be to make a dichotomous variable representing observation 3 and add that to the regression.
anscombe$obs3<-anscombe$y3 == 12.74 # Add it to the model using a +. results3<-lm(y3 ~ x3 + obs3, data=anscombe)
There is not a right answer, only well thought out answers.
Finally, let's look at the x2, y2 results.
Really try to remember!
One kind of curve line in algebra is a parabola, which is a function with a squared value of x. (Google parabola if you need to.)
Let's try a model with a squared term. We add it using a + sign.
anscombe$x2squared<- anscombe$x2^2 results2<-lm(y2 ~ x2 + x2squared, data=anscombe)
You will definitely want to look at the predicted values and the errors. You may also want to plot x and predicted.
We should take two lessons from this. First, just because you can do something does not mean you should do it. A regression will run for all kinds of data but that does not mean it is right.
Second, always look at your data graphically in order to help decide whether a regression model makes sense and spot problems such as outliers. Looking at residuals can give you the same information especially as your models get more complex.
As a reader of regression results you should always ask whether the author has really investigated if there are any such issues in their data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.