library(lehmansociology)
library(dplyr)
library(tidyr)
library(broom)
library(printr)
education_and_poverty<-create_educ_poverty_data()

We use regression (specifically ordinary least squares regression or OLS) to analyse linear relationships between variables.

For basic regression we use the lm() function. The lm comes from Linear Model, which makes it easy to remember.

Basics

In regression analysis you have a dependent variable and one or more independent variables. The dependent variable in OLS regression should be an interval-ratio variable. If you choose to use an ordinal variable you need to have a good reason to assume that it will act like an interval variable.

In this section of documentation we will focus on the case of one independent variable that is an interval or ratio variable.

The results of regression give you the equation for a straight line that summarizes the linear relationship between your variables. Of course this only makes sense if you have good reason to believe that the relationship is linear. You may remember the idea of equation of a line from algebra:

$$ y = mx + b $$ where $b$ is the intercept and $m$ is the slope.

In statistics there is a similar idea, but but instead of y, we have $\hat y$ (prounounced y-hat) or $predicted(y)$ and we write the equation this way:
$$ \hat y = a +b_1x_1 $$

The basic format for a regression analysis in R is

lm(dependent_variable ~ independent_variable, dataset)

# Spread out because the names are long and to make the different parts more visible.
lm(Percent.of.adults.with.less.than.a.high.school.diploma..2009.2013 
   ~ 
  Percent.of.adults.with.less.than.a.high.school.diploma..1970, 
  education.states)

However. more often you will want to use the assignment operater <- to assign the analysis to an object that will let you access much more detailed results.

results <- lm(Percent.of.adults.with.less.than.a.high.school.diploma..2009.2013 
   ~ 
  Percent.of.adults.with.less.than.a.high.school.diploma..1970, 
  education.states)


#Then you can get the coefficients

coefficients(results)

# tidy() will give you nicely printed coefficients but you must load the broom library
tidy(results)


#The summary (which gives you the R Squared and other details)
summary(results)

# The glance() function gives you a nice looking version of the summary, but you must load
#the broom package..
glance(results)

# The residusals or errors (the difference between the data and the regression line)
resid(results)


# The fitted or predicted values (predicted dependent variable based on the line)
# Either predict() or fitted() will work

fitted(results)

# augment() from the broom package makes a nicer looking result. 
augment(results)

Intermediate Regression

There are a number of ways you can make your regression analysis more powerful or that you can take advantage of some neat properties of the mean using regression analysis.

Run separate regressions for groups

One thing we might want to do is to compare the regression results for different groups. One way to do this is to use the dplyr package (notice this was loaded at the start of the file) groupby function.

**Note that this analysis will not run without dplyr and the region variable. First we group the data by region.

databyregion <- education_and_poverty  %>% 
  group_by(region)

Note that this presentation of result depends on the broom package. You can also just display results or use xtable to wrap results in a pdf.

regbyregion<-databyregion  %>% 
  do(model = lm(PCTPOVALL_2013 ~ Percent.of.adults.with.less.than.a.high.school.diploma..2009.2013, data = .))

results <- tidy(regbyregion, model)
results

rsquared <- glance(regbyregion, model)
rsquared

Use a dichtomous independent variable

Another thing you can do is to use a dichtomous independent variable (one with only two attributes, such as "yes" and "no") as an independent variable.

Use multiple independent variables

Generally sociologists do not believe that there is just one important variable for explaining independent variables. This is for many reasons that won't be described here.

It is easy to add more independent variables into your analyis. All you need to do is add them using a + sign.

results2 <- lm(Percent.of.adults.with.less.than.a.high.school.diploma..2009.2013 
   ~ 
  Percent.of.adults.with.less.than.a.high.school.diploma..1970
  +
  Percent.of.adults.with.less.than.a.high.school.diploma..1990
  , 
  education.states)


summary(results2)

Use a nominal independent variable.

Normally we think of regression as analyzing interval or ration variables, but it is also possible to us a nominal independent variable. In R nominal variables are stored as factors.

Here is a table of means for each region.

comparison_poverty <- education_and_poverty %>% 
  group_by(region) %>% 
  summarize(mean(PCTPOVALL_2013))

 comparison_poverty

Here is a regression with PCTPOVALL_2013 as the dependent variable and Region as the independent variable.

 results3 <- lm(PCTPOVALL_2013 ~ region, education_and_poverty)

 coefficients(results3)

Comapring the two results we can see that there are only 3 coefficients instead of the 4 that you might have expected. Also the Intercept is equal to the mean for Midwest. Midwest is called the excluded value, and this kind of analysis will always exclude 1 value.
Now notice that the coefficients for the other regions are all very different from the region means. However, if you add the coefficient for a region to the intercept you get the mean for the region. So the coeffients represent the difference from the mean of the exluded value that is associated with being in a particular region.

If you want to make a predicted value for a state in a given region you would do this:

If it is in the Midwest just use the interept. If it is in another region, add the coefficient for that region. Ignore all of the other regions.

This is sometimes called regression with dummy variables.

Regression with a nominal dependent variable

Although nominal independent variables are relatively easy to incorporate, nominal dependent variables do not work well with ordinary least squares regression. They require a more advanced analysis that is beyond the scope of this document. In reading journals if you see the term logistic regression or generalized linear model, this means that the article is using a dependent variable that doesn't work well with ordinary least squares regression. Logistic regression is what is specifically used with a dichtomous dependent variable.

You can find more useful information for regression analysis in the Making Graphs and Confidence Intervals documentation.



SOC345/lehmansociology documentation built on May 9, 2019, 11:41 a.m.