This lab is designed to help you understand the concept of a standard (z) score. It also has you thinking more about units of measurement and standard deviation.

# Load your libraries here
library(aws.s3)
library('lehmansociology')
library('ggplot2')
library(dplyr)

This next chunk loads the GSS data.

s3load('gss.Rda', bucket = 'lehmansociologydata')

This next chunk focuses on tvhours, which is an interval-ratio variable.

#the line below RECODES implausible values to missing values so that they are ignored 
# (note that this is hours per day so it cannot be greater than 24)
GSS$tvhours[GSS$tvhours >24]<-NA

Now that we've fixed the problem with having implausible values, we can look at average hours per day spent watching TV.

summary(GSS$tvhours)

Notice that summary on the line above will show you the number of NA (i.e., missing) values. However, recall from the swirl lesson called "Working with Variables" that if you have missing values R will give you a value of NA when you ask for specific statistics, such as the mean and sd. Therefore, the two lines of code below add na.rm=TRUE which tells R that those are missing values that should be ignored to compute the mean and sd.

mean(GSS$tvhours, na.rm=TRUE)
sd(GSS$tvhours, na.rm=TRUE)
#note the added code section for geom_vline in the histogram below. 
ggplot_tvhours <-ggplot(GSS, aes(tvhours))
ggplot_tvhours + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) +
    ggtitle("") +   
    labs(y="Percent",     x="") + 
    geom_vline(xintercept=mean(GSS$tvhours, na.rm=TRUE),  color="blue", linetype="dashed", size=1) 

Describe what the code segment that begins with geom_vline adds to your histogram. Why is the line where it is? Are you surprised by the placement of the vertical line? Be specific.

# Now let's convert (AKA transform) the unit of measurement from hours to minutes 
# and create a new variable that measures the number of minutes
# spent watching tv per day.
GSS$tvmins <-GSS$tvhours*60
mean(GSS$tvmins, na.rm=TRUE)
sd(GSS$tvmins, na.rm=TRUE)
# Now let's convert the unit of measurement from hours to days. 
# Note that multiplying by 1/24 is the same as dividing by 24 since there are
# 24 hours in a day.
GSS$tvdays <-GSS$tvhours*1/24
mean(GSS$tvdays, na.rm=TRUE)
sd(GSS$tvdays, na.rm=TRUE)
#Now let's see what happens if we convert the units to standard deviations. 
GSS$tvsd <- GSS$tvhours*1/sd(GSS$tvhours, na.rm=TRUE)
mean(GSS$tvsd, na.rm=TRUE)
sd(GSS$tvsd, na.rm=TRUE)

Describe how the mean for tvmins, tvdays, and tvsd are related to the mean for tvhours.

What is unique about the standard deviation of tvsd?

Now let's make histograms of the new variables. Notice that we are changing the binwidth. Think about why we are changing the binwidth to 60 for minutes, 1/24 for hours and 1/sd(tvhrs) for standard deviations.
If you want, after your run them this way, you can change the binwidth back to 1 and see what happens.

ggplot_tvmins <-ggplot(GSS, aes(tvmins))
ggplot_tvmins + geom_histogram(binwidth =60, aes(y=(..count../sum(..count..))*100)) + 
    ggtitle("")+  
    labs(y="Percent",     x="") + 
    geom_vline(xintercept=mean(GSS$tvmins, na.rm=TRUE),  color="green", linetype="dashed", size=1) 
ggplot_tvdays <-ggplot(GSS, aes(tvdays))
ggplot_tvdays + geom_histogram(binwidth =1/24, aes(y=(..count../sum(..count..))*100)) + 
    ggtitle("") +   
    labs(y="Percent",     x="") +
    geom_vline(xintercept=mean(GSS$tvdays, na.rm=TRUE),  color="orange", linetype="dashed", size=1) 
ggplot_tvsd <-ggplot(GSS, aes(tvsd))
ggplot_tvsd +
    geom_histogram(binwidth =1/sd(GSS$tvhours, na.rm=TRUE), aes(y=(..count../sum(..count..))*100)) + 
    ggtitle("") +   
    labs(y="Percent",     x="") + 
    geom_vline(xintercept=mean(GSS$tvsd, na.rm=TRUE),  color="purple", linetype="dashed", size=1) 

Now let's create some new variables where instead of the actual values we have the difference from the value to the mean. This is sometimes called "recentering." Then we will get the standard deviation and histogram for each new variable.

GSS$tvhours0<-GSS$tvhours - mean(GSS$tvhours, na.rm=TRUE)
sd(GSS$tvhours0, na.rm=TRUE)
ggplot_tvhours0 <-ggplot(GSS, aes(tvhours0))
ggplot_tvhours0 + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) + 
    ggtitle("") +   
    labs(y="Percent",     x="") + 
    geom_vline(xintercept=mean(GSS$tvhours0, na.rm=TRUE),  color="blue", linetype="dashed", size=.5) 
GSS$tvmins0<-GSS$tvmins - mean(GSS$tvmins, na.rm=TRUE)
sd(GSS$tvmins0, na.rm=TRUE)
ggplot_tvmins0 <-ggplot(GSS, aes(tvmins0))
ggplot_tvmins0 + 
    geom_histogram(binwidth =60, aes(y=(..count../sum(..count..))*100)) + 
    ggtitle("") +   labs(y="Percent",     x="") + 
    geom_vline(xintercept=mean(GSS$tvmins0, na.rm=TRUE),  color="green", linetype="dashed", size=.5) 
GSS$tvdays0<-GSS$tvdays - mean(GSS$tvdays, na.rm=TRUE)
sd(GSS$tvdays0, na.rm=TRUE)
ggplot_tvdays0 <-ggplot(GSS, aes(tvdays0))
ggplot_tvdays0 + 
    geom_histogram(binwidth =1/24, aes(y=(..count../sum(..count..))*100)) + 
    ggtitle("") +   
    labs(y="Percent",     x="") + 
    geom_vline(xintercept=mean(GSS$tvdays0, na.rm=TRUE),  color="orange", linetype="dashed", size=.5) 
GSS$tvsd0<-GSS$tvsd - mean(GSS$tvsd, na.rm=TRUE)
sd(GSS$tvsd0, na.rm=TRUE)
ggplot_tvsd0 <-ggplot(GSS, aes(tvsd0))
ggplot_tvsd0 + 
    geom_histogram(binwidth =1/sd(GSS$tvhours0, na.rm=TRUE), aes(y=(..count../sum(..count..))*100)) + 
    ggtitle("") +   
    labs(y="Percent",     x="") + 
    geom_vline(xintercept=mean(GSS$tvsd0, na.rm=TRUE),  color="purple", linetype="dashed", size=.5) 

Why is subtracting the mean from each value sometimes called "recentering"?

What does a negative value on these variables mean?

Describe how the standard deviations for these recentered variables compare to the standard deviations for the previous (comparable) variables.

When we convert the value of an observation into units of "standard deviations above the mean" or "standard deviations below the mean" those new scores are called Z-SCORES.



lehmansociology/lehmansociology documentation built on May 21, 2022, 9:06 p.m.