208 Data Analysis

knitr::opts_chunk$set(fig.height=3)

## Do not delete this!
## It loads the s20x library for you. If you delete it 
## your document may not compile
library(s20x)
## From now on we also need the emmeans library:
library(emmeans)

Question 1

A researcher was interested in comparing e-mail habits of adults, depending on their marital status. They collected a random sample of data from the U.S. General Social Surveys.

The data is stored is email.csv and contains the following variables:

Variable | Description -----------|-------------------------------------------------------------------------- email | The number of e-mails a person sent in the last 24 hours (day) recorded at the time of the survey. marital | The persons current marital status (Married , Divorced or Never).

Instructions:

Comment on the plot,table and summary statistics of the data.
Fit an appropriate GLM to model the number of emails sent. Consider all possible pairwise comparisons of marital status. Generate confidence interval output for this model. See the note below.
Write appropriate Methods and Assumption Checks.
Write an appropriate Executive Summary.

Note: ANOVA tests for GLMs:

For regular GLM models, use the command: anova(fit, test = "Chisq") However, for quasi GLM models, use the command: anova(fit, test = "F")

Questions of Interest

It was of interest to compare e-mail habits of adults, depending on their marital status.

Read in and inspect the data:

load(system.file("extdata", "email.df.rda", package = "s20x"))

email.df = read.csv("email.csv",stringsAsFactors = TRUE)
with(email.df, table(email))  
boxplot(email~marital,horizontal=TRUE,data=email.df)
summaryStats(email~marital,data=email.df)

with(email.df, table(email))  
boxplot(email~marital,horizontal=TRUE,data=email.df)
summaryStats(email~marital,data=email.df)

Comment on plot(s) and/or summary statistics

It is difficult to see any difference between the three groups due to the amount of background variation. Most people tend to send 0, 1 or 2 e-mails per day, so all three groups have medians of 2. But all three groups have long upper tails. For the divorced group, the upper tail is slightly longer. The mean is highest in the divorced group and lowest in the married group.

The tables reveal that treating the response as counts is reasonable. The box plots reveal extreme skewness and the variance is considerably larger than the mean, therefore there is very clear evidence of overdispersion relative to the Poisson model. It will not be surprising to find that we need a quasi-Poisson model over an ordinary Poisson regression.

Fit an appropriate GLM:

email.fit1 = glm(email ~ marital, family=poisson, data = email.df)
summary(email.fit1)
plot(email.fit1, which = 1)

1 - pchisq(deviance(email.fit1), df = df.residual(email.fit1))
# Overdispersion is extremely strong

# Quasi-Poisson fit
email.fit2 = glm(email ~ marital, family = quasipoisson, data = email.df)
summary(email.fit2)
anova(email.fit2, test = "F")


confint(email.fit2)
exp(confint(email.fit2)[2,])
100 * (exp(confint(email.fit2)[2,]) - 1)

# Rotate factor
email.df <-within(email.df ,{maritalR <- factor(marital,levels = c("never", "married", "divorced"))})
email.fit3 = glm(email ~ maritalR, family = quasipoisson, data = email.df)
summary(email.fit3)

conf1 = t(100 * (exp(confint(email.fit2)[2,]) - 1))
conf1 = data.frame(conf1)
names(conf1) <- c("lower", "upper")
resultStr1 = paste0(sprintf("%.1f%%", abs(conf1$upper)), " and ", sprintf("%.1f%%", abs(conf1$lower)))

Method and Assumption Checks

A Poisson log-linear model was fitted to the counts of number of emails. The explanatory variables was marital status, a factor variable with 3 levels.

Residuals looked fine. The Residual deviance is much bigger than the residual deviance degrees of freedom, so it is no surprise that the P-value for the model adequacy test was small and comes out as 0. Therefore we have strong evidence that the data is over-dispersed so the model was refitted as quasipoisson.

The final model is approximately [ \log(\mu_i) = \beta_0 + \beta_1 \times maritalmarried_i + \beta_2 \times maritalnever_i ] where $\mu_i$ is the mean number of e-mail sent by person $i$, $maritalmarried_i$ is 1 if person $i$ is married and 0 otherwise, similarly for $maritalnever_i$. $Email_i$ has has an overdispersed distribution with underlying mean $\mu_i$.

Note 1: this is not a Poisson distribution in this case as we used a QuasiPoisson model.)

Note 2: could use $E[Count_i]$ for $\mu_i$ as have the same definition.

Executive Summary

We were interested in comparing the number of e-mails sent per day between people who were married, divorced or never married.

We found that there was evidence that married people send fewer e-mails than divorced people. There was weak evidence that married people also send fewer e-mails than people who have never been married.

We estimate that the expected number of e-mails that married people send is somewhere between r resultStr1[1] lower than that sent by people who have been divorced.