knitr::opts_chunk$set(fig.height=3)
## Do not delete this! ## It loads the s20x library for you. If you delete it ## your document may not compile library(s20x) ## From now on we also need the emmeans library: library(emmeans)
A researcher was interested in comparing e-mail habits of adults, depending on their marital status. They collected a random sample of data from the U.S. General Social Surveys.
The data is stored is email.csv and contains the following variables:
Variable | Description -----------|-------------------------------------------------------------------------- email | The number of e-mails a person sent in the last 24 hours (day) recorded at the time of the survey. marital | The persons current marital status (Married , Divorced or Never).
Instructions:
Note: ANOVA tests for GLMs:
For regular GLM models, use the command: anova(fit, test = "Chisq")
However, for quasi GLM models, use the command: anova(fit, test = "F")
It was of interest to compare e-mail habits of adults, depending on their marital status.
load(system.file("extdata", "email.df.rda", package = "s20x"))
email.df = read.csv("email.csv",stringsAsFactors = TRUE) with(email.df, table(email)) boxplot(email~marital,horizontal=TRUE,data=email.df) summaryStats(email~marital,data=email.df)
with(email.df, table(email)) boxplot(email~marital,horizontal=TRUE,data=email.df) summaryStats(email~marital,data=email.df)
It is difficult to see any difference between the three groups due to the amount of background variation. Most people tend to send 0, 1 or 2 e-mails per day, so all three groups have medians of 2. But all three groups have long upper tails. For the divorced group, the upper tail is slightly longer. The mean is highest in the divorced group and lowest in the married group.
The tables reveal that treating the response as counts is reasonable. The box plots reveal extreme skewness and the variance is considerably larger than the mean, therefore there is very clear evidence of overdispersion relative to the Poisson model. It will not be surprising to find that we need a quasi-Poisson model over an ordinary Poisson regression.
email.fit1 = glm(email ~ marital, family=poisson, data = email.df) summary(email.fit1) plot(email.fit1, which = 1) 1 - pchisq(deviance(email.fit1), df = df.residual(email.fit1)) # Overdispersion is extremely strong # Quasi-Poisson fit email.fit2 = glm(email ~ marital, family = quasipoisson, data = email.df) summary(email.fit2) anova(email.fit2, test = "F") confint(email.fit2) exp(confint(email.fit2)[2,]) 100 * (exp(confint(email.fit2)[2,]) - 1) # Rotate factor email.df <-within(email.df ,{maritalR <- factor(marital,levels = c("never", "married", "divorced"))}) email.fit3 = glm(email ~ maritalR, family = quasipoisson, data = email.df) summary(email.fit3)
conf1 = t(100 * (exp(confint(email.fit2)[2,]) - 1)) conf1 = data.frame(conf1) names(conf1) <- c("lower", "upper") resultStr1 = paste0(sprintf("%.1f%%", abs(conf1$upper)), " and ", sprintf("%.1f%%", abs(conf1$lower)))
A Poisson log-linear model was fitted to the counts of number of emails. The explanatory variables was marital status, a factor variable with 3 levels.
Residuals looked fine. The Residual deviance is much bigger than the residual deviance degrees of freedom, so it is no surprise that the P-value for the model adequacy test was small and comes out as 0. Therefore we have strong evidence that the data is over-dispersed so the model was refitted as quasipoisson.
The final model is approximately [ \log(\mu_i) = \beta_0 + \beta_1 \times maritalmarried_i + \beta_2 \times maritalnever_i ] where $\mu_i$ is the mean number of e-mail sent by person $i$, $maritalmarried_i$ is 1 if person $i$ is married and 0 otherwise, similarly for $maritalnever_i$. $Email_i$ has has an overdispersed distribution with underlying mean $\mu_i$.
Note 1: this is not a Poisson distribution in this case as we used a QuasiPoisson model.)
Note 2: could use $E[Count_i]$ for $\mu_i$ as have the same definition.
We were interested in comparing the number of e-mails sent per day between people who were married, divorced or never married.
We found that there was evidence that married people send fewer e-mails than divorced people. There was weak evidence that married people also send fewer e-mails than people who have never been married.
We estimate that the expected number of e-mails that married people send is somewhere between r resultStr1[1] lower than that sent by people who have been divorced.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.