In doomlab/learnSTATS: Introduction to Statistics (ANLY 500)

library(learnr)
library(learnSTATS)
library(faux)
library(mice)
library(corrplot)
data("datascreen_data")

datascreen_data$Gender <- factor(datascreen_data$Gender, 
                               levels = 1:2)
datascreen_data$Group <- factor(datascreen_data$Group, 
                              levels = 1:2)
datascreen_data <- subset(datascreen_data, 
                        complete.cases(datascreen_data))

datascreen_data <- sim_df(datascreen_data, #data frame
                      sample(50:100, 1), #how many of each group
                      between = c("Gender", "Group")) 

datascreen_data <- messy(datascreen_data, 
                     prop = .02,
                     2:6,
                     replace = NA)

datascreen_data <- datascreen_data[ , -1]

notypos <- datascreen_data
notypos$Gender <- factor(notypos$Gender, 
                        levels = c(1,2), 
                        labels = c("Male", "Female"))
notypos$Group <- factor(notypos$Group, 
                       levels = c(1,2), 
                       labels = c("Control", "Experimental"))
notypos[ , 3:5][ notypos[ , 3:5] > 125 ] <- NA
notypos[ , 3:5][ notypos[ , 3:5] < 2 ] <- NA
percentmiss <- function(x){ sum(is.na(x))/length(x) *100 }
missing <- apply(notypos, 1, percentmiss)
replacerows <- subset(notypos, missing <= 25)
norows <- subset(notypos, missing > 25)
replacecols <- replacerows[ , c(3:5)]
nocols <- replacerows[ , -c(3:5)]
tempnomiss <- mice(replacecols, print = FALSE)
nomiss <- complete(tempnomiss, 1)
allcolumns <- cbind(nocols, nomiss)
mahal <- mahalanobis(allcolumns[ , -c(1,2)], 
                    colMeans(allcolumns[ , -c(1,2)], na.rm = TRUE),
                    cov(allcolumns[ , -c(1,2)], use="pairwise.complete.obs"))
cutoff <- qchisq(1 - .001, ncol(allcolumns[ , -c(1,2)])) 
noout <- subset(allcolumns, mahal < cutoff)

knitr::opts_chunk$set(echo = FALSE)

Data Screening Assumptions

Last week, we worked on data screening for accuracy, missing, and outliers. This week you will get to practice screening for assumptions of parametric tests. We will work through how to use R to clean up our dataset from last week, testing the following assumptions:

Additivity
Normality
Linearity
Homogeneity
Homoscedasticity

Data Screening Week 2 Video Part 1

The following videos are provided as a lecture for the class material. The lectures are provided here as part of the flow for the course. You can view the lecture notes within R using vignette("Data-Screen-2", "learnSTATS"). You can skip these pages if you are in class to go on to the assignment.

Data Screening Week 2 Video Part 2

Exercises

In this next section, you will answer questions using the R code blocks provided. Be sure to use the solution option to see the answer if you need it!

Please enter your name for submission. If you do not need to submit, just type anything you'd like in this box.

question_text(
  "Student Name:",
  answer("Your Name", correct = TRUE),
  incorrect = "Thanks!",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Dataset

A large number employees participated in a company-wide experiment to test if an educational program would be effective at increasing employee satisfaction. Half of the employees were assigned to be in the control group, while the other half were assigned to be in the experimental group. The experimental group was the only group that received the educational intervention. All groups were given an employee satisfaction scale at time one to measure their initial levels of satisfaction. The same scale was then used half way through the program and at the end of the program. The goal of the experiment was to assess satisfaction to see if it increased across the measurements during the program as compared to a control group.

a) Gender (1 = male, 2 = female) b) Group (1 = control group, 2 = experimental group) c) 3 satisfaction scores, ranging from 2-125 points. Decimals are possible! The control group was measured at the same three time points, but did not take part in the educational program. i) Before the program ii) Half way through the program iii) After the program

head(datascreen_data)

Additivity

Using our noout dataset from the last steps of our outlier check, let's screen our continuous variables for additivity. Create a correlation table and visualization using the corrplot library.

library(corrplot)

library(corrplot)
corrplot(cor(noout[ , c(3:5)]))

question_text(
  "Do you meet the assumption for additivity?",
  answer("Yes!", correct = TRUE),
  incorrect = "Yes, because all of our correlations are below .9.",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Linearity

To get started, we need to create our fake regression and other variables that help us assess the rest of the assumptions. Using the code from class, create the random variable, fake regression, standardized residuals, and standardized fit values. Finally, add the qqnorm plot to assess for linearity.

##set up for all assumptions

##linearity

random <- rchisq(nrow(noout), 7)
fake <- lm(random~., data = noout)
standardized <- rstudent(fake)
fitvalues <- scale(fake$fitted.values)

qqnorm(standardized)
abline(0,1)

question_text(
  "Do you meet the assumption for linearity?",
  answer("Yes!", correct = TRUE),
  incorrect = "While each graph is different, likely yes because the dots line up on the line between -2 and 2.",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Normality

In this section, let's assess multivariate normality by examining a hist() on our standardized residuals.

random <- rchisq(nrow(noout), 7)
fake <- lm(random~., data = noout)
standardized <- rstudent(fake)
fitvalues <- scale(fake$fitted.values)

hist(standardized, breaks = 15)

question_text(
  "Do you meet the assumption for normality?",
  answer("Yes!", correct = TRUE),
  incorrect = "While each graph is different, you may have a little bit of positive skew.",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Homogeneity/Homoscedasticity

Last, let's include our residual scatterplot of our fitvalues and standardized residuals. Use this plot to help answer your final two assumptions!

random <- rchisq(nrow(noout), 7)
fake <- lm(random~., data = noout)
standardized <- rstudent(fake)
fitvalues <- scale(fake$fitted.values)

plot(fitvalues, standardized) 
abline(0,0)
abline(v = 0)

question_text(
  "Do you meet the assumption for homogeneity and homoscedasticity?",
  answer("Yes!", correct = TRUE),
  incorrect = "Everyone's graph will look different. You should check the spread of the dots around zero and the spread of the dots across the whole graph. Explain your answer: yes because it's all evenly spread or no because it looks like a giant triangle.",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Submit

On this page, you will create the submission for your instructor (if necessary). Please copy this report and submit using a Word document or paste into the text window of your submission page. Click "Generate Submission" to get your work!