In doomlab/learnSTATS: Introduction to Statistics (ANLY 500)

library(learnr)
library(learnSTATS)
library(faux)
library(mice)
data("datascreen_data")

datascreen_data$Gender <- factor(datascreen_data$Gender, 
                               levels = 1:2)
datascreen_data$Group <- factor(datascreen_data$Group, 
                              levels = 1:2)
datascreen_data <- subset(datascreen_data, 
                        complete.cases(datascreen_data))

datascreen_data <- sim_df(datascreen_data, #data frame
                      sample(50:100, 1), #how many of each group
                      between = c("Gender", "Group")) 

datascreen_data <- messy(datascreen_data, 
                     prop = .02,
                     2:6,
                     replace = NA)

datascreen_data <- datascreen_data[ , -1]

knitr::opts_chunk$set(echo = FALSE)

Introduction to Data Screening

Now that we know how to use a little bit of R and a little bit of statistic theory, let's dive into the most critical component of any project: Data Screening. Remember: messy data produces messy answers. In this section, we cover:

What is accuracy and error?
How to handle missing values
How to deal with outliers

Data Screening Video Part 1

The following videos are provided as a lecture for the class material. The lectures are provided here as part of the flow for the course. You can view the lecture notes within R using vignette("Data-Screen-1", "learnSTATS"). You can skip these pages if you are in class to go on to the assignment.

Data Screening Video Part 2

Data Screening Video Part 3

Exercises

In this next section, you will answer questions using the R code blocks provided. Be sure to use the solution option to see the answer if you need it!

Please enter your name for submission. If you do not need to submit, just type anything you'd like in this box.

question_text(
  "Student Name:",
  answer("Your Name", correct = TRUE),
  incorrect = "Thanks!",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Dataset

A large number employees participated in a company-wide experiment to test if an educational program would be effective at increasing employee satisfaction. Half of the employees were assigned to be in the control group, while the other half were assigned to be in the experimental group. The experimental group was the only group that received the educational intervention. All groups were given an employee satisfaction scale at time one to measure their initial levels of satisfaction. The same scale was then used half way through the program and at the end of the program. The goal of the experiment was to assess satisfaction to see if it increased across the measurements during the program as compared to a control group.

a) Gender (1 = male, 2 = female) b) Group (1 = control group, 2 = experimental group) c) 3 satisfaction scores, ranging from 2-125 points. Decimals are possible! The control group was measured at the same three time points, but did not take part in the educational program. i) Before the program ii) Half way through the program iii) After the program

head(datascreen_data)

Accuracy in Categorical Variables

First, let's fix the categorical variables for their issues. We should factor each variable to accurately denote the proper labels (you can go back to the previous page to view this information again). We will save this dataset as notypos to get started.

notypos <- datascreen_data

notypos <- datascreen_data
notypos$Gender <- factor(notypos$Gender, 
                        levels = c(1,2), 
                        labels = c("Male", "Female"))
notypos$Group <- factor(notypos$Group, 
                       levels = c(1,2), 
                       labels = c("Control", "Experimental"))

Accuracy in Continuous Variables

Next, we should check our three continuous variables for issues with their ranges. These variables range from 2 to 125. Fix those values below and use summary() to make sure things look correct.

notypos <- datascreen_data
notypos$Gender <- factor(notypos$Gender, 
                        levels = c(1,2), 
                        labels = c("Male", "Female"))
notypos$Group <- factor(notypos$Group, 
                       levels = c(1,2), 
                       labels = c("Control", "Experimental"))

##more than one way to do this, but here's an example
notypos[ , 3:5][ notypos[ , 3:5] > 125 ] <- NA
notypos[ , 3:5][ notypos[ , 3:5] < 2 ] <- NA
summary(notypos)

Missing data

As we saw on our previous summary, we have missing data. Our normal rule for estimating missing data is 5%, but for this exercise, let's fill in anyone who is missing data at less than 25% of their missing data.

Start by creating a table of missing values by rows and subset out the participants who do not have at least 25% of their data.

notypos <- datascreen_data
notypos$Gender <- factor(notypos$Gender, 
                        levels = c(1,2), 
                        labels = c("Male", "Female"))
notypos$Group <- factor(notypos$Group, 
                       levels = c(1,2), 
                       labels = c("Control", "Experimental"))
notypos[ , 3:5][ notypos[ , 3:5] > 125 ] <- NA
notypos[ , 3:5][ notypos[ , 3:5] < 2 ] <- NA

##include the percent missing function

##use apply to get the missing values

##use a table to see the missing values

##create replacement rows 
replacerows <- subset()

##create no replacement rows
norows <- subset()

##include the percent missing function
percentmiss <- function(x){ sum(is.na(x))/length(x) *100 }
##use apply to get the missing values
missing <- apply(notypos, 1, percentmiss)
##use a table to see the missing values
table(missing)
##create replacement rows 
replacerows <- subset(notypos, missing <= 25)
##create no replacement rows
norows <- subset(notypos, missing > 25)

Next, we will filter out columns that we should not replace. We should check each column for missing data percents for less than 5% and exclude categorical variables. Due to the randomization of the data for this assignment, you should estimate missing data for Begin, Middle, and After.

notypos <- datascreen_data
notypos$Gender <- factor(notypos$Gender, 
                        levels = c(1,2), 
                        labels = c("Male", "Female"))
notypos$Group <- factor(notypos$Group, 
                       levels = c(1,2), 
                       labels = c("Control", "Experimental"))
notypos[ , 3:5][ notypos[ , 3:5] > 125 ] <- NA
notypos[ , 3:5][ notypos[ , 3:5] < 2 ] <- NA
percentmiss <- function(x){ sum(is.na(x))/length(x) *100 }
missing <- apply(notypos, 1, percentmiss)
replacerows <- subset(notypos, missing <= 25)
norows <- subset(notypos, missing > 25)

##use apply to get the missing values, no need to save it

##create replacement columns 
replacecols <- replacerows[ , c()]

##create no replacement columns
nocols <- replacerows[ , c()]

##use apply to get the missing values, no need to save it
apply(replacerows, 2, percentmiss)
##create replacement columns 
replacecols <- replacerows[ , c(3:5)]
##create no replacement columns
nocols <- replacerows[ , -c(3:5)]

In our last step, let's use mice() and complete() to create the replacement dataset. Note the last line of code that adds back in the original two columns. We will not add back in the rows with missing data.

notypos <- datascreen_data
notypos$Gender <- factor(notypos$Gender, 
                        levels = c(1,2), 
                        labels = c("Male", "Female"))
notypos$Group <- factor(notypos$Group, 
                       levels = c(1,2), 
                       labels = c("Control", "Experimental"))
notypos[ , 3:5][ notypos[ , 3:5] > 125 ] <- NA
notypos[ , 3:5][ notypos[ , 3:5] < 2 ] <- NA
percentmiss <- function(x){ sum(is.na(x))/length(x) *100 }
missing <- apply(notypos, 1, percentmiss)
replacerows <- subset(notypos, missing <= 25)
norows <- subset(notypos, missing > 25)
replacecols <- replacerows[ , c(3:5)]
nocols <- replacerows[ , -c(3:5)]

tempnomiss <- mice() ##fill me in
nomiss <- complete() ##fill me in
allcolumns <- cbind(nocols, nomiss)

tempnomiss <- mice(replacecols)
nomiss <- complete(tempnomiss, 1)
allcolumns <- cbind(nocols, nomiss)

Outliers

Using the allcolumns dataset, create the mahal scores, along with the cut off score for your values. Use a summary of your Mahalanobis scores less than the cut off to figure out how many outliers you have. Last, create the noout dataset by using subset() to eliminate any outliers. The next class assignment will use this dataset to continue data screening.

notypos <- datascreen_data
notypos$Gender <- factor(notypos$Gender, 
                        levels = c(1,2), 
                        labels = c("Male", "Female"))
notypos$Group <- factor(notypos$Group, 
                       levels = c(1,2), 
                       labels = c("Control", "Experimental"))
notypos[ , 3:5][ notypos[ , 3:5] > 125 ] <- NA
notypos[ , 3:5][ notypos[ , 3:5] < 2 ] <- NA
percentmiss <- function(x){ sum(is.na(x))/length(x) *100 }
missing <- apply(notypos, 1, percentmiss)
replacerows <- subset(notypos, missing <= 25)
norows <- subset(notypos, missing > 25)
replacecols <- replacerows[ , c(3:5)]
nocols <- replacerows[ , -c(3:5)]
tempnomiss <- mice(replacecols, print = FALSE)
nomiss <- complete(tempnomiss, 1)
allcolumns <- cbind(nocols, nomiss)

mahal <- mahalanobis(allcolumns[ , -c(1,2)], 
                    colMeans(allcolumns[ , -c(1,2)], na.rm = TRUE),
                    cov(allcolumns[ , -c(1,2)], use="pairwise.complete.obs"))
cutoff <- qchisq(1 - .001, ncol(allcolumns[ , -c(1,2)])) 
summary(mahal < cutoff)
noout <- subset(allcolumns, mahal < cutoff)

question_text(
  "How many outliers did you have? Look at the FALSE section of your table.",
  answer("Depends", correct = TRUE),
  incorrect = "This answer will depend on your randomized data, but likely is less than 10 outliers.",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Submit

On this page, you will create the submission for your instructor (if necessary). Please copy this report and submit using a Word document or paste into the text window of your submission page. Click "Generate Submission" to get your work!

encoder_logic()

encoder_ui()

doomlab/learnSTATS documentation built on June 9, 2022, 12:54 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

doomlab/learnSTATS
Introduction to Statistics (ANLY 500)

In doomlab/learnSTATS: Introduction to Statistics (ANLY 500)

Introduction to Data Screening

Data Screening Video Part 1

Data Screening Video Part 2

Data Screening Video Part 3

Exercises

Dataset

Accuracy in Categorical Variables

Accuracy in Continuous Variables

Missing data

Outliers

Submit

R Package Documentation

Browse R Packages

We want your feedback!

doomlab/learnSTATS Introduction to Statistics (ANLY 500)

In doomlab/learnSTATS: Introduction to Statistics (ANLY 500)

Introduction to Data Screening

Data Screening Video Part 1

Data Screening Video Part 2

Data Screening Video Part 3

Exercises

Dataset

Accuracy in Categorical Variables

Accuracy in Continuous Variables

Missing data

Outliers

Submit

R Package Documentation

Browse R Packages

We want your feedback!

doomlab/learnSTATS
Introduction to Statistics (ANLY 500)