knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Garbage in, garbage out:
Our check should include for:
Note that assumptions checks may depend on the type of test because there are different tests.
Normally our $\alpha$ value is set to < 0.05, other-wise known as a Type 1 Error.
In data screening, we want to use a stricter criterion.
(4) Assumptions for the analysis performed.
Assumptions include:
library(rio) master <- import("data/data_screening.csv") str(master)
We will want to check for several issues:
notypos <- master #update the dataset with each step apply(notypos[ , c("Sex", "SES")], 2, table) #3 here for sex is probably incorrect
factor()
but not include the bad label to drop that incorrect point.## fix the categorical labels and typos notypos$Sex <- factor(notypos$Sex, levels = c(1,2), #no 3 labels = c("Women", "Men")) notypos$SES <- factor(notypos$SES, levels = c(1,2, 3), labels = c("Low", "Medium", "High")) apply(notypos[ , c("Sex", "SES")], 2, table)
summary()
function to view the summary statistics for our continuous variables.summary(notypos)
summary(notypos$Grade) notypos$Grade[ notypos$Grade > 12 ] <- NA summary(notypos$Grade) summary(notypos$Absences) notypos$Absences[ notypos$Absences > 15 ] <- NA summary(notypos$Absences)
names(notypos) head(notypos[ , 6:19]) #lots of ways to do this part! notypos[ , 6:19][ notypos[ , 6:19] > 7 ] <- NA summary(notypos)
You want to make sure it's the data you expect, the mean can be used to make such a judgment.
Also, standard deviation indicates the spread of the data
names(notypos) apply(notypos[ , -c(1,3)], 2, mean, na.rm = T) apply(notypos[ , -c(1,3)], 2, sd, na.rm = T)
summary()
can give a quick view. apply()
for a sum of the number of NA
values.summary(notypos) apply(notypos, 2, function(x) { sum(is.na(x)) })
Missing data is an important problem and leads us to ask ourselves one question: Why is this data missing?
MCAR: missing completely at random
MNAR: missing not at random.
There are ways to test for the type, but most times you can easily note problematic MNAR data by checking percent of missing data or using the View()
function.
You should not replace:
You can conservatively replace:
Note: there is a difference between missing data and incomplete data.
knitr::include_graphics("pictures/datascreen/missing.png")
Mean substitution was a popular way to estimate missing data by simply estimating the mean for that variable for the missing data points.
Multiple imputation is now the most popular way to estimate missing data points because computing programs have made this process easier.
library(VIM, quietly = T) aggr(notypos, numbers = T)
percentmiss <- function(x){ sum(is.na(x))/length(x) * 100 } missing <- apply(notypos, 1, percentmiss) table(missing)
replace_rows <- subset(notypos, missing <= 5) #5% noreplace_rows <- subset(notypos, missing > 5) nrow(notypos) nrow(replace_rows) nrow(noreplace_rows)
replace_rows
because excluding the incomplete data may eliminate any issues by column. apply(replace_rows, 2, percentmiss)
mice
uses all available information to estimate. replace_columns <- replace_rows[ , -c(1,2,4)] noreplace_columns <- replace_rows[ , c(1,2,4)] #notice these are both replace_rows
mice
mice()
will figure out the type of data based on the column structure and replace it with that type of data.library(mice) temp_no_miss <- mice(replace_columns)
mice
nomiss <- complete(temp_no_miss, 1) #pick a dataset 1-5 #combine back together dim(notypos) #original data from previous step dim(nomiss) #replaced data #get all columns all_columns <- cbind(noreplace_columns, nomiss) dim(all_columns) #get all rows all_rows <- rbind(noreplace_rows, all_columns) dim(all_rows)
Definition - case with extreme value on one variable or multiple variables.
Why does an outlier occur?
The logic of removing outliers:
There are two (2) types:
Univariate - when you have one (1) DV or Y variable.
Multivariate - when you have multiple continuous variables, measurements, or dependent variables.
This measure is a distance measure:
How do we know what is very far away?
Again, because we save multiple datasets, we can test the analysis with and without outliers to help us determine their impact on our analyses.
## you can use all columns or all rows here ## however, all rows has missing data, which will not get a score str(all_columns) mahal <- mahalanobis(all_columns[ , -c(1,4)], colMeans(all_columns[ , -c(1,4)], na.rm=TRUE), cov(all_columns[ , -c(1,4)], use ="pairwise.complete.obs"))
## remember to match the number of columns cutoff <- qchisq(1-.001, ncol(all_columns[ , -c(1,4)])) ## df and cutoff ncol(all_columns[ , -c(1,4)]) cutoff ##how many outliers? Look at FALSE summary(mahal < cutoff) ## eliminate noout <- subset(all_columns, mahal < cutoff) dim(all_columns) dim(noout)
In this lecture, you have learned:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.