knitr::opts_chunk$set(echo = TRUE, comment = NA, error = TRUE)

Using on activity dataset

library(RepDataPeerAssessment1)

data("activity")
head(activity)
steps <- activity$steps
summary(steps)

We have r sum(is.na(steps)) NAs in this variable.

How important is it?

We have r round(mean(is.na(steps)) * 100, 0) percent of missing values!

How many values are zeros?

library(dplyr)

filter(activity) %>%
  summarize(zeros = mean(steps != 0, na.rm = TRUE))

Where our data lies

Source: http://thomasleeper.com/Rcourse/Tutorials/NAhandling.html

Analyze a vector or variable of the dataframe:

length(steps[is.na(steps) == TRUE])     # number of NAs

Analyze for NAs in a the whole dataframe:

length(steps[is.na(steps) == FALSE])     # number of non-NAs
summary(activity)

Visualizing missing data in a plot

image(is.na(activity),
      xaxt = "n",         # suppress plotting of x axis
      yaxt = "n",         # suppress plotting of y axis
      bty = "n"           # do not draw a box around
      )

# custom labels for x and y axis
axis(1, seq(0, 1, length.out = nrow(activity)), 1:nrow(activity), col = "white")
axis(2,                 # y axis
     c(0, 0.5, 1),      # sequence of labels
     names(activity),   # variables names of data frame
     col = "white",     # color
     las = 2)           # 2: perpendicular to the axis. Defaults = 0

box(lty = '1373', col = 'black')    # draw a box around

Using the mean to replace NA values

Source: https://www.r-bloggers.com/example-2014-5-simple-mean-imputation/

df = data.frame(x = 1:20, y = c(1:10,rep(NA,10)))
head(df)
tail(df)
df$y[is.na(df$y)] = mean(df$y, na.rm = TRUE)
tail(df)

Alternative using transform and ifelse

# alternative
df = data.frame(x = 1:20, y = c(1:10,rep(NA,10)))
head(df)
tail(df)

df = transform(df, y = ifelse(is.na(y), mean(y, na.rm=TRUE), y))
tail(df)

In the first example, we identify elements of y that are NA, and replace them with the mean, if so. In the second, we test each element of y; if it is NA, we replace with the mean, otherwise we replace with the original value.



AlfonsoRReyes/RepDataPeerAssessment1 documentation built on May 5, 2019, 4:53 a.m.