knitr::opts_chunk$set(echo = TRUE, comment = NA, error = TRUE)

A. Loading and preprocessing the data

We start by downloading the raw data using the link provided by the instructor. We use a function that will download the zip file, unpack it and place it in an indicated directory. The function is called downloadZip.

1. Load the data

Downloading and unpacking raw data file

library(RepDataPeerAssessment1)
library(rprojroot)

cat("Setting up the project folders:\n")
project.data <- find_package_root_file('data')
project.extdata <- find_package_root_file('inst/extdata')
project.R <- find_package_root_file('R')
project.data
project.R
project.extdata
downloadZip <- function(fileUrl, outDir="./data") {
  # function to download zipped file and unpack
  temp <- tempfile()
  download.file(fileUrl, temp, mode = "wb")
  unzip(temp, exdir = outDir)
}
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
cat("Unpacking the raw data file:\n")
outDir <- project.extdata             # folder for raw data
downloadZip(fileUrl, outDir = outDir)   # download and unpack zip file

\

Create the RData file

More RData files may be generated during this assignment. They will be placed under the folder data.

Saving the raw dataset to the data folder

Reads the CSV raw data file from inst/extdata and save it under data. Then remove it from memory.

# save the dataset
activity.raw <- read.csv(paste(project.extdata, "activity.csv", sep = "/"))
save(activity.raw, file=paste(project.data, "activity.raw.RData", sep = "/"))
head(activity.raw)

Convert the variable date from as.factor to as.date

activity <- activity.raw
activity$date <- as.Date(activity.raw$date)
str(activity) 
# save the dataset
save(activity, file=paste(project.data, "activity.RData", sep = "/"))
head(activity)

Confirm it has been saved:

rm(activity)                                            # remove variable from memory
load(paste(project.data, "activity.RData", sep = "/"))  # load data file
cat("Checking dataset has the structure we want\n\n")
str(activity)
head(activity)
# file.exists(paste(project.data, "activity.RData", sep = "/"))  # we could use this too

\

2. Process and transform

Process/transform the data (if necessary) into a format suitable for the analysis.

Basic sanity check

library(RepDataPeerAssessment1)    #load my package

data("activity")
head(activity)

Show dimensions

dim(activity)

Names of the variables

names(activity)

Summary

suma <- summary(activity)
suma

Notice that we have r suma[7, 1].

B. What is the mean total number of steps taken per day?

We will ignore the NAs in this part of the assignment.

1. Calculate the total number of steps taken per day

# get only observations that are not NA
complete <- complete.cases(activity)
activity.cases <- activity[complete, ]
activity.NAs <- activity[!complete, ]       # NAs
activity.NAs.not <- activity.cases

cat("# of observations:\t", dim(activity.cases)[1], "\n")
cat("# of NAs:\t\t", dim(activity.NAs)[1], "\n")
plot(seq(1:nrow(activity.cases)), activity.cases$steps)

Histogram of total number of steps each day

byDate.steps.total <- aggregate(activity.cases$steps, 
                               by = list(activity.cases$date), sum)

# rename the variable to something meaningful
names(byDate.steps.total) <- c("Day", "total.steps")
# byDate.steps.total
hist(byDate.steps.total$total.steps)

Find the mean and the median total number of steps per day

mean.0 <- mean(byDate.steps.total$total.steps)
mean.0
median.0 <- median(byDate.steps.total$total.steps)
median.0

How many unique intervals are there?

We want to know how many unique intervals there are because we will need later to calculate the maximum steps per interval and we need this number to verify our count of intervals is correct.

unique(activity.cases$interval)

So, there are r length(unique(activity.cases$interval)) unique intervals, with the first interval being r min(activity.cases$interval) and the last r max(activity.cases$interval).


C. What is the average daily activity pattern?

1. Make a time series plot (i.e. type = "l")

Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)

byDate.steps.mean <- aggregate(activity.cases$steps, 
                               by = list(activity.cases$date), mean)

# rename the variable to something meaningful
names(byDate.steps.mean) <- c("Day", "mean.steps")
plot(byDate.steps.mean$Day, byDate.steps.mean$mean.steps, type = "l")

2. Which interval contain the maximum # of steps

Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?

Find day and interval with maximum number of steps

Let's find the maximum number of steps:

max.steps <- max(activity.cases$steps)
max.steps

Now, let's find which interval holds the maximum number of steps:

index <- which(activity.cases$steps == max.steps)
whole_row <- activity.cases[index, ]
whole_row

The interval with the maximum number of steps is r whole_row$interval.

Grouping by interval

byInterval <- aggregate(activity.cases$steps, 
                        by = list(activity.cases$interval), 
                        max)

names(byInterval) <- c("interval", "steps.max")
sorted <- byInterval[order(-byInterval$steps.max), ]   # order by steps, descending
cat("These are the top 10 intervals with more activity\n\n")
head(sorted, 10)

\

Find what are the intervals with more activity

We will plot the number maximum of steps in logarithmic scale:

plot(sorted$interval, sorted$steps.max,
     xlim = c(0, 2500), ylim = range(1:1000),
     panel.first = grid(lty = 1))
plot(sorted$interval, sorted$steps.max, log = "y",
     xlim = c(0, 2500), ylim = range(1:1000),
     panel.first = grid(lty = 1))

And zooming in in a non-logarithmic plot, we can see that the maximum values are those around 800.

# non-logarithmic plot
plot(sorted$interval, sorted$steps.max,
     xlim = range(0:2500), ylim = range(700:825),
     panel.first = grid(lty = 1))

What we can confirm is that the maximum activity occurs between the 500 and 2000 intervals.

D. Imputing missing values

1. Calculate the number of missing values

Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)

# get only observations that are  NA
complete <- complete.cases(activity)
activity.missing <- activity[!complete, ]
nrow(activity.missing)

2. Complete missing values

Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.

There are several ways we can complete the missing values and lots of R packages available to perform the imputation. We will make it the easier way. This is the strategy:

Motivation

Since there is no way to compare in a dataset with missing values how a imputation method performs, after several attempts on resolving this, I found in the literature that the imputation method selection is done based on a dataset without any missing values first.

  1. Using the dataset under analysis, we proceed to generate a new one but without any missing values.

  2. We apply several imputation methods on the new dataset but this time the missing values are entered under control, the amount and randomness which could be MCAR or MAR, among several others.

  3. The result of applying these methods will give us a performance plot of the methods used versus the Root Mean Squared Error (RMSE). We will chose the imputation method with the lowest RMSE.

  4. After selecting the imputation method, we go back to our original dataset with missing data and apply the imputation method with the best performance, as selected above.

  5. The result will be the data frame with imputed data.

Since this topic needs not to be sophisticated at this stage, we will use a pretty straight forward R package called imputTestbench. It is available in CRAN. We will describe the steps in detail below.

2.1. Generate a new dataset but without missing values

ok <- complete.cases(activity$steps)
steps.clean <- activity$steps[ok]
length(steps.clean)

2.2. Apply several imputation methods

library(imputeTestbench)

itb <- impute_errors(steps.clean, 
                     missPercentFrom = 0, missPercentTo = 10, 
                     interval = 1, blckper = TRUE, blck = 10)

itb

2.3. Performance plots

plot_errors(itb)

From the boxplot we can see that the methods with the lowest error are na.interp and na.interpolation. These two methods are included in the package _. We will be able to see this clearly with the line type plot:

plot_errors(dataIn = itb, plotType = "line")

Both interpolation methods are overlapping which means that there is not significant difference between applying any of them. Also we can see that the worst performing imputing methods are the Last Observation Carried Forward (LOCF) and the mean.

This how it looks four of the different imputation simulations at 10% missing data rate.

plot_impute(steps.clean, 
            methods = c("na.mean", "na.locf", "na.interp"), 
            missPercent = 10) 

2.4. Apply selected imputation method to the original dataset

We bring up the original data frame and get the vectors for the steps variable.

The function used is na.interp from the package 'forecast'. So, the imputation will be performed with na.interp.

library(forecast)

steps <- activity$steps

summary(activity$steps)
se.0 <- sd(activity$steps, 
           na.rm = TRUE) / sqrt(sum(!is.na(activity$steps)))
cat("SE(before imp.) = ", se.0, "\n\n")

steps.imp.interp <- na.interp(steps)     # interpolating the NA values

summary(steps.imp.interp)
se.1 <- sd(steps.imp.interp, 
           na.rm = TRUE) / sqrt(sum(!is.na(steps.imp.interp)))
cat("SE(after imp.) = ", se.1, "\n")

What we can see is that there is an improvement in the Standard Error (SE) from 0.91 to 0.79.

\

4. Make a histogram (with missing data filled in)

Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day. Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing missing data on the estimates of the total daily number of steps?

First, we put together the activity data frame with the imputed values.

activity.imp <- activity

activity.imp$steps <- as.numeric(steps.imp.interp)

summary(activity.imp)
str(activity.imp)
byDate.steps.total.1 <- aggregate(activity.imp$steps, 
                               by = list(activity.imp$date), sum)

# rename the variable to something meaningful
names(byDate.steps.total.1) <- c("Day", "total.steps")
summary(byDate.steps.total.1$total.steps)
hist(byDate.steps.total.1$total.steps)

Calculating the mean and the media for the imputed steps

mean.1 <- mean(byDate.steps.total.1$total.steps)
median.1 <- median(byDate.steps.total.1$total.steps)
mean.1
median.1

And we find that there is a difference between mean and the media with the dataset with missing data and the dataset with imputed values.

mean.0 - mean.1
median.0 - median.1

The change in the mean is r round((mean.0 - mean.1)/mean.0*100, 0) percent and of the median is r round((median.0 - median.1)/median.0*100, 0) percent.

\

E. Are there differences in activity patterns between weekdays and weekends?

1. Create a new factor variable

Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.

activity.imp$week <- ifelse(weekdays(activity.imp$date) %in% c("Saturday", "Sunday"), "weekend", "weekday")

# View(activity.imp)

activity.imp$week <- as.factor(activity.imp$week)
str(activity.imp)

2. Make a panel

Make a panel plot containing a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis). See the README file in the GitHub repository to see an example of what this plot should look like using simulated data.

activity.imp.1 <- activity.imp

byInterval <- aggregate(activity.imp.1$steps, 
                        by = list(activity.imp.1$interval, activity.imp.1$week), mean)

names(byInterval) <- c("interval", "week", "steps.mean")
byInterval
library(lattice)
# xyplot(y ~ x | panel, data = dataset, type = "o")
xyplot(steps.mean ~ interval | as.factor(week), 
       data = byInterval, 
       type = "l", 
       layout=c(1,2))


AlfonsoRReyes/RepDataPeerAssessment1 documentation built on May 5, 2019, 4:53 a.m.