knitr::opts_chunk$set(echo = TRUE, comment = NA, error = TRUE)
We start by downloading the raw data using the link provided by the instructor. We use a function that will download the zip file, unpack it and place it in an indicated directory. The function is called downloadZip
.
library(RepDataPeerAssessment1) library(rprojroot) cat("Setting up the project folders:\n") project.data <- find_package_root_file('data') project.extdata <- find_package_root_file('inst/extdata') project.R <- find_package_root_file('R') project.data project.R project.extdata
downloadZip <- function(fileUrl, outDir="./data") { # function to download zipped file and unpack temp <- tempfile() download.file(fileUrl, temp, mode = "wb") unzip(temp, exdir = outDir) }
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip" cat("Unpacking the raw data file:\n") outDir <- project.extdata # folder for raw data downloadZip(fileUrl, outDir = outDir) # download and unpack zip file
\
More RData files may be generated during this assignment. They will be placed under the folder data
.
data
folderReads the CSV raw data file from inst/extdata
and save it under data
. Then remove it from memory.
# save the dataset activity.raw <- read.csv(paste(project.extdata, "activity.csv", sep = "/")) save(activity.raw, file=paste(project.data, "activity.raw.RData", sep = "/")) head(activity.raw)
Convert the variable date
from as.factor
to as.date
activity <- activity.raw activity$date <- as.Date(activity.raw$date) str(activity)
# save the dataset save(activity, file=paste(project.data, "activity.RData", sep = "/")) head(activity)
Confirm it has been saved:
rm(activity) # remove variable from memory load(paste(project.data, "activity.RData", sep = "/")) # load data file cat("Checking dataset has the structure we want\n\n") str(activity) head(activity) # file.exists(paste(project.data, "activity.RData", sep = "/")) # we could use this too
\
Process/transform the data (if necessary) into a format suitable for the analysis.
library(RepDataPeerAssessment1) #load my package data("activity") head(activity)
Show dimensions
dim(activity)
Names of the variables
names(activity)
Summary
suma <- summary(activity) suma
Notice that we have r suma[7, 1]
.
We will ignore the NAs in this part of the assignment.
# get only observations that are not NA complete <- complete.cases(activity) activity.cases <- activity[complete, ] activity.NAs <- activity[!complete, ] # NAs activity.NAs.not <- activity.cases cat("# of observations:\t", dim(activity.cases)[1], "\n") cat("# of NAs:\t\t", dim(activity.NAs)[1], "\n")
plot(seq(1:nrow(activity.cases)), activity.cases$steps)
byDate.steps.total <- aggregate(activity.cases$steps, by = list(activity.cases$date), sum) # rename the variable to something meaningful names(byDate.steps.total) <- c("Day", "total.steps") # byDate.steps.total hist(byDate.steps.total$total.steps)
mean.0 <- mean(byDate.steps.total$total.steps) mean.0
median.0 <- median(byDate.steps.total$total.steps) median.0
We want to know how many unique intervals there are because we will need later to calculate the maximum steps per interval and we need this number to verify our count of intervals is correct.
unique(activity.cases$interval)
So, there are r length(unique(activity.cases$interval))
unique intervals, with the first interval being r min(activity.cases$interval)
and the last r max(activity.cases$interval)
.
Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
byDate.steps.mean <- aggregate(activity.cases$steps, by = list(activity.cases$date), mean) # rename the variable to something meaningful names(byDate.steps.mean) <- c("Day", "mean.steps") plot(byDate.steps.mean$Day, byDate.steps.mean$mean.steps, type = "l")
Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
Let's find the maximum number of steps:
max.steps <- max(activity.cases$steps) max.steps
Now, let's find which interval holds the maximum number of steps:
index <- which(activity.cases$steps == max.steps) whole_row <- activity.cases[index, ] whole_row
The interval with the maximum number of steps is r whole_row$interval
.
Grouping by interval
byInterval <- aggregate(activity.cases$steps, by = list(activity.cases$interval), max) names(byInterval) <- c("interval", "steps.max") sorted <- byInterval[order(-byInterval$steps.max), ] # order by steps, descending cat("These are the top 10 intervals with more activity\n\n") head(sorted, 10)
\
We will plot the number maximum of steps in logarithmic scale:
plot(sorted$interval, sorted$steps.max, xlim = c(0, 2500), ylim = range(1:1000), panel.first = grid(lty = 1))
plot(sorted$interval, sorted$steps.max, log = "y", xlim = c(0, 2500), ylim = range(1:1000), panel.first = grid(lty = 1))
And zooming in in a non-logarithmic plot, we can see that the maximum values are those around 800.
# non-logarithmic plot plot(sorted$interval, sorted$steps.max, xlim = range(0:2500), ylim = range(700:825), panel.first = grid(lty = 1))
What we can confirm is that the maximum activity occurs between the 500 and 2000 intervals.
Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)
# get only observations that are NA complete <- complete.cases(activity) activity.missing <- activity[!complete, ] nrow(activity.missing)
Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.
There are several ways we can complete the missing values and lots of R packages available to perform the imputation. We will make it the easier way. This is the strategy:
Motivation
Since there is no way to compare in a dataset with missing values how a imputation method performs, after several attempts on resolving this, I found in the literature that the imputation method selection is done based on a dataset without any missing values first.
Using the dataset under analysis, we proceed to generate a new one but without any missing values.
We apply several imputation methods on the new dataset but this time the missing values are entered under control, the amount and randomness which could be MCAR or MAR, among several others.
The result of applying these methods will give us a performance plot of the methods used versus the Root Mean Squared Error (RMSE). We will chose the imputation method with the lowest RMSE.
After selecting the imputation method, we go back to our original dataset with missing data and apply the imputation method with the best performance, as selected above.
The result will be the data frame with imputed data.
Since this topic needs not to be sophisticated at this stage, we will use a pretty straight forward R package called imputTestbench
. It is available in CRAN. We will describe the steps in detail below.
ok <- complete.cases(activity$steps) steps.clean <- activity$steps[ok] length(steps.clean)
library(imputeTestbench) itb <- impute_errors(steps.clean, missPercentFrom = 0, missPercentTo = 10, interval = 1, blckper = TRUE, blck = 10) itb
plot_errors(itb)
From the boxplot we can see that the methods with the lowest error are na.interp
and na.interpolation
. These two methods are included in the package _. We will be able to see this clearly with the line type plot:
plot_errors(dataIn = itb, plotType = "line")
Both interpolation methods are overlapping which means that there is not significant difference between applying any of them. Also we can see that the worst performing imputing methods are the Last Observation Carried Forward (LOCF) and the mean.
This how it looks four of the different imputation simulations at 10% missing data rate.
plot_impute(steps.clean, methods = c("na.mean", "na.locf", "na.interp"), missPercent = 10)
We bring up the original data frame and get the vectors for the steps
variable.
The function used is na.interp from the package 'forecast'. So, the imputation will be performed with na.interp
.
library(forecast) steps <- activity$steps summary(activity$steps) se.0 <- sd(activity$steps, na.rm = TRUE) / sqrt(sum(!is.na(activity$steps))) cat("SE(before imp.) = ", se.0, "\n\n") steps.imp.interp <- na.interp(steps) # interpolating the NA values summary(steps.imp.interp) se.1 <- sd(steps.imp.interp, na.rm = TRUE) / sqrt(sum(!is.na(steps.imp.interp))) cat("SE(after imp.) = ", se.1, "\n")
What we can see is that there is an improvement in the Standard Error (SE) from 0.91 to 0.79.
\
Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day. Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing missing data on the estimates of the total daily number of steps?
First, we put together the activity data frame with the imputed values.
activity.imp <- activity activity.imp$steps <- as.numeric(steps.imp.interp) summary(activity.imp) str(activity.imp)
byDate.steps.total.1 <- aggregate(activity.imp$steps, by = list(activity.imp$date), sum) # rename the variable to something meaningful names(byDate.steps.total.1) <- c("Day", "total.steps") summary(byDate.steps.total.1$total.steps) hist(byDate.steps.total.1$total.steps)
Calculating the mean and the media for the imputed steps
mean.1 <- mean(byDate.steps.total.1$total.steps) median.1 <- median(byDate.steps.total.1$total.steps) mean.1 median.1
And we find that there is a difference between mean and the media with the dataset with missing data and the dataset with imputed values.
mean.0 - mean.1 median.0 - median.1
The change in the mean is r round((mean.0 - mean.1)/mean.0*100, 0)
percent and of the median is r round((median.0 - median.1)/median.0*100, 0)
percent.
\
Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
activity.imp$week <- ifelse(weekdays(activity.imp$date) %in% c("Saturday", "Sunday"), "weekend", "weekday") # View(activity.imp) activity.imp$week <- as.factor(activity.imp$week) str(activity.imp)
Make a panel plot containing a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis). See the README file in the GitHub repository to see an example of what this plot should look like using simulated data.
activity.imp.1 <- activity.imp byInterval <- aggregate(activity.imp.1$steps, by = list(activity.imp.1$interval, activity.imp.1$week), mean) names(byInterval) <- c("interval", "week", "steps.mean") byInterval
library(lattice) # xyplot(y ~ x | panel, data = dataset, type = "o") xyplot(steps.mean ~ interval | as.factor(week), data = byInterval, type = "l", layout=c(1,2))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.