Reproducible Research, Assignment 1

Alfonso R. Reyes

A. Loading and preprocessing the data

We start by downloading the raw data using the link provided by the instructor. We use a function that will download the zip file, unpack it and place it in an indicated directory. The function is called downloadZip.

1. Load the data

Downloading and unpacking raw data file


cat("Setting up the project folders:\n") <- find_package_root_file('data')
project.extdata <- find_package_root_file('inst/extdata')
project.R <- find_package_root_file('R')
Setting up the project folders:
[1] "/home/superuser/git.projects/RepDataPeerAssessment1/data"
[1] "/home/superuser/git.projects/RepDataPeerAssessment1/R"
[1] "/home/superuser/git.projects/RepDataPeerAssessment1/inst/extdata"
downloadZip <- function(fileUrl, outDir="./data") {
  # function to download zipped file and unpack
  temp <- tempfile()
  download.file(fileUrl, temp, mode = "wb")
  unzip(temp, exdir = outDir)
fileUrl <- ""
cat("Unpacking the raw data file:\n")
Unpacking the raw data file:
outDir <- project.extdata             # folder for raw data
downloadZip(fileUrl, outDir = outDir)   # download and unpack zip file


Create the RData file

More RData files may be generated during this assignment. They will be placed under the folder data.

Saving the raw dataset to the data folder

Reads the CSV raw data file from inst/extdata and save it under data. Then remove it from memory.

# save the dataset
activity.raw <- read.csv(paste(project.extdata, "activity.csv", sep = "/"))
save(activity.raw, file=paste(, "activity.raw.RData", sep = "/"))
  steps       date interval
1    NA 2012-10-01        0
2    NA 2012-10-01        5
3    NA 2012-10-01       10
4    NA 2012-10-01       15
5    NA 2012-10-01       20
6    NA 2012-10-01       25

Convert the variable date from as.factor to

activity <- activity.raw
activity$date <- as.Date(activity.raw$date)
'data.frame':   17568 obs. of  3 variables:
 $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
 $ date    : Date, format: "2012-10-01" "2012-10-01" ...
 $ interval: int  0 5 10 15 20 25 30 35 40 45 ...
# save the dataset
save(activity, file=paste(, "activity.RData", sep = "/"))
  steps       date interval
1    NA 2012-10-01        0
2    NA 2012-10-01        5
3    NA 2012-10-01       10
4    NA 2012-10-01       15
5    NA 2012-10-01       20
6    NA 2012-10-01       25

Confirm it has been saved:

rm(activity)                                            # remove variable from memory
load(paste(, "activity.RData", sep = "/"))  # load data file
cat("Checking dataset has the structure we want\n\n")
# file.exists(paste(, "activity.RData", sep = "/"))  # we could use this too
Checking dataset has the structure we want

'data.frame':   17568 obs. of  3 variables:
 $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
 $ date    : Date, format: "2012-10-01" "2012-10-01" ...
 $ interval: int  0 5 10 15 20 25 30 35 40 45 ...
  steps       date interval
1    NA 2012-10-01        0
2    NA 2012-10-01        5
3    NA 2012-10-01       10
4    NA 2012-10-01       15
5    NA 2012-10-01       20
6    NA 2012-10-01       25


2. Process and transform

Process/transform the data (if necessary) into a format suitable for the analysis.

Basic sanity check

library(RepDataPeerAssessment1)    #load my package

  steps       date interval
1    NA 2012-10-01        0
2    NA 2012-10-01        5
3    NA 2012-10-01       10
4    NA 2012-10-01       15
5    NA 2012-10-01       20
6    NA 2012-10-01       25

Show dimensions

[1] 17568     3

Names of the variables

[1] "steps"    "date"     "interval"


suma <- summary(activity)
     steps             date               interval     
 Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
 Median :  0.00   Median :2012-10-31   Median :1177.5  
 Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
 3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
 Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
 NA's   :2304                                          

Notice that we have NA's :2304 .

B. What is the mean total number of steps taken per day?

We will ignore the NAs in this part of the assignment.

1. Calculate the total number of steps taken per day

# get only observations that are not NA
complete <- complete.cases(activity)
activity.cases <- activity[complete, ]
activity.NAs <- activity[!complete, ]       # NAs
activity.NAs.not <- activity.cases

cat("# of observations:\t", dim(activity.cases)[1], "\n")
cat("# of NAs:\t\t", dim(activity.NAs)[1], "\n")
# of observations:   15264 
# of NAs:        2304 
plot(seq(1:nrow(activity.cases)), activity.cases$steps)

Histogram of total number of steps each day <- aggregate(activity.cases$steps, 
                               by = list(activity.cases$date), sum)

# rename the variable to something meaningful
names( <- c("Day", "total.steps")

Find the mean and the median total number of steps per day

mean.0 <- mean($total.steps)
[1] 10766.19
median.0 <- median($total.steps)
[1] 10765

How many unique intervals are there?

We want to know how many unique intervals there are because we will need later to calculate the maximum steps per interval and we need this number to verify our count of intervals is correct.

# xyplot(y ~ x | panel, data = dataset, type = "o")
xyplot(steps.mean ~ interval | as.factor(week), 
       data = byInterval, 
       type = "l", 

