This document details the steps I took to analize and present the activity data given to us for peer assessment 1 in the Reproducible Research course. First, I loaded and preprocessed the activity data. Next, I found the mean number of total steps per day, made a histogram of the total number of steps per day, and calculated the mean and median of the total number of steps taken per day. After that, I looked into average daily patterns. Then, I imputed missing values and, finally, I looked at whether there was any difference in activity patterns between weekdays and weekends.

Loading and preprocessing the data

First I'll load the packages I'll need for the analysis.

library(dplyr)
library(ggplot2)

Next, I'll load the activity data from the file "activity.csv". Note that this step assumes this document is in the same directory as "activity.csv".

#setwd("~/Documents/DataScience/ReproducibleResearch/project1")

data <- read.csv("../../inst/extdata/activity.csv")
summary(data)
head(data)

The date column was read as a factor (character data), so I'll translate it into dates using the as.Date() function.

data$date <- as.Date(data$date)

What is mean total number of steps taken per day?

To find the mean total number of steps taken per day, first I'll find the total number of steps for each day (ignoring missing values for now).

totalStepsByDay <- data %>% 
  group_by(date) %>% 
  summarize(total.steps = sum(steps, na.rm = TRUE))
head(totalStepsByDay)

Next, I'll make a histogram of this data. The difference between a bar plot and a histogram is that a histogram shows the distribution of one quantitative variable, whereas a bar plot compares two categorical variables.

totalStepsByDay %>% 
  ggplot(aes(x = total.steps)) + 
  geom_histogram(binwidth = 250, fill = "#2b8cbe", color = "#636363")

In my graphs, I like to get the colors from the color brewer site.

For the last part of this section, I'll calculate and report the mean and median of the total number of steps taken per day using the summary function.

summary(totalStepsByDay$total.steps)

This person had a mean total steps per day of 9,354 and a median total steps per day of 10,400.

What is the average daily activity pattern?

For this section, I'll go back to the original data set and compute the average number of steps taken in each 5-minute interval, averaged across all days by grouping by interval (instead of date), ignoring missing values.

averageStepsPerInterval <- data %>% 
  group_by(interval) %>% 
  summarize(mean.steps = mean(steps, na.rm = TRUE))
head(averageStepsPerInterval)

Now that I have the averageStepsPerInterval data, I'll make a time series plot of the average by interval.

averageStepsPerInterval %>% 
  ggplot(aes(x = interval, y = mean.steps)) + geom_line(color = "#31a354")

From the plot, I can see that the maximum average number of steps occurs near the 800th interval. To find out precisely where it is, I'll select the interval for only the row where mean.steps is the maximum.

averageStepsPerInterval$interval[averageStepsPerInterval$mean.steps == max(averageStepsPerInterval$mean.steps)]

The intervals start at midnight (interval 0) so interval 835 is in the late morning. Probably this person walks during that time, or maybe jogs or performs some other regular activity.

Imputing missing values

There are quite a few missing values in the data. First, I'll see how many rows have NA values for each column.

sapply(data, function(x) sum(is.na(x)))

It looks like all the missing values are in the steps column. There are 2304 out of 17568 of them, or about r round((2304/17568)*100)%. There are several ways to deal with missing values. One way is that I could simply delete (or ignore) all the rows with missing values. (In my analysis so far I've been ignoring them.) But 13% of the data is a lot of data to lose. Thus, instead of deleting or ignoring them, I'll try imputing sensible values for the missing data.

I calculated the average steps per interval above. I'll use these values to fill in the missing data. In other words, for each interval where the number of steps is missing, I'll use the mean number of steps for that interval as the value.

# first join on interval so we have the mean values we need
J <- merge(data, averageStepsPerInterval, by = "interval")

# then fill in the steps column with values from the mean.steps column where steps is NA
J <- mutate(J, steps = ifelse(is.na(steps), mean.steps, steps))

# drop unnecessary columns
J <- J[c("steps", "date", "interval")]

# take a look at the result
head(J)

Now I'll see what impact imputing the missing values had on the total number of steps taken each day. First, I'll recompute the totalStepsByDay data.

totalStepsByDay2 <- J %>% 
  group_by(date) %>% 
  summarize(total.steps = sum(steps))

Now I'll recalculate the histogram from section one.

totalStepsByDay2 %>% ggplot(aes(x = total.steps)) + 
  geom_histogram(binwidth = 250, fill = "#e6550d", color = "#636363")

The distribution of total steps per day has shifted noticeably. There is a much higher peak around 10,800 steps. Also, the high peak at 0 has vanished. I'll use the summary function to see how this has changed the mean and median of the total steps.

summary(totalStepsByDay2$total.steps)

The median and mean are now the same, which makes sense since I replaced so many values by the mean for those intervals. Also, both values are higher. The median is higher by a small amount, the mean is higher by an appreciable amount. Although these values differ from the version of the data with missing values, this imputed data is probably more representative of the person's habits.

Are there differences in activity patterns between weekdays and weekends?

In the final section, I'll look at whether weekdays are different than weekends in terms of activity. First, I'll create a new column, a factor variable called "day" indicating what day of the week it is. Then I'll use that column to create a factor variable called "day.type" that has either the value "weekend" for a weekend day or "weekday" for a weekday.

J$day <- weekdays(J$date)
J <- mutate(J, day.type = ifelse((day == "Sunday") | (day == "Saturday"), "weekend", "weekday"))
J$day.type <- factor(J$day.type)

Now I'll make a multiplot of the 5-minute interval versus the average number of steps taken for both weekends and weekdays. First, I'll calculate the average number of steps for the two categories.

avgNumSteps2 <- J %>% group_by(day.type, interval) %>% summarize(avg.steps = mean(steps))

Then, I'll plot the result.

ggplot(avgNumSteps2, aes(x = interval, y = avg.steps)) + 
  geom_line(color = "#756bb1") +
  facet_grid(day.type ~ .)

There do appear to be different patterns in these graphs. In both graphs, the highest peaks occur in the mornings. However, during the weekdays, the morning peak is higher. During the weekend it seems overall activity is greater throughout the day, even though the highest peak is lower.



AlfonsoRReyes/RepDataPeerAssessment1 documentation built on May 5, 2019, 4:53 a.m.