It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the "quantified self" movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.
This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
The data for this assignment can be downloaded from the course web site:
Dataset: Activity monitoring data [52K]
The variables included in this dataset are:
steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken
The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.
These two packages are for creating new column(s) and plotting, respectively.
library(dplyr) library(ggplot2)
And this is the colour code for histogram filling and line plotting. You could change it to any colour you'd like.
colorcode <- "#36B8B8" filename <- "../../inst/extdata/activity.csv"
Give the csv file a name, download the zip file, unzip it, read the csv file in, and get basic summary info.
fileurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip" download.file(fileurl, destfile = 'data_activity.zip') unzip('data_activity.zip', filename)
data <- read.csv(filename) str(data) summary(data)
Use aggregate to sum up the steps according to dates.
numStepsDaily <- aggregate(steps ~ date, data = data, FUN = sum, na.rm = FALSE) head(numStepsDaily)
Cast the format of dates by using as.Date. binwidth is arbitrarily chosen, so you could change it for your own aesthetic view.
data$date <- as.Date(data$date) ggplot(numStepsDaily, aes(x = steps)) + geom_histogram(fill = colorcode, binwidth = 1000) + labs(title = 'Histogram of the total number of steps', x = "Number of steps per day", y = "Number of times in a day")
meanNumStepsDaily <- mean(numStepsDaily$steps) meanNumStepsDaily medianNumStepsDaily <- median(numStepsDaily$steps) medianNumStepsDaily
avgDailyActivity <- aggregate(steps~interval,data = data, FUN = mean, na.rm = TRUE) head(avgDailyActivity) ggplot(avgDailyActivity, aes(x = interval, y = steps)) + geom_line(colour = colorcode) + labs(title = "Average daily activity pattern", x = "Interval", y = "Steps")
maxNumOfSteps <- avgDailyActivity[which.max(avgDailyActivity$steps),] maxNumOfSteps['interval']
Note that there are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data.
imputeMissing <- sum(is.na(data$steps)) imputeMissing
dataFilled <- data meanSteps <- mean(data$steps, na.rm=TRUE) dataFilled$steps[is.na(dataFilled$steps)] <- meanSteps
str(dataFilled)
numStepsDailyFilled <- aggregate(steps ~ date, data = dataFilled, FUN = sum, na.rm = FALSE) ggplot(numStepsDailyFilled, aes(x = steps)) + geom_histogram(fill=colorcode, binwidth = 1000) + labs(title = "Histogram of the total number of steps", x = "Number of steps per day", y = "Number of times in a day")
Now that the NAs are removed, we would assume some changes in mean and median:
meanNumStepsDailyFilled <- mean(numStepsDailyFilled$steps) meanNumStepsDailyFilled medianNumStepsDailyFilled <- median(numStepsDailyFilled$steps) medianNumStepsDailyFilled
But this is not enough for us to see the changes. How about a quick, easy summary?
summary(numStepsDaily) summary(numStepsDailyFilled)
It is clear to see that the steps quantiles have changed much!
For this part the weekdays() function may be of some help here. Use the dataset with the filled-in missing values for this part.
The dataset with the filled-in missing values is called dataFilled. We assign the dataset to a new one called dataNew, and use mutate in dplyr to create a new column called dayType.
dataNew <- dataFilled dataNew <- dataNew %>% mutate(dayType = ifelse(weekdays(dataNew$date) == "Saturday" | weekdays(dataNew$date) == "Sunday", "weekend", "weekday")) head(dataNew)
numStepsDailyNew <- aggregate(steps ~ interval, data = dataNew, FUN = mean, na.rm = TRUE) ggplot(dataNew, aes(x = interval, y = steps, color = dayType)) + geom_line() + labs(title = "Average daily steps", x = "Interval", y = "Total number of steps") + facet_wrap(~dayType, ncol = 1, nrow = 2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.