```{css, echo=FALSE} body { counter-reset: li; / initialize counter named li / }
ol { margin-left:0; / Remove the default left margin / padding-left:0; / Remove the default left padding / } ol > li { position:relative; / Create a positioning context / margin:0 0 10px 2em; / Give each list item a left margin to make room for the numbers / padding:10px 80px; / Add some spacing around the content / list-style:none; / Disable the normal item numbering / border-top:2px solid #317EAC; background:rgba(49, 126, 172, 0.1); } ol > li:before { content:"Exercise " counter(li); / Use the counter as content / counter-increment:li; / Increment the counter by 1 / / Position and style the number / position:absolute; top:-2px; left:-2em; -moz-box-sizing:border-box; -webkit-box-sizing:border-box; box-sizing:border-box; width:7em; / Some space between the number and the content in browsers that support generated content but not positioning it (Camino 2 is one example) / margin-right:8px; padding:4px; border-top:2px solid #317EAC; color:#fff; background:#317EAC; font-weight:bold; font-family:"Helvetica Neue", Arial, sans-serif; text-align:center; } li ol, li ul {margin-top:6px;} ol ol li:last-child {margin-bottom:0;}
.oyo ul { list-style-type:decimal; }
.oyo ul { list-style-type:decimal; }
hr { border: 1px solid #357FAA; }
div#boxedtext { background-color: rgba(86, 155, 189, 0.2); padding: 20px; margin-bottom: 20px; font-size: 10pt; }
div#template { margin-top: 30px; margin-bottom: 30px; color: #808080; border:1px solid #808080; padding: 10px 10px; background-color: rgba(128, 128, 128, 0.2); border-radius: 5px; }
div#license { margin-top: 30px; margin-bottom: 30px; color: #4C721D; border:1px solid #4C721D; padding: 10px 10px; background-color: rgba(76, 114, 29, 0.2); border-radius: 5px; }
/------------ Fancy blocks --------------/
.noteblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #042e6b; background: 1.5em center/2.5em no-repeat; background-image: url("images/info-circle.svg"); }
.tipblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #d5d5ce; background: 1.5em center/2em no-repeat; background-image: url("images/lightbulb.svg"); }
.warningblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #be4e00; background: 1.5em center/2.5em no-repeat; background-image: url("images/exclamation-triangle.svg"); }
.importantblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #c30000; background: 1.5em center/2.5em no-repeat; background-image: url("images/exclamation-circle.svg"); }
.exerciseblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #317EAC; background: rgba(49, 126, 172, 0.1) 1.5em center/2.5em no-repeat; background-image: url("images/flag.svg"); }
```r # packages library(MA22004labs) library(tidyverse) library(ggmosaic) library(janitor) library(learnr) library(gradethis) library(knitr) # tutorial options tutorial_options( # code running in exercise times out after 30 seconds # if it is taking more than 30 s something is wrong exercise.timelimit = 30 ) # use gradethis for checking gradethis_setup() options(width = 60) # hide non-exercise code chunks knitr::opts_chunk$set(echo = FALSE, error = TRUE, comment = "", fig.align = "center", fig.width = 6, fig.asp = 0.75, out.width = "100%") data(cdc)
Some define Statistics as the field that focuses on turning information into knowledge. The first step is to summarise and describe the raw data. In this lab, you will gain insight into public health by generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and Prevention (CDC). As this is a large data set, you'll also learn indispensable data processing and subsetting skills.
:::{.noteblock} The report for Lab 2 consists of answers to exercises 1-6 in the last section of this tutorial. Upload your submission to Gradescope as a PDF. Remember: to create your new lab report in RStudio, go to New File > R Markdown, then choose From Template and select MA22004 Lab Report from the list of templates. Further instructions on using the class template to produce a nice lab report containing data analysis and text are in the first Section, "RStudio and lab reports", of Lab 1. :::
In this lab, we will explore and visualise the data using the tidyverse suite of packages. The package ggmosaic provides the geom_mosaic
geometry for producing mosaic plots with ggplot
. The package janitor is for data cleaning and provides the function tabyl
for summarising data.
Recall that all the necessary packages can be loaded using the library
command.
library(MA22004labs) library(tidyverse) library(ggmosaic) library(janitor)
This lab will look at the CDC Health Data, loaded in this tutorial under the data frame cdc
.
:::{.tipblock} The CDC Health Data contains an extract from the Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS is an annual telephone survey of 350,000 people in the United States collected by the CDC. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
This data set contains a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset. The data set of 20,000 observations can be loaded using the data
command.
data(cdc)
:::
The data set cdc
that shows up in your workspace is a data matrix, with each row representing a case and each column representing a variable. R
calls this data format a data frame, a term used throughout the labs.
To view the names of the variables, one uses the names
function.
names(`___`)
names(cdc)
This returns the names genhlth
, exerany
, hlthplan
, smoke100
, height
, weight
, wtdesire
, age
, and gender
. Each of these variables corresponds to a question in the survey. For genhlth
, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany
variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan
indicates whether the respondent had health coverage (1) or did not (0). The smoke100
variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent's height
in inches, weight
in pounds as well as their desired weight, wtdesire
, age
in years, and gender
.
We can look at our data's first few entries (rows) with the head
function.
`___`(cdc)
head(cdc)
Similarly, we can look at the last few using the tail
function.
`___`(cdc)
tail(cdc)
Remember that you can use glimpse
to take a quick peek at your data to understand its contents better.
You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. We know cdc
has 20,000 rows, so viewing the entire data set would mean flooding your screen. It's better to take small peeks at the data with head
, tail
or the subsetting techniques that you'll learn in a moment.
The BRFSS questionnaire is a massive trove of information. An excellent first step in any analysis is distilling all that information into a few summary statistics and graphics. As a simple example, the function summary
returns a numerical summary: minimum, first quartile, median, mean, second quartile, and maximum.
Try producing a summary for the weight
variable.
summary(`___`$`___`)
summary(cdc$weight)
If you wanted to compute the interquartile range (difference between 3rd and 1st quartile) for the respondents' weight, you would look at the output from the summary command above and then enter:
190 - 140
R
also has built-in functions to compute summary statistics one by one. For instance, the mean, median, and variance of weight
can be calculated using the obvious functions.
:::{.noteblock} Summary statistics: Some useful function calls for summary statistics for a single numerical variable are as follows:
mean
median
sd
var
IQR
min
max
Note that each function takes a single vector as an argument and returns a single value. :::
While it makes sense to describe a quantitative variable like weight
in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function tabyl
does this for you by counting the number of times each kind of response was given.
Use the tabyl
function to see the number of people who have smoked 100 cigarettes in their lifetime (smoke100
).
cdc |> tabyl(smoke100)
The output of the tabyl
function includes both the counts and the relative frequency distribution for each category of response to the question smoke100
(in this case, yes 1
or no 0
).
Next, make a bar plot of the entries in the tabyl
data frame using ggplot
with the geom_bar
geometry. Here we must set the option stat = "identity"
within the geometry (the default is stat = count
, but we've already summarised the data using tabyl
; there are other ways to produce the plot below...).
df <- cdc |> tabyl(smoke100) ggplot(`___`, aes(x = smoke100, y = `___`)) + geom_bar(stat = "identity")
df <- cdc |> tabyl(smoke100) ggplot(df, aes(x = smoke100, y = `___`)) + geom_bar(stat = "identity") # you can set `y` to `n` or `percentage` as you please
Notice what we've done here! We've computed a summary table of the responses to cdc$smoke100
.
:::{.exerciseblock}
Using the CDC Data, create a numerical summary for height
and age
, and compute the interquartile range for each. Compute the relative frequency distribution for gender
using tabyl
. How many males are in the sample? What proportion of the sample reports being in excellent health?
# The following commands can aid in answering the questions summary(cdc$height) summary(cdc$age) IQR(cdc$height) IQR(cdc$age) cdc |> tabyl(gender) cdc |> tabyl(genhlth)
question_text( "Enter your observations as free text below.", answer("C0rrect", correct = TRUE), incorrect = "Okay! [even if this button is red]", try_again_button = "Modify your answer", allow_retry = TRUE )
:::
The tabyl
command can be used to tabulate any number of variables that you provide. For example, we could use the following to examine which participants have smoked across each gender.
cdc |> tabyl(smoke100, gender)
Here, the first column named smoke100
contains the categories (levels) 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. The columns m
and f
then refer to gender (m for male or f for female), and the numbers correspond to totals in each category.
Creating a mosaic plot of this table (gender
vs smoke100
) can be done using ggplot
and working directly from the original cdc
data frame by setting the aesthetic for x
in geom_mosaic
to product(X,Y)
.
ggplot(cdc) + geom_mosaic(aes(x = product(smoke100, gender)))
:::{.exerciseblock}
Generate a mosaic plot of gender
vs smoke100
from the CDC Health Data. What does the mosaic plot reveal about smoking habits and gender?
question_text( "Enter your observations as free text below.", answer("C0rrect", correct = TRUE), incorrect = "Okay! [even if this button is red]", try_again_button = "Modify your answer", allow_retry = TRUE )
:::
R
thinks about dataWe mentioned that R
stores data in data frames, which you might think of as a spreadsheet. Each row is a different observation (a different respondent), and each column is a separate variable (the first is genhlth
, the second exerany
and so on). Use the dim
function to view the size of the cdc
data frame; this will return the number of rows and columns. (In RStudio, you can see this information by simply looking next to the object name in the Environment pane in the upper right-hand corner).
R
)`___`(`___`)
dim(cdc)
Now, if we want to access a subset of the full data frame, we can use row-and-column notation. For example, to see the sixth variable of the 567th respondent, use the format
cdc[567,6]
which means we want the element of our data set that is in the 567th row (meaning the 567th person or observation) and the 6th column (in this case, weight
). We know that weight
is the 6th variable because it is the 6th entry in the list of variable names.
names(cdc)
To see the weights for the first ten respondents, we can type
cdc[1:10,6]
In this expression, we have asked just for rows in the range 1 through 10. R
uses the :
to create a range of values, so 1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
Finally, if we want all of the data for the first ten respondents, type
cdc[1:10,]
We get all the columns by leaving out an index or a range (we didn't type anything between the comma and the square bracket). When starting out in R
, this is a bit counterintuitive. As a rule, we omit the column number to see all columns in a data frame. Similarly, if we leave out an index or range for the rows, we would access all the observations, not just the 567th, or rows 1 through 10. Display the weights of the second batch of 100 respondents.
cdc[`___`, `___`]
grade_this({ if (.result[3] == 110){ pass() } if (length(.result) != 100){ fail("This vector doesn't seem to be the correct size.") } if (.result[1] != 118){ fail("This vector doesn't seem to start at the correct index.") } fail() })
Recall that column 6 represents respondents' weight, so the above command reported all the weights in the data set. An alternative method to access the weight data is by referring to the name. Previously, we typed names(cdc)
to see all the variables in the CDC Health Data. We can use variable names to select items in our data set.
cdc$weight
The dollar sign tells R
to look in data frame cdc
for the column called weight
. Since that's a single vector, we can subset it with just a single index inside square brackets. We see the weight for the 567th respondent by typing
cdc$weight[567]
Write a code to retrieve the weights of the first ten respondents using the dollar-sign notation.
`___`
The command above returns the same result as the cdc[1:10,6]
command. Both row-and-column notation and dollar-sign notation are widely used; which one you choose to use depends on your personal preference.
tidy
way)It's often helpful to extract all individuals (cases) in a data set with specific characteristics. We accomplish this through conditioning commands. First, consider expressions like
cdc$gender == "m"
or
cdc$age > 30
These commands produce a series of TRUE
and FALSE
values. There is one value for each respondent, where TRUE
indicates that the person was male (via the first command) or older than 30 (second command).
Suppose we want to extract just the data for the men in the sample or just for those over 30. We can pipe the cdc
data frame to the filter
function to do that for us. For example, the command
mdata <- cdc |> filter(gender == "m") glimpse(mdata)
will create a new data set called mdata
that contains only the men from the cdc
data set. This new data set contains all the same variables but just under half the rows. The important thing is that we can carve up the data based on the values of one or more variables.
:::{.noteblock}
Logical operators: Filtering for certain observations (e.g. response by gender or age) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data.
To do so, you can use the filter
function and a series of logical operators.
The most commonly used logical operators for data analysis are as follows:
==
means "equal to"!=
means "not equal to">
or <
means "greater than" or "less than">=
or <=
means "greater than or equal to" or "less than or equal to"
:::Create a subset of the health data of men over thirty.
m_and_over30 <- `___` |> filter(`___`, `___`) m_and_over30
m_and_over30 <- cdc |> filter(gender == "m", age > 30) m_and_over30
Note that you can separate filter
conditions using commas if you want responses from males and over 30.
If you are interested in either men or over 30, then you can use the |
instead of the comma.
Create a subset of the health data of mean or people over thirty.
m_or_over30 <- `___` |> `___` m_or_over30
m_or_over30 <- cdc |> filter(gender == "m" | age > 30) m_or_over30
You may use as many "and" and "or" clauses as you like when filtering data.
:::{.exerciseblock}
Using the CDC Health Data, create a new object called under23_and_smoke
that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke <- `___`
under23_and_smoke
under23_and_smoke <- cdc |> filter(age < 23, smoke100 == 1) under23_and_smoke
:::
With our subsetting tools, we'll return to the day's task: making basic summaries of the CDC Health Data. We've already looked at categorical data such as smoke
and gender
, so let's turn our attention to quantitative data. Two common ways to visualise quantitative data are with box plots and histograms. We can construct a box plot for a single variable with the following command.
ggplot(cdc, aes(x = height)) + geom_boxplot()
You can compare the locations of the box components by examining the summary statistics.
summary(cdc$height)
Confirm that the median and upper and lower quartiles reported in the numerical summary match those in the graph. The purpose of a box plot is to provide a thumbnail sketch of a variable to compare across the categories (levels or qualitative characteristics) of a categorical variable. We can do this with ggplot
by adding a y
aesthetic to the mix.
ggplot(cdc, aes(x = height, y = gender)) + geom_boxplot()
Next, let's consider a new variable that doesn't show up directly in this data set: Body Mass Index (BMI) [W]. BMI is a weight-to-height ratio and can be calculated as:
[ \text{BMI} = \frac{\text{weight}~(lb)}{\text{height}~(in)^2} \cdot 703 ]
The multiplicative factor of 703 is the approximate conversion to change units from metric (meters and kilograms) to imperial (inches and pounds).
To create a new variable called bmi
, we use the mutate
function. Then we can create a box plot of bmi
by genhlth
.
cdc <- cdc |> mutate(bmi = 703 * weight / height^2) ggplot(cdc, aes(x = genhlth, y = bmi)) + geom_boxplot()
Notice that the first line above is just some arithmetic, but it's applied to all 20,000 numbers in the cdc
data set. For each of the 20,000 participants, we take their weight, divide by their height-squared and then multiply by 703. The result is 20,000 BMI values, one for each respondent. We like R because it lets us perform computations like this using straightforward expressions.
:::{.exerciseblock}
Using the CDC Health Data, create a box plot of bmi
by genhlth
. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would be associated with BMI and indicate what the figure suggests.
cdc <- `___` |> mutate(bmi = 703 * `___` / `___`) ggplot(`___`) + `___`
question_text( "Enter your solution as free text below.", answer("C0rrect", correct = TRUE), incorrect = "Okay! [even if this button is red]", try_again_button = "Modify your answer", allow_retry = TRUE )
:::
Finally, let's make some histograms using the command hist
. Create a histogram for the age of our respondents.
ggplot(cdc, aes(x = age)) + geom_histogram()
Histograms are generally an excellent way to see the shape of a single distribution, but that shape can change depending on how the data is split between the different bins. You can control the number of bins by adding an argument to the command geom_histogram
. Experiment with changing the number of bin
(e.g., replace the 30 below with a smaller and larger number).
ggplot(cdc, aes(x = age)) + geom_histogram(bins = 30)
How do these two histograms compare?
question_text( "Enter your observations as free text below.", answer("C0rrect", correct = TRUE), incorrect = "Okay! [even if this button is red]", try_again_button = "Modify your answer", allow_retry = TRUE )
We've done a good first pass at analysing the BRFSS questionnaire. We've found an interesting association between smoking and gender, and we can say something about the relationship between people's assessment of their general health and BMI. We've also picked up essential computing tools -- summary statistics, subsetting, and plots -- that will serve us well throughout this course.
:::{.warningblock} Please work through the interactive tutorial for this lab. Your lab report should provide answers to the exercises below. For full marks, please answer the exercises in complete sentences, be mindful of appropriate usage, spelling, and grammar, and use $\LaTeX$ for mathematics where applicable. Use the MA22004 Lab Report template to create your lab report. Upload your final submission as a PDF to Gradescope. Further instructions on using the class template to produce a nice lab report containing data analysis and text are in the interactive tutorial for Lab 1. :::
The following exercises refer to variables found in or derived from the CDC Health Data in the data frame cdc
, which is packaged with MA22004labs
.
:::{.tipblock} The CDC Health Data contains an extract from the Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS is an annual telephone survey of 350,000 people in the United States collected by the CDC. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
This data set contains a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset. :::
gender
vs smoke100
from the CDC Health Data using ggplot
. What does the mosaic plot reveal about smoking habits and gender? [2 marks]List the categorical variables from the CDC Health Data. Create a box plot to compare BMI across a categorical variable from cdc
(you may not use genhlth
) using ggplot
and mutate
. List the variable you chose, why you might think it would be associated with BMI and indicate what the figure suggests. [4 marks]
Using the CDC Health Data, make a scatter plot of weight versus desired weight using ggplot
. Describe the relationship between these two variables. [3 marks]
Using the CDC Health Data, consider a new variable, wdiff
, the difference between desired and current weight. Provide a code to create wdiff
using commands from tidyverse
. What type of data is wdiff
? If an observation wdiff
is 0, what does this mean about the person's weight and desired weight? What if wdiff
is positive or negative? [3 marks]
Using appropriate numerical summaries and ggplot
plots, describe the distribution of wdiff
, created in Exercise 4, in terms of its centre, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight? [4 marks]
Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women (use ggplot
). [4 marks]
{style="border-width:0"}
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
This work is derivative of OpenIntro Labs and was modified by Eric Hall of the University of Dundee Mathematics Division. The original labs were adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from labs written by Mark Hansen of UCLA Statistics.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.