knitr::opts_chunk$set(echo = TRUE)
$\$
If you have not done so already (or if you're working in a new directory), please download files needed for the next two classes. Please press the green play arrow once to download the files. Once they are downloaded you do not need to run this again.
On the homework, you will also need to start by downloading files also by pressing the green start arrow on an R chuck at the top of the document.
# If you don't have the SDS230 package working yet, you can download the files using the following commands download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/daily_bike_totals.rda", "daily_bike_totals.rda", mode = "wb") download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/profiles_revised.csv", "profiles_revised.csv", mode = "wb")
$\$
The code below loads the OkCupid data
Can you write code that will print out the mean income of the 2nd person, 4th and 12th person in the data set?
Here are steps you can use:
income
that has everyone's incomesincome3
that has the incomes of the 2nd, 4th and 12th personincome3
vector. # Load the data profiles <- read.csv("profiles_revised.csv") # View(profiles) # Create a vector called `incomes` that has everyone's incomes income <- profiles$income # Create a vector called `income3` that has the incomes of the 2nd, 4th and 12th person income3 <- income[c(2, 4, 12)] # Take the mean of the `income3` vector mean(income3)
$\$
Categorical variables take on one of a fixed number of possible values
For categorical variables, we usually want to view:
Let's examine categorical data by looking at drinking behavior on OkCupid
# Get information about drinking behavior drinking_vec <- profiles$drinks # Create a table showing how often people drink drinks_table <- table(drinking_vec) drinks_table
$\$
We can create a relative frequency table using the function: prop.table(my_table)
Can you create a relative frequency table for the drinking behavior of the people in the OkCupid data set in the R chunk below?
drinks_table <- table(profiles$drinks) prop.table(drinks_table)
$\$
We can plot the number of items in each category using a bar plot: barplot(my_table)
Can you create a bar plot for the drinking behavior of the people in the OkCupid data set?
drinks_table <- table(profiles$drinks) barplot(drinks_table)
$\$
Is there a problem with using the bar plot function without any of the extra arguments?
Can you figure out how to fix your plot?
We can also create pie charts using the pie function
pie(prop.table(table(profiles$sex, useNA = "always")))
Some pie charts are more informative than others
$\$
Our plots are dominated by social drinkers - let's remove them...
nonsocial_inds <- drinks_table < 10000 nonsocial_drinks_table <- drinks_table[nonsocial_inds] pie(nonsocial_drinks_table) barplot(nonsocial_drinks_table)
There are other websites/apps for dating as well
$\$
There are several summary statistics useful for describing quantitative data such as the mean and the median. Use the mean() and median() functions to extract measures of the central tendency for OkCupid user's heights.
mean(profiles$height)
What went wrong?
We can ignore missing values using the na.rm = TRUE
argument
mean(profiles$height, na.rm = TRUE) median(profiles$height, na.rm = TRUE)
Fact: the average height of males in US is 69.6", and of females is 64". Also 60% of the ok cupid users are in our data set are male. Is is the height of the average OkCupid user what we would expect from the US population? Would we expect them to be the same?
expected_okcupid_height <- .6 * 69.6 + .4 * 64 # based on census data observed_okcupid_height <- mean(profiles$height, na.rm = TRUE) expected_okcupid_height observed_okcupid_height
$\$
We can plot histograms of heights using the hist()
function.
hist(profiles$height) hist(profiles$height, breaks = 50)
$\$
We can add lines to our plots using the abline() function. For example abline(v = 60) would add a vertical line at the value of 60. Can you add a vertical line at the average OkCupid user's height?
hist(profiles$height, breaks = 50, xlim = c(50, 90)) abline(v = mean(profiles$height, na.rm = TRUE), col = "red")
$\$
Boxplots visually show a version of a 5 number summary (min, Q1, median, Q3, max). We can create boxplots using the boxplot()
function.
Create a boxplot of OkCupid user's heights.
boxplot(profiles$height, ylab = "Heights (inches)", main = "OkCupid users' heights")
If there are extreme outliers in a plot we need to investigate them. If they are errors we can remove them, otherwise we need to take them into account.
$\$
Let's now look at data from CitiBike in New York City. How many cases are there any how many variables? What does each case corresond to?
# download.file("https://yale.box.com/shared/static/t3ezfphfg729x03079aajop0d3f454wm.rda", "daily_bike_totals.rda") load("daily_bike_totals.rda")
$\$
Scatter plots show the relationships between two quantitative variables. We can use the plot(x, y)
function to create scatter plots. Create a scatter plot of the maximum temperature as a function of the minimum temperature. Also create a line of the number of trips as a function of the date using the plot(x, y, type = "o")
function.
# scatter plot plot(bike_daily_data$min_temperature, bike_daily_data$max_temperature) # line plot plot(bike_daily_data$date, bike_daily_data$trips, type = "o")
Can you calculate the mean height of only the men in the data set?
Steps:
1. Create a vector that has everyone's height called height
2. Create a vector that lists the self reported sex of each person called sex
2. Create a Boolean vector for just the males can is_male
that is TRUE an individual's sex is male
3. Use the is_male
vector to get just the ages of the men and store in a vector called men_ages
4. Take the mean of the men_ages
vector
# Load the data profiles <- read.csv("profiles_revised.csv") # View(profiles) # We can extract the columns of a data frame as vector objects using the $ symbol height <- profiles$height # Get the sex of each individual sex <- profiles$sex # Create a Boolean vector that is TRUE if an individual is male # use == to test if values are equal is_male <- sex == "m" # Get the ages of the men men_height <- height[is_male] # Get the mean height of men mean(men_height, na.rm = TRUE)
$\$
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.