knitr::opts_chunk$set(echo = TRUE)
If you have not done so already (or if you're working in a new directory), please download files needed for the next two classes. Please press the green play arrow once to download the files. Once they are downloaded you do not need to run this again.
On the homework, you will also need to start by downloading files also by pressing the green start arrow on an R chuck at the top of the document.
# If you don't have the SDS230 package working yet, you can download the files using the following commands download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/images/sds.png", "sds.png", mode = "wb") download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/daily_bike_totals.rda", "daily_bike_totals.rda", mode = "wb") download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/profiles_revised.csv", "profiles_revised.csv", mode = "wb")
Let's now continue learning R by looking at data frames. Data frames are structured data and can also be thought of as collections of vectors.
Let's look at data from the website okcupid
profiles <- read.csv("profiles_revised.csv") #View(profiles) # the View() function only works in R Studio! # We can extract the columns of a data frame as vector objects using the $ symbol the_ages <- profiles$age # We can get the mean() age of OKCupid users mean(the_ages)
We can extract rows from a data frame in a similar way as extracting values from a vector by using the square brackets
profiles[1, ] # returns the first row of the data frame head(profiles[, 1]) # returns the first column of the data # we are using the head() function here so that we don't print out too much stuff! # Note: the first column of the profiles data frame is the variable age, so we can also get the first column using: head(profiles$age) # this is the same as profiles[, 1]
We can also create vectors of numbers or Booleans specifying which rows we want to extract from a data frame
# create a vector with the numbers 1, 10, 20 my_vec <- c(1, 10, 20) # use my_vec to get the 1st, 10th, and 20th profile small_profiles <- profiles[my_vec, ]
Finally, we can also extract rows by creating a Boolean vector that is of the same length as the number of rows in the data frame. TRUE values will be extracted from the data frame while FALSE values will not.
# create a vector of booleans my_bools <- c(TRUE, FALSE, TRUE) # use the Boolean vector to get the 1st and 3rd row my_bools <- c(TRUE, FALSE, TRUE) small_profiles[my_bools, ] # dim() gives us the the number of rows and columns in a data frame dim(small_profiles) dim(small_profiles[my_bools, ])
Categorical variables take on one of a fixed number of possible values
For categorical variables, we usually want to view:
Let's examine categorical data by looking at drinking behavior on OkCupid
# Get information about drinking behavior drinking_vec <- profiles$drinks # Create a table showing how often people drink drinks_table <- table(drinking_vec) drinks_table
We can create a relative frequency table using the function:
Can you create a relative frequency table for the drinking behavior of the people in the OkCupid data set in the R chunk below?
drinks_table <- table(profiles$drinks) prop.table(drinks_table)
We can plot the number of items in each category using a bar plot:
Can you create a bar plot for the drinking behavior of the people in the OkCupid data set?
drinks_table <- table(profiles$drinks) barplot(drinks_table)
Is there a problem with using the bar plot function without any of the extra arguments?
Can you figure out how to fix your plot?
We can also create pie charts using the pie function
pie(prop.table(table(profiles$sex, useNA = "always")))
Some pie charts are more informative than others
Our plots are dominated by social drinkers - let's remove them...
nonsocial_inds <- drinks_table < 10000 nonsocial_drinks_table <- drinks_table[nonsocial_inds] pie(nonsocial_drinks_table) barplot(nonsocial_drinks_table)
There are other websites/apps for dating as well
There are several summary statistics useful for describing quantitative data such as the mean and the median. Use the mean() and median() functions to extract measures of the central tendency for OkCupid user's heights.
What went wrong?
We can ignore missing values using the
na.rm = TRUE argument
mean(profiles$height, na.rm = TRUE) median(profiles$height, na.rm = TRUE)
Fact: the average height of males in US is 69.6", and of females is 64". Also 60% of the ok cupid users are in our data set are male. Is is the height of the average OkCupid user what we would expect from the US population? Would we expect them to be the same?
expected_okcupid_height <- .6 * 69.6 + .4 * 64 observed_okcupid_height <- mean(profiles$height, na.rm = TRUE) expected_okcupid_height observed_okcupid_height
We can plot histograms of heights using the
hist(profiles$height) hist(profiles$height, breaks = 50)
We can add lines to our plots using the abline() function. For example abline(v = 60) would add a vertical line at the value of 60. Can you add a vertical line at the average OkCupid user's height?
hist(profiles$height, nclass = 50, xlim = c(50, 90)) abline(v = mean(profiles$height, na.rm = TRUE), col = "red")
Boxplots visually show a version of a 5 number summary (min, Q1, median, Q3, max). We can create boxplots using the
Create a boxplot of OkCupid user's heights.
boxplot(profiles$height, ylab = "Heights (inches)", main = "OkCupid users' heights")
If there are extreme outliers in a plot we need to investigate them. If they are errors we can remove them, otherwise we need to take them into account.
Let's now look at data from CitiBike in New York City. How many cases are there any how many variables? What does each case corresond to?
# download.file("https://yale.box.com/shared/static/t3ezfphfg729x03079aajop0d3f454wm.rda", "daily_bike_totals.rda") load("daily_bike_totals.rda")
Scatter plots show the relationships between two quantitative variables. We can use the
plot(x, y) function to create scatter plots. Create a scatter plot of the maximum temperature as a function of the minimum temperature. Also create a scatter plot of the number of trips as a function of the date.
plot(bike_daily_data$min_temperature, bike_daily_data$max_temperature) plot(bike_daily_data$date, bike_daily_data$trips)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.