knitr::opts_chunk$set(echo = TRUE)


Download files

If you have not done so already (or if you're working in a new directory), please download files needed for the next two classes. Please press the green play arrow once to download the files. Once they are downloaded you do not need to run this again.

On the homework, you will also need to start by downloading files also by pressing the green start arrow on an R chuck at the top of the document.

# If you don't have the SDS230 package working yet, you can download the files using the following commands

download.file("", "sds.png", mode = "wb")

download.file("", "daily_bike_totals.rda", mode = "wb")

download.file("", "profiles_revised.csv", mode = "wb")


Back to learning R

Data frames

Let's now continue learning R by looking at data frames. Data frames are structured data and can also be thought of as collections of vectors.

Let's look at data from the website okcupid

profiles <- read.csv("profiles_revised.csv")

# the View() function only works in R Studio!

# We can extract the columns of a data frame as vector objects using the $ symbol

# We can get the mean() age of OKCupid users


We can extract rows from a data frame in a similar way as extracting values from a vector by using the square brackets

# get the first row of the data frame

# get the first column of the data 

# Note: the first column of the profiles data frame is the variable age, so we can also get the first column using...

# we can use the head() function here so that we don't print out too much stuff!


We can also create vectors of numbers or Booleans specifying which rows we want to extract from a data frame

# create a vector with the numbers 1, 10, 20

# use my_vec to get the 1st, 10th, and 20th profile


Finally, we can also extract rows by creating a Boolean vector that is of the same length as the number of rows in the data frame. TRUE values will be extracted from the data frame while FALSE values will not.

# create a vector of Booleans

# use the Boolean vector to get the 1st and 3rd row 

# dim() gives us the the number of rows and columns in a data frame


Examining categorical data

Categorical variables take on one of a fixed number of possible values

For categorical variables, we usually want to view:

Let's examine categorical data by looking at drinking behavior on OkCupid

# Get information about drinking behavior

# Create a table showing how often people drink


We can create a relative frequency table using the function: prop.table(my_table)

Can you create a relative frequency table for the drinking behavior of the people in the OkCupid data set in the R chunk below?


Bar plots (no pun intended?)

We can plot the number of items in each category using a bar plot: barplot(my_table)

Can you create a bar plot for the drinking behavior of the people in the OkCupid data set?


Is there a problem with using the bar plot function without any of the extra arguments?

XKCD illusterates the point

Can you figure out how to fix your plot?

We can also create pie charts using the pie function

Some pie charts are more informative than others


Our plots are dominated by social drinkers - let's remove them...

There are other websites/apps for dating as well


Examining quantiative data

There are several summary statistics useful for describing quantitative data such as the mean and the median. Use the mean() and median() functions to extract measures of the central tendency for OkCupid user's heights.

What went wrong?

Fact: the average height of males in US is 69.6", and of females is 64". Also 60% of the ok cupid users are in our data set are male. Is is the height of the average OkCupid user what we would expect from the US population? Would we expect them to be the same?


We can plot histograms of heights using the hist() function.


We can add lines to our plots using the abline() function. For example abline(v = 60) would add a vertical line at the value of 60. Can you add a vertical line at the average OkCupid user's height?


Boxplots visually show a version of a 5 number summary (min, Q1, median, Q3, max). We can create boxplots using the boxplot() function.

Create a boxplot of OkCupid user's heights.


Let's now look at data from CitiBike in New York City. How many cases are there any how many variables? What does each case corresond to?

# download.file("", "daily_bike_totals.rda")



Scatter plots show the relationships between two quantitative variables. We can use the plot(x, y) function to create scatter plots. Create a scatter plot of the maximum temperature as a function of the minimum temperature. Also create a scatter plot of the number of trips as a function of the date.

emeyers/SDS230 documentation built on Jan. 13, 2023, 5:16 a.m.