In emeyers/SDS230: Tools for the class Data Exploration and Analysis

knitr::opts_chunk$set(echo = TRUE)

$\$

Download files

If you have not done so already (or if you're working in a new directory), please download files needed for the next two classes. Please press the green play arrow once to download the files. Once they are downloaded you do not need to run this again.

On the homework, you will also need to start by downloading files also by pressing the green start arrow on an R chuck at the top of the document.

# If you don't have the SDS230 package working yet, you can download the files using the following commands


download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/daily_bike_totals.rda", "daily_bike_totals.rda", mode = "wb")

download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/profiles_revised.csv", "profiles_revised.csv", mode = "wb")

$\$

Warm up exercise

The code below loads the OkCupid data

Can you write code that will print out the mean income of the 2nd person, 4th and 12th person in the data set?

Here are steps you can use:

Create a vector called income that has everyone's incomes
Create a vector called income3 that has the incomes of the 2nd, 4th and 12th person
Take the mean of the income3 vector.

# Load the data
profiles <- read.csv("profiles_revised.csv")

# View(profiles)


# Create a vector called `incomes` that has everyone's incomes
income <- profiles$income


# Create a vector called `income3` that has the incomes of the 2nd, 4th and 12th person
income3 <- income[c(2, 4, 12)]


# Take the mean of the `income3` vector
mean(income3)

$\$

Examining categorical data

Categorical variables take on one of a fixed number of possible values

For categorical variables, we usually want to view:

How many items are each category or
The proportion (or percentage) of items in each category

Let's examine categorical data by looking at drinking behavior on OkCupid

# Get information about drinking behavior
drinking_vec <- profiles$drinks

# Create a table showing how often people drink
drinks_table <- table(drinking_vec)
drinks_table

$\$

We can create a relative frequency table using the function: prop.table(my_table)

Can you create a relative frequency table for the drinking behavior of the people in the OkCupid data set in the R chunk below?

drinks_table <- table(profiles$drinks)

prop.table(drinks_table)

$\$

Bar plots (no pun intended?)

We can plot the number of items in each category using a bar plot: barplot(my_table)

Can you create a bar plot for the drinking behavior of the people in the OkCupid data set?

drinks_table <- table(profiles$drinks)

barplot(drinks_table)

$\$

Is there a problem with using the bar plot function without any of the extra arguments?

XKCD illusterates the point

Can you figure out how to fix your plot?

We can also create pie charts using the pie function

pie(prop.table(table(profiles$sex, useNA = "always")))

Some pie charts are more informative than others

$\$

Our plots are dominated by social drinkers - let's remove them...

nonsocial_inds <- drinks_table < 10000

nonsocial_drinks_table <- drinks_table[nonsocial_inds]

pie(nonsocial_drinks_table)

barplot(nonsocial_drinks_table)

There are other websites/apps for dating as well

$\$

Examining quantiative data

There are several summary statistics useful for describing quantitative data such as the mean and the median. Use the mean() and median() functions to extract measures of the central tendency for OkCupid user's heights.

mean(profiles$height)

What went wrong?

We can ignore missing values using the na.rm = TRUE argument

mean(profiles$height, na.rm = TRUE)

median(profiles$height, na.rm = TRUE)

Fact: the average height of males in US is 69.6", and of females is 64". Also 60% of the ok cupid users are in our data set are male. Is is the height of the average OkCupid user what we would expect from the US population? Would we expect them to be the same?

expected_okcupid_height <-  .6 * 69.6 + .4 * 64  # based on census data

observed_okcupid_height <- mean(profiles$height, na.rm = TRUE)

expected_okcupid_height
observed_okcupid_height

$\$

We can plot histograms of heights using the hist() function.

hist(profiles$height)

hist(profiles$height, breaks = 50)

$\$

We can add lines to our plots using the abline() function. For example abline(v = 60) would add a vertical line at the value of 60. Can you add a vertical line at the average OkCupid user's height?

hist(profiles$height, breaks = 50, xlim = c(50, 90))

abline(v = mean(profiles$height, na.rm = TRUE), col = "red")

$\$

Boxplots visually show a version of a 5 number summary (min, Q1, median, Q3, max). We can create boxplots using the boxplot() function.

Create a boxplot of OkCupid user's heights.

boxplot(profiles$height, ylab = "Heights (inches)", main = "OkCupid users' heights")

If there are extreme outliers in a plot we need to investigate them. If they are errors we can remove them, otherwise we need to take them into account.

$\$

Let's now look at data from CitiBike in New York City. How many cases are there any how many variables? What does each case corresond to?

# download.file("https://yale.box.com/shared/static/t3ezfphfg729x03079aajop0d3f454wm.rda", "daily_bike_totals.rda")

load("daily_bike_totals.rda")

$\$

Scatter plots show the relationships between two quantitative variables. We can use the plot(x, y) function to create scatter plots. Create a scatter plot of the maximum temperature as a function of the minimum temperature. Also create a line of the number of trips as a function of the date using the plot(x, y, type = "o") function.

# scatter plot
plot(bike_daily_data$min_temperature, bike_daily_data$max_temperature)



# line plot
plot(bike_daily_data$date, bike_daily_data$trips, type = "o")

Challenge exercise

Can you calculate the mean height of only the men in the data set?

Steps: 1. Create a vector that has everyone's height called height 2. Create a vector that lists the self reported sex of each person called sex 2. Create a Boolean vector for just the males can is_male that is TRUE an individual's sex is male 3. Use the is_male vector to get just the ages of the men and store in a vector called men_ages 4. Take the mean of the men_ages vector

# Load the data
profiles <- read.csv("profiles_revised.csv")

# View(profiles)


# We can extract the columns of a data frame as vector objects using the $ symbol
height <- profiles$height


# Get the sex of each individual
sex <- profiles$sex


# Create a Boolean vector that is TRUE if an individual is male
# use == to test if values are equal
is_male <- sex == "m"


# Get the ages of the men
men_height <- height[is_male]


# Get the mean height of men
mean(men_height, na.rm = TRUE)