In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

# install.packages("gapminder")

SDS230::download_data("amazon.rda")

SDS230::download_image("qq-plot.png")

devtools::install_github("dill/emoGG")

# install.packages("latex2exp")

library(latex2exp)

library(dplyr)


#options(scipen=999)


knitr::opts_chunk$set(echo = TRUE)

# hide all plot output - useful for printing the code
# knitr::opts_chunk$set(fig.show='hide')

set.seed(123)

$\$

Overview

ggplot bonus features
visual hypothesis tests
functions
qqplots

$\$

Part 1: Review and continuation of ggplot

ggplot2 is a R package created by Hadley Wickham that implements Leland Wilkinson's concept of a grammar of graphics. In the grammar of graphics (and ggplot) data graphics are composed of:

a frame
glyphs
scales and guides
facets
layers
themes

$\$

Part 1.1: Multiple layers

We can also have multiple geom layers on a single graph by using the + symbol E.g ggplot(…) + geom_type1() + geom_type2()

library(ggplot2)
library(gapminder)


# Create a scatter plot of miles per gallon as a function of weight and then add a smoothed line using geom_smooth() and a vertical lines using geom_vline()

ggplot(mtcars, aes(x = wt, y = mpg)) +      
    geom_point() +  
    geom_smooth() +
  geom_vline(xintercept = 3)

$\$

Part 1.2: Changing scales

Each visual attribute that has an aesthetic mapping has a default scale. We can change the scales used for each mapping using functions that start with scale_.

For example, we can change the x-scale from liner to logarithmic using scale_x_continuous(trans='log10'). Likewise we can change the color scale using scale_color_manual().

Let's use the gapminder data to construct a visualization using this grammar that has:

A basic frame constructed from called the ggplot() function
A geom of points where:
gdpPercap is mapped to the x position
lifeExp is mapped to the y position
continent is mapped to color
Transform the scales so that
The x values are on a $log_10$ scale
Let's use scale_color_manual() to manually choose colors.

# changing the scale on the x-axis
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +    
    geom_point(alpha = .2) + 
  scale_x_continuous(trans='log10')


# mapping continents to colors, and adding my own color scale
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, col = continent)) +   
    geom_point(alpha = .2) + 
  scale_x_continuous(trans='log10') + 
  scale_color_manual(values = c("red", "yellow", "green", "blue", "purple"))

$\$

Part 1.3: Color scales using scale_color_brewer

Let's try the color scale using a different qualitative color brewer pallet

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, col = continent)) +   
    geom_point()  +                        
  scale_x_continuous(trans = "log10") +                                          
  scale_color_brewer(type = "qua", palette = 2)       # change the scales

$\$

Part 1.4: Themes

We can also use different themes to change the appearance of our plot.

# Add theme_classic() to our plot

ggplot(mtcars, aes(x = wt, y = mpg)) +          
    geom_point()  +    
    xlab("Weight") +   
    ylab("Miles per Gallon") + 
    theme_classic() + 
  theme(                                              # modify the theme by:
        axis.text.y = element_blank(),                #  - turning off the y-axis text
        plot.background = element_rect(fill = "red")  # - making the background red
        )                                             # see ? theme for more options


# install.packages("ggthemes")
library(ggthemes)

ggplot(mtcars, aes(x = wt, y = mpg)) +          
    geom_point()  +    
    ggtitle("Cars!") + 
    theme_fivethirtyeight()
  #theme_wsj()

$\$

Part 2: Additional ggplot themes and features

$\$

Part 2.1: Plotly

The plotly package can be used to create interactive visualizations.

Let's recreate an interactive version of the gapminder data from 2007 using a geom_point() and:

gdpPercap is mapped to the x position
lifeExp is mapped to the y position
continent is mapped to color
country is mapped to the name

We will save our plot to an object g and then we can use ggplotly(g) to create an interactive visualization.

library(gapminder)
library(plotly)


g <- gapminder |> 
  filter(year == 2007) |>   
  ggplot(aes(x = gdpPercap, y = lifeExp, col = continent, name = country)) +      
  geom_point()   

ggplotly(g)

$\$

Part 2.2: emojis

Everyone loves emojis!

Let's use the motor trends car data to plot miles per gallon (mpg) as a function of weight using the proper glyph.

# devtools::install_github("dill/emoGG")
library(emoGG)
ggplot(mtcars, aes(wt, mpg)) + 
   geom_emoji(emoji="1f697")



# another option...
#install.packages("emojifont")
# 
# library(emojifont)
# 
# load.emojifont('OpenSansEmoji.ttf')
# 
# 
# ggplot(mtcars, aes(x = wt, y = mpg, label = emoji("car"), col = factor(cyl))) +       
#   geom_text(family="OpenSansEmoji", size=6)  +    
#   xlab("Weight") +   
#   ylab("Miles per Gallon") + 
#   theme_classic()

$\$

Part 2.3: animations

We can crate animated gifs too...

#install.packages('png')
#install.packages('gifski')
#install.packages('gganimate')


# library(gganimate)
# 
# 
# ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, colour = country)) +
#   geom_point(alpha = 0.7, show.legend = FALSE) +
#   scale_colour_manual(values = country_colors) +
#   scale_size(range = c(2, 12)) +
#   scale_x_log10() +
#   facet_wrap(~continent) +
#   # Here comes the gganimate specific bits
#   labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
#   transition_time(year) +
#   ease_aes('linear')

$\$

Part 3: Visual hypothesis tests

In "visual hypothesis tests", we assess whether we can identify which plot contains relationships in the data, from plots that show scrambled versions of the data. For more information about these tests, and an R package that implements these tests, see:

https://cran.r-project.org/web/packages/nullabor/vignettes/nullabor.html
https://vita.had.co.nz/papers/inference-infovis.pdf

$\$

Part 3.1: An example problem

Let's run a visual hypothesis test to see if there is a correlation between the number of pages in a book and the price. We can start by stating the null and alternative hypothesis:

In words:

Null hypothesis: there is no relationship between the number of pages in a book and the price.

Alternative hypothesis: books that have more pages have higher prices

In symbols:

$H_0: \rho = 0$

$H_0: \rho > 0$

Let's now run the visual hypothesis test. Where we have:

The "observed visualization" (step 2) is the plot of the actual data.
The "null visualizations" (step 3) are visualizations of data where the relationship between weight and mpg has been shuffled.
The "p-value/decision" (step 4-5) is whether we can identify the real plot from the shuffled plots.

$\$

Part 3.2: Running the analysis

Let's use the nullabor package to run the analysis. We will use the null_permute() function to randomly shuffle the List.Price variable and we will use the lineup() to create a data frame that has the real data and 19 other randomly shuffled data sets. We can then plot the real and shuffled data sets using ggplot() to see if we can identify the real one. We can also use the decrypt() function to reveal the answer about which data set is the real one.

# install.packages("nullabor")

library(nullabor)
library(ggplot2)

load("amazon.rda")

d <- lineup(null_permute("List.Price"), amazon)

ggplot(data=d, aes(x=NumPages, y=List.Price)) + 
  geom_point() + 
  facet_wrap(~ .sample)

The nullabor package can be used to examine other types of relationships, including examining differences in word useage (word clouds) and assessing whether there are trends on a map.

$\$

Part 4: Writing functions in R and conditional statements

Part 4.1: Writing functions

We can write functions in R using the function() function!

Let's write a function called cube_it() that takes a value x and returns x cubed.

# the square root function
sqrt(49)



# a function that takes a value to the third power
cube_it <- function(x) {

  x^3

}


cube_it(2)

$\$

Part 4.2: Addtional function arguments

We can specify additional arguments which can also have default values.

Let's write a function called power_it which takes a value x and a value pow and returns x to the pow power. Let's also have the default value of the pow argument be 3.

# a function that takes a value to the third power
power_it <- function(x, pow = 3) {

  x^pow

}



power_it(2)

power_it(2, 8)

$\$

Part 4.3: Writing our own permutation test function for correlation

Let's write our permutation test for testing whether there is a significant correlation in data.

We can then test it to see if there is a correlation between the price of a book and the number of pages in a book.

# write a cor_permtest function that tests where there is a significant correlation
# between the values in two vectors
cor_permtest <- function(vec1, vec2, n_null = 10000) {


  # calculate the observed statistic
  obs_stat <- cor(vec1, vec2)


  # create the null distribution 
  null_dist <- NULL
  for (i in 1:n_null) {
    curr_null_stat <- cor(vec1, sample(vec2))
    null_dist[i] <- curr_null_stat
  }


  # Calculate the two-tailed p-value
  pval_left <- mean(null_dist <= -1 * abs(obs_stat))
  pval_right <- mean(null_dist >= abs(obs_stat))

  pval <- pval_left + pval_right


  # return the p-value
  pval


}




# test the function to see if there is a correlation between the number of pages in a book
# and the price of a book 
load("amazon.rda")


cor_permtest(amazon$NumPages, amazon$List.Price)

Try this at home: write a function that can run a permutation test comparing two means from two different data samples (e.g., a function that can run a permutation test to assess whether cognitive abilities different depending on whether participants took a Gingko supplement or a placebo).

$\$

Part 5: Examining how data is distributed using quantile-quantile plots

We can check if empirical data seems to come from a particular distribution using quantile-quantile plots (qqplots). qqplots plot the sorted data values in your sample as a function of the theoretical quantile values (at evenly spaced probability areas).

Below is an illustration of quantile values for 10 data points that come from a normal distribution. If we have 10 data points in our sample, then to create a qqplot comparing our data to a standard normal distribution we would plot our data as a function of these theoretical quantile values. If the plot falls along a diagonal line, this indicates our data comes from the standard normal distribution

Also see this explanation

$\$

Part 5.1

Let's create a qqplot to assess whether the thickness of books on Amazon are normally distributed.

# get the heights of all baseball players
load("amazon.rda")
thickness <- amazon$Thick



# view a histogram of book thicknesses
hist(thickness, breaks = 20)


# create an sequence of values between 0 and 1 at even spaces
prob_area_vals <- seq(0, 1, length.out = length(thickness))


# get the quantiles from these values
quantile_vals <- qnorm(prob_area_vals)


# create the qqplot
plot(quantile_vals, sort(thickness),
     xlab = "Normal quantiles",
     ylab = "Data quantiles",
     main = "Quantile-quantile plot")

$\$

Part 5.2

We can also use the qqnorm() function to do this more easily when comparing data to the normal distribution. Or, if we want a better visualization, we can use the qqPlot() function in the car package.

qqnorm(thickness)

#install.packages("car")
car::qqPlot(thickness)

This data is pretty normal as can see in the plots above. Let's look at some highly skewed data.

# data that is skewed to the right
exp_data <- rexp(1000)

hist(exp_data, breaks = 50, 
     main = "Data from an exponential distribution", 
     xlab = "Data values")

qqnorm(exp_data)



# data that is skewed to the left
exp_data <- -1 * rexp(1000)

hist(exp_data, breaks = 50, 
     main = "Left skewed data", 
     xlab = "Data values")

qqnorm(exp_data)

emeyers/SDS230 documentation built on Feb. 6, 2025, 4:55 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

emeyers/SDS230
Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Overview

Part 1: Review and continuation of ggplot

Part 1.1: Multiple layers

Part 1.2: Changing scales

Part 1.3: Color scales using scale_color_brewer

Part 1.4: Themes

Part 2: Additional ggplot themes and features

Part 2.1: Plotly

Part 2.2: emojis

Part 2.3: animations

Part 3: Visual hypothesis tests

Part 3.1: An example problem

Part 3.2: Running the analysis

Part 4: Writing functions in R and conditional statements

Part 4.1: Writing functions

Part 4.2: Addtional function arguments

Part 4.3: Writing our own permutation test function for correlation

Part 5: Examining how data is distributed using quantile-quantile plots

Part 5.1

Part 5.2

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230 Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Overview

Part 1: Review and continuation of ggplot

Part 1.1: Multiple layers

Part 1.2: Changing scales

Part 1.3: Color scales using scale_color_brewer

Part 1.4: Themes

Part 2: Additional ggplot themes and features

Part 2.1: Plotly

Part 2.2: emojis

Part 2.3: animations

Part 3: Visual hypothesis tests

Part 3.1: An example problem

Part 3.2: Running the analysis

Part 4: Writing functions in R and conditional statements

Part 4.1: Writing functions

Part 4.2: Addtional function arguments

Part 4.3: Writing our own permutation test function for correlation

Part 5: Examining how data is distributed using quantile-quantile plots

Part 5.1

Part 5.2

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230
Tools for the class Data Exploration and Analysis