$\$
# install.packages("gapminder") SDS230::download_data("amazon.rda") SDS230::download_image("qq-plot.png") devtools::install_github("dill/emoGG")
# install.packages("latex2exp") library(latex2exp) library(dplyr) #options(scipen=999) knitr::opts_chunk$set(echo = TRUE) # hide all plot output - useful for printing the code # knitr::opts_chunk$set(fig.show='hide') set.seed(123)
$\$
$\$
ggplot2 is a R package created by Hadley Wickham that implements Leland Wilkinson's concept of a grammar of graphics. In the grammar of graphics (and ggplot) data graphics are composed of:
$\$
Let's use the gapminder data to construct a visualization using this grammar that has:
A basic frame constructed from called the ggplot()
function
A geom of points where:
gdpPercap
is mapped to the x positionlifeExp
is mapped to the y positioncontinent
is mapped to color
Transform the scales so that
library(ggplot2) library(gapminder) ggplot(data = gapminder) + # create the basic frame geom_point(aes(x = gdpPercap, y = lifeExp, color = continent)) + # add a geom with 3 aesthetic mappings scale_x_continuous(trans = "log10") + # change the scales scale_color_brewer(type = "qua", palette = 2)
$\$
Let's continue to use the gapminder data to construct a visualization using this grammar that has:
The gapminder data filtered to only have data from France, Haiti and the United States
A geom of line where:
year
is mapped to the x positionlifeExp
is mapped to the y position
Faceting based on the country
A second glyph layer with a red vertical line at the year 1969
The Wall Street Journal theme from the ggthemes package
library(dplyr) library(ggthemes) gapminder |> filter(country %in% c("France", "Haiti", "United States")) |> ggplot(aes(year, lifeExp)) + geom_line() + facet_wrap(~country) + # add a geom with 3 aesthetic mappings geom_vline(xintercept = 1969, col = "red") + theme_wsj()
$\$
We can also create:
- Text as a glyph using geom_text()
- Annotations using annotate(geom_type, geom_properties)
- Manually modify the theme using theme()
with different arguments
Let's create a plot where:
It only has data from 2007
A text glyph with the following properties:
gdpPercap
is mapped to the x positionlifeExp
is mapped to the y positioncontinent
is mapped to colorcountry
is mapped to the text label
Let's use a log 10 scale
Let's turn off the legend by modifying the theme
Let's add a text annotation with the following attributes
gapminder |> filter(year == 2007) |> ggplot(aes(x = gdpPercap, y = lifeExp, label = country, color = continent)) + geom_text() + scale_x_continuous(trans = "log10") + theme(legend.position="none") + annotate("text", x = 600, y = 80, label = "this plot is messy :(")
$\$
$\$
The plotly package can be used to create interactive visualizations.
Let's recreate an interactive version of the gapminder data from 2007 using a geom_point()
and:
gdpPercap
is mapped to the x positionlifeExp
is mapped to the y positioncontinent
is mapped to colorcountry
is mapped to the nameWe will save our plot to an object g
and then we can use ggplotly(g)
to create an interactive visualization.
library(gapminder) library(plotly) g <- gapminder |> filter(year == 2007) |> ggplot(aes(x = gdpPercap, y = lifeExp, col = continent, name = country)) + geom_point() ggplotly(g)
$\$
Everyone loves emojis!
Let's use the motor trends car data to plot miles per gallon (mpg) as a function of weight using the proper glyph.
# devtools::install_github("dill/emoGG") library(emoGG) ggplot(mtcars, aes(wt, mpg)) + geom_emoji(emoji="1f697") # another option... #install.packages("emojifont") # # library(emojifont) # # load.emojifont('OpenSansEmoji.ttf') # # # ggplot(mtcars, aes(x = wt, y = mpg, label = emoji("car"), col = factor(cyl))) + # geom_text(family="OpenSansEmoji", size=6) + # xlab("Weight") + # ylab("Miles per Gallon") + # theme_classic()
$\$
We can crate animated gifs too...
#install.packages('png') #install.packages('gifski') #install.packages('gganimate') # library(gganimate) # # # ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, colour = country)) + # geom_point(alpha = 0.7, show.legend = FALSE) + # scale_colour_manual(values = country_colors) + # scale_size(range = c(2, 12)) + # scale_x_log10() + # facet_wrap(~continent) + # # Here comes the gganimate specific bits # labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') + # transition_time(year) + # ease_aes('linear')
$\$
In "visual hypothesis tests", we assess whether we can identify which plot contains relationships in the data, from plots that show scrambled versions of the data. For more information about these tests, and an R package that implements these tests, see:
$\$
Let's run a visual hypothesis test to see if there is a correlation between the number of pages in a book and the price. We can start by stating the null and alternative hypothesis:
In words:
Null hypothesis: there is no relationship between the number of pages in a book and the price.
Alternative hypothesis: books that have more pages have higher prices
In symbols:
$H_0: \rho = 0$
$H_0: \rho > 0$
Let's now run the visual hypothesis test. Where we have:
The "observed visualization" (step 2) is the plot of the actual data.
The "null visualizations" (step 3) are visualizations of data where the relationship between weight and mpg has been shuffled.
The "p-value/decision" (step 4-5) is whether we can identify the real plot from the shuffled plots.
$\$
Let's use the nullabor
package to run the analysis. We will use the
null_permute()
function to randomly shuffle the List.Price
variable and we
will use the lineup()
to create a data frame that has the real data and 19
other randomly shuffled data sets. We can then plot the real and shuffled data
sets using ggplot()
to see if we can identify the real one. We can also use
the decrypt()
function to reveal the answer about which data set is the real
one.
# install.packages("nullabor") library(nullabor) library(ggplot2) load("amazon.rda") d <- lineup(null_permute("List.Price"), amazon) ggplot(data=d, aes(x=NumPages, y=List.Price)) + geom_point() + facet_wrap(~ .sample)
The nullabor
package can be used to examine other types of relationships,
including examining differences in word useage (word clouds) and assessing
whether there are trends on a map.
$\$
We can write functions in R using the function()
function!
Let's write a function called cube_it()
that takes a value x
and returns x
cubed.
# the square root function sqrt(49) # a function that takes a value to the third power cube_it <- function(x) { x^3 } cube_it(2)
$\$
We can specify additional arguments which can also have default values.
Let's write a function called power_it
which takes a value x
and a value
pow
and returns x
to the pow power. Let's also have the default value of the
pow
argument be 3.
# a function that takes a value to the third power power_it <- function(x, pow = 3) { x^pow } power_it(2) power_it(2, 8)
$\$
Let's write our permutation test for testing whether there is a significant correlation in data.
We can then test it to see if there is a correlation between the price of a book and the number of pages in a book.
# write a cor_permtest function that tests where there is a significant correlation # between the values in two vectors cor_permtest <- function(vec1, vec2, n_null = 10000) { # calculate the observed statistic obs_stat <- cor(vec1, vec2) # create the null distribution null_dist <- NULL for (i in 1:n_null) { curr_null_stat <- cor(vec1, sample(vec2)) null_dist[i] <- curr_null_stat } # Calculate the two-tailed p-value pval_left <- mean(null_dist <= -1 * abs(obs_stat)) pval_right <- mean(null_dist >= abs(obs_stat)) pval <- pval_left + pval_right # return the p-value pval } # test the function to see if there is a correlation between the number of pages in a book # and the price of a book load("amazon.rda") cor_permtest(amazon$NumPages, amazon$List.Price)
Try this at home: write a function that can run a permutation test comparing two means from two different data samples (e.g., a function that can run a permutation test to assess whether cognitive abilities different depending on whether participants took a Gingko supplement or a placebo).
$\$
We can check if empirical data seems to come from a particular distribution using quantile-quantile plots (qqplots). qqplots plot the sorted data values in your sample as a function of the theoretical quantile values (at evenly spaced probability areas).
Below is an illustration of quantile values for 10 data points that come from a normal distribution. If we have 10 data points in our sample, then to create a qqplot comparing our data to a standard normal distribution we would plot our data as a function of these theoretical quantile values. If the plot falls along a diagonal line, this indicates our data comes from the standard normal distribution
Also see this explanation
$\$
Let's create a qqplot to assess whether the thickness of books on Amazon are normally distributed.
# get the heights of all baseball players load("amazon.rda") thickness <- amazon$Thick # view a histogram of book thicknesses hist(thickness, breaks = 20) # create an sequence of values between 0 and 1 at even spaces prob_area_vals <- seq(0, 1, length.out = length(thickness)) # get the quantiles from these values quantile_vals <- qnorm(prob_area_vals) # create the qqplot plot(quantile_vals, sort(thickness), xlab = "Normal quantiles", ylab = "Data quantiles", main = "Quantile-quantile plot")
$\$
We can also use the qqnorm()
function to do this more easily when comparing
data to the normal distribution. Or, if we want a better visualization, we can
use the qqPlot()
function in the car
package.
qqnorm(thickness) #install.packages("car") car::qqPlot(thickness)
This data is pretty normal as can see in the plots above. Let's look at some highly skewed data.
# data that is skewed to the right exp_data <- rexp(1000) hist(exp_data, breaks = 50, main = "Data from an exponential distribution", xlab = "Data values") qqnorm(exp_data) # data that is skewed to the left exp_data <- -1 * rexp(1000) hist(exp_data, breaks = 50, main = "Left skewed data", xlab = "Data values") qqnorm(exp_data)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.