I'm going to give some examples of how to do common analyses you'll need for this class. I won't be spending much time on the statistical assumptions or diagnostics.
# This invisible block is a workaround for Travis CI not having the tidyverse meta-package; library(ggplot2) library(dplyr) library(readr) library(tibble) library(forcats) library(cowplot) library(kableExtra) theme_set(theme_cowplot()) lizards <- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this data # Capture View() real_view = View View = function(x,...){ # Scrollable table output # knitr::kable(head(x)) x |> head(n = 10) |> kableExtra::kbl() |> kableExtra::kable_paper() |> kableExtra::scroll_box(width = "100%") }
library(tidyverse) library(cowplot) theme_set(theme_cowplot()) lizards <- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this data
R has two ways of representing textual data: character vectors (also called strings) and factors.
For the most part, it's easier and safer to work with character vectors. Most functions we'll be using know how to convert them when it's necessary.
One important thing to note is that ggplot
arranges character vectors alphabetically on its categorical scales, but orders factors by their level number. Thus, to change the order of categorical x-axes (and other scales), you need to make your categories into a factor. This is done with the fct_inorder
, fct_infreq
, and related functions, which are part of the forcats
package and loaded with tidyverse. The easiest one to use is fct_inorder
, which changes the level values to be in the order of your data; when combined with arrange()
and other dplyr
functions, this is quite flexible and powerful. For more information on these and other factor functions, take a look at the forcats website.
If you want to manually create a factor, you can use the factor
command, which is in base R (no package).
color_levels = c("Red","Green","Blue") # Randomly select 20 colors from color_levels color_example = sample(color_levels, size = 20, replace = TRUE) color_example # Convert it into a factor color_as_factor = factor(color_example, levels = color_levels) color_as_factor
These tests generally compare the frequencies of count data.
The easiest way to make a contingency table is from a data frame where each column is a categorical variable you want in the table & each row is an observation. Let's say we wanted to create a color morph by perch type contingency with our lizard data.
color_by_perch_tbl <- lizards |> select(Color_morph, Perch_type) |> # select only the columns you want table() # feed them into the table command color_by_perch_tbl
If you want to switch your contingency table from counts to frequencies, just divide it by its sum:
color_by_perch_tbl / sum(color_by_perch_tbl)
Once you have a contingency table, you can test for independence between the rows and columns. This provides you with your test statistic (X-squared
), degrees of freedom, and p-value.
chisq.test(color_by_perch_tbl)
Chi-squared tests assume that there's at least five observations in each cell of the contingency table. If this fails, then the resulting values aren't accurate. For example:
site_a_contingency <- lizards |> filter(Site == "A") |> # cut down the data size select(Color_morph, Perch_type) |> # select only the columns you want table() # feed them into the table command print(site_a_contingency) chisq.test(site_a_contingency)
Note the warning that "Chi-squared approximation may be incorrect." In this case, it's a good idea to run the Fisher's exact test, which investigates the same null hypotheses, but works with low counts. Fisher's test is less powerful than the Chi-squared test, so it should only be used when it's the only option.
fisher.test(site_a_contingency)
You can also run a chi-squared or Fisher's exact test directly on two vectors or data frame columns:
chisq.test(lizards$Color_morph, lizards$Perch_type)
Linear regression and Analysis of Variance (ANOVA) are both special cases of the general linear model (LM), which fits a continuous response (y) to one or more predictors (x). You specify linear models in R with a formula syntax, which generally follows as: response ~ predictor
. Combining this formula with the lm()
function and a datset gives you the basis of a linear model.
Lets say we wanted to see how snout-vent length (SVL) affects mass:
simple_reg <- lm(Mass ~ SVL, data = lizards) simple_reg
By default, this creates an LM object, which tells us the regression coefficients. For a linear regression (continuous response), these tell us the regression equation; in this case, that for every $1 \text{ mm}$ increase in SVL, mass increases by $r coef(simple_reg)[2] |> round(3) |> as.numeric()
\text{ g}$. Note that in this case, it may make sense to re-scale SVL to be in cm, so that the coefficient would be easier to interpret (e.g., use lizards |> mutate(SVL_cm = SVL/10)
. To extract the coeficients directly, use:
coef(simple_reg)
For more information, use the summary function:
summary(simple_reg)
The most important components of this are the $R^2$ (which is listed as Multiple R-squared
), the standard errors and p-values for each of your coefficients (Coefficients
section), and the overall F-statistic and p-value (the last line).
A quick note on p-values:
For a simple linear regression, ggplot
can automatically plot the trendline (a.k.a., fitted values) and confidence intervals with geom_smooth()
.
lizards |> ggplot() + aes(x = SVL, y = Mass) + geom_point() + geom_smooth(method = "lm", # use a linear regression se = TRUE, # Include a confidence interval around the line level = .95) # the level of the confidence interval; default = 95%
For more complicated models, this approach may not work very well; it can be helpful to calculate the fitted values & confidence intervals from the model object and plot them directly.
You can get these values directly with the predict()
function. The code below calculates the trendline and 95% confidence interval for the regression, and adds them to a data frame with the original data.
simple_reg_predictions <- simple_reg |> # Predict fitted values wish 95% confidence intervals predict(interval = "confidence", level = .95) |> # The output of predict() is a matrix; that's hard to work with, so... as_tibble() # we convert the output of predict() to a data frame (tibble) simple_reg_plot_data <- lizards |> select(SVL, Mass) |> # we only need these columns bind_cols(simple_reg_predictions) # adds columns of simple_reg_preds to our lizards View(simple_reg_plot_data)
To plot this, you'd use a combination of geom_line()
for the fitted values and geom_ribbon()
, which would create the confidence interval region.
ggplot(simple_reg_plot_data) + aes(x = SVL) + geom_point(aes(y = Mass), color = "cornflowerblue") + geom_ribbon(aes(ymin = lwr, ymax = upr), fill = alpha("black", .2), # dark fill with 80% transparency (alpha) color = grey(.4), # dark-ish grey border line linetype = 2) + # dotted border line geom_line(aes(y = fit))
You can also use predict()
to calculate the respected value of your response variable given new predictor(s); this is useful for interpolation and extrapolation.
# Create a data frame with new predictors svl_predictors = tibble(SVL = c(20, 150, 600)) # New predictors mass_predictions <- predict(simple_reg, newdata = svl_predictors) # The newdata argument is key here # If unspecified, it uses the original data svl_predictors |> mutate(Mass_estimate = mass_predictions)
For more information, see the help page ?predict.lm
.
Analysis of variance (ANOVA) is a special case of linear model where all of the predictors are all categorical. However, ANOVAs are usually treated differently from regressions for historical reasons. In R, you fit an ANOVA in the same way as a regression (with the lm()
command); however, it's common to use the aov()
command on the lm
's output to reformat the results into a more traditional style. For the simple (one-way) ANOVA follows:
# Fit the model wtih a regression simple_anova_lm <- lm(Diameter ~ Color_morph, data = lizards) # Reformat into traditional ANOVA style; this is usually done with a pipe as part of the previous step simple_anova <- aov(simple_anova_lm) # Look at the ANOVA table simple_anova |> summary()
The ANOVA table gives us estimates of variation explained by the predictor and the residuals. Note how different the output is from the summary of a regression, even though the underlying math is the same:
summary(simple_anova_lm)
Two quick notes about this: r ss = anova(simple_anova)[["Sum Sq"]]
aov()
summary, it can be calculated as the total sum of squares of your predictors (in this case, Color_morph) divided by the total sum of squares including the residual; For this example, we have $r ss[1]
/ (r ss[1]
+ r ss[2]
) = r ss[1]/sum(ss)
$. aov()
table provides a line for each predicting factor and the residuals; the lm()
summary includes estimates for the intercept and the specific levels of the factors (e.g., Color_morphBrown
). These represent the dummy coded variables that R uses under the hood; you don't need to worry about this, but I've got an explanation for them below if you're interested.Dummy coding: R converts an ANOVA into a (multiple) regression model by changing a categorical predictor with $n$ levels into $n-1$ predictor variables that can have values of 0 or 1. The first level of the factor doesn't get a dummy variable; its mean is represented by the regression's intercept. The other dummy coefficients represent the mean difference between the reference level and the level they represent. In our example, the reference category is Blue, so the regression's intercept is the mean Diameter for blue lizards. The average value for brown lizards would be the intercept plus the Color_morphBrown
coefficient (and similarly for green). The aov()
function reinterprets the dummy coded results as a traditionally categorical analysis.
# make a data frame with just your predictors (distinct removes duplicates) color_levels <- lizards |> distinct(Color_morph) color_levels simple_anova_means <- # Calculate means & 95% CI predict(simple_anova, newdata = color_levels, interval = "confidence", level = 0.95) |> as_tibble() |> # Convert to data frame bind_cols(color_levels) # add the prediction data as more columns simple_anova_means
In general, the ANOVA tests whether a factor (such as color morph) explains a significant amount of variation in the response. To test whether there are significant differences between specific levels of a factor, you need to run post-hoc tests on your ANOVA. A common post-hoc is Tukey's HSD (honest significant difference).
simple_anova |> TukeyHSD()
You can use the witch
argument to specify specific level comparisons, but by default it compares all of them.
A traditional way of summarizing post-hoc tests is to assign one or more letters to significantly different levels, where levels with different letters are significantly different from each other. For example, the letter grouping c("A", "AB", "B", "C")
would indicate that group 1 and group 3 are different from each other, but not from group 2; group 4 is different from everything. In the above example, everything is different, so we could assign the letters A through C.
# We'll store these in a data frame for later use anova_means_letters <- simple_anova_means |> arrange(Color_morph) |> # These will be plotted alphabetically, so we'll sort them that way to keep things easy mutate(anova_labels = c("A", "B", "C")) anova_means_letters
The basics of plotting ANOVA results was covered in the last part of Section \@ref(ggp_discx_conty), but we'll refine them a bit here. The main additions are the use of geom_text()
to include the labels above each category and scale_y_continuous
, which we use to customize the y axis a bit so that it doesn't cut off the labels. Be sure to indicate in your figure captions what the letters mean.
library(ggforce) # for geom_sina() # We're going to add post-hoc letters to this plot letter_y_position <- max(lizards$Diameter) * 1.05 # Put letters at the top anova_plot <- ggplot(lizards) + aes(x = Color_morph) + # Color_morph column is in both lizards AND anova_means_letters # Show the raw data geom_sina(aes(y = Diameter), # Diameter column is in lizards data frame color = alpha("blue", .2)) + # semi-transparent points geom_errorbar( aes(ymin = lwr, ymax = upr), # columns lwr and upr are in anova_means_letters data = anova_means_letters, # Use the summary data frame instead of lizards color = "black", width = .3 # How wide the error bars are ) + geom_point(aes(y = fit), # fit column is in anova_means_letters color = "black", size = 2, data = anova_means_letters) + geom_text(aes(label = anova_labels), # label is the aesthetic used to indicate text y = letter_y_position, # This is a fixed spot, not an aesthetic fontface = "bold", size = 6, # font size data = anova_means_letters) + scale_y_continuous( # insure the letters aren't cut off at the top limits = c(NA, letter_y_position), # This is the same as ylim() breaks = seq(from = 0, to = 50, by = 10), # set axis ticks every 10 points name = c("Perch Diameter (mm)") # this is the same as ylab() ) + xlab("Color Morph") anova_plot
While it's perfectly valid to use an ANOVA to compare the means of two samples, the t-test is a more traditional approach. You can do this with the t-test()
function.
First, let's create a 2-level categorical variable to work with. Normally, this would be done during data collection, but there aren't any in this data source. Let's define lizards as big (SVL >= 65 mm) or small (SVL <= 55 mm) and remove the intermediates.
lizards_by_size <- lizards |> filter(SVL <= 55 | SVL >= 65) |> # Remove the lizards of intermediate size mutate(Size_class = if_else (SVL >= 65, "big", "small") )# Define Size_class as big or small, based on SVL
Next, let's use a t-test to see if limb length differs between big and small lizards. First, we should look at the mean and standard deviation of each group. The classic t-test requires that the variances of the two groups are equal; if not, an adjustment will need to be applied. In practice, this adjustment has no ill effect if the variances are equal, so you may as well apply it. In any case, it's a good idea to look at data summaries before running a test.
lizard_size_sumary_table <- lizards_by_size |> group_by(Size_class) |> # We want to summarize each size class summarize(mean = mean(Limb), sd = sd(Limb), var = var(Limb), n = n()) |> ungroup() lizard_size_sumary_table
In this case, the standard deviations are somewhat different, but not particularly. In any case, we'll run the test without assuming for equal variances.
lizard_t_test <- t.test(Limb ~ Size_class, # formula specifies response (on left) and predictor (on right) data = lizards_by_size, var.equal = FALSE) # Assume unequal variances; this is the default option. print(lizard_t_test)
When reporting a t-test, you should t, the degrees of freedom, the p-value, the effect size (Cohen's d). The effect size isnt' provided for you directly, so you'll need to calculate it yourself.
Cohen's d is essentially a standardized way of measuring the difference between two groups. Conceptually, it's the number of standard deviations of difference between the two groups. You can calculate it by taking the difference in the group means and dividing by the pooled standard deviation. To get the pooled standard deviation, calculate the weighted average of the variance (weighted by population size) and take the square root.
lizard_size_sumary_table |> summarize(mean_diff = diff(mean), pooled_variance = sum(var * n) / sum(n) ) |> #weighted mean of variance mutate(cohen_d = abs(mean_diff) / sqrt(pooled_variance) ) # COhen's d is always positive, so take abs() of mean_diff
You could report your results as follows (assuming that figure 1 compared the two groups):
Big lizards have significantly longer limbs than small lizards (d = 4.27, t = 45.7, df = 442.7, p < 0.0001; Figure 1)
Analysis of covariance (ANCOVA) is a statistical model that combines a linear regression with an ANOVA or t-test. Essentially, it is a regression that also allows the slope and/or intercept of the fit line to vary between levels of a categorical variable. To keep things simple, we're going to focus on ANCOVAS that only test two-level categorical variables; however, there's no inherent reason for it to require more.
Let's make two versions of the lizard
data, contrasting Brown with Blue and Brown with Green. I'm also re-defining Color_morph as a factor so that Brown is taken as the reference category (see dummy coding under the ANOVA section).
lizards_brown_green = lizards |> filter(Color_morph != "Blue") |> mutate(Color_morph = factor(Color_morph, levels = c("Brown", "Green"))) lizards_brown_blue = lizards |> filter(Color_morph != "Green") |> mutate(Color_morph = factor(Color_morph, levels = c("Brown", "Blue")))
Let's test to see if the relationship between body size (SVL
) and limb length (Limb
) differs by color morph. To run an ANCOVA, you can use the lm()
function with a formula like Limb ~ SVL * Color_morph
.
ancova_limb_green = lm(Limb ~ SVL * Color_morph, data = lizards_brown_green) print(ancova_limb_green) ancova_limb_blue =lm(Limb ~ SVL * Color_morph, data = lizards_brown_blue) print(ancova_limb_blue)
This tells us that the ANCOVA has four coefficients:
SVL
, the baseline regression slope for the reference category, BrownColor_morphGreen
(or Blue), how the intercept differs between Brown and the other color; this is the main effect of color morph.SVL:Color_morphGreen
(or Blue), how the slope differs between Brown and the other color; this is the SVL / Color morph interaction term.For both data sets, both the main effect and interaction of color morph are near zero; you can test this statistically with summary()
.
summary(ancova_limb_green) summary(ancova_limb_blue)
This means that the two regression lines are essentially the same. Let's plot one of them to see what that looks like:
lizards_brown_green |> ggplot() + aes(x = SVL, y = Limb, color = Color_morph) + geom_point() + geom_smooth(method = 'lm') + scale_color_viridis_d(begin = 0.3, end = 1)
As you can see, the regression lines are almost indistinguishable.
What would an ANCOVA show if there are differences? Let's see if color morph affects the Tail:Diameter relationship instead:
ancova_dia_blue =lm(Diameter ~ Tail * Color_morph, data = lizards_brown_blue) print(ancova_dia_blue) summary(ancova_dia_blue) lizards_brown_blue |> ggplot() + aes(x = Tail, y = Diameter, color = Color_morph) + geom_point() + geom_smooth(method = 'lm') + scale_color_viridis_d(begin = 0.3, end = 1)
In this case, you can see that there's a significant main effect but no significant interaction; this results in a separation between the two regression lines on the figure.
If you get this result, re-run the ANCOVA but replace the *
with a +
; this will remove the interaction effect and give you better estimate.
ancova_dia_blue_noint =lm(Diameter ~ Tail + Color_morph, data = lizards_brown_blue) summary(ancova_dia_blue_noint)
Biologically, you can interpret this as Blue lizards have a larger perch diameter than brown across a variety of tail sizes. Specifically, blue diameter is on average average r as.numeric(round(coef(ancova_dia_blue_noint)[3], 2))
mm wider (you get this from the main effect coefficient of the no-intercept model).
When reporting this, be sure to provide the regression equation (Intercept and slope of Tail) and the main effect.
But what if you have a significant interaction? Let's look at the Brown/Green comparison:
ancova_dia_green =lm(Diameter ~ Tail * Color_morph, data = lizards_brown_green) print(ancova_dia_green) summary(ancova_dia_green) lizards_brown_green |> ggplot() + aes(x = Tail, y = Diameter, color = Color_morph) + geom_point() + geom_smooth(method = 'lm') + scale_color_viridis_d(begin = 0.3, end = 1)
In this case, we have a significant interaction, which indicates that the relationship between Diameter and Tail is different between color morphs; the graph backs this up.
The main effect is not significantly different in this case, but that generally doesn't matter (as it can't be interpreted without the slope).
To understand and report this, it's best to think of it as a pair of separate regressions.
The regression for Brown is determined by your Intercept and your slope (e.g., Diameter = Intercept + slope * Tail
).
The regression for Green adds the Green-specific coefficients to those effects: Diameter = (Intercept + Color_morphGreen) + (Tail slope + Tail:Color_morphGreen) * Tail
.
In this case, you'd report your regression slopes as:
Brown: Diameter = r round(coef(ancova_dia_green)[[1]], 2)
+ r round(coef(ancova_dia_green)[[2]], 2)
* Tail
Green: Diameter = r round(coef(ancova_dia_green)[[1]] + coef(ancova_dia_green)[[3]], 2)
+ r round(coef(ancova_dia_green)[[2]] + coef(ancova_dia_green)[[4]], 2)
* Tail
You should also include the $R^2$ and the p-values for the non-intercept coefficients for both cases.
ggplot(lizards, aes(x = SVL, y = Diameter, color = Color_morph)) + geom_point() + geom_smooth(method = 'lm')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.