In Christopher-Peterson/Bio373L-Book: Does not matter.

Performing some useful analyses in R {#r-stats}

I'm going to give some examples of how to do common analyses you'll need for this class. I won't be spending much time on the statistical assumptions or diagnostics.

# This invisible block is a workaround for Travis CI not having the tidyverse meta-package; 
library(ggplot2)
library(dplyr)
library(readr)
library(tibble)
library(forcats)
library(cowplot)
library(kableExtra)
theme_set(theme_cowplot())
lizards <- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this data
# Capture View()
real_view = View
View = function(x,...){
  # Scrollable table output
  # knitr::kable(head(x))
  x |> head(n = 10) |> 
    kableExtra::kbl() |> 
    kableExtra::kable_paper() |> 
    kableExtra::scroll_box(width = "100%")
}

library(tidyverse) 
library(cowplot)
theme_set(theme_cowplot())
lizards <- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this data

A note on factors {#r-stats-factors}

R has two ways of representing textual data: character vectors (also called strings) and factors.

Character vectors are just text; they have no inherent underlying meaning.
Factors are a data type with a specific number of levels; they're often used to represent different experimental treatments. Examples could include {low, medium, high} or {control, treatment}. Each level of a factor is associated with a number.

For the most part, it's easier and safer to work with character vectors. Most functions we'll be using know how to convert them when it's necessary.

One important thing to note is that ggplot arranges character vectors alphabetically on its categorical scales, but orders factors by their level number. Thus, to change the order of categorical x-axes (and other scales), you need to make your categories into a factor. This is done with the fct_inorder, fct_infreq, and related functions, which are part of the forcats package and loaded with tidyverse. The easiest one to use is fct_inorder, which changes the level values to be in the order of your data; when combined with arrange() and other dplyr functions, this is quite flexible and powerful. For more information on these and other factor functions, take a look at the forcats website.

If you want to manually create a factor, you can use the factor command, which is in base R (no package).

color_levels = c("Red","Green","Blue")
# Randomly select 20 colors from color_levels
color_example = sample(color_levels, size = 20, replace = TRUE)
color_example

# Convert it into a factor
color_as_factor = factor(color_example, levels = color_levels)
color_as_factor

Comparing categorical frequencies (Contingency Tables & related analyses) {#r-stats-cat}

These tests generally compare the frequencies of count data.

Contingency Tables {#r-stats-cont-tbl}

The easiest way to make a contingency table is from a data frame where each column is a categorical variable you want in the table & each row is an observation. Let's say we wanted to create a color morph by perch type contingency with our lizard data.

color_by_perch_tbl <- lizards |> 
  select(Color_morph, Perch_type) |> # select only the columns you want
  table()  # feed them into the table command
color_by_perch_tbl

If you want to switch your contingency table from counts to frequencies, just divide it by its sum:

color_by_perch_tbl / sum(color_by_perch_tbl)

Chi-squared & Fisher's Exact Tests {#r-stats-chisq}

Once you have a contingency table, you can test for independence between the rows and columns. This provides you with your test statistic (X-squared), degrees of freedom, and p-value.

chisq.test(color_by_perch_tbl)

Chi-squared tests assume that there's at least five observations in each cell of the contingency table. If this fails, then the resulting values aren't accurate. For example:

site_a_contingency <- lizards |> 
  filter(Site == "A") |> # cut down the data size
  select(Color_morph, Perch_type) |> # select only the columns you want
  table()  # feed them into the table command
print(site_a_contingency)
chisq.test(site_a_contingency)

Note the warning that "Chi-squared approximation may be incorrect." In this case, it's a good idea to run the Fisher's exact test, which investigates the same null hypotheses, but works with low counts. Fisher's test is less powerful than the Chi-squared test, so it should only be used when it's the only option.

fisher.test(site_a_contingency)

You can also run a chi-squared or Fisher's exact test directly on two vectors or data frame columns:

chisq.test(lizards$Color_morph, lizards$Perch_type)

Linear Models: Regression, ANOVA, and t-tests {#r-stats-lm}

Linear regression and Analysis of Variance (ANOVA) are both special cases of the general linear model (LM), which fits a continuous response (y) to one or more predictors (x). You specify linear models in R with a formula syntax, which generally follows as: response ~ predictor. Combining this formula with the lm() function and a datset gives you the basis of a linear model.

Regression

Lets say we wanted to see how snout-vent length (SVL) affects mass:

simple_reg <- lm(Mass ~ SVL, data = lizards)
simple_reg

By default, this creates an LM object, which tells us the regression coefficients. For a linear regression (continuous response), these tell us the regression equation; in this case, that for every $1 \text{ mm}$ increase in SVL, mass increases by $r coef(simple_reg)[2] |> round(3) |> as.numeric() \text{ g}$. Note that in this case, it may make sense to re-scale SVL to be in cm, so that the coefficient would be easier to interpret (e.g., use lizards |> mutate(SVL_cm = SVL/10). To extract the coeficients directly, use:

coef(simple_reg)

For more information, use the summary function:

summary(simple_reg)

The most important components of this are the $R^2$ (which is listed as Multiple R-squared), the standard errors and p-values for each of your coefficients (Coefficients section), and the overall F-statistic and p-value (the last line).

A quick note on p-values:

Don't base all of your interpretation on p-values; the $R^2$ and adjusted $R^2$ of a model are more important.
The overall p-value relates to how the whole model explains the variance in the data; the coefficient p-value relates to whether the specific coefficient is different from zero.
Coefficient-level p-values tend to be rather fragile, and shouldn't be used.

Plotting linear regressions

For a simple linear regression, ggplot can automatically plot the trendline (a.k.a., fitted values) and confidence intervals with geom_smooth().

lizards |> 
  ggplot() + 
  aes(x = SVL, y = Mass) + 
  geom_point() + 
  geom_smooth(method = "lm", # use a linear regression
              se = TRUE, # Include a confidence interval around the line
              level = .95) # the level of the confidence interval; default = 95%

For more complicated models, this approach may not work very well; it can be helpful to calculate the fitted values & confidence intervals from the model object and plot them directly.

You can get these values directly with the predict() function. The code below calculates the trendline and 95% confidence interval for the regression, and adds them to a data frame with the original data.

simple_reg_predictions <- simple_reg |> 
  # Predict fitted values wish 95% confidence intervals
  predict(interval = "confidence", level = .95) |> 
  # The output of predict() is a matrix; that's hard to work with, so...
  as_tibble() # we convert the output of predict() to a data frame (tibble)
simple_reg_plot_data <- lizards |> 
  select(SVL, Mass) |> # we only need these columns
  bind_cols(simple_reg_predictions) # adds columns of simple_reg_preds to our lizards
View(simple_reg_plot_data)

To plot this, you'd use a combination of geom_line() for the fitted values and geom_ribbon(), which would create the confidence interval region.

ggplot(simple_reg_plot_data) + 
  aes(x = SVL) + 
  geom_point(aes(y = Mass), color = "cornflowerblue") + 
  geom_ribbon(aes(ymin = lwr, ymax = upr), 
              fill = alpha("black", .2),  # dark fill with 80% transparency (alpha)
              color = grey(.4), # dark-ish grey border line
              linetype = 2) +   # dotted border line
  geom_line(aes(y = fit))

You can also use predict() to calculate the respected value of your response variable given new predictor(s); this is useful for interpolation and extrapolation.

# Create a data frame with new predictors
svl_predictors = tibble(SVL = c(20, 150, 600)) # New predictors
mass_predictions <- predict(simple_reg, 
                           newdata = svl_predictors)  # The newdata argument is key here
                                                      # If unspecified, it uses the original data
svl_predictors |> 
  mutate(Mass_estimate = mass_predictions)

For more information, see the help page ?predict.lm.

ANOVA {#r-stats-anova}

Analysis of variance (ANOVA) is a special case of linear model where all of the predictors are all categorical. However, ANOVAs are usually treated differently from regressions for historical reasons. In R, you fit an ANOVA in the same way as a regression (with the lm() command); however, it's common to use the aov() command on the lm's output to reformat the results into a more traditional style. For the simple (one-way) ANOVA follows:

# Fit the model wtih a regression
simple_anova_lm <- lm(Diameter ~ Color_morph, data = lizards) 
# Reformat into traditional ANOVA style; this is usually done with a pipe as part of the previous step
simple_anova <- aov(simple_anova_lm)

# Look at the ANOVA table
simple_anova |> summary()

The ANOVA table gives us estimates of variation explained by the predictor and the residuals. Note how different the output is from the summary of a regression, even though the underlying math is the same:

summary(simple_anova_lm)

Two quick notes about this: r ss = anova(simple_anova)[["Sum Sq"]]

While $R^2$ isn't included in the aov() summary, it can be calculated as the total sum of squares of your predictors (in this case, Color_morph) divided by the total sum of squares including the residual; For this example, we have $r ss[1] / (r ss[1] + r ss[2]) = r ss[1]/sum(ss)$.
The aov() table provides a line for each predicting factor and the residuals; the lm() summary includes estimates for the intercept and the specific levels of the factors (e.g., Color_morphBrown). These represent the dummy coded variables that R uses under the hood; you don't need to worry about this, but I've got an explanation for them below if you're interested.

Dummy coding: R converts an ANOVA into a (multiple) regression model by changing a categorical predictor with $n$ levels into $n-1$ predictor variables that can have values of 0 or 1. The first level of the factor doesn't get a dummy variable; its mean is represented by the regression's intercept. The other dummy coefficients represent the mean difference between the reference level and the level they represent. In our example, the reference category is Blue, so the regression's intercept is the mean Diameter for blue lizards. The average value for brown lizards would be the intercept plus the Color_morphBrown coefficient (and similarly for green). The aov() function reinterprets the dummy coded results as a traditionally categorical analysis.

ANOVA means & confidence intervals

# make a data frame with just your predictors (distinct removes duplicates)
color_levels <- lizards |> distinct(Color_morph)
color_levels

simple_anova_means <- 
  # Calculate means & 95% CI
  predict(simple_anova, newdata = color_levels, 
                             interval = "confidence", level = 0.95) |> 
    as_tibble() |>  # Convert to data frame
    bind_cols(color_levels) # add the prediction data as more columns
simple_anova_means

Post-hoc tests

In general, the ANOVA tests whether a factor (such as color morph) explains a significant amount of variation in the response. To test whether there are significant differences between specific levels of a factor, you need to run post-hoc tests on your ANOVA. A common post-hoc is Tukey's HSD (honest significant difference).

simple_anova |> TukeyHSD()

You can use the witch argument to specify specific level comparisons, but by default it compares all of them.

A traditional way of summarizing post-hoc tests is to assign one or more letters to significantly different levels, where levels with different letters are significantly different from each other. For example, the letter grouping c("A", "AB", "B", "C") would indicate that group 1 and group 3 are different from each other, but not from group 2; group 4 is different from everything. In the above example, everything is different, so we could assign the letters A through C.

# We'll store these in a data frame for later use
anova_means_letters <- simple_anova_means |> 
  arrange(Color_morph) |> # These will be plotted alphabetically, so we'll sort them that way to keep things easy
  mutate(anova_labels = c("A", "B", "C"))
anova_means_letters

Plotting ANOVAs

The basics of plotting ANOVA results was covered in the last part of Section \@ref(ggp_discx_conty), but we'll refine them a bit here. The main additions are the use of geom_text() to include the labels above each category and scale_y_continuous, which we use to customize the y axis a bit so that it doesn't cut off the labels. Be sure to indicate in your figure captions what the letters mean.

library(ggforce) # for geom_sina()
# We're going to add post-hoc letters to this plot
letter_y_position <- max(lizards$Diameter) * 1.05 # Put letters at the top

anova_plot <- ggplot(lizards) + 
  aes(x = Color_morph) + # Color_morph column is in both lizards AND anova_means_letters
  # Show the raw data
  geom_sina(aes(y = Diameter), # Diameter column is in lizards data frame
            color = alpha("blue", .2)) +  # semi-transparent points
  geom_errorbar(
    aes(ymin = lwr, ymax = upr), # columns lwr and upr are in anova_means_letters
    data = anova_means_letters,    # Use the summary data frame instead of lizards
    color = "black", 
    width = .3 # How wide the error bars are
  ) + 
  geom_point(aes(y = fit), # fit column is in anova_means_letters
             color = "black", size = 2,
             data = anova_means_letters) + 
  geom_text(aes(label = anova_labels), # label is the aesthetic used to indicate text
            y = letter_y_position, # This is a fixed spot, not an aesthetic
            fontface = "bold",
            size = 6, # font size
            data = anova_means_letters) + 
  scale_y_continuous(
    # insure the letters aren't cut off at the top
    limits = c(NA, letter_y_position), # This is the same as ylim()
    breaks = seq(from = 0, to = 50, by = 10), # set axis ticks every 10 points
    name = c("Perch Diameter (mm)") # this is the same as ylab()
  ) + 
  xlab("Color Morph")
anova_plot

Comparing two means (t-tests)

While it's perfectly valid to use an ANOVA to compare the means of two samples, the t-test is a more traditional approach. You can do this with the t-test() function.

First, let's create a 2-level categorical variable to work with. Normally, this would be done during data collection, but there aren't any in this data source. Let's define lizards as big (SVL >= 65 mm) or small (SVL <= 55 mm) and remove the intermediates.

lizards_by_size <- lizards |>
  filter(SVL <= 55 | SVL >= 65) |>  # Remove the lizards of intermediate size
  mutate(Size_class = if_else (SVL >= 65, "big", "small") )# Define Size_class as big or small, based on SVL

Next, let's use a t-test to see if limb length differs between big and small lizards. First, we should look at the mean and standard deviation of each group. The classic t-test requires that the variances of the two groups are equal; if not, an adjustment will need to be applied. In practice, this adjustment has no ill effect if the variances are equal, so you may as well apply it. In any case, it's a good idea to look at data summaries before running a test.

lizard_size_sumary_table <- lizards_by_size |>
  group_by(Size_class) |> # We want to summarize each size class
  summarize(mean = mean(Limb), sd = sd(Limb), var = var(Limb), n = n()) |>
  ungroup()
lizard_size_sumary_table

In this case, the standard deviations are somewhat different, but not particularly. In any case, we'll run the test without assuming for equal variances.

lizard_t_test <-  t.test(Limb ~ Size_class, # formula specifies response (on left) and predictor (on right)
                         data = lizards_by_size, 
                         var.equal = FALSE) # Assume unequal variances; this is the default option.
print(lizard_t_test)

When reporting a t-test, you should t, the degrees of freedom, the p-value, the effect size (Cohen's d). The effect size isnt' provided for you directly, so you'll need to calculate it yourself.

Effect size: Cohen's d

Cohen's d is essentially a standardized way of measuring the difference between two groups. Conceptually, it's the number of standard deviations of difference between the two groups. You can calculate it by taking the difference in the group means and dividing by the pooled standard deviation. To get the pooled standard deviation, calculate the weighted average of the variance (weighted by population size) and take the square root.

lizard_size_sumary_table |> 
  summarize(mean_diff = diff(mean),
            pooled_variance = sum(var * n) / sum(n) ) |> #weighted mean of variance
  mutate(cohen_d = abs(mean_diff) / sqrt(pooled_variance) ) # COhen's d is always positive, so take abs() of mean_diff

You could report your results as follows (assuming that figure 1 compared the two groups):

Big lizards have significantly longer limbs than small lizards (d = 4.27, t = 45.7, df = 442.7, p < 0.0001; Figure 1)

Comparing two rates or relationships (ANCOVA) {#r-stats-ancova}

Analysis of covariance (ANCOVA) is a statistical model that combines a linear regression with an ANOVA or t-test. Essentially, it is a regression that also allows the slope and/or intercept of the fit line to vary between levels of a categorical variable. To keep things simple, we're going to focus on ANCOVAS that only test two-level categorical variables; however, there's no inherent reason for it to require more.

Let's make two versions of the lizard data, contrasting Brown with Blue and Brown with Green. I'm also re-defining Color_morph as a factor so that Brown is taken as the reference category (see dummy coding under the ANOVA section).

lizards_brown_green = lizards |> filter(Color_morph != "Blue") |> 
  mutate(Color_morph = factor(Color_morph, levels = c("Brown", "Green")))
lizards_brown_blue = lizards |> filter(Color_morph != "Green") |> 
  mutate(Color_morph = factor(Color_morph, levels = c("Brown", "Blue")))

Let's test to see if the relationship between body size (SVL) and limb length (Limb) differs by color morph. To run an ANCOVA, you can use the lm() function with a formula like Limb ~ SVL * Color_morph.

ancova_limb_green = lm(Limb ~ SVL * Color_morph, data = lizards_brown_green)
print(ancova_limb_green)
ancova_limb_blue =lm(Limb ~ SVL * Color_morph, data = lizards_brown_blue)
print(ancova_limb_blue)

This tells us that the ANCOVA has four coefficients:

Intercept (for the regression on Brown)
SVL, the baseline regression slope for the reference category, Brown
Color_morphGreen (or Blue), how the intercept differs between Brown and the other color; this is the main effect of color morph.
SVL:Color_morphGreen (or Blue), how the slope differs between Brown and the other color; this is the SVL / Color morph interaction term.

For both data sets, both the main effect and interaction of color morph are near zero; you can test this statistically with summary().

summary(ancova_limb_green)
summary(ancova_limb_blue)

This means that the two regression lines are essentially the same. Let's plot one of them to see what that looks like:

lizards_brown_green |> ggplot() + aes(x = SVL, y = Limb, color = Color_morph) + 
  geom_point() + geom_smooth(method = 'lm') + scale_color_viridis_d(begin = 0.3, end = 1)

As you can see, the regression lines are almost indistinguishable.

What would an ANCOVA show if there are differences? Let's see if color morph affects the Tail:Diameter relationship instead:

ancova_dia_blue =lm(Diameter ~ Tail * Color_morph, data = lizards_brown_blue)
print(ancova_dia_blue)
summary(ancova_dia_blue)
lizards_brown_blue |> ggplot() + aes(x = Tail, y = Diameter, color = Color_morph) + 
  geom_point() + geom_smooth(method = 'lm') + scale_color_viridis_d(begin = 0.3, end = 1)

In this case, you can see that there's a significant main effect but no significant interaction; this results in a separation between the two regression lines on the figure. If you get this result, re-run the ANCOVA but replace the * with a +; this will remove the interaction effect and give you better estimate.

ancova_dia_blue_noint =lm(Diameter ~ Tail + Color_morph, data = lizards_brown_blue)
summary(ancova_dia_blue_noint)

Biologically, you can interpret this as Blue lizards have a larger perch diameter than brown across a variety of tail sizes. Specifically, blue diameter is on average average r as.numeric(round(coef(ancova_dia_blue_noint)[3], 2)) mm wider (you get this from the main effect coefficient of the no-intercept model).

When reporting this, be sure to provide the regression equation (Intercept and slope of Tail) and the main effect.

But what if you have a significant interaction? Let's look at the Brown/Green comparison:

ancova_dia_green =lm(Diameter ~ Tail * Color_morph, data = lizards_brown_green)
print(ancova_dia_green)
summary(ancova_dia_green)
lizards_brown_green |> ggplot() + aes(x = Tail, y = Diameter, color = Color_morph) + 
  geom_point() + geom_smooth(method = 'lm') + scale_color_viridis_d(begin = 0.3, end = 1)

In this case, we have a significant interaction, which indicates that the relationship between Diameter and Tail is different between color morphs; the graph backs this up. The main effect is not significantly different in this case, but that generally doesn't matter (as it can't be interpreted without the slope). To understand and report this, it's best to think of it as a pair of separate regressions. The regression for Brown is determined by your Intercept and your slope (e.g., Diameter = Intercept + slope * Tail). The regression for Green adds the Green-specific coefficients to those effects: Diameter = (Intercept + Color_morphGreen) + (Tail slope + Tail:Color_morphGreen) * Tail. In this case, you'd report your regression slopes as:

Brown: Diameter = r round(coef(ancova_dia_green)[[1]], 2) + r round(coef(ancova_dia_green)[[2]], 2) * Tail Green: Diameter = r round(coef(ancova_dia_green)[[1]] + coef(ancova_dia_green)[[3]], 2) + r round(coef(ancova_dia_green)[[2]] + coef(ancova_dia_green)[[4]], 2) * Tail

You should also include the $R^2$ and the p-values for the non-intercept coefficients for both cases.

ggplot(lizards, aes(x = SVL, y = Diameter, color = Color_morph)) + 
  geom_point() + 
  geom_smooth(method = 'lm')

Christopher-Peterson/Bio373L-Book documentation built on Oct. 26, 2022, 8:36 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Christopher-Peterson/Bio373L-Book
Does not matter.

In Christopher-Peterson/Bio373L-Book: Does not matter.

Performing some useful analyses in R {#r-stats}

A note on factors {#r-stats-factors}

Comparing categorical frequencies (Contingency Tables & related analyses) {#r-stats-cat}

Contingency Tables {#r-stats-cont-tbl}

Chi-squared & Fisher's Exact Tests {#r-stats-chisq}

Linear Models: Regression, ANOVA, and t-tests {#r-stats-lm}

Regression

Plotting linear regressions

ANOVA {#r-stats-anova}

ANOVA means & confidence intervals

Post-hoc tests

Plotting ANOVAs

Comparing two means (t-tests)

Effect size: Cohen's d

Comparing two rates or relationships (ANCOVA) {#r-stats-ancova}

R Package Documentation

Browse R Packages

We want your feedback!

Christopher-Peterson/Bio373L-Book Does not matter.

In Christopher-Peterson/Bio373L-Book: Does not matter.

Performing some useful analyses in R {#r-stats}

A note on factors {#r-stats-factors}

Comparing categorical frequencies (Contingency Tables & related analyses) {#r-stats-cat}

Contingency Tables {#r-stats-cont-tbl}

Chi-squared & Fisher's Exact Tests {#r-stats-chisq}

Linear Models: Regression, ANOVA, and t-tests {#r-stats-lm}

Regression

Plotting linear regressions

ANOVA {#r-stats-anova}

ANOVA means & confidence intervals

Post-hoc tests

Plotting ANOVAs

Comparing two means (t-tests)

Effect size: Cohen's d

Comparing two rates or relationships (ANCOVA) {#r-stats-ancova}

R Package Documentation

Browse R Packages

We want your feedback!

Christopher-Peterson/Bio373L-Book
Does not matter.