library(rbiom) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", R.options = list( pillar.print_min = 5, pillar.print_max = 5 ))
Regression functions are provided for examining the association between two continuous variables.
Examples of continuous variables are alpha diversity metrics, taxa abundances, and sample metadata such as age or BMI. These all have a numeric range in which a data point can take on any value in that range.
To start with a simple example, consider the children
dataset from the R package npregfast
. It records age, sex, and height for 2500 children aged 5 through 19 years.
head(npregfast::children)
First, let's plot 'age' vs 'height', and overlay a best-fit linear trendline.
With this plot, we're testing the hypothesis "as you get older, your height changes".
stats_corrplot( df = npregfast::children, # dataset x = 'age', # x-axis variable y = 'height', # y-axis variable layers = 'tp', # show trendline and points test = 'emtrends', # run stats on the slope fit = 'lm', # straight trendline pt.size = 0.2, # make points smaller pt.alpha = 0.2 ) # and semi-transparent
As you'd expect, the trendline's slope is positive (p-value = 0). Therefore it's safe to say that height changes with age.
Is the rate of growth influenced by sex? We can add stat.by = "sex"
to draw separate lines for males and females.
stats_corrplot( df = npregfast::children, # dataset x = 'age', # x-axis variable y = 'height', # y-axis variable stat.by = 'sex', # statistical groups <==== fit = 'lm', # straight trendline layers = 'tp', # show trendline and points test = 'emtrends', # run stats on the slopes pt.size = 0.2 ) # make points smaller
Here, a p-value of 3.9e-51 indicates that males and females do have different growth rates.
If you look at the captions of the last two plots, you'll notice that the p-value test automatically changes.
stat.by = NULL
, test = "emtrends"
will test if the slope is zero. stat.by
is not NULL
, test = "emtrends"
will test if the trendline slopes are equal.Linear trendlines probably aren't the best way to model growth rates. Setting fit = "gam"
will use a generalized additive model which fits several sub-ranges of age with independent splines.
stats_corrplot( df = npregfast::children, # dataset x = 'age', # x-axis variable y = 'height', # y-axis variable stat.by = 'sex', # statistical groups fit = 'gam', # smoothed trendline <==== layers = 'tp', # show trendline and points test = 'emtrends', # run stats on the slopes pt.size = 0.2 ) # make points smaller
This gives us a much better idea of the moving average over time.
We still have test = "emtrends"
, so the p-value is reported where the difference in slope is most significant - in this case, at age = 5.9 years.
Let's set test = "emmeans"
instead, and show the 95% confidence interval instead of all the data points.
stats_corrplot( df = npregfast::children, # dataset x = 'age', # x-axis variable y = 'height', # y-axis variable stat.by = 'sex', # statistical groups fit = 'gam', # smoothed trendline layers = 'tc', # show trendline with conf. int. test = 'emmeans' ) # run stats on the means <====
Differences in height are most significant at age = 16.8 (where p-value = 8.9e-40).
The complete statistical output is attached to the returned plot as $stats
.
You can also directly generate this table with the stats_table()
function.
st <- stats_table( df = npregfast::children, # dataset regr = 'age', # x-axis variable resp = 'height', # y-axis variable stat.by = 'sex', # statistical groups fit = 'gam', # generalized additive model test = 'emmeans' ) # run stats on the means # Print the table transposed, since it's one row by 14 columns. t(st)
Depending on the arguments given, the estimate term will be one of:
| Field | Description | test
| stat.by
|
| ----------- | ------------------------ | --------------------- | ---------- |
| .mean | Estimated marginal mean. | emmeans::emmeans()
| NULL
|
| .mean.diff | Difference in means. | emmeans::emmeans()
| not NULL
|
| .slope | Trendline slope. | emmeans::emtrends()
| NULL
|
| .slope.diff | Difference in slopes. | emmeans::emtrends()
| not NULL
|
Other fields in this table include:
| Field | Description |
| ------------ | ----------------------------------------------------- |
| .h1 | Alternate hypothesis. |
| .p.val | Probability that null hypothesis is correct. |
| .adj.p | .p.val
after adjusting for multiple comparisons. |
| .effect.size | Effect size. See emmeans::eff_size()
. |
| .lower | Confidence interval lower bound. |
| .upper | Confidence interval upper bound. |
| .se | Standard error. |
| .n | Number of samples. |
| .df | Degrees of freedom. |
| .t.ratio | (.mean, .mean.diff, .slope, or .slope.diff) / .se
|
| .r.sqr | Percent of variation explained by the model. |
| .adj.r | .r.sqr
, taking degrees of freedom into account. |
| .aic | Akaike Information Criterion (predictive models). |
| .bic | Bayesian Information Criterion (descriptive models). |
| .loglik | Log-likelihood goodness-of-fit score. |
| .fit.p | P-value for observing this fit by chance. |
Estimated marginal means (EMMs; also known as least-squares means) are means extracted from a statistical model. This allows EMMs to take into account more complex associations and produce confidence intervals in addition to estimates of the mean. This approach is employed by rbiom's 'emmeans'
and 'emtrends'
tests, using the emmeans::emmeans
package.
Additional information on EMMs is available at:
The statistics table has several fields for assessing how well the model fits the data.
.r.sqr
- Coefficient of Determination (R2); range: 0-1, lower is better. Percent of variation explained by the model..adj.r
- R2, taking degrees of freedom into account..aic
- Akaike Information Criterion; lower is better. Preferred when using model for prediction..bic
- Bayesian Information Criterion; lower is better. Preferred when using model for interpretation..loglik
- Log-Likelihood (negative values); higher values (closer to zero) are better..fit.p
- P-value for observing this fit by chance; range: 0-1, lower is better.Not all of these values can be computed for every model.
As an example, lets pull the AIC value from the statistics table generated in the last section.
st$.aic
These values can be used to decide which fit
argument to use ("lm"
, "log"
, or "gam"
).
plyr::ldply(c(lm="lm", log="log", gam="gam"), .id = "fit", function (fit) stats_table(npregfast::children, "age", "height", "sex", fit = fit) ) %>% dplyr::select(fit, .r.sqr:.fit.p)
Based on the above table, fit = "gam"
has the lowest AIC, lowest BIC, and highest Log-Likelihood, and is therefore the best model of the three to use for this set of data and variables.
The three columns .mean.diff
, .h1
, and .p.val
together show the hypothesis and its outcome. Above, we're asking if .mean.diff
is non-zero. Since .p.val
is less than 0.05 we can say that it is.
dplyr::select(st, .mean.diff:.p.val)
For a more complete review of alternative hypotheses, see the rbiom statistics article.
One of the best ways to assess the quality of your model is to interrogate the residuals. A "residual" is the vertical distance from a data point to the trendline; the difference between observed and predicted (fitted) values.
High quality models will have randomly distributed residuals. Three common plots for visualizing residuals are normal Q-Q plots, residual vs fitted value plots, and scale-location plots. You can add these plots to any rbiom corrplot by setting check = TRUE
.
stats_corrplot( df = npregfast::children, # dataset x = 'age', # x-axis variable y = 'height', # y-axis variable stat.by = 'sex', # statistical groups fit = 'gam', # smoothed trendline layers = 'tc', # show trendline with conf. int. test = 'emmeans', # run stats on the means check = TRUE ) # display diagnostic plots <====
The subtitle for each plot tells you what a high quality model will look like. For example, the "Normal Q-Q" plot says that "Points should fall on the line." In this case they do, so that particular check indicates high-quality.
These diagnostic plots are largely subjective/qualitative. That is, you must use your best judgement to decide if the residuals appear to pass or fail the checks.
A complete explanation of these diagnostic plots is beyond the scope of this article. Readers are encouraged to explore the topic further at sites such as:
Going forward we'll use the gems
dataset included with rbiom. GEMS is a case-control study comprising 1,006 stool samples from young children with or without diarrhea in four low-income countries. The original Nature article by Eric J. de Muinck and Pal Trosvik for this data set is available online for free at Individuality and Convergence of the Infant Gut Microbiota During the First Year of Life.
gems
glimpse(gems)
The gems
metadata includes two categorical fields and one continuous field:
| Field | Type | Values | | -------- | ----------- | ---------------------------------- | | diarrhea | categorical | Case or Control | | age | continuous | 0 - 4.8 (years old) | | country | categorical | Bangladesh, Gambia, Kenya, or Mali |
To start, let's select just the healthy controls from two countries, then rarefy those samples to an even sequencing depth.
hbk <- gems %>% subset(diarrhea == "Control") %>% subset(country %in% c("Bangladesh", "Kenya")) %>% rarefy() hbk$n_samples
Since we're working with an rbiom object - hbk
- we can use the convenience function adiv_corrplot()
, which combines the steps of generating a table of alpha diversity values (adiv_table()
) and generating a correlation plot (stats_corrplot()
).
adiv_corrplot( biom = hbk, # healthy controls from Bangladesh and Kenya x = "age", # x-axis variable stat.by = "country", # statistical groups fit = "gam", # smoothed trendline check = TRUE ) # display diagnostic plots
Here, the diagnostic plots look pretty good, so we can confidently say that children have different Shannon diversity index values, depending on whether they live in Bangladesh vs Kenya, and that this difference is most significant at around 1.35 years old.
Similar to adiv_corrplot()
, taxa_corrplot()
combines taxa_table()
and stats_corrplot()
into a single streamlined function.
We'll set layers = "tcrp"
to show the residuals on this main plot. This will help in explaining an artifact on the "Residuals vs Fitted" diagnostic plot.
taxa_corrplot( biom = hbk, # healthy controls from Bangladesh and Kenya layers = "tcrp", # show residuals on the main plot x = "age", # x-axis variable taxa = 1, # select the most abundant taxa rank = "Genus", # at the genus level stat.by = "country", # statistical groups fit = "gam", # smoothed trendline t.size = 2, # make the trendline thicker check = TRUE ) # display diagnostic plots
Above, all the orange and blue points falling on y=0 indicate there are zero Prevotella reads for that sample. This can confound taxa abundance analyses, as indicated by the diagnostic plots.
One method for addessing zero abundances is to rank tranform all the abundances. In rbiom, setting trans = "rank"
will rank transform all the abundances, without consideration of facets. The default parameters ties = "random", seed = 0
will assign unique integers (see base::rank()
for other ties
options).
This will improve the diagnostic plots, but adds noise to the data.
taxa_corrplot( biom = hbk, # healthy controls from Bangladesh and Kenya layers = "tcrp", # show residuals on the main plot x = "age", # x-axis variable taxa = 1, # select the most abundant taxa rank = "Genus", # at the genus level stat.by = "country", # statistical groups fit = "gam", # smoothed trendline trans = "rank", # rank-transform the abundances t.size = 2, # make the trendline thicker check = TRUE ) # display diagnostic plots
Like the last two corrplot functions, bdiv_corrplot()
combines bdiv_table()
and stats_corrplot()
to generate a beta diversity correlation plot from an rbiom object.
Beta diversity correlation plots are more conceptually complex, so let's quickly reveal what we'll be plotting.
As with most bdiv_*
functions, we can supply a within
and/or between
parameters to limit which pairs of samples are included in the plot. Setting within = "country"
as below will only plot pairs of samples that came from the same country, and prevent stat.by
from producing a "Bangladesh vs Kenya" trendline.
bdiv_corrplot( biom = hbk, # healthy controls from Bangladesh and Kenya x = "age", # x-axis variable stat.by = "country", # statistical groups within = "country", # don't compare samples from different countries fit = "gam", # smoothed trendline check = TRUE ) # display diagnostic plots
Here, the main plot tells us:
The diagnostic plots here indicate less than optimal distributions of residuals, with the "Normal Q-Q" plot showing that many more points lie above the trendline than below it. The main plot's trendlines don't spend much time between y = 0.75 and 0.83, hence the sparsity of points between x = 0.75 and 0.83 on the "Residuals vs Fitted" and "Scale-Location" plots.
Running bdiv_corrplot()
with trans = "rank"
could help make the residuals more normally distributed.
bdiv_corrplot( biom = hbk, # healthy controls from Bangladesh and Kenya x = "age", # x-axis variable stat.by = "country", # statistical groups within = "country", # don't compare samples from different countries fit = "gam", # smoothed trendline trans = "rank", # rank-transform the abundances check = TRUE ) # display diagnostic plots
After rank-transforming:
You could argue either for or against preferring the transformed figure over the original.
Comparing the AIC, BIC, and Log-Likelihood values for the untransformed and rank-transformed data can tell you which model is better, right? Wrong! These values scale with the range of the data, so the untransformed beta diversity data in the range of 0 - 1 will result in drastically different AIC, BIC, and Log-Likelihood values than the rank-transformed data in the range of 1 - 16400.
Generally speaking, you cannot compare goodness-of-fit values derived from different datasets. However, it is arguable that these two datasets are the same, just scaled differently. Therefore if you don't mind potentially incurring a reviewer's wrath, you can rescale both datasets to the same range and compare goodness-of-fit values on those.
plyr::ldply(c(none="none", rank="rank"), .id = "trans", function (trans) bdiv_table(biom = hbk, within = "country", trans = trans) %>% dplyr::mutate(.distance = scales::rescale(.distance, to = c(0, 1))) %>% stats_table(regr = "age", resp = ".distance", stat.by = "country", fit = "gam") %>% dplyr::select(.aic, .bic, .loglik) )
These goodness-of-fit metrics strongly favor the untransformed data.
By default, the *_corrplot()
functions will automatically find the x-axis location with the most significant term of interest. However, you can override this by providing the at
parameter.
stats_corrplot(npregfast::children, 'age', 'height', stat.by = 'sex', at = 12)
Although you're limited to one at
value for *_corrplot()
functions, you can provide multiple at
locations to stats_table()
or *_stats()
functions.
stats_table(npregfast::children, 'age', 'height', stat.by = 'sex', at = 11:15)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.