An Introduction to TITAN2

knitr::opts_chunk$set(echo=TRUE, collapse=TRUE, error=TRUE, comment = "#")

Introduction to Threshold Indicator Taxa Analysis with TITAN2

The purpose of this vignette is to walk users through the basic functionality of the package TITAN2 for analysis of taxon-specific contributions to community change along an environmental gradient. The Everglades data for this vignette was described in Baker and King (2010) first published by King and Richardson (2003).

For users familiar with the original version of titan(){.R} and associated functions, released as a text file for R 2.9.2 in Appendix 3 of Baker and King (2010), there are a number of updates included in this R package TITAN2, some of which may break your code. First, the original scripts have been divided into modular component functions to facilitate dissemination through CRAN. Second, a number of functions have been added to reduce run time and allow for the handling of large datasets. Third, several new analyses and plotting functions have been included, and others dropped. Fourth, several minor bugs have been fixed; these include screens for inappropriate data.

First Steps

For users new to TITAN2, the package can be loaded like any other R package downloaded from CRAN with install.packages("TITAN2"){.R}; this only needs to be done once. The package can then be loaded into an active R session with library(){.R}:

library("TITAN2")

Note that the package needs to be loaded once per R session.

As with the first version of the project, titan(){.R} requires 1) a site by taxa matrix and 2) an environmental gradient. To facilitate this discussion, we have included in TITAN2 examples of both: the datasets glades.taxa{.R} and glades.env{.R}. This is the same data included with the original version in Baker and King (2010). Users can load the taxa data with the data(){.R} function:

data(glades.taxa)
str(glades.taxa, list.len = 5)

See below for a discussion of how you can get your data into R.

The data(){.R} function loads the site by taxa matrix of 126 rows and 164 columns glades.taxa{.R}, and the str(){.R} ("structure") function gives a summary of what the data object is. The columns of glades.taxa{.R} contain $log_{10}(x+1)$ transformed densities of macroinvertebrate taxa (though transformation is usually unnecessary in TITAN2). The rows are sample units. Taxa names are 8-character abbreviations of scientific names. Only taxa with at least 5 occurrences are included. Taxa in your file must occur at least 3 times. Missing values are not allowed.

Loading the sample environmental gradient data is done similarly:

data(glades.env)
str(glades.env)

This dataset has 126 rows and 1 column, surface-water total phosphorus (ug/L). As with the taxa data, missing values are not allowed in the environmental gradient data.

We are now ready to run the titan(){.R} function. (Note: this takes a while to run, so beware.)

glades.titan <- titan(glades.env, glades.taxa)

In the above use of titan(){.R} many parameters are set by default, but they can be customized to the user's circumstance by direct specification. The default values assumed by the above call are

glades.titan <- titan(glades.env, glades.taxa, 
  minSplt = 5, numPerm = 250, boot = TRUE, nBoot = 500, imax = FALSE, 
  ivTot = FALSE, pur.cut = 0.95, rel.cut = 0.95, ncpus = 8, memory = FALSE
)

titan(){.R}'s extensive computations often require long run times, typically minutes or hours depending on the specifications. To save you some time from running the above function yourself, we've included the results in the built-in dataset glades.titan{.R}, accessible with the data(){.R} function as before:

data(glades.titan)
str(glades.titan, 1)

In titan(){.R} the primary arguments are the environmental gradient (here glades.env{.R}) and the taxa matrix (here glades.taxa{.R}). When you use the function, be sure to save the output to another variable (this is the glades.titan <- titan(gla... part above). Note that the deviance argument in titan(){.R} has been dropped from previous versions. Other arguments include the following; you can learn more about them in the documentation of titan(){.R} obtained by typing ?titan:

During normal program operation, the taxa dataset is screened to assess whether it is appropriate for titan(){.R} and the following messages are written to the screen:

cli::cli_alert_info("Screening taxa...")
cli::cli_alert("  100% occurrence detected 1 times (0.6% of taxa),")
cli::cli_alert("  use of TITAN less than ideal for this data type.")
cli::cli_alert_success("  Taxa frequency screen complete.")

The first screening checks for excessively rare or common taxa occurrences and returns a warning if 100% or below minimum number of occurrences are detected. In some cases where analysis would be nonsensical, titan(){.R} will end its run. The second screen checks to make sure the environmental gradient and the taxa matrix have the same number of records.

The next set of messages detail progress of the change point analysis computations on the observed data:

cli::cli_alert_info("Partitioning along gradient...")
cli::cli_alert_info("Calculating observed IndVal maxima and class values...")
cli::cli_alert_info("Calculating IndVals using mean relative abundance...")
cli::cli_alert_info("Permuting IndVal scores...")
cli::cli_alert_info("Summarizing observed results...")
cli::cli_alert_info("Estimating taxa change points using z-score maxima...")

The final message issued in a normal execution presents the progress of the bootstrap routine by counting off replicates (if ncpus == 1{.R}):

cli::cli_alert_info("Bootstrap resampling in sequence...")

followed by a progress bar or informing the user that the bootstrap is running in parallel (if ncpus == 8{.R} or greater):

cli::cli_alert_info("Bootstrap resampling in parallel using 8 CPUs... no index will be printed to screen.")

This is the most computationally intensive step in the analysis. The Everglades dataset takes ~6-7 seconds per bootstrap replicate, depending on processor speed. Thus, 500 replicates take 3500 seconds or 58 minutes and ~60 minutes for the entire run. However, setting ncpus = 2{.R} cuts this time in half to about 32 min, setting ncpus = 4{.R} lowers the time to about 15 min, etc. For initial explorations, users may consider setting boot = FALSE{.R}. However, we do not recommend any interpretation of graphical outputs or many tabular outputs until filtered by bootstrap diagnostics.

Note on Loading Your Own Data

As noted in the previous section, titan(){.R} requires two datasets: a site by taxa matrix and a environmental gradient dataset. The site by taxa matrix should ideally be of class data.frame{.R}, the standard data structure in R. These can be read into R in several ways, for example the read.table(){.R} or read.csv(){.R} functions. The readr package has other similar functions to help read data into R. The environmental gradient dataset can either be a data.frame{.R} object with one column or simply an atomic vector of class numeric{.R}. In addition to the the data-reading functions listed above, scan(){.R} can be helpful for reading data of this type into R. In addition to these, the file.choose(){.R} function allows users to select files interactively, which can help avoid pathing issues.

Tabular Results

Community change points are returned automatically at the end of each titan(){.R} call, but these tabular sum(z) change point results are part of the titan object and may be called as follows:

glades.titan$sumz.cp

The columns include the observed change point (cp{.R}) defined by the sum(z) maximum and selected quantiles (0.05-0.95) of the change points determined by resampling the observed data.

Rows include the change points corresponding to declining or negatively responding taxa (sumz-), those corresponding to the increasing or positively responding taxa (sumz+), and the corresponding scores using filtered versions of both sums. Filtering is achieved by computing the sum(z) scores using only those taxa that are determined to be pure and reliable indicators. FIltered scores are sensitive to pur.cut{.R} and rel.cut{.R} arguments at the titan(){.R} function call. Filtered scores may be used as a check to indicate whether impure or unreliable indicator taxa are contributing substantially to the pattern of sum(z) scores (in this case they do not).

To obtain tabular results for individual taxa we can simply type glades.titan$sppmax{.R}. Since this is a $164 \times 16$ matrix, we use the head(){.R} function to just display the top to get a feel for the results:

head(glades.titan$sppmax)

The resulting table lists all taxa as rows and the columns are as follows:

Remember that each titan{.R} object (created with the titan(){.R} function) contains a number of additional elements for post hoc exploration:

str(glades.titan, max.level = 1, give.attr = FALSE)

summary(glades.titan){.R} provides similar output. The items within each titan{.R} object include:

Visualizing Results

plot_sumz(){.R} and plot_sumz_density(){.R}

The community-level output may be viewed using the plot_sumz(){.R} or plot_sumz_density(){.R} functions. The plot_sumz_density(){.R} function is new to this version of TITAN2. We recommend its use because we think it will be both easier to understand than previous output and encourage appropriate interpretation of results.

There are many options for changing the look of the graphs; see their full documentations with ?plot_sumz{.R} or ?plot_sumz_density{.R}. An abbreviated version is presented below. To begin, here's how you use plot_sumz_density(){.R}:

plot_sumz_density(glades.titan)

plot_sumz_density(){.R} plots community level change identified with TITAN2. There are three component panels in the default output and we will consider them from the bottom to the top. The first plot is a slight modification of the original sum(z) plot first presented by Baker and King (2010). It shows the magnitude of change among taxa declining along the gradient (z-) and those increasing along the gradient. Peaks in the values indicate points along the environmental gradient that produce large amounts of change in community composition and/or structure. These are the nominal community change points. Plateaus denote regions of similar change. The sums are converted to areas using the logical argument ribbon{.R}, a line plot can be obtained by setting ribbon{.R} to FALSE{.R}, and the original point plot can be obtained by setting the logical argument points{.R} to TRUE{.R}.

plot_sumz_density(glades.titan, ribbon = FALSE, points = TRUE)

The second change from earlier versions of TITAN2 is that the logical argument filter{.R} defaults to TRUE{.R}. This argument ensures that the only taxa contributing to patterns in the sum(z) are pure and reliable taxa identified by the bootstrap results rather than all taxa. Setting filter{.R} to FALSE{.R} allows impure or unreliable taxa to contribute to apparent patterns of community change. In general, the filtered scores should show a similar pattern but be somewhat lower in magnitude depending on contributions from impure or unreliable taxa.

The middle panel shows the estimated probability density of the sum(z) across all bootstrap replicates. Previously these distributions were represented by cumulative frequency distributions of the sum(z) maxima. A narrow spread indicates a high degree of precision regarding the location of the community change point.

The top panel shows the observed sum(z-) and sum(z+) maxima as circles with the 95th percentile of their distributions as horzontal lines. The primary reason for this plot is to illustrate how much information is hidden by this technique relative to the probability density functions.

There are several additional plotting options in plot_sumz_density(){.R}. First, any of the three panels can be dropped by setting logical arguments sumz{.R}, density{.R}, or change_points{.R} to FALSE{.R}. Second, it is possible to display only the z- or z+ taxa through the use of the logical arguments sumz1{.R} and sumz2{.R}. Finally, the x axis label can be edited using the xlabel{.R} argument.

plot_sumz_density(glades.titan, 
  ribbon = TRUE, points = FALSE, sumz1 = FALSE, change_points = FALSE, 
  xlabel = expression(paste("Surface Water Total Phosphorus ("*mu*"g/l)"))
)

Besides the mandatory titan object, arguments include:

For plot_sumz{.R}, the older plotting function in TITAN2, filled and hollow symbols denote the magnitude of summed z scores of increasing (z+) or decreasing (z-) taxa. Peaks in the values indicate points along the environmental gradient that produce large amounts of change in community composition and/or structure. Solid and dashed lines are cumulative frequency distributions of sum(z-) and sum(z+) maxima (respectively) across bootstrap replicates. Vertical CFDs indicate narrow uncertainty about where the maximum chnage occurs, sloping or stair-step CFDs suggest broad uncertainty regarding the location of maximum change.

Note that users can adjust colors (e.g. col1 and fil1) and labels for the left hand y-axis using the argument y1label{.R} or the x-axis with xlabel{.R}. Setting filter = TRUE{.R} will display the filtered sum(z) scores. In general, the filtered scores should show a similar pattern but be somewhat lower in magnitude depending on contributions from impure or unreliable taxa:

plot_sumz(glades.titan, filter = TRUE)

Besides the mandatory titan object, arguments include:

plot_taxa_ridges(){.R}

Pure (purity>=0.95 or higher) and reliable (reliability>=0.95 or higher) indicator taxa that contribute to the overall sum(z) scores can be plotted using the function plot_taxa_ridges(){.R}.

Taxa that meet purity (pur.cut) and reliability (rel.cut) criteria are extracted from the full taxa output object titan.out$sppmax{.R}, separated into two groups: z- (negative responders, or decliners) and z+ (positive responders, or increasers).

These two groups of negative and positive responders are arranged into separate, stacked panels for plotting. Within each panel, taxa change points (values of x resulting in the largest indicator z value) are visualized as a probability density function of the nBoot bootstrapped replicates (n=500 or however many nBoot replicates specified by the user).

Taxa names are arranged discretely along the y-axis based on the rank order of the median (i.e., 50th percentile or 2nd quantile) change-point value from n bootstrap replicates. The rank order of taxa is low-to-high (ascending) for z- (negative, decliners) but high-to-low (descending) for z+ (positive, increasers), respectively.

There are many options for changing the look of the graph, so see the full documentation with ?plot_taxa_ridges{.R}, but you can get started by simply calling plot_taxa_ridges(){.R} on the output of titan(){.R}:

plot_taxa_ridges(glades.titan, axis.text.y = 8)

Optional arguments include:

Below is the same function with a user-specified x-label and n_ytaxa set to 50

plot_taxa_ridges(glades.titan, 
  xlabel = expression(paste("Surface water total phosphorus ("*mu*"g/l)")), 
  n_ytaxa = 50
)

There are a total of 90 pure and reliable indicator taxa in the glades.titan{.R} output, thus 40 taxa were excluded from this plot because we set n_ytaxa = 50. The 50 taxa plotted are the top 50 out of 90 taxa based on descending rank order of purity{.R}, reliability{.R}, and z.median{.R}, respectively. This argument can be useful when there are too many pure and reliable taxa to effectively visualize in one plot.

Alternatively, set z1{.R} or z2 = FALSE.

plot_taxa_ridges(glades.titan, 
  xlabel = expression(paste("Surface water total phosphorus ("*mu*"g/l)")), 
  z2 = FALSE
)

Gridlines are set to TRUE{.R} as the default, but can be removed if desired:

plot_taxa_ridges(glades.titan, 
  xlabel = expression(paste("Surface water total phosphorus ("*mu*"g/l)")), 
  z2 = FALSE, grid = FALSE
)

Here's a final example with additional arguments: 1) d_lines = TRUE{.R} (dashed vertical line corresponding to the median change point from the density distribution function), 2) rel_heights{.R} set in proportion to the number of decreaser and increaser taxa, respectively, 3) xaxis = TRUE{.R}, 4) log-10 scaling of the x-axis, 5) specification of x-axis limits, 6) specification of x-axis breaks and tick mark values, and 7) resizing of x and y axes text and and x-axis title (axis.text.x{.R}, axis.text.y{.R}, and axis.title.x{.R}).

plot_taxa_ridges(
  glades.titan, 
  axis.text.x = 12, axis.text.y = 8, axis.title.x = 14,  
  rel_heights = c(0.45, 0.55), xaxis = TRUE, d_lines = TRUE, trans = "log10", 
  xlim = c(4, 200), breaks = c( 10, 20, 40, 80, 160), 
  xlabel = expression(paste("Surface water total phosphorus ("*mu*"g/l)"))
)

plot_taxa(){.R}

Pure and reliable indicator taxa change points can be plotted using an older plotting function that was integral to the original TITAN2 release, plot_taxa(){.R}. We have replaced this plot with the distributions in plot_taxa_ridges(){.R} because it tended to emphasize the observed change points over the distribution of change points from the bootstrap replicates. Again, there are many options for changing the look of the graph, so see the full documentation with ?plot_taxa{.R}, but you can get started by simply calling plot_taxa(){.R} on the output of titan(){.R}:

plot_taxa(glades.titan, xlabel = "Surface Water TP (ug/l)")

Each plot includes robust declining taxa on the left axis, increasers on the right. Each median change point (50th percentile) of bootstrap replicates (or observed change point if z.med = FALSE) is indicated by the circular symbol; the horizontal lines suggest 5-95% quantiles from the bootstrapped change point distribution.

Optional arguments include:

log{.R}, at{.R}, xmin{.R}, xmax{.R}, xlabel{.R}, tck{.R}, bty{.R}, ntick{.R}, prtty{.R}, dig{.R}, leg.x{.R}, leg.y{.R}, cex{.R}, cex.axis{.R}, cex.leg{.R}, cex.lab{.R}, legend{.R}, col1{.R}, fil1{.R}, col2{.R}, and fil2{.R} are the same as in plot_sumz(){.R} above.

Here is the same function with z.med = FALSE{.R} (plots points at observed change-point value instead of 50% of bootreps; we recommend the default (TRUE{.R}), however, as it more robust, particularly in small data sets):

plot_taxa(
  glades.titan, z.med = FALSE,
  xlabel = expression(paste("Surface water total phosphorus ("*mu*"g/l)"))
)

A more complete illustration of plot_taxa(), with z.med = TRUE{.R} (default), no legend, decreased size of taxa labels, increased size of x-axis labels, and added colored points and fill.

plot_taxa(
  glades.titan, 
  xlabel = expression(paste("Surface water total phosphorus ("*mu*"g/l)")), 
  cex.taxa = 0.5, cex = 1.25, cex.axis = 1.1, legend = FALSE, 
  col1 = "black", col2 = "red", fil2 = "red"
)

Alternatively, with prob95 = TRUE:

plot_taxa(
  glades.titan, 
  xlabel = expression(paste("Surface water total phosphorus ("*mu*"g/l)")), 
  cex.taxa = 0.5, cex = 1.25, cex.axis = 1.1, legend = FALSE,  
  prob95 = TRUE, col1 = "black", col2 = "red", fil2 = "red"
)

Importantly, some of the control over the plotting that was optional in the original version has been internalized and warnings are issued when users plot results without using the bootstrap results. Tables of significant z- and z+ taxa used to make graph are automatically returned but no longer automatically written as text files (these can be easily extracted from the glades.titan$sppmax{.R} table).

plot_cps(){.R}

In TITAN2 2.0 and later, there are several plotting options; all contained within a single function centered on interpreting taxon-specific change points in the context of the results from the bootstrap resampling procedure.

The remaining arguments: xlabel{.R}, xmin{.R}, xmax{.R}, tck{.R}, bty{.R}, ntick{.R}, cex{.R}, cex.axis{.R}, cex.leg{.R}, cex.lab{.R}, leg.x{.R}, leg.y{.R} and legend{.R} are the same as in the previous plotting functions.

The default graphic from plot_cps(){.R} is a $z$-score weighted probability density function derived from the empirical distribution of change points across all bootstrap replicates plotted as a histogram for each pure and reliable taxon (blue for decreasers and red for increasers). NOTE: When you try to plot the matrix of plots consecutively you can generate an error about "Figure margins too large". This is easily addressed by closing the plot window, or turning the graphics device off.

plot_cps(glades.titan)

Typically, these histograms are visually scanned to distinguish those taxa whose bootstrapped change point distributions are clearly unimodal from those that are multimodal or more uniform. For any taxon of interest, a closer inspection can be achieved by setting the argument "taxaID" to a taxon ID number or label name. For example, in the plot above, if the declining (i.e., maxgrp == 1{.R}) taxon "ENALCIVI"{.R} is of interest, the user can set taxaID = 57{.R} or taxaID = "ENALCIVI" and either will produce the plot below. Again, you may need to close the plotting window or turn graphics off because we are switching plot margins.

plot_cps(glades.titan, taxaID = "ENALCIVI", xlabel = "Surface Water TP (ug/l)")

The base plot is simply taxon-specific abundance (black circles) along the environmental gradient. Overlaid near the top of the plot in red is the observed change point, or if cp.med = TRUE{.R}, the median change point across all bootstrap replicates. Overlaid on the abundance plot is the $z$-weighted (if z.weights = TRUE{.R}) probability density function of change point locations across all bootstrap replicates (blue histogram). Setting cp.hist = FALSE{.R} will remove the histogram. On the other hand, setting cp.trace = TRUE{.R} will plot the observed IndVal $z$ scores (in red; what TITAN2 uses by default if imax = FALSE{.R}) and the observed IndVal scores (in grey) for every candidate change point along the environmental gradient (see Baker and King 2013 for examples):

plot_cps(glades.titan, taxaID = "ENALCIVI", cp.trace = TRUE, xlabel = "Surface Water TP (ug/l)")

Below is an increasing taxon from the same data set with a conspicuously bimodal distribution of replicate change points:

plot_cps(glades.titan, taxaID = "OSTRASP5", cp.trace = TRUE, xlabel = "Surface Water TP (ug/l)")

Here it appears as if two potential change points were predominantly identified: one that kept nearly all of the abundance to the right, and another (the observed) than kept nearly all the absences to the left.

Finally, if taxa.dist = FALSE{.R}, instead of a taxon-specific plot, plot_cps(){.R} will produce a plot of the summed $z$-weighted (if z.weights = TRUE{.R}) probability densities across increasing (maxgrp == 1{.R}; blue) decreasing (maxgrp = 2; red) and all pure and reliable taxa (filter>0; grey):

plot_cps(glades.titan, taxa.dist = FALSE, xlabel = "Surface Water TP (ug/l)")

The accompanying table summarizes the location of both the $z$-weighted and unweighted maxima along the environmental gradient, as well as the magnitude of those maxima for decreasing taxa (maxgrp == 1{.R}), increasing taxa (maxgrp == 2{.R}), and all pure and reliable taxa combined. In this case there appears to be good correspondence between the sum(z) maxima and the summed $z$-weighted PDF maxima. In other words, the points at which community change appears to be greatest corresponds to where the sum of individual taxa changes tend to occur, even when different sets of observations are used. In practice, this is a desirable result we would expect to occur most of the time, yet exceptions to this pattern can also be informative. For example, in the plot above, there appears to be another candidate change point closer to ~10 ug/l where declining taxa tend to change frequently, and at both ~25 ug/l and ~40 ug/l there are large numbers of frequently occurring change points across both decliners and increasers even though they are not particularly large for either group by itself.

Alternatively, these different sums can be represented as a stacked bar chart by setting stacked = TRUE{.R}:

plot_cps(glades.titan, taxa.dist = FALSE, xlabel = "Surface Water TP (ug/l)", stacked = TRUE)

Here the blue and red circles correspond to the location of the respective maxima for decreasing and increasing taxa change points. In either case, the summed PDFs are meant to provide complementarity to the plot_sumz(){.R} graphic about where there are key values along the environmental gradient where many coincident change points occur regardless of the sample.

Citations:

Baker, M.E. and R. S. King. 2010. A new method for detecting and interpreting biodiversity and ecological community thresholds. Methods in Ecology and Evolution 1: 25-43.Z

Baker, M.E. and R. S. King. 2013. Of TITAN and strawmen: an appeal for greater understanding of community data. Freshwater Science 32(2):489-506.

King, R.S. and C. J. Richardson. 2003. Integrating bioassessment and ecological risk assessment: an approach to developing numerical water-quality criteria. Environmental Management 31(6):795-809.

Dufrene, M. and P. Legendre. 1997. Species assemblages and indicator species: the need for a flexible asymmetrical approach. Ecological Monographs. 67:345-366.



Try the TITAN2 package in your browser

Any scripts or data that you put into this service are public.

TITAN2 documentation built on Nov. 15, 2023, 1:09 a.m.