secsse is an R package designed for multistate data sets under a concealed state and speciation ('hisse') framework. In this sense, it is parallel to the 'MuSSE' functionality implemented in 'diversitree', but it accounts for finding possible spurious relationships between traits and diversification rates ('false positives', Rabosky & Goldberg 2015) by testing against a 'hidden trait' (Beaulieu et al. 2013), which is responsible for more variation in diversification rates than the trait being investigated.

Similar to the 'diversitree' (Fitzjohn et al. 2012) and 'hisse' (Beaulieu & O'Meara 2016) packages, secsse uses two input files: a rooted, ultrametric tree in nexus format (for conversion of other formats to nexus, we refer to the documentation in package 'ape') and a data file with two columns, the first containing taxa names and the second a numeric code for trait state with a header (usually 0, 1, 2, 3, etc., but notice that 'NA' is a valid code too, if you are not sure what trait state to assign to a taxon). Here, we will use a simple trait dataset with only values 0 and 1, indicating presence and absence of a trait. A comma-separated value file (.csv) generated in MsExcel works particularly well. The *.csv file can be loaded into R using the read.csv() function. and should look like this:

library(secsse) data(traits) tail(traits)

This data set (here we see only the bottom lines of the data frame) has two character states labeled as 0 and 1. Ambiguity about trait state (you are not sure which trait state to assign a taxon too, or you have no data on trait state for a particular taxon), can be assigned using 'NA'. secsse handles 'NA' differently from a full trait state, in that it assigns probabilities to all trait states for a taxon demarcated with 'NA'.

The second object we need is an ultrametric phylogenetic tree, that is rooted and has labelled tips. One can load it in R by using read.nexus(). In our example we load a prepared phylogeny named "phylo_vignette":

data("phylo_vignette")

For running secsse it is important that tree tip labels agree with taxon names in the data file, but also that these are in the same order. For this purpose, we run the following piece of code prior to any analysis:

sorted_traits <- sortingtraits(traits, phylo_vignette)

If there is a mismatch in the number of taxa between data and tree file, you will receive an error message. However, to then identify which taxa are causing issues and if they are in the tree or data file, you can use the name.check function in the 'geiger'(Harmon et al. 2008) package:

library(geiger) #pick out all elements that do not agree between tree and data mismat <- name.check(phylo_vignette, traits) #this will call all taxa that are in the tree, but not the data file #mismat$tree_not_data #and conversely, #mismat$data_not_tree

If you have taxa in your tree file that do not appear in your trait file, it is
worth adding them with value `NA`

for trait state.
You can visualise the tip states using the package diversitree:

if (requireNamespace("diversitree")) { for_plot <- data.frame(trait = traits$trait, row.names = phylo_vignette$tip.label) diversitree::trait.plot(phylo_vignette, dat = for_plot, cols = list("trait" = c("blue", "red")), type = "p") }

After you are done properly setting up your data, you can proceed to setting parameters and constraints.

If the user wishes to assign a taxon to multiple trait states, because he/she is
unsure which state best describes the taxon, he/she can use `NA`

. `NA`

is used
when there is no information on possible state at all; for example when a state
was not measured or a taxon is unavailable for inspection. `NA`

means a taxon is
equally likely to pertain to any state. In case the user does have some
information, for example if a taxon can pertain to multiple states, or if there
is uncertainty regarding state but one or multiple states can with certainty be
excluded, secsse offers flexibility to handle ambiguity. In this case, the user
only needs to supply a trait file, with at least four columns, one for the taxon
name, and three for trait state. Below, we show an example of what the trait
info should be like (the column with species' names has been removed). If a
taxon may pertain to trait state 1 or 3, but not to 2, the three columns should
have at least the values 1 and a 3, but never 2 (species in the third row). On
the other hand, the species in the fifth row can pertain to all states: the
first column would have a 1, the second a 2, the third a 3 (although if you only
have this type of ambiguity, it is easier to assign `NA`

and use a single-column
data file).

# traits traits traits # [1,] 2 2 2 # [2,] 1 1 1 # [3,] 2 2 2 # [4,] 3 1 1 # [5,] 1 2 3

To perform a Maximum Likelihood analysis, secsse makes use of the
function `DDD::optimize()`

, which in turn, typically, uses the subplex
package to perform the Maximum Likelihood optimization. In such an
analysis, we need to specify which parameters we want to optimize, which
parameters to keep fix, and the initial values per parameter. We do so
by providing the structure of the input parameters (e.g. in vector,
matrix or list form), and within this structure we highlight values that
stay at zero with a 0, and parameters to be inferred with indexes 1, 2,
... n. The optimizer will then use these indexes to fill in the
associated parameters and perform the optimization. If this all seems a
bit unclear, please continue reading and look at the fully set up
parameterization for the maximum likelihood below to gain more insight.

In the ETD model, we assume that the examined trait affects diversification. In a secsse analysis we need to specify the structure of three distinct properties: the lambda list, the mu vector and the transition (Q) matrix. Each of these informs properties of the model of speciation, extinction and trait-shifts respectively.

Speciation in a secsse model is defined using a list of matrices, where each matrix highlights the state of the daughter species resulting from a speciation event. In our case, we have a trait with two states, and thus we will have to specify a list with two matrices, one for each state, where each matrix in turn will then specify the daughter states. We can do so by hand, but secsse includes functionality to do this in a more organized manner - this is especially useful if you have a trait with more than two states for instance. In this more organized manner, we can provide secsse with a matrix specifying the potential speciation results, and secsse will construct the lambda list accordingly:

spec_matrix <- c() spec_matrix <- rbind(spec_matrix, c(0, 0, 0, 1)) spec_matrix <- rbind(spec_matrix, c(1, 1, 1, 2)) lambda_list <- secsse::create_lambda_list(state_names = c(0, 1), num_concealed_states = 2, transition_matrix = spec_matrix, model = "ETD") lambda_list

Let's see what the code has done. First, we create a `spec_matrix`

, where
the first column indicates the parent species (0 or 1) and the second
and third column indicate the identities of the two daughter species. In
this case, we choose for symmetric speciation without a change of trait,
e.g. the daughters have the same trait as the parent. If you have
evidence of perhaps asymmetric inheritance, you can specify this here.
The fourth column indicates the associated rate indicator. In this case
we choose two different speciation rates. We choose two concealed
states, as it is good practice to have the same number of concealed
states as observed states. The resulting `lambda_list`

then contains four
entries, one for each unique state (see the names of the entries in the list),
that is, for each combination of observed and concealed states, where the
concealed states are indicated with a capital letter.
Looking at the first entry in the list, e.g. the
result of a speciation event starting with a parent in state 0A, will
result with rate 1 in two daughter species of state 0A as well. The way
to read this, is by looking at the row and column identifiers of the
entered rate. Similarly, for a speciation event starting in state 1A
(`lambda_list[[2]]`

), the two daughter species are 1A as well, but this
time with rate 2, as we specified that species with trait 1 will have a
different speciation rate. Note that here, rates 1 and 2 are ordered
with the observed trait, we will later explore the CTD model, where the
rates will be sorted according to the concealed state.

Having the speciation rates set, we can move on to extinction rates. Since we are using the ETD model, here we also expect the extinction rates to be different:

mu_vec <- secsse::create_mu_vector(state_names = c(0, 1), num_concealed_states = 2, model = "ETD", lambda_list = lambda_list) mu_vec

The function `create_mus_vector()`

takes the same standard information we
provided earlier, with as addition our previously made `lambda_list`

. It uses the
`lambda_list`

to identify the rate indicators (in this case 1 and 2) that
are already used and to thus pick new rates. We see that secsse has
created a named vector with two extinction rates (3 and 4), which are
associated with our observed traits 0 and 1.

Lastly, we need to specify our transition matrix. Often, Q matrices can get
quite large and complicated, the more states you are analyzing. We have devised
a tool to more easily put together Q matrices. This tool starts from the
so-called `shift_matrix`

, the basic matrix in which we only find information on
transitions between examined states. The information contained in this
`shift_matrix`

is then automatically mimicked for inclusion in the full matrix,
to ensure that the same complexity in examined state transitions is also found
in concealed states.
Instead of specifying the entire `shift_matrix`

, instead it suffices to
only specify the non-zero transitions. In this case these are from state 0
to 1, and vice versa:

shift_matrix <- c() shift_matrix <- rbind(shift_matrix, c(0, 1, 5)) shift_matrix <- rbind(shift_matrix, c(1, 0, 6)) q_matrix <- secsse::create_q_matrix(state_names = c(0, 1), num_concealed_states = 2, shift_matrix = shift_matrix, diff.conceal = TRUE) q_matrix

Thus, we first specify a matrix containing the potential state
transitions, here 0->1 and 1->0. Then, we use
`create_q_matrix()`

to create the q-matrix. By setting
`diff.conceal`

to `TRUE`

, we ensure that the concealed states will get
their own rates specified. Setting this to `FALSE`

would set their rates
equal to the observed rates (5 and 6). The way to read the transition
matrix is column-row, e.g. starting at state 0A, with rate 5 the species
will shift to state 1A and with rate 7 it will shift to state 0B. We
intentionally ignore 'double' shifts, e.g. from 0A to 1B, where both the
observed and the concealed trait shift at the same time. If you have
good evidence to include such shifts in your model, you can modify the
trans_matrix by hand of course.

We have now specified the required ingredients to perform Maximum Likelihood analyses. Prerequisites for performing Maximum Likelihood analyses with secsse are that we specify the ids of the rates we want optimized, and provide initial values. We can do so as follows:

idparsopt <- 1:8 # our maximum rate parameter was 8 idparsfix <- c(0) # we want to keep all zeros at zero initparsopt <- rep(0.1, 8) initparsfix <- c(0.0) # all zeros remain at zero. sampling_fraction <- c(1, 1)

Here, we specify that we want to optimize all parameters with rates 1,
2, ..., 8. We set these at initial values at 0.1 for all parameters. Here, we
will only use one starting point, but in practice it is often advisable
to explore multiple different initial values to avoid getting stuck in a
local optimum and missing the global optimum. `idparsfix`

and `initparsfix`

indicate that all entries with a zero are to be kept at the value zero.
Lastly, we set the sampling fraction to be c(1, 1), this indicates to
secsse that we have sampled per trait all species with that trait in our
dataset. Alternatively, if we know that perhaps some species with trait
0 are missing, we could specify that as c(0.8, 1.0). Thus, note that the
sampling fraction does not add up to 1 across traits, but within traits.

And now we can perform maximum likelihood:

idparslist <- list() idparslist[[1]] <- lambda_list idparslist[[2]] <- mu_vec idparslist[[3]] <- q_matrix answ <- secsse::cla_secsse_ml(phy = phylo_vignette, traits = traits$trait, num_concealed_states = 2, idparslist = idparslist, idparsopt = idparsopt, initparsopt = initparsopt, idparsfix = idparsfix, parsfix = initparsfix, sampling_fraction = sampling_fraction, verbose = FALSE)

We can now extract several pieces of information from the returned answer:

```{R ETD_res} ML_ETD <- answ$ML ETD_par <- secsse::extract_par_vals(idparslist, answ$MLpars) ML_ETD ETD_par spec_rates <- ETD_par[1:2] ext_rates <- ETD_par[3:4] Q_Examined <- ETD_par[5:6] Q_Concealed <- ETD_par[7:8] spec_rates ext_rates Q_Examined Q_Concealed

The function `extract_par_vals()` goes over the list `answ$MLpars` and places the found parameter values back in consecutive vector 1:8 in this case. Here, we find that the speciation rate of trait 1 is higher than the speciation rate of trait 0. ### CTD Let's compare our findings with a CTD model, e.g. a model centered around the concealed trait. Again, we need to specify our lambda list, mu vector and transition matrix. We will see that this is quite straightforward now that we have gotten the hang of how this works. #### Lambda matrices Again, we specify two distinct rates, indicating that the observed state inherits faithfully to the daughter species. However, this time, we set the model indicator to "CTD": ```r spec_matrix <- c() spec_matrix <- rbind(spec_matrix, c(0, 0, 0, 1)) spec_matrix <- rbind(spec_matrix, c(1, 1, 1, 2)) lambda_list <- secsse::create_lambda_list(state_names = c(0, 1), num_concealed_states = 2, transition_matrix = spec_matrix, model = "CTD") lambda_list

The resulting `lambda_list`

now has the chosen rates 1 and 2 sorted
differently across the matrices, with matrices 1 and 2 containing rate
1, and matrices 3 and 4 containing rate 2. Looking at the column names
of the matrices, states 1 and 2 are states 0A and 1A, and states 3 and 4
are states 0B and 1B, in other words, speciation rate 1 is now
associated with all states with concealed state A, and speciation rate 2
is now associated with all states with concealed state B.

For the mu vector, we repeat the same we did for the ETD model:

mu_vec <- secsse::create_mu_vector(state_names = c(0, 1), num_concealed_states = 2, model = "CTD", lambda_list = lambda_list) mu_vec

Here, again, we see that whereas previously extinction rate 3 was associated with states 0A and 0B (e.g. all states with state 0), it is now associated with states 0A and 1A, e.g. all states associated with concealed state A.

Setting up the transition matrix is not different from the ETD model, the same transitions are possible:

shift_matrix <- c() shift_matrix <- rbind(shift_matrix, c(0, 1, 5)) shift_matrix <- rbind(shift_matrix, c(1, 0, 6)) q_matrix <- secsse::create_q_matrix(state_names = c(0, 1), num_concealed_states = 2, shift_matrix = shift_matrix, diff.conceal = TRUE) q_matrix

Now that we have specified our matrices, we can use the same code we used for the ETD model to perform our maximum likelihood:

idparsopt <- 1:8 # our maximum rate parameter was 8 idparsfix <- c(0) # we want to keep all zeros at zero initparsopt <- rep(0.1, 8) initparsfix <- c(0.0) # all zeros remain at zero. sampling_fraction <- c(1, 1) idparslist <- list() idparslist[[1]] <- lambda_list idparslist[[2]] <- mu_vec idparslist[[3]] <- q_matrix answ <- secsse::cla_secsse_ml(phy = phylo_vignette, traits = traits$trait, num_concealed_states = 2, idparslist = idparslist, idparsopt = idparsopt, initparsopt = initparsopt, idparsfix = idparsfix, parsfix = initparsfix, sampling_fraction = sampling_fraction, verbose = FALSE) ML_CTD <- answ$ML CTD_par <- secsse::extract_par_vals(idparslist, answ$MLpars) ML_CTD CTD_par spec_rates <- CTD_par[1:2] ext_rates <- CTD_par[3:4] Q_Examined <- CTD_par[5:6] Q_Concealed <- CTD_par[7:8] spec_rates ext_rates Q_Examined Q_Concealed

Here we now find that state A has a very low speciation rate, in
contrast to a much higher speciation rate for state B (remember that
speciation rate 1 is now associated with A, and not with state 0!).
Similarly, extinction rates for both states are also quite different,
with state A having a much lower extinction rate than state B. Examined
trait shifts (`Q_Examined`

) are quite low, whereas concealed trait shifts
seem to be quite high. The LogLikelihood seems to be lower than what we
found for the ETD model.

As a check, we will also fit a model where there is no trait effect - perhaps we are looking for an effect when there is none. This is always a good sanity check.

To specify the lambda matrices, this time we choose the same rate indicator across both states.

spec_matrix <- c() spec_matrix <- rbind(spec_matrix, c(0, 0, 0, 1)) spec_matrix <- rbind(spec_matrix, c(1, 1, 1, 1)) lambda_list <- secsse::create_lambda_list(state_names = c(0, 1), num_concealed_states = 2, transition_matrix = spec_matrix, model = "CR") lambda_list

The mu vector follows closely from this, having a shared extinction rate across all states:

mu_vec <- secsse::create_mu_vector(state_names = c(0, 1), num_concealed_states = 2, model = "CR", lambda_list = lambda_list) mu_vec

We will use the same transition matrix as used before, although one could perhaps argue that without a trait effect, all rates in the transition matrix (both forward and reverse trait shifts) should share the same rate. Here, we will choose the more parameter-rich version (Home assignment: try to modify the code to perform an analysis in which all rates in the transition matrix are the same).

shift_matrix <- c() shift_matrix <- rbind(shift_matrix, c(0, 1, 3)) shift_matrix <- rbind(shift_matrix, c(1, 0, 4)) q_matrix <- secsse::create_q_matrix(state_names = c(0, 1), num_concealed_states = 2, shift_matrix = shift_matrix, diff.conceal = TRUE) q_matrix

idparsopt <- 1:6 # our maximum rate parameter was 6 idparsfix <- c(0) # we want to keep all zeros at zero initparsopt <- rep(0.1, 6) initparsfix <- c(0.0) # all zeros remain at zero. sampling_fraction <- c(1, 1) idparslist <- list() idparslist[[1]] <- lambda_list idparslist[[2]] <- mu_vec idparslist[[3]] <- q_matrix answ <- secsse::cla_secsse_ml(phy = phylo_vignette, traits = traits$trait, num_concealed_states = 2, idparslist = idparslist, idparsopt = idparsopt, initparsopt = initparsopt, idparsfix = idparsfix, parsfix = initparsfix, sampling_fraction = sampling_fraction, verbose = FALSE) ML_CR <- answ$ML CR_par <- secsse::extract_par_vals(idparslist, answ$MLpars) ML_CR CR_par spec_rate <- CR_par[1] ext_rate <- CR_par[2] Q_Examined <- CR_par[3:4] Q_Concealed <- CR_par[5:6] spec_rate ext_rate Q_Examined Q_Concealed

We now recover a non-zero extinction rate, and much higher transition rates for the concealed than for the observed states.

Having collected the different log likelihoods, we can directly compare the models using AIC. Remembering that the AIC is 2k - 2LL, where k is the number of parameters of each model and LL is the Log Likelihood, we can calculate this as follows:

res <- data.frame(ll = c(ML_ETD, ML_CTD, ML_CR), k = c(8, 8, 6), model = c("ETD", "CTD", "CR")) res$AIC <- 2 * res$k - 2 * res$ll res

I can now reveal to you that the tree we used was generated using an ETD model, which we have correctly recovered!

If after reading these vignettes, you still have questions, please feel free to create an issue at the package's GitHub repository https://github.com/rsetienne/secsse/issues or e-mail the authors for help with this R package. Additionally, bug reports and feature requests are welcome by the same means.

Beaulieu, J. M., O'Meara, B. C., & Donoghue, M. J. (2013). Identifying hidden rate changes in the evolution of a binary morphological character: the evolution of plant habit in campanulid angiosperms. Systematic biology, 62(5), 725-737.

Beaulieu, J. M., & O'Meara, B. C. (2016). Detecting hidden diversification shifts in models of trait-dependent speciation and extinction. Systematic biology, 65(4), 583-601.

FitzJohn, R. G. (2012). Diversitree: comparative phylogenetic analyses of diversification in R. Methods in Ecology and Evolution, 3(6), 1084-1092.

Harmon, L. J., Weir, J. T., Brock, C. D., Glor, R. E., & Challenger, W. (2008). GEIGER: investigating evolutionary radiations. Bioinformatics, 24(1), 129-131.

Rabosky, D. L., & Goldberg, E. E. (2015). Model inadequacy and mistaken inferences of trait-dependent speciation. Systematic Biology, 64(2), 340-355.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.