Metatree: Builds metatree from source data

Description Usage Arguments Details Value Author(s) References Examples

View source: R/Metatree.R

Description

Builds a metatree data set from a set of source data.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
Metatree(
  MRPDirectory,
  XMLDirectory,
  InclusiveDataList = c(),
  ExclusiveDataList = c(),
  TargetClade = "",
  HigherTaxaToCollapse = c(),
  SpeciesToExclude = c(),
  MissingSpecies = "exclude",
  Interval = NULL,
  VeilLine = TRUE,
  IncludeSpecimenLevelOTUs = TRUE,
  BackboneConstraint = NULL,
  MonophylyConstraint = NULL,
  RelativeWeights = c(1, 1, 1, 1),
  WeightCombination = "sum",
  ReportContradictionsToScreen = FALSE,
  ExcludeTaxonomyMRP = FALSE
)

Arguments

MRPDirectory

The directory in which the MRP files are to be read from. See details.

XMLDirectory

The directory in which the XML files are to be read from. See details.

InclusiveDataList

A vector of the data sets to include in the metatree. Can be left empty to just read all files in MRPDirectory and XMLDirectory.

ExclusiveDataList

A vector of any data sets to exclude from the metatree. Can be left empty if all data sets in MRPDirectory and XMLDirectory are valid. (Intended to exclude things like oogenera or footprint analyses, other supertree data sets etc.)

TargetClade

The name of the target clade of the metatree (e.g., "Dinosauria"). OTUs outside of this clade will be pruned.

HigherTaxaToCollapse

Vector of any higher taxa to collapse (e.g., if you are focused on relationships in a stem-group). NB: It is very important that these are safely monophyletic or the results will be confounded.

SpeciesToExclude

Vector of any individual species to be excluded from the final metatree. Intended to deal with problematic taxa, for example, the dinosaurs Eshanosaurus and Ricardoestesia.

MissingSpecies

What to do with species assigned to the target clade, but not present in the source data. Options are: "exclude" (excludes these missing species; the default and safest option), "genus" (include those species in a genus-level polytomy if the genus is sampled in the source data), and "all" (every species assigned to the target clade will be included). Note that neither "genus" or "all" should be used without careful checks of the taxonomy.

Interval

If restricting the sample to a specific interval of geologic time then use this option (passed to PaleobiologyDBDescendantFinder which should be consulted for formatting). Default is NULL (no restriction on ages of tips to be included).

VeilLine

A logical indicating whether to remove older data sets that do not increase taxonomic coverage (TRUE; the default and recommended) or not (FALSE). See Lloyd et al. (2016) and the details section below for more information.

IncludeSpecimenLevelOTUs

A logical indicating whether specimen-level OTUs should (TRUE; the default) or should not (FALSE) be included in the metatree. See details.

BackboneConstraint

The file name of one of the source data sets to be used as a backbone constraint (will enforce topology in final metatree but allows taxa not in topology to fall out inside the constraint). This is not required and the default (NULL) will mean no constraint is applied. See details for more information.

MonophylyConstraint

The file name of one of the source data sets to be used as a monophyly constraint (will enforce topology in final metatree and forces taxa not in topology to fall outside the constraint). This is not required and the default (NULL) will mean no constraint is applied. See details for more information.

RelativeWeights

A numeric vector of four values (default c(1, 1, 1, 1)) giving the respective weights to use for: 1) the input weights (the weights read in from the source MRP files), 2) the publication year weights (from equation 1 in the supplement of Lloyd et al. 2016), 3) the data set dependency weights (1 / the number of "sibling" data sets; see Lloyd et al. 2016), and 4) the within-matrix weights of individual clades (1 / number of conflicitng clades). Zeroes exclude particular weighting types. E.g., to only use input weights use c(1, 0, 0, 0).

WeightCombination

How to combine the weights above. Must be one of either "product" or "sum". Note product will exclude zero weight values to avoid zero weight output. E.g., if only using input weights the result of combining weights will not be all zeroes simply because the other types of weight are set at zero.

ReportContradictionsToScreen

Logical indicating whether or not to print any taxonomy-phylogeny contradictions found to the screen. These can aid checking for congruence betwen taxonomy and phylogeny, i.e., they inform the user on whether either the Paleobiology Database or the metadata might need amending.

ExcludeTaxonomyMRP

Logical indicating whether to exclude the taxonomy MRP. NOT RECOMMENDED.

Details

Introduction

Broadly speaking this function is an implementation and extension of the approach to generating composite phylogeneic trees laid out in Lloyd et al. (2016), which itself builds on Lloyd et al. (2008), namely the "metatree" approach. Metatrees are most comparable to formal supertrees but differ in that instead of published trees (figures in source publications) the input data are the original character-taxon matrices and/or sequence alignments. Thus metatrees can be considered superior to formal supertrees if you want to: 1) standardise the way input data are analysed, 2) choose the optimality criterion applied to inference instead of being forced to use whatever the original study used, 3) include non-focal species that may have been removed from published figures (improving taxon overlap), and 4) more properly incorporate phylogenetic uncertainty rather than being restricted to the use of consensus topologies.

Input values

Formatting input data

The implementation here assumes this data set reanalysis has already been performed, and the results have been encoded in NEXUS (Maddison et al. 1997) format using Matrix Representation with Parsimony (MRP; Baum 1992; Ragan 1992). Lloyd et al. (2016) further suggested that such MRP should represent every biparition present across a sample of trees (all most parsimonious trees, all trees in a posterior sample). An example of how this should be formatted is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#NEXUS

BEGIN DATA;
  DIMENSIONS  NTAX=4 NCHAR=3;
  FORMAT SYMBOLS= " 0 1" MISSING=? GAP=- ;
MATRIX

Ancilla      000
Turrancilla  011
Ancillista   101
Amalda       111
;
END;

BEGIN ASSUMPTIONS;
  OPTIONS  DEFTYPE=unord PolyTcount=MINSTEPS ;
END;

Note that all characters must be either zero or one (i.e., the function cannot currently deal with Purvis MRP; Purvis 1995). Each MRP file should also have a corresponding metadata file in a specific format expressed as XML. Multiple examples of this format are available at graemetlloyd.com and a much more detailed description of the XML structure can be found in the help file for the ReadMetatreeXML function, but a simple version is also shown below. Note that this corresponds to the MRP file above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
<?xml version="1.0" standalone="yes"?>
<SourceTree>
  <Source>
    <Author>
      <List>Michaux, B.</List>
    </Author>
    <Year>1989</Year>
    <Title>Cladograms can reconstruct phylogenies: an example from the fossil record</Title>
    <Journal>Alcheringa</Journal>
    <Volume>13</Volume>
    <Pages>21-36</Pages>
    <Booktitle/>
    <Publisher/>
    <City/>
    <Editor/>
  </Source>
  <Taxa number="4">
    <List recon_name="Ancilla" recon_no="10760">Ancilla</List>
    <List recon_name="DELETE" recon_no="-1">Turrancilla</List>
    <List recon_name="Ancillista" recon_no="10763">Ancillista</List>
    <List recon_name="Amalda" recon_no="10743">Amalda</List>
  </Taxa>
  <Characters>
    <Molecular/>
    <Morphological number="11">
      <Type>Osteology</Type>
    </Morphological>
    <Behavioural/>
    <Other/>
  </Characters>
  <Analysis>
    <Type>Maximum Parsimony</Type>
  </Analysis>
  <Notes>Based on reanalysis of the original matrix.</Notes>
  <Filename>Michaux_1989aa</Filename>
  <Parent/>
  <Sibling/>
</SourceTree>

This format is itself a modification of that used by the Supertree Toolkit (Hill and Davis 2014) and contains two pieces of critical information used by the function: 1) how the Operational Taxonomic Units (OTUs) of the source data set should be reconciled to real taxa (the <Taxa> tag), and 2) how the source data are related to each other (the <Parent> and <Sibling> tags). At present the former operates exclusively using the Paleobiology Database, which is considered as the taxonomic authority. (NB: This means currently the metatree approach is better suited to extinct-only or total-group inference, but extant taxa are also included there.) For each OTU both the formal name (recon_name) as spelled in the database, spaces should be replaced with underscores) and its' associated database number should be provided (recon_no). For example, the OTU name "Trex" would be reconciled to "Tyrannosaurus_rex" and the number "54833". These numbers can be found by looking up individual taxa in the database and referring to the URL (e.g., 54833).

Input directories

Once these files are prepared for use they should be placed in separate directories (one for MRP files and one for XML files) and the paths to these folders are the primary inputs to the function (as MRPDirectory and XMLDirectory, respectively). Note that aside from the file extension (.nex, .xml) filenames must match perfectly between corresponding MRP and XML files. For example, Eppes_etal_2005amrp.nex in the MRP directory must have a corresponding Eppes_etal_2005a.xml file in the XML directory. (Note the "mrp.nex" ending.)

Data set inclusion and exclusion

By default the function will use all MRP and XML files found in the two directories, but the user may also wish to supply a limited list of data sets instead. For only data to include the user should use InclusiveDataList and a vector of strings corresponding to the file names (without extension). For example, InclusiveDataList = c("Eppes_etal_2005a", "White_et_Pinkman_2008a"). ExclusiveDataList works the same way but in reverse and might be preferable if the data sets to exclude are fewer than those to include.

Target clade

As generally speaking not all taxa found in the source trees will want to be included in the final analysis it is necessary to supply a target clade for the function to use. Again, this depends on the Paleobiology Database and so this must be a valid taxon there and spelled correctly (e.g., TargetClade = "Dinosauria"). In practice any species not assigned to this taxon will be excluded from the final tree, so in practice you may wish to use a slightly more inclusive taxon (e.g., "Dinosauromorpha" rather than "Dinosauria"). In theory if you wanted to include all taxa you could supply the top of the taxonomic hierarchy, i.e., TargetClade = "Life".

Taxonomic options

It may be that a more specific complement of taxa is desired and there are multiple ways to achieve this. By default the function will only include species present in the sample (species-level reconciliations amongst the source data sets; MissingSpecies = "species"). This can be reduced or expanding depending on what is required.

More species can be excluded by changing the MissingSpecies option to something else. Additional taxa can be included if you use "genus" (all species in the Paleobiology Database assigned to a sampled genus are also included, emerging in a polytomy from that genus) or "all" (all species assigned to the target clade are included, again emerging from polytomies for each corresponding supraspecific taxon nested inside the target clade). These options represent a trade-off between taxonomic coverage and precision. If in doubt it is recommended that the user stick to the default here as more inclusive options are also likely to include problematic or obsolete taxa excluded from source data sets for good reasons, or inaccurate placements because the current taxonomic synthesis needs updating.

Species can also be individually excluded and in multiple ways. A simple means is by using the SpeciesToExclude option where the user may supply a vector of names to exclude from the final analysis. For example, SpeciesToExclude = c("Tyrannosaurus_rex", "Triceratops_horridus"). Again, underscores should separate genus and species and names must match the Paleobiology Database version. However, in practice it may be that the desired species to exclude really represent a clade (supraspecific taxon) and hence it would make more sense to provide a single name instead of many. This can be done with the HigherTaxaToCollapse option. Note that this will not remove the taxa completely but replace them with a single OTU that will appear in ALL CAPS in the final tree. For example, if the desired target clade is Dinosauria exclusive of crown-birds you could use HigherTaxaToCollapse = c("Neornithes"), which will replace all crown-bird OTUs with the single taxon "NEORNITHES". Note that this option should only be applied if the user is certain that the higher taxon is monophyletic in the Paleobiology Database, otherwise the results may be compromised.

Specifying a sampling interval

Another way of more specifically sampling taxa might be temporal. For example, if the desired tree is Cretaceous dinosaurs only. This can be specified using the Interval option. This works by using the PaleobiologyDBDescendantFinder function to identify valid species assigned to the specified interval. Here both a highest and a lowest interval must be specified, so for our Cretaceous example these would be the same, i.e., Interval = c("Cretaceous", "Cretaceous"). Note that currently the function only accepts geologic periods and not finer subdivisions. This option should also be used with caution as not all taxa in the database have temporal information (meaning Cretaceous species could be excluded unintentionally simply because their fossil occurrence(s) have not yet been entered into the database). Additionally, specimen-level OTUs (see below) cannot be excluded this way.

Use of a veil line

A common criticism of formal supertrees is that they treat all source data sets equally, regardless of age or quality. However, this is actually an issue of poor implementation rather than a limitation of the approach - good meta-analysis can (and should) differentially weight source data (see discussion on weights below). Another way to deal with this issue is to apply some form of "veil line" where a specified year is used and data sets older than this year excluded from the analysis. However, there are good reasons to not do this a priori. For example, choosing a year would ideally be based on some explicit quantitative criteri(a) that is actually tested. Additionally, excluding older data sets a priori might lead to critical dependency information also being excluded, compromising both data set weighting and pruning (see below).

Here the veil line criterion of Lloyd et al. (2016) is applied by default (VeilLine = TRUE). This assumes that the desired optimality criterion is taxonomic coverage and hence searches for the year corresponding to the last year between then and the present that all possible taxa can be included. To put this another way, for each valid OTU it finds the most recent source data set in which it was included. The oldest of these (the taxon least recently included in a phylogenetic analysis) will set the veil line.

Any data set older than this will not appear in the final analysis. However, any information included in such data sets pertaining to non-dependence is retained. In practice the major advantage here is to reduce the amount of data in the final output, but also to exclude older, potentially less accurate data.

Specimen-level OTUs

By default the function assumes the desired taxonomic-level of OTUs is the species level. Aside from not being implemented here, use of higher-levels such as the genus are strongly cautioned against as they can incorporate all kinds of problems into the analysis (e.g., many genera, extinct and extant, are para- or polyphyletic). However, in practice many source data sets of fossil taxa will include OTUs without a valid species name. For example, the dinosaur fossil nicknamed "Dave" known from specimen NGMC91 (Wikipedia page), which has been included in numerous phylogenetic analyses of theropod dinosaurs. Such specimen-level OTUs can still be included in the analysis by assigning them to the lowest-level taxon to which they can be safely assigned (e.g., genus, family etc.) and then including their specimen number in the name to ensure they are unique. For example, Lloyd et al. (2016) included two specimen-level OTUs assigned to the family Alvarezsauridae and given the names Alvarezsauridae_indet_MPC_100_99and120 and Alvarezsauridae_indet_YPM_1049 to designate them as separate OTUs.

In general such specimen-level OTUs are considered to be valuable to many forms of analysis that might be applied to the resulting metatree. For example, time-scaling (they may be the oldest or youngest members of their larger clade) or biogeography (they may represent some form of spatial extreme, e.g., the only member of the clade known from a specifc region). However, they can complicate issues as well. For example, they are not recorded as "proper" taxa in the Paleobiolgy Database hence their validity must be determined by the user. Especially vexatious here is that the fate of many of these specimens is to be given valid names and hence without vigilant checking by the user the same OTU may be inadvertently included twice, first as a specimen-level OTU before it was named and then as its' proper valid name. Thus the option to exclude such taxa (IncludeSpecimenLevelOTUs = FALSE is offered here. Note that if these do not exist amongst the source data the issue is irrelevant.

Applying constraint(s)

As with any form of phylogenetic inference it may be desirable to specify some form of constraint to restrict the resulting topolog(ies). For example, a molecular scaffold may be desired or multiple data sets might be generated that reflect competing hypotheses of relationships.

In most phylogenetic software a constraint tree is specified as a single tree, but here the constraint must be included in the source data. In other words, it must be expressed as an MRP file and XML file just like any other data set. (To build an MRP file from a tree the user should consult the Tree2MRP function.) This is to limit the many problems that can come from separately specifying a tree, such as ensuring the taxa match up properly. However, it offers additional benefits too. In particular, the MRP encoding means the constraint can represent not just a single tree but a set of trees (e.g., a posterior sample from a Bayesian analysis of molecular data). Thus the result can be limited to a specific set of biparitions without having to specify a single (consensus) tree that would unintentionally allow relationships not found in the original sample.

Only a single data set can be specified as a constraint, but two options may be used, one of either: BackboneConstraint or MonophylyConstraint. These represent a constraint tree that either allows taxa not included in the constraint to fit anywhere else (backbone) or forcing them to fall outside the constraint (monophyly). Note that if the constraint tree includes all the sampled OTUs then this option is irrelevant - either of BackboneConstraint or MonophylyConstraint could be used and the result would be identical.

Note that the constraint works by simply upweighting the constraint data set's MRP block over the remaining characters.

Weighting source data

A major failing of many formal supertree approaches is to weight all input data equally by default without any consideration of variation in information quality. This is problematic for multiple reasons, but it's also very easily dealt with as most parsimony-based inference software (the intended inference software for the final MRP matrix) can take weighting information as input and apply it accordingly. Here the function considers four possible types of weighting:

  1. Weighting by input values

  2. Weighting by year of publication

  3. Weighting by non-independence

  4. Weighting by data set size

Weighting by input values means applying the weights read in from the original MRP files. If none are found these are considered to be one for every character. In practice such weights could represent, for example, the posterior probability (relative frequency amongst a posterior sample) of each biparition (MRP character). Alternatively they could simply represent some custom weighting scheme favoured by the user. A word of caution with using such weights, however. As the function works taxa can be altered (duplicated, pruned etc.) by the taxonomic reconciliation steps and hence characters can be deleted and thus the weighting potentially (partially) corrupted. The exact effect of this will depend on specific circumstances, but is something the user should be aware of.

Weighting by year of publication seems logical on a basic level: we assume that scientific knowledge improves over time. In other words more recent data sets should be upweighted over older data sets. Lloyd et al. (2016) proposed the following equation (slightly modified here) to capture this:

W_i = 2^(0.5(x_i - t_0))

Where W_i is the weight of the ith data set, x_i is the year of publication of the ith dataset, and t_0 is the year of publication of the oldest included data set. In practice this describes a scenario where the weight assigned doubles in value every two years, e.g., a data set from 2000 would be weighted 1/32 the value of a dataset from 2010.

Weighting by non-independence is a means of acknowledging the fact that morphological data sets in particular tend not to be independent hypotheses of phylogeny, but rather frequently reuse previously published data sets often with little modification. The function used here can also exclude data sets based on non-independence. For example, if data set A is reused by two later authors (data sets B and C), then typically data set A can be excluded (equivalent of weighting zero). However, data sets B and C have equal claim to data set A's "weight" and hence should be weighted half each. Note that this process is more complex as branching trees of successive reuse emerge, making multiple data sets redundant and sharing the overall initial data set weight over sometimes a very large number of "descendant" data sets. This complexity is automatically handled by the function.

Weighting by data set size is complex. Generally speaking, source data sets with more OTUs will lead to more MRP characters and hence greater influence over the resulting super- or metatree. This in of itself is not strictly a problem that requires intervention as more comprehensive data sets are generally of higher value, increasing both taxonomic overlap and total information. However, the metatree approach complicates this issue as it can take samples of trees as input and hence MRP characters can grow not just with the number of OTUs but with the amount of phylogenetic uncertainty in the source data set. Here intervention is required as otherwise uncertain data sets will have undue influence, especially as variation within subclades will effectively reinforce the existence of that subclade in the first place (the concern that prompted Purvis' MRP modification; Purvis 1995). This issue is dealt with here by searching for characters within data sets that conflict and downweighting these such that they sum to one (the weight assigned to characters that do not conflict). If there is no conflict (all characters are congruent) then they will all be weighted one.

Each of the four different weightings will initially be weighted on a zero to one scale, although in practice nothing will be weighted zero as this effectively excludes it from the analysis (except for the data sets made redundant through non-independence). However, this does not mean all different types of weighting will, or should, be applied in the final analysis. For example, if a priori weights are preferred then the other three types will presumably want to be ignored. Similarly, even if including two or more types of weighting it may be that the user will prefer to rely on one more than others. Both what to include and what to favour can be captured with the RelativeWeights option, which is simply a vector of four numbers corresponding to the four types of weight listed above and in that order. Thus to only include a priori weights everything else would be set to zero (RelativeWeights = c(1, 0, 0, 0)). Note that here the default (RelativeWeights = c(1, 1, 1, 1)) is not the recommended option. However, if wanting to account for both non-independence and year of publication but favour the year of publication a larger weight could be used for the latter (e.g., RelativeWeights = c(0, 10, 1, 0)).

If using two or more weights another consideration is how these should be combined in order to produce a single weight for each character. Lloyd et al. (2016) used the product, but here the sum is also offered as an option. This choice is made using the WeightCombination variable, with the sum as the default. Importantly, if using the product and setting some relative weights to zero does not mean the result will be a zero (instead the zero values are excluded).

Final weights may still differ from what the user expects, however, and this is because of two factors: 1) consideration of taxonomic information in inferring relationships and 2) weight limits set by TNT (Goloboff and Catalano 2016), the assumed inference software.

Following Lloyd et al. (2016) the metatree approach includes an additional "hidden" source tree based on the current Paleobiology Database taxonomic hierarchy. This is to provide some minimal information on where taxa go in the resulting tree, apply the missing species option (described above) and potentially break ties where no other information is available. However, unlike other approaches this is not a constraint and more generally phylogenetic source data is allowed to overrule taxonomy. This is achieved by setting the weight of all taxonomic MRP characters to one and starting all phylogenetic characters at a weight of 10 (i.e., minimally an order of magnitude higher than the taxonomic characters).

As the resulting MRP matrix is meant for parsimony analysis the most obvious destination software is TNT, which is optimised for fast searching of large data sets, but sets restrictions on the range of weights. Specifically, TNT can only accept input weights between 0.5 and 1000 and at a maximum precision of two decimal places (0.50-1000.00 total range). Thus weights between 10 and whatever the maximum value is are rescaled to fall on to a 10.00 to 1000.00 scale (maintaining the lower weights for taxonomy). (Lloyd et al. 2016 explored a means to extend this by using more states for each MRP character, effectively increasing the weight of an individual character from 0.50 to 31000.00. However, in practice using more character states causes a dramatic slowdown in performance in TNT and so that option is not implemented here.)

Taxonomy-phylogeny contradictions and "chunking"

In an ideal scenario taxonomic and phylogenetic hierarchies will be perfectly congruent (i.e., all taxa will be monophyletic), making the generation of metatrees a simple and fast affair. However, contradictions will arise when input phylogenies incorporate (for example) paraphyletic groups as supraspecific OTUs or the currently favoured Paleobiology Database opinion for a supraspecific taxon is non-monophyletic.

These conflicts may be correctable, either in the metadata (editing the XML file(s)) or the database (entering new data into the Paleobiology Database). However, before either can occur the user must be aware of the issues and this is not easily done through manual inspection of the data. For this purpose a series of checks are made to find any contradictions. This can be ignored by the user (ReportContradictionsToScreen = FALSE) or output to the screen for manual checking later (ReportContradictionsToScreen = TRUE).

Ultimately this information will also be used for another purpose: "chunking" the analysis. As metatrees get larger in size, moving from hundreds to thousands of tips, inference becomes computationally and temporally more expensive. However, given the reliance on taxonomy as an input tree if some clades are found to be monophyletic (no contradictions between taxonomy and phylogeny) then they can safely be broken out into separate analyses, or "chunks". Critically, this can dramatically speed up inference as it is much easier to infer optimal topologies for, say, ten analyses of 100 tips, than one analysis of 1000 tips.

Although this chunking isn't yet implemented automatically it can be done manually by the user using the MonophyleticTaxa output (see below) and the TargetClade and HigherTaxaToCollapse input options. As an example, lets imagine we are interested in generating a metatree of Pseudosuchia (crocodile-line archosaurs). We can initially set this as our target clade and run the metatree function. Once this is complete we can check the monophyletic taxa and might find a large subclade is monophyletic (e.g., Eusuchia, the crocodile crown-group). We can save some time in inference if we then use this information to generate two separate metatrees: 1) a Eusuchia tree (using this as the target clade) and, 2) a Pseudosuchian tree with the Eusuchian portion collapsed to a single tip.

Note that in practice large subclades may not be monophyletic and although a similar manual "chunking" could still be forced it is not recommended as it would compromise the results.

Operational steps

In practice the function operates in a linear fashion, taking the data through a multi-step process and associated checks, and informing the user with messages as it goes. It can be slow (minutes to hours) depending on the data set size and options applied, but primarily because it makes multiple calls to the Paleobiology Database API as it goes. These are essential to the operation of the function and are not easy to to speed up so the user should factor in run time to their pipelines.

Users will also typically find the first time they run it on their data multiple errors will be found. However, these are typically caught by informative messages and easily fixed by modifying the input data. In other words, human error is pretty much inevitable and typographic errors in reconciled names (for example) will usually be found somewhere. However, these are critical to fix to avoid confounded output data. (It is always true that the worst errors are generated by code that doesn't produce any.)

[BELOW IS UNFINISHED]

Need to add:

- Linear list of operations of function

- Output explanations

Value

FullMRPMatrix

The full MRP matrix in the same format imported by read_nexus_matrix. This can be written out to a file in either NEXUS or TNT format using write_nexus_matrix or write_tnt_matrix.

STRMRPMatrix

The safe taxonomic reduction MRP matrix in the same format imported by read_nexus_matrix. This can be written out to a file in either NEXUS or TNT format using write_nexus_matrix or write_tnt_matrix. This is the matrix recommended for analysis as it will be smaller and faster than the full version and taxa can still be reinserted later using safe_taxonomic_reinsertion.

TaxonomyTree

The taxonomic hierarchy of the included taxa presented as an ape "phylo" object, with supraspeciifc taxa as node labels.

MonophyleticTaxa

A vector of taxa which can be considered monophyletic (no phylogenetic data in the sample contradicts the existence of these clades). The intended use of this is to identify smaller subsets of the data that can be analysed separately to "chunk" the metatree process into smaller, faster parts.

SafelyRemovedTaxa

The results of the safe taxonomic reduction. This is the $str.list part of the output of safe_taxonomic_reduction and can be used to reinsert taxa later with the safe_taxonomic_reinsertion function.

RemovedSourceData

A vector of source data removed throughout the Metatree function. Note that currently the function does not distinguish between the reasons for this (e.g., too many invalid taxa, too few taxa, redundant through non-independence, removed through the veil year process etc.). Importantly it is not therefore safe to remove these data sets from the input as they may still be contributing to the non-independence information.

VeilYear

The veil year applied (i.e., only data sets this age or younger are included in the output).

DataSetDependenceWeights

Table of data set weightings used.

CharacterWeights

Table of character weightings used.

Author(s)

Graeme T. Lloyd graemetlloyd@gmail.com

References

Baum, B. R., 1992. Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon, 41, 3-10.

Goloboff, P. A. and Catalano, S. A., 2016. TNT version 1.5, including a full implementation of phylogenetic morphometrics. Cladistics, 32, 221-238.

Hill, J. and Davis, K. E., 2014. The Supertree Toolkit 2: a new and improved software package with a Graphical User Interface for supertree construction. Biodiversity Data Journal, 2, e1053.

Lloyd, G. T., Davis, K. E., Pisani, D., Tarver, J. E., Ruta, M., Sakamoto, M., Hone, D. W. E., Jennings, R. & Benton, M. J., 2008. Dinosaurs and the Cretaceous Terrestrial Revolution. Proceedings of the Royal Society B, 275, 2483-2490.

Lloyd, G. T., Bapst, D. W., Friedman, M. and Davis, K. E., 2016. Probabilistic divergence time estimation without branch lengths: dating the origins of dinosaurs, avian flight, and crown birds. Biology Letters, 12, 20160609.

Maddison, D. R., Swofford, D. L. and Maddison, W. P., 1997. NEXUS: an extensible file format for systematic information. Systematic Biology, 46, 590-621.

Purvis, A., 1995. A modification to Baum and Ragan's method for combining phylogenetic trees. Systematic Biology, 44, 251-255.

Ragan, M., 1992. Phylogenetic inference based on matrix representation of trees. Molecular Phylogenetics and Evolution, 1, 113-126.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Local test for Ichthyopterygia:
#Metatree(MRPDirectory = "/Users/eargtl/Documents/Homepage/www.graemetlloyd.com/mrp",
#  XMLDirectory = "/Users/eargtl/Documents/Homepage/www.graemetlloyd.com/xml",
#  TargetClade = "Ichthyopterygia", InclusiveDataList = sort(c(GetFilesForClade("matricht.html"),
#  "Bickelmann_etal_2009a", "Caldwell_1996a", "Chen_etal_2014ba", "Chen_etal_2014bb",
#  "deBraga_et_Rieppel_1997a", "Gauthier_etal_1988b", "Laurin_et_Reisz_1995a", "Muller_2004a",
#  "Reisz_etal_2011a", "Rieppel_et_Reisz_1999a", "Rieppel_et_deBraga_1996a", "Young_2003a")),
#  ExclusiveDataList = c("Averianov_inpressa", "Bravo_et_Gaete_2015a", "Brocklehurst_etal_2013a",
#  "Brocklehurst_etal_2015aa", "Brocklehurst_etal_2015ab", "Brocklehurst_etal_2015ac",
#  "Brocklehurst_etal_2015ad", "Brocklehurst_etal_2015ae", "Brocklehurst_etal_2015af",
#  "Bronzati_etal_2012a", "Bronzati_etal_2015ab", "Brusatte_etal_2009ba", "Campbell_etal_2016ab",
#  "Carr_et_Williamson_2004a", "Carr_etal_2017ab", "Frederickson_et_Tumarkin-Deratzian_2014aa",
#  "Frederickson_et_Tumarkin-Deratzian_2014ab", "Frederickson_et_Tumarkin-Deratzian_2014ac",
#  "Frederickson_et_Tumarkin-Deratzian_2014ad", "Garcia_etal_2006a", "Gatesy_etal_2004ab",
#  "Grellet-Tinner_2006a", "Grellet-Tinner_et_Chiappe_2004a", "Grellet-Tinner_et_Makovicky_2006a",
#  "Knoll_2008a", "Kurochkin_1996a", "Lopez-Martinez_et_Vicens_2012a", "Lu_etal_2014aa",
#  "Norden_etal_inpressa", "Pisani_etal_2002a", "Ruiz-Omenaca_etal_1997a", "Ruta_etal_2003ba",
#  "Ruta_etal_2003bb", "Ruta_etal_2007a", "Selles_et_Galobart_2016a", "Sereno_1993a", "Sidor_2001a",
#  "Skutschas_etal_inpressa", "Tanaka_etal_2011a", "Toljagic_et_Butler_2013a",
#  "Tsuihiji_etal_2011aa", "Varricchio_et_Jackson_2004a", "Vila_etal_2017a", "Wilson_2005aa",
#  "Wilson_2005ab", "Zelenitsky_et_Therrien_2008a"), MissingSpecies = "exclude",
#  BackboneConstraint = "Moon_inpressa", RelativeWeights = c(0, 100, 10, 1))

graemetlloyd/metatree documentation built on April 29, 2021, 2:32 a.m.