WFO.match: Standardize plant names according to World Flora Online...

View source: R/WFO.match.R

WFO.matchR Documentation

Standardize plant names according to World Flora Online taxonomic backbone

Description

This package checks a list of taxa (typically species) against the World Flora Online (WFO) taxonomic backbone. The user needs to first download a static copy of the Taxonomic Backbone data from https://www.worldfloraonline.org or https://zenodo.org/doi/10.5281/zenodo.7460141 (_DwC_backbone_R.zip).

Usage

    WFO.match(spec.data = NULL, WFO.file = NULL, WFO.data = NULL,
        no.dates = TRUE,
        spec.name = "spec.name", Genus = "Genus", Species = "Species",
        Infraspecific.rank = "Infraspecific.rank", Infraspecific = "Infraspecific",
        Authorship = "Authorship", First.dist = FALSE,
        acceptedNameUsageID.match = TRUE,
        Fuzzy = 0.1, Fuzzy.force = FALSE, Fuzzy.max = 250, Fuzzy.min = TRUE,
        Fuzzy.shortest = FALSE, Fuzzy.within = FALSE,
        Fuzzy.two = TRUE, Fuzzy.one = TRUE,
        squish = TRUE,
        spec.name.tolower = FALSE, spec.name.nonumber = TRUE, spec.name.nobrackets = TRUE,
        exclude.infraspecific = FALSE,
        infraspecific.excluded = c("cultivar.", "f.", "sect.", "subf.", "subg.",
            "subsp.", "subvar.", "var", "var.", "[infraspec.]", "fo.", "forma",
            "nothosubsp.", "nothovar.", "sect."),
        spec.name.sub = TRUE,
        sub.pattern=c(" sp[.] A", " sp[.] B", " sp[.] C", " sp[.]", " spp[.]", " pl[.]",
            " indet[.]", " ind[.]", " gen[.]", " g[.]", " fam[.]", " nov[.]", " prox[.]",
            " cf[.]", " aff[.]", " s[.]s[.]", " s[.]l[.]",
            " p[.]p[.]", " p[.] p[.]", "[?]", " inc[.]", " stet[.]", "Ca[.]",
            "nom[.] cons[.]", "nom[.] dub[.]", " nom[.] err[.]", " nom[.] illeg[.]",
            " nom[.] inval[.]", " nom[.] nov[.]", " nom[.] nud[.]", " nom[.] obl[.]",
            " nom[.] prot[.]", " nom[.] rej[.]", " nom[.] supp[.]", " sensu auct[.]"),
        verbose = TRUE, counter = 1000)

    WFO.url(WFO.result = NULL, browse = FALSE, browse.rows = c(1:1), ...)

    WFO.one(WFO.result = NULL, priority = "Accepted",
        spec.name = NULL, Auth.dist = NULL, Old.author.dist=NULL, First.dist = NULL,
        verbose = TRUE, counter = 1000)

    WFO.browse(taxon, WFO.file = NULL, WFO.data = NULL,
        accepted.only = FALSE, acceptedNameUsageID.match = TRUE, ...)

    WFO.synonyms(taxon, WFO.file = NULL, WFO.data = NULL, ...)

    WFO.family(taxon, WFO.file = NULL, WFO.data = NULL, ...)

Arguments

spec.data

A data.frame containing variables with species names. In case that a character vector is provided, then this vector will be converted to a data.frame

WFO.file

File name of the static copy of the Taxonomic Backbone. If not NULL, then data will be reloaded from this file.

WFO.data

Data set with the static copy of the Taxonomic Backbone. Ignored if WFO.file is not NULL.

no.dates

Speeding up the loading of the WFO.data by not loading fields of 'created' and 'modified'.

spec.name

Name of the column with taxonomic names. In case that a spec.name is provided, then separate genus and species names will be ignored. For function WFO.one, giving the name for this columns results in copying a submitted but unmatched plant name into the scientificName of the results.

Genus

Name of the column with the genus names.

Species

Name of the column with the species names.

Infraspecific.rank

Name of the column with the infraspecific rank (such as "subsp.", "var." or "cultivar.").

Infraspecific

Name of the column with the infraspecific names.

Authorship

Name of the column with the naming authorities.

First.dist

If TRUE, then calculate the fuzzy distance between the first words of the submitted and matched names (these are typically the genus names) .

acceptedNameUsageID.match

If TRUE, obtain the accepted name and others details from the earlier acceptedNameUsageID.

Fuzzy

If larger than 0, then attempt fuzzy matching in case an identifical taxonomic name is not found in the World Flora Online. This argument will be used as argument max.distance in the internally called agrep. Note that fuzzy matching is only possible for the spec.name.

Fuzzy.force

If TRUE, always use the fuzzy matching algorithm, even when the spec.name was matched exactly.

Fuzzy.max

Maximum number of fuzzy matches.

Fuzzy.min

If TRUE, limit the matching of names to those with the smallest Levenshtein distance, calculated via adist.

Fuzzy.shortest

If TRUE, limit the matching of names to those with the most similar length of characters (this feature is expected to eliminate matches at infraspecific levels, see examples).

Fuzzy.within

If TRUE, limit the matching of names to those that contain exactly the submitted plant name (this feature is expected to be useful when submitting plant names that only contain a subset of the first characters of the species name, in order to check for best matches manually afterwards).

Fuzzy.two

If TRUE, in case that there were no fuzzy matches, limit the terms to be matched to the first two (these are expected to be genus and species names).

Fuzzy.one

If TRUE, in case that there were no fuzzy matches, limit the terms to be matched to the first one (expected to be the genus name).

squish

If TRUE, remove repeated whitespace and white space from the start and end of the submitted full name via str_squish.

spec.name.tolower

If TRUE, then convert all characters of the spec.name to lower case via tolower.

spec.name.nonumber

If TRUE, then submitted spec.name that contain numbers will be interpreted as genera, only matching the first word.

spec.name.nobrackets

If TRUE, then submitted spec.name then sections of the submitted name after '(' will be removed. Note that this will also remove sections after ')', such as authorities for plant names that are in a separate column of WFO.

exclude.infraspecific

If TRUE, then exclude records that contain the infraspecific levels defined by infraspecific.excluded.

infraspecific.excluded

Infraspecific levels (available from column 'verbatimTaxonRank') excluded in the results. Note that levels are excluded both in direct matches and matches with the accepted name.

spec.name.sub

If TRUE, then delete sections of the spec.name that match the sub.pattern.

sub.pattern

Sections of the spec.name to be deleted

verbose

Give details on the fuzzy matching process.

counter

Progress on the matching process is reported by multiples of this counter.

WFO.result

Result obtained via WFO.match.

browse

If TRUE, then browse urls specified by browse.rows.

browse.rows

Indices of row with the urls to be browsed.

priority

Method of selecting the 1-to-1 matches. Option Accepted first limits candidates to accepted names, with a possible second step of eliminating accepted names that are synonyms. Option Synonym first limits candidates to those that are not synonyms, with a possible second step of eliminating names that are not accepted. When the number of matches is larger than one after these steps, a third algorithm picks the candidate with the smallest taxonID.

Auth.dist

In case that the name of the variable with the Levenshtein distance between the authorship names is provided, then the algorithm first prioritizes records with the best match between the submitted and matched author names.

Old.author.dist

In case that the name of the variable with the Levenshtein distance between the authorship names for the synonym matches is provided, then the algorithm first prioritizes records with the best match between the submitted and matched author names.

taxon

Character string with the name of the taxon for which information will be given (for families, different genera; for genera, different specieds; for species, infraspecific levels).

accepted.only

If TRUE, then only provide taxa with accepted names.

...

Other arguments for browseURL (WFO.url) or WFO.match (WFO.browse).

Details

The principal function (WFO.match) matches plant names. Columns retrieved from the World Flora Online are added to the provided input data.frame. In case that there are multiple matches, then rows from the input data.frame are repeated.

Column 'Unique' shows whether there was a unique match (or not match) in the WFO.

Column 'Matched' shows whether there was a match in the WFO.

Column 'Fuzzy' shows whether matching was done by the fuzzy method.

Column 'Fuzzy.dist' gives the Levenshtein distance calculated between submitted and matched plant names adist.

Column 'Auth.dist' gives the Levenshtein distance calculated between submitted and matched authorship names, if the former were provided adist.

Column 'Subseq' gives different numbers for different matches for the same plant name.

Column 'Hybrid' shows whether there was a hybrid character in the scientificName.

Column 'New.accepted' shows whether the species details correspond to the current accepted name.

Column 'Old.status' gives the taxonomic status of the first match with the non-blank acceptedNameUsageID.

Column 'Old.ID' gives the ID of the first match with the non-blank acceptedNameUsageID.

Column 'Old.name' gives the name of the first match with the non-blank acceptedNameUsageID.

The function was inspired on the Taxonstand package that matches plant names against The Plant List. Note that The Plant List has been static since 2013, but was used as the starting point for the Taxonomic Backbone of the World Flora Online.

Function WFO.one finds one unique matching name for each submitted name. Via priority = "Accepted", it first limits candidates to accepted names, with a possible second step of eliminating accepted names that are synonyms. Via priority = "Synonym", it first limits candidates to those that are not synonyms, with a possible second step of eliminating names that are not accepted. When the number of matches is larger than one after these steps, a third algorithm picks the candidate with the smallest taxonID. When a spec.name is given to WFO.one, the original submitted name is inserted for the scientificName.

When the user specifies the column with the Auth.dist, documenting the Levenshtein distance between the submitted and matched authorities, then WFO.one first prioritizes records with best match between Authorities.

Function WFO.browse lists all the genera for a family, all species for a genus or all infraspecific levels for a species.

Function WFO.synonyms gives all records with the acceptedNameUsageID equal to the matched accepted species shown in the first row.

Function WFO.family provides information on the order of vascular plants, based on information available from vascular.families. Based on an internal list of bryophyte families, when the submitted plant name is a bryophyte, the function returns 'bryophyte' instead.

Value

The main function returns a data.set with the matched species details from the WFO.

Author(s)

Roeland Kindt (World Agroforestry, CIFOR-ICRAF)

References

World Flora Online. An Online Flora of All Known Plants. https://www.worldfloraonline.org

Sigovini M, Keppel E, Tagliapietra. 2016. Open Nomenclature in the biodiversity era. Methods in Ecology and Evolution 7: 1217-1225.

Kindt, R. 2020. WorldFlora: An R package for exact and fuzzy matching of plant names against the World Flora Online taxonomic backbone data. Applications in Plant Sciences 8(9): e11388

See Also

WFO.match.fuzzyjoin

Examples


data(WFO.example)

spec.test <- data.frame(spec.name=c("Faidherbia albida", "Acacia albida",
    "Omalanthus populneus", "Pygeum afric"))

WFO.match(spec.data=spec.test, WFO.data=WFO.example, counter=1, verbose=TRUE)

# Also calculate the Levenshtein distance for the genus
WFO.match(spec.data=spec.test, WFO.data=WFO.example, First.dist=TRUE,
    counter=1, verbose=TRUE)

# Show all the fuzzy matches, which included those at infraspecifc level
e1 <- WFO.match(spec.data=spec.test, WFO.data=WFO.example, counter=1,
    Fuzzy.min=FALSE, Fuzzy.shortest=FALSE, verbose=TRUE)
e1

# Use function WFO.one for a 1-to-1 match between submitted and matched names
WFO.one(e1)

# Hybrid species
WFO.match("Arabis divaricarpa", WFO.data=WFO.example)
WFO.match("Arabis x divaricarpa", WFO.data=WFO.example)

# Convert capitals to lower case
WFO.match("FAIDHERBIA ALBIDA", WFO.data=WFO.example, spec.name.tolower=TRUE)

# Remove sections of plant names that are equal to ' sp.' or ' indet. '
WFO.match("Prunus sp.", WFO.data=WFO.example, spec.name.sub=TRUE)

# Get urls, but do not open any
e2 <- WFO.match(spec.data=spec.test, WFO.data=WFO.example, counter=1, verbose=TRUE)
WFO.url(e2, browse=FALSE, browse.rows=c(1:nrow(e2)))

# Include input species names where no matches were found
# This happens when the name with original species names is provided to WFO.one
x1 <- WFO.match("World agroforestry", WFO.data=WFO.example)
WFO.one(x1, spec.name="spec.name")

## Not run: 

# Cross-check with Taxonstand results
library(Taxonstand)
data(bryophytes)

# Give the file with the static copy of the Taxonomic Backbone data ('classification.txt')
# that was downloaded from \url{https://www.worldfloraonline.org/downloadData}.
# Possibly first use unzip(file.choose()) for the downloaded WFO_Backbone.zip
WFO.file.RK <- file.choose()

# check species name
w1 <- WFO.match(bryophytes[1:20, ], WFO.file=WFO.file.RK, spec.name="Full.name", counter=1)
w1


# check species name from list of names
w1 <- WFO.match(bryophytes$Full.name[1:20], WFO.file=WFO.file.RK, counter=1)

# re-check species names obtained via Taxonstand
# note that Taxonstand did not match some infraspecific names ('Higher.level')
r1 <- Taxonstand::TPL(bryophytes$Full.name[1:20], corr = TRUE)
w2 <- WFO.match(r1, WFO.file=WFO.file.RK, Genus="New.Genus", Species="New.Species",
          Infraspecific.rank="New.Infraspecific.rank", Infraspecific="New.Infraspecific", counter=1)
w2

# only check genus and species
# specify different names for infraspecific columns as default to Taxonstand
w3 <- WFO.match(r1, WFO.file=WFO.file.RK, Genus="New.Genus", Species="New.Species",
          Infraspecific.rank="none", Infraspecific="none", counter=1)

# note that the method above also retrieved infraspecific levels
# to only retrieve at the species level, match infraspecific levels with an empty column
r1$empty <- rep("", nrow(r1))
w4 <- WFO.match(r1, WFO.file=WFO.file.RK, Genus="New.Genus", Species="New.Species",
          Infraspecific.rank="empty", Infraspecific="empty", counter=1)

# as an alternative to the method above, exclude all documented infraspecific levels
# from the results
w5 <- WFO.match(r1, WFO.file=WFO.file.RK, Genus="New.Genus", Species="New.Species",
          exclude.infraspecific=TRUE, counter=1)

# save results to file
# utils::write.table(w4, quote=F, sep="\t", row.names=F, append=FALSE)

# limit the fuzzy matches to those that contain a shortened version of a species name
w6 <- WFO.match("Acacia caes", WFO.file=WFO.file.RK, Fuzzy=0.01, Fuzzy.within=TRUE, verbose=TRUE)

# show all the matches for a genus
spec.test1 <- data.frame(Genus=c("Casimiroa"))
w8 <- WFO.match(spec.test1, WFO.file=WFO.file.RK, exclude.infraspecific=TRUE, verbose=TRUE)

# show all listings at a next hierarchical level
WFO.data1 <- data.table::fread(WFO.file.RK, encoding="UTF-8")

WFO.browse("Pinaceae", WFO.data=WFO.data1)
WFO.browse("Pinaceae", WFO.data=WFO.data1, accepted.only=T)

WFO.browse("Tsuga", WFO.data=WFO.data1)
WFO.browse("Tsuga", WFO.data=WFO.data1, accepted.only=T)

WFO.browse("Olea europaea", WFO.data=WFO.data1)
WFO.browse("Olea europaea", WFO.data=WFO.data1, accepted.only=T)

# browsing only works at family, genus and species levels
# for orders, however, information is given from vascular.families
WFO.browse("Polypodiales", WFO.data=WFO.data1)

# submitting no name results in a list of all families
WFO.browse(, WFO.data=WFO.data1)

# give synonyms
WFO.synonyms("Olea europaea", WFO.data=WFO.data1)

# give order and other higher levels from family
WFO.family("Olea europaea", WFO.data=WFO.data1)

## End(Not run)

WorldFlora documentation built on Sept. 11, 2024, 8:24 p.m.