knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE
)

hasData <- requireNamespace("LCVP", quietly = TRUE) 

if (!hasData) {
    knitr::opts_chunk$set(eval = FALSE)
    msg <- paste("Note: Examples in this vignette require that the", 
                 "`LCVP` package be installed. The system",
                 "currently running this vignette does not have that package",
                 "installed, so code examples will not be evaluated.")
    msg <- paste(strwrap(msg), collapse = "\n")
    message(msg)
}

Background

When comparing or merging any list of plant species, for instance for compiling regional species lists, merging occurrence lists with trait data or phylogenies, the taxonomic names must be matched to avoid artificial inflation due to synonyms, data loss or erroneous matches due to homonyms. The lcvplants package facilitates this process by automatizing large-scale taxonomic harmonization of plant names by fuzzy matching and synonymy resolution against the Leipzig Catalogue of Vascular Plants as taxonomic backbone.

The Leipzig Catalogue of Vascular Plants (LCVP) is a novel global taxonomic backbone, updating The Plant List, comprising more than 1,300,000 names and 350,000 accepted taxa names. We described the LCVP in detail in the related scientific publication (Freiberg et al, 2020).

Running lcvplants

Installation

You can install lcvplants from github using the devtools package You may need to install devtools first, so if you do not have it installed, run the following code.

install.packages("devtools")

To use lcvplants you also need the data of the LCVP package, which you can install in the same way.

devtools::install_github("idiv-biodiversity/lcvplants")
devtools::install_github("idiv-biodiversity/LCVP")
library(lcvplants)

Taxonomic resolution

The basic function in lcvplants is lcvp_search. It allows users to input a list of species name that will be matched to the names in the LCVP data. It returns a data.frame indicating the full name matched — including the authority name and removing possible orthographic errors —, the taxonomic status, the accepted name, family and order.

lcvp_search("Hibiscus vitifolius")
knitr::kable(lcvp_search("Hibiscus vitifolius"))

The lcvp_search algorithm will first try to exactly match the binomial names provided by the user. If no match is found, it will try to find the closest name given the maximal transformation cost defined in the argument max_distance (see ?base::agrep for more details). When the argument show_correct = TRUE a column is added to the final result indicating whether the binomial name was exactly matched (TRUE), or if it is misspelled (FALSE).

lcvp_search("Hibiscus vitifoliuse", 
            max_distance = 0.1, 
            show_correct = TRUE)
knitr::kable(lcvp_search("Hibiscus vitifoliuse", 
                         max_distance = 0.1, 
                         show_correct = TRUE))

If more than one name is fuzzy matched (more than one closest match found), only the accepted or the first name will be returned. The function lcvp_fuzzy_search can be used to return all the closest results when argument keep_closest = TRUE, or all names within the max_distance defined when keep_closest = FALSE.

lcvp_fuzzy_search("Hibiscus vitifoliuse", 
                  max_distance = 0.1, 
                  keep_closest = TRUE)
knitr::kable(lcvp_fuzzy_search("Hibiscus vitifoliuse",
                               max_distance = 0.1, 
                               keep_closest = TRUE))

Both lcvp_search and lcvp_fuzzy_search allows multiple species search. If you need to run for many species consider setting the argument progress_bar = TRUE to track the execution progress.

splist <- c(
  "Hibiscus abelmoschus var. betulifolius Mast.",
  "Hibiscus abutiloides Willd.",
  "Hibiscus aculeatus",
  "Hibiscus acuminatus",
  "Hibiscus furcatuis" 
)
x <- lcvp_search(splist, max_distance = 0.2, show_correct = TRUE)
x
knitr::kable(x)

The function lcvp_summary gives a report on the searching results using the lcvp_search function. Indicating the number of species searched, how many were matched. Among the matched names, it indicates how many were exactly or fuzzy matched. Then it checks how many author and infracategory names were exactly matched. Note that if authors or infracategory is not provided, it will be considered a no match.

lcvp_summary(x)

Notice that the warning message indicated that more than one name was matched for some species. You can easily access these species using the following code.

sps_mult <- attr(x, "matched_mult")
sps_mult

And you can quickly use the lcvp_fuzzy_search on them.

lcvp_fuzzy_search(sps_mult)
knitr::kable(lcvp_fuzzy_search(sps_mult))

Search all vascular plant names of a Genus, Family, Order or Author

Users can search all plant taxa names listed in the "Leipzig Catalogue of Vascular Plants" (LCVP) by order, family, genus or author, using the lcvp_group_search function.

# Search by Genus
x <- lcvp_group_search(c("AA", "Adansonia"), search_by = "Genus")
head(x)
knitr::kable(head(x))

Users can choose to keep only certain taxonomic stauts, this includes "accepted", "synonym", "unresolved", and "external".

# Search by Author and keep only accepted names
x <- lcvp_group_search("Schltr.", search_by = "Author", status = "accepted")
head(x)
knitr::kable(head(x))

Basic output information

The output from lcvp_search, lcvp_fuzzy_search and lcvp_group_search is a data.frame (or list of data.frames) with the following columns:

If no match is found for one species it will return NA for the columns in the LCVP table. But, if no match is found for all species the function will return NULL and a warning message.

Matching and joining two lists of vascular plant names

Match

In many situations, researchers want to compare and match two lists of species name coming from different sources (e.g. phylogenies, spatial data, trait data, regional lists). The function lcvp_match can be used for that. It matches and compares two name lists based on the taxonomic resolution of plant taxa names listed in the LCVP.

# Generate two lists of species name
splist1 <- sample(apply(LCVP::tab_lcvp[2:10, 2:3], 1, paste, collapse = " "))
splist2 <- sample(apply(LCVP::tab_lcvp[11:3, 2:3], 1, paste, collapse = " "))

Ordered is based on the splist1, and followed by non-matched names in splist2.

# Match both lists
x <- lcvp_match(splist1, splist2)
head(x)
knitr::kable(head(x))

If include_all = FALSE, non-matched names in splist2 are not included. And users can use the column Match.Position.2to1 to reorder splist2 to match splist1.

# Match both lists
matchLists <- lcvp_match(splist1, splist2, include_all = FALSE)
splist2_reordered <- splist2[matchLists$Match.Position.2to1]

Join

The lcvp_join function provides a quicker way to join two tables based on the taxonomic resolution of plant taxa names listed in the LCVP. It is inspired by the dplyr::join function. The function add the columns from a first table to a second table based on the list of species name in both tables. It first standardizes the species names in both tables based on the "Leipzig Catalogue of Vascular Plants" (LCVP) using the algorithm in lcvp_search. These standardized names of both tables are then matched using the algorithm in lcvp_match. The argument type indicates the kind of join to be done. The option "full" join will keep all species, "left" return all rows from the first table and drops the non-matches from the second, "right" do the same for the second table, and "inner" keep only matched species.

# Create data.frame1
splist1 <- sample(apply(LCVP::tab_lcvp[2:10, 2:3], 1, paste, collapse = " "))
tbl1 <-
  data.frame("Species" = splist1, "Trait1" = runif(length(splist1)))

# Create data.frame2
splist2 <- sample(apply(LCVP::tab_lcvp[11:3, 2:3], 1, paste, collapse = " "))
tbl2 <- data.frame(
  "Species" = splist2,
  "Trait2" = runif(length(splist2)),
  "Trait3" = runif(length(splist2)),
  "Trait4" = sample(c("a", "b"), length(splist2), replace = TRUE),
  "Trait5" = sample(c(TRUE, FALSE), length(splist2), replace = TRUE)
)
# Full join the two tables
x <- lcvp_join(tbl1, tbl2, c("Species", "Species"), type = "full")
head(x)
knitr::kable(head(x))

Because some names may turn out to be synonyms based on the LCVP taxonomic resolution, users may opt to solve duplicated species names by summarizing traits given provided functions for common classes of variables (numeric, character, and logical). This can be done by turning the option solve_duplicated = TRUE, and the algorithm will combine duplicated output names (Output.Taxon column) based on users-defined functions to summarize the information.

x <- lcvp_join(tbl1, tbl2, c("Species", "Species"), type = "full",
               solve_duplicated = TRUE)
head(x)
knitr::kable(head(x))

Because some users may want to solve duplicated names outside the lcvp_join function, the algorithm is available as an individual function names lcvp_solve_dups.

# Create a data.frame with duplicated names and different traits
splist <- sample(apply(LCVP::tab_lcvp[1:100, 2:3], 1, paste, collapse = " "))
search <- lcvp_search(splist)

tbl <- data.frame("Species" = search$Output.Taxon,
"Trait1" = runif(length(splist)),
"Trait2" = sample(c("a", "b"), length(splist), replace = TRUE),
"Trait3" = sample(c(TRUE, FALSE), length(splist), replace = TRUE))
# Solve with default parameters
x <- lcvp_solve_dups(tbl, 1)
head(x)
knitr::kable(head(x))

Parallel programing with lcvplants

Some users may want to use lcvplants functions to perform taxonomic harmonization for many thousands of species and wish to use the entire computational capacity available to speed up the processing time. Check this article, where we show how to run lcvp_search or lcvp_fuzzy_search in parallel in an efficient way to reduce computational time.

References

Freiberg, M., Winter, M., Gentile, A. et al. LCVP, The Leipzig catalogue of vascular plants, a new taxonomic reference list for all known vascular plants. Sci Data 7, 416 (2020). https://doi.org/10.1038/s41597-020-00702-z



idiv-biodiversity/lcvplants documentation built on Nov. 18, 2022, 3:39 a.m.