knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%", eval = FALSE )
gnfindrr
is an experimental R wrapper for the gnfinder library for scientific name discovery in texts created by Dmitry Mozzherin.
gnfinder
forms part of the Global Names Architecture (GNA) suite of tools for working with taxonomic names and biodiversity data. gnfinder
is presently available as a command line tool and can also be used as a gRPC service, a library or in a docker container. gnfinder
is used to text mine the Biodiversity Heritage Library for taxonomic names at the scale of millions of pages.
gnfinderr
has the modest ambition of providing a thin wrapper in R with a focus on text mining smaller data frames and returning data frames to the user. gnfinderr
uses the tidyverse and includes the pipe %>%
.
gnfinderr
is not intended for use on large datasets (as it will take a very long time). For that use gnfinder
directly. What gnfinderr
is useful for is smaller datasets of upto a few thousand texts. This will give you a good flavour of what you can do with gnfinder
if you are interested in scaling up later on.
gnfinderr
is at an early experimental stage and presently just maps the main results to a data frame.
To work with gnfinderr
you need to install gnfinder
as a command line app for your operating system as described here https://github.com/gnames/gnfinder. These steps are reproduced below.
Step 1: Get the latest release for your operating system from here https://github.com/gnames/gnfinder/releases
Make sure that you download the right version for your system to avoid considerable confusion.
Step 2: Linux or OSX
Move gnfinder
executable somewhere in your PATH (for example /usr/local/bin)
sudo mv path_to/gnfinder /usr/local/bin
Step 2: Windows
Here you have options.
One possible way would be to create a default folder for executables and place gnfinder
there.
Use Windows+R keys combination and type "cmd". In the appeared terminal window type:
mkdir C:\bin copy path_to\gnfinder.exe C:\bin
Add C:\bin directory to your PATH environment variable.
Step 3: Test run gnfinder
In terminal bring up the gnfinder
page.
gnfinder
Run a quick query:
echo "Pomatomus saltator and Parus major" | gnfinder find -c -l eng
You are good to go.
Step 4: install gnfinderr
gnfinderr
is not on CRAN. The development version can be installed from GitHub with:
# install.packages("devtools") devtools::install_github("poldham/gnfinderr")
This is presently a one function package.
library(gnfinderr) intro <- gnfinder(string = c("Lepidium meyennii is a hot plant.", "Escherichia coli is not a plant at all"))
gnfinderr
is intended to be informative and to fail fast. The default search uses the combination of dictionaries and bayes regression that power gnfinder and will check any names discovered against the Catalogue of Life (by default)
An id column is generated and mapped to the input texts to facilitate joins. Alternatively you can provide ids at input. Example datasets of various sizes are provided to experiment with. Be warned that the zootaxa_titles
dataset with 20,000 titles takes a long time to run. Here we use the small fivetexts
dataset to pass document ids to join the results.
Most of the functions from gnfinder
are available in gnfinderr
(except language options and metadata is not returned). By default all available arguments are TRUE (bayes is on). We can change the setting by entering a value in the relevant argument.
Turning off check_names will speed up processing and simplify the returns:
library(gnfinderr) df <- gnfinder(string = five$text, id = five$id, check_names = FALSE)
You can also turn off the Bayes regression by entering a value for nobayes
:
df <- gnfinder(string = five$text, id = five$id, nobayes = TRUE, check_names = FALSE)
In general name discovery in texts is facilitated by leaving the bayes setting as is.
When working with gnfinder it is best to be selective in the use of texts to facilitate matching. Thus, if you include article metadata such as author names and organisations you may get unexpected false positive matches.
By default gnfinderr
checks names against the Catalogue of Life Taxonomy. The results are returned in a set of data.frames under verification and include a best result.
For a list of available sources see https://index.globalnames.org/datasource. Here we will check against:
We can specify the source we would like to check by supplying a vector of ids to source_ids
.
df <- gnfinder(string = five$text, id = five$id, source_ids = c(1,11,179))
This will return a tibble (data frame) that contains other data frames under verification
.
df$verification
The verification table contains bestresult and preferredresults dfs, where the best result is for the best match and the preferred results are from the taxonomic services in source_ids
. These tables include details of the taxon ids and taxonomic hierarchy.
gnfinder
works with UTF8 texts and should not be expected to work with other kinds of texts. If you don't know whether you have utf8 formatted text the package should tell you when you enter your texts.
R runs on a single core and so, for the time being, does gnfinderr
. Work is planned to explore options for running in parallel. As such, expect the package to be slow on large numbers of texts. Turn off check_names
for a performance boost.
At this early stage the focus is on:
Suggestions and comments are welcome on the Github Issues page. A Code of Conduct is provided for collaborators.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.