knitr::opts_chunk$set( comment = "#>", collapse = TRUE, warning = FALSE, message = FALSE )
A typical workflow looks like:
A note about pipes: pipes (%>%
) are a relatively new thing in R. They make constructing
what you want to do in R much easier, and should be easier for others to read as well.
We use pipes in examples here, but you don't have to use pipes.
library("spplit")
In this example, we'll do a geospatially defined query. You can also do a query based on taxonomic names, or a combination of geometry and taxonomic names.
Define a WKT string. For more about gemoetry queries, see the Geospatial Queries
vignette.
wkt <- 'POLYGON((-124.07 41.48,-119.99 41.48,-119.99 35.57,-124.07 35.57,-124.07 41.48))'
Use sp_occ_idigbio()
to execute a query against iDigBio data. For brevity we're limiting
to 3 results, but you can set it at whatever you like, keeping in mind that the larger
the limit
value, the longer the query will take.
By default we're using the botany collection from the CalAcademy. If perhaps searching across
all collections makes more sense let me know and we can change that. You can easily change the
CalAcademy collection being searched with the cas_coll
parameter.
res <- sp_occ_idigbio(geometry = wkt, limit = 3) res
After getting back some occurrence data, but before getting some data from the BHL, we need a species list because there's no point in sending duplicate queries to BHL.
spplist <- res %>% sp_list() spplist
For BHL, we first need to search BHL for the species list we just made. Some taxa may be in BHL and some may not. The returned data we'll get in this step will contain the information we need to get pages from texts that contain information about the taxa of interest.
bhldat <- spplist %>% sp_bhl_meta() bhldat
You can combine BHL metadata into a data.frame for easier viewing/manipulation:
as_df(bhldat)
Note how each taxon gets a different data.frame. You can combine them all into one via
something like dplyr::bind_rows()
.
Now that we have metadata, we want to get OCR pages - or text of digitized pages that has been generated through Optical Character Recognition.
ocrdat <- bhldat %>% sp_bhl_ocr() ocrdat #> $`albizia lophantha` #> <bhl ocr'ed text> #> Count: 10 #> no. pages / total character count [1st 10]: #> 1 / 2826 #> 2 / 7498 #> 5 / 18561 #> 2 / 5010 #> 1 / 6238 #> 1 / 3305 #> 1 / 2052 #> 1 / 4135 #> 6 / 20270 #> 1 / 2257 #> #> $`beckmannia syzigachne` #> <bhl ocr'ed text> #> Count: 121 #> no. pages / total character count [1st 10]: #> 1 / 1682 #> 2 / 6801 #> 1 / 2456 #> 1 / 1600 #> 3 / 7152 #> 21 / 79223 #> 1 / 35 #> 1 / 3224 #> 1 / 2750 #> 1 / 1826 #> #> $`draba aureola` #> <bhl ocr'ed text> #> Count: 31 #> no. pages / total character count [1st 10]: #> 3 / 10860 #> 1 / 2620 #> 2 / 5764 #> 1 / 3838 #> 1 / 3838 #> 1 / 3877 #> 1 / 1847 #> 1 / 2272 #> 1 / 3774 #> 1 / 3234
Note that the OCR text we get back has essentially no structure. It's simply plain text, and there's little metadata to work with.
You can view the pages with a utility in this package, which will open a separate browser tab in your default browser for each taxon. Each taxon page has a different panel for each BHL item, where each item can have many pages.
viewer(ocrdat)
Note that right now the text is all concatentated together - I am working on making it prettier. Let me know if you think this viewer is useful.
After all the work above, you'll likely want to save the OCR pages to disk before quitting your R session.
sp_bhl_save(ocrdat) #> ocr text written to files in ./albizia_lophantha #> ocr text written to files in ./beckmannia_syzigachne #> ocr text written to files in ./draba_aureola
By default, sp_bhl_save()
saves each each taxon to a separate folder, where each taxon
folder will have every page as a separate file. In cases where it's ambiguous what the
name of the folder should be we give it a random alphanumeric string for a name.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.