knitr::opts_chunk$set(
    comment = "#>",
    collapse = TRUE,
    warning = FALSE,
    message = FALSE
)

Introduction to spplit

A typical workflow looks like:

A note about pipes: pipes (%>%) are a relatively new thing in R. They make constructing what you want to do in R much easier, and should be easier for others to read as well. We use pipes in examples here, but you don't have to use pipes.

load spplit

library("spplit")

Get occurrences

In this example, we'll do a geospatially defined query. You can also do a query based on taxonomic names, or a combination of geometry and taxonomic names.

Define a WKT string. For more about gemoetry queries, see the Geospatial Queries vignette.

wkt <- 'POLYGON((-124.07 41.48,-119.99 41.48,-119.99 35.57,-124.07 35.57,-124.07 41.48))'

Use sp_occ_idigbio() to execute a query against iDigBio data. For brevity we're limiting to 3 results, but you can set it at whatever you like, keeping in mind that the larger the limit value, the longer the query will take.

By default we're using the botany collection from the CalAcademy. If perhaps searching across all collections makes more sense let me know and we can change that. You can easily change the CalAcademy collection being searched with the cas_coll parameter.

res <- sp_occ_idigbio(geometry = wkt, limit = 3)
res

Get a species list

After getting back some occurrence data, but before getting some data from the BHL, we need a species list because there's no point in sending duplicate queries to BHL.

spplist <- res %>% sp_list()
spplist

Get BHL metadata

For BHL, we first need to search BHL for the species list we just made. Some taxa may be in BHL and some may not. The returned data we'll get in this step will contain the information we need to get pages from texts that contain information about the taxa of interest.

bhldat <- spplist %>% sp_bhl_meta()
bhldat

You can combine BHL metadata into a data.frame for easier viewing/manipulation:

as_df(bhldat)

Note how each taxon gets a different data.frame. You can combine them all into one via something like dplyr::bind_rows().

Get OCR text

Now that we have metadata, we want to get OCR pages - or text of digitized pages that has been generated through Optical Character Recognition.

ocrdat <- bhldat %>% sp_bhl_ocr()
ocrdat
#> $`albizia lophantha`
#> <bhl ocr'ed text>
#>   Count: 10
#>   no. pages / total character count [1st 10]:
#>     1 / 2826
#>     2 / 7498
#>     5 / 18561
#>     2 / 5010
#>     1 / 6238
#>     1 / 3305
#>     1 / 2052
#>     1 / 4135
#>     6 / 20270
#>     1 / 2257
#>
#> $`beckmannia syzigachne`
#> <bhl ocr'ed text>
#>   Count: 121
#>   no. pages / total character count [1st 10]:
#>     1 / 1682
#>     2 / 6801
#>     1 / 2456
#>     1 / 1600
#>     3 / 7152
#>     21 / 79223
#>     1 / 35
#>     1 / 3224
#>     1 / 2750
#>     1 / 1826
#>
#> $`draba aureola`
#> <bhl ocr'ed text>
#>   Count: 31
#>   no. pages / total character count [1st 10]:
#>     3 / 10860
#>     1 / 2620
#>     2 / 5764
#>     1 / 3838
#>     1 / 3838
#>     1 / 3877
#>     1 / 1847
#>     1 / 2272
#>     1 / 3774
#>     1 / 3234

Note that the OCR text we get back has essentially no structure. It's simply plain text, and there's little metadata to work with.

You can view the pages with a utility in this package, which will open a separate browser tab in your default browser for each taxon. Each taxon page has a different panel for each BHL item, where each item can have many pages.

viewer(ocrdat)

img

Note that right now the text is all concatentated together - I am working on making it prettier. Let me know if you think this viewer is useful.

Save text to disk

After all the work above, you'll likely want to save the OCR pages to disk before quitting your R session.

sp_bhl_save(ocrdat)
#> ocr text written to files in ./albizia_lophantha
#> ocr text written to files in ./beckmannia_syzigachne
#> ocr text written to files in ./draba_aureola

By default, sp_bhl_save() saves each each taxon to a separate folder, where each taxon folder will have every page as a separate file. In cases where it's ambiguous what the name of the folder should be we give it a random alphanumeric string for a name.



ropenscilabs/spplit documentation built on Sept. 29, 2020, 3:05 a.m.