knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
**Note that while in beta mode, you need to build the bulkUploader package first. Do this through the "build" tab, click "install and restart"
The purpose of the bulkUploader
package is to help data managers to bring their data into the Neotoma Paleoecology Database. Neotoma is a highly structured database that contains information on a large number of data types, including diatoms, ostracodes, pollen and mammal fossils.
Neotoma is really a "database of databases", and contains X constituent databases. To help data managers include their data in Neotoma, we have wrapped Neotoma package functions around tools to help manage, validate and upload their data into Neotoma.
library(bulkUploader) devtools::install_github('NeotomaDB/neotoma2') library(neotoma2) library(purrr) library(dplyr) library(httr)
The bulkUploader
package helps to manage data in Neotoma. There are helper functions that can assist a user in creating and manipulating neotoma2
data objects, but to specifically upload data to Neotoma, a user must be a Steward, with a username and password. Best practice here is to keep your credentials in a separate file, to be read in by the script you are using. If you are using version control software, you will also want to ensure that the credential file is added to your .gitignore
file (or svn:ignore
). Assuming you have a username ndevia
and a password StrongPassword
, you can create a file, pwcontrol.txt
in your working directory, with the contents:
ndevia StrongPassword
You will the read it in and add it to a bulkUploader
credential object using the addCredentials()
function:
pw <- readr::read_lines('pwcontrol.txt') cred <- addCredentials(username = pw[1], password = pw[2]) cred
Once you've loaded your credentials into the workflow, you can then validate them using the Neotoma API. If the credentials are valid, you will see not only the username and (obscured) password, but also the databases to which you have access. The following code shows the execution pattern, but is not actually executed. To execute the code yourself copy the
JLB NOTE: The text in the line above was not finished...changed eval=TRUE and ran it to execute
cred <- checkCredentials(cred) cred
And simply typing cred
will return:
User credentials: User name: someUser Password: *************** User credentials are valid for: dbid databasename 1 13 Academy of Natural Sciences of Drexel University 2 30 Diatom Paleolimnology Data Cooperative (DPDC)
The easiest data to enter into Neotoma is the publication data and contact data. From a given database, or data resource, we want to check that information is accurate. For contacts, unless we have ORCIDs for all users, we're stuck hand-checking information. For publication records we can use the DOI as a "gold standard" for publication information, since CrossRef manages this data for us, and provides a unique identifier that we can check against Neotoma's information.
However your publication data is organized, it must be aligned with Neotoma's so that it can be included in the database. The Neotoma publications
table uses the following fields in the publication
object:
| Field | Description |
=========== | =================
| publicationtype | The type of publication |
| publicationid | A unique Neotoma publication identifier |
| articletitle | The title of the document |
| year | The publishing year where available |
| journal | The journal (if available) |
| volume | Journal volume |
| issue | Journal issue |
| pages | Pages over which the item spans |
| citation | A plain text citation, generally APA format. |
| doi | The article DOI |
| author | A list of contacts
|
Users can directly add new information into Neotoma using the neotoma2::add_publication()
function, or they can use scrape_dois()
, a function in the bulkUploader
package, to validate publication information and to add additional metadata, where such metadata does not already exist.
data(pavela) # Note that `pavela` is a data element contained within the `bulkUploader` package, based on raw data from the PaVeLA database. It has info on both publications and authors neotoma <- pavela$PUBLIC # pull out just the publication dataframe and assign to the neotoma object # for pavela, removed citation and doi from this function because these fields are not available for this dataset pubList <- formatPubs(neotoma, year = 'AÑO_PUBLICACION', title = c('TITULO', 'NOMBRE_LIBRO.REVISTA')) # had to add a "../" to the savefile to account for the weird handling of Rmd files within R Studio vs Linux. outputs <- scrape_dois(pubList[1:10,], n = 3, savefile = '../data/pavelatab.RDS', restore = FALSE) results <- map(outputs, get_cr_results)
From this output we have a list
object called outputs
, with a length equal to the number of publications in the original data file. We have a list
called results
with a length equal to the length of outputs
. The results
object is made up of individual data.frame
s, with columns score
and diff
, showing the quality of each match returned by the CrossRef API, and the quality of subsequent matches (using the CrossRef scoring value). Knowing the diff
is useful to understand the quality of the matches. High quality matches have a high diff
score between the first and second matches, and a generally high score
value.
The score
gives the approximate match strength to a publication in CrossRef. The diff
lets us know what the score of the next closest match is. Because we need to work with the CrossRef API in a reasonable way, this matching tends to be slow.
Once this script runs we then have a set of publications for the constituent database, and for each described publication we have 0 or more potential matches. To generate a simple match we want to create some form of citation. We can do this using the very simple helper function makeCite()
.
neoCites <- makeCite(pubList$query.bibliographic) # original code listed pubList$citation, changed to pubList$query.bibliographic because no citation column in PaVeLA data neoCites <- makeCite(pubList$citation) crossCites <- outputs %>% map(cr_to_neotoma)
Now, for each publication we have a set of possible matches, and scores. We need to prompt the user to select the best match:
bestMatch <- matchPubs(cross = crossCites, pubs = neoCites, results = results)
The matchPubs()
function goes through each of the publications and provides the user with the ability to select the best match for a publication (if it exists) based on the CrossRef metadata.
For this process, we now have a variable bestMatch
that is a neotoma2
package publications
object. It contains all the information needed to add a publication to Neotoma (or to match a publication within Neotoma). By default bestMatch
does not include publicationid
because this data has only been drawn from CrossRef. We now need to test to see if the records are found in Neotoma. We use the get_publications()
method to do check Neotoma for existing publications. The API applies full text searching within the database against each publication
in the set of publications
. The function only runs against publication
objects that do not have publicationid
s.
matchedPubs <- get_publications(bestMatch)
Pavela notes: we do not need to use any code in this section for the first 10
When get_publications()
is applied to a record without a publicationid it returns the best matches that the API can find. These matches are effectively invisible to the user, but they can be shown using the showMatches()
function for any particular publication:
showMatch(matchedPubs[[1]])
So, a record may have n
potential matches and we can select a particular n
match in the selectMatch(x, n)
function. In the function x
is the specific publication, and n
is the match we want to select. If we pass selectMatch(x, n)
for a particular publication, with no option, or n=NA
, then all matches for the publication x
are cleared. In the code below you can see that we clear the matches, and then showMatch()
returns nothing.
matchedPubs[[1]] <- selectMatch(matchedPubs[[1]], 1) # something here overwrites the dataframe. bug in selectMatch code. showMatch(matchedPubs[[1]])
If you do have multiple matches and want to pick a specific one to assign to the unassigned publications, you can use the selectMatch(x,n)
function, specifying which publication you are replacing by assigning n
to the number of the best match. This also clears the close matches found for the record.
showMatch(matchedPubs[[2]]) matchedPubs[[2]] <- selectMatch(matchedPubs[[2]], 1) matchedPubs[[2]]
Once we've confirmed the records that exist as part of Neotoma, and identified the records that are not part of the database, we then want to upload the new publications to the database.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.