knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The purpose of the bulkUploader package is to help data managers to bring their data into the Neotoma Paleoecology Database. Neotoma is a highly structured database that contains information on a large number of data types, including diatoms, ostracodes, pollen and mammal fossils.

Neotoma is really a "database of databases", and contains X constituent databases. To help data managers include their data in Neotoma, we have wrapped Neotoma package functions around tools to help manage, validate and upload their data into Neotoma.

library(bulkUploader)
devtools::install_github('NeotomaDB/neotoma2')
library(neotoma2)
library(purrr)
library(dplyr)

Managing your Data

The easiest data to enter into Neotoma is the publication data and contact data. From a given database, or data resource, we want to check that information is accurate. For contacts, unless we have ORCIDs for all users, we're stuck hand-checking information. For publication records we can use the DOI as a "gold standard" for publication information, since CrossRef manages this data for us, and provides a unique identifier that we can check against Neotoma's information.

Publication Data

However your publication data is organized, it must be aligned with Neotoma's so that it can be included in the database. The Neotoma publications table uses the following fields in the publication object:

| Field | Description | =========== | ================= | publicationtype | The type of publication | | publicationid | A unique Neotoma publication identifier | | articletitle | The title of the document | | year | The publishing year where available | | journal | The journal (if available) | | volume | Journal volume | | issue | Journal issue | | pages | Pages over which the item spans | | citation | A plain text citation, generally APA format. | | doi | The article DOI | | author | A list of contacts |

Users can directly add new information into Neotoma using the neotoma2::add_publication() function, or they can use scrape_dois(), a function in the bulkUploader package, to validate publication information and to add additional metadata, where such metadata does not already exist.

# Note that `miomapcsv` is a data element contained within the `bulkUploader` package 

data <- data(miomapcsv)

# Ideally this code would be massaged.  The issue here is that there is title and journal information in several places.  It's not the nicest way of doing things.

pubList <- formatPubs(miomapcsv[1:3,], 
                      author='Author',
                      year='Year',
                      title=c('Title', 'Book.Title'))

outputs <- scrape_dois(pubList,
            n = 3, 
            savefile = './data/miotest.RDS', 
            restore = TRUE)

results <- map(outputs, get_cr_results)

From this output we have a list object called outputs, with a length equal to the number of publications in the original data file. We have a list called results with a length equal to the length of outputs. The results object is made up of individual data.frames, with columns score and delta, showing the quality of each match returned by the CrossRef API, and the quality of subsequent matches (using the CrossRef scoring value). Knowing the delta is useful to understand the quality of the matches. High quality matches have a high delta score between the first and second matches, and a generally high score value.

The score gives the approximate match strength to a publication in CrossRef. The delta lets us know what the score of the next closest match is. Because we need to work with the CrossRef API in a reasonable way, this matching tends to be slow.

Once this script runs we then have a set of publications for the constituent database, and for each described publication we have 0 or more potential matches. To generate a simple match we want to create some form of citation. We can do this using the very simple helper function makeCite().

miocites <- makeCite(miomapcsv$Author,
                     miomapcsv$Year,
                     miomapcsv$Title,
                     miomapcsv$Journal)

crossCites <- outputs %>% map(cr_to_neotoma)

Now, for each publication we have a set of possible matches, and scores. We need to prompt the user to select the best match:

bestMatch <- matchPubs(cross = crossCites,
                       pubs = miocites,
                       results = results)

The matchesPubs() function goes through each of the publications and provides the user with the ability to select the best match for a publication (if it exists) based on the CrossRef metadata.

Matching to Neotoma

For this process, we now have a variable bestMatch that is a neotoma2 package publications object. It contains all the information needed to add a publication to Neotoma (or to match a publication within Neotoma). By default bestMatch does not include publicationid, because this data has only been drawn from CrossRef. We now need to find matches from Neotoma. To do that, we use the get_publications() method implemented for publication lists. This checks for publications that are missing publication IDs and returns that information.

matchedPubs <- get_publications(bestMatch)

Data States after Matching:

| Publication State | in Miomap | in crossref | in neotoma | | ----------- | -------------- | ------------ | ------------ | Only in Miomap | X | | |

In the case where there is a conflict between Neotoma & CrossRef/Miomap either missing data needs to be filled (easy), or one or the other field needs to be updated.

attr(toUpdate=FALSE, conflict=NULL)



NeotomaDB/bulkUploader documentation built on June 9, 2025, 10:49 p.m.