In kamclean/collaborator: Scalable multi-centre research using R and REDCap

knitr::opts_chunk$set(collapse = FALSE, warning=F, message=F)
library(dplyr);library(collaborator);library(knitr)

Collaborator: Generating Authorship Lists

Generating and formatting authorship lists for multi-centre research projects can be a challenging data wrangling task. In the case of collaborative research projects, there can be thousands of collaborators across hundreds of sites with a variety of roles.

Author name extraction using ORCID

ORCID provides a persistent digital identifier (an ORCID iD) that each individual own and control, and that distinguishes them from every other researcher. This is free to register for and can be used to empower collaborators to specify how their name should appear in publications. When working with 1000s collaborators, this provides a simple route to ensure accuate display of names on an authorship list and can be simply extracted from the ORCID website using the ORCID.

Validation of ORCID

ORCIDs follow a specific format of 16 characters in the format of "XXXX-XXXX-XXXX-XXXX" (16 characters in groups of 4 and separated by a dash). The extraction from the ORCID website will not work if not in this format. However, we can use the orcid_valid() function to investigate whether the ORCIDs on record are valid or not to use.

data <- tibble::tibble(n = c(1:7),
               orcid = c("0000-0001-6482-9086", "0000000250183066", "0000-0002-8738-4902",
                          "00O0-0002-8738-490X", "0000-0002-8738-490X", "0000-0002-8738-490", NA))

collaborator::orcid_valid(data, orcid = "orcid", reason = T) %>%
  knitr::kable()

This will output the same dataframe with the "orcid_valid" column appended with a correctly formatted orcid (if it is valid to use). All non-valid orcids will be listed as "NA". If you want to investigate further, you can use the argument ("reason==T") in the function to get additional columns:

"orcid_check_present": A binary value if any value was provided in that row or not (e.g. when a "NA" value was supplied)
"orcid_check_length": A binary value if the ORCID supplied is 16 characters or not (e.g. "0000-0002-8738-490" where a character has been missed)
"orcid_check_format": A binary value if the ORCID supplied fits the correct format of either 16 numbers or 15 numbers with an X at the end (e.g. "00O0-0002-8738-490X" value supplied)
"orcid_check_sum": ORCID uses an internal "checksum" to make sure not just any random set of 16 characters can be entered. This is a binary value if the ORCID supplied either passes or fails this "checksum" (e.g. "0000-0002-8738-490X" is indistinguishable from a valid ORCID, except it fails the checksum, so it had to have been entered incorrectly)

If any of the values above are "No", then the ORCID is not valid and so cannot be used. The final column "orcid_valid_reason" summarises all the reasons why an ORCID is not valid so these can be addressed.

Extraction of names from ORCID

Now we know what ORCIDs are valid, lets extract the names of just these using orcid_name(). Names on ORCID are recorded in 2 ways:

"Your given and family names" ("orcid_name_first" and "orcid_name_last").
"Your published name" (orcid_name_credit"): This is the full name displayed on ORCID, however this is not automatically separated into first name / last name.

Given this is recorded in 2 different ways, there can be discrepancies between the two methods (and why both are returned). It is recommended that "Your given and family names" is preferentially used since this avoids any confusion about first/middle vs last names for authorship lists (since the format required for authorship lists is often that given names are converted into initials).

data %>%
  collaborator::orcid_valid(data, orcid = "orcid", reason = F) %>%
  filter(is.na(orcid_valid)==F) %>%
  orcid_name(orcid = "orcid_valid", reason = F) %>%
  knitr::kable()

Formatting of names

If you need to format the names of collaborators as initials, this can be simply done using author_name(). This will convert every name in the "first_name" column into initials, which can be placed before or after the last name. This is shown in the "author_name" column below.

data %>%
  collaborator::orcid_valid(data, orcid = "orcid", reason = F) %>%
  collaborator::orcid_name(orcid = "orcid_valid", reason = F) %>%
  collaborator::author_name(first_name = "orcid_name_first", last_name = "orcid_name_last",position = "left", initial_max=3) %>%
  dplyr::select(n:orcid_valid, orcid_name_first:orcid_name_last, author_name)%>%
  knitr::kable()

Generating the formatted authorship list

Once you have your final list of authors, the report_auth() function aims to simplify the process of generating the fully formatted authorship list, with inbuilt flexibility in how these are presented.

Requirements

In order for the report_auth() function to operate as intended, we must first create a dataframe of all authors/collaborators containing at least 1 column: "name".

Example dataframe (data_author):

data_author <- collaborator::example_report_author
knitr::kable(head(data_author, n=10)) # Please note all names have been randomly generated

Main Features

(1) Basic Function

At it's most basic, report_auth() can produce a formatted list of a column of names.

  collaborator::report_auth(data_author) %>% # Please note all names have been randomly generated
  knitr::kable(, col.names= "")

(2) Grouping and subdivision of names

These names can be further grouped by another column in the dataframe:

collaborator::report_auth(data_author, group = "hospital") %>% # Please note all names have been randomly generated
  knitr::kable(col.names= "")

Or can be subdivided by another column in the dataframe:

collaborator::report_auth(data_author, subdivision = "country") %>% # Please note all names have been randomly generated
  knitr::kable(col.names= "")

Or groups can be further subdivided (for example by region/country, or by role)

collaborator::report_auth(data_author,
            group = "hospital",
            subdivision = "country") %>% # Please note all names have been randomly generated
  knitr::kable(col.names= "")

(3) Formatting

Clear and consistent formatting of authorship lists allows the contributions and affiliations of each collaborator/author to be represented. Within report_auth(), names are usually separated by a comma (","), with groups separated by a semicolon (";"). Furthermore the name of groups are separated by round brackets ("()"). However, there is a degree of inbuilt flexibility to facilitate customisation.

Below if for demonstration of this concept (not intented to reflect how these should be formatted!)

collaborator::report_auth(data_author, group="hospital", subdivision = "country",
            name_sep = " +", group_brachet = "[]",group_sep = " --- ") %>% # Please note all names have been randomly generated
  knitr::kable(col.names= "")