In dgrtwo/cord19: COVID-19 Open Research Dataset

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  cache = TRUE,
  cache.path = "README-cache/"
)

cord19

The cord19 package shares the COVID-19 Open Research Dataset (CORD-19) in a tidy form that is easily analyzed within R.

Installation

Install the package from GitHub as follows:

remotes::install_github("dgrtwo/cord19")

Papers

The package turns the CORD-19 dataset into a set of tidy tables.

For example, the paper metadata is stored in cord19_papers.

library(dplyr)
library(cord19)

cord19_papers

# Learn how many papers came from each journal
cord19_papers %>%
    count(journal, sort = TRUE)

Full text

Most usefully, cord19_paragraphs has the full text of the papers, with one observation for each paragraph.

cord19_paragraphs

# What are common sections
cord19_paragraphs %>%
    count(section, sort = TRUE)

This allows for some analysis with a package like tidytext.

library(tidytext)
set.seed(2020)

# Sample 100 random papers
paper_words <- cord19_paragraphs %>%
    filter(paper_id %in% sample(unique(paper_id), 100)) %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words, by = "word")

paper_words %>%
    count(word, sort = TRUE)

Citations

This also includes the articles cited by each paper.

cord19_paper_citations

What are the most commonly cited articles?

cord19_paper_citations %>%
    count(title, sort = TRUE)

We could use the widyr package to find which papers are often cited by the same paper.

library(widyr)

filtered_citations <- cord19_paper_citations %>%
    add_count(title) %>%
    filter(n >= 25)

# What papers are often cited by the same paper?
filtered_citations %>%
    pairwise_cor(title, paper_id, sort = TRUE)