knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%", cache = TRUE, cache.path = "README-cache/" )
The cord19 package shares the COVID-19 Open Research Dataset (CORD-19) in a tidy form that is easily analyzed within R.
Install the package from GitHub as follows:
remotes::install_github("dgrtwo/cord19")
The package turns the CORD-19 dataset into a set of tidy tables.
For example, the paper metadata is stored in cord19_papers
.
library(dplyr) library(cord19) cord19_papers # Learn how many papers came from each journal cord19_papers %>% count(journal, sort = TRUE)
Most usefully, cord19_paragraphs
has the full text of the papers, with one observation for each paragraph.
cord19_paragraphs # What are common sections cord19_paragraphs %>% count(section, sort = TRUE)
This allows for some analysis with a package like tidytext.
library(tidytext) set.seed(2020) # Sample 100 random papers paper_words <- cord19_paragraphs %>% filter(paper_id %in% sample(unique(paper_id), 100)) %>% unnest_tokens(word, text) %>% anti_join(stop_words, by = "word") paper_words %>% count(word, sort = TRUE)
This also includes the articles cited by each paper.
cord19_paper_citations
What are the most commonly cited articles?
cord19_paper_citations %>% count(title, sort = TRUE)
We could use the widyr package to find which papers are often cited by the same paper.
library(widyr) filtered_citations <- cord19_paper_citations %>% add_count(title) %>% filter(n >= 25) # What papers are often cited by the same paper? filtered_citations %>% pairwise_cor(title, paper_id, sort = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.