cord19_papers: Metadata for papers in the CORD-19 dataset
In dgrtwo/cord19: COVID-19 Open Research Dataset

Description Usage Format Source Examples

Metadata such as titles, authors, journal, and publication IDs for each paper in the CORD-19 dataset. This comes from the all_sources_metadata_DATE.csv file in the decompressed dataset. Note that the papers have been deduplicated based on paper_id, doi, or title, and papers without a paper_id or title have been removed.

1	cord19_papers

A tibble with one observation for each paper, and the following columns:

paper_id: Unique identifier that can link to full text and citations. SHA of the paper PDF.
source: Source (e.g. pubmed, CZI...)
title: Title
doi: Digital Object Identifier
pmcid: pmcid
pubmed_id: PubMed ID
license: License
abstract: Abstract
publish_time: Publication year
authors: Authors
journal: Journal
microsoft_academic_paper_id: Microsoft Academic Paper ID
who: CovidenceWHO
has_full_text: Does it have full text

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge, specifically the all_sources_metadata_DATE.csv file.

library(dplyr)

# What are the most common journals?
cord19_papers %>%
  count(journal, sort = TRUE)

# What are the most common words in titles (or abstracts)?
library(tidytext)

cord19_papers %>%
  unnest_tokens(word, title) %>%
  count(word, sort = TRUE) %>%
  anti_join(stop_words, by = "word")