cord19_papers: Metadata for papers in the CORD-19 dataset

Description Usage Format Source Examples

Description

Metadata such as titles, authors, journal, and publication IDs for each paper in the CORD-19 dataset. This comes from the all_sources_metadata_DATE.csv file in the decompressed dataset. Note that the papers have been deduplicated based on paper_id, doi, or title, and papers without a paper_id or title have been removed.

Usage

1

Format

A tibble with one observation for each paper, and the following columns:

paper_id

Unique identifier that can link to full text and citations. SHA of the paper PDF.

source

Source (e.g. pubmed, CZI...)

title

Title

doi

Digital Object Identifier

pmcid

pmcid

pubmed_id

PubMed ID

license

License

abstract

Abstract

publish_time

Publication year

authors

Authors

journal

Journal

microsoft_academic_paper_id

Microsoft Academic Paper ID

who

CovidenceWHO

has_full_text

Does it have full text

Source

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge, specifically the all_sources_metadata_DATE.csv file.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
library(dplyr)

# What are the most common journals?
cord19_papers %>%
  count(journal, sort = TRUE)

# What are the most common words in titles (or abstracts)?
library(tidytext)

cord19_papers %>%
  unnest_tokens(word, title) %>%
  count(word, sort = TRUE) %>%
  anti_join(stop_words, by = "word")

dgrtwo/cord19 documentation built on March 20, 2020, 12:44 a.m.