In jaytimm/corpusdatr:

output: md_document: variant: markdown_github

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##",
  fig.path = "README-")

corpusdatr

A data package consisting of two corpora:

Slate Magazine corpus (ca 1996-2000, 1K texts, 1m words), derived from the OANC.
A current and aggregate (ie, bag-of-words) corpus derived from web-based news articles, collected over a three-week time period.

Both corpora are sized well for demo and pedagogical purposes.

library(corpusdatr)#devtools::install_github("jaytimm/corpusdatr")
library(tidyverse)

Slate Magazine corpus

The Slate Magazine corpus is derived from the Slate Magazine portion of the Open American National Corpus. The Slate Magazine sub-corpus of the OANC is comprised of over 4,500 articles (~4 million words) published between 1996 and 2000.

For the sake of manageability, the full Slate corpus is reduced here to 1,000 randomly selected articles ranging in length from 850 and 1500 words in length. This amounts to a corpus of approximately 1 million words.

Each text has been annotated using the spacyr package. Tuples have been added to these annotataions, which are derived from the token, lemma, and part-of-speech tags; additionally included are tuple character onsets/offsets. These additions facilitate regex search for complex lexical patterns and grammatical constructions.

The corpus loads as a list of dataframes called cdr_slate_ann. The original data frame corpus is also included in the package as cdr_slate_corpus.

Slate Magazine metadata

head(corpusdatr::cdr_slate_meta)

Geo-political entities in Slate

Additionally included in the package is a sf points object containing the lat/lon coordinates of geo-political entities occurring in more than 1% of texts comprising the Slate Mag corpus. It is included here to enable geographical analysis of text data.

head(corpusdatr::cdr_slate_gpe)

Slate content

To get a quick sense of the content of articles included in the corpus, we plot the more frequent named entites by category.

corpusdatr::cdr_slate_ann %>%
  bind_rows()%>%
  filter(pos=='ENTITY')%>%
  group_by(entity_type,lemma)%>%
  summarize(freq =n())%>%
  group_by(entity_type)%>%
  top_n(n=13,wt=freq)%>%
  arrange(lemma,desc(freq))%>%
  filter(entity_type %in% c('PERSON','ORG','GPE','NORP'))%>%

  ggplot(aes(x=reorder(lemma,freq), y=freq, fill=entity_type)) + 
    geom_col(show.legend = FALSE) +
    facet_wrap(~entity_type, scales = "free_y", ncol = 2) +
    coord_flip()+
    labs(title="Named entities in Slate corpus (1996-2000)")

Google News corpus

For demo purposes, it is always nice to have a current corpus on hand, as well as a corpus with time series information. To this end, the package also includes a corpus comprised of articles scraped from the web (via GoogleNews' RSS feed) using my r package quicknews. A timed script was used to obtain/annotate articles three times a day for roughly three weeks (from 11-27/17 to 12/20/17). Search was limited to top stories in the United States. Again, spacyr was used to build annotations.

For the sake of avoiding copyright issues, each constituent article in the corpus is reduced to a bag-of-words. The corpus is comprised of ~1,500 texts, ~1.3 million words, and ~200 unique media sources, and loads as a single data frame, cdr_gnews_historical.

Metadata for the corpus can be accessed via cdr_gnews_meta: