output: md_document: variant: markdown_github


knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##",
  fig.path = "README-")
library(tidyverse)
library(cleanNLP)
library(corpuslingr) #devtools::install_github("jaytimm/corpuslingr")
library(quicknews)#devtools::install_github("jaytimm/quicknews")

Corpus preparation & annotation

To demo the search functionality of corpuslingr, we first build a small corpus of current news articles using my quicknews package. We apply the gnews_get_meta/gnews_scrape_web functions across multiple Google News sections to build out the corpus some, and to add a genre-like dimension to the corpus.

topics <- c('nation','world', 'sports','science')

corpus <- lapply(topics, function (x) {
    quicknews::qnews_get_meta (language="en", country="us", type="topic", search=x)%>%
    quicknews::qnews_scrape_web (link_var='link')})%>%
  bind_rows() %>%
  mutate(doc_id = as.character(row_number())) #Add doc_id

clr_prep_corpus

This function performs two tasks. It elminates unnecessary whitespace from the text column of a corpus dataframe object. Additionally, it attempts to trick annotators into treating hyphenated words as a single token. With the exception of Stanford's CoreNLP (via cleanNLP), annotators tend to treat hyphenated words as multiple word tokens. For linguists interested in word formation processes, eg, this is disappointing. There is likley a less hacky way to do this.

corpus <- clr_prep_corpus (corpus, hyphenate = TRUE)

Annotate via cleanNLP and udpipe

For demo purposes, we use udpipe (via cleanNLP) to annotate the corpus dataframe object.

cleanNLP::cnlp_init_udpipe(model_name="english",feature_flag = FALSE, parser = "none") 
ann_corpus <- cleanNLP::cnlp_annotate(corpus$text, as_strings = TRUE, doc_ids = corpus$doc_id) 

clr_set_corpus()

This function prepares the annotated corpus for complex, tuple-based search. Tuples are created, taking the form <token~lemma~pos>; tuple onsets/offsets are also set. Annotation output is homogenized, including column names, making text processing easier 'downstream.' Naming conventions established in the spacyr package are adopted here.

Lastly, the function splits the corpus into a list of dataframes by document. This is ultimately a search convenience.

lingr_corpus <- ann_corpus$token %>%
  clr_set_corpus(doc_var='id', 
                  token_var='word', 
                  lemma_var='lemma', 
                  tag_var='pos', 
                  pos_var='upos',
                  sentence_var='sid',
                  meta = corpus[,c('doc_id','source','search')])

clr_desc_corpus()

A simple function for describing an annotated corpus, providing some basic aggregate statisitcs at the corpus, genre, and text levels.

summary <- corpuslingr::clr_desc_corpus(lingr_corpus,doc="doc_id", sent="sentence_id", tok="token",upos='pos', genre="search")

Corpus summary:

summary$corpus

By genre:

summary$genre

By text:

head(summary$text)

Search & aggregation functions

Basic search syntax

The search syntax utilized here is modeled after the syntax implemented in the BYU suite of corpora. A full list of part-of-speech syntax can be viewed here.

library(knitr)
corpuslingr::clr_ref_search_egs %>% kable(escape=TRUE,caption = "Search syntax examples")

clr_search_gramx()

Search for all instantiaions of a particular lexical pattern/grammatical construction devoid of context. This function enables fairly quick search.

search1 <- "VERB (*)? up"

lingr_corpus %>%
  corpuslingr::clr_search_gramx(search=search1)%>%
  head ()

clr_get_freqs()

A simple function for calculating text and token frequencies of search term(s). The agg_var parameter allows the user to specify how frequency counts are aggregated.

Note that generic noun phrases can be include as a search term (regex below), and can be specified in the query using NPHR.

clr_ref_nounphrase
search2 <- "*tial NOUNX"

lingr_corpus %>%
  corpuslingr::clr_search_gramx(search=search2)%>%
  corpuslingr::clr_get_freq(agg_var = 'token', toupper=TRUE)%>%
  head()

clr_search_context()

A function that returns search terms with user-specified left and right contexts (LW and RW). Output includes a list of two dataframes: a BOW (bag-of-words) dataframe object and a KWIC (keyword in context) dataframe object.

search3 <- 'NPHR (do)? (NEG)? (THINK| BELIEVE )'

found_egs <- corpuslingr::clr_search_context(search=search3,corp=lingr_corpus,LW=5, RW = 5)

clr_context_kwic()

Access KWIC object:

found_egs %>%
  corpuslingr::clr_context_kwic()%>% #Add genre.
  select(doc_id,kwic)%>%
  slice(1:15)%>%
  kable(escape=FALSE)

clr_context_bow()

agg_var and content_only Access BOW object:

search3 <- "White House"

corpuslingr::clr_search_context(search=search3,corp=lingr_corpus,LW=10, RW = 10)%>%
  corpuslingr::clr_context_bow(content_only=TRUE,agg_var=c('searchLemma','lemma'))%>%
  head()

clr_search_keyphrases()

Function for extracting key phrases from each text comprising a corpus based on tf-idf weights. The methods and logic underlying this function are described in more detail here.

The regex for key phrase search:

clr_ref_keyphrase

The user can specify the number of keyphrases to extract, how to aggregate key phrases, how to output key phrases, and whether or not to use jitter to break ties among top n key phrases.

lingr_corpus %>%
  corpuslingr::clr_search_keyphrases(n=5, key_var ='lemma', flatten=TRUE,jitter=TRUE)%>%
  head()%>%
  kable()


jaytimm/corpuslingr documentation built on May 29, 2019, 1:01 a.m.