stylest vignette

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(kableExtra)

About stylest

This vignette describes the usage of stylest for estimating speaker (author) style distinctiveness.

Installation

Install stylest from CRAN by executing:

install.packages("stylest")

The dev version of stylest on GitHub may have additional features (and bugs) and is not guaranteed to be stable. Power users may install it with:

# install.packages("devtools")
# devtools::install_github("leslie-huang/stylest")

Load the package

stylest is built on top of corpus. corpus is required to specify (optional) parameters in stylest, so we recommend installing corpus as well.

library(stylest)
library(corpus)

Example: Fitting a model to English novels

Corpus

We will be using texts of the first lines of novels by Jane Austen, George Eliot, and Elizabeth Gaskell. Excerpts were obtained from the full texts of novels available on Project Gutenberg: http://gutenberg.org.

data(novels_excerpts)
# show a snippet of the data
kable(novels_excerpts[c(1,4,8), ]) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

The corpus should have at least one variable by which the texts can be grouped --- the most common examples are a "speaker" or "author" attribute. Here, we will use novels_excerpts$author.

unique(novels_excerpts$author)

Using stylest_select_vocab

This function uses n-fold cross-validation to identify the set of terms that maximizes the model's rate of predicting the speakers of out-of-sample texts. For those unfamiliar with cross-validation, the technical details follow:

(Vocabulary selection is optional; the model can be fit using all the terms in the support of the corpus.)

Setting the seed before this step, to ensure reproducible runs, is recommended:

set.seed(1234)

Below are examples of stylest_select_vocab using the defaults and with custom parameters:

vocab_with_defaults <- stylest_select_vocab(novels_excerpts$text, novels_excerpts$author)

Tokenization selections can optionally be passed as the filter argument; see the corpus package for more information about text_filter.

filter <- corpus::text_filter(drop_punct = TRUE, drop_number = TRUE)

vocab_custom <- stylest_select_vocab(novels_excerpts$text, novels_excerpts$author, 
                                     filter = filter, smooth = 1, nfold = 10, 
                                     cutoff_pcts = c(50, 75, 99))

Let's look inside the vocab_with_defaults object.

# Percentile with best prediction rate
vocab_with_defaults$cutoff_pct_best

# Rate of INCORRECTLY predicted speakers of held-out texts
vocab_with_defaults$miss_pct

# Data on the setup:

# Percentiles tested
vocab_with_defaults$cutoff_pcts

# Number of folds
vocab_with_defaults$nfold

Fitting a model

Using a percentile to select terms

With the best percentile identified as 90 percent, we can select the terms above that percentile to use in the model. Be sure to use the same text_filter here as in the previous step.

terms_90 <- stylest_terms(novels_excerpts$text, novels_excerpts$author, 90, filter = filter)

Fitting the model: basic

Below, we fit the model using the terms above the 90th percentile, using the same text_filter as before, and leaving the smoothing value for term frequencies as the default 0.5.

mod <- stylest_fit(novels_excerpts$text, novels_excerpts$author, terms = terms_90, filter = filter)

The model contains detailed information about token usage by each of the authors (see mod$rate); exploring this is left as an exercise.

Fitting the model: adding custom term weights

A new feature is the option to specify custom term weights, in the form of a dataframe. The intended use case is the mean cosine distance from the embedding representation of the word to all other words in the vocabulary, but the weights can be anything desired by the user.

An example term_weights is shown below. When this argument is passed to stylest_fit(), the weights for terms in the model vocabulary will be extracted. Any term not included in term_weights will be assigned a default weight of 1.

term_weights <- data.frame("word" = c("the", "and", "Floccinaucinihilipilification"),
                           "mean_distance" = c(0.1,0.2,0.001))

term_weights

Below is an example of fitting the model with term_weights:

mod <- stylest_fit(novels_excerpts$text, novels_excerpts$author, 
                   terms = terms_90, filter = filter,
                   term_weights = term_weights,
                   weight_varname = "mean_distance")

The weights are stored in mod$weights.

Using the model

Calculating speaker log odds

odds <- stylest_odds(mod, novels_excerpts$text, novels_excerpts$author)

We can examine the mean log odds that Jane Austen wrote Pride and Prejudice (in-sample).

# Pride and Prejudice
novels_excerpts$text[14]

odds$log_odds_avg[14]

odds$log_odds_se[14]

Predicting the speaker of a new text

In this example, the model is used to predict the speaker of a new text, in this case Northanger Abbey by Jane Austen.

Note that a prior may be specified, and may be useful for handling texts containing out-of-sample terms. Here, we do not specify a prior, so a uniform prior is used.

na_text <- "No one who had ever seen Catherine Morland in her infancy would have supposed 
            her born to be an heroine. Her situation in life, the character of her father 
            and mother, her own person and disposition, were all equally against her. Her 
            father was a clergyman, without being neglected, or poor, and a very respectable 
            man, though his name was Richard—and he had never been handsome. He had a 
            considerable independence besides two good livings—and he was not in the least 
            addicted to locking up his daughters."

pred <- stylest_predict(mod, na_text)

Viewing the result, and recovering the log probabilities calculated for each speaker, is simple:

pred$predicted

pred$log_probs

Influential terms

stylest_term_influence identifies terms' contributions to speakers' distinctiveness in a fitted model.

influential_terms <- stylest_term_influence(mod, novels_excerpts$text, novels_excerpts$author)

The mean and maximum influence can be accessed with $infl_avg and $infl_max, respectively.

The terms with the highest mean influence can be obtained:

kable(head(influential_terms[order(influential_terms$infl_avg, decreasing = TRUE), ])) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

And the least influential terms:

kable(tail(influential_terms[order(influential_terms$infl_avg, decreasing = TRUE), ])) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Citation

If you use this software, please cite:

Huang, L., Perry, P., & Spirling, A. (2020). A General Model of Author “Style” with Application to the UK House of Commons, 1935–2018. Political Analysis, 28(3), 412-434. https://doi.org/10.1017/pan.2019.49

@article{huang2020general,
  title={A General Model of Author “Style” with Application to the UK House of Commons, 1935--2018},
  author={Huang, Leslie and Perry, Patrick O and Spirling, Arthur},
  journal={Political Analysis},
  volume={28},
  number={3},
  pages={412--434},
  year={2020},
  publisher={Cambridge University Press}
}

Issues

Please submit any bugs, error reports, etc. on GitHub at: https://github.com/leslie-huang/stylest/issues.



Try the stylest package in your browser

Any scripts or data that you put into this service are public.

stylest documentation built on March 5, 2021, 1:05 a.m.