jsdivergence_shift: Jensen-Shannon Divergence Shift
In pverspeelt/shifterator: Functionality for Constructing Word Shift Graphs

jsdivergence_shift

R Documentation

Jensen-Shannon Divergence Shift

Description

Shift object for calculating the Jensen-Shannon divergence (JSD) between two systems.

Usage

jsdivergence_shift(
  type2freq_1,
  type2freq_2,
  weight_1 = 0.5,
  weight_2 = 0.5,
  base = 2L,
  alpha = 1,
  reference_value = 0,
  normalization = "variation"
)

Arguments

`type2freq_1`	A data.frame containing words and their frequencies.
`type2freq_2`	A data.frame containing words and their frequencies.
`weight_1`	Relative weight of type2freq_1 when constructing the mixed distribution. Together with weight_2 should sum to 1.
`weight_2`	Relative weight of type2freq_2 when constructing the mixed distribution. Together with weight_1 should sum to 1.
`base`	The base for the logarithm when computing entropy scores.
`alpha`	The parameter for the generalized Tsallis entropy. Setting 'alpha = 1' recovers the Shannon entropy.
`reference_value`	Optional. String or numeric. The reference score to use to partition scores into two different regimes. If 'average', uses the average score according to type2freq_1 and type2score_1. If a lexicon is used for type2score, you need to use the middle point of that lexicon's scale. If no value is supplied, zero will be used as the reference point. See details for more information.
`normalization`	Optional. Default value: "variation". If 'variation', normalizes shift scores so that the sum of their absolute values sums to 1. If 'trajectory', normalizes them so that the sum of shift scores is 1 or -1. The trajectory normalization cannot be applied if the total shift score is 0, so scores are left unnormalized if the total is 0 and 'trajectory' is specified.

Details

The Jensen-Shannon divergence (JSD) accounts for some of the pathologies of the KLD. It does so by first creating a mixture text M,

M = π_1 P^{(1)} + π_2 P^{(2)},

where π_1 and π_2 are weights on the mixture between the two corpora. The JSD is then calculated as the average KLD of each text from the mixture text,

D^{(JS)} \bigl(P^{(1)} || P^{(2)}\bigr) = π_1 D^{(KL)} \bigl(P^{(1)} || M \bigr) + π_2 D^{(KL)} \bigl(P^{(2)} || M \bigr)

If the probability of a word in the mixture text is m_i = π_1 p_i^{(1)} + π_2 p_i^{(2)}, then an individual word's contribution to the JSD can be written as

δ JSD_i = m_i \log \frac{1}{m_i} - \biggl( π_i p_i^{(1)} \log \frac{1}{p_i^{(1)}} + π_2 p_i^{(2)} \log \frac{1}{p_i^{(2)}} \bigg)

Note

The JSD is well-defined for every word because the KLD is taken with respect to the mixture text M, which contains every word from both texts by design. Unlike the other measures, a word's JSD contribution is always positive, so we direct it in the word shift graph depending on the text in which it has the highest relative frequency. A word's contribution is zero if and only if p_i^{(1)} = p_i^{(2)}.

Like the Shannon entropy, the JSD can be generalized using the Tsallis entropy and the order can be set through the parameter alpha.

Quite often the JSD is effective at pulling out distinct words from each corpus (rather than "stop words"), but it is a more complex measure and so it is harder to properly interpret it as a whole.

The total Jensen-Shannon divergence be accessed through the difference column in the shift object.

Value

Returns a list object of class shift.

Examples

library(shifterator)
library(quanteda)
library(quanteda.textstats)
library(dplyr)

reagan <- corpus_subset(data_corpus_inaugural, President == "Reagan") %>% 
  tokens(remove_punct = TRUE) %>% 
dfm() %>% 
textstat_frequency() %>% 
as.data.frame() %>% # to move from classes frequency, textstat, and data.frame to data.frame
select(feature, frequency) 

bush <- corpus_subset(data_corpus_inaugural, President == "Bush" & FirstName == "George W.") %>% 
tokens(remove_punct = TRUE) %>% 
dfm() %>% 
textstat_frequency() %>% 
as.data.frame() %>% 
select(feature, frequency)

jsd <- jsdivergence_shift(reagan, bush)

pverspeelt/shifterator documentation built on Oct. 7, 2022, 3:37 a.m.