keyperm-package: keyperm: Keyword Analysis Using Permutation Tests

keyperm-packageR Documentation

keyperm: Keyword Analysis Using Permutation Tests

Description

Implementation of permutation-based keyword analysis for corpus linguistics.

Details

This package contains an implementation of the permutation testing approach to keyness as used in corpus linguistics.

Keywords are words that appear more frequently in one corpus compared to another corpus. Usually this is assessed using test statistics, for example the likelihood-ratio test on 2x2 contingency tables, resulting in scores for every term that appears in the document.

Conventionally, keyness scores are judged by reference to a limiting null distribution under a token-by-token-sampling model. keyperm approximates the null distribution under a document-by-document sampling model.

The permutation distributions of a given keyness measure for each term is calculated by repeatedly shuffeling the copus labels. Number of documents per corpus is kept constant.

Apart from obtaining null distributions of common test statistics like LLR and Chi-Square, keyperm can also obtain null distributions of of the logratio measure that is normally used as an effect size.

Currently, the following types of scores are supported:

llr

The log-likelihood ratio

chisq

The Chi-Square-Statistic

diff

Difference of relative frequencies

logratio

Binary logarithm of the ratio of the relative frequencies, possibly using a laplace correction to avoid infinite values.

The actual resampling procedure is implemented in an efficient manner using the Rcpp package and utilizing a special data structure (indexed frequency list). Currently, keyperm can generate indexed frequency lists from term-document-matrices as implemented in the package tm.

Author(s)

Maintainer: Thoralf Mildenberger mild@zhaw.ch (ORCID)

Examples

library(tm)
library(keyperm)

# load subcorpora "acq" and "crude" from Reuters

data(acq)
data(crude)

# convert to term-document-matrices and combine into single tdm

acq_tdm <- TermDocumentMatrix(acq, control = list(removePunctuation = TRUE))
crude_tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE))
tdm <- c(acq_tdm, crude_tdm)

# generate a logical that indicates whether document comes from "acq" or "crude"

ndoc_A <- dim(acq_tdm)[2]
ndoc_B <- dim(crude_tdm)[2]
corpus <- rep(c(TRUE, FALSE), c(ndoc_A, ndoc_B))

# generate an indexed frequency list, the data structure used by keyperm

reuters_ifl <- create_ifl(tdm, corpus = corpus)

# calculate Log-Likelihood-Ratio scores for all terms and calculate
# p-values according to the (wrong) token-by-token sampling model

llr <- keyness_scores(reuters_ifl, type = "llr", laplace = 0)
head(round(pchisq(llr, df = 1, lower.tail = FALSE), digits = 4), n = 10)

# generate permutation distribution and p-values based on document-by-document sampling model

keyp <- keyperm(reuters_ifl, llr, type = "llr", 
                laplace = 1, output = "counts", nperm = 1000)
head(p_value(keyp, alternative = "greater"), n = 10)

# generate observed log-ratio values and (one-sided) p-values based
# on the permutation distribution (document-by-document sampling model)
# laplace-correction used (adding one occurence to both corpora)

logratio <- keyness_scores(reuters_ifl, type = "logratio", laplace = 1)
keyp2 <- keyperm(reuters_ifl, logratio, type = "logratio", 
                laplace = 1, output = "counts", nperm = 1000)
head(p_value(keyp2, alternative = "greater"), n = 10)

# it may be of interest to improve accuracy of the small p-values. 
# Think of this in terms of spending the computational budget mainly 
# on the terms for which higher accuracy matters most 

pvals <- p_value(keyp2, alternative = "greater")
table(pvals > 0.1)

small_p <- which(pvals < 0.1)

logratio_subset <- logratio[small_p]
reuters_ifl_subset <- create_ifl(tdm, subset_terms = small_p, corpus = corpus)

keyp2_subset <- keyperm(reuters_ifl_subset, logratio_subset, type = "logratio", 
                 laplace = 1, output = "counts", nperm = 9000)

# combine counts from both runs using the combiner

keyp2_combined <- combine_results(keyp2, keyp2_subset)

# smaller p-values are based on 1000, the larger ones on 10000 random permutations
# note that 10000 is still far too small for real applications

head(p_value(keyp2_combined, alternative = "greater"), n = 10)


thmild/keyperm documentation built on Sept. 12, 2023, 12:25 a.m.