In quanteda/quanteda: Quantitative Analysis of Textual Data

collocations verification

Example by Jouni, via Ken

knitr::opts_chunk$set(echo = TRUE)

This is follows up on this thread.

Jouni: Please see also the code here to produce a unit test from line 50 onward of tests/testthat/test-textstat_collocations.R from the collocations_verify branch. We want to make sure our tests are correct, and then make sure our code output matches the correct tests.

Two models are fitted here: the one with all two-way interactions ((W1W2, W1W3, W2W3) in the log-linear models language) and the saturated model (W1W2W3). The models are fitted in two ways, which are equivalent: (1) as a Poisson log-linear model for the counts, (2) as a binomial logistic model for the third word given the other two; note here R wants the counts of both the "successes" (W3 is "tax") and "failures" (W3 is other) as part of the definition of the response variable.

Here the likelihood ratio test between these two models is 0.25604 (with 1df), the three-way interaction parameter (lambda) is -1.072 (with se= 2.164 and z-statistic of -0.496, the square of which is 0.246, close to the LR test statistic). You can see in the output how these numbers appear under both ways of fitting the model.

As discussed before, it is not really necessary to fit the saturated model to get lambda and its standard error, as these are simple functions of the counts. The script also shows this calculation to get the -1.072 and the 2.164.

Here the interaction parameter is not significant (although we cannot really conclude this, because the test assumes independence between different triplets) and its estimate is negative. Here seeing capital and gains together thus actually increases the probability seeing "tax" to a lesser extent than would be expected from the sum of the individual effects of capital and gains. In sum, there is no evidence that "capital gains tax" is a true 3-gram in the text where these counts came from.

data.tmp <- data.frame(word1 = c("capital","other","capital","other", 
                               "capital","other","capital","other"),  
                       word2 = c("gains","gains","other","other", 
                               "gains","gains","other","other"),
                       word3 = c("other","other","other","other", 
                              "tax","tax","tax","tax"),
                       n = c(1.5, 3.5, 2.5, 12.5, 5.5, 2.5, 1.5, 0.5))

data.tmp

# For convenience of interpreting the output below, make "other" the reference level
data.tmp$word1 <- relevel(data.tmp$word1, "other")
data.tmp$word2 <- relevel(data.tmp$word2, "other")
data.tmp$word3 <- relevel(data.tmp$word3, "other")

# Models fitted as Poisson log-linear models:  
mP1.tmp <- glm(n ~ word1*word2+word1*word3+word2*word3, family = "poisson", data = data.tmp)
mP2.tmp <- glm(n ~ word1*word2*word3, family="poisson",data=data.tmp)
summary(mP1.tmp)
summary(mP2.tmp)
anova(mP1.tmp, mP2.tmp)

# The interaction parameter and its standard error, calculated directly from the counts
(log(data.tmp$n[c(1,3,5,7)]) %*% c(-1, 1, 1, -1)) - (log(data.tmp$n[c(2,4,6,8)]) %*% c(-1, 1, 1, -1))
sqrt(sum(1/data.tmp$n))

# Models fitted as Binomial logistic models:  
data2.tmp <- data.tmp[1:4, -(3:4)]
data2.tmp$nOther <- data.tmp$n[1:4]
data2.tmp$nTax <- data.tmp$n[5:8]

data2.tmp
mB1.tmp <- glm(as.matrix(data2.tmp[, c("nTax", "nOther")]) ~ word1 + word2, family="binomial", data=data2.tmp)
mB2.tmp <- glm(as.matrix(data2.tmp[, c("nTax", "nOther")]) ~ word1 * word2, family="binomial", data=data2.tmp)

summary(mB1.tmp)
summary(mB2.tmp)
anova(mB1.tmp,mB2.tmp)