ttMatrix: Construct a 'type-token' (tt) Matrix from a vector
In cysouw/qlcMatrix: Utility Sparse Matrix Functions for Quantitative Language Comparison

ttMatrix

R Documentation

Construct a ‘type-token’ (tt) Matrix from a vector

Description

A type-token matrix is a sparse matrix representation of a vector of entities. The rows of the matrix (‘types’) represent all different entities in the vector, and the columns of the matrix (‘tokens’) represent the entities themselves. The cells in the matrix represent which token belongs to which type. This is basically a convenience wrapper around factor and sparseMatrix, with an option to influence the ordering of the rows (‘types’) based on locale settings.

Usage

ttMatrix(vector, simplify = FALSE)

Arguments

`vector`	a vector of tokens to be represented as a sparse matrix. It will work without complaining just as well when given a factor, but be aware that the ordering of the levels in the factor depends on the locale, which is consistently set by `getOption("qlcMatrix.locale")`, see details. So better let this function turn the vector into a factor.
`simplify`	by default, the row and column names are not included into the matrix to keep the matrix as lean as possible. The row names (‘types’) are returned separately. Using `simplify = T` the row and columns names will be added into the matrix. Note that the column names are simply the vector that went into the function.

Details

This function is a rather low-level preparation for later high level functions. A few simple uses are described in the examples.

Ordering of the entities (‘collation’) is determined by a locale. By default R mostly uses ‘en_US.UTF-8’, though this depends on the installation. By default, the qlcMatrix package sets the ordering to ‘C’, which means that characters are always ordered according to their Unicode-number. For more information about locale settings, see locales. You can change the locale used in this package by changing the qlcMatrix.locale option, e.g. by the command options(qlcMatrix.locale = "en_US.UTF-8")

Value

By default (simplify = F), then the output is a list with two elements:

`M`	sparse pattern Matrix of type `ngCMatrix`. Because of the structure of these matrices, row-based encoding would be slightly more efficient. If RAM is crucial, consider storing the matrix as its transpose
`rownames`	a separate vector with the names of the types in the order of occurrence in the matrix. This vector is separated from the matrix itself for reasons of efficiency when dealing with many matrices.

When simplify = T, then only the matrix M with row and columns names is returned.

Author(s)

Michael Cysouw

Examples

# Consider two nominal variables
# one with eight categories, and one with three categories
var1 <- sample(8, 1000, TRUE)
var2 <- sample(3, 1000, TRUE)

# turn them into type-token matrices
M1 <- ttMatrix(var1, simplify = TRUE)
M2 <- ttMatrix(var2, simplify = TRUE)

# Then taking  the `residuals' from assocSparse ...
x <- as.matrix(assocSparse(t(M1), t(M2), method = res))

# ... is the same as the residuals as given by a chi-square
x2 <- chisq.test(var1, var2)$residuals
class(x2) <- "matrix"
all.equal(x, x2, check.attributes = FALSE) # TRUE


# A second quick example: consider a small piece of English text:
text <- "Once upon a time in midwinter, when the snowflakes were 
falling like feathers from heaven, a queen sat sewing at her window, 
which had a frame of black ebony wood. As she sewed she looked up at the snow 
and pricked her finger with her needle. Three drops of blood fell into the snow. 
The red on the white looked so beautiful that she thought to herself: 
If only I had a child as white as snow, as red as blood, and as black 
as the wood in this frame. Soon afterward she had a little daughter who was 
as white as snow, as red as blood, and as black as ebony wood, and therefore 
they called her Little Snow-White. And as soon as the child was born, 
the queen died." 

# split by characters, make lower-case, and turn into a type-token matrix
split.text <- tolower(strsplit(text,"")[[1]])
M <- ttMatrix(split.text, simplify = TRUE)

# rowSums give the character frequency
freq <- rowSums(M)
names(freq) <- rownames(M)
sort(freq, decreasing = TRUE)

# shift the matrix one character to the right using a bandSparse matrix
S <- bandSparse(n = ncol(M), k = 1)
N <- M %*% S

# use rKhatriRao on M and N to get frequencies of bigrams
B <- rKhatriRao(M, N, binder = "")
freqB <- rowSums(B$M)
names(freqB) <- B$rownames
sort(freqB, decreasing = TRUE)

# then the association between N and M is related 
# to the transition probabilities between the characters. 
P <- assocSparse(t(M), t(N))
plot(hclust(as.dist(-P), method = "ward.D"))

cysouw/qlcMatrix documentation built on July 3, 2024, 8:44 p.m.