Construct a ‘type-token’ (tt) Matrix from a vector
A type-token matrix is a sparse matrix representation of a vector of entities. The rows of the matrix (‘types’) represent all different entities in the vector, and the columns of the matrix (‘tokens’) represent the entities themselves. The cells in the matrix represent which token belongs to which type. This is basically a convenience wrapper around
sparseMatrix, with an option to influence the ordering of the rows (‘types’) based on locale settings.
a vector of tokens to be represented as a sparse matrix. It will work without complaining just as well when given a factor, but be aware that the ordering of the levels in the factor depends on the locale, which is transparently handled by this function. So better let this function turn the vector into a factor.
by default, the row and column names are not included into the matrix to keep the matrix as lean as possible. The row names (‘types’) are returned separately. Using
locale determining the ordering (‘collation’) of the entities. By default R mostly uses ‘en_US.UTF-8’, though this might depend on the installation. By default, this function sets the ordering to ‘C’, which means that characters are ordered according to their Unicode-number. For more information about locale settings, see
This function is a rather low-level preparation for later high level functions. A few simple uses are described in the examples.
By default (
simplify = F), then the output is a list with two elements:
sparse pattern Matrix of type
a separate vector with the names of the types in the order of occurrence in the matrix. This vector is separated from the matrix itself for reasons of efficiency when dealing with many matrices.
simplify = T, then only the matrix M with row and columns names is returned.
This function is used in various high-level functions like
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
# Consider two nominal variables # one with eight categories, and one with three categories var1 <- sample(8, 1000, TRUE) var2 <- sample(3, 1000, TRUE) # turn them into type-token matrices M1 <- ttMatrix(var1, simplify = TRUE) M2 <- ttMatrix(var2, simplify = TRUE) # Then taking the `residuals' from assocSparse ... x <- as.matrix(assocSparse(t(M1), t(M2), method = res)) # ... is the same as the residuals as given by a chi-square x2 <- chisq.test(var1, var2)$residuals class(x2) <- "matrix" all.equal(x, x2, check.attributes = FALSE) # TRUE # A second quick example: consider a small piece of English text: text <- "Once upon a time in midwinter, when the snowflakes were falling like feathers from heaven, a queen sat sewing at her window, which had a frame of black ebony wood. As she sewed she looked up at the snow and pricked her finger with her needle. Three drops of blood fell into the snow. The red on the white looked so beautiful that she thought to herself: If only I had a child as white as snow, as red as blood, and as black as the wood in this frame. Soon afterward she had a little daughter who was as white as snow, as red as blood, and as black as ebony wood, and therefore they called her Little Snow-White. And as soon as the child was born, the queen died." # split by characters, make lower-case, and turn into a type-token matrix split.text <- tolower(strsplit(text,"")[]) M <- ttMatrix(split.text, simplify = TRUE) # rowSums give the character frequency freq <- rowSums(M) names(freq) <- rownames(M) sort(freq, decreasing = TRUE) # shift the matrix one character to the right using a bandSparse matrix S <- bandSparse(n = ncol(M), k = 1) N <- M %*% S # use rKhatriRao on M and N to get frequencies of bigrams B <- rKhatriRao(M, N, binder = "") freqB <- rowSums(B$M) names(freqB) <- B$rownames sort(freqB, decreasing = TRUE) # then the association between N and M is related # to the transition probabilities between the characters. P <- assocSparse(t(M), t(N)) plot(hclust(as.dist(-P), method = "ward"))
Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.