# Construct ‘part-whole’ (pw) Matrices from tokenized strings

### Description

A part-whole Matrix is a sparse matrix representation of a vector of strings (‘wholes’) split into smaller parts by a specified separator. It basically summarizes which strings consist of which parts. By itself, this is not a very interesting transformation, but it allows for quite fancy computations by simple matrix manipulations.

### Usage

1 |

### Arguments

`strings` |
a vector (or list) of strings to be separated into parts |

`sep` |
The separator to be used. Defaults to space |

`gap.length` |
This adds the specified number of gap symbols between each pair of strings. This is only important for generating higher ngram-statistics later on, when no ordering of the strings is implied. For example, when the strings are alphabetically ordered words, any bigram-statistics should not count the bigrams consisting of the last character of the a word with the first character of the next word. |

`gap.symbol` |
The gap symbol to insert (see gap.length above). It defaults to U+8901 ( · ) on the assumption that this character will not often be included in data. |

`simplify` |
by default, the row and column names are not included into the matrix to keep the matrix as lean as possible. The row names (‘parts’) are returned separately. Using |

### Details

Internally, this is basically using `strsplit`

and some cosmetic changes, returning a sparse matrix.

### Value

By default (when `simplify = F`

) the output is a list with two elements, containing:

`M` |
a sparse pattern Matrix of type |

`rownames` |
all different characters from the strings in order (i.e. all individual tokens of the original strings). |

When `simplify = T`

, then only the matrix M with row and column names is returned.

### Author(s)

Michael Cysouw

### See Also

Used in `splitStrings`

and `splitWordlist`

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | ```
# By itself, this functions does nothing really interesting
example <- c("this","is","an","example")
pw <- pwMatrix(example)
pw
# However, making a type-token Matrix (with ttMatrix) of the rownames
# and then taking a matrix product, results in frequencies of each element in the strings
tt <- ttMatrix(pw$rownames)
distr <- (tt$M*1) %*% (pw$M*1)
rownames(distr) <- tt$rownames
colnames(distr) <- example
distr
# Use banded sparse matrix with superdiagonal ('shift matrix') to get co-occurrence counts
# of adjacent characters. Rows list first character, columns adjacent character.
# Non-zero entries list number of co-occurrences
S <- bandSparse( n = ncol(tt$M), k = 1) * 1
TT <- tt$M * 1
( C <- TT %*% S %*% t(TT) )
# show the non-zero entries as triplets:
s <- summary(C)
first <- tt$rownames[s[,1]]
second <- tt$rownames[s[,2]]
freq <- s[,3]
data.frame(first,second,freq)
``` |