# splitStrings: Construct unigram and bigram matrices from a vector of... In qlcMatrix: Utility Sparse Matrix Functions for Quantitative Language Comparison

## Description

A (possibly large) vector of strings is separated into sparse pattern matrices, which allows for efficient computation on the strings.

## Usage

 ```1 2 3``` ```splitStrings(strings, sep = "", bigrams = TRUE, boundary = TRUE, bigram.binder = "", gap.symbol = "\u2043", left.boundary = "#", right.boundary = "#", simplify = FALSE) ```

## Arguments

 `strings` Vector of strings to be separated into sparse matrices `sep` Separator used to split the strings into parts. This will be passed to `strsplit` internally, so there is no fine-grained control possible over the splitting. If it is important to get the splitting exactly right, consider pre-processing the splitting by inserting a special symbol on the split-positions, and then choosing to split by this specific symbol. `bigrams` By default, both unigrams and bigrams are computer. If bigrams are not needed, setting `bigrams = F` will save on resources. `boundary` Should a start symbol and a stop symbol be added to each string? This will only be used for the determination of bigrams, and will be ignored if `bigrams = F`. `bigram.binder` Only when `bigrams = T`. What symbol(s) should occur between the two parts of the bigram? `gap.symbol` Only when `bigram = T`. What symbol should be included to separate the strings? It defaults to U+2043 `HYPHEN BULLET` on the assumption that this character will not often be included in data. See `pwMatrix` for some more explanation about the necessity of this gap symbol. `left.boundary, right.boundary` Symbols to be used as boundaries, only used when `boundary = T`. `simplify` By default, various vectors and matrices are returned. However, when `simplify = T`, only a single sparse matrix is returned. See Value.

## Value

By default, the output is a list of six elements:

 `segments` A vector with all splitted parts (i.e. all tokens) in order of occurrence, separated between the original strings with gap symbols. `unigrams` A vector with all unique parts occuring in the segments. `bigrams` Only present when `bigrams = T`. A vector with all unique bigrams. `SW` A sparse pattern matrix of class `ngCMatrix` specifying the distribution of segments (S) over the original strings (W, think ‘words’). This matrix is only interesting in combination with the following matrices. `US` A sparse pattern matrix of class `ngCMatrix` specifying the distribution of the unique unigrams (U) over the tokenized segments (S). `BS` Only present when `bigrams = T`. A sparse pattern matrix of class `ngCMatrix` specifying the distribution of the unique bigrams (B) over the tokenized segments (S)

When `simplify = T` the output is a single sparse matrix of class `dgCMatrix`. This is basically BS %8% SW (when `bigrams = T`) or US %*% SW (when `bigrams = F`) with rows and column names added into the matrix.

## Note

Because of some internal idiosyncrasies, the ordering of the bigrams is first by second element, and then by first element. This might change in future versions.

## Author(s)

Michael Cysouw

`sim.strings` is a convenience function to quickly compute pairwise strings similarities, based on `splitStrings`.
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19``` ```# a simple example to see the function at work example <- c("this","is","an","example") splitStrings(example) splitStrings(example, simplify = TRUE) ## Not run: # a bit larger, but still quick and efficient # taking 15526 wordforms from the English Dalby Bible and splitting them into bigrams data(bibles) words <- splitText(bibles\$eng)\$wordforms system.time( S <- splitStrings(words, simplify = TRUE) ) # and then taking the cosine similarity between the bigram-vectors for all word pairs system.time( sim <- cosSparse(S) ) # most similar words to "father" sort(sim["father",], decreasing = TRUE)[1:20] ## End(Not run) ```