Description Usage Arguments Value Note Author(s) See Also Examples

A (possibly large) vector of strings is separated into sparse pattern matrices, which allows for efficient computation on the strings.

1 2 3 | ```
splitStrings(strings, sep = "", bigrams = TRUE, boundary = TRUE,
bigram.binder = "", gap.symbol = "\u2043", left.boundary = "#",
right.boundary = "#", simplify = FALSE)
``` |

`strings` |
Vector of strings to be separated into sparse matrices |

`sep` |
Separator used to split the strings into parts. This will be passed to |

`bigrams` |
By default, both unigrams and bigrams are computer. If bigrams are not needed, setting |

`boundary` |
Should a start symbol and a stop symbol be added to each string? This will only be used for the determination of bigrams, and will be ignored if |

`bigram.binder` |
Only when |

`gap.symbol` |
Only when |

`left.boundary, right.boundary` |
Symbols to be used as boundaries, only used when |

`simplify` |
By default, various vectors and matrices are returned. However, when |

By default, the output is a list of six elements:

`segments` |
A vector with all splitted parts (i.e. all tokens) in order of occurrence, separated between the original strings with gap symbols. |

`unigrams` |
A vector with all unique parts occuring in the segments. |

`bigrams` |
Only present when |

`SW` |
A sparse pattern matrix of class |

`US` |
A sparse pattern matrix of class |

`BS` |
Only present when |

When `simplify = T`

the output is a single sparse matrix of class `dgCMatrix`

. This is basically BS %8% SW (when `bigrams = T`

) or US %*% SW (when `bigrams = F`

) with rows and column names added into the matrix.

Because of some internal idiosyncrasies, the ordering of the bigrams is first by second element, and then by first element. This might change in future versions.

Michael Cysouw

`sim.strings`

is a convenience function to quickly compute pairwise strings similarities, based on `splitStrings`

.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | ```
# a simple example to see the function at work
example <- c("this","is","an","example")
splitStrings(example)
splitStrings(example, simplify = TRUE)
## Not run:
# a bit larger, but still quick and efficient
# taking 15526 wordforms from the English Dalby Bible and splitting them into bigrams
data(bibles)
words <- splitText(bibles$eng)$wordforms
system.time( S <- splitStrings(words, simplify = TRUE) )
# and then taking the cosine similarity between the bigram-vectors for all word pairs
system.time( sim <- cosSparse(S) )
# most similar words to "father"
sort(sim["father",], decreasing = TRUE)[1:20]
## End(Not run)
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.