txt.to.features: Split string of words or other countable features
In computationalstylistics/stylo: Stylometric Multivariate Analyses

txt.to.features

R Documentation

Split string of words or other countable features

Description

Function that converts a vector of words into either words, or characters, and optionally parses them into n-grams.

Usage

txt.to.features(tokenized.text, features = "w", ngram.size = 1)

Arguments

`tokenized.text`	a vector of tokinzed words
`features`	an option for specifying the desired type of feature: `w` for words, `c` for characters (default: `w`).
`ngram.size`	an optional argument (integer) indicating the value of n, or the size of n-grams to be created. If this argument is missing, the default value of 1 is used.

Details

Function that carries out the preprocessing steps necessary for feature selection: converts an input text into the type of sequences needed (n-grams etc.) and returns a new vector of items. The function invokes make.ngrams to combine single units into pairs, triplets or longer n-grams. See help(make.ngrams) for details.

Author(s)

Maciej Eder, Mike Kestemont

Examples

# consider the string my.text:
my.text = "Quousque tandem abutere, Catilina, patientia nostra?"

# split it into a vector of consecutive words:
my.vector.of.words = txt.to.words(my.text)

# build a vector of word 2-grams:
txt.to.features(my.vector.of.words, ngram.size = 2)
 
# or produce character n-grams (in this case, character tetragrams):
txt.to.features(my.vector.of.words, features = "c", ngram.size = 4)

computationalstylistics/stylo documentation built on Jan. 4, 2025, 1:56 p.m.