txt.to.features: Split string of words or other countable features

View source: R/txt.to.features.R

txt.to.featuresR Documentation

Split string of words or other countable features

Description

Function that converts a vector of words into either words, or characters, and optionally parses them into n-grams.

Usage

txt.to.features(tokenized.text, features = "w", ngram.size = 1)

Arguments

tokenized.text

a vector of tokinzed words

features

an option for specifying the desired type of feature: w for words, c for characters (default: w).

ngram.size

an optional argument (integer) indicating the value of n, or the size of n-grams to be created. If this argument is missing, the default value of 1 is used.

Details

Function that carries out the preprocessing steps necessary for feature selection: converts an input text into the type of sequences needed (n-grams etc.) and returns a new vector of items. The function invokes make.ngrams to combine single units into pairs, triplets or longer n-grams. See help(make.ngrams) for details.

Author(s)

Maciej Eder, Mike Kestemont

See Also

txt.to.words, txt.to.words.ext, make.ngrams

Examples

# consider the string my.text:
my.text = "Quousque tandem abutere, Catilina, patientia nostra?"

# split it into a vector of consecutive words:
my.vector.of.words = txt.to.words(my.text)

# build a vector of word 2-grams:
txt.to.features(my.vector.of.words, ngram.size = 2)
 
# or produce character n-grams (in this case, character tetragrams):
txt.to.features(my.vector.of.words, features = "c", ngram.size = 4)

computationalstylistics/stylo documentation built on April 7, 2024, 4:12 p.m.