txt.to.features: Split string of words or other countable features

Description Usage Arguments Details Author(s) See Also Examples

View source: R/txt.to.features.R

Description

Function that converts a vector of words into either words, or characters, and optionally parses them into n-grams.

Usage

1
txt.to.features(tokenized.text, features = "w", ngram.size = 1)

Arguments

tokenized.text

a vector of tokinzed words

features

an option for specifying the desired type of feature: w for words, c for characters (default: w).

ngram.size

an optional argument (integer) indicating the value of n, or the size of n-grams to be created. If this argument is missing, the default value of 1 is used.

Details

Function that carries out the preprocessing steps necessary for feature selection: converts an input text into the type of sequences needed (n-grams etc.) and returns a new vector of items. The function invokes make.ngrams to combine single units into pairs, triplets or longer n-grams. See help(make.ngrams) for details.

Author(s)

Maciej Eder, Mike Kestemont

See Also

txt.to.words, txt.to.words.ext, make.ngrams

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# consider the string my.text:
my.text = "Quousque tandem abutere, Catilina, patientia nostra?"

# split it into a vector of consecutive words:
my.vector.of.words = txt.to.words(my.text)

# build a vector of word 2-grams:
txt.to.features(my.vector.of.words, ngram.size = 2)
 
# or produce character n-grams (in this case, character tetragrams):
txt.to.features(my.vector.of.words, features = "c", ngram.size = 4)

Example output

### stylo version: 0.6.9 ###

If you plan to cite this software (please do!), use the following reference:
    Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R:
    a package for computational text analysis. R Journal 8(1): 107-121.
    <https://journal.r-project.org/archive/2016/RJ-2016-007/index.html>

To get full BibTeX entry, type: citation("stylo")
Warning message:
no DISPLAY variable so Tk is not available 
[1] "quousque tandem"    "tandem abutere"     "abutere catilina"  
[4] "catilina patientia" "patientia nostra"  
 [1] "q u o u" "u o u s" "o u s q" "u s q u" "s q u e" "q u e  " "u e   t"
 [8] "e   t a" "  t a n" "t a n d" "a n d e" "n d e m" "d e m  " "e m   a"
[15] "m   a b" "  a b u" "a b u t" "b u t e" "u t e r" "t e r e" "e r e  "
[22] "r e   c" "e   c a" "  c a t" "c a t i" "a t i l" "t i l i" "i l i n"
[29] "l i n a" "i n a  " "n a   p" "a   p a" "  p a t" "p a t i" "a t i e"
[36] "t i e n" "i e n t" "e n t i" "n t i a" "t i a  " "i a   n" "a   n o"
[43] "  n o s" "n o s t" "o s t r" "s t r a"

stylo documentation built on Dec. 6, 2020, 5:06 p.m.

Related to txt.to.features in stylo...