README.md
In cookm346/textfeats: Computes features for text

The textfeats package

This is an R package for computing text features for single words or longer documents.

Example Usage

install.packages("devtools")
library(devtools)

install_github("cookm346/textfeats")
library(textfeats)

Here is some basic text we can use to demonstrate the package. The elements of this array can be single words or longer text including full documents (e.g., books).

x <- c("i am the WALRUS", "Jerk!!", "very very good")

The package will compute fequency based features for each string, including:

Number of words
Number of unique words
Number of characters
Number of unique characters
Number of punction characters

count(x)

##                 n_words n_unique_words n_chars n_unique_chars n_periods n_commas n_question n_exclamation
## i am the WALRUS       4              4      15             13         0        0          0             0
## Jerk!!                1              1       6              5         0        0          0             2
## very very good        3              2      14              8         0        0          0             0

The package will also produce a set of sentiment based features. The valence, arousal, and dominance of individual words or longer documents can be summarized by utilizing the norms from Warriner et al (2013).

warriner(x)

##                 valence arousal dominance
## i am the WALRUS    5.79    3.95      5.23
## Jerk!!             2.43    6.45      4.84
## very very good     7.89    3.66      6.41

For strings with more than one word, the mean valence, arousal, and dominance is computed.

The package will produce a set of features based on the parts of speech of the text (e.g., noun, verb, adjective). These parts of speech classes are extracted from the Moby parts of Speech database. Fifteen different classes exist in the database. The outout below is trucated.

pos(x)

##                 Noun Plural Noun_Phrase Verb_participle Verb_transitive   ...
## i am the WALRUS    2      0           0               1               0   ...
## Jerk!!             1      0           0               1               1   ...
## very very good     1      0           0               0               0   ...

For strings with more than one word, the sum for each parts of speech class is computed.

The package will fetch concreteness norms for Brysbaert et al. (2014). For strings with more than one word, the mean concreteness for each word will be returned.

concreteness(x)

##                     Concreteness
## i am the WALRUS     2.796667
## Jerk!!              3.260000
## very very good      1.500000

Finally, the package will generate a set of semantic vectors (i.e., word embeddings). Any set of word embeddings can be used. For strings with more than one word, the function will compute the mean of the word vectors in the string (i.e., the semantic gist of the document). The outout below is trucated.


# load(url("http://www.lingexp.uni-tuebingen.de/z2/LSAspaces/TASA.rda"))

semantics(x, TASA)

##                                [,1]        [,2]          [,3]          [,4]         [,5]          [,6]   ...
## i am the WALRUS    [1,] 0.049237273 0.064432684 -0.0106181763  0.0215646167 -0.000233423  0.0045926339   ...
## Jerk!!             [2,] 0.001102812 0.001143806 -0.0008592891 -0.0005428566  0.001024854 -0.0007934065   ...
## very very good     [3,] 0.049006431 0.009857605 -0.0135557641 -0.0073958977 -0.022383744  0.0004875048   ...

cookm346/textfeats documentation built on April 24, 2020, 9:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com